I'm attempting to dig through my computer and plot a bunch of CSVs on one plot (I'm using Python 2.7 and Pandas).
While all the CSV files have the same name of file.csv, they are located in a myriad of different folders. I've done the following below where I wrap the CSVs into a dataframe and then plot the dataframe from a certain range of values.
I would like to label each plot as the folder name (i.e. have the legend specify the folder directory that the CSV is located in)
import pandas as pd
from pandas import read_csv
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import os
class do(object):
def something(self):
style.use('ggplot')
file_1 = r'C:\User\me\PathABC\Folder123\file.csv'
file_2 = r'C:\User\me\PathABC\Folder456\file.csv'
file_3 = r'C:\User\me\PathABC\Folder789\file.csv'
file_4 = r'C:\User\me\PathABC\Folder101112\file.csv'
df1 = pd.read_csv(file_1,header=None)
df2 = pd.read_csv(file_2,header=None)
df3 = pd.read_csv(file_3,header=None)
df4 = pd.read_csv(file_4,header=None)
plt.plot(df1[0],df1[1],label='Folder123')
plt.plot(df2[0],df2[1],label='Folder456')
plt.plot(df3[0],df3[1],label='Folder789')
plt.plot(df4[0],df4[1],label='Folder101112')
plt.xlim([200000,800000])
plt.legend()
plt.ylabel('Amplitude')
plt.xlabel('Hz')
plt.grid(True,color='k')
plt.show()
x=do()
x.something()
essentially, i would like to automate this process such that I can parse my computer by using the following logic:
where file.csv exists, plot it
label plot with folder name of where file.csv came from
Walking a file path is one answer, but you may be able to use glob.glob in simpler cases where the target folders are all at the same depth in the filesystem. For example,
for filename in glob.glob('somewhere/sheets/*/file.csv')
will iterate over all files called file.csv in any subfolder of somewhere/sheets. If they are all two levels down, glob.glob('somewhere/sheets/*/*/file.csv') will work, and if they are all one or two levels down, you can join the lists from two glob invocations.
Take a look at How to list all files of a directory? by #pycruft and edited by #Martin Thoma. I would use walk to get the full path of all csv files existing in several folders inside a specific path as follows:
from os import walk
from os.path import join,splitext
f = []
for (dirpath, dirnames, filenames) in walk(specific_path):
for filename in filenames:
if splitext(filename)[1].upper() == '.CSV':
f.extend([join(dirpath,filename)])
Related
I have a wide range of CSV files that give me the solar energy produced by several systems on a daily basis. Each CSV file corresponds to one day in the year, for one particular site (I have 12 sites).
My goal is to develop a code that reads all CSV files (located across different folders), extracts the daily produced solar energy for every specific day and site, stores the values in a dataframe, and finally exports the dataframe collecting all daily produced solar energy across all sites on a new Excel file.
So far I have written the code to extract the values of all CSV files stored within the same folder, which gives me the solar energy produced for all days for which a CSV file exists in that folder:
import csv
import pandas as pd
import numpy as np
import glob
path = r"C:\Users\XX\Documents\XX\DataLogs\NameofSite\CSV\2020\02\CSV\*.csv"
Monthly_PV=[]
for fname in glob.glob(path):
df=pd.read_csv(fname, header=7, decimal=',')
kWh_produced=df["kWh"]
daily_start=kWh_produced.iloc[0]
daily_end=kWh_produced.iloc[-1]
DailyPV=daily_end-daily_start
Monthly_PV.append(DailyPV)
print(Monthly_PV)
MonthlyTotal=sum(Monthly_PV)
Monthly_PV=pd.DataFrame(Monthly_PV)
print(MonthlyTotal)
Monthly_PV.to_excel(r"C:\Users\XXX\Documents\XXX\DataLogs\NameofSite\CSV\2020\02\CSV\Summary.xlsx")
I get the result I want: a list in which each value corresponds to the daily produced solar energy of each CSV in this one given folder located on the folder I called "path". My aim is to add something to this code so that the developed code is applied to CSV files located in previous folders to the one listed here as well, or to parallel folders within the same bigger folder.
Any tips will be much appreciated.
Thanks!
You can add an extra for loop to handle a list of paths
import numpy as np
import glob
paths = [r"C:\Users\XX\Documents\XX\DataLogs\NameofSite\CSV\2020\02\CSV\*.csv",
r"C:\Foo\*.csv",
r"..\..Bar]*.csv"]
Monthly_PV=[]
for path in paths:
for fname in glob.glob(path):
df=pd.read_csv(fname, header=7, decimal=',')
kWh_produced=df["kWh"]
daily_start=kWh_produced.iloc[0]
daily_end=kWh_produced.iloc[-1]
DailyPV=daily_end-daily_start
Monthly_PV.append(DailyPV)
print(Monthly_PV)
MonthlyTotal=sum(Monthly_PV)
Monthly_PV=pd.DataFrame(Monthly_PV)
print(MonthlyTotal)
Monthly_PV.to_excel(r"C:\Users\XXX\Documents\XXX\DataLogs\NameofSite\CSV\2020\02\CSV\Summary.xlsx")
If you do not want to hardcode a list of directories in your program, maybe try something based on this?
def get_input_directories(depth: int, base_directory: str) -> typing.DefaultDict[str, typing.List[DeltaFile]]:
"""
Build a dict with keys that are directories, and values that are lists of filenames.
Does not include blocking uninteresting tracks.
"""
result: typing.DefaultDict[str, typing.List[DeltaFile]] = collections.defaultdict(list)
os.chdir(base_directory)
try:
for root, directories, filenames in os.walk('.'):
if root.count('/') != depth:
# We only want to deal with /band/album (for EG, with depth==2) in root
continue
assert not directories, f"root is {root}, directories is {directories}"
for filename in filenames:
if appropriate_extension(filename) and not hidden(filename):
result[root].append(DeltaFile(filename))
finally:
os.chdir('..')
return result
You can safely remove the type annotations if you don't want them.
I have multiple .csv files that represents a serious of measurements maiden.
I need to plot them in order to compare proceeding alterations.
I basically want to create a function with it I can read the file into a list and replay several of the "data cleaning in each .csv file" Then plot them all together in a happy graph
this is a task I need to do to analyze some results. I intend to make this in python/pandas as I might need to integrate into a bigger picture in the future but for now, this is it.
I basically want to create a function with it I can read the file into a big picture comparing it Graph.
I also have one file that represents background noise. I want to remove these values from the others .csv as well
import matplotlib.pyplot as plt
from matplotlib.ticker import FormatStrFormatter
PATH = r'C:\Users\UserName\Documents\FSC\Folder_name'
FileNames = [os.listdir(PATH)]
for files in FileNames:
df = pd.read_csv(PATH + file, index_col = 0)
I expected to read every file and store into this List but I got this code:
FileNotFoundError: [Errno 2] File b'C:\Users\UserName\Documents\FSC\FolderNameFileName.csv' does not exist: b'C:\Users\UserName\Documents\FSC\FolderNameFileName.csv'
Have you used pathlib from the standard library? it makes working with the file system a breeze,
recommend reading : https://realpython.com/python-pathlib/
try:
from pathlib import Path
files = Path('/your/path/here/').glob('*.csv') # get all csvs in your dir.
for file in files:
df = pd.read_csv(file,index_col = 0)
# your plots.
I am having problems getting txt files located in zipped files to load/concatenate using pandas. There are many examples on here with pd.concat(zip_file.open) but still not getting anything to work in my case since I have more than one zip file and multiple txt files in each.
For example, Lets say I have TWO Zipped files in a specific folder "Main". Each zipped file contains FIVE txt files each. I want to read all of these txt files and pd.concat them all together. In my real world example I will have dozens of zip folders with each containing five txt files.
Can you help please?
Folder and File Structure for Example:
'C:/User/Example/Main'
TAG_001.zip
sample001_1.txt
sample001_2.txt
sample001_3.txt
sample001_4.txt
sample001_5.txt
TAG_002.zip
sample002_1.txt
sample002_2.txt
sample002_3.txt
sample002_4.txt
sample002_5.txt
I started like this but everything after this is throwing errors:
import os
import glob
import pandas as pd
import zipfile
path = 'C:/User/Example/Main'
ziplist = glob.glob(os.path.join(path, "*TAG*.zip"))
This isn't efficient but it should give you some idea of how it might be done.
import os
import zipfile
import pandas as pd
frames = {}
BASE_DIR = 'C:/User/Example/Main'
_, _, zip_filenames = list(os.walk(BASE_DIR))[0]
for zip_filename in zip_filenames:
with zipfile.ZipFile(os.path.join(BASE_DIR, zip_filename)) as zip_:
for filename in zip_.namelist():
with zip_.open(filename) as file_:
new_frame = pd.read_csv(file_, sep='\t')
frame = frames.get(filename)
if frame is not None:
pd.concat([frame, new_frame])
else:
frames[filename] = new_frame
#once all frames have been concatenated loop over the dict and write them back out
depending on how much data there is you will have to design a solution that balances processing power/memory/disk space. This solution could potentially use up a lot of memory.
I want to load multiple xlsx files with varying structures from a directory and assign these their own data frame based on the file name. I have 30+ files with differing structures but for brevity please consider the following:
3 excel files [wild_animals.xlsx, farm_animals_xlsx, domestic_animals.xlsx]
I want to assign each with their own data frame so if the file name contains 'wild' it is assigned to wild_df, if farm then farm_df and if domestic then dom_df. This is just the first step in a process as the actual files contain a lot of 'noise' that needs to be cleaned depending on file type etc they file names will also change on a weekly basis with only a few key markers staying the same.
My assumption is the glob module is the best way to begin to do this but in terms of taking very specific parts of the file extension and using this to assign to a specific df I become a bit lost so any help appreciated.
I asked a similar question a while back but it was part of a wider question most of which I have now solved.
I would parse them into a dictionary of DataFrame's:
import os
import glob
import pandas as pd
files = glob.glob('/path/to/*.xlsx')
dfs = {}
for f in files:
dfs[os.path.splitext(os.path.basename(f))[0]] = pd.read_excel(f)
then you can access them as a normal dictionary elements:
dfs['wild_animals']
dfs['domestic_animals']
etc.
You nee to get all xlsx files, than using comprehension dict, you can access to any elm
import pandas as pd
import os
import glob
path = 'Your_path'
extension = 'xlsx'
os.chdir(path)
result = [i for i in glob.glob('*.{}'.format(extension))]
{elm:pd.ExcelFile(elm) for elm in result}
For completeness wanted to show the solution I ended up using, very close to Khelili suggestion with a few tweaks to suit my particular code including not creating a DataFrame at this stage
import os
import pandas as pd
import openpyxl as excel
import glob
#setting up path
path = 'data_inputs'
extension = 'xlsx'
os.chdir(path)
files = [i for i in glob.glob('*.{}'.format(extension))]
#Grouping files - brings multiple files of same type together in a list
wild_groups = ([s for s in files if "wild" in s])
domestic_groups = ([s for s in files if "domestic" in s])
#Sets up a dictionary associated with the file groupings to be called in another module
file_names = {"WILD":wild_groups, "DOMESTIC":domestic_groups}
...
The code below generates a sum from the "Value" column in an ndarray called 'File1.csv'.
How do I apply this code to every file in a directory and place the sums in a new file called Sum.csv?
import pandas as pd
import numpy as np
df = pd.read_csv("~/File1.csv")
df["Value"].sum()
Many thanks!
There's probably a nice way to do this with a pandas Panel, but this is a basic python implementation.
import os
import pandas as pd
# Get the home directory (not recommended, work somewhere else)
directory = os.environ["HOME"]
# Read all files in directory, filter out non-csv
files = [os.path.join(directory, f)
for f in os.listdir(directory) if f.endswith(".csv")]
# Make list of tuples [(filename, sum)]
sums = [(filename, pd.read_csv(filename)["Value"].sum())
for filename in files ]
# Make a dataframe
df = pd.DataFrame(sums, columns=["filename", "sum"])
df.to_csv(os.path.join(directory, "files_with_sum.csv"))
Note that the built in python os.listdir() doesn't understand "~/" like pandas does, so we get it out of the environment map. Using the home directory isn't really recommended, so this gives any adopter of this code an opportunity to set a different path.