Iterate a simple calculation in Pandas across multiple files - python

The code below generates a sum from the "Value" column in an ndarray called 'File1.csv'.
How do I apply this code to every file in a directory and place the sums in a new file called Sum.csv?
import pandas as pd
import numpy as np
df = pd.read_csv("~/File1.csv")
df["Value"].sum()
Many thanks!

There's probably a nice way to do this with a pandas Panel, but this is a basic python implementation.
import os
import pandas as pd
# Get the home directory (not recommended, work somewhere else)
directory = os.environ["HOME"]
# Read all files in directory, filter out non-csv
files = [os.path.join(directory, f)
for f in os.listdir(directory) if f.endswith(".csv")]
# Make list of tuples [(filename, sum)]
sums = [(filename, pd.read_csv(filename)["Value"].sum())
for filename in files ]
# Make a dataframe
df = pd.DataFrame(sums, columns=["filename", "sum"])
df.to_csv(os.path.join(directory, "files_with_sum.csv"))
Note that the built in python os.listdir() doesn't understand "~/" like pandas does, so we get it out of the environment map. Using the home directory isn't really recommended, so this gives any adopter of this code an opportunity to set a different path.

Related

Pandas - Reading CSVs to dataframes in a FOR loop then appending to a master DF is returning a blank DF

I've searched for about an hour for an answer to this and none of the solutions I've found are working. I'm trying to get a folder full of CSVs into a single dataframe, to output to one big csv. Here's my current code:
import os
sourceLoc = "SOURCE"
destLoc = sourceLoc + "MasterData.csv"
masterDF = pd.DataFrame([])
for file in os.listdir(sourceLoc):
workingDF = pd.read_csv(sourceLoc + file)
print(workingDF)
masterDF.append(workingDF)
print(masterDF)
The SOURCE is a folder path but I've had to remove it as it's a work network path. The loop is reading the CSVs to the workingDF variable as when I run it it prints the data into the console, but it's also finding 349 rows for each file. None of them have that many rows of data in them.
When I print masterDF it prints Empty DataFrame Columns: [] Index: []
My code is from this solution but that example is using xlsx files and I'm not sure what changes, if any, are needed to get it to work with CSVs. The Pandas documentation on .append and read_csv is quite limited and doesn't indicate anything specific I'm doing wrong.
Any help would be appreciated.
There are a couple of things wrong with your code, but the main thing is that pd.append returns a new dataframe, instead of modifying in place. So you would have to do:
masterDF = masterDF.append(workingDF)
I also like the approach taken by I_Al-thamary - concat will probably be faster.
One last thing I would suggest, is instead of using glob, check out pathlib.
import pandas as pd
from pathlib import Path
path = Path("your path")
df = pd.concat(map(pd.read_csv, path.rglob("*.csv"))))
you can use glob
import glob
import pandas as pd
import os
path = "your path"
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join(path,'*.csv'))))
print(df)
You may store them all in a list and pd.concat them at last.
dfs = [
pd.read_csv(os.path.join(sourceLoc, file))
for file in os.listdir(sourceLoc)
]
masterDF = pd.concat(df)

Loop/iterate through a directory of excel files & add to the bottom of the dataframe

I am currently working on importing and formatting a large number of excel files (all the same format/scheme, but different values) with Python.
I have already read in and formatted one file and everything worked fine so far.
I would now do the same for all the other files and combine everything in one dataframe, i.e. read in the first excel in one dataframe, add the second at the bottom of the dataframe, add the third at the bottom the dataframe, and so on until I have all the excel files in one dataframe.
So far my script looks something like this:
import pandas as pd
import numpy as np
import xlrd
import os
path = os.getcwd()
path = "path of the directory"
wbname = "name of the excel file"
files = os.listdir(path)
files
wb = xlrd.open_workbook(path + wbname)
# I only need the second sheet
df = pd.read_excel(path + wbname, sheet_name="sheet2", skiprows = 2, header = None,
skipfooter=132)
# here is where all the formatting is happening ...
df
So, "files" is a list with all file relevant names. Now I have to try to put one file after the other into a loop (?) so that they all eventually end up in df.
Has anyone ever done something like this or can help me here.
Something like this might work:
import os
import pandas as pd
list_dfs=[]
for file in os.listdir('path_to_all_xlsx'):
df = pd.read_excel(file, <the rest of your config to parse>)
list_dfs.append(df)
all_dfs = pd.concat(list_dfs)
You read all the dataframes and add them to a list, and then the concat method adds them all together int one big dataframe.

Loading Multiple Data files from same folder in Python

I am trying to load a large number of data files from the same folder in Python. The ultimate goal here is to simply choose which file I would like to use in calculations, rather than individually opening files.
Here is what I have. This seems to work in opening the data in the files, but I am having a hard time choosing a specific file I want to work with (and assigning a value to each column in each file).
import astropy
import numpy as np
import matplotlib.pyplot as plt
dir = '/S34_east_tfa/'
import glob, os
os.chdir(dir)
for file in glob.glob("*.data"):
data = np.loadtxt(file)
print (data)
Time = data[:,0]
Use a python dictionary, instead of overwriting the results in data variable inside your loop.
data_dict = dict()
for file in glob.glob("*.data"):
data_dict[file] = np.loadtxt(file)
Is this what you were looking for?

plotting CSVs that are buried in different directories

I'm attempting to dig through my computer and plot a bunch of CSVs on one plot (I'm using Python 2.7 and Pandas).
While all the CSV files have the same name of file.csv, they are located in a myriad of different folders. I've done the following below where I wrap the CSVs into a dataframe and then plot the dataframe from a certain range of values.
I would like to label each plot as the folder name (i.e. have the legend specify the folder directory that the CSV is located in)
import pandas as pd
from pandas import read_csv
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import os
class do(object):
def something(self):
style.use('ggplot')
file_1 = r'C:\User\me\PathABC\Folder123\file.csv'
file_2 = r'C:\User\me\PathABC\Folder456\file.csv'
file_3 = r'C:\User\me\PathABC\Folder789\file.csv'
file_4 = r'C:\User\me\PathABC\Folder101112\file.csv'
df1 = pd.read_csv(file_1,header=None)
df2 = pd.read_csv(file_2,header=None)
df3 = pd.read_csv(file_3,header=None)
df4 = pd.read_csv(file_4,header=None)
plt.plot(df1[0],df1[1],label='Folder123')
plt.plot(df2[0],df2[1],label='Folder456')
plt.plot(df3[0],df3[1],label='Folder789')
plt.plot(df4[0],df4[1],label='Folder101112')
plt.xlim([200000,800000])
plt.legend()
plt.ylabel('Amplitude')
plt.xlabel('Hz')
plt.grid(True,color='k')
plt.show()
x=do()
x.something()
essentially, i would like to automate this process such that I can parse my computer by using the following logic:
where file.csv exists, plot it
label plot with folder name of where file.csv came from
Walking a file path is one answer, but you may be able to use glob.glob in simpler cases where the target folders are all at the same depth in the filesystem. For example,
for filename in glob.glob('somewhere/sheets/*/file.csv')
will iterate over all files called file.csv in any subfolder of somewhere/sheets. If they are all two levels down, glob.glob('somewhere/sheets/*/*/file.csv') will work, and if they are all one or two levels down, you can join the lists from two glob invocations.
Take a look at How to list all files of a directory? by #pycruft and edited by #Martin Thoma. I would use walk to get the full path of all csv files existing in several folders inside a specific path as follows:
from os import walk
from os.path import join,splitext
f = []
for (dirpath, dirnames, filenames) in walk(specific_path):
for filename in filenames:
if splitext(filename)[1].upper() == '.CSV':
f.extend([join(dirpath,filename)])

Read multiple .xlsx files from a directory into separate Pandas data frames based on file name

I want to load multiple xlsx files with varying structures from a directory and assign these their own data frame based on the file name. I have 30+ files with differing structures but for brevity please consider the following:
3 excel files [wild_animals.xlsx, farm_animals_xlsx, domestic_animals.xlsx]
I want to assign each with their own data frame so if the file name contains 'wild' it is assigned to wild_df, if farm then farm_df and if domestic then dom_df. This is just the first step in a process as the actual files contain a lot of 'noise' that needs to be cleaned depending on file type etc they file names will also change on a weekly basis with only a few key markers staying the same.
My assumption is the glob module is the best way to begin to do this but in terms of taking very specific parts of the file extension and using this to assign to a specific df I become a bit lost so any help appreciated.
I asked a similar question a while back but it was part of a wider question most of which I have now solved.
I would parse them into a dictionary of DataFrame's:
import os
import glob
import pandas as pd
files = glob.glob('/path/to/*.xlsx')
dfs = {}
for f in files:
dfs[os.path.splitext(os.path.basename(f))[0]] = pd.read_excel(f)
then you can access them as a normal dictionary elements:
dfs['wild_animals']
dfs['domestic_animals']
etc.
You nee to get all xlsx files, than using comprehension dict, you can access to any elm
import pandas as pd
import os
import glob
path = 'Your_path'
extension = 'xlsx'
os.chdir(path)
result = [i for i in glob.glob('*.{}'.format(extension))]
{elm:pd.ExcelFile(elm) for elm in result}
For completeness wanted to show the solution I ended up using, very close to Khelili suggestion with a few tweaks to suit my particular code including not creating a DataFrame at this stage
import os
import pandas as pd
import openpyxl as excel
import glob
#setting up path
path = 'data_inputs'
extension = 'xlsx'
os.chdir(path)
files = [i for i in glob.glob('*.{}'.format(extension))]
#Grouping files - brings multiple files of same type together in a list
wild_groups = ([s for s in files if "wild" in s])
domestic_groups = ([s for s in files if "domestic" in s])
#Sets up a dictionary associated with the file groupings to be called in another module
file_names = {"WILD":wild_groups, "DOMESTIC":domestic_groups}
...

Categories