How to save the sum of specific columns from diffrent files - python

I have written a code (thanks to) that groupe the column that I need to remain as it is and sum of the targeted columns:
import pandas as pd
import glob as glob
import numpy as np
#Read excel and Create DF
all_data = pd.DataFrame()
for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx'):
df = pd.read_excel(f,index_col=None, na_values=['NA'])
df['filename'] = f
data = all_data.append(df,ignore_index=True)
#Group and Sum
result = data.groupby(["Date"])["Families","Individuals"].agg([np.sum])
#Save file
file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
result.to_csv(file_name, index=True)
the problem is here :
#Save file
file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
result.to_csv(file_name, index=True)
the code gives me the result that I want however it only takes into account the last file that it iterates through, I need to save all the sums from different files
thank you

Simply you never change all_data in the loop since it is never re-assigned. Each loop iteration appends to the empty data frame initialized outside loop. So only the very last file is retained. A quick (non-recommended) fix would include:
all_data = pd.DataFrame()
for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx'):
...
all_data = all_data.append(df, ignore_index=True) # CHANGE LAST LINE IN LOOP
# USE all_data (NOT data) aggregation
result = all_data.groupby(...)
However, reconsider growing a data frame inside a loop. As #unutbu warns us: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying. Instead, the recommended version would be to build a list of data frames to concatenate once outside the loop which you can do so with a list comprehension, even assign for filename:
# BUILD LIST OF DFs
df_list = [(pd.read_excel(f, index_col=None, na_values=['NA'])
.assign(filename = f)
) for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx')]
# CONCATENATE ALL DFs
data = pd.concat(df_list, ignore_index=True)
# AGGREGATE DATA
result = data.groupby(["Date"])["Families", "Individuals"].agg([np.sum])
file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
result.to_csv(file_name, index=True)

Related

Iterate through directory and return DataFrame with number of lines per file

I have a directory containing several excel files. I want to create a DataFrame with a list of the filenames, a count of the number of rows in each file, and a min and max column.
Example file 1:
Example file 2:
Desired result:
This is as far as I've gotten:
fileslist = os.listdir(folder)
for file in fileslist:
str = file
if not str.startswith('~$'):
df = pd.read_excel(os.path.join(folder,file), header = 0, sheet_name = 'Main', usecols=['Name','Number'])
NumMax = max(df['Number'])
NumMin = min(df['Number'])
NameCount = df['Name'].count()
From here, I can't figure out how to create the final DataFrame as shown in the above "Desired Result." I'm very new at this and would appreciate any nudge in the right direction.
You're using str wrong. It is a function in Python, but you don't need it at all. Here, you just mean to write file.startswith. Now, to store the data, at each iteration you'll want to append to a list. What you can do is use dictionaries to create the data:
import pandas as pd
fileslist = os.listdir(folder)
data = [] # store the intermediate data in the loop
for file in fileslist:
# no need to assign file to str
if not file.startswith('~$'):
df = pd.read_excel(os.path.join(folder, file), header=0,
sheet_name='Main', usecols=['Name', 'Number'])
NumMax = max(df['Number'])
NumMin = min(df['Number'])
NameCount = df['Name'].count()
data.append(
{ # the dict keys will become pandas column names
'Filename': file, # you probably want to remove the extension here
'Count': NameCount,
'MinNumber': NumMin,
'MaxNumber': NumMax
})
df = pd.DataFrame(data)
From here, you just need to write the data frame to your excel file.
First of all, I would just like to point out that you shouldn't name any variable as "str" as you did here:
str = file
This can cause issues in the future if you ever try to convert some object to a string using the str(object) as you are redefining the method. Also, this redefinition of "file" is unnecessary, so you can just take that out. You did something similar with "file" as that is also a keyword that you are redefining. A name like "file_name" would be better.
As for how to create the final dataframe, it is somewhat simple. I would recommend you use a list and dictionaries and add all the data to that, then create the dataframe. Like this:
fileslist = os.listdir(folder)
# temporary list to store data
data = []
for file_name in fileslist:
if not file_name.startswith('~$'):
df = pd.read_excel(os.path.join(folder,file_name), header = 0, sheet_name = 'Main', usecols=['Name','Number'])
NumMax = max(df['Number'])
NumMin = min(df['Number'])
NameCount = df['Name'].()
# appending row of data with appropriate column names
data.append({'Filename': file_name, 'Count': NameCount, 'MinNumber': NumMin, 'MaxNumber': NumMax})
# creating actual dataframe
df = pd.DataFrame(data)

How to use loops to adjust path names

I'm working on a program which analyses a lot of csv files.
Currently I'm declaring every item manually, but as you can see in my code I'm actually just go +1 in my paths and in the variable-names.
I guess I can simplify this with a loop, just don't know how to do this with the path-names.
My code:
import pandas as pd
import numpy as np
### declation ###
df_primes1 = pd.DataFrame()
df_primes1 = np.array(df_primes1)
df_search1 = pd.DataFrame()
df_primes2 = pd.DataFrame()
df_primes2 = np.array(df_primes2)
df_search2 = pd.DataFrame()
df_primes3 = pd.DataFrame()
df_primes3 = np.array(df_primes3)
df_search3 = pd.DataFrame()
searchterm = '322'
### reads csv in numpy array ###
df_primes1 = pd.read_csv('1/1_Primes_32.csv', delimiter=';', header=None, names='1')
df_primes2 = pd.read_csv('1/2_Primes_32.csv', delimiter=';', header=None, names='2')
df_primes3 = pd.read_csv('1/3_Primes_32.csv', delimiter=';', header=None, names='3')
### sorts prime numbers ###
#df_sorted = df_primes1.sort_values(by='n')
#print(df_sorted)
### searches for number with "searchterm" as start value ###
df_search1 = df_primes1[df_primes1['1'].astype(str).str.startswith(searchterm)]['1']
df_search2 = df_primes2[df_primes2['2'].astype(str).str.startswith(searchterm)]['2']
df_search3 = df_primes3[df_primes3['3'].astype(str).str.startswith(searchterm)]['3']
print(df_search1)
print(df_search2)
print(df_search3)
The program is working, I was just want to know how I can simplify this, because there will be 20+ more files like this.
IIUC, we can use pathlib and a dict comprehension :
from pathlib import Path
p = 'Path/to/your_csv/'
dfs = {
f"search_{i}": pd.read_csv(file, delimiter=";",
header=None,
names=str(i))
for i, file in enumerate(Path(p).glob("*Prime*.csv"), 1)
}
to break down each item,
p is the target folder that holds your csvs
i is an enumerator to loop over your files you will most likely need to add a pre-step of sorting your csvs to get the order you're after.
file is each item that is returned from the generator object. we turn each value into a dataframe.
you can filter each dataframe by your collection i.e
dfs['search_1']
this will return a dataframe.

Import multiple excel files into pandas and create a column based on name of file

I have multiple excel files in one folder which I want to read and concat together,but while concating together I want to add column based on name of the file
'D:\\156667_Report.xls',
'D:\\192059_Report.xls',
'D:\\254787_Report.xls',
'D:\\263421_Report.xls',
'D:\\273554_Report.xls',
'D:\\280163_Report.xls',
'D:\\307928_Report.xls'
I can read these files in pandas with following script
path =r'D:\' # use your path
allFiles = glob.glob(path + "/*.xls")
frame = pd.DataFrame()
list_ = []
for file_ in allFiles:
df = pd.read_excel(file_,index_col=None, header=0)
list_.append(df)
frame = pd.concat(list_)
I want to add column as Code in all the files which I read.Code will be numbers from filename e.g. 156667,192059
why not just match
foo = re.match('\.*_Report', file_)
num = foo[:6]`
df['Code']= num
Inside your loop?
One you could do this, is by using join, isdigit, inside a list comprehension.
The isdigit will get only the numbers from the file name (in a list), and the join function will join them back into 1.
To be clear, you could change your for loop to this:
for file_ in allFiles:
df = pd.read_excel(file_,index_col=None, header=0)
df['Code'] = ''.join(str(i) for i in file_ if i.isdigit())
list_.append(df)
which will add a column called Code in each df.

How to append Dataframe by rows in Python

I would like to merge (using df.append()) some python dataframes by rows.
The code below reported starts by reading all the json files that are in the input json_dir_path, it reads input_fn = json_data["accPreparedCSVFileName"] that contains the full path where the csv file is store and read it in the data frame df_i. When I try to merge df_output = df_i.append(df_output) I do not obtained the desired results.
def __merge(self, json_dir_path):
if os.path.exists(json_dir_path):
filelist = [f for f in os.listdir( json_dir_path )]
df_output = pd.DataFrame()
for json_fn in filelist:
json_full_name = os.path.join( json_dir_path, json_fn )
# print("[TrainficationWorkflow::__merge] We are merging the json file ", json_full_name)
if os.path.exists(json_full_name):
with open(json_full_name, 'r') as in_json_file:
json_data = json.load(in_json_file)
input_fn = json_data["accPreparedCSVFileName"]
df_i = pd.read_csv(input_fn)
df_output = df_i.append(df_output)
return df_output
else:
return pd.DataFrame(data=[], columns=self.DATA_FORMAT)
I got only 2 files are merged out of 12. What am I doing wrong?
Any help would be very appreciated.
Best Regards,
Carlo
You can also set ignore_index=True when appending.
df_output = df_i.append(df_output, ignore_index=True)
Also you can concatenate the dataframes:
df_output = pd.concat((df_output, df_i), axis=0, ignore_index=True)
As #jpp suggested in his answer, you can load the list of dataframes and concatenate them in 1 go.
I strongly recommend you do not concatenate dataframes in a loop.
It is much more efficient to store your dataframes in a list, then concatenate items of your list in one call. For example:
lst = []
for fn in input_fn:
lst.append(pd.read_csv(fn))
df_output = pd.concat(lst, ignore_index=True)

Reading in a list of files into a list of DataFrames

I'm trying to read a list of files into a list of Pandas DataFrames in Python. However, the code below doesn't work.
files = [file1, file2, file3]
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
dfs = [df1, df2, df3]
# Read in data files
for file,df in zip(files, dfs):
if file_exists(file):
with open(file, 'rb') as in_file:
df = pd.read_csv(in_file, low_memory=False)
print df #the file is getting read properly
print df1 #empty
print df2 #empty
print df3 #empty
How to I get the original DataFrames to update if I pass them into a for-loop as a list of DataFrames?
Try this:
dfs = [pd.read_csv(f, low_memory=False) for f in files]
if you want to check whether file exists:
import os
dfs = [pd.read_csv(f, low_memory=False) for f in files if os.path.isfile(f)]
and if you want to concatenate all of them into one data frame:
df = pd.concat([pd.read_csv(f, low_memory=False)
for f in files if os.path.isfile(f)],
ignore_index=True)
You are not working on the list elements themselves when iterating over them but you are not operating on the list.
You need to insert the elements (or append them) to the list. One possibility could be:
files = [file1, file2, file3]
dfs = [None] * 3 # Just a placeholder
# Read in data files
for i, file in enumerate(files): # Enumeration instead of zip
if file_exists(file):
with open(file, 'rb') as in_file:
dfs[i] = pd.read_csv(in_file, low_memory=False) # Setting the list element
print dfs[i] #the file is getting read properly
This updates the list elements and should work.
Your code seems over complicated you can just do:
files = [file1, file2, file3]
dfs = []
# Read in data files
for file in files:
if file_exists(file):
dfs.append(pd.read_csv(file, low_memory=False))
You will end up with a list of dfs as desired
You can try list comprehension:
files = [file1, file2, file3]
dfs = [pd.read_csv(x, low_memory=False) for x in files if file_exists(x)]
Custom-written Python function that appropriately handles both CSV & JSON files.
def generate_list_of_dfs(incoming_files):
"""
Accepts a list of csv and json file/path names.
Returns a list of DataFrames.
"""
outgoing_files = []
for filename in incoming_files:
file_extension = filename.split('.')[1]
if file_extension == 'json':
with open(filename, mode='r') as incoming_file:
outgoing_json = pd.DataFrame(json.load(incoming_file))
outgoing_files.append(outgoing_json)
if file_extension == 'csv':
outgoing_csv = pd.read_csv(filename)
outgoing_files.append(outgoing_csv)
return outgoing_files
How to Call this Function
import pandas as pd
import json
files_to_be_read = ['filename1.json', 'filename2.csv', 'filename3.json', 'filename4.csv']
dataframes_list = generate_list_of_dfs(files_to_be_read)
Here is a simple solution that avoids using a list to hold all the data frames, if you don't need them in a list.
import fnmatch
# get the CSV files only
files = fnmatch.filter(os.listdir('.'), '*.csv')
files
Output which is now a list of the names:
['Feedback Form Submissions 1.21-1.25.22.csv',
'Feedback Form Submissions 1.21.22.csv',
'Feedback Form Submissions 1.25-1.31.22.csv']
Now create a simple list of new names to make working with them easier:
# use a simple format
names = []
for i in range(0,len(files)):
names.append('data' + str(i))
names
['data0', 'data1', 'data2']
You can use any list of names that you want. The next step take the file names and the list of names and then assign them to the names.
# i is the incrementor for the list of names
i = 0
# iterate through the file names
for file in files:
# make an empty dataframe
df = pd.DataFrame()
# load the first file in
df = pd.read_csv(file, low_memory=False)
# get the first name from the list, this will be a string
new_name = names[i]
# assign the string to the variable and assign it to the dataframe
locals()[new_name] = df.copy()
# increment the list of names
i = i + 1
You now have 3 separate dataframes named data0, data1, data2, and do commands like
data2.info()

Categories