I would like to merge (using df.append()) some python dataframes by rows.
The code below reported starts by reading all the json files that are in the input json_dir_path, it reads input_fn = json_data["accPreparedCSVFileName"] that contains the full path where the csv file is store and read it in the data frame df_i. When I try to merge df_output = df_i.append(df_output) I do not obtained the desired results.
def __merge(self, json_dir_path):
if os.path.exists(json_dir_path):
filelist = [f for f in os.listdir( json_dir_path )]
df_output = pd.DataFrame()
for json_fn in filelist:
json_full_name = os.path.join( json_dir_path, json_fn )
# print("[TrainficationWorkflow::__merge] We are merging the json file ", json_full_name)
if os.path.exists(json_full_name):
with open(json_full_name, 'r') as in_json_file:
json_data = json.load(in_json_file)
input_fn = json_data["accPreparedCSVFileName"]
df_i = pd.read_csv(input_fn)
df_output = df_i.append(df_output)
return df_output
else:
return pd.DataFrame(data=[], columns=self.DATA_FORMAT)
I got only 2 files are merged out of 12. What am I doing wrong?
Any help would be very appreciated.
Best Regards,
Carlo
You can also set ignore_index=True when appending.
df_output = df_i.append(df_output, ignore_index=True)
Also you can concatenate the dataframes:
df_output = pd.concat((df_output, df_i), axis=0, ignore_index=True)
As #jpp suggested in his answer, you can load the list of dataframes and concatenate them in 1 go.
I strongly recommend you do not concatenate dataframes in a loop.
It is much more efficient to store your dataframes in a list, then concatenate items of your list in one call. For example:
lst = []
for fn in input_fn:
lst.append(pd.read_csv(fn))
df_output = pd.concat(lst, ignore_index=True)
Related
I am using a for loop to merge csv files on jupyter notebook, however my result returns a list instead of a dataframe. Could someone help me and tell me what I am doing wrong? Thank you in advance.
files = ['babd_light_z1.csv','babd_light_z2.csv','babd_light_z3.csv']
data = []
for f in files:
data.append(pd.read_csv(f))
type(data) # returns list
You can simply use pd.concat(data, axis=0, ignore_index=True) outside your loop to merge your csv files as in:
files = ['babd_light_z1.csv', 'babd_light_z2.csv', 'babd_light_z3.csv']
data = []
for f in files:
data.append(pd.read_csv(f))
df = pd.concat(data, axis=0, ignore_index=True)
type(df) should return pandas.core.frame.DataFrame
Give this a shot:
combined = pd.combine(data, axis=0)
I have written a code (thanks to) that groupe the column that I need to remain as it is and sum of the targeted columns:
import pandas as pd
import glob as glob
import numpy as np
#Read excel and Create DF
all_data = pd.DataFrame()
for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx'):
df = pd.read_excel(f,index_col=None, na_values=['NA'])
df['filename'] = f
data = all_data.append(df,ignore_index=True)
#Group and Sum
result = data.groupby(["Date"])["Families","Individuals"].agg([np.sum])
#Save file
file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
result.to_csv(file_name, index=True)
the problem is here :
#Save file
file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
result.to_csv(file_name, index=True)
the code gives me the result that I want however it only takes into account the last file that it iterates through, I need to save all the sums from different files
thank you
Simply you never change all_data in the loop since it is never re-assigned. Each loop iteration appends to the empty data frame initialized outside loop. So only the very last file is retained. A quick (non-recommended) fix would include:
all_data = pd.DataFrame()
for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx'):
...
all_data = all_data.append(df, ignore_index=True) # CHANGE LAST LINE IN LOOP
# USE all_data (NOT data) aggregation
result = all_data.groupby(...)
However, reconsider growing a data frame inside a loop. As #unutbu warns us: Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying. Instead, the recommended version would be to build a list of data frames to concatenate once outside the loop which you can do so with a list comprehension, even assign for filename:
# BUILD LIST OF DFs
df_list = [(pd.read_excel(f, index_col=None, na_values=['NA'])
.assign(filename = f)
) for f in glob.glob(r'C:\Users\Sarah\Desktop\IDPMosul\Data\2014\09\*.xlsx')]
# CONCATENATE ALL DFs
data = pd.concat(df_list, ignore_index=True)
# AGGREGATE DATA
result = data.groupby(["Date"])["Families", "Individuals"].agg([np.sum])
file_name = r'C:\Users\Sarah\Desktop\U2014.csv'
result.to_csv(file_name, index=True)
I have multiple files as shown below. My task is to read all those files, merge them and create one final dataframe. However, one file (Measurement_table_sep_13th.csv) has to be summarized before being used for merge. It is too huge, so we summarize it and then merge it.
filenames = sorted(glob.glob('*.csv'))
filenames # gives the below output
filenames = sorted(glob.glob('*.csv'))
for f in filenames:
print(f)
if f == 'Measurement_table_sep_13th.csv':
df = spark.read.csv(f, sep=",",inferSchema=True, header=True)
df = df.groupby("person_id","visit_occurrence_id").pivot("measurement_concept_id").agg(F.mean(F.col("value_as_number")), F.min(F.col("value_as_number")), F.max(F.col("value_as_number")),
F.count(F.col("value_as_number")),F.stddev(F.col("value_as_number")),
F.expr('percentile_approx(value_as_number, 0.25)').alias("25_pc"),
F.expr('percentile_approx(value_as_number, 0.75)').alias("75_pc"))
else:
df = spark.read.csv(f, sep=",",inferSchema=True, header=True)
try:
JKeys = ['person_id', 'visit_occurrence_id'] if 'visit_occurrence_id' in df.columns else ['person_id']
print(JKeys)
df_final = df_final.join(df, on=JKeys, how='left')
print("success in try")
except:
df_final = df
print("success in except")
As you can see, I am summarizing Measurement_table_sep_13th.csv file before merging, but is there any other elegant and efficient way to write this?
If you do not want to save the one file in a different folder, you can also exlude it directly with glob:
followed by this post:
glob exclude pattern
files = glob.glob('files_path/[!_]*')
You can use this to run a glob function for all the files except your measurement file and then join it.
then you can avoid the long if-code.
It would look like (followed by this post: Loading multiple csv files of a folder into one dataframe):
files = glob.glob("[!M]*.csv")
dfs = [pd.read_csv(f, header=True, sep=";", inferShema=True) for f in files]
df2 = pd.concat(dfs,ignore_index=True)
df = spark.read.csv(f, sep=",",inferSchema=True, header=True)
df = df.groupby("person_id","visit_occurrence_id").pivot("measurement_concept_id").agg(F.mean(F.col("value_as_number")), F.min(F.col("value_as_number")), F.max(F.col("value_as_number")),
F.count(F.col("value_as_number")),F.stddev(F.col("value_as_number")),
F.expr('percentile_approx(value_as_number, 0.25)').alias("25_pc"),
F.expr('percentile_approx(value_as_number, 0.75)').alias("75_pc"))
JKeys = ['person_id', 'visit_occurrence_id'] if 'visit_occurrence_id' in df.columns else ['person_id']
df_final = df(df2, on=JKeys, how='left')
I have 100 dataframes (formatted exactly the same) saved on my disk as 100 pickle files. These dataframes are each roughly 250,000 rows long. I want to save all 100 dataframes in 1 dataframe which I want to save on my disk as 1 pickle file.
This is what I am doing so far:
path = '/Users/srayan/Desktop/MyData/Pickle'
df = pd.DataFrame()
for filename in glob.glob(os.path.join(path, '*.pkl')):
newDF = pd.read_pickle(filename)
df = df.append(newDF)
df.to_pickle("/Users/srayan/Desktop/MyData/Pickle/MergedPickle.pkl")
I understand that pickle serializes the data frame but is it necessary for me to take my pickle file, unserialize it, append the data frame, and then serialize it again? Or is there a faster way to do this? With all the data I have, I am getting slowed down
You can use list comprehension with appending each df to list and only once concat:
files = glob.glob('files/*.pkl')
df = pd.concat([pd.read_pickle(fp) for fp in files], ignore_index=True)
what is same as:
dfs = []
for filename in glob.glob('files/*.pkl'):
newDF = pd.read_pickle(filename)
dfs.append(newDF)
df = pd.concat(dfs, ignore_index=True)
A more compact version in one line:
df = pd.concat(map(pd.read_pickle, glob.glob(os.path.join(path, '*.pkl'))))
I'm trying to read a list of files into a list of Pandas DataFrames in Python. However, the code below doesn't work.
files = [file1, file2, file3]
df1 = pd.DataFrame()
df2 = pd.DataFrame()
df3 = pd.DataFrame()
dfs = [df1, df2, df3]
# Read in data files
for file,df in zip(files, dfs):
if file_exists(file):
with open(file, 'rb') as in_file:
df = pd.read_csv(in_file, low_memory=False)
print df #the file is getting read properly
print df1 #empty
print df2 #empty
print df3 #empty
How to I get the original DataFrames to update if I pass them into a for-loop as a list of DataFrames?
Try this:
dfs = [pd.read_csv(f, low_memory=False) for f in files]
if you want to check whether file exists:
import os
dfs = [pd.read_csv(f, low_memory=False) for f in files if os.path.isfile(f)]
and if you want to concatenate all of them into one data frame:
df = pd.concat([pd.read_csv(f, low_memory=False)
for f in files if os.path.isfile(f)],
ignore_index=True)
You are not working on the list elements themselves when iterating over them but you are not operating on the list.
You need to insert the elements (or append them) to the list. One possibility could be:
files = [file1, file2, file3]
dfs = [None] * 3 # Just a placeholder
# Read in data files
for i, file in enumerate(files): # Enumeration instead of zip
if file_exists(file):
with open(file, 'rb') as in_file:
dfs[i] = pd.read_csv(in_file, low_memory=False) # Setting the list element
print dfs[i] #the file is getting read properly
This updates the list elements and should work.
Your code seems over complicated you can just do:
files = [file1, file2, file3]
dfs = []
# Read in data files
for file in files:
if file_exists(file):
dfs.append(pd.read_csv(file, low_memory=False))
You will end up with a list of dfs as desired
You can try list comprehension:
files = [file1, file2, file3]
dfs = [pd.read_csv(x, low_memory=False) for x in files if file_exists(x)]
Custom-written Python function that appropriately handles both CSV & JSON files.
def generate_list_of_dfs(incoming_files):
"""
Accepts a list of csv and json file/path names.
Returns a list of DataFrames.
"""
outgoing_files = []
for filename in incoming_files:
file_extension = filename.split('.')[1]
if file_extension == 'json':
with open(filename, mode='r') as incoming_file:
outgoing_json = pd.DataFrame(json.load(incoming_file))
outgoing_files.append(outgoing_json)
if file_extension == 'csv':
outgoing_csv = pd.read_csv(filename)
outgoing_files.append(outgoing_csv)
return outgoing_files
How to Call this Function
import pandas as pd
import json
files_to_be_read = ['filename1.json', 'filename2.csv', 'filename3.json', 'filename4.csv']
dataframes_list = generate_list_of_dfs(files_to_be_read)
Here is a simple solution that avoids using a list to hold all the data frames, if you don't need them in a list.
import fnmatch
# get the CSV files only
files = fnmatch.filter(os.listdir('.'), '*.csv')
files
Output which is now a list of the names:
['Feedback Form Submissions 1.21-1.25.22.csv',
'Feedback Form Submissions 1.21.22.csv',
'Feedback Form Submissions 1.25-1.31.22.csv']
Now create a simple list of new names to make working with them easier:
# use a simple format
names = []
for i in range(0,len(files)):
names.append('data' + str(i))
names
['data0', 'data1', 'data2']
You can use any list of names that you want. The next step take the file names and the list of names and then assign them to the names.
# i is the incrementor for the list of names
i = 0
# iterate through the file names
for file in files:
# make an empty dataframe
df = pd.DataFrame()
# load the first file in
df = pd.read_csv(file, low_memory=False)
# get the first name from the list, this will be a string
new_name = names[i]
# assign the string to the variable and assign it to the dataframe
locals()[new_name] = df.copy()
# increment the list of names
i = i + 1
You now have 3 separate dataframes named data0, data1, data2, and do commands like
data2.info()