How to concatenate thousands of Pandas DataFrames efficiently? - python

I have a folder /data/csvs which contains ~7000 CSV files each with ~600 lines. Each CSV has a name which contains a timestamp that needs to be preserved e.g. /data/csvs/261121.csv, /data/csvs/261122.csv (261121 being 26/11/21 today's date).
I need to:
Load each CSV.
Add a column in which the timestamp can be saved so I know which file the data came from. The time increases by half a second each row so this row also shows the hour/minute/second/microseconds.
Combine the rows into one table which will span a month of data.
Ideally I'd like the final product to be a DataFrame.
Currently this is what I'm doing:
files = os.listdir('/data/csvs')
csv_names = []
for file_name in files:
if file_name[-4:] == '.csv':
csv_names.append(file_name)
to_process = len(csv_names)
for i, csv_name in enumerate(csv_names):
df = pd.read_csv(f'{csv_folder_path}/{file_name}')
df = timestamp(df, csv_name)
to_process = to_process-1
if i == 0:
concat_df = df
concat_df.to_feather(path=processed_path)
else:
concat_df = pd.concat([concat_df, df])
if to_process % 100 == 0:
saved_df = pd.read_feather(path=processed_path)
concat_df = pd.concat([saved_df, concat_df])
concat_df.reset_index(drop=True, inplace=True)
concat_df.to_feather(path=processed_path)
I'm loading in each CSV as a DataFrame, adding the timestamp column and concatenating the CSVs 100 at a time (because I thought this would reduce memory usage) and then saving 100 CSVs at a time to a large DataFrame feather file. This is really slow and uses loads of memory.
What is a more efficient way of doing this?

First, you could be more efficient loading your files using glob. This saves you iterating over all the files and checking whether the file-extension is ".csv"
import glob
src = '/data/csvs'
files = glob.iglob(os.path.join(src, "*.csv"))
Then, read all files into a df and add them to a generator, in the same step assigning the basename of the file to a column named timestamp
df_from_each_file = (pd.read_csv(f).assign(timestamp=os.path.basename(f).split('.')[0]) for f in files)
And finally concatenate the dfs into one
csv_data = pd.concat(df_from_each_file, ignore_index=True)
Hope this helped! I have used a process like this for large amounts of data and found it efficient enough.

Related

Merge csv files based on file names and suffix in Python

First time poster and fairly new to Python here. I have a collection of +1,7000 csv files with 2 columns each. The number and labels of the rows are the same in every file. The files are named with a specific format. For example:
Species_1_OrderA_1.csv
Species_1_OrderA_2.csv
Species_1_OrderA_3.csv
Species_10_OrderB_1.csv
Species_10_OrderB_2.csv
Each imported dataframe is formatted like so:
TreeID Species_1_OrderA_2
0 Bu2_1201_1992 0
1 Bu3_1201_1998 0
2 Bu4_1201_2000 0
3 Bu5_1201_2002 0
4 Bu6_1201_2004 0
.. ... ...
307 Fi141_16101_2004 0
308 Fi142_16101_2006 0
309 Fi143_16101_2008 0
310 Fi144_16101_2010 0
311 Fi147_16101_2015 0
I would like to join the files that correspond to the same species, based on the first column. So, in the end, I would get the files Species_1_OrderA.csv and Species_10_OrderB.csv. Please note that all the species do not necessarily have the same number of files.
This is what I have tried so far.
import os
import glob
import pandas as pd
# Importing csv files from directory
path = '.'
extension = 'csv'
os.chdir(path)
files = glob.glob('*.{}'.format(extension))
# Create a dictionary to loop through each file to read its contents and create a dataframe
file_dict = {}
for file in files:
key = file
df = pd.read_csv(file)
file_dict[key] = df
# Extract the name of each dataframe, convert to a list and extract the relevant
# information (before the 3rd underscore). Compare each of these values to the next and
# if they are the same, append them to a list. This list (in my head, at least) will help
# me merge them using pandas.concat
keys_list = list(file_dict.keys())
group = ''
for line in keys_list:
type = "_".join(line.split("_")[:3])
for i in range(len(type) - 1):
if type[i] == type[i+1]:
group.append(line[keys_list])
print(group)
However, the last bit is not even working, and at this point, I am not sure this is the best way to deal with my problem. Any pointers on how to solve this will be really appreciated.
--- EDIT:
This is the expected output for the files per species. Ideally, I would remove the rows that have zeros in them, but that can easily be done with awk.
TreeID,Species_1_OrderA_0,Species_1_OrderA_1,Species_1_OrderA_2
Bu2_1201_1992,0,0,0
Bu3_1201_1998,0,0,0
Bu4_1201_2000,0,0,0
Bu5_1201_2002,0,0,0
Bu6_1201_2004,0,0,0
Bu7_1201_2006,0,0,0
Bu8_1201_2008,0,0,0
Bu9_1201_2010,0,0,0
Bu10_1201_2012,0,0,0
Bu11_1201_2014,0,0,0
Bu14_1201_2016,0,0,0
Bu16_1201_2018,0,0,0
Bu18_3103_1989,0,0,0
Bu22_3103_1999,0,0,0
Bu23_3103_2001,0,0,0
Bu24_3103_2003,0,0,0
...
Fi141_16101_2004,0,0,10
Fi142_16101_2006,0,4,0
Fi143_16101_2008,0,0,0
Fi144_16101_2010,2,0,0
Fi147_16101_2015,0,7,0
``
Try it like this:
import os
import pandas as pd
path = "C:/Users/username"
files = [file for file in os.listdir(path) if file.endswith(".csv")]
dfs = dict()
for file in files:
#everything before the final _ is the species name
species = file.rsplit("_", maxsplit=1)[0]
#read the csv to a dataframe
df = pd.read_csv(os.path.join(path, file))
#if you don't have a df for a species, create a new key
if species not in dfs:
dfs[species] = df
#else, merge current df to existing df on the TreeID
else:
dfs[species] = pd.merge(dfs[species], df, on="TreeID", how="outer")
#write all dfs to their own csv files
for key in dfs:
dfs[key].to_csv(f"{key}.csv")
If your goal is to concatenate all the csv's for each species-order into a consolidated csv, this is one approach. I haven't tested it so there might be a few errors. The idea is to first use glob, as you're doing, to make a dict of file_paths so that all the file_paths of the same species-order are grouped together. Then for each species-order read in all the data into a single table in memory and then write out to a consolidated file.
import pandas as pd
import glob
#Create a dictionary keyed by species_order, valued by a list of files
#i.e. file_paths_by_species_order['Species_10_OrderB'] = ['Species_10_OrderB_1.csv', 'Species_10_OrderB_2.csv']
file_paths_by_species_order = {}
for file_path in glob.glob('*.csv'):
species_order = file_path.split("_")[:3]
if species_order not in file_paths_by_species_order:
file_paths_by_species_order[species_order] = [file_path]
else:
file_paths_by_species_order[species_order].append(file_path)
#For each species_order, concat all files and save the info into a new csv
for species_order,file_paths in file_paths_by_species_order.items():
df = pd.concat(pd.read_csv(file_path) for file_path in file_paths)
df.to_csv('consolidated_{}.csv'.format(species_order))
There are definitely improvements that can be made such as using collections.defaultdict and writing one file at a time out to the consolidated file, instead of reading them all into memory

Reading, calculate and group data of several files with pandas

I'm trying to make a small script to automate something at my work. I have a ton of text files that I need to group into a large dataframe to plot after.
The files have this general structure like this
5.013130280 4258.0
5.039390845 4198.0
... ...
49.944957015 858.0
49.971217580 833.0
What I want to do is
Keep the first column as the column of the final dataframe (as these values are the same for all files)
The rest of the dataframe is just extracting the second column of each file, normalize it and group everything together.
Use the file name as the header for extracted column (from point to) to use after in the plotting of the data
Right I was able to only make step 2, here is the code
import os
import pandas as pd
import glob
path = "mypath"
extension = 'xy'
os.chdir(path)
dir = os.listdir(path)
files = glob.glob(path + "/*.xy")
li = []
for file in files:
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
df['int_n']=data['int']/data['int'].max()
li_norm.append(df['int_n'])
norm_files = pd.concat(li_norm, axis = 1)
So is there any way to solve this in an easy way?
Assuming that all of your files have exactly the same length (# of rows) and values for angles, then you don't really need to make a bunch of dataframes and concatenate them all together.
If I'm understanding correctly, you just want a final dataframe with a new column for each file (named with the filename) with the 'int' data, normalized with all the values from only that specific file
On the first file, you can create a dataframe to use as your final output, then just add columns to it on each subsequent file
for idx,file in enumerate(files):
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
filename = file.split('\\')[-1][:-3] #get filename from splitting full path and removing last 3 characters (file extension)
df[filename]=df['int']/df['int'].max() #use the filename itself as the new column name
if idx == 0: #create norm_files output dataframe on first file
norm_files = df[['angle',file]]
else: #add column to norm_files for all subsequent files
norm_files[file] = df[file]
You can add a calculated column quite simply, although I'm not sure if that's what you're asking.
for file in files:
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
df[file.split('.')[0]]=data['int']/data['int'].max()
li_norm.append(df['int_n'])

how do i merge n .csv files(possibly 20-30 files) with 1 BIG .csv file horizontally(axis=1) using pandas?

I have a 20-30 csv files containing 3 columns like 'id','col1','col2','col3' and 1 big csv file of 20GB size that i want to read in chunks and merge with these samller csv files.
the bigger csv file has columns as 'id','name','zipdeails'.
both have ID column in same sequences,
smaple looks like
'id','name','zipdeails'
1,Ravi,2031345
2,Shayam,201344
3,Priya,20134
.........
1000,Pravn,204324
chunk file 1 looks like
'id','col1','col2','col3'
1,Heat,,
2,Goa,Next,
3,,,Delhi
all the smaller csv files are of same lenth(number of rows) except for the last file which may be smaller in length with header in each. the bigger csv file to which these are to be merged can be broken into chunksize that is equal to the length of these smaller files
so Last chunk looks like
'id','col1','col2','col3'
1000,Jaipur,Week,Trip
Now the output should look like
'id','name','zipdeails','col1','col2','col3'
1,Ravi,2031345,Heat,NAN,NAN
2,Shayam,201344,Goa,Next,NAN
3,Priya,20134,NAN,NAN,Delhi
.........
1000,Pravn,204324,Jaipur,Week,Trip
I think you need create list of DataFrames for all small files, then read big file to memory and concat all together by index created by id column:
import glob
#concat 30 files
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp, index_col=['id']) for fp in files]
#if necessary
#df_big = df_big.set_index('id')
df_fin = pd.concat([df_big, dfs], axis=1)
There is possible solution a bit modify if there is same order of id values in all DataFrames without duplicates like 1,2,3...N with parameter nrows for read only first rows of big DataFrame by max length of smaller DataFrames:
#concat 30 files
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp, index_col=['a']) for fp in files]
len_max= max([x.index.max() for x in dfs])
df_big= pd.read_csv('big_df_file.csv', index_col=['id'], nrows=len_max)
df_fin = pd.concat([df_big, dfs], axis=1)
EDIT:
#concat 30 files
files = glob.glob('files/*.csv')
#order of files is important for concat values -
#in first file are id = (1,100), second (101, 200)...
print (files)
#set by max rows of file
N = 100
#loop by big fileby chunk define in N
for i, x in enumerate(pd.read_csv('files/big.csv', chunksize=N, index_col=['id'])):
#added try for avoid errors if want seelct non exist file in list files
try:
df = pd.read_csv(files[i], index_col=['id'])
df1 = pd.concat([x, df], axis=1)
print (df1)
#in first loop create header in output
if i == 0:
pd.DataFrame(columns=df1.columns).to_csv('files/out.csv')
#append data to output file
df1.to_csv('files/out.csv', mode='a', header=False)
except IndexError as e:
print ('no files in list')

Appending Pickle Files in Python

I have 100 dataframes (formatted exactly the same) saved on my disk as 100 pickle files. These dataframes are each roughly 250,000 rows long. I want to save all 100 dataframes in 1 dataframe which I want to save on my disk as 1 pickle file.
This is what I am doing so far:
path = '/Users/srayan/Desktop/MyData/Pickle'
df = pd.DataFrame()
for filename in glob.glob(os.path.join(path, '*.pkl')):
newDF = pd.read_pickle(filename)
df = df.append(newDF)
df.to_pickle("/Users/srayan/Desktop/MyData/Pickle/MergedPickle.pkl")
I understand that pickle serializes the data frame but is it necessary for me to take my pickle file, unserialize it, append the data frame, and then serialize it again? Or is there a faster way to do this? With all the data I have, I am getting slowed down
You can use list comprehension with appending each df to list and only once concat:
files = glob.glob('files/*.pkl')
df = pd.concat([pd.read_pickle(fp) for fp in files], ignore_index=True)
what is same as:
dfs = []
for filename in glob.glob('files/*.pkl'):
newDF = pd.read_pickle(filename)
dfs.append(newDF)
df = pd.concat(dfs, ignore_index=True)
A more compact version in one line:
df = pd.concat(map(pd.read_pickle, glob.glob(os.path.join(path, '*.pkl'))))

What is an efficient way to combine hundreds of data files into a single master DataFrame?

As in the title, I have more than 800 data files (all in .csv) (each with size ~ 0-5MB, and each file contains 10 columns with 1st row being the header) and I want to combine all of them into a single DataFrame. I can append them one by one using Pandas data frame but it is very time consuming.
Is there a way to do this faster?
My code:
fname = "CRANlogs/" + ffiles[0]
df = pandas.read_csv(fname,header=0)
for i in range(807)[1:]:
print(i)
fname = "CRANlogs/" + ffiles[i]
temp = pandas.read_csv(fname,header=0)
df = pandas.merge(df,temp,how="outer")
I usually create a list of frames and then use pandas concat()
frames = []
for i in range(807):
fname = "CRANlogs/" + ffiles[i]
temp = pandas.read_csv(fname,header=0)
frames.append(temp)
#and now concat
df = pd.concat(frames)
Do you need the header of each one? If not it may be faster to convert them all to a numpy array and then use numpy.append feature and then convert the file back to a csv file.

Categories