Appending Pickle Files in Python - python

I have 100 dataframes (formatted exactly the same) saved on my disk as 100 pickle files. These dataframes are each roughly 250,000 rows long. I want to save all 100 dataframes in 1 dataframe which I want to save on my disk as 1 pickle file.
This is what I am doing so far:
path = '/Users/srayan/Desktop/MyData/Pickle'
df = pd.DataFrame()
for filename in glob.glob(os.path.join(path, '*.pkl')):
newDF = pd.read_pickle(filename)
df = df.append(newDF)
df.to_pickle("/Users/srayan/Desktop/MyData/Pickle/MergedPickle.pkl")
I understand that pickle serializes the data frame but is it necessary for me to take my pickle file, unserialize it, append the data frame, and then serialize it again? Or is there a faster way to do this? With all the data I have, I am getting slowed down

You can use list comprehension with appending each df to list and only once concat:
files = glob.glob('files/*.pkl')
df = pd.concat([pd.read_pickle(fp) for fp in files], ignore_index=True)
what is same as:
dfs = []
for filename in glob.glob('files/*.pkl'):
newDF = pd.read_pickle(filename)
dfs.append(newDF)
df = pd.concat(dfs, ignore_index=True)

A more compact version in one line:
df = pd.concat(map(pd.read_pickle, glob.glob(os.path.join(path, '*.pkl'))))

Related

How to concatenate thousands of Pandas DataFrames efficiently?

I have a folder /data/csvs which contains ~7000 CSV files each with ~600 lines. Each CSV has a name which contains a timestamp that needs to be preserved e.g. /data/csvs/261121.csv, /data/csvs/261122.csv (261121 being 26/11/21 today's date).
I need to:
Load each CSV.
Add a column in which the timestamp can be saved so I know which file the data came from. The time increases by half a second each row so this row also shows the hour/minute/second/microseconds.
Combine the rows into one table which will span a month of data.
Ideally I'd like the final product to be a DataFrame.
Currently this is what I'm doing:
files = os.listdir('/data/csvs')
csv_names = []
for file_name in files:
if file_name[-4:] == '.csv':
csv_names.append(file_name)
to_process = len(csv_names)
for i, csv_name in enumerate(csv_names):
df = pd.read_csv(f'{csv_folder_path}/{file_name}')
df = timestamp(df, csv_name)
to_process = to_process-1
if i == 0:
concat_df = df
concat_df.to_feather(path=processed_path)
else:
concat_df = pd.concat([concat_df, df])
if to_process % 100 == 0:
saved_df = pd.read_feather(path=processed_path)
concat_df = pd.concat([saved_df, concat_df])
concat_df.reset_index(drop=True, inplace=True)
concat_df.to_feather(path=processed_path)
I'm loading in each CSV as a DataFrame, adding the timestamp column and concatenating the CSVs 100 at a time (because I thought this would reduce memory usage) and then saving 100 CSVs at a time to a large DataFrame feather file. This is really slow and uses loads of memory.
What is a more efficient way of doing this?
First, you could be more efficient loading your files using glob. This saves you iterating over all the files and checking whether the file-extension is ".csv"
import glob
src = '/data/csvs'
files = glob.iglob(os.path.join(src, "*.csv"))
Then, read all files into a df and add them to a generator, in the same step assigning the basename of the file to a column named timestamp
df_from_each_file = (pd.read_csv(f).assign(timestamp=os.path.basename(f).split('.')[0]) for f in files)
And finally concatenate the dfs into one
csv_data = pd.concat(df_from_each_file, ignore_index=True)
Hope this helped! I have used a process like this for large amounts of data and found it efficient enough.

Need to reduce memory usage when using pd.concat() on multiple df's

I need to read in multiple large .csv's (20k rows x 6k columns) and store them in a dataframe.
This thread has excellent examples that have worked for me in the past with smaller files.
Such as:
pd.concat((pd.read_csv(f,index_col='Unnamed: 0') for f in file_list))
Other more direct approaches that I have attempted is:
frame = pd.DataFrame()
list_ = []
for file_ in file_list:
print(file_)
df = pd.read_csv(file_,index_col=0)
list_.append(df)
df = pd.concat(list_)
However all the solutions revolve around creating a list of all the csv files as individual df's and then using pd.concat() at the end over all of the df's.
As far as I can tell it's this approach which is causing a memory error when concat'ing ~20 of these df's.
How could I get past this and perhaps append each df as I go?
Example of file_list:
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_26.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_30.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_25.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_19.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_27.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_18.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_28.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_23.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_06_03.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_24.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_29.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_06_04.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_20.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_22.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_06_06.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_06_05.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_06_01.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_06_02.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_31.csv
/realtimedata/orderbooks/bitfinex/btcusd/bitfinex_btcusd_orderbook_2018_05_21.csv
Your CSVs are still manageably sized, so I would assume the issue is with misaligned headers.
I'd recommend reading in your DataFrames without any header, so concatenation is aligned.
list_ = []
for file_ in file_list:
df = pd.read_csv(file_, index_col=0, skiprows=1, header=None)
list_.append(df)
df = pd.concat(list_)

How to append Dataframe by rows in Python

I would like to merge (using df.append()) some python dataframes by rows.
The code below reported starts by reading all the json files that are in the input json_dir_path, it reads input_fn = json_data["accPreparedCSVFileName"] that contains the full path where the csv file is store and read it in the data frame df_i. When I try to merge df_output = df_i.append(df_output) I do not obtained the desired results.
def __merge(self, json_dir_path):
if os.path.exists(json_dir_path):
filelist = [f for f in os.listdir( json_dir_path )]
df_output = pd.DataFrame()
for json_fn in filelist:
json_full_name = os.path.join( json_dir_path, json_fn )
# print("[TrainficationWorkflow::__merge] We are merging the json file ", json_full_name)
if os.path.exists(json_full_name):
with open(json_full_name, 'r') as in_json_file:
json_data = json.load(in_json_file)
input_fn = json_data["accPreparedCSVFileName"]
df_i = pd.read_csv(input_fn)
df_output = df_i.append(df_output)
return df_output
else:
return pd.DataFrame(data=[], columns=self.DATA_FORMAT)
I got only 2 files are merged out of 12. What am I doing wrong?
Any help would be very appreciated.
Best Regards,
Carlo
You can also set ignore_index=True when appending.
df_output = df_i.append(df_output, ignore_index=True)
Also you can concatenate the dataframes:
df_output = pd.concat((df_output, df_i), axis=0, ignore_index=True)
As #jpp suggested in his answer, you can load the list of dataframes and concatenate them in 1 go.
I strongly recommend you do not concatenate dataframes in a loop.
It is much more efficient to store your dataframes in a list, then concatenate items of your list in one call. For example:
lst = []
for fn in input_fn:
lst.append(pd.read_csv(fn))
df_output = pd.concat(lst, ignore_index=True)

What is an efficient way to combine hundreds of data files into a single master DataFrame?

As in the title, I have more than 800 data files (all in .csv) (each with size ~ 0-5MB, and each file contains 10 columns with 1st row being the header) and I want to combine all of them into a single DataFrame. I can append them one by one using Pandas data frame but it is very time consuming.
Is there a way to do this faster?
My code:
fname = "CRANlogs/" + ffiles[0]
df = pandas.read_csv(fname,header=0)
for i in range(807)[1:]:
print(i)
fname = "CRANlogs/" + ffiles[i]
temp = pandas.read_csv(fname,header=0)
df = pandas.merge(df,temp,how="outer")
I usually create a list of frames and then use pandas concat()
frames = []
for i in range(807):
fname = "CRANlogs/" + ffiles[i]
temp = pandas.read_csv(fname,header=0)
frames.append(temp)
#and now concat
df = pd.concat(frames)
Do you need the header of each one? If not it may be faster to convert them all to a numpy array and then use numpy.append feature and then convert the file back to a csv file.

Pandas: import multiple csv files into dataframe using a loop and hierarchical indexing

I would like to read multiple CSV files (with a different number of columns) from a target directory into a single Python Pandas DataFrame to efficiently search and extract data.
Example file:
Events
1,0.32,0.20,0.67
2,0.94,0.19,0.14,0.21,0.94
3,0.32,0.20,0.64,0.32
4,0.87,0.13,0.61,0.54,0.25,0.43
5,0.62,0.21,0.77,0.44,0.16
Here is what I have so far:
# get a list of all csv files in target directory
my_dir = "C:\\Data\\"
filelist = []
os.chdir( my_dir )
for files in glob.glob( "*.csv" ) :
filelist.append(files)
# read each csv file into single dataframe and add a filename reference column
# (i.e. file1, file2, file 3) for each file read
df = pd.DataFrame()
columns = range(1,100)
for c, f in enumerate(filelist) :
key = "file%i" % c
frame = pd.read_csv( (my_dir + f), skiprows = 1, index_col=0, names=columns )
frame['key'] = key
df = df.append(frame,ignore_index=True)
(the indexing isn't working properly)
Essentially, the script below is exactly what I want (tried and tested) but needs to be looped through 10 or more csv files:
df1 = pd.DataFrame()
df2 = pd.DataFrame()
columns = range(1,100)
df1 = pd.read_csv("C:\\Data\\Currambene_001y09h00m_events.csv",
skiprows = 1, index_col=0, names=columns)
df2 = pd.read_csv("C:\\Data\\Currambene_001y12h00m_events.csv",
skiprows = 1, index_col=0, names=columns)
keys = [('file1'), ('file2')]
df = pd.concat([df1, df2], keys=keys, names=['fileno'])
I have found many related links, however I am still not able to get this to work:
Reading Multiple CSV Files into Python Pandas Dataframe
Merge of multiple data frames of different number of columns into one big data frame
Import multiple csv files into pandas and concatenate into one DataFrame
You need to decide in what axis you want to append your files. Pandas will always try to do the right thing by:
Assuming that each column from each file is different, and appending digits to columns with similar names across files if necessary, so that they don't get mixed;
Items that belong to the same row index across files are placed side by side, under their respective columns.
The trick to appending efficiently is to tip the files sideways, so you get the desired behaviour to match what pandas.concat will be doing. This is my recipe:
from pandas import *
files = !ls *.csv # IPython magic
d = concat([read_csv(f, index_col=0, header=None, axis=1) for f in files], keys=files)
Notice that read_csv is transposed with axis=1, so it will be concatenated on the column axis, preserving its names. If you need, you can transpose the resulting DataFrame back with d.T.
EDIT:
For different number of columns in each source file, you'll need to supply a header. I understand you don't have a header in your source files, so let's create one with a simple function:
def reader(f):
d = read_csv(f, index_col=0, header=None, axis=1)
d.columns = range(d.shape[1])
return d
df = concat([reader(f) for f in files], keys=files)

Categories