Reduce memory usage of pandas concat for lots of dataframes - python

I have a bunch (15,000+) of small data frames that I need to concatenate column-wise to make one very large (100,000x1000) data frame in pandas. There are two (obvious) concerns I have, speed and memory usage.
The following is one methodology I've seen highly endorsed on Stack Overflow.
dfList = [df1, df2, ..., df15000] #made by appending in a for loop
df_out = pd.concat(dfList, axis=1)
This is great for speed. It's simple code that is easy to understand. However, it uses a fairly large amount of memory. My understanding is that Pandas' concat function works by making a new big dataframe and then copying all the info over, essentially doubling the amount of memory consumed by the program.
How do I avoid this large memory overhead with minimal reduction in speed?
I tried just adding columns one by one to the first df in a for loop. Great for memory (1+1/15,000), terrible for speed.
Then I came up with the following. I replace the list with a deque and do concatenation peicewise. It saves memory (4.1GB vs 5.4GB on most recent run), at a manageable speed decrease (<30seconds added here on a 5-6min total length script), but I can't seem to figure out why does this save memory?
dfDq = collections.deque()
#add all 15,000 dfs to deque
while(len(dfDq)>2):
dfDq.appendleft(pd.concat([dfDq.pop(), dfDq.pop(), dfDq.pop()], axis=1))
if(len(dfDq)==2): df_out = pd.concat([dfDq.pop(), dfDq.pop()], axis=1)
else: df_out=dfDq.pop()
The last step of this peicewise concatenation should still use 2x the memory if my understanding of the pd.concat() function is correct. What is making this work? While the numbers I quoted above for speed increase and memory saved are specific to that one run, the general trend has been the same over numerous runs.
In addition to trying to figure out why the above works, also open to other suggestions for methodology.

Just create the full-size DataFrame in advance:
df = pd.DataFrame(index=pd.RangeIndex(0, N), columns=[...])
Then write to it in sections:
col = 0
for path in paths:
part = pd.read_csv(path)
df.iloc[:,col:col+part.shape[1]] = part
col += part.shape[1]

Related

Improving Python code performance when comparing strings using Levenshtein in pandas

I have this code that functions properly and produces the result I am looking for:
from thefuzz import fuzz
import pandas as pd
df = pd.read_csv('/folder/folder/2011_05-rc.csv', dtype=str, lineterminator='\n')
df_compare = pd.DataFrame(
df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())
for i in df_compare.index:
for j in df_compare.columns[i:]:
df_compare.iloc[i, j] = 0
df[df_compare.max(axis=1) < 75].to_csv('/folder/folder/2011_05-ready.csv', index=False)
print('Done did')
However, since string comparison is a very costly operation, the script is very slow and only works on relatively small CSV files with 5000-7000 rows. Anything large (over 12MB) takes days before throwing a memory-related error message. I attempted running it with modin on 32 cores with 32 gb memory, but it did not change anything and I ended up with the same result.
import glob
from thefuzz import fuzz
import modin.pandas as pd
files = glob.glob('/folder/folder/2013/*.csv')
for file in files:
df = pd.read_csv(file, dtype=str, lineterminator='\n')
f_compare = pd.DataFrame(
df['text'].apply(lambda row: [fuzz.partial_ratio(x, row) for x in df['text']]).to_list())
for i in df_compare.index:
for j in df_compare.columns[i:]:
df_compare.iloc[i, j] = 0
df[df_compare.max(axis=1) < 75].to_csv(f'{file[:-4]}-done.csv', index=False)
print(f'{file} has been done')
It works on smaller files running as a separate job, but files are too many to do it all separately. Would there be a way to optimise this code or some other possible solution?
The data is a collection of tweets while and only one column is being compared (out of around 30 columns). It looks like this:
ID
Text
11213
I am going to the cinema
23213
Black is my favourite colour
35455
I am going to the cinema with you
421323
My friends think I am a good guy.
It appears that the requirement is to compare each sentence against every other sentence. Given that overall approach here I don't think there is a good answer. You are looking at n^2 comparisons. As your row count gets large the overall processing requirements turn into a monster very quickly.
To figure out the feasibility you could run some smaller tests calculating the n^2 for that test to get an evaluations rows per second metric. Then calculate n^2 for the big datasets that you want to do to get an idea of the required processing time. That is assuming that your memory could handle it. Maybe there is work done on handling n^2 problems. Might want to look around for something like that.
You are doing more than twice the work that you need to do. You compare everything against everything twice and against itself. But even then when things get large, if you just do the combinations, n(n-1)/2 is still monstrous.

Apply transformation over dataframe instead of element-wise (for audio files)

I am dealing with a data frame with over 8000 rows. Each tuple consists of a path to an audio file for which a spectrogram needs to be created. One solution would be to use itertuples, iterating row-wise and converting each audio sample one at a time.
for row in valid_data.itertuples():
y, sr = librosa.load('code/UrbanSound8K/audio/' + row.path, duration=2.97)
ps = librosa.feature.melspectrogram(y=y, sr=sr)
D.append( (ps, row.classID) )
But I'm not a fan of iterating over a dataframe. Is there a more efficient way to do this by applying an operation over all rows at the same time?
I don't know if it is going to help you, but my idea is to use threads (available in many OO languages) I don't know if you computer is going to be able to handle all the rows, but you can work for example with 4 rows at the same time, of course you will have to modify the "for" counter.

(Py)Spark combineByKey mergeCombiners output type != mergeCombinerVal type

I'm trying to optimize one piece of Software written in Python using Pandas DF . The algorithm takes a pandas DF as input, can't be distributed and it outputs a metric for each client.
Maybe it's not the best solution, but my time-efficient approach is to load all files in parallel and then build a DF for each client
This works fine BUT very few clients have really HUGE amount of data. So I need to save memory when creating their DF.
In order to do this I'm performing a groupBy() (actually a combineByKey, but logically it's a groupBy) and then for each group (aka now a single Row of an RDD) I build a list and from it, a pandas DF.
However this makes many copies of the data (RDD rows, List and pandas DF...) in a single task/node and crashes and I would like to remove that many copies in a single node.
I was thinking on a "special" combineByKey with the following pseudo-code:
def createCombiner(val):
return [val]
def mergeCombinerVal(x,val):
x.append(val);
return x;
def mergeCombiners(x,y):
#Not checking if y is a pandas DF already, but we can do it too
if (x is a list):
pandasDF= pd.Dataframe(data=x,columns=myCols);
pandasDF.append(y);
return pandasDF
else:
x.append(y);
return x;
My question here, docs say nothing, but someone knows if it's safe to assume that this will work? (return dataType of merging two combiners is not the same than the combiner). I can control datatype on mergeCombinerVal too if the amount of "bad" calls is marginal, but it would be very inefficient to append to a pandas DF row by row.
Any better idea to perform want i want to do?
Thanks!,
PS: Right now I'm packing Spark rows, would switching from Spark rows to python lists without column names help reducing memory usage?
Just writing my comment as answer
At the end I've used regular combineByKey, its faster than groupByKey (idk the exact reason, I guess it helps packing the rows, because my rows are small size, but there are maaaany rows), and also allows me to group them into a "real" Python List (groupByKey groups into some kind of Iterable which Pandas doesn't support and forces me to create another copy of the structure, which doubles memory usage and crashes), which helps me with memory management when packing them into Pandas/C datatypes.
Now I can use those lists to build a dataframe directly without any extra transformation (I don't know what structure is Spark's groupByKey "list", but pandas won't accept it in the constructor).
Still, my original idea should have given a little less memory usage (at most 1x DF + 0.5x list, while now I have 1x DF + 1x list), but as user8371915 said it's not guaranteed by the API/docs..., better not to put that into production :)
For now, my biggest clients fit into a reasonable amount of memory. I'm processing most of my clients in a very parallel low-memory-per-executor job and the biggest ones in a not-so-parallel high-memory-per-executor job. I decide based on a pre-count I perform

csvstat takes a much longer time to run than pandas

I am trying to get the number of unique values in a specific column in a csv (~10 GB) and am looking for the fastest way to do that. I expected command line tools like csvstat to run faster than pandas, but:
def get_uniq_col_count(colname):
df = pd.read_csv('faults_all_main_dp_1_joined__9-4-15.csv', engine='c', usecols=[colname], nrows = 400000)
df.columns = ['col']
return len(set(df.col.values)), list(set(df.col.values))
t1 = datetime.datetime.now()
count, uniq = get_uniq_col_count('model')
print(datetime.datetime.now() - t1)
# 0:00:04.386585
vs.
$ time csvcut -c model faults_all_main_dp_1_joined__9-4-15.csv | head -n 400000 | csvstat --unique
3
real 1m3.343s
user 1m3.212s
sys 0m0.237s
(I am doing the header, because I let csvstat run on the whole dataset, went out for lunch, came back, and it’s still running. It took pandas 50 sec to finish.)
I wonder if I am doing something wrong, and in general, if there is a way to speed up the process. (There are about 5 million rows to read through for each column.)
This is not surprising. 'csvtools' are written entirely in python and do not use any optimization tricks. In particular, 'csvstat' will read entire table into memory and store multiple copies of it as regular python list. This has huge overhead -- both in memory and in the garbage collector time.
Oh the other hand, pandas uses numpy, which means that the whole column uses only a few python objects, and almost no memory overhead. You may be able to make your program slightly faster by using pandas-specific unique method (Find unique values in a Pandas dataframe, irrespective of row or column location)
If you are going to be doing this a lot, convert the data to a more efficient format. This page: http://matthewrocklin.com/blog/work/2015/03/16/Fast-Serialization/ shows that if once you convert your data to HDF5, it will be much faster to load.

Nested Iteration of HDF5 using PyTables

I am have a fairly large dataset that I store in HDF5 and access using PyTables. One operation I need to do on this dataset are pairwise comparisons between each of the elements. This requires 2 loops, one to iterate over each element, and an inner loop to iterate over every other element. This operation thus looks at N(N-1)/2 comparisons.
For fairly small sets I found it to be faster to dump the contents into a multdimensional numpy array and then do my iteration. I run into problems with large sets because of memory issues and need to access each element of the dataset at run time.
Putting the elements into an array gives me about 600 comparisons per second, while operating on hdf5 data itself gives me about 300 comparisons per second.
Is there a way to speed this process up?
Example follows (this is not my real code, just an example):
Small Set:
with tb.openFile(h5_file, 'r') as f:
data = f.root.data
N_elements = len(data)
elements = np.empty((N_elements, 1e5))
for ii, d in enumerate(data):
elements[ii] = data['element']
D = np.empty((N_elements, N_elements))
for ii in xrange(N_elements):
for jj in xrange(ii+1, N_elements):
D[ii, jj] = compare(elements[ii], elements[jj])
Large Set:
with tb.openFile(h5_file, 'r') as f:
data = f.root.data
N_elements = len(data)
D = np.empty((N_elements, N_elements))
for ii in xrange(N_elements):
for jj in xrange(ii+1, N_elements):
D[ii, jj] = compare(data['element'][ii], data['element'][jj])
Two approaches I'd suggest here:
numpy memmap: Create a memory mapped array, put the data inside this and then run code for "Small Set". Memory maps behave almost like arrays.
Use multiprocessing-module to allow parallel processing: if the "compare" method consumes at least a noticeable amount of CPU time, you could use more than one process.
Assuming you have more than one core in the CPU, this will speed up significantly. Use
one process to read the data from the hdf and put in into a queue
one process to grab from the queue and do the comparisson and put some result to "output-queue"
one process to collect the results again.
Before choosing the way: "Know your enemy", i.e., use profiling! Optimizations are only worth the effort if you improve at the bottlenecks, so first find out which methods consume you precious CPU time.
Your algorithm is O(n^2), which is not good for large data. Don't you see any chance to reduce this, e.g., by applying some logic? This is always the best approach.
Greetings,
Thorsten

Categories