Pandas and Multiprocessing Memory Management: Splitting a DataFrame into Multiple Chunks - python

I have to process a huge pandas.DataFrame (several tens of GB) on a row by row bases, where each row operation is quite lengthy (a couple of tens of milliseconds). So I had the idea to split up the frame into chunks and process each chunk in parallel using multiprocessing. This does speed-up the task, but the memory consumption is a nightmare.
Although each child process should in principle only consume a tiny chunk of the data, it needs (almost) as much memory as the original parent process that contained the original DataFrame. Even deleting the used parts in the parent process does not help.
I wrote a minimal example that replicates this behavior. The only thing it does is creating a large DataFrame with random numbers, chunk it into little pieces with at most 100 rows, and simply print some information about the DataFrame during multiprocessing (here via a mp.Pool of size 4).
The main function that is executed in parallel:
def just_wait_and_print_len_and_idx(df):
"""Waits for 5 seconds and prints df length and first and last index"""
# Extract some info
idx_values = df.index.values
first_idx, last_idx = idx_values[0], idx_values[-1]
length = len(df)
pid = os.getpid()
# Waste some CPU cycles
time.sleep(1)
# Print the info
print('First idx {}, last idx {} and len {} '
'from process {}'.format(first_idx, last_idx, length, pid))
The helper generator to chunk a DataFrame into little pieces:
def df_chunking(df, chunksize):
"""Splits df into chunks, drops data of original df inplace"""
count = 0 # Counter for chunks
while len(df):
count += 1
print('Preparing chunk {}'.format(count))
# Return df chunk
yield df.iloc[:chunksize].copy()
# Delete data in place because it is no longer needed
df.drop(df.index[:chunksize], inplace=True)
And the main routine:
def main():
# Job parameters
n_jobs = 4 # Poolsize
size = (10000, 1000) # Size of DataFrame
chunksize = 100 # Maximum size of Frame Chunk
# Preparation
df = pd.DataFrame(np.random.rand(*size))
pool = mp.Pool(n_jobs)
print('Starting MP')
# Execute the wait and print function in parallel
pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, chunksize))
pool.close()
pool.join()
print('DONE')
The standard output looks like this:
Starting MP
Preparing chunk 1
Preparing chunk 2
First idx 0, last idx 99 and len 100 from process 9913
First idx 100, last idx 199 and len 100 from process 9914
Preparing chunk 3
First idx 200, last idx 299 and len 100 from process 9915
Preparing chunk 4
...
DONE
The Problem:
The main process needs about 120MB of memory. However, the child processes of the pool need the same amount of memory, although they only contain 1% of the original DataFame (chunks of size 100 vs original length of 10000). Why?
What can I do about it? Does Python (3) send the whole DataFrame to each child process despite my chunking? Is that a problem of pandas memory management or the fault of multiprocessing and data pickling? Thanks!
Whole script for simple copy and paste in case you want to try it yourself:
import multiprocessing as mp
import pandas as pd
import numpy as np
import time
import os
def just_wait_and_print_len_and_idx(df):
"""Waits for 5 seconds and prints df length and first and last index"""
# Extract some info
idx_values = df.index.values
first_idx, last_idx = idx_values[0], idx_values[-1]
length = len(df)
pid = os.getpid()
# Waste some CPU cycles
time.sleep(1)
# Print the info
print('First idx {}, last idx {} and len {} '
'from process {}'.format(first_idx, last_idx, length, pid))
def df_chunking(df, chunksize):
"""Splits df into chunks, drops data of original df inplace"""
count = 0 # Counter for chunks
while len(df):
count += 1
print('Preparing chunk {}'.format(count))
# Return df chunk
yield df.iloc[:chunksize].copy()
# Delete data in place because it is no longer needed
df.drop(df.index[:chunksize], inplace=True)
def main():
# Job parameters
n_jobs = 4 # Poolsize
size = (10000, 1000) # Size of DataFrame
chunksize = 100 # Maximum size of Frame Chunk
# Preparation
df = pd.DataFrame(np.random.rand(*size))
pool = mp.Pool(n_jobs)
print('Starting MP')
# Execute the wait and print function in parallel
pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, chunksize))
pool.close()
pool.join()
print('DONE')
if __name__ == '__main__':
main()

Ok, so I figured it out after the hint by Sebastian Opałczyński in the comments.
The problem is that the child processes are forked from the parent, so all of them contain a reference to the original DataFrame. However, the frame is manipulated in the original process, so the copy-on-write behavior kills the whole thing slowly and eventually when the limit of the physical memory is reached.
There is a simple solution: Instead of pool = mp.Pool(n_jobs), I use the new context feature of multiprocessing:
ctx = mp.get_context('spawn')
pool = ctx.Pool(n_jobs)
This guarantees that the Pool processes are just spawned and not forked from the parent process. Accordingly, none of them has access to the original DataFrame and all of them only need a tiny fraction of the parent's memory.
Note that the mp.get_context('spawn') is only available in Python 3.4 and newer.

A better implementation is just to use the pandas implementation of chunked dataframe as a generator and feed it into the "pool.imap" function
pd.read_csv('<filepath>.csv', chucksize=<chunksize>)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Benefit: It doesn't read in the whole df in your main process (save memory). Each child process will be pointed the chunk it needs only. --> solve the child memory issue.
Overhead: It requires you to save your df as csv first and read it in again using pd.read_csv --> I/O time.
Note: chunksize is not available to pd.read_pickle or other loading methods that are compressed on storage.
def main():
# Job parameters
n_jobs = 4 # Poolsize
size = (10000, 1000) # Size of DataFrame
chunksize = 100 # Maximum size of Frame Chunk
# Preparation
df = pd.DataFrame(np.random.rand(*size))
pool = mp.Pool(n_jobs)
print('Starting MP')
# Execute the wait and print function in parallel
df_chunked = pd.read_csv('<filepath>.csv',chunksize = chunksize) # modified
pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, df_chunked) # modified
pool.close()
pool.join()
print('DONE')

Related

map.pool running out of memory - when to close and join

I am parallelizing ingestion and analysis of a dataset using Pool.map. However I keep running into 'an out of memory' error. This is despite me working on a cluster with 100GB of memory and working with a compressed file of ~25gb (the uncompressed is likely larger than 100GB). I believe I am running out of memory because I am joining and closing processes incorrectly i.e. All the chunks are being stored in memory. Here is my code:
##load df
df_chunk = pd.read_csv(f'{file}', sep = '\t' , chunksize = 10000)
## parallel process
pool = Pool(16)
processed_results = pool.map(conduct_analysis, df_chunk)
## merge all results together
df_all = reduce(merge_results, processed_results)
pool.close()
pool.join()
With the following function definition for the analysis:
def conduct_analysis(df):
##dictionary to store results
pvalue = {}
##clean columns
df = df.replace('./.', '0/0')
##run analysis
for group in df.groupby(pos):
sample = pd.DataFrame(group[1]['age_of_onset'])
#ANOVA test
f_val, p_val = stats.f_oneway(*sample)
p_value[pos] = p_val[0]
return p_value
Note reduce(merge results) just takes in the mapped results and joins them together.
I can provide output examples too but i believe this is not necessary for my issue. I believe I am running out of memory as the function calls are not being closed when the new one is being opened. is there a way to close the chunk but still retain the output? If I run this code in a for loop one chunk at a time I do not get such an error.
Thanks!

Why append list is slower when using multiprocess?

I want to append list. Each element to be append is a large dataframe.
I try to use Multiprocessing mudule to speed up appending list. My code as follows:
import pandas as pd
import numpy as np
import time
import multiprocessing
from multiprocessing import Manager
def generate_df(size):
df = pd.DataFrame()
for x in list('abcdefghi'):
df[x] = np.random.normal(size=size)
return df
def do_something(df_list,size,k):
df = generate_df(size)
df_list[k] = df
if __name__ == '__main__':
size = 200000
num_df = 30
start = time.perf_counter()
with Manager() as manager:
df_list = manager.list(range(num_df))
processes = []
for k in range(num_df):
p = multiprocessing.Process(target=do_something, args=(df_list,size,k,))
p.start()
processes.append(p)
for process in processes:
process.join()
final_df = pd.concat(df_list)
print(final_df.head())
finish = time.perf_counter()
print(f'Finished in {round(finish-start,2)} second(s)')
print(len(final_df))
The elapsed time is 7 secs.
I try to append list without Multiprocessing.
df_list = []
for _ in range(num_df):
df_list.append(generate_df(size))
final_df = pd.concat(df_list)
But, this time the elapsed time is 2 secs! Why append list with multiprocessing is slower than without that?
When you use manager.list, you're not using a normal Python list. You're using a special list proxy object that has a whole lot of other stuff going on. Every operation on that list will involve locking and interprocess communication so that every process with access to the list will see the same data in it at all times. It's slow because it's a non-trivial problem to keep everything consistent in that way.
You probably don't need all of that synchronization, it's just slowing you down. A much more natural way to do what you're attempting is to use a process pool and it's map method. The pool will handle creating and shutting down the processes, and map will call a target function with an argument from an iterable.
Try something like this, which will use a number of worker processes equal to the number of CPUs your system has:
if __name__ == '__main__':
size = 200000
num_df = 30
start = time.perf_counter()
with multiprocessing.pool() as pool:
df_list = pool.map(generate_df, [size]*num_df)
final_df = pd.concat(df_list)
print(final_df.head())
finish = time.perf_counter()
print(f'Finished in {round(finish-start,2)} second(s)')
print(len(final_df))
This will still have some overhead, since the interprocess communication used to pass the dataframes back to the main process is not free. It may still be slower than running everything in a single process.
Two points:
Start and retrieve data from subprocess costs data must be transported between processes. This means that if transportation time is more than the time it takes to compute data you don't find benefits. This article can explain better the question.
In your implementation the bottleneck is in the df_list use. The Manager uses lock, this means that the processes are not free to write results into the list df_list

How to increase a function speed being calling 400 time in Python

I have a list named dfs. It contains 400 Pandas dataframes of size 700 rows x 400 columns.
I have a function like this:
def updateDataframe(i):
global dfs
df = dfs[i]
df["abc"].iloc[-1] = "xyz"
df["abc2"] = df["abc"].rolling(10).mean()
........ #More pandas operations like this
dfs[i] = df
for i in range(len(dfs)):
updateDataframe(i)
Now, this loop takes 10 seconds to execute. I have tried python multi-processing, but it takes same time and somtimes even more.
Things I tried:
import multiprocessing.dummy as mp #Multi process Library, used for speeding up download
p=mp.Pool(8) #Define Number of Process to Use
p.map(updateDataframe,range(len(dfs))) # Call the Download Image funciton
p.close() #Close the multi threads
p.join()
Also tried this:
from multiprocessing import Process
if __name__ == "__main__": # confirms that the code is under main function
processes = []
for i in range(len(dfs)):
process = Process(target=updateDataframe, args=(i,))
processes.append(process)
processes.start()
# complete the processes
for i in range(len(processes)):
processes[i].join()

Multiprocessing Doesn't Create Any Extra Processes

I am trying to increase the speed of my program in Python using multiprocessing, but it doesn't actually create any more processes. I've watched a few tutorials but I'm not getting anywhere.
Here it is:
cpuutil = int((multiprocessing.cpu_count()) / 2)
p = Pool(processes = cpuutil)
output = p.map(OSGBtoETRSfunc(data, eastcol, northcol))
p.close()
p.join()
return output
So to me, this should create 2 processes on a quadcore machine, but it doesn't. My CPU util sits around 18%...
Any insight? It looks the same as the tutorials I have watched... The p.map was not working when listing arguments in square brackets ([]) so I presumed it would need to be in the syntax it is above?
Thanks
I don't clearly understand what do you want, so let's start from simple. The following is a way to simply call the same function over the rows of pd dataframe:
import pandas as pd
import numpy as np
import os
import pathos
from contextlib import closing
NUM_PROCESSES = os.cpu_count()
# create some data frame 100x4
nrow = 100
ncol = 4
df = pd.DataFrame(np.random.randint(0,100,size=(nrow, ncol)), columns=list('ABCD'))
# dataframe resides in global scope
# so it is accessible to processes spawned below
# I pass only row indices to each process
# function to be run over rows
# it transforms the given row independently
def foo(idx):
# extract given row to numpy
row = df.iloc[[idx]].values[0]
# you can pass ranges:
# df[2:3]
# transform row
# I return it as list for simplicity of creating dataframe
row = np.exp(row)
# return numpy row
return row
# run pool over range of indexes (0,1, ... , nrow-1)
# and close it afterwars
# there is not reason here to have more workers than number of CPUs
with closing(pathos.multiprocessing.Pool(processes=NUM_PROCESSES)) as pool:
results = pool.map(foo, range(nrow))
# create new dataframe from all those numpy slices:
col_names = df.columns.values.tolist()
df_new = pd.DataFrame(np.array(results), columns=col_names)
What in your computation needs more complicated setup?
EDIT: Ok, here is running two functions concurrently (I am not much familiar with pandas, so just switch to numpy):
# RUNNING TWO FUNCTIONS SIMLTANEOUSLY
import pandas as pd
import numpy as np
from multiprocessing import Process, Queue
# create some data frame 100x4
nrow = 100
ncol = 4
df = pd.DataFrame(np.random.randint(0,100,size=(nrow, ncol)), columns=list('ABCD'))
# dataframe resides in global scope
# so it is accessible to processes spawned below
# I pass only row indices to each process
# function to be run over part1 independently
def proc_func1(q1):
# get data from queue1
data1 = q1.get()
# I extract given data to numpy
data_numpy = data1.values
# do something
data_numpy_new = data_numpy + 1
# return numpy array to queue 1
q1.put(data_numpy_new)
return
# function to be run over part2 independently
def proc_func2(q2):
# get data from queue2
data2 = q2.get()
# I extract given data to numpy
data_numpy = data2.values
# do something
data_numpy_new = data_numpy - 1
# return numpy array to queue 2
q2.put(data_numpy_new)
return
# instantiate queues
q1 = Queue()
q2 = Queue()
# divide data frame into two parts
part1 = df[:50]
part2 = df[50:]
# send data, so it will already be in queries
q1.put(part1)
q2.put(part2)
# start two processes
p1 = Process(target=proc_func1, args=(q1,))
p2 = Process(target=proc_func2, args=(q2,))
p1.start()
p2.start()
# wait until they finish
p1.join()
p2.join()
# read results from Queues
res1 = q1.get()
res2 = q2.get()
if (res1 is None) or (res2 is None):
print('Error!')
# reassemble two results back to single dataframe (might be inefficient)
col_names = df.columns.values.tolist()
# concatenate results along x axis
df_new = pd.DataFrame(np.concatenate([np.array(res1), np.array(res2)], axis=0), columns=col_names)
In Python you should provide the function and the arguments separated. If not, you are executing the function OSGBtoETRSfunc at the time of creating the process. Instead, you should provide the pointer to the function, and a list with the arguments.
Your case is similar to the one shown on Python Docs: https://docs.python.org/3.7/library/multiprocessing.html#introduction
Anyway, I think you are using the wrong function. Pool.map() works as map: on a list of items and applies the same function to each item. I think that your function OSGBtoERTSfunc needs the three params in order to work properly. Please, instead of using p.map(), use p.apply()
cpuutil = int((multiprocessing.cpu_count()) / 2)
p = Pool(processes = cpuutil)
output = p.apply(OSGBtoETRSfunc, [data, eastcol, northcol])
p.close()
p.join()
return output

Long running loop slows down?

I have a generator function in python which reads a dataset in chunks and yields each chunk in a loop.
On each iteration of the loop, the chunk size is the same and the data array is overwritten.
It starts off by yielding chunks every ~0.3s and this slows down to every ~3s by about the 70th iteration.
Here is the generator:
def yield_chunks(self):
# Loop over the chunks
for j in range(self.ny_chunks):
for i in range(self.nx_chunks):
dataset_no = 0
arr = numpy.zeros([self.chunk_size_y, self.chunk_size_x, nInputs], numpy.dtype(numpy.int32))
# Loop over the datasets we will read into a single 'chunk'
for peril in datasets.dataset_cache.iterkeys():
group = datasets.getDatasetGroup(peril)
for return_period, dataset in group:
dataset_no += 1
# Compute the window of the dataset that falls into this chunk
dataset_xoff, dataset_yoff, dataset_xsize, dataset_ysize = self.chunk_params(i, j)
# Read the data
data = dataset[0].ReadAsArray(dataset_xoff, dataset_yoff, dataset_xsize, dataset_ysize)
# Compute the window of our chunk array that this data fits into
chunk_xoff, chunk_yoff = self.window_params(dataset_xoff, dataset_yoff, dataset_xsize, dataset_ysize)
# Add the data to the chunk array
arr[chunk_yoff:(dataset_ysize+chunk_yoff), chunk_xoff:(dataset_xsize+chunk_xoff), dataset_no] = data
# Once we have added data from all datasets to the chunk array, yield it
yield arr
Is it possible that memory is not being properly released after each chunk, and this is causing the loop to slow down? Any other reasons?

Categories