I am parallelizing ingestion and analysis of a dataset using Pool.map. However I keep running into 'an out of memory' error. This is despite me working on a cluster with 100GB of memory and working with a compressed file of ~25gb (the uncompressed is likely larger than 100GB). I believe I am running out of memory because I am joining and closing processes incorrectly i.e. All the chunks are being stored in memory. Here is my code:
##load df
df_chunk = pd.read_csv(f'{file}', sep = '\t' , chunksize = 10000)
## parallel process
pool = Pool(16)
processed_results = pool.map(conduct_analysis, df_chunk)
## merge all results together
df_all = reduce(merge_results, processed_results)
pool.close()
pool.join()
With the following function definition for the analysis:
def conduct_analysis(df):
##dictionary to store results
pvalue = {}
##clean columns
df = df.replace('./.', '0/0')
##run analysis
for group in df.groupby(pos):
sample = pd.DataFrame(group[1]['age_of_onset'])
#ANOVA test
f_val, p_val = stats.f_oneway(*sample)
p_value[pos] = p_val[0]
return p_value
Note reduce(merge results) just takes in the mapped results and joins them together.
I can provide output examples too but i believe this is not necessary for my issue. I believe I am running out of memory as the function calls are not being closed when the new one is being opened. is there a way to close the chunk but still retain the output? If I run this code in a for loop one chunk at a time I do not get such an error.
Thanks!
Related
I want to append list. Each element to be append is a large dataframe.
I try to use Multiprocessing mudule to speed up appending list. My code as follows:
import pandas as pd
import numpy as np
import time
import multiprocessing
from multiprocessing import Manager
def generate_df(size):
df = pd.DataFrame()
for x in list('abcdefghi'):
df[x] = np.random.normal(size=size)
return df
def do_something(df_list,size,k):
df = generate_df(size)
df_list[k] = df
if __name__ == '__main__':
size = 200000
num_df = 30
start = time.perf_counter()
with Manager() as manager:
df_list = manager.list(range(num_df))
processes = []
for k in range(num_df):
p = multiprocessing.Process(target=do_something, args=(df_list,size,k,))
p.start()
processes.append(p)
for process in processes:
process.join()
final_df = pd.concat(df_list)
print(final_df.head())
finish = time.perf_counter()
print(f'Finished in {round(finish-start,2)} second(s)')
print(len(final_df))
The elapsed time is 7 secs.
I try to append list without Multiprocessing.
df_list = []
for _ in range(num_df):
df_list.append(generate_df(size))
final_df = pd.concat(df_list)
But, this time the elapsed time is 2 secs! Why append list with multiprocessing is slower than without that?
When you use manager.list, you're not using a normal Python list. You're using a special list proxy object that has a whole lot of other stuff going on. Every operation on that list will involve locking and interprocess communication so that every process with access to the list will see the same data in it at all times. It's slow because it's a non-trivial problem to keep everything consistent in that way.
You probably don't need all of that synchronization, it's just slowing you down. A much more natural way to do what you're attempting is to use a process pool and it's map method. The pool will handle creating and shutting down the processes, and map will call a target function with an argument from an iterable.
Try something like this, which will use a number of worker processes equal to the number of CPUs your system has:
if __name__ == '__main__':
size = 200000
num_df = 30
start = time.perf_counter()
with multiprocessing.pool() as pool:
df_list = pool.map(generate_df, [size]*num_df)
final_df = pd.concat(df_list)
print(final_df.head())
finish = time.perf_counter()
print(f'Finished in {round(finish-start,2)} second(s)')
print(len(final_df))
This will still have some overhead, since the interprocess communication used to pass the dataframes back to the main process is not free. It may still be slower than running everything in a single process.
Two points:
Start and retrieve data from subprocess costs data must be transported between processes. This means that if transportation time is more than the time it takes to compute data you don't find benefits. This article can explain better the question.
In your implementation the bottleneck is in the df_list use. The Manager uses lock, this means that the processes are not free to write results into the list df_list
Here's a simplified version of my code.
import dask
import dask.dataframe as dask_frame
from dask.distributed import Client, LocalCluster
def main():
cluster = LocalCluster(n_workers=4, threads_per_worker=2)
client = Client(cluster)
csv_path_one = "" # both have 70 columns and around 70 million rows. at a size of about 25 gigabytes
csv_path_two = ""
# the columns are a mix of ints floats datetimes and strings
# almost all string lengths are less than 15 two of the longest string columns have a max length of 70
left_df = dask_frame.read_csv(csv_path_one, sep="|", quotechar="+", encoding="Latin-1", dtype="object")
right_df = dask_frame.read_csv(csv_path_one, sep=",", quotechar="\"", encoding="utf-8", dtype="object")
cand_keys = [""] # I have 3
merged = dask_frame.merge(left_df, right_df, how='outer', on=cand_keys, suffixes=("_L", "_R"),indicator=True)
missing_mask = merged._merge != 'both'
missing_findings: dask_frame.DataFrame = merged.loc[missing_mask, cand_keys + ["_merge"]]
print(f"Running {client}")
missing_findings.to_csv("results/findings-*.csv")
cluster.close()
client.close()
if __name__ == '__main__':
main()
This example never finishes, dask gets to a certain part then one or more workers instantly exceed the memory limit and the nanny kills them and rolls back all of the worker's progress
Looking at the diagnostics page usually the memory spikes happen about halfway through the shuffle-split tasks.
I'm running Dask 2.9.1 on Windows.
I can update Dask but it's pain with my current setup and I don't know if it'll fix my issue
An Update to 2.15 fixed this issue.
I have several large files (> 4 gb each). Some of them are in a fixed width format and some are pipe delimited. The files have both numeric and text data. Currently I am using the following approach:
df1 = pd.read_fwf(fwFileName, widths = [2, 3, 5, 2, 16],
names = columnNames, dtype = columnTypes,
skiprows = 1, engine = 'c',
keep_default_na = False)
df2 = pd.read_csv(pdFileName, sep = '|', names = columnNames,
dtype = columnTypes, useCols = colNumbers,
skiprows = 1, engine = 'c',
keep_default_na = False)
However, this seems to be slower than for example R's read_fwf (from readr) and fread (from data.table). Can I use some other methods that will help speed up reading these files?
I am working on a big server with several cores so memory is not an issue. I can safely load the whole files into memory. Maybe they are the same thing in this case but my goal is to optimize on time and not on resources.
Update
Based on the comments so far here are a few additional details about the data and my ultimate goal.
These files are compressed (fixed width are zip and pipe delimited are gzip). Therefore, I am not sure if things like Dask will add value for loading. Will they?
After loading these files, I plan to apply a computationally expensive function to groups of data. Therefore, I need the whole data. Although the data is sorted by groups, i.e. first x rows are group 1, next y rows are group 2 and so on. Therefore, forming groups on the fly might be more productive? Is there an efficient way of doing that, given that I don't know how many rows to expect for each group?
Since we are taking time as a metric here, then your memory size is not the main factor we should be looking at, actually on the contrary all methods using lazy loading(less memory and only load objects when needed) are much much faster than loading all data at once in memory, you can check out dask as it provides such lazy read function. https://dask.org/
start_time = time.time()
data = dask.dataframe.read_csv('rg.csv')
duration = time.time() - start_time
print(f"Time taken {duration} seconds") # less than a second
But as I said this won't load data in memory, but rather load only portions of data when needed, you could however load it in full using:
data.compute()
If you want to load things faster in memory, then you need to have good computing capabilities in your server, a good candidate that could benefit from such capabilities is ParaText https://github.com/wiseio/paratext
You can benchmark ParaText against readcsv using the following code:
import time
import paratext
start_time = time.time()
df = paratext.load_csv_to_pandas("rg.csv")
duration = time.time() - start_time
print(f"Time taken {duration} seconds")
import time
import pandas as pd
start_time = time.time()
df = pd.read_csv("rg.csv")
duration = time.time() - start_time
print(f"Time taken {duration} seconds")
Please Note that results may be worse if you don't have enough compute power to support paraText.
You can check out the benchmarks for ParaText loading large files here
https://deads.gitbooks.io/paratext-bench/content/results_csv_throughput.html .
I am trying to work with around 100 csv files to do a time series analysis.
To build an efficient algorithm to use I've structured my data read_csv function such that it only reads all the files at once and don't have to repeat the same process again and again. To explain further following is my code:
start_date = '2016-06-01'
end_date = '2017-09-02'
allocation = 170000
#contains 100 symbols
usesymbols = ['']
cost_matrix = []
def data():
dates=pd.date_range(start_date,end_date)
df=pd.DataFrame(index=dates)
for symbol in usesymbols:
df_temp=pd.read_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),usecols=['Date','Close'],
parse_dates=True,index_col='Date',na_values=['nan'])
df_temp = df_temp.rename(columns={'Close': symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
return df
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1))
power_set = list(powerset(usesymbols))
dataframe = data()
Problem is that if I run the above code with 15 symbols it works perfectly.
But that's not sufficient, I want to use 100 symbols.
If I run the code with 100 items in usesymbols, my RAM is used up completely and the machine freezes.
Is there anything that can be done to avoid this situation?
Edited Part:
1) I've 16 GB RAM.
2) the issue is with the variable power_set, if I don't call powerset function data gets retrieved easily.
DataFrame.memory_usage(index=False)
Return:
sizes : Series
A series with column names as index and memory usage of columns with units of bytes.
I have to process a huge pandas.DataFrame (several tens of GB) on a row by row bases, where each row operation is quite lengthy (a couple of tens of milliseconds). So I had the idea to split up the frame into chunks and process each chunk in parallel using multiprocessing. This does speed-up the task, but the memory consumption is a nightmare.
Although each child process should in principle only consume a tiny chunk of the data, it needs (almost) as much memory as the original parent process that contained the original DataFrame. Even deleting the used parts in the parent process does not help.
I wrote a minimal example that replicates this behavior. The only thing it does is creating a large DataFrame with random numbers, chunk it into little pieces with at most 100 rows, and simply print some information about the DataFrame during multiprocessing (here via a mp.Pool of size 4).
The main function that is executed in parallel:
def just_wait_and_print_len_and_idx(df):
"""Waits for 5 seconds and prints df length and first and last index"""
# Extract some info
idx_values = df.index.values
first_idx, last_idx = idx_values[0], idx_values[-1]
length = len(df)
pid = os.getpid()
# Waste some CPU cycles
time.sleep(1)
# Print the info
print('First idx {}, last idx {} and len {} '
'from process {}'.format(first_idx, last_idx, length, pid))
The helper generator to chunk a DataFrame into little pieces:
def df_chunking(df, chunksize):
"""Splits df into chunks, drops data of original df inplace"""
count = 0 # Counter for chunks
while len(df):
count += 1
print('Preparing chunk {}'.format(count))
# Return df chunk
yield df.iloc[:chunksize].copy()
# Delete data in place because it is no longer needed
df.drop(df.index[:chunksize], inplace=True)
And the main routine:
def main():
# Job parameters
n_jobs = 4 # Poolsize
size = (10000, 1000) # Size of DataFrame
chunksize = 100 # Maximum size of Frame Chunk
# Preparation
df = pd.DataFrame(np.random.rand(*size))
pool = mp.Pool(n_jobs)
print('Starting MP')
# Execute the wait and print function in parallel
pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, chunksize))
pool.close()
pool.join()
print('DONE')
The standard output looks like this:
Starting MP
Preparing chunk 1
Preparing chunk 2
First idx 0, last idx 99 and len 100 from process 9913
First idx 100, last idx 199 and len 100 from process 9914
Preparing chunk 3
First idx 200, last idx 299 and len 100 from process 9915
Preparing chunk 4
...
DONE
The Problem:
The main process needs about 120MB of memory. However, the child processes of the pool need the same amount of memory, although they only contain 1% of the original DataFame (chunks of size 100 vs original length of 10000). Why?
What can I do about it? Does Python (3) send the whole DataFrame to each child process despite my chunking? Is that a problem of pandas memory management or the fault of multiprocessing and data pickling? Thanks!
Whole script for simple copy and paste in case you want to try it yourself:
import multiprocessing as mp
import pandas as pd
import numpy as np
import time
import os
def just_wait_and_print_len_and_idx(df):
"""Waits for 5 seconds and prints df length and first and last index"""
# Extract some info
idx_values = df.index.values
first_idx, last_idx = idx_values[0], idx_values[-1]
length = len(df)
pid = os.getpid()
# Waste some CPU cycles
time.sleep(1)
# Print the info
print('First idx {}, last idx {} and len {} '
'from process {}'.format(first_idx, last_idx, length, pid))
def df_chunking(df, chunksize):
"""Splits df into chunks, drops data of original df inplace"""
count = 0 # Counter for chunks
while len(df):
count += 1
print('Preparing chunk {}'.format(count))
# Return df chunk
yield df.iloc[:chunksize].copy()
# Delete data in place because it is no longer needed
df.drop(df.index[:chunksize], inplace=True)
def main():
# Job parameters
n_jobs = 4 # Poolsize
size = (10000, 1000) # Size of DataFrame
chunksize = 100 # Maximum size of Frame Chunk
# Preparation
df = pd.DataFrame(np.random.rand(*size))
pool = mp.Pool(n_jobs)
print('Starting MP')
# Execute the wait and print function in parallel
pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, chunksize))
pool.close()
pool.join()
print('DONE')
if __name__ == '__main__':
main()
Ok, so I figured it out after the hint by Sebastian Opałczyński in the comments.
The problem is that the child processes are forked from the parent, so all of them contain a reference to the original DataFrame. However, the frame is manipulated in the original process, so the copy-on-write behavior kills the whole thing slowly and eventually when the limit of the physical memory is reached.
There is a simple solution: Instead of pool = mp.Pool(n_jobs), I use the new context feature of multiprocessing:
ctx = mp.get_context('spawn')
pool = ctx.Pool(n_jobs)
This guarantees that the Pool processes are just spawned and not forked from the parent process. Accordingly, none of them has access to the original DataFrame and all of them only need a tiny fraction of the parent's memory.
Note that the mp.get_context('spawn') is only available in Python 3.4 and newer.
A better implementation is just to use the pandas implementation of chunked dataframe as a generator and feed it into the "pool.imap" function
pd.read_csv('<filepath>.csv', chucksize=<chunksize>)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Benefit: It doesn't read in the whole df in your main process (save memory). Each child process will be pointed the chunk it needs only. --> solve the child memory issue.
Overhead: It requires you to save your df as csv first and read it in again using pd.read_csv --> I/O time.
Note: chunksize is not available to pd.read_pickle or other loading methods that are compressed on storage.
def main():
# Job parameters
n_jobs = 4 # Poolsize
size = (10000, 1000) # Size of DataFrame
chunksize = 100 # Maximum size of Frame Chunk
# Preparation
df = pd.DataFrame(np.random.rand(*size))
pool = mp.Pool(n_jobs)
print('Starting MP')
# Execute the wait and print function in parallel
df_chunked = pd.read_csv('<filepath>.csv',chunksize = chunksize) # modified
pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, df_chunked) # modified
pool.close()
pool.join()
print('DONE')