I have a generator function in python which reads a dataset in chunks and yields each chunk in a loop.
On each iteration of the loop, the chunk size is the same and the data array is overwritten.
It starts off by yielding chunks every ~0.3s and this slows down to every ~3s by about the 70th iteration.
Here is the generator:
def yield_chunks(self):
# Loop over the chunks
for j in range(self.ny_chunks):
for i in range(self.nx_chunks):
dataset_no = 0
arr = numpy.zeros([self.chunk_size_y, self.chunk_size_x, nInputs], numpy.dtype(numpy.int32))
# Loop over the datasets we will read into a single 'chunk'
for peril in datasets.dataset_cache.iterkeys():
group = datasets.getDatasetGroup(peril)
for return_period, dataset in group:
dataset_no += 1
# Compute the window of the dataset that falls into this chunk
dataset_xoff, dataset_yoff, dataset_xsize, dataset_ysize = self.chunk_params(i, j)
# Read the data
data = dataset[0].ReadAsArray(dataset_xoff, dataset_yoff, dataset_xsize, dataset_ysize)
# Compute the window of our chunk array that this data fits into
chunk_xoff, chunk_yoff = self.window_params(dataset_xoff, dataset_yoff, dataset_xsize, dataset_ysize)
# Add the data to the chunk array
arr[chunk_yoff:(dataset_ysize+chunk_yoff), chunk_xoff:(dataset_xsize+chunk_xoff), dataset_no] = data
# Once we have added data from all datasets to the chunk array, yield it
yield arr
Is it possible that memory is not being properly released after each chunk, and this is causing the loop to slow down? Any other reasons?
Related
I am parallelizing ingestion and analysis of a dataset using Pool.map. However I keep running into 'an out of memory' error. This is despite me working on a cluster with 100GB of memory and working with a compressed file of ~25gb (the uncompressed is likely larger than 100GB). I believe I am running out of memory because I am joining and closing processes incorrectly i.e. All the chunks are being stored in memory. Here is my code:
##load df
df_chunk = pd.read_csv(f'{file}', sep = '\t' , chunksize = 10000)
## parallel process
pool = Pool(16)
processed_results = pool.map(conduct_analysis, df_chunk)
## merge all results together
df_all = reduce(merge_results, processed_results)
pool.close()
pool.join()
With the following function definition for the analysis:
def conduct_analysis(df):
##dictionary to store results
pvalue = {}
##clean columns
df = df.replace('./.', '0/0')
##run analysis
for group in df.groupby(pos):
sample = pd.DataFrame(group[1]['age_of_onset'])
#ANOVA test
f_val, p_val = stats.f_oneway(*sample)
p_value[pos] = p_val[0]
return p_value
Note reduce(merge results) just takes in the mapped results and joins them together.
I can provide output examples too but i believe this is not necessary for my issue. I believe I am running out of memory as the function calls are not being closed when the new one is being opened. is there a way to close the chunk but still retain the output? If I run this code in a for loop one chunk at a time I do not get such an error.
Thanks!
I want a loop that goes over every 30th row of a (1095, 10000) array, returns a scipy.stats.describe(matrix[30]) and writes these results to a list
I have tried to do it manually and it works, I'm trying to optimise my code
stats150 = scipy.stats.describe(matrix[150])
list_for_stats +=['150:', stats150]
stats180 = scipy.stats.describe(matrix[180])
list_for_stats += ['180:', stats180]
statsOut = open("myOutputStatsFile.txt", "w")
for line in list_for_stats:
# write line to output file
statsOut.write(str(line))
statsOut.write("\n")
statsOut.close()
a for loop that is more intuitive than what I already have
Assuming your matrix is a numpy array this loop goes through every 30th row of a (1095,10000) matrix of ones and stores the scipy.describe results along with the row number as a string in a list:
import numpy as np
import scipy
matrix = np.ones(shape=(1095,10000))
list_for_stats=[]
for i in range(0,matrix.shape[0],30):
list_for_stats +=[str(i)+':', scipy.stats.describe(matrix[i])]
I have a huge .csv file (over 1 million rows), that I am trying to parse using the pandas read_csv function. The file is very large because it is measurement data from a sensor with very high sampling rate, and I want to take downsampled segments from it. I tried implementing it with a lambda function and the skiprows and nrows parameter. What my code currently does is, it just reads the same segment over and over again.
segment_amt = 20 # How many segments we want from a individual measurement file
segment_length = 5 # Segment length in seconds
segment_length_idx = fs * segment_length # Segmenth length in indices
segment_skip_length = 10 # How many seconds between segments
segment_skip_idx = fs * segment_skip_length # The amount of indices to skip between each segment
downsampling = 2 # Factor of downsampling
idx = start_idx
for i in range(segment_amt):
cond = lambda x: (x+idx) % downsampling != 0
data = pd.read_csv(filename, skiprows=cond, nrows = segment_length_idx/downsampling,
usecols=[z_component_idx],names=["z"],engine='python')
M1_df = M1_df.append(data.T)
idx += segment_skip_idx
This results in something like this. I assume the behaviour is due to the lambda function, but I don't know how to fix it, so that it changes the starting row each time based on idx (this is what I thought it would do currently).
Your cond lambda is wrong. You want to skip rows if x < idx or x % downsampling != 0. Just write it that way:
cond = x < idx or x % downsampling != 0
But you should also consider passing header = False to avoid processing the first line of each segment as a header.
I am trying to work with around 100 csv files to do a time series analysis.
To build an efficient algorithm to use I've structured my data read_csv function such that it only reads all the files at once and don't have to repeat the same process again and again. To explain further following is my code:
start_date = '2016-06-01'
end_date = '2017-09-02'
allocation = 170000
#contains 100 symbols
usesymbols = ['']
cost_matrix = []
def data():
dates=pd.date_range(start_date,end_date)
df=pd.DataFrame(index=dates)
for symbol in usesymbols:
df_temp=pd.read_csv('/home/furqan/Desktop/python_data/{}.csv'.format(str(symbol)),usecols=['Date','Close'],
parse_dates=True,index_col='Date',na_values=['nan'])
df_temp = df_temp.rename(columns={'Close': symbol})
df=df.join(df_temp)
df=df.fillna(method='ffill')
df=df.fillna(method='bfill')
return df
def powerset(iterable):
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1))
power_set = list(powerset(usesymbols))
dataframe = data()
Problem is that if I run the above code with 15 symbols it works perfectly.
But that's not sufficient, I want to use 100 symbols.
If I run the code with 100 items in usesymbols, my RAM is used up completely and the machine freezes.
Is there anything that can be done to avoid this situation?
Edited Part:
1) I've 16 GB RAM.
2) the issue is with the variable power_set, if I don't call powerset function data gets retrieved easily.
DataFrame.memory_usage(index=False)
Return:
sizes : Series
A series with column names as index and memory usage of columns with units of bytes.
I have to process a huge pandas.DataFrame (several tens of GB) on a row by row bases, where each row operation is quite lengthy (a couple of tens of milliseconds). So I had the idea to split up the frame into chunks and process each chunk in parallel using multiprocessing. This does speed-up the task, but the memory consumption is a nightmare.
Although each child process should in principle only consume a tiny chunk of the data, it needs (almost) as much memory as the original parent process that contained the original DataFrame. Even deleting the used parts in the parent process does not help.
I wrote a minimal example that replicates this behavior. The only thing it does is creating a large DataFrame with random numbers, chunk it into little pieces with at most 100 rows, and simply print some information about the DataFrame during multiprocessing (here via a mp.Pool of size 4).
The main function that is executed in parallel:
def just_wait_and_print_len_and_idx(df):
"""Waits for 5 seconds and prints df length and first and last index"""
# Extract some info
idx_values = df.index.values
first_idx, last_idx = idx_values[0], idx_values[-1]
length = len(df)
pid = os.getpid()
# Waste some CPU cycles
time.sleep(1)
# Print the info
print('First idx {}, last idx {} and len {} '
'from process {}'.format(first_idx, last_idx, length, pid))
The helper generator to chunk a DataFrame into little pieces:
def df_chunking(df, chunksize):
"""Splits df into chunks, drops data of original df inplace"""
count = 0 # Counter for chunks
while len(df):
count += 1
print('Preparing chunk {}'.format(count))
# Return df chunk
yield df.iloc[:chunksize].copy()
# Delete data in place because it is no longer needed
df.drop(df.index[:chunksize], inplace=True)
And the main routine:
def main():
# Job parameters
n_jobs = 4 # Poolsize
size = (10000, 1000) # Size of DataFrame
chunksize = 100 # Maximum size of Frame Chunk
# Preparation
df = pd.DataFrame(np.random.rand(*size))
pool = mp.Pool(n_jobs)
print('Starting MP')
# Execute the wait and print function in parallel
pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, chunksize))
pool.close()
pool.join()
print('DONE')
The standard output looks like this:
Starting MP
Preparing chunk 1
Preparing chunk 2
First idx 0, last idx 99 and len 100 from process 9913
First idx 100, last idx 199 and len 100 from process 9914
Preparing chunk 3
First idx 200, last idx 299 and len 100 from process 9915
Preparing chunk 4
...
DONE
The Problem:
The main process needs about 120MB of memory. However, the child processes of the pool need the same amount of memory, although they only contain 1% of the original DataFame (chunks of size 100 vs original length of 10000). Why?
What can I do about it? Does Python (3) send the whole DataFrame to each child process despite my chunking? Is that a problem of pandas memory management or the fault of multiprocessing and data pickling? Thanks!
Whole script for simple copy and paste in case you want to try it yourself:
import multiprocessing as mp
import pandas as pd
import numpy as np
import time
import os
def just_wait_and_print_len_and_idx(df):
"""Waits for 5 seconds and prints df length and first and last index"""
# Extract some info
idx_values = df.index.values
first_idx, last_idx = idx_values[0], idx_values[-1]
length = len(df)
pid = os.getpid()
# Waste some CPU cycles
time.sleep(1)
# Print the info
print('First idx {}, last idx {} and len {} '
'from process {}'.format(first_idx, last_idx, length, pid))
def df_chunking(df, chunksize):
"""Splits df into chunks, drops data of original df inplace"""
count = 0 # Counter for chunks
while len(df):
count += 1
print('Preparing chunk {}'.format(count))
# Return df chunk
yield df.iloc[:chunksize].copy()
# Delete data in place because it is no longer needed
df.drop(df.index[:chunksize], inplace=True)
def main():
# Job parameters
n_jobs = 4 # Poolsize
size = (10000, 1000) # Size of DataFrame
chunksize = 100 # Maximum size of Frame Chunk
# Preparation
df = pd.DataFrame(np.random.rand(*size))
pool = mp.Pool(n_jobs)
print('Starting MP')
# Execute the wait and print function in parallel
pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, chunksize))
pool.close()
pool.join()
print('DONE')
if __name__ == '__main__':
main()
Ok, so I figured it out after the hint by Sebastian Opałczyński in the comments.
The problem is that the child processes are forked from the parent, so all of them contain a reference to the original DataFrame. However, the frame is manipulated in the original process, so the copy-on-write behavior kills the whole thing slowly and eventually when the limit of the physical memory is reached.
There is a simple solution: Instead of pool = mp.Pool(n_jobs), I use the new context feature of multiprocessing:
ctx = mp.get_context('spawn')
pool = ctx.Pool(n_jobs)
This guarantees that the Pool processes are just spawned and not forked from the parent process. Accordingly, none of them has access to the original DataFrame and all of them only need a tiny fraction of the parent's memory.
Note that the mp.get_context('spawn') is only available in Python 3.4 and newer.
A better implementation is just to use the pandas implementation of chunked dataframe as a generator and feed it into the "pool.imap" function
pd.read_csv('<filepath>.csv', chucksize=<chunksize>)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Benefit: It doesn't read in the whole df in your main process (save memory). Each child process will be pointed the chunk it needs only. --> solve the child memory issue.
Overhead: It requires you to save your df as csv first and read it in again using pd.read_csv --> I/O time.
Note: chunksize is not available to pd.read_pickle or other loading methods that are compressed on storage.
def main():
# Job parameters
n_jobs = 4 # Poolsize
size = (10000, 1000) # Size of DataFrame
chunksize = 100 # Maximum size of Frame Chunk
# Preparation
df = pd.DataFrame(np.random.rand(*size))
pool = mp.Pool(n_jobs)
print('Starting MP')
# Execute the wait and print function in parallel
df_chunked = pd.read_csv('<filepath>.csv',chunksize = chunksize) # modified
pool.imap(just_wait_and_print_len_and_idx, df_chunking(df, df_chunked) # modified
pool.close()
pool.join()
print('DONE')