Memory Error with Multiprocessing in Python

Memory Error with Multiprocessing in Python - python

I'm trying to perform some costly scientific calculation with Python. I have to read a bunch of data stored in csv files and process then. Since each process take a long time and I have some 8 processors to use, I was trying to use the Pool method from Multiprocessing.
This is how I structured the multiprocessing call:
pool = Pool()
vector_components = []
for sample in range(samples):
vector_field_x_i = vector_field_samples_x[sample]
vector_field_y_i = vector_field_samples_y[sample]
vector_component = pool.apply_async(vector_field_decomposer, args=(x_dim, y_dim, x_steps, y_steps,
vector_field_x_i, vector_field_y_i))
vector_components.append(vector_component)
pool.close()
pool.join()
vector_components = map(lambda k: k.get(), vector_components)
for vector_component in vector_components:
CsvH.write_vector_field(vector_component, '../CSV/RotationalFree/rotational_free_x_'+str(sample)+'.csv')
I was running a data set of 500 samples of size equal to 100 (x_dim) by 100 (y_dim).
Until then everything worked fine.
Then I receive a data set of 500 samples of 400 x 400.
When running it, I get an error when calling the get.
I also tried to run a single sample of 400 x 400 and got the same error.
Traceback (most recent call last):
File "__init__.py", line 33, in <module>
VfD.samples_vector_field_decomposer(samples, x_dim, y_dim, x_steps, y_steps, vector_field_samples_x, vector_field_samples_y)
File "/export/home/pceccon/VectorFieldDecomposer/Sources/Controllers/VectorFieldDecomposerController.py", line 43, in samples_vector_field_decomposer
vector_components = map(lambda k: k.get(), vector_components)
File "/export/home/pceccon/VectorFieldDecomposer/Sources/Controllers/VectorFieldDecomposerController.py", line 43, in <lambda>
vector_components = map(lambda k: k.get(), vector_components)
File "/export/home/pceccon/.pyenv/versions/2.7.5/lib/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
MemoryError
What should I do?
Thank you in advance.

Right now you're keeping several lists in memory - vector_field_x, vector_field_y, vector_components, and then a separate copy of vector_components during the map call (which is when you actually run out of memory). You can avoid needing either copy of the vector_components list by using pool.imap, instead of pool.apply_async along with a manually created list. imap returns an iterator instead of a complete list, so you never have all the results in memory.
Normally, pool.map breaks the iterable passed to it into chunks, and sends the those chunks to the child processes, rather than sending one element at a time. This helps improve performance. Because imap uses an iterator instead of a list, it doesn't know the complete size of the iterable you're passing to it. Without knowing the size of the iterable, it doesn't know how big to make each chunk, so it defaults to a chunksize of 1, which will work, but may not perform all that well. To avoid this, you can provide it with a good chunksize argument, since you know the iterable is sample elements long. It may not make much difference for your 500 element list, but it's worth experimenting with.
Here's some sample code that demonstrates all this:
import multiprocessing
from functools import partial
def vector_field_decomposer(x_dim, y_dim, x_steps, y_steps, vector_fields):
vector_field_x_i = vector_fields[0]
vector_field_y_i = vector_fields[1]
# Do whatever is normally done here.
if __name__ == "__main__":
num_workers = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_workers)
# Calculate a good chunksize (based on implementation of pool.map)
chunksize, extra = divmod(samples // 4 * num_workers)
if extra:
chunksize += 1
# Use partial so many arguments can be passed to vector_field_decomposer
func = partial(vector_field_decomposer, x_dim, y_dim, x_steps, y_steps)
# We use a generator expression as an iterable, so we don't create a full list.
results = pool.imap(func,
((vector_field_samples_x[s], vector_field_samples_y[s]) for s in xrange(samples)),
chunksize=chunksize)
for vector in results:
CsvH.write_vector_field(vector_component,
'../CSV/RotationalFree/rotational_free_x_'+str(sample)+'.csv')
pool.close()
pool.join()
This should allow you to avoid the MemoryError issues, but if not, you could try running imap over smaller chunks of your total sample, and just do multiple passes. I don't think you'll have any issues though, because you're not building any additional lists, other than the vector_field_* lists you start with.

Related

Python multiple processes to read input and call an expensive model

I have a setup with 2 functions, like this.
def fun1(input_stream, output_stream):
batch_data = []
#read input line by line and construct a batch of size batch_size
for line in input_stream:
batch_data.append(process(line))
if len(batch_data) == batch_size:
batch_results = fun2(batch_data)
#write results line by line to output stream
batch_data = []
def fun2(batch_data):
# call an expensive model and return the response
return process(expensive_call(batch_data))
In the setup, an external caller calls fun1. fun2 is waiting to get the batch from fun1, and when the model is called, fun1 is waiting idly.
My first intuition is to see if we can use multiprocessing to separate fun1 and fun2 into 2 processes. fun1 keeps writing to a queue of max size (say, batch_size * 5) and whenever fun2 is free, it processes whatever is available in the queue (if a full batch or more is available, reads a batch. Else, reads whatever is available.)
I am experienced in python but have never had to use multi-processing/multi-threading. What is the best way to do this in python? Will it be better to use multi-processing/multi-threading, and what is the difference?
Also, will it be a good idea to do the writing to the output_stream asynchronously as well?
Are there any other ways to speed it up?

I would turn function func into a generator function that yields its batches and can be used as an iterable to be used with either the multiprocessing.Pool.imap or multiprocessing.Pool.imap_unordered methods of the multiprocessing.Pool (see the code comments for the distinction). These methods allow you to do something with the final results as they become available compared with map, which will not return until all batches have been processed.
from multiprocessing import Pool
def fun1(input_stream, output_stream):
batch_data = []
#read input line by line and construct a batch of size batch_size
for line in input_stream:
batch_data.append(process_line(line))
if len(batch_data) == batch_size:
yield batch_data
batch_data = []
# The possibility exists (no?) that input is not a multiple of batch_size, so:
if batch_data:
yield batch_data
def fun2(batch_data):
# call an expensive model and return the response
return process(expensive_call(batch_data))
def main():
pool = Pool()
# The iterable, i.e. the fun1 generator function can be lazily evalulated:
results = pool.imap(fun2, fun1(input_stream, output_stream))
# Iterate the results from fun2 as they become available.
# Substitute pool.imap_unordered for pool.imap if you are willing to have
# the results returned in completion order rather than task-submission order.
# imap_unordered can be slightly more efficient.
for result in results:
... # do something with the return value from
# Required for Windows:
if __name__ == '__main__':
main()

Multiprocessing Code not producing results

I am running the below code in Python. The code shows the example of how to use multiprocessing in Python. But it is not printing the result (line 16), Only the dataset (line 9) is getting printed and the kernel keeps on running. I am of the opinion that the results should be produced instantly as the code is using various cores of the cpu through multiprocessing so fast execution of the code should be there. Can someone tell what is the issue??
from multiprocessing import Pool
def square(x):
# calculate the square of the value of x
return x*x
if __name__ == '__main__':
# Define the dataset
dataset = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
# Output the dataset
print ('Dataset: ' + str(dataset)) # Line 9
# Run this with a pool of 5 agents having a chunksize of 3 until finished
agents = 5
chunksize = 3
with Pool(processes=agents) as pool:
result = pool.map(square, dataset, chunksize)
# Output the result
print ('Result: ' + str(result)) # Line 16

You are running under Windows. Look at the console output where you started up Jupyter. You will see 5 instances (you have 5 processes in your pool) of:
AttributeError: Can't get attribute 'square' on <module 'main' (built-in)>
You need to put your worker function in a file, for instance, workers.py, in the same directory as your Jupyter code:
workers.py
def square(x):
# calculate the square of the value of x
return x*xdef square(x):
Then remove the above function from your cell and instead add:
import workers
and then:
with Pool(processes=agents) as pool:
result = pool.map(workers.square, dataset, chunksize)
Note
See my comment(s) to your post concerning the chunksize argument to the Pool constructor. The default chunksize for the map method when None is calculated more or less as follows based on the iterable argument:
if not hasattr(iterable, '__len__'):
iterable = list(iterable)
if chunksize is None:
chunksize, extra = divmod(len(iterable), len(self._pool) * 4)
if extra:
chunksize += 1
So I mispoke in one deatil: Regardless of whether you specify a chunksize or not, map will convert your iterable to a list if it needs to in order to get its length. But essentially the default chunksize used is the ceiling of the number of jobs being submitted divided by 4 times the pool size.

There is a known bug in older versions of python and Windows10 https://bugs.python.org/issue35797
It occurs when using multiprocessing through venv.
Bugfix is released in Python 3.7.3.

Dask delayed / dask array no response

I have a distributed dask cluster setup and I have used it to load and transform a bunch of data. Works like a charm.
I'm want to use it do some processing in parallel. Here's my function
el = 5000
n_using = 26
n_across= 6
mat = np.random.random((el,n_using,n_across))
idx = np.tril_indices(n_across*2, -n_across)
def get_vals(c1, m, el, idx):
m1 = m[c1,:,:]
corr_vals = np.zeros((el, (n_across//2)*(n_across+1)))
for c2 in range(c1+1, el):
corr = np.corrcoef(m1.T, m[c2,:,:].T)
corr_vals[c2] = corr[idx]
return corr_vals
lazy_get_val = dask.delayed(get_vals, pure=True)
Here is a single processor version of what I'm trying to do:
arrays = [get_vals(c1, mat, el, idx) for c1 in range(el)]
all_corr = np.stack(arrays, axis=0)
Works fine but takes a few hours.
Here's my go at doing this in dask:
lazy_list = [lazy_get_val(c1, mat, el, idx) for c1 in range(el)]
arrays = [da.from_delayed(lazy_item, dtype=float, shape=(el, 21)) for lazy_item in lazy_list]
all_corr = da.stack(arrays, axis=0)
Even if it run all_corr[1].compute(), it just sits there and doesn't respond. When I interrupt the kernel, it seems to be stuck at /distributed/utils.py:
~/.../lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
249 else:
250 while not e.is_set():
--> 251 e.wait(10)
252 if error[0]:
253 six.reraise(*error[0])
Any suggestions on debugging this?
Other things:
If I run it with a smaller mat (el=1000) and it runs fine.
If I make el = 5000, it hangs.
If I interrupt the kernel and run it again with el = 1000, it hangs.

After adding imports to the example I ran things and it was very slow while building the graph. This can be improved by avoiding placing numpy arrays directly in delayed calls as follows:
# mat = np.random.random((el,n_using,n_across))
# idx = np.tril_indices(n_across*2, -n_across)
mat = dask.delayed(np.random.random)((el,n_using,n_across))
idx = dask.delayed(np.tril_indices)(n_across*2, -n_across)
Or by removing the pure=True keyword to dask.delayed (when you set pure=True it has to hash the contents of all inputs to get a unique key for them, you're doing this 5000 times). I found this out by profiling your code with the %snakeviz magic in IPython.
I then ran all_corr[1].compute() and it was fine. I then ran all_corr.compute() and it seemed like it would progress to completion, but wasn't very fast. I suspect that either your tasks are too small so that there is too much overhead, or that your code is spending too much time in Python for loops and so is running into GIL issues. Not sure which.
The next thing I would recommend trying would be using the dask.distributed scheduler, which would handle the GIL issue better and exacerbate the overhead issue. Seeing how that performed would probably help isolate the issue.

Determining Memory Consumption with Python Multiprocessing and Shared Arrays

I've spent some time research the Python multiprocessing module, it's use of os.fork, and shared memory using Array in multiprocessing for a piece of code I'm writing.
The project itself boils down to this: I have several MxN arrays (let's suppose I have 3 arrays called A, B, and C) that I need to process to calculate a new MxN array (called D), where:
Dij = f(Aij, Bij, Cij)
The function f is such that standard vector operations cannot be applied. This task is what I believe is called "embarrassing parallel". Given the overhead involved in multiprocessing, I am going to break the calculation of D into blocks. For example, if D was 8x8 and I had 4 processes, each processor would be responsible for solving a 4x4 "chunk" of D.
Now, the size of the arrays has the potential to be very big (on the order of several GB), so I want all arrays to use shared memory (even array D, which will have sub-processes writing to it). I believe I have a solution to the shared array issue using a modified version of what is presented here.
However, from an implementation perspective it'd be nice to place arrays A, B, and C into a dictionary. What is unclear to me is if doing this will cause the arrays to be copied in memory when the reference counter for the dictionary is incremented within each sub-process.
To try and answer this, I wrote a little test script (see below) and tried running it using valgrind --tool=massif to track memory usage. However, I am not quite clear how to intemperate the results from it. Specifically, whether each massiff.out file (where the number of files is equal to the number of sub-processes created by my test script + 1) denotes the memory used by that process (i.e. I need to sum them all up to get the total memory usage) or if I just need to consider the massif.out associated with the parent process.
On a side note: One of my shared memory arrays has the sub-processes writing to it. I know that this sound be avoided, especially since I am not using locks to limit only one sub-process writing to the array at any given time. Is this a problem? My thought is that since the order that the array is filled out is irrelevant, the calculation of any index is independent of any other index, and that any given sub-process will never write to the same array index as any other process, there will not be any sort of race conditions. Is this correct?
#! /usr/bin/env python
import multiprocessing as mp
import ctypes
import numpy as np
import time
import sys
import timeit
def shared_array(shape=None, lock=False):
"""
Form a shared memory numpy array.
https://stackoverflow.com/questions/5549190/is-shared-readonly-data-copied-to-different-processes-for-python-multiprocessing
"""
shared_array_base = mp.Array(ctypes.c_double, shape[0]*shape[1], lock=lock)
# Create a locked or unlocked array
if lock:
shared_array = np.frombuffer(shared_array_base.get_obj())
else:
shared_array = np.frombuffer(shared_array_base)
shared_array = shared_array.reshape(*shape)
return shared_array
def worker(indices=None, queue=None, data=None):
# Loop over each indice and "crush" some data
for i in indices:
time.sleep(0.01)
if data is not None:
data['sink'][i, :] = data['source'][i, :] + i
# Place ID for completed indice into the queue
queue.put(i)
if __name__ == '__main__':
# Set the start time
begin = timeit.default_timer()
# Size of arrays (m x n)
m = 1000
n = 1000
# Number of Processors
N = 2
# Create a queue to use for tracking progress
queue = mp.Queue()
# Create dictionary and shared arrays
data = dict()
# Form a shared array without a lock.
data['source'] = shared_array(shape=(m, n), lock=True)
data['sink'] = shared_array(shape=(m, n), lock=False)
# Create a list of the indices associated with the m direction
indices = range(0, m)
# Parse the indices list into range blocks; each process will get a block
indices_blocks = [int(i) for i in np.linspace(0, 1000, N+1)]
# Initialize a list for storing created sub-processes
procs = []
# Print initialization time-stap
print 'Time to initialize time: {}'.format(timeit.default_timer() - begin)
# Create and start each sbu-process
for i in range(1, N+1):
# Start of the block
start = indices_blocks[i-1]
# End of the block
end = indices_blocks[i]
# Create the sub-process
procs.append(mp.Process(target=worker,
args=(indices[start:end], queue, data)))
# Kill the sub-process if/when the parent is killed
procs[-1].daemon=True
# Start the sub-process
procs[-1].start()
# Initialize a list to store the indices that have been processed
completed = []
# Entry a loop dependent on whether any of the sub-processes are still alive
while any(i.is_alive() for i in procs):
# Read the queue, append completed indices, and print the progress
while not queue.empty():
done = queue.get()
if done not in completed:
completed.append(done)
message = "\rCompleted {:.2%}".format(float(len(completed))/len(indices))
sys.stdout.write(message)
sys.stdout.flush()
print ''
# Join all the sub-processes
for p in procs:
p.join()
# Print the run time and the modified sink array
print 'Running time: {}'.format(timeit.default_timer() - begin)
print data['sink']
Edit: I seems I've run into another issue; specifically, an value of n equal to 3 million will result in the kernel killing the process (I assume it's due to a memory issue). This appears to be with how shared_array() works (I can create np.zeros arrays of the same size and not have an issue). After playing with it a bit I get the traceback shown below. I'm not entirely sure what is causing the memory allocation error, but a quick Google search gives discussions about how mmap maps virtual address space, which I'm guessing is smaller than the amount of physical memory a machine has?
Traceback (most recent call last):
File "./shared_array.py", line 66, in <module>
data['source'] = shared_array(shape=(m, n), lock=True)
File "./shared_array.py", line 17, in shared_array
shared_array_base = mp.Array(ctypes.c_double, shape[0]*shape[1], lock=lock)
File "/usr/apps/python/lib/python2.7/multiprocessing/__init__.py", line 260, in Array
return Array(typecode_or_type, size_or_initializer, **kwds)
File "/usr/apps/python/lib/python2.7/multiprocessing/sharedctypes.py", line 120, in Array
obj = RawArray(typecode_or_type, size_or_initializer)
File "/usr/apps/python/lib/python2.7/multiprocessing/sharedctypes.py", line 88, in RawArray
obj = _new_value(type_)
File "/usr/apps/python/lib/python2.7/multiprocessing/sharedctypes.py", line 68, in _new_value
wrapper = heap.BufferWrapper(size)
File "/usr/apps/python/lib/python2.7/multiprocessing/heap.py", line 243, in __init__
block = BufferWrapper._heap.malloc(size)
File "/usr/apps/python/lib/python2.7/multiprocessing/heap.py", line 223, in malloc
(arena, start, stop) = self._malloc(size)
File "/usr/apps/python/lib/python2.7/multiprocessing/heap.py", line 120, in _malloc
arena = Arena(length)
File "/usr/apps/python/lib/python2.7/multiprocessing/heap.py", line 82, in __init__
self.buffer = mmap.mmap(-1, size)
mmap.error: [Errno 12] Cannot allocate memory

Fastest way to process a large file?

I have multiple 3 GB tab delimited files. There are 20 million rows in each file. All the rows have to be independently processed, no relation between any two rows. My question is, what will be faster?
Reading line-by-line?
with open() as infile:
for line in infile:
Reading the file into memory in chunks and processing it, say 250 MB at a time?
The processing is not very complicated, I am just grabbing value in column1 to List1, column2 to List2 etc. Might need to add some column values together.
I am using python 2.7 on a linux box that has 30GB of memory. ASCII Text.
Any way to speed things up in parallel? Right now I am using the former method and the process is very slow. Is using any CSVReader module going to help?
I don't have to do it in python, any other language or database use ideas are welcome.

It sounds like your code is I/O bound. This means that multiprocessing isn't going to help—if you spend 90% of your time reading from disk, having an extra 7 processes waiting on the next read isn't going to help anything.
And, while using a CSV reading module (whether the stdlib's csv or something like NumPy or Pandas) may be a good idea for simplicity, it's unlikely to make much difference in performance.
Still, it's worth checking that you really are I/O bound, instead of just guessing. Run your program and see whether your CPU usage is close to 0% or close to 100% or a core. Do what Amadan suggested in a comment, and run your program with just pass for the processing and see whether that cuts off 5% of the time or 70%. You may even want to try comparing with a loop over os.open and os.read(1024*1024) or something and see if that's any faster.
Since your using Python 2.x, Python is relying on the C stdio library to guess how much to buffer at a time, so it might be worth forcing it to buffer more. The simplest way to do that is to use readlines(bufsize) for some large bufsize. (You can try different numbers and measure them to see where the peak is. In my experience, usually anything from 64K-8MB is about the same, but depending on your system that may be different—especially if you're, e.g., reading off a network filesystem with great throughput but horrible latency that swamps the throughput-vs.-latency of the actual physical drive and the caching the OS does.)
So, for example:
bufsize = 65536
with open(path) as infile:
while True:
lines = infile.readlines(bufsize)
if not lines:
break
for line in lines:
process(line)
Meanwhile, assuming you're on a 64-bit system, you may want to try using mmap instead of reading the file in the first place. This certainly isn't guaranteed to be better, but it may be better, depending on your system. For example:
with open(path) as infile:
m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)
A Python mmap is sort of a weird object—it acts like a str and like a file at the same time, so you can, e.g., manually iterate scanning for newlines, or you can call readline on it as if it were a file. Both of those will take more processing from Python than iterating the file as lines or doing batch readlines (because a loop that would be in C is now in pure Python… although maybe you can get around that with re, or with a simple Cython extension?)… but the I/O advantage of the OS knowing what you're doing with the mapping may swamp the CPU disadvantage.
Unfortunately, Python doesn't expose the madvise call that you'd use to tweak things in an attempt to optimize this in C (e.g., explicitly setting MADV_SEQUENTIAL instead of making the kernel guess, or forcing transparent huge pages)—but you can actually ctypes the function out of libc.

I know this question is old; but I wanted to do a similar thing, I created a simple framework which helps you read and process a large file in parallel. Leaving what I tried as an answer.
This is the code, I give an example in the end
def chunkify_file(fname, size=1024*1024*1000, skiplines=-1):
"""
function to divide a large text file into chunks each having size ~= size so that the chunks are line aligned
Params :
fname : path to the file to be chunked
size : size of each chink is ~> this
skiplines : number of lines in the begining to skip, -1 means don't skip any lines
Returns :
start and end position of chunks in Bytes
"""
chunks = []
fileEnd = os.path.getsize(fname)
with open(fname, "rb") as f:
if(skiplines > 0):
for i in range(skiplines):
f.readline()
chunkEnd = f.tell()
count = 0
while True:
chunkStart = chunkEnd
f.seek(f.tell() + size, os.SEEK_SET)
f.readline() # make this chunk line aligned
chunkEnd = f.tell()
chunks.append((chunkStart, chunkEnd - chunkStart, fname))
count+=1
if chunkEnd > fileEnd:
break
return chunks
def parallel_apply_line_by_line_chunk(chunk_data):
"""
function to apply a function to each line in a chunk
Params :
chunk_data : the data for this chunk
Returns :
list of the non-None results for this chunk
"""
chunk_start, chunk_size, file_path, func_apply = chunk_data[:4]
func_args = chunk_data[4:]
t1 = time.time()
chunk_res = []
with open(file_path, "rb") as f:
f.seek(chunk_start)
cont = f.read(chunk_size).decode(encoding='utf-8')
lines = cont.splitlines()
for i,line in enumerate(lines):
ret = func_apply(line, *func_args)
if(ret != None):
chunk_res.append(ret)
return chunk_res
def parallel_apply_line_by_line(input_file_path, chunk_size_factor, num_procs, skiplines, func_apply, func_args, fout=None):
"""
function to apply a supplied function line by line in parallel
Params :
input_file_path : path to input file
chunk_size_factor : size of 1 chunk in MB
num_procs : number of parallel processes to spawn, max used is num of available cores - 1
skiplines : number of top lines to skip while processing
func_apply : a function which expects a line and outputs None for lines we don't want processed
func_args : arguments to function func_apply
fout : do we want to output the processed lines to a file
Returns :
list of the non-None results obtained be processing each line
"""
num_parallel = min(num_procs, psutil.cpu_count()) - 1
jobs = chunkify_file(input_file_path, 1024 * 1024 * chunk_size_factor, skiplines)
jobs = [list(x) + [func_apply] + func_args for x in jobs]
print("Starting the parallel pool for {} jobs ".format(len(jobs)))
lines_counter = 0
pool = mp.Pool(num_parallel, maxtasksperchild=1000) # maxtaskperchild - if not supplied some weird happend and memory blows as the processes keep on lingering
outputs = []
for i in range(0, len(jobs), num_parallel):
print("Chunk start = ", i)
t1 = time.time()
chunk_outputs = pool.map(parallel_apply_line_by_line_chunk, jobs[i : i + num_parallel])
for i, subl in enumerate(chunk_outputs):
for x in subl:
if(fout != None):
print(x, file=fout)
else:
outputs.append(x)
lines_counter += 1
del(chunk_outputs)
gc.collect()
print("All Done in time ", time.time() - t1)
print("Total lines we have = {}".format(lines_counter))
pool.close()
pool.terminate()
return outputs
Say for example, I have a file in which I want to count the number of words in each line, then the processing of each line would look like
def count_words_line(line):
return len(line.strip().split())
and then call the function like:
parallel_apply_line_by_line(input_file_path, 100, 8, 0, count_words_line, [], fout=None)
Using this, I get a speed up of ~8 times as compared to vanilla line by line reading on a sample file of size ~20GB in which I do some moderately complicated processing on each line.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.