I have a setup with 2 functions, like this.
def fun1(input_stream, output_stream):
batch_data = []
#read input line by line and construct a batch of size batch_size
for line in input_stream:
batch_data.append(process(line))
if len(batch_data) == batch_size:
batch_results = fun2(batch_data)
#write results line by line to output stream
batch_data = []
def fun2(batch_data):
# call an expensive model and return the response
return process(expensive_call(batch_data))
In the setup, an external caller calls fun1. fun2 is waiting to get the batch from fun1, and when the model is called, fun1 is waiting idly.
My first intuition is to see if we can use multiprocessing to separate fun1 and fun2 into 2 processes. fun1 keeps writing to a queue of max size (say, batch_size * 5) and whenever fun2 is free, it processes whatever is available in the queue (if a full batch or more is available, reads a batch. Else, reads whatever is available.)
I am experienced in python but have never had to use multi-processing/multi-threading. What is the best way to do this in python? Will it be better to use multi-processing/multi-threading, and what is the difference?
Also, will it be a good idea to do the writing to the output_stream asynchronously as well?
Are there any other ways to speed it up?
I would turn function func into a generator function that yields its batches and can be used as an iterable to be used with either the multiprocessing.Pool.imap or multiprocessing.Pool.imap_unordered methods of the multiprocessing.Pool (see the code comments for the distinction). These methods allow you to do something with the final results as they become available compared with map, which will not return until all batches have been processed.
from multiprocessing import Pool
def fun1(input_stream, output_stream):
batch_data = []
#read input line by line and construct a batch of size batch_size
for line in input_stream:
batch_data.append(process_line(line))
if len(batch_data) == batch_size:
yield batch_data
batch_data = []
# The possibility exists (no?) that input is not a multiple of batch_size, so:
if batch_data:
yield batch_data
def fun2(batch_data):
# call an expensive model and return the response
return process(expensive_call(batch_data))
def main():
pool = Pool()
# The iterable, i.e. the fun1 generator function can be lazily evalulated:
results = pool.imap(fun2, fun1(input_stream, output_stream))
# Iterate the results from fun2 as they become available.
# Substitute pool.imap_unordered for pool.imap if you are willing to have
# the results returned in completion order rather than task-submission order.
# imap_unordered can be slightly more efficient.
for result in results:
... # do something with the return value from
# Required for Windows:
if __name__ == '__main__':
main()
Related
I've spent some time research the Python multiprocessing module, it's use of os.fork, and shared memory using Array in multiprocessing for a piece of code I'm writing.
The project itself boils down to this: I have several MxN arrays (let's suppose I have 3 arrays called A, B, and C) that I need to process to calculate a new MxN array (called D), where:
Dij = f(Aij, Bij, Cij)
The function f is such that standard vector operations cannot be applied. This task is what I believe is called "embarrassing parallel". Given the overhead involved in multiprocessing, I am going to break the calculation of D into blocks. For example, if D was 8x8 and I had 4 processes, each processor would be responsible for solving a 4x4 "chunk" of D.
Now, the size of the arrays has the potential to be very big (on the order of several GB), so I want all arrays to use shared memory (even array D, which will have sub-processes writing to it). I believe I have a solution to the shared array issue using a modified version of what is presented here.
However, from an implementation perspective it'd be nice to place arrays A, B, and C into a dictionary. What is unclear to me is if doing this will cause the arrays to be copied in memory when the reference counter for the dictionary is incremented within each sub-process.
To try and answer this, I wrote a little test script (see below) and tried running it using valgrind --tool=massif to track memory usage. However, I am not quite clear how to intemperate the results from it. Specifically, whether each massiff.out file (where the number of files is equal to the number of sub-processes created by my test script + 1) denotes the memory used by that process (i.e. I need to sum them all up to get the total memory usage) or if I just need to consider the massif.out associated with the parent process.
On a side note: One of my shared memory arrays has the sub-processes writing to it. I know that this sound be avoided, especially since I am not using locks to limit only one sub-process writing to the array at any given time. Is this a problem? My thought is that since the order that the array is filled out is irrelevant, the calculation of any index is independent of any other index, and that any given sub-process will never write to the same array index as any other process, there will not be any sort of race conditions. Is this correct?
#! /usr/bin/env python
import multiprocessing as mp
import ctypes
import numpy as np
import time
import sys
import timeit
def shared_array(shape=None, lock=False):
"""
Form a shared memory numpy array.
https://stackoverflow.com/questions/5549190/is-shared-readonly-data-copied-to-different-processes-for-python-multiprocessing
"""
shared_array_base = mp.Array(ctypes.c_double, shape[0]*shape[1], lock=lock)
# Create a locked or unlocked array
if lock:
shared_array = np.frombuffer(shared_array_base.get_obj())
else:
shared_array = np.frombuffer(shared_array_base)
shared_array = shared_array.reshape(*shape)
return shared_array
def worker(indices=None, queue=None, data=None):
# Loop over each indice and "crush" some data
for i in indices:
time.sleep(0.01)
if data is not None:
data['sink'][i, :] = data['source'][i, :] + i
# Place ID for completed indice into the queue
queue.put(i)
if __name__ == '__main__':
# Set the start time
begin = timeit.default_timer()
# Size of arrays (m x n)
m = 1000
n = 1000
# Number of Processors
N = 2
# Create a queue to use for tracking progress
queue = mp.Queue()
# Create dictionary and shared arrays
data = dict()
# Form a shared array without a lock.
data['source'] = shared_array(shape=(m, n), lock=True)
data['sink'] = shared_array(shape=(m, n), lock=False)
# Create a list of the indices associated with the m direction
indices = range(0, m)
# Parse the indices list into range blocks; each process will get a block
indices_blocks = [int(i) for i in np.linspace(0, 1000, N+1)]
# Initialize a list for storing created sub-processes
procs = []
# Print initialization time-stap
print 'Time to initialize time: {}'.format(timeit.default_timer() - begin)
# Create and start each sbu-process
for i in range(1, N+1):
# Start of the block
start = indices_blocks[i-1]
# End of the block
end = indices_blocks[i]
# Create the sub-process
procs.append(mp.Process(target=worker,
args=(indices[start:end], queue, data)))
# Kill the sub-process if/when the parent is killed
procs[-1].daemon=True
# Start the sub-process
procs[-1].start()
# Initialize a list to store the indices that have been processed
completed = []
# Entry a loop dependent on whether any of the sub-processes are still alive
while any(i.is_alive() for i in procs):
# Read the queue, append completed indices, and print the progress
while not queue.empty():
done = queue.get()
if done not in completed:
completed.append(done)
message = "\rCompleted {:.2%}".format(float(len(completed))/len(indices))
sys.stdout.write(message)
sys.stdout.flush()
print ''
# Join all the sub-processes
for p in procs:
p.join()
# Print the run time and the modified sink array
print 'Running time: {}'.format(timeit.default_timer() - begin)
print data['sink']
Edit: I seems I've run into another issue; specifically, an value of n equal to 3 million will result in the kernel killing the process (I assume it's due to a memory issue). This appears to be with how shared_array() works (I can create np.zeros arrays of the same size and not have an issue). After playing with it a bit I get the traceback shown below. I'm not entirely sure what is causing the memory allocation error, but a quick Google search gives discussions about how mmap maps virtual address space, which I'm guessing is smaller than the amount of physical memory a machine has?
Traceback (most recent call last):
File "./shared_array.py", line 66, in <module>
data['source'] = shared_array(shape=(m, n), lock=True)
File "./shared_array.py", line 17, in shared_array
shared_array_base = mp.Array(ctypes.c_double, shape[0]*shape[1], lock=lock)
File "/usr/apps/python/lib/python2.7/multiprocessing/__init__.py", line 260, in Array
return Array(typecode_or_type, size_or_initializer, **kwds)
File "/usr/apps/python/lib/python2.7/multiprocessing/sharedctypes.py", line 120, in Array
obj = RawArray(typecode_or_type, size_or_initializer)
File "/usr/apps/python/lib/python2.7/multiprocessing/sharedctypes.py", line 88, in RawArray
obj = _new_value(type_)
File "/usr/apps/python/lib/python2.7/multiprocessing/sharedctypes.py", line 68, in _new_value
wrapper = heap.BufferWrapper(size)
File "/usr/apps/python/lib/python2.7/multiprocessing/heap.py", line 243, in __init__
block = BufferWrapper._heap.malloc(size)
File "/usr/apps/python/lib/python2.7/multiprocessing/heap.py", line 223, in malloc
(arena, start, stop) = self._malloc(size)
File "/usr/apps/python/lib/python2.7/multiprocessing/heap.py", line 120, in _malloc
arena = Arena(length)
File "/usr/apps/python/lib/python2.7/multiprocessing/heap.py", line 82, in __init__
self.buffer = mmap.mmap(-1, size)
mmap.error: [Errno 12] Cannot allocate memory
I am working with DEAP.
I am evaluating a population (currently 50 individuals) against a large dataset (400.000 columns of 200 floating points).
I have successfully tested the algorithm without any multiprocessing. Execution time is about 40s/generation.
I want to work with larger populations and more generations, so I try to speed up by using multiprocessing.
I guess that my question is more related to multiprocessing than to DEAP.
This question is not directly related to sharing memory/variables between processes. The main issue is how to minimise disk access.
I have started to work with Python multiprocessing module.
The code looks like this
toolbox = base.Toolbox()
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)
PICKLE_SEED = 'D:\\Application Data\\Dev\\20150925173629ClustersFrame.pkl'
PICKLE_DATA = 'D:\\Application Data\\Dev\\20150925091456DataSample.pkl'
if __name__ == "__main__":
pool = multiprocessing.Pool(processes = 2)
toolbox.register("map", pool.map)
data = pd.read_pickle(PICKLE_DATA).values
And then, a little bit further:
def main():
NGEN = 10
CXPB = 0.5
MUTPB = 0.2
population = toolbox.population_guess()
fitnesses = list(toolbox.map(toolbox.evaluate, population))
print(sorted(fitnesses, reverse = True))
for ind, fit in zip(population, fitnesses):
ind.fitness.values = fit
# Begin the evolution
for g in range(NGEN):
The evaluation function uses the global "data" variable.
and, finally:
if __name__ == "__main__":
start = datetime.now()
main()
pool.close()
stop = datetime.now()
delta = stop-start
print (delta.seconds)
So: the main processing loop and the pool definition are guarded by if __name__ == "__main__":.
It somehow works. Execution times are:
1 process: 398 s
2 processes: 270 s
3 processes: 272 s
4 processes: 511 s
Multiprocessing does not dramatically improve the execution time, and can even harm it.
The 4 process (lack of) performance can be explained by memory constraints. My system is basically paging instead of processing.
I guess that the other measurements can be explained by the loading of data.
My questions:
1) I understand that the file will be read and unpickled each time the module is started as a separate process. Is this correct? Does this mean it will be read each time one of the functions it contains will be called by map?
2) I have tried to move the unpickling under the if __name__ == "__main__": guard, but, then, I get an error message saying the "data" is not defined when I call the evaluation function. Could you explain how I can read the file once, and then only pass the array to the processes
I have multiple 3 GB tab delimited files. There are 20 million rows in each file. All the rows have to be independently processed, no relation between any two rows. My question is, what will be faster?
Reading line-by-line?
with open() as infile:
for line in infile:
Reading the file into memory in chunks and processing it, say 250 MB at a time?
The processing is not very complicated, I am just grabbing value in column1 to List1, column2 to List2 etc. Might need to add some column values together.
I am using python 2.7 on a linux box that has 30GB of memory. ASCII Text.
Any way to speed things up in parallel? Right now I am using the former method and the process is very slow. Is using any CSVReader module going to help?
I don't have to do it in python, any other language or database use ideas are welcome.
It sounds like your code is I/O bound. This means that multiprocessing isn't going to help—if you spend 90% of your time reading from disk, having an extra 7 processes waiting on the next read isn't going to help anything.
And, while using a CSV reading module (whether the stdlib's csv or something like NumPy or Pandas) may be a good idea for simplicity, it's unlikely to make much difference in performance.
Still, it's worth checking that you really are I/O bound, instead of just guessing. Run your program and see whether your CPU usage is close to 0% or close to 100% or a core. Do what Amadan suggested in a comment, and run your program with just pass for the processing and see whether that cuts off 5% of the time or 70%. You may even want to try comparing with a loop over os.open and os.read(1024*1024) or something and see if that's any faster.
Since your using Python 2.x, Python is relying on the C stdio library to guess how much to buffer at a time, so it might be worth forcing it to buffer more. The simplest way to do that is to use readlines(bufsize) for some large bufsize. (You can try different numbers and measure them to see where the peak is. In my experience, usually anything from 64K-8MB is about the same, but depending on your system that may be different—especially if you're, e.g., reading off a network filesystem with great throughput but horrible latency that swamps the throughput-vs.-latency of the actual physical drive and the caching the OS does.)
So, for example:
bufsize = 65536
with open(path) as infile:
while True:
lines = infile.readlines(bufsize)
if not lines:
break
for line in lines:
process(line)
Meanwhile, assuming you're on a 64-bit system, you may want to try using mmap instead of reading the file in the first place. This certainly isn't guaranteed to be better, but it may be better, depending on your system. For example:
with open(path) as infile:
m = mmap.mmap(infile, 0, access=mmap.ACCESS_READ)
A Python mmap is sort of a weird object—it acts like a str and like a file at the same time, so you can, e.g., manually iterate scanning for newlines, or you can call readline on it as if it were a file. Both of those will take more processing from Python than iterating the file as lines or doing batch readlines (because a loop that would be in C is now in pure Python… although maybe you can get around that with re, or with a simple Cython extension?)… but the I/O advantage of the OS knowing what you're doing with the mapping may swamp the CPU disadvantage.
Unfortunately, Python doesn't expose the madvise call that you'd use to tweak things in an attempt to optimize this in C (e.g., explicitly setting MADV_SEQUENTIAL instead of making the kernel guess, or forcing transparent huge pages)—but you can actually ctypes the function out of libc.
I know this question is old; but I wanted to do a similar thing, I created a simple framework which helps you read and process a large file in parallel. Leaving what I tried as an answer.
This is the code, I give an example in the end
def chunkify_file(fname, size=1024*1024*1000, skiplines=-1):
"""
function to divide a large text file into chunks each having size ~= size so that the chunks are line aligned
Params :
fname : path to the file to be chunked
size : size of each chink is ~> this
skiplines : number of lines in the begining to skip, -1 means don't skip any lines
Returns :
start and end position of chunks in Bytes
"""
chunks = []
fileEnd = os.path.getsize(fname)
with open(fname, "rb") as f:
if(skiplines > 0):
for i in range(skiplines):
f.readline()
chunkEnd = f.tell()
count = 0
while True:
chunkStart = chunkEnd
f.seek(f.tell() + size, os.SEEK_SET)
f.readline() # make this chunk line aligned
chunkEnd = f.tell()
chunks.append((chunkStart, chunkEnd - chunkStart, fname))
count+=1
if chunkEnd > fileEnd:
break
return chunks
def parallel_apply_line_by_line_chunk(chunk_data):
"""
function to apply a function to each line in a chunk
Params :
chunk_data : the data for this chunk
Returns :
list of the non-None results for this chunk
"""
chunk_start, chunk_size, file_path, func_apply = chunk_data[:4]
func_args = chunk_data[4:]
t1 = time.time()
chunk_res = []
with open(file_path, "rb") as f:
f.seek(chunk_start)
cont = f.read(chunk_size).decode(encoding='utf-8')
lines = cont.splitlines()
for i,line in enumerate(lines):
ret = func_apply(line, *func_args)
if(ret != None):
chunk_res.append(ret)
return chunk_res
def parallel_apply_line_by_line(input_file_path, chunk_size_factor, num_procs, skiplines, func_apply, func_args, fout=None):
"""
function to apply a supplied function line by line in parallel
Params :
input_file_path : path to input file
chunk_size_factor : size of 1 chunk in MB
num_procs : number of parallel processes to spawn, max used is num of available cores - 1
skiplines : number of top lines to skip while processing
func_apply : a function which expects a line and outputs None for lines we don't want processed
func_args : arguments to function func_apply
fout : do we want to output the processed lines to a file
Returns :
list of the non-None results obtained be processing each line
"""
num_parallel = min(num_procs, psutil.cpu_count()) - 1
jobs = chunkify_file(input_file_path, 1024 * 1024 * chunk_size_factor, skiplines)
jobs = [list(x) + [func_apply] + func_args for x in jobs]
print("Starting the parallel pool for {} jobs ".format(len(jobs)))
lines_counter = 0
pool = mp.Pool(num_parallel, maxtasksperchild=1000) # maxtaskperchild - if not supplied some weird happend and memory blows as the processes keep on lingering
outputs = []
for i in range(0, len(jobs), num_parallel):
print("Chunk start = ", i)
t1 = time.time()
chunk_outputs = pool.map(parallel_apply_line_by_line_chunk, jobs[i : i + num_parallel])
for i, subl in enumerate(chunk_outputs):
for x in subl:
if(fout != None):
print(x, file=fout)
else:
outputs.append(x)
lines_counter += 1
del(chunk_outputs)
gc.collect()
print("All Done in time ", time.time() - t1)
print("Total lines we have = {}".format(lines_counter))
pool.close()
pool.terminate()
return outputs
Say for example, I have a file in which I want to count the number of words in each line, then the processing of each line would look like
def count_words_line(line):
return len(line.strip().split())
and then call the function like:
parallel_apply_line_by_line(input_file_path, 100, 8, 0, count_words_line, [], fout=None)
Using this, I get a speed up of ~8 times as compared to vanilla line by line reading on a sample file of size ~20GB in which I do some moderately complicated processing on each line.
Samples records in the data file (SAM file):
M01383 0 chr4 66439384 255 31M * 0 0 AAGAGGA GFAFHGD MD:Z:31 NM:i:0
M01382 0 chr1 241995435 255 31M * 0 0 ATCCAAG AFHTTAG MD:Z:31 NM:i:0
......
The data files are line-by-line based
The size of the data files are varies from 1G - 5G.
I need to go through the record in the data file line by line, get a particular value (e.g. 4th value, 66439384) from each line, and pass this value to another function for processing. Then some results counter will be updated.
the basic workflow is like this:
# global variable, counters will be updated in search function according to the value passed.
counter_a = 0
counter_b = 0
counter_c = 0
open textfile:
for line in textfile:
value = line.split()[3]
search_function(value) # this function takes abit long time to process
def search_function (value):
some conditions checking:
update the counter_a or counter_b or counter_c
With single process code and about 1.5G data file, it took about 20 hours to run through all the records in one data file. I need much faster code because there are more than 30 of this kind data file.
I was thinking to process the data file in N chunks in parallel, and each chunk will perform above workflow and update the global variable (counter_a, counter_b, counter_c) simultaneously. But I don't know how to achieve this in code, or wether this will work.
I have access to a server machine with: 24 processors and around 40G RAM.
Anyone could help with this? Thanks very much.
The simplest approach would probably be to do all 30 files at once with your existing code -- would still take all day, but you'd have all the files done at once. (ie, "9 babies in 9 months" is easy, "1 baby in 1 month" is hard)
If you really want to get a single file done faster, it will depend on how your counters actually update. If almost all the work is just in analysing value you can offload that using the multiprocessing module:
import time
import multiprocessing
def slowfunc(value):
time.sleep(0.01)
return value**2 + 0.3*value + 1
counter_a = counter_b = counter_c = 0
def add_to_counter(res):
global counter_a, counter_b, counter_c
counter_a += res
counter_b -= (res - 10)**2
counter_c += (int(res) % 2)
pool = multiprocessing.Pool(50)
results = []
for value in range(100000):
r = pool.apply_async(slowfunc, [value])
results.append(r)
# don't let the queue grow too long
if len(results) == 1000:
results[0].wait()
while results and results[0].ready():
r = results.pop(0)
add_to_counter(r.get())
for r in results:
r.wait()
add_to_counter(r.get())
print counter_a, counter_b, counter_c
That will allow 50 slowfuncs to run in parallel, so instead of taking 1000s (=100k*0.01s), it takes 20s (100k/50)*0.01s to complete. If you can restructure your function into "slowfunc" and "add_to_counter" like the above, you should be able to get a factor of 24 speedup.
Read one file at a time, use all CPUs to run search_function():
#!/usr/bin/env python
from multiprocessing import Array, Pool
def init(counters_): # called for each child process
global counters
counters = counters_
def search_function (value): # assume it is CPU-intensive task
some conditions checking:
update the counter_a or counter_b or counter_c
counter[0] += 1 # counter 'a'
counter[1] += 1 # counter 'b'
return value, result, error
if __name__ == '__main__':
counters = Array('i', [0]*3)
pool = Pool(initializer=init, initargs=[counters])
values = (line.split()[3] for line in textfile)
for value, result, error in pool.imap_unordered(search_function, values,
chunksize=1000):
if error is not None:
print('value: {value}, error: {error}'.format(**vars()))
pool.close()
pool.join()
print(list(counters))
Make sure (for example, by writing wrappers) that exceptions do not escape next(values), search_function().
This solution works on a set of files.
For each file, it divides it into a specified number of line-aligned chunks, solves each chunk in parallel, then combines the results.
It streams each chunk from disk; this is somewhat slower, but does not consume nearly so much memory. We depend on disk cache and buffered reads to prevent head thrashing.
Usage is like
python script.py -n 16 sam1.txt sam2.txt sam3.txt
and script.py is
import argparse
from io import SEEK_END
import multiprocessing as mp
#
# Worker process
#
def summarize(fname, start, stop):
"""
Process file[start:stop]
start and stop both point to first char of a line (or EOF)
"""
a = 0
b = 0
c = 0
with open(fname, newline='') as inf:
# jump to start position
pos = start
inf.seek(pos)
for line in inf:
value = int(line.split(4)[3])
# *** START EDIT HERE ***
#
# update a, b, c based on value
#
# *** END EDIT HERE ***
pos += len(line)
if pos >= stop:
break
return a, b, c
def main(num_workers, sam_files):
print("{} workers".format(num_workers))
pool = mp.Pool(processes=num_workers)
# for each input file
for fname in sam_files:
print("Dividing {}".format(fname))
# decide how to divide up the file
with open(fname) as inf:
# get file length
inf.seek(0, SEEK_END)
f_len = inf.tell()
# find break-points
starts = [0]
for n in range(1, num_workers):
# jump to approximate break-point
inf.seek(n * f_len // num_workers)
# find start of next full line
inf.readline()
# store offset
starts.append(inf.tell())
# do it!
stops = starts[1:] + [f_len]
start_stops = zip(starts, stops)
print("Solving {}".format(fname))
results = [pool.apply(summarize, args=(fname, start, stop)) for start,stop in start_stops]
# collect results
results = [sum(col) for col in zip(*results)]
print(results)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Parallel text processor')
parser.add_argument('--num_workers', '-n', default=8, type=int)
parser.add_argument('sam_files', nargs='+')
args = parser.parse_args()
main(args.num_workers, args.sam_files)
main(args.num_workers, args.sam_files)
What you don't want to do is hand files to invidual CPUs. If that's the case, the file open/reads will likely cause the heads to bounce randomly all over the disk, because the files are likely to be all over the disk.
Instead, break each file into chunks and process the chunks.
Open the file with one CPU. Read in the whole thing into an array Text. You want to do this is one massive read to prevent the heads from thrashing around the disk, under the assumption that your file(s) are placed on the disk in relatively large sequential chunks.
Divide its size in bytes by N, giving a (global) value K, the approximate number of bytes each CPU should process. Fork N threads, and hand each thread i its index i, and a copied handle for each file.
Each thread i starts a thread-local scan pointer p into Text as offset i*K. It scans the text, incrementing p and ignores the text until a newline is found. At this point, it starts processing lines (increment p as it scans the lines). Tt stops after processing a line, when its index into the Text file is greater than (i+1)*K.
If the amount of work per line is about equal, your N cores will all finish about the same time.
(If you have more than one file, you can then start the next one).
If you know that the file sizes are smaller than memory, you might arrange the file reads to be pipelined, e.g., while the current file is being processed, a file-read thread is reading the next file.
I'm trying to perform some costly scientific calculation with Python. I have to read a bunch of data stored in csv files and process then. Since each process take a long time and I have some 8 processors to use, I was trying to use the Pool method from Multiprocessing.
This is how I structured the multiprocessing call:
pool = Pool()
vector_components = []
for sample in range(samples):
vector_field_x_i = vector_field_samples_x[sample]
vector_field_y_i = vector_field_samples_y[sample]
vector_component = pool.apply_async(vector_field_decomposer, args=(x_dim, y_dim, x_steps, y_steps,
vector_field_x_i, vector_field_y_i))
vector_components.append(vector_component)
pool.close()
pool.join()
vector_components = map(lambda k: k.get(), vector_components)
for vector_component in vector_components:
CsvH.write_vector_field(vector_component, '../CSV/RotationalFree/rotational_free_x_'+str(sample)+'.csv')
I was running a data set of 500 samples of size equal to 100 (x_dim) by 100 (y_dim).
Until then everything worked fine.
Then I receive a data set of 500 samples of 400 x 400.
When running it, I get an error when calling the get.
I also tried to run a single sample of 400 x 400 and got the same error.
Traceback (most recent call last):
File "__init__.py", line 33, in <module>
VfD.samples_vector_field_decomposer(samples, x_dim, y_dim, x_steps, y_steps, vector_field_samples_x, vector_field_samples_y)
File "/export/home/pceccon/VectorFieldDecomposer/Sources/Controllers/VectorFieldDecomposerController.py", line 43, in samples_vector_field_decomposer
vector_components = map(lambda k: k.get(), vector_components)
File "/export/home/pceccon/VectorFieldDecomposer/Sources/Controllers/VectorFieldDecomposerController.py", line 43, in <lambda>
vector_components = map(lambda k: k.get(), vector_components)
File "/export/home/pceccon/.pyenv/versions/2.7.5/lib/python2.7/multiprocessing/pool.py", line 554, in get
raise self._value
MemoryError
What should I do?
Thank you in advance.
Right now you're keeping several lists in memory - vector_field_x, vector_field_y, vector_components, and then a separate copy of vector_components during the map call (which is when you actually run out of memory). You can avoid needing either copy of the vector_components list by using pool.imap, instead of pool.apply_async along with a manually created list. imap returns an iterator instead of a complete list, so you never have all the results in memory.
Normally, pool.map breaks the iterable passed to it into chunks, and sends the those chunks to the child processes, rather than sending one element at a time. This helps improve performance. Because imap uses an iterator instead of a list, it doesn't know the complete size of the iterable you're passing to it. Without knowing the size of the iterable, it doesn't know how big to make each chunk, so it defaults to a chunksize of 1, which will work, but may not perform all that well. To avoid this, you can provide it with a good chunksize argument, since you know the iterable is sample elements long. It may not make much difference for your 500 element list, but it's worth experimenting with.
Here's some sample code that demonstrates all this:
import multiprocessing
from functools import partial
def vector_field_decomposer(x_dim, y_dim, x_steps, y_steps, vector_fields):
vector_field_x_i = vector_fields[0]
vector_field_y_i = vector_fields[1]
# Do whatever is normally done here.
if __name__ == "__main__":
num_workers = multiprocessing.cpu_count()
pool = multiprocessing.Pool(num_workers)
# Calculate a good chunksize (based on implementation of pool.map)
chunksize, extra = divmod(samples // 4 * num_workers)
if extra:
chunksize += 1
# Use partial so many arguments can be passed to vector_field_decomposer
func = partial(vector_field_decomposer, x_dim, y_dim, x_steps, y_steps)
# We use a generator expression as an iterable, so we don't create a full list.
results = pool.imap(func,
((vector_field_samples_x[s], vector_field_samples_y[s]) for s in xrange(samples)),
chunksize=chunksize)
for vector in results:
CsvH.write_vector_field(vector_component,
'../CSV/RotationalFree/rotational_free_x_'+str(sample)+'.csv')
pool.close()
pool.join()
This should allow you to avoid the MemoryError issues, but if not, you could try running imap over smaller chunks of your total sample, and just do multiple passes. I don't think you'll have any issues though, because you're not building any additional lists, other than the vector_field_* lists you start with.