I am using multiprocessing.Pool() to parallelize some heavy computations.
The target function returns a lot of data (a huge list). I'm running out of RAM.
Without multiprocessing, I'd just change the target function into a generator, by yielding the resulting elements one after another, as they are computed.
I understand multiprocessing does not support generators -- it waits for the entire output and returns it at once, right? No yielding. Is there a way to make the Pool workers yield data as soon as they become available, without constructing the entire result array in RAM?
Simple example:
def target_fnc(arg):
result = []
for i in xrange(1000000):
result.append('dvsdbdfbngd') # <== would like to just use yield!
return result
def process_args(some_args):
pool = Pool(16)
for result in pool.imap_unordered(target_fnc, some_args):
for element in result:
yield element
This is Python 2.7.
This sounds like an ideal use case for a Queue: http://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes
Simply feed your results into the queue from the pooled workers and ingest them in the master.
Note that you still may run into memory pressure issues unless you drain the queue nearly as fast as the workers are populating it. You could limit the queue size (the maximum number of objects that will fit in the queue) in which case the pooled workers would block on the queue.put statements until space is available in the queue. This would put a ceiling on memory usage. But if you're doing this, it may be time to reconsider whether you require pooling at all and/or if it might make sense to use fewer workers.
From your description, it sounds like you're not so much interested in processing the data as they come in, as in avoiding passing a million-element list back.
There's a simpler way of doing that: Just put the data into a file. For example:
def target_fnc(arg):
fd, path = tempfile.mkstemp(text=True)
with os.fdopen(fd) as f:
for i in xrange(1000000):
f.write('dvsdbdfbngd\n')
return path
def process_args(some_args):
pool = Pool(16)
for result in pool.imap_unordered(target_fnc, some_args):
with open(result) as f:
for element in f:
yield element
Obviously if your results can contain newlines, or aren't strings, etc., you'll want to use a csv file, a numpy, etc. instead of a simple text file, but the idea is the same.
That being said, even if this is simpler, there are usually benefits to processing the data a chunk at a time, so breaking up your tasks or using a Queue (as the other two answers suggest) may be better, if the downsides (respectively, needing a way to break the tasks up, or having to be able to consume the data as fast as they're produced) are not deal-breakers.
If your tasks can return data in chunks… can they be broken up into smaller tasks, each of which returns a single chunk? Obviously, this isn't always possible. When it isn't, you have to use some other mechanism (like a Queue, as Loren Abrams suggests). But when it is, it's probably a better solution for other reasons, as well as solving this problem.
With your example, this is certainly doable. For example:
def target_fnc(arg, low, high):
result = []
for i in xrange(low, high):
result.append('dvsdbdfbngd') # <== would like to just use yield!
return result
def process_args(some_args):
pool = Pool(16)
pool_args = []
for low in in range(0, 1000000, 10000):
pool_args.extend(args + [low, low+10000] for args in some_args)
for result in pool.imap_unordered(target_fnc, pool_args):
for element in result:
yield element
(You could of course replace the loop with a nested comprehension, or a zip and flatten, if you prefer.)
So, if some_args is [1, 2, 3], you'll get 300 tasks—[[1, 0, 10000], [2, 0, 10000], [3, 0, 10000], [1, 10000, 20000], …], each of which only returns 10000 elements instead of 1000000.
Related
I'm having a bit trouble in Python multiprocessing.Pool. I have two list of numpy array a and b, in which
a.shape=(10000,3)
and
b.shape=(1000000000,3)
Then I have a function which does some computation like
def role(array, point):
sub = array-point
return (1/(np.sqrt(np.min(np.sum(sub*sub, axis=-1)))+0.001)**2)
Next, I need to compute
[role(a, point) for point in b]
To speed it up, I try to use
cpu_num = 4
m = multiprocessing.Pool(cpu_num)
cost_list = m.starmap(role, [(a, point) for point in b])
m.close
The whole process takes around 70s, but if I set cpu_num = 1, the processing time decrease to 60s... My laptop has 6 core, for reference.
Here I have two questions:
is there sth I did wrong with multiprocessing.Pool? why the processing time increased if I set cpu_num = 4?
for task like this (each for loop is a very tiny process), should I use multiprocessing to speed up? I feel like each time, python fill in Pool takes longer than process function role...
Any suggestions is really welcome.
Multiprocessing comes with some overhead (to create new processes), which is why it's not a very good choice when you have lots of tiny tasks, where the overhead of process creation might outweigh the benefit of parallelizing.
Have you considered vectorizing your problem?
In particular, if you broadcast the variable b you get there:
sub = a - b[::,np.newaxis] # broadcast b
1./(np.sqrt(np.min(np.sum(sub**2, axis=2), axis=-1))+0.001)**2
I believe you could then still reduce the complexity of the last expression a bit, as you're creating the square of a square root, which seems redundant (note that I'm assuming the 0.001 constant value is merely there to avoid some non-sensible operation like division by zero).
If the tasks are too tiny, then tohe multiprocessing overhead will be your bottleneck and you will win nothing.
If the amount of data per task that you have to pass to a worker or that the worker has to return then you will also not win a lot (or even win nothing)
If you have 10.000 tiny tasks, then I recommend to create a list of meta tasks.
Each meta task would consist of executing for example 20 tiny tasks.
meta_tasks = []
for idx in range(0, len(tiny_tasks), 20):
meta_tasks.append(tiny_tasks[idx:idx+20])
Then pass the meta tasks to your worker pool.
Is it possible in python (maybe using dask, maybe using multiprocessing) to 'emplace' generators on cores, and then, in parallel, step through the generators and process the results?
It needs to be generators in particular (or objects with __iter__); lists of all the yielded elements the generators yield won't fit into memory.
In particular:
With pandas, I can call read_csv(...iterator=True), which gives me an iterator (TextFileReader) - I can for in it or explicitly call next multiple times. The entire csv never gets read into memory. Nice.
Every time I read a next chunk from the iterator, I also perform some expensive computation on it.
But now I have 2 such files. I would like to create 2 such generators, and 'emplace' 1 on one core and 1 on another, such that I can:
result = expensive_process(next(iterator))
on each core, in parallel, and then combine and return the result. Repeat this step until one generator or both is out of yield.
It looks like the TextFileReader is not pickleable, nor is a generator. I can't find out how to do this in dask or multiprocessing. Is there a pattern for this?
Dask's read_csv is designed to load data from multiple files in chunks, with a chunk-size that you can specify. When you operate on the resultant dataframe, you will be working chunk-wise, which is exactly the point of using Dask in the first place. There should be no need to use your iterator method.
The dask dataframe method you will want to use, most likely, is map_partitions().
If you really wanted to use the iterator idea, you should look into dask.delayed, which is able to parallelise arbitrary python functions, by sending each invocation of the function (with a different file-name for each) to your workers.
So luckily I think this problem maps nicely onto python's multiprocessing .Process and .Queue.
def data_generator(whatever):
for v in something(whatever):
yield v
def generator_constructor(whatever):
def generator(outputQueue):
for d in data_generator(whatever):
outputQueue.put(d)
outputQueue.put(None) # sentinel
return generator
def procSumGenerator():
outputQs = [Queue(size) for _ in range(NumCores)]
procs = [Process(target=generator_constructor(whatever),
args=(outputQs[i],))
for i in range(NumCores)]
for proc in procs: proc.start()
# until any output queue returns a None, collect
# from all and yield
done = False
while not done:
results = [oq.get() for oq in outputQs]
done = any(res is None for res in results)
if not done:
yield some_combination_of(results)
for proc in procs: terminate()
for v in procSumGenerator():
print(v)
Maybe this can be done better with Dask? I find that my solution fairly quickly saturates the network for large sizes of generated data - I'm manipulating csvs with pandas and returning large numpy arrays.
https://github.com/colinator/doodle_generator/blob/master/data_generator_uniform_final.ipynb
I've implemented a genetic search algorithm and tried to parallelise it, but getting terrible performance (worse than single threaded). I suspect this is due to communication overhead.
I have provided pseudo-code below, but in essence the genetic algorithm creates a large pool of "Chromosome" objects, then runs many iterations of:
Score each individual chromosome based on how it performs in a 'world.' The world remains static across iterations.
Randomly selects a new population based on their scores calculated in the previous step
Go to step 1 for n iterations
The scoring algorithm (step 1) is the major bottleneck, hence it seemed natural to distribute out the processing of this code.
I have run into a couple of issues I hoped I could get help with:
How can I link the calculated score with the object that was passed to the scoring function by map(), i.e. link each Future holding a score back to a Chromosome? I've done this in a very clunky way by having the calculate_scores() method return the object, but in reality all I need is to send a float back if there is a better way to maintain the link.
The parallel processing of the scoring function is working okay, though takes a long time for map() to iterate through all the objects. However, the subsequent calls to draw_chromosome_from_pool() run very slowly compared to the single-threaded version to the point that I've not yet seen it complete. I have no idea what is causing this as the method always completes quickly in the single-threaded version. Is there some IPC going on to pull the chromosomes back to the local process, even after all the futures have completed? Is the local process de-prioritised in some way?
I am worried that the overall iterative nature of building/rebuilding the pool each cycle is going to cause an enormous amount of data transmission to the workers. The question at the root of this concern: what and when does Dask actually send data back and forth to the worker pool. i.e. when does Environment() get distributed out vs. Chromosome(), and how/when do results come back? I've read the docs but either haven't found the right detail, or am too stupid to understand.
Idealistically, I think (but open to correction) what I want is a distributed architecture where each worker holds the Environment() data locally on a 'permanent' basis, then Chromosome() instance data is distributed for scoring with little duplicated back/forth of unchanged Chromosome() data between iterations.
Very long post, so if you have taken the time to read this, thank you already!
class Chromosome(object): # Small size: several hundred bytes per instance
def get_score():
# Returns a float
def set_score(i):
# Stores a a float
class Environment(object): # Large size: 20-50Mb per instance, but only one instance
def calculate_scores(chromosome):
# Slow calculation using attributes from chromosome and instance data
chromosome.set_score(x)
return chromosome
class Evolver(object):
def draw_chromosome_from_pool(self, max_score):
while True:
individual = np.random.choice(self.chromosome_pool)
selection_chance = np.random.uniform()
if selection_chance < individual.get_score() / max_score:
return individual
def run_evolution()
self.dask_client = Client()
self.chromosome_pool = list()
for i in range(10000):
self.chromosome_pool.append( Chromosome() )
world_data = LoadWorldData() # Returns a pandas Dataframe
self.world = Environment(world_data)
iterations = 1000
for i in range(iterations):
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
for future in as_completed(futures):
c = future.result()
highest_score = max(highest_score, c.get_score())
new_pool = set()
while len(new_pool)<self.pool_size:
mother = self.draw_chromosome_from_pool(highest_score)
# do stuff to build a new pool
Yes, each time you call the line
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
you are serialising self.world, which is large. You could do this just once before the loop with
future_world = client.scatter(self.world, broadcast=True)
and then in the loop
futures = self.dask_client.map(lambda ch: Environment.calculate_scores(future_world, ch), self.chromosome_pool)
will use the copies already on the workers (or a simple function that does the same). The point is that future_world is just a pointer to stuff already distributed, but dask takes care of this for you.
On the issue of which chromosome is which: using as_completed breaks the order that you submitted them to map, but this is not necessary for your code. You could have used wait to process when all the work was done, or simply iterate over the future.result()s (which will wait for each task to be done), and then you will retain the ordering in the chromosome_pool.
I am trying to create workers for a task that involves reading a lot of files and analyzing them.
I want something like this:
list_of_unique_keys_from_csv_file = [] # About 200mb array (10m rows)
# a list of uniquekeys for comparing inside worker processes to a set of flatfiles
I need more threads as it is going very slow, doing the comparison with one process (10 minutes per file).
I have another set of flat-files that I compare the CSV file to, to see if unique keys exist. This seems like a map reduce type of problem.
main.py:
def worker_process(directory_glob_of_flat_files, list_of_unique_keys_from_csv_file):
# Do some parallel comparisons "if not in " type stuff.
# generate an array of
# lines of text like : "this item_x was not detected in CSV list (from current_flatfile)"
if current_item not in list_of_unique_keys_from_csv_file:
all_lines_this_worker_generated.append(sometext + current_item)
return all_lines_this_worker_generated
def main():
all_results = []
pool = Pool(processes=6)
partitioned_flat_files = [] # divide files from glob by 6
results = pool.starmap(worker_process, partitioned_flat_files, {{{{i wanna pass in my read-only parameter}}}})
pool.close()
pool.join()
all_results.extend(results )
resulting_file.write(all_results)
I am using both a linux and a windows environment, so perhaps I need something cross-platform compatible (the whole fork() discussion).
Main Question: Do I need some sort of Pipe or Queue, I can't seem to find good examples of how to transfer around a big read-only string array, a copy for each worker process?
You can just split your read-only parameters and then pass them in. The multiprocessing module is cross-platform compatible, so don't worry about it.
Actually, every process, even sub-process, has its own resources, that means no matter how you pass the parameters to it, it will keep a copy of the original one instead of sharing it. In this simple case, when you pass the parameters from main process into sub-processes, Pool automatically makes a copy of your variables. Because sub-processes just have the copies of original one, so the modification cannot be shared. It doesn't matter in this case as your variables are read-only.
But be careful about your code, you need to wrap the parameters you need into an
iterable collection, for example:
def add(a, b):
return a + b
pool = Pool()
results = pool.starmap(add, [(1, 2), (3, 4)])
print(results)
# [3, 7]
I have a fuzzy string matching script that looks for some 30K needles in a haystack of 4 million company names. While the script works fine, my attempts at speeding up things via parallel processing on an AWS h1.xlarge failed as I'm running out of memory.
Rather than trying to get more memory as explained in response to my previous question, I'd like to find out how to optimize the workflow - I'm fairly new to this so there should be plenty of room. Btw, I've already experimented with queues (also worked but ran into the same MemoryError, plus looked through a bunch of very helpful SO contributions, but not quite there yet.
Here's what seems most relevant of the code. I hope it sufficiently clarifies the logic - happy to provide more info as needed:
def getHayStack():
## loads a few million company names into id: name dict
return hayCompanies
def getNeedles(*args):
## loads subset of 30K companies into id: name dict (for allocation to workers)
return needleCompanies
def findNeedle(needle, haystack):
""" Identify best match and return results with score """
results = {}
for hayID, hayCompany in haystack.iteritems():
if not isnull(haystack[hayID]):
results[hayID] = levi.setratio(needle.split(' '),
hayCompany.split(' '))
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
def runMatch(args):
""" Execute findNeedle and process results for poolWorker batch"""
batch, first = args
last = first + batch
hayCompanies = getHayStack()
needleCompanies = getTargets(first, last)
needles = defaultdict(list)
current = first
for needleID, needleCompany in needleCompanies.iteritems():
current += 1
needles[targetID] = findNeedle(needleCompany, hayCompanies)
## Then store results
if __name__ == '__main__':
pool = Pool(processes = numProcesses)
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
pool.map_async(runMatch,
itertools.izip(itertools.repeat(targetsPerBatch),
xrange(0,
totalTargets,
targetsPerBatch))).get(99999999)
pool.close()
pool.join()
So I guess the questions are: How can I avoid loading the haystack for all workers - e.g. by sharing the data or taking a different approach like dividing the much larger haystack across workers rather than the needles? How can I otherwise improve memory usage by avoiding or eliminating clutter?
Your design is a bit confusing. You're using a pool of N workers, and then breaking your M jobs work up into N tasks of size M/N. In other words, if you get that all correct, you're simulating worker processes on top of a pool built on top of worker processes. Why bother with that? If you want to use processes, just use them directly. Alternatively, use a pool as a pool, sends each job as its own task, and use the batching feature to batch them up in some appropriate (and tweakable) way.
That means that runMatch just takes a single needleID and needleCompany, and all it does is call findNeedle and then do whatever that # Then store results part is. And then the main program gets a lot simpler:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
results = pool.map_async(runMatch, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING).get()
Or, if the results are small, instead of having all of the processes (presumably) fighting over some shared resulting-storing thing, just return them. Then you don't need runMatch at all, just:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
for result in pool.imap_unordered(findNeedle, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING):
# Store result
Or, alternatively, if you do want to do exactly N batches, just create a Process for each one:
if __name__ == '__main__':
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
processes = [Process(target=runMatch,
args=(targetsPerBatch,
xrange(0,
totalTargets,
targetsPerBatch)))
for _ in range(numProcesses)]
for p in processes:
p.start()
for p in processes:
p.join()
Also, you seem to be calling getHayStack() once for each task (and getNeedles as well). I'm not sure how easy it would be to end up with multiple copies of this live at the same time, but considering that it's the largest data structure you have by far, that would be the first thing I try to rule out. In fact, even if it's not a memory-usage problem, getHayStack could easily be a big performance hit, unless you're already doing some kind of caching (e.g., explicitly storing it in a global or a mutable default parameter value the first time, and then just using it), so it may be worth fixing anyway.
One way to fix both potential problems at once is to use an initializer in the Pool constructor:
def initPool():
global _haystack
_haystack = getHayStack()
def runMatch(args):
global _haystack
# ...
hayCompanies = _haystack
# ...
if __name__ == '__main__':
pool = Pool(processes=numProcesses, initializer=initPool)
# ...
Next, I notice that you're explicitly generating lists in multiple places where you don't actually need them. For example:
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
If there's more than a handful of results, this is wasteful; just use the results.values() iterable directly. (In fact, it looks like you're using Python 2.x, in which case keys and values are already lists, so you're just making an extra copy for no good reason.)
But in this case, you can simplify the whole thing even farther. You're just looking for the key (resultID) and value (score) with the highest score, right? So:
needleID, score = max(results.items(), key=operator.itemgetter(1))
return [needleID, haystack[needleID], score]
This also eliminates all the repeated searches over score, which should save some CPU.
This may not directly solve the memory problem, but it should hopefully make it easier to debug and/or tweak.
The first thing to try is just to use much smaller batches—instead of input_size/cpu_count, try 1. Does memory usage go down? If not, we've ruled that part out.
Next, try sys.getsizeof(_haystack) and see what it says. If it's, say, 1.6GB, then you're cutting things pretty fine trying to squeeze everything else into 0.4GB, so that's the way to attack it—e.g., use a shelve database instead of a plain dict.
Also try dumping memory usage (with the resource module, getrusage(RUSAGE_SELF)) at the start and end of the initializer function. If the final haystack is only, say, 0.3GB, but you allocate another 1.3GB building it up, that's the problem to attack. For example, you might spin off a single child process to build and pickle the dict, then have the pool initializer just open it and unpickle it. Or combine the two—build a shelve db in the first child, and open it read-only in the initializer. Either way, this would also mean you're only doing the CSV-parsing/dict-building work once instead of 8 times.
On the other hand, if your total VM usage is still low (note that getrusage doesn't directly have any way to see your total VM size—ru_maxrss is often a useful approximation, especially if ru_nswap is 0) at time the first task runs, the problem is with the tasks themselves.
First, getsizeof the arguments to the task function and the value you return. If they're large, especially if they either keep getting larger with each task or are wildly variable, it could just be pickling and unpickling that data takes too much memory, and eventually 8 of them are together big enough to hit the limit.
Otherwise, the problem is most likely in the task function itself. Either you've got a memory leak (you can only have a real leak by using a buggy C extension module or ctypes, but if you keep any references around between calls, e.g., in a global, you could just be holding onto things forever unnecessarily), or some of the tasks themselves take too much memory. Either way, this should be something you can test more easily by pulling out the multiprocessing and just running the tasks directly, which is a lot easier to debug.