Python multiprocess with pool workers - memory use optimization - python

I have a fuzzy string matching script that looks for some 30K needles in a haystack of 4 million company names. While the script works fine, my attempts at speeding up things via parallel processing on an AWS h1.xlarge failed as I'm running out of memory.
Rather than trying to get more memory as explained in response to my previous question, I'd like to find out how to optimize the workflow - I'm fairly new to this so there should be plenty of room. Btw, I've already experimented with queues (also worked but ran into the same MemoryError, plus looked through a bunch of very helpful SO contributions, but not quite there yet.
Here's what seems most relevant of the code. I hope it sufficiently clarifies the logic - happy to provide more info as needed:
def getHayStack():
## loads a few million company names into id: name dict
return hayCompanies
def getNeedles(*args):
## loads subset of 30K companies into id: name dict (for allocation to workers)
return needleCompanies
def findNeedle(needle, haystack):
""" Identify best match and return results with score """
results = {}
for hayID, hayCompany in haystack.iteritems():
if not isnull(haystack[hayID]):
results[hayID] = levi.setratio(needle.split(' '),
hayCompany.split(' '))
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
def runMatch(args):
""" Execute findNeedle and process results for poolWorker batch"""
batch, first = args
last = first + batch
hayCompanies = getHayStack()
needleCompanies = getTargets(first, last)
needles = defaultdict(list)
current = first
for needleID, needleCompany in needleCompanies.iteritems():
current += 1
needles[targetID] = findNeedle(needleCompany, hayCompanies)
## Then store results
if __name__ == '__main__':
pool = Pool(processes = numProcesses)
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
pool.map_async(runMatch,
itertools.izip(itertools.repeat(targetsPerBatch),
xrange(0,
totalTargets,
targetsPerBatch))).get(99999999)
pool.close()
pool.join()
So I guess the questions are: How can I avoid loading the haystack for all workers - e.g. by sharing the data or taking a different approach like dividing the much larger haystack across workers rather than the needles? How can I otherwise improve memory usage by avoiding or eliminating clutter?

Your design is a bit confusing. You're using a pool of N workers, and then breaking your M jobs work up into N tasks of size M/N. In other words, if you get that all correct, you're simulating worker processes on top of a pool built on top of worker processes. Why bother with that? If you want to use processes, just use them directly. Alternatively, use a pool as a pool, sends each job as its own task, and use the batching feature to batch them up in some appropriate (and tweakable) way.
That means that runMatch just takes a single needleID and needleCompany, and all it does is call findNeedle and then do whatever that # Then store results part is. And then the main program gets a lot simpler:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
results = pool.map_async(runMatch, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING).get()
Or, if the results are small, instead of having all of the processes (presumably) fighting over some shared resulting-storing thing, just return them. Then you don't need runMatch at all, just:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
for result in pool.imap_unordered(findNeedle, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING):
# Store result
Or, alternatively, if you do want to do exactly N batches, just create a Process for each one:
if __name__ == '__main__':
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
processes = [Process(target=runMatch,
args=(targetsPerBatch,
xrange(0,
totalTargets,
targetsPerBatch)))
for _ in range(numProcesses)]
for p in processes:
p.start()
for p in processes:
p.join()
Also, you seem to be calling getHayStack() once for each task (and getNeedles as well). I'm not sure how easy it would be to end up with multiple copies of this live at the same time, but considering that it's the largest data structure you have by far, that would be the first thing I try to rule out. In fact, even if it's not a memory-usage problem, getHayStack could easily be a big performance hit, unless you're already doing some kind of caching (e.g., explicitly storing it in a global or a mutable default parameter value the first time, and then just using it), so it may be worth fixing anyway.
One way to fix both potential problems at once is to use an initializer in the Pool constructor:
def initPool():
global _haystack
_haystack = getHayStack()
def runMatch(args):
global _haystack
# ...
hayCompanies = _haystack
# ...
if __name__ == '__main__':
pool = Pool(processes=numProcesses, initializer=initPool)
# ...
Next, I notice that you're explicitly generating lists in multiple places where you don't actually need them. For example:
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
If there's more than a handful of results, this is wasteful; just use the results.values() iterable directly. (In fact, it looks like you're using Python 2.x, in which case keys and values are already lists, so you're just making an extra copy for no good reason.)
But in this case, you can simplify the whole thing even farther. You're just looking for the key (resultID) and value (score) with the highest score, right? So:
needleID, score = max(results.items(), key=operator.itemgetter(1))
return [needleID, haystack[needleID], score]
This also eliminates all the repeated searches over score, which should save some CPU.
This may not directly solve the memory problem, but it should hopefully make it easier to debug and/or tweak.
The first thing to try is just to use much smaller batches—instead of input_size/cpu_count, try 1. Does memory usage go down? If not, we've ruled that part out.
Next, try sys.getsizeof(_haystack) and see what it says. If it's, say, 1.6GB, then you're cutting things pretty fine trying to squeeze everything else into 0.4GB, so that's the way to attack it—e.g., use a shelve database instead of a plain dict.
Also try dumping memory usage (with the resource module, getrusage(RUSAGE_SELF)) at the start and end of the initializer function. If the final haystack is only, say, 0.3GB, but you allocate another 1.3GB building it up, that's the problem to attack. For example, you might spin off a single child process to build and pickle the dict, then have the pool initializer just open it and unpickle it. Or combine the two—build a shelve db in the first child, and open it read-only in the initializer. Either way, this would also mean you're only doing the CSV-parsing/dict-building work once instead of 8 times.
On the other hand, if your total VM usage is still low (note that getrusage doesn't directly have any way to see your total VM size—ru_maxrss is often a useful approximation, especially if ru_nswap is 0) at time the first task runs, the problem is with the tasks themselves.
First, getsizeof the arguments to the task function and the value you return. If they're large, especially if they either keep getting larger with each task or are wildly variable, it could just be pickling and unpickling that data takes too much memory, and eventually 8 of them are together big enough to hit the limit.
Otherwise, the problem is most likely in the task function itself. Either you've got a memory leak (you can only have a real leak by using a buggy C extension module or ctypes, but if you keep any references around between calls, e.g., in a global, you could just be holding onto things forever unnecessarily), or some of the tasks themselves take too much memory. Either way, this should be something you can test more easily by pulling out the multiprocessing and just running the tasks directly, which is a lot easier to debug.

Related

Python 3.7 MultiThreading strategy

I have a work load that consist of a very slow query that returns a HUGE amount of data that has to be parsed and calculated, all that on a loop. Basically, it looks like this:
for x in lastTenYears
myData = DownloadData(x) # takes about ~40-50 [sec]
parsedData.append(ParseData(myData)) # takes another +30-60 [sec]
As I believe you have noticed, if I could run the data parsing on a thread, I could download the next batch of data while the parsing happens.
How can I achieve this parallelism of operations?
Ideally speaking, I would like to have 1 thread always downloading, and N threads doing the parsing. The download part is actually a query against a database, so it's not good to have a bunch o parallel of them...
Details:
The parsing of the data is a heavily CPU bound, and consists of raw math calculations and nothing else.
Using Python 3.7.4
1) Use a threadsafe queue. Queue.FIFOQueue. At the top level define
my_queue = Queue.FIFOQueue()
parsedData = []
2) On the first thread, kick off the data loading
my_queue.put(DownloadData(x))
On the second thread
if not (my_queue.empty()):
myData = my_queue.get()
parsedData.append(ParseData(myData))
If your program is CPU bound you will have hard times to do anything else in other threads due to the GIL (global interpreter lock).
Here is a link to an article which might help you to understand the topic: https://opensource.com/article/17/4/grok-gil
Downloading the data in a sub-process is most likely the best approach.
It's hard to say if and how much this will actually help (as I have nothing to test...), but you might try a multiprocessing.Pool. It handles all the dirty work for you and you can customize number of processes, chunk size etc.
from multiprocessing import Pool
def worker(x):
myData = DownloadData(x)
return ParseData(myData)
if __name__ == "__main__":
processes = None # defaults to os.cpu_count()
chunksize = 1
with Pool(processes) as pool:
parsedData = pool.map(worker, lastTenYears, chunksize)
Here for the example I use the map method, but according to your needs you might want to use imap or map_async.
Q : How can I achieve this parallelism of operations?
The step number one is to realise, the above requested use-case is not a [PARALLEL] code-execution, but an un-ordered batch of resources-use policy limited execution of a strict sequence of pairs of :
First-a-remote-[DB-Query](returning (cit.) HUGE amount of data)
Next-a-local-[CPU-process]( of (cit.) HUGE amount of data just returned here)
The latency of the first could be masked( if it were permitted, but it is not permitted - due to a will not to overload the DB-host ),the latency for the second not( can start but a next I/O-bound DB-Query, yet only if not violating the rule of keeping the DB-machine but under a mild workload ).
As I believe you have noticed, if I could run the data parsing on a thread, I could download the next batch of data while the parsing happens.
It is high time to make thing clear and sound :
Facts :
A )
The CPU-bound tasks will never run faster in whatever number N of threads in python-GIL-lock controlled ecosystem( since ever and forever, as Guido ROSSUM has expressed ),as the GIL-lock enforces a re-[SERIAL]-isation, so the more threads "work", the more threads actually wait for acquiring the GIL-lock, before they "get" it but for a 1 / ( N + 1 )-th fraction of time of the resulting, thanks to the GIL-lock policing again pure-[SERIAL], duration of N * ( 30 - 60 ) [sec]
B )
The I/O-bound task makes no sense to off-load into a full process-based, concurrent execution, as the full-copy of the python process ( in Windows also with duplicating the whole python interpreter state with all data, during the sub-process instantiation ) makes no sense, as there are smarter techniques for I/O-bound processing ( where GIL-lock does not hurt so much.
C )
The whole concept of N-parsing : 1-querying is principally wrong - the maximum achievable goal is to mask the latency of the I/O-process ( where making sense ), yet here each one and every query takes those said ~ 40-50 [sec] so no second pack-of-data to parse will ever be present here before running those said ~ 40-50 [sec] next time, sono second worker will ever get anything to parse anytime before T0 + ~ 80~100 [sec] - so one could dream a wish to have N-(unbound)-workers working ( yet have 'em but actually waiting for data ) is possible, but awfully anti-productive ( the worse for N-(GIL-MUTEX-ed)-"waiting"-agents ).

Dask: How to efficiently distribute a genetic search algorithm?

I've implemented a genetic search algorithm and tried to parallelise it, but getting terrible performance (worse than single threaded). I suspect this is due to communication overhead.
I have provided pseudo-code below, but in essence the genetic algorithm creates a large pool of "Chromosome" objects, then runs many iterations of:
Score each individual chromosome based on how it performs in a 'world.' The world remains static across iterations.
Randomly selects a new population based on their scores calculated in the previous step
Go to step 1 for n iterations
The scoring algorithm (step 1) is the major bottleneck, hence it seemed natural to distribute out the processing of this code.
I have run into a couple of issues I hoped I could get help with:
How can I link the calculated score with the object that was passed to the scoring function by map(), i.e. link each Future holding a score back to a Chromosome? I've done this in a very clunky way by having the calculate_scores() method return the object, but in reality all I need is to send a float back if there is a better way to maintain the link.
The parallel processing of the scoring function is working okay, though takes a long time for map() to iterate through all the objects. However, the subsequent calls to draw_chromosome_from_pool() run very slowly compared to the single-threaded version to the point that I've not yet seen it complete. I have no idea what is causing this as the method always completes quickly in the single-threaded version. Is there some IPC going on to pull the chromosomes back to the local process, even after all the futures have completed? Is the local process de-prioritised in some way?
I am worried that the overall iterative nature of building/rebuilding the pool each cycle is going to cause an enormous amount of data transmission to the workers. The question at the root of this concern: what and when does Dask actually send data back and forth to the worker pool. i.e. when does Environment() get distributed out vs. Chromosome(), and how/when do results come back? I've read the docs but either haven't found the right detail, or am too stupid to understand.
Idealistically, I think (but open to correction) what I want is a distributed architecture where each worker holds the Environment() data locally on a 'permanent' basis, then Chromosome() instance data is distributed for scoring with little duplicated back/forth of unchanged Chromosome() data between iterations.
Very long post, so if you have taken the time to read this, thank you already!
class Chromosome(object): # Small size: several hundred bytes per instance
def get_score():
# Returns a float
def set_score(i):
# Stores a a float
class Environment(object): # Large size: 20-50Mb per instance, but only one instance
def calculate_scores(chromosome):
# Slow calculation using attributes from chromosome and instance data
chromosome.set_score(x)
return chromosome
class Evolver(object):
def draw_chromosome_from_pool(self, max_score):
while True:
individual = np.random.choice(self.chromosome_pool)
selection_chance = np.random.uniform()
if selection_chance < individual.get_score() / max_score:
return individual
def run_evolution()
self.dask_client = Client()
self.chromosome_pool = list()
for i in range(10000):
self.chromosome_pool.append( Chromosome() )
world_data = LoadWorldData() # Returns a pandas Dataframe
self.world = Environment(world_data)
iterations = 1000
for i in range(iterations):
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
for future in as_completed(futures):
c = future.result()
highest_score = max(highest_score, c.get_score())
new_pool = set()
while len(new_pool)<self.pool_size:
mother = self.draw_chromosome_from_pool(highest_score)
# do stuff to build a new pool
Yes, each time you call the line
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
you are serialising self.world, which is large. You could do this just once before the loop with
future_world = client.scatter(self.world, broadcast=True)
and then in the loop
futures = self.dask_client.map(lambda ch: Environment.calculate_scores(future_world, ch), self.chromosome_pool)
will use the copies already on the workers (or a simple function that does the same). The point is that future_world is just a pointer to stuff already distributed, but dask takes care of this for you.
On the issue of which chromosome is which: using as_completed breaks the order that you submitted them to map, but this is not necessary for your code. You could have used wait to process when all the work was done, or simply iterate over the future.result()s (which will wait for each task to be done), and then you will retain the ordering in the chromosome_pool.

Multiprocessing Pool in Python - Only single CPU is utilized

Original Question
I am trying to use multiprocessing Pool in Python. This is my code:
def f(x):
return x
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
for x in xrange(1, 11):
res = list(mapper(f,bar(x)))
This code makes use of all CPUs (I have 8 CPUs) when the xrange is small like xrange(1, 6). However, when I increase the range to xrange(1, 10). I observe that only 1 CPU is running at 100% while the rest are just idling. What could be the reason? Is it because, when I increase the range, the OS shutdowns the CPUs due to overheating?
How can I resolve this problem?
minimal, complete, verifiable example
To replicate my problem, I have created this example: Its a simple ngram generation from a string problem.
#!/usr/bin/python
import time
import itertools
import threading
import multiprocessing
import random
def f(x):
return x
def ngrams(input_tmp, n):
input = input_tmp.split()
if n > len(input):
n = len(input)
output = []
for i in range(len(input)-n+1):
output.append(input[i:i+n])
return output
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
num = 100000000 #100
rand_list = random.sample(xrange(100000000), num)
rand_str = ' '.join(str(i) for i in rand_list)
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
if __name__ == '__main__':
start = time.time()
foo()
print 'Total time taken: '+str(time.time() - start)
When num is small (e.g., num = 10000), I find that all 8 CPUs are utilised. However, when num is substantially large (e.g.,num = 100000000). Only 2 CPUs are used and rest are idling. This is my problem.
Caution: When num is too large it may crash your system/VM.
First, ngrams itself takes a lot of time. While that's happening, it's obviously only one one core. But even when that finishes (which is very easy to test by just moving the ngrams call outside the mapper and throwing a print in before and after it), you're still only using one core. I get 1 core at 100% and the other cores all around 2%.
If you try the same thing in Python 3.4, things are a little different—I still get 1 core at 100%, but the others are at 15-25%.
So, what's happening? Well, in multiprocessing, there's always some overhead for passing parameters and returning values. And in your case, that overhead completely swamps the actual work, which is just return x.
Here's how the overhead works: The main process has to pickle the values, then put them on a queue, then wait for values on another queue and unpickle them. Each child process waits on the first queue, unpickles values, does your do-nothing work, pickles the values, and puts them on the other queue. Access to the queues has to be synchronized (by a POSIX semaphore on most non-Windows platforms, I think an NT kernel mutex on Windows).
From what I can tell, your processes are spending over 99% of their time waiting on the queue or reading or writing it.
This isn't too unexpected, given that you have a large amount of data to process, and no computation at all beyond pickling and unpickling that data.
If you look at the source for SimpleQueue in CPython 2.7, the pickling and unpickling happens with the lock held. So, pretty much all the work any of your background processes do happens with the lock held, meaning they all end up serialized on a single core.
But in CPython 3.4, the pickling and unpickling happens outside the lock. And apparently that's enough work to use up 15-25% of a core. (I believe this change happened in 3.2, but I'm too lazy to track it down.)
Still, even on 3.4, you're spending far more time waiting for access to the queue than doing anything, even the multiprocessing overhead. Which is why the cores only get up to 25%.
And of course you're spending orders of magnitude more time on the overhead than the actual work, which makes this not a great test, unless you're trying to test the maximum throughput you can get out of a particular multiprocessing implementation on your machine or something.
A few observations:
In your real code, if you can find a way to batch up larger tasks (explicitly—just relying on chunksize=1000 or the like here won't help), that would probably solve most of your problem.
If your giant array (or whatever) never actually changes, you may be able to pass it in the pool initializer, instead of in each task, which would pretty much eliminate the problem.
If it does change, but only from the main process side, it may be worth sharing rather than passing the data.
If you need to mutate it from the child processes, see if there's a way to partition the data so each task can own a slice without contention.
Even if you need fully-contended shared memory with explicit locking, it may still be better than passing something this huge around.
It may be worth getting a backport of the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries off PyPI (or upgrading to Python 3.x), just to move the pickling out of the lock.
The problem is that your f() function (which is the one running on separate processes) is doing nothing special, hence it is not putting load on the CPU.
ngrams(), on the other hand, is doing some "heavy" computation, but you are calling this function on the main process, not in the pool.
To make things clearer, consider that this piece of code...
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
...is equivalent to this:
for n in xrange(1, 100):
arg = ngrams(rand_str, n)
res = list(mapper(f, arg))
Also the following is a CPU-intensive operation that is being performed on your main process:
num = 100000000
rand_list = random.sample(xrange(100000000), num)
You should either change your code so that sample() and ngrams() are called inside the pool, or change f() so that it does something CPU-intensive, and you'll see a high load on all of your CPUs.

How do i use subprocesses to force python to release memory?

I was reading up on Python Memory Management and would like to reduce the memory footprint of my application. It was suggested that subprocesses would go a long way in mitigating the problem; but i'm having trouble conceptualizing what needs to be done. Could some one please provide a simple example of how to turn this...
def my_function():
x = range(1000000)
y = copy.deepcopy(x)
del x
return y
#subprocess_witchcraft
def my_function_dispatcher(*args):
return my_function()
...into a real subprocessed function that doesn't store an extra "free-list"?
Bonus Question:
Does this "free-list" concept apply to python c-extensions as well?
The important thing about the optimization suggestion is to make sure that my_function() is only invoked in a subprocess. The deepcopy and del are irrelevant — once you create five million distinct integers in a process, holding onto all of them at the same time, it's game over. Even if you stop referring to those objects, Python will free them by keeping references to five million empty integer-object-sized fields in a limbo where they await reuse for the next function that wants to create five million integers. This is the free list mentioned in the other answer, and it buys blindingly fast allocation and deallocation of ints and floats. It is only fair to Python to note that this is not a memory leak since the memory is definitely made available for further allocations. However, that memory will not get returned to the system until the process ends, nor will it be reused for anything other than allocating numbers of the same type.
Most programs don't have this problem because most programs do not create pathologically huge lists of numbers, free them, and then expect to reuse that memory for other objects. Programs using numpy are also safe because numpy stores numeric data of its arrays in tightly packed native format. For programs that do follow this usage pattern, the way to mitigate the problem is by not creating a large number of the integers at the same time in the first place, at least not in the process which needs to return memory to the system. It is unclear what exact use case you have, but a real-world solution will likely require more than a "magic decorator".
This is where subprocess come in: if the list of numbers is created in another process, then all the memory associated with the list, including but not limited to storage of ints, is both freed and returned to the system by the mere act of terminating the subprocess. Of course, you must design your program so that the list can be both created and processed in the subsystem, without requiring the transfer of all these numbers. The subprocess can receive information needed to create the data set, and can send back the information obtained from processing the list.
To illustrate the principle, let's upgrade your example so that the whole list actually needs to exist - say we're benchmarking sorting algorithms. We want to create a huge list of integers, sort it, and reliably free the memory associated with the list, so that the next benchmark can allocate memory for its own needs without worrying of running out of RAM. To spawn the subprocess and communicate, this uses the multiprocessing module:
# To run this, save it to a file that looks like a valid Python module, e.g.
# "foo.py" - multiprocessing requires being able to import the main module.
# Then run it with "python foo.py".
import multiprocessing, random, sys, os, time
def create_list(size):
# utility function for clarity - runs in subprocess
maxint = sys.maxint
randrange = random.randrange
return [randrange(maxint) for i in xrange(size)]
def run_test(state):
# this function is run in a separate process
size = state['list_size']
print 'creating a list with %d random elements - this can take a while... ' % size,
sys.stdout.flush()
lst = create_list(size)
print 'done'
t0 = time.time()
lst.sort()
t1 = time.time()
state['time'] = t1 - t0
if __name__ == '__main__':
manager = multiprocessing.Manager()
state = manager.dict(list_size=5*1000*1000) # shared state
p = multiprocessing.Process(target=run_test, args=(state,))
p.start()
p.join()
print 'time to sort: %.3f' % state['time']
print 'my PID is %d, sleeping for a minute...' % os.getpid()
time.sleep(60)
# at this point you can inspect the running process to see that it
# does not consume excess memory
Bonus Answer
It is hard to provide an answer to the bonus question, since the question is unclear. The "free list concept" is exactly that, a concept, an implementation strategy that needs to be explicitly coded on top of the regular Python allocator. Most Python types do not use that allocation strategy, for example it is not used for instances of classes created with the class statement. Implementing a free list is not hard, but it is fairly advanced and rarely undertaken without good reason. If some extension author has chosen to use a free list for one of its types, it can be expected that they are aware of the tradeoff a free list offers — gaining extra-fast allocation/deallocation at the cost of some additional space (for the objects on the free list and the free list itself) and inability to reuse the memory for something else.

Adding state to a function which gets called via pool.map -- how to avoid pickling errors

I've hit the common problem of getting a pickle error when using the multiprocessing module.
My exact problem is that I need to give the function I'm calling some state before I call it in the pool.map function, but in doing so, I cause the attribute lookup __builtin__.function failed error found here.
Based on the linked SO answer, it looks like the only way to use a function in pool.map is to call the defined function itself so that it is looked up outside the scope of the current function.
I feel like I explained the above poorly, so here is the issue in code. :)
Testing without pool
# Function to be called by the multiprocessing pool
def my_func(x):
massive_list, medium_list, index1, index2 = x
result = [massive_list[index1 + x][index2:] for x in xrange(10)]
return result in medium_list
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = map(my_func, to_crunch)
This works A-OK and just as expected. The only thing "wrong" with it is that it's slow.
Pool Attempt 1
# (Note: my_func() remains the same)
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
pool = multiprocessing.Pool(2)
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = pool.map(my_func, to_crunch)
This technically works, but it is a stunning 18x slower! The slow down must be coming from not only copying the two massive data structures on each call, but also pickling/unpickling them as they get passed around. The non-pool version benefits from only having to pass the reference to the massive list around, rather than the actual list.
So, having tracked down the bottleneck, I try to store the two massive lists as state inside of my_func. That way, if I understand correctly, it will only need to be copied once for each worker (in my case, 4).
Pool Attempt 2:
I wrap up my_func in a closure passing in the two lists as stored state.
def build_myfunc(m,s):
def my_func(x):
massive_list = m # close the state in there
small_list = s
index1, index2 = x
result = [massive_list[index1 + x][index2:] for x in xrange(10)]
return result in medium_list
return my_func
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
modified_func = build_myfunc(data, source)
pool = multiprocessing.Pool(2)
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = pool.map(modified_func, to_crunch)
However, this returns the pickle error as (based on the above linked SO question) you cannot call a function with multiprocessing from inside of the same scope.
Error:
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
So, is there a way around this problem?
Map is a way to distribute workload. If you store the data in the func i think you vanish the initial purpose.
Let's try to find why it is slower. It's not normal and there must be something else.
First, the number of processes must be suitable for the machine running them. In your example you're using a pool of 2 processes so a total of 3 processes is involved. How many cores are on the system you're using? What else is running? What's the system load while crunching data?
What does the function do with the data? Does it access disk? Or maybe it uses DB which means there is probably another process accessing disk and cores.
What about memory? Is it sufficient for storing the initial lists?
The right implementation is your Attempt 1.
Try to profile the execution using iostat for example. This way you can spot the bottlenecks.
If it stalls on the cpu then you can try some tweaks to the code.
From another answer on Stackoverflow (by me so no problem copy and pasting it here :P ):
You're using .map() which collect the results and then returns. So for large dataset probably you're stuck in the collecting phase.
You can try using .imap() which is the iterator version on .map() or even the .imap_unordered() if the order of results is not important (as it seems from your example).
Here's the relevant documentation. Worth noting the line:
For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.

Categories