I have a work load that consist of a very slow query that returns a HUGE amount of data that has to be parsed and calculated, all that on a loop. Basically, it looks like this:
for x in lastTenYears
myData = DownloadData(x) # takes about ~40-50 [sec]
parsedData.append(ParseData(myData)) # takes another +30-60 [sec]
As I believe you have noticed, if I could run the data parsing on a thread, I could download the next batch of data while the parsing happens.
How can I achieve this parallelism of operations?
Ideally speaking, I would like to have 1 thread always downloading, and N threads doing the parsing. The download part is actually a query against a database, so it's not good to have a bunch o parallel of them...
Details:
The parsing of the data is a heavily CPU bound, and consists of raw math calculations and nothing else.
Using Python 3.7.4
1) Use a threadsafe queue. Queue.FIFOQueue. At the top level define
my_queue = Queue.FIFOQueue()
parsedData = []
2) On the first thread, kick off the data loading
my_queue.put(DownloadData(x))
On the second thread
if not (my_queue.empty()):
myData = my_queue.get()
parsedData.append(ParseData(myData))
If your program is CPU bound you will have hard times to do anything else in other threads due to the GIL (global interpreter lock).
Here is a link to an article which might help you to understand the topic: https://opensource.com/article/17/4/grok-gil
Downloading the data in a sub-process is most likely the best approach.
It's hard to say if and how much this will actually help (as I have nothing to test...), but you might try a multiprocessing.Pool. It handles all the dirty work for you and you can customize number of processes, chunk size etc.
from multiprocessing import Pool
def worker(x):
myData = DownloadData(x)
return ParseData(myData)
if __name__ == "__main__":
processes = None # defaults to os.cpu_count()
chunksize = 1
with Pool(processes) as pool:
parsedData = pool.map(worker, lastTenYears, chunksize)
Here for the example I use the map method, but according to your needs you might want to use imap or map_async.
Q : How can I achieve this parallelism of operations?
The step number one is to realise, the above requested use-case is not a [PARALLEL] code-execution, but an un-ordered batch of resources-use policy limited execution of a strict sequence of pairs of :
First-a-remote-[DB-Query](returning (cit.) HUGE amount of data)
Next-a-local-[CPU-process]( of (cit.) HUGE amount of data just returned here)
The latency of the first could be masked( if it were permitted, but it is not permitted - due to a will not to overload the DB-host ),the latency for the second not( can start but a next I/O-bound DB-Query, yet only if not violating the rule of keeping the DB-machine but under a mild workload ).
As I believe you have noticed, if I could run the data parsing on a thread, I could download the next batch of data while the parsing happens.
It is high time to make thing clear and sound :
Facts :
A )
The CPU-bound tasks will never run faster in whatever number N of threads in python-GIL-lock controlled ecosystem( since ever and forever, as Guido ROSSUM has expressed ),as the GIL-lock enforces a re-[SERIAL]-isation, so the more threads "work", the more threads actually wait for acquiring the GIL-lock, before they "get" it but for a 1 / ( N + 1 )-th fraction of time of the resulting, thanks to the GIL-lock policing again pure-[SERIAL], duration of N * ( 30 - 60 ) [sec]
B )
The I/O-bound task makes no sense to off-load into a full process-based, concurrent execution, as the full-copy of the python process ( in Windows also with duplicating the whole python interpreter state with all data, during the sub-process instantiation ) makes no sense, as there are smarter techniques for I/O-bound processing ( where GIL-lock does not hurt so much.
C )
The whole concept of N-parsing : 1-querying is principally wrong - the maximum achievable goal is to mask the latency of the I/O-process ( where making sense ), yet here each one and every query takes those said ~ 40-50 [sec] so no second pack-of-data to parse will ever be present here before running those said ~ 40-50 [sec] next time, sono second worker will ever get anything to parse anytime before T0 + ~ 80~100 [sec] - so one could dream a wish to have N-(unbound)-workers working ( yet have 'em but actually waiting for data ) is possible, but awfully anti-productive ( the worse for N-(GIL-MUTEX-ed)-"waiting"-agents ).
Related
I have two functions. Each function runs a for loop.
def f1(df1, df2):
final_items = []
for ind, row in df1.iterrows():
id = row['Id']
some_num = row['some_num']
timestamp = row['Timestamp']
res = f2(df=df2, id=id, some_num=some_num, timestamp=timestamp))
final_items.append(res)
return final_items
def f2(df, id, some_num, timestamp):
for ind, row in df.iterrows():
filename = row['some_filename']
dfx = reader(key=filename) # User defined; object reader
# Assign variables
st_ID = dfx["Id"]
st_some_num = dfx["some_num"]
st_time_first = dfx['some_first_time_variable']
st_time_last = dfx['some_last_time_variable']
if device_id == st_ID and some_num == st_some_num:
if st_time_first <= timestamp and st_time_last >= timestamp:
return filename
else:
return None
else:
continue
The first function calls the second function as shown. The first loop occurs 2000 times, i.e., there are 2000 rows in the first dataframe.
The second function (the one that is called from f1()) runs 10 Million times.
My objective is to speed up f2() using parallel processing. I have tried using python packages like Multiprocessing and Ray but I am new to the world of parallel processing and am running into a lot of roadblocks due to lack of experience.
Can some one help me speed up the function so that it takes considerably lesser time to execute for 10 million rows?
FACTS : initial formulation asks 2E3 rows in f1() to request f2() to scan 1E7 rows in "shared" df2,so as to get called an unspecified reader()-process to receive some other data to decide about further processing or return
My objective is to speed up f2() using parallel processing
Can some one help me speed up the function so that it takes considerably lesser time to execute for 10 million rows?
Surprise No.1 : This is NOT a use-case of parallel-processing
The problem, as-is formulated above, calls many times file-I/O operations, that are never true-[PARALLEL] down there on the physical storage level, are they? Never. Any and all smart file-I/O-(pre)-caching and sliding-window file-I/O tricks cease to help on even moderate levels of a just-[CONCURRENT] workloads and often wreak havoc if going a single step beyond that principal workload ceiling due to physically limited scope of memory resources and I/O-bus width x speed and the weakest chain element's latency increasing under still growing traffic-loads.
The workflow controlling iterators are pure-[SERIAL] "Work Dispatchers" that sequentially step through their domain of values, one after another, and order just another file to get ( again iteratively ) processed.
Surprise No.2 : Vectorisation will NOT help
While vectorised operations are smart for many vector/matrix/tensor processing schemes ( love using numpy + numba ), the Condicio Sine Qua Non is, that the problem has to be:
"compact" - so that it gets easily expressed by vectorising syntax-tricks, which this original [SERIAL]-row-after-row-after-row to find a first and only first "device_ID match" in a "remote"-file-content, next return None if not ( <exprA> and <exprB> ) else filename
"uniform", i.e. non-sequential "until" something first happens - the vectorisation is great to "cover" the whole N-dimensional space with smart-internal code for (best) orthogonal-sub-structures processing uniformly "across" the whole space. On the contrary here, the vectorisation is hard to re-sequentialise "back" to stop (poison) it from any further smart-producing results right after the first occurrence was matched... (ref.1 above "find first and only first occurrence ( and die / return ) )
"memory-adequately-sized", i.e. given any add-on logic is added to the vectorised task, whenever a code asks vectorisation engine to process N-dim "data" using some sort of where(...)-clause, the interim product of such where(...)-condition is consuming additional [SPACE]-footprint ( best in RAM, worse in SWAP-file-I/O ) and this additional memory-footpring may soon devastate any and all benefits from the idea of vectorised processing re-formulation ( not speaking about the cases that due to such immense additional memory-allocation needs result but in a swap-file-I/O suffocation of the whole process flow ) where(...)-clause over a 10E6 rows is expensive, the more once the global strategy is to execute that 1 < nCPUs < 2E3 many times ( as noted above, vectorisation goes uniformly "across" the whole range of data, no sequentially beneficial shortcuts to stop after a first and only the first match... )
THE BEST NEXT STEP : dependency-graph -> latencies -> bottleneck
The problem as-is formulated above is a just-[CONCURRENT] processing, where the actual blocking or availability of "shared" resources' usage limits the overall processing duration. Having no more than a given set of resources to use, there are no magic chances to speed-up the concurrent usage patterns for faster processing. Thus the "amounts" of free-resources to harness and their respective response-"latencies" sure, those under-high-levels-of-concurrent-workloads, not the idealistic, unloaded, response times
If you have no profiling data, measure/benchmark at least the main characteristic durations:
a) the net f2()-per-row process latency [ min, Avg, MAX, StDev] in [us]
b) the reader()-related setup/retrieve latency [ min, Avg, MAX, StDev] in [us]
test, whether the reader()'s performance represents or not a bottleneck - a ceiling for the any-increased-concurrency operated process-flow
If it does, you get it's maximum workload it can handle and based on this, the concurrent-processing may get the speed forwards up to this reader()-determined performance ceiling.
All the rest is elementary.
Epilogue
Such latency-data engineered, (un)avoidable bottleneck-aware right-sized concurrent processing setup for a maximum Latency Masking is about the maximum one can expect here to help.
Given a chance to re-engineer and re-factor the global strategy, there might be much faster processing times, but that may come from other than a pure-[SERIAL] tandem of sequential iterators instructing the sequence of about ~ 20.000.000.000 calls to an unknown reader()-code.
Yet, that goes ways beyond the scope of this Stack Overflow MinCunVE-problem definition.
Hope this might have sparked some fresh views on how to make the results faster. Smart ideas may lead to processing times from a few days down to a few minutes (!). Having gone this way a few times, no one will believe how fulfilling this hard work may get both you and your customer(s), if you hit such a solution by designing the right-sized solution for their business domain.
I've implemented a genetic search algorithm and tried to parallelise it, but getting terrible performance (worse than single threaded). I suspect this is due to communication overhead.
I have provided pseudo-code below, but in essence the genetic algorithm creates a large pool of "Chromosome" objects, then runs many iterations of:
Score each individual chromosome based on how it performs in a 'world.' The world remains static across iterations.
Randomly selects a new population based on their scores calculated in the previous step
Go to step 1 for n iterations
The scoring algorithm (step 1) is the major bottleneck, hence it seemed natural to distribute out the processing of this code.
I have run into a couple of issues I hoped I could get help with:
How can I link the calculated score with the object that was passed to the scoring function by map(), i.e. link each Future holding a score back to a Chromosome? I've done this in a very clunky way by having the calculate_scores() method return the object, but in reality all I need is to send a float back if there is a better way to maintain the link.
The parallel processing of the scoring function is working okay, though takes a long time for map() to iterate through all the objects. However, the subsequent calls to draw_chromosome_from_pool() run very slowly compared to the single-threaded version to the point that I've not yet seen it complete. I have no idea what is causing this as the method always completes quickly in the single-threaded version. Is there some IPC going on to pull the chromosomes back to the local process, even after all the futures have completed? Is the local process de-prioritised in some way?
I am worried that the overall iterative nature of building/rebuilding the pool each cycle is going to cause an enormous amount of data transmission to the workers. The question at the root of this concern: what and when does Dask actually send data back and forth to the worker pool. i.e. when does Environment() get distributed out vs. Chromosome(), and how/when do results come back? I've read the docs but either haven't found the right detail, or am too stupid to understand.
Idealistically, I think (but open to correction) what I want is a distributed architecture where each worker holds the Environment() data locally on a 'permanent' basis, then Chromosome() instance data is distributed for scoring with little duplicated back/forth of unchanged Chromosome() data between iterations.
Very long post, so if you have taken the time to read this, thank you already!
class Chromosome(object): # Small size: several hundred bytes per instance
def get_score():
# Returns a float
def set_score(i):
# Stores a a float
class Environment(object): # Large size: 20-50Mb per instance, but only one instance
def calculate_scores(chromosome):
# Slow calculation using attributes from chromosome and instance data
chromosome.set_score(x)
return chromosome
class Evolver(object):
def draw_chromosome_from_pool(self, max_score):
while True:
individual = np.random.choice(self.chromosome_pool)
selection_chance = np.random.uniform()
if selection_chance < individual.get_score() / max_score:
return individual
def run_evolution()
self.dask_client = Client()
self.chromosome_pool = list()
for i in range(10000):
self.chromosome_pool.append( Chromosome() )
world_data = LoadWorldData() # Returns a pandas Dataframe
self.world = Environment(world_data)
iterations = 1000
for i in range(iterations):
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
for future in as_completed(futures):
c = future.result()
highest_score = max(highest_score, c.get_score())
new_pool = set()
while len(new_pool)<self.pool_size:
mother = self.draw_chromosome_from_pool(highest_score)
# do stuff to build a new pool
Yes, each time you call the line
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
you are serialising self.world, which is large. You could do this just once before the loop with
future_world = client.scatter(self.world, broadcast=True)
and then in the loop
futures = self.dask_client.map(lambda ch: Environment.calculate_scores(future_world, ch), self.chromosome_pool)
will use the copies already on the workers (or a simple function that does the same). The point is that future_world is just a pointer to stuff already distributed, but dask takes care of this for you.
On the issue of which chromosome is which: using as_completed breaks the order that you submitted them to map, but this is not necessary for your code. You could have used wait to process when all the work was done, or simply iterate over the future.result()s (which will wait for each task to be done), and then you will retain the ordering in the chromosome_pool.
I have an arcpy process that requires doing a union on a bunch of layers, running some calculations, and writing an HTML report. Given the number of reports I need to generate (~2,100) I need this process to be as quick as possible (my target is 2 seconds per report). I've tried a number of ways to do this, including multiprocessing, when I ran across a problem, namely, that running the multi-process part essentially takes the same amount of time no matter how many cores I use.
For instance, for the same number of reports:
2 cores took ~30 seconds per round (so 40 reports takes 40/2 * 30 seconds)
4 cores took ~60 seconds (40/4 * 60)
10 cores took ~160 seconds (40/10 * 160)
and so on. It works out to the same total time because churning through twice as many at a time takes twice as long to do.
Does this mean my problem is I/O bound, rather than CPU bound? (And if so - what do I do about it?) I would have thought it was the latter, given that the large bottleneck in my timing is the union (it takes up about 50% of the processing time). Unions are often expensive in ArcGIS, so I assumed breaking it up and running 2 - 10 at once would have been 2 - 10 times faster. Or, potentially I implementing multi-process incorrectly?
## Worker function just included to give some context
def worker(sub_code):
layer = 'in_memory/lyr_{}'.format(sub_code)
arcpy.Select_analysis(subbasinFC, layer, where_clause="SUB_CD = '{}'".format(sub_code))
arcpy.env.extent = layer
union_name = 'in_memory/union_' + sub_code
arcpy.Union_analysis([fields],
union_name,
"NO_FID", "1 FEET")
#.......Some calculations using cursors
# Templating using Jinjah
context = {}
context['DATE'] = now.strftime("%B %d, %Y")
context['SUB_CD'] = sub_code
context['SUB_ACRES'] = sum([r[0] for r in arcpy.da.SearchCursor(union, ["ACRES"], where_clause="SUB_CD = '{}'".format(sub_code))])
# Etc
# Then write the report out using custom function
write_html('template.html', 'output_folder', context)
if __name__ == '__main__':
subList = sorted({r[0] for r in arcpy.da.SearchCursor(subbasinFC, ["SUB_CD"])})
NUM_CORES = 7
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
for chunk in chunk_list:
jobs = []
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()
There isn't much to go on here, and I have no experience with ArcGIS. So I can just note two higher-level things. First, "the usual" way to approach this would be to replace all the code below your NUM_CORES = 7 with:
pool = multiprocessing.Pool(NUM_CORES)
pool.map(worker, subList)
pool.close()
pool.join()
map() takes care of keeping all the worker processes as busy as possible. As is, you fire up 7 processes, then wait for all of them to finish. All the processes that complete before the slowest vanish, and their cores sit idle waiting for the next outer loop iteration. A Pool keeps the 7 processes alive for the duration of the job, and feeds each a new piece of work to do as soon as it finishes its last piece of work.
Second, this part ends with a logical error:
chunk_list = [subList[i:i+NUM_CORES] for i in range(0, len(subList), NUM_CORES-1)]
You want NUM_CORES there rather than NUM_CORES-1. As-is, the first time around you extract
subList[0:7]
then
subList[6:13]
then
subList[12:19]
and so on. subList[6] and subList[12] (etc) are extracted twice each. The sublists overlap.
You don't show us quite enough to be sure what you are doing. For example, what is your env.workspace? And what is the value of subbasinFC? It seems like you're doing an analysis at the beginning of each process to filter down the data into layer. But is subbasinFC coming from disk, or from memory? If it's from disk, I'd suggest you read everything into memory before any of the processes try their filtering. That should speed things along, if you have the memory to support it. Otherwise, yeah, you're I/O bound on the input data.
Forgive my arcpy cluelessness, but why are you inserting a where clause in your sum of context['SUB_ACRES']? Didn't you already filter on sub_code at the start? (We don't know what the union is, so maybe you're unioning with something unfiltered...)
I'm not sure you are using the Process pool correctly to track your jobs. This:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
jobs.append(p)
p.start()
for process in jobs:
process.join()
Should instead be:
for subbasin in chunk:
p = multiprocessing.Process(target=worker, args=(subbasin,))
p.start()
p.join()
Is there a specific reason you are going against the spec of using the multiprocessing library? You are not waiting until the thread terminates before spinning another process up, which is just going to create a whole bunch of processes that are not handled by the parent calling process.
Original Question
I am trying to use multiprocessing Pool in Python. This is my code:
def f(x):
return x
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
for x in xrange(1, 11):
res = list(mapper(f,bar(x)))
This code makes use of all CPUs (I have 8 CPUs) when the xrange is small like xrange(1, 6). However, when I increase the range to xrange(1, 10). I observe that only 1 CPU is running at 100% while the rest are just idling. What could be the reason? Is it because, when I increase the range, the OS shutdowns the CPUs due to overheating?
How can I resolve this problem?
minimal, complete, verifiable example
To replicate my problem, I have created this example: Its a simple ngram generation from a string problem.
#!/usr/bin/python
import time
import itertools
import threading
import multiprocessing
import random
def f(x):
return x
def ngrams(input_tmp, n):
input = input_tmp.split()
if n > len(input):
n = len(input)
output = []
for i in range(len(input)-n+1):
output.append(input[i:i+n])
return output
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
num = 100000000 #100
rand_list = random.sample(xrange(100000000), num)
rand_str = ' '.join(str(i) for i in rand_list)
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
if __name__ == '__main__':
start = time.time()
foo()
print 'Total time taken: '+str(time.time() - start)
When num is small (e.g., num = 10000), I find that all 8 CPUs are utilised. However, when num is substantially large (e.g.,num = 100000000). Only 2 CPUs are used and rest are idling. This is my problem.
Caution: When num is too large it may crash your system/VM.
First, ngrams itself takes a lot of time. While that's happening, it's obviously only one one core. But even when that finishes (which is very easy to test by just moving the ngrams call outside the mapper and throwing a print in before and after it), you're still only using one core. I get 1 core at 100% and the other cores all around 2%.
If you try the same thing in Python 3.4, things are a little different—I still get 1 core at 100%, but the others are at 15-25%.
So, what's happening? Well, in multiprocessing, there's always some overhead for passing parameters and returning values. And in your case, that overhead completely swamps the actual work, which is just return x.
Here's how the overhead works: The main process has to pickle the values, then put them on a queue, then wait for values on another queue and unpickle them. Each child process waits on the first queue, unpickles values, does your do-nothing work, pickles the values, and puts them on the other queue. Access to the queues has to be synchronized (by a POSIX semaphore on most non-Windows platforms, I think an NT kernel mutex on Windows).
From what I can tell, your processes are spending over 99% of their time waiting on the queue or reading or writing it.
This isn't too unexpected, given that you have a large amount of data to process, and no computation at all beyond pickling and unpickling that data.
If you look at the source for SimpleQueue in CPython 2.7, the pickling and unpickling happens with the lock held. So, pretty much all the work any of your background processes do happens with the lock held, meaning they all end up serialized on a single core.
But in CPython 3.4, the pickling and unpickling happens outside the lock. And apparently that's enough work to use up 15-25% of a core. (I believe this change happened in 3.2, but I'm too lazy to track it down.)
Still, even on 3.4, you're spending far more time waiting for access to the queue than doing anything, even the multiprocessing overhead. Which is why the cores only get up to 25%.
And of course you're spending orders of magnitude more time on the overhead than the actual work, which makes this not a great test, unless you're trying to test the maximum throughput you can get out of a particular multiprocessing implementation on your machine or something.
A few observations:
In your real code, if you can find a way to batch up larger tasks (explicitly—just relying on chunksize=1000 or the like here won't help), that would probably solve most of your problem.
If your giant array (or whatever) never actually changes, you may be able to pass it in the pool initializer, instead of in each task, which would pretty much eliminate the problem.
If it does change, but only from the main process side, it may be worth sharing rather than passing the data.
If you need to mutate it from the child processes, see if there's a way to partition the data so each task can own a slice without contention.
Even if you need fully-contended shared memory with explicit locking, it may still be better than passing something this huge around.
It may be worth getting a backport of the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries off PyPI (or upgrading to Python 3.x), just to move the pickling out of the lock.
The problem is that your f() function (which is the one running on separate processes) is doing nothing special, hence it is not putting load on the CPU.
ngrams(), on the other hand, is doing some "heavy" computation, but you are calling this function on the main process, not in the pool.
To make things clearer, consider that this piece of code...
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
...is equivalent to this:
for n in xrange(1, 100):
arg = ngrams(rand_str, n)
res = list(mapper(f, arg))
Also the following is a CPU-intensive operation that is being performed on your main process:
num = 100000000
rand_list = random.sample(xrange(100000000), num)
You should either change your code so that sample() and ngrams() are called inside the pool, or change f() so that it does something CPU-intensive, and you'll see a high load on all of your CPUs.
I have a fuzzy string matching script that looks for some 30K needles in a haystack of 4 million company names. While the script works fine, my attempts at speeding up things via parallel processing on an AWS h1.xlarge failed as I'm running out of memory.
Rather than trying to get more memory as explained in response to my previous question, I'd like to find out how to optimize the workflow - I'm fairly new to this so there should be plenty of room. Btw, I've already experimented with queues (also worked but ran into the same MemoryError, plus looked through a bunch of very helpful SO contributions, but not quite there yet.
Here's what seems most relevant of the code. I hope it sufficiently clarifies the logic - happy to provide more info as needed:
def getHayStack():
## loads a few million company names into id: name dict
return hayCompanies
def getNeedles(*args):
## loads subset of 30K companies into id: name dict (for allocation to workers)
return needleCompanies
def findNeedle(needle, haystack):
""" Identify best match and return results with score """
results = {}
for hayID, hayCompany in haystack.iteritems():
if not isnull(haystack[hayID]):
results[hayID] = levi.setratio(needle.split(' '),
hayCompany.split(' '))
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
def runMatch(args):
""" Execute findNeedle and process results for poolWorker batch"""
batch, first = args
last = first + batch
hayCompanies = getHayStack()
needleCompanies = getTargets(first, last)
needles = defaultdict(list)
current = first
for needleID, needleCompany in needleCompanies.iteritems():
current += 1
needles[targetID] = findNeedle(needleCompany, hayCompanies)
## Then store results
if __name__ == '__main__':
pool = Pool(processes = numProcesses)
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
pool.map_async(runMatch,
itertools.izip(itertools.repeat(targetsPerBatch),
xrange(0,
totalTargets,
targetsPerBatch))).get(99999999)
pool.close()
pool.join()
So I guess the questions are: How can I avoid loading the haystack for all workers - e.g. by sharing the data or taking a different approach like dividing the much larger haystack across workers rather than the needles? How can I otherwise improve memory usage by avoiding or eliminating clutter?
Your design is a bit confusing. You're using a pool of N workers, and then breaking your M jobs work up into N tasks of size M/N. In other words, if you get that all correct, you're simulating worker processes on top of a pool built on top of worker processes. Why bother with that? If you want to use processes, just use them directly. Alternatively, use a pool as a pool, sends each job as its own task, and use the batching feature to batch them up in some appropriate (and tweakable) way.
That means that runMatch just takes a single needleID and needleCompany, and all it does is call findNeedle and then do whatever that # Then store results part is. And then the main program gets a lot simpler:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
results = pool.map_async(runMatch, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING).get()
Or, if the results are small, instead of having all of the processes (presumably) fighting over some shared resulting-storing thing, just return them. Then you don't need runMatch at all, just:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
for result in pool.imap_unordered(findNeedle, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING):
# Store result
Or, alternatively, if you do want to do exactly N batches, just create a Process for each one:
if __name__ == '__main__':
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
processes = [Process(target=runMatch,
args=(targetsPerBatch,
xrange(0,
totalTargets,
targetsPerBatch)))
for _ in range(numProcesses)]
for p in processes:
p.start()
for p in processes:
p.join()
Also, you seem to be calling getHayStack() once for each task (and getNeedles as well). I'm not sure how easy it would be to end up with multiple copies of this live at the same time, but considering that it's the largest data structure you have by far, that would be the first thing I try to rule out. In fact, even if it's not a memory-usage problem, getHayStack could easily be a big performance hit, unless you're already doing some kind of caching (e.g., explicitly storing it in a global or a mutable default parameter value the first time, and then just using it), so it may be worth fixing anyway.
One way to fix both potential problems at once is to use an initializer in the Pool constructor:
def initPool():
global _haystack
_haystack = getHayStack()
def runMatch(args):
global _haystack
# ...
hayCompanies = _haystack
# ...
if __name__ == '__main__':
pool = Pool(processes=numProcesses, initializer=initPool)
# ...
Next, I notice that you're explicitly generating lists in multiple places where you don't actually need them. For example:
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
If there's more than a handful of results, this is wasteful; just use the results.values() iterable directly. (In fact, it looks like you're using Python 2.x, in which case keys and values are already lists, so you're just making an extra copy for no good reason.)
But in this case, you can simplify the whole thing even farther. You're just looking for the key (resultID) and value (score) with the highest score, right? So:
needleID, score = max(results.items(), key=operator.itemgetter(1))
return [needleID, haystack[needleID], score]
This also eliminates all the repeated searches over score, which should save some CPU.
This may not directly solve the memory problem, but it should hopefully make it easier to debug and/or tweak.
The first thing to try is just to use much smaller batches—instead of input_size/cpu_count, try 1. Does memory usage go down? If not, we've ruled that part out.
Next, try sys.getsizeof(_haystack) and see what it says. If it's, say, 1.6GB, then you're cutting things pretty fine trying to squeeze everything else into 0.4GB, so that's the way to attack it—e.g., use a shelve database instead of a plain dict.
Also try dumping memory usage (with the resource module, getrusage(RUSAGE_SELF)) at the start and end of the initializer function. If the final haystack is only, say, 0.3GB, but you allocate another 1.3GB building it up, that's the problem to attack. For example, you might spin off a single child process to build and pickle the dict, then have the pool initializer just open it and unpickle it. Or combine the two—build a shelve db in the first child, and open it read-only in the initializer. Either way, this would also mean you're only doing the CSV-parsing/dict-building work once instead of 8 times.
On the other hand, if your total VM usage is still low (note that getrusage doesn't directly have any way to see your total VM size—ru_maxrss is often a useful approximation, especially if ru_nswap is 0) at time the first task runs, the problem is with the tasks themselves.
First, getsizeof the arguments to the task function and the value you return. If they're large, especially if they either keep getting larger with each task or are wildly variable, it could just be pickling and unpickling that data takes too much memory, and eventually 8 of them are together big enough to hit the limit.
Otherwise, the problem is most likely in the task function itself. Either you've got a memory leak (you can only have a real leak by using a buggy C extension module or ctypes, but if you keep any references around between calls, e.g., in a global, you could just be holding onto things forever unnecessarily), or some of the tasks themselves take too much memory. Either way, this should be something you can test more easily by pulling out the multiprocessing and just running the tasks directly, which is a lot easier to debug.