I am running a backtest for a trading strategy, defined as a class. I am trying to select the best combination of parameters to input in the model, so I am running multiple backtesting on a given period, trying out different combinations. The idea is to be able to select the first generation of a population to feed into a genetic algorithm. Seems like the perfect job for multiprocessing!
So I tried a bunch of things to see what works faster. I opened 10 Spyder consoles (yes, I tried it) and ran a single combination of parameters for each console (all running at the same time).
The sample code used for each single Spyder console:
class MyStrategy(day,parameters):
# my strategy that runs on a single day
backtesting=[]
for day in days:
backtesting_day=MyStrategy(day,single_parameter_combi)
backtesting.append(backtesting_day)
I then tried the multiprocessing way, using pool.
The sample code used in multiprocessing:
class MyStrategy(day,parameters):
# my strategy that runs on a single day
def single_run_backtesting(single_parameter_combi):
backtesting=[]
for day in days:
backtesting_day=MyStrategy(day,single_parameter_combi)
backtesting.append(backtesting_day)
return backtesting
def backtest_many(list_of parameter_combinations):
p=multiprocessing.pool()
result=p.map(single_run_backtesting,list_of parameter_combinations)
p.close()
p.join()
return result
if __name__ == '__main__':
parameter_combis=[...] # a list of parameter combinations, 10 different ones in this case
result = backtest_many(parameter_combis)
I have also tried the following: opening 5 Spyder consoles and running 2 instances of the class in a for loop, as below, and a single Spyder console with 10 instances of the class.
class MyStrategy(day,parameters):
# my strategy that runs on a single day
parameter_combis=[...] # a list of parameter combinations
backtest_dict={k: [] for k in range(len(parameter_combis)} # make a dictionary of empty lists
for day in days:
for j,single_parameter_combi in enumerate(parameter_combis):
backtesting_day=MyStrategy(day,single_parameter_combi)
backtest_dict[j].append(backtesting_day)
To my great surprise, it takes around 25 minutes with multiprocessing to go thorugh a single day, about the same time with a single Spyder console with 10 instances of a class in the for loop, and magically it takes only 15 minutes when I run 10 Spyder consoles at the same time. How do I process this information? It doesn't really make sense to me. I am running a 12-cpu machine on windows 10.
Consider that I am planning to run things on AWS with a 96-core machine, with something like 100 combinations of parameters that cross in a genetic algorithm which should run something like 20-30 generations (a full backtesting is 2 business months = 44 days).
My question is: what am I missing??? Most importantly, is this just a difference in scale?
I know that for example if you define a simple squaring function and run it serially for 100 times, multiprocessing is actually slower than a for loop. You start seeing the advantage around 10000 times, see for example this: https://github.com/vprusso/youtube_tutorials/blob/master/multiprocessing_and_threading/multiprocessing/multiprocessing_pool.py
Will I see a difference in performance when I go up to 100 combinations with multiprocessing, and is there any way of knowing in advnace if this is the case? Am I properly writing the code? Other ideas? Do you think it would speed up significatively if I was to use multiprocessing one step "above", in a single parameter combination over many days?
To expand upon my comment "Try p.imap_unordered().":
p.map() ensures that you get the results in the same order they're in the parameter list. To achieve this, some of the workers necessarily remain idle for some time
For your use case – essentially a grid search of parameter combinations – you really don't need to have them in the same order, you just want to end up with the best option. (Additionally, quoth the documentation, "it may cause high memory usage for very long iterables. Consider using imap() or imap_unordered() with explicit chunksize option for better efficiency.")
p.imap_unordered(), by contrast, doesn't really care – it just queues things up and workers work on them as they free up.
It's also worth experimenting with the chunksize parameter – quoting the imap() documentation, "For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1." (since you spend less time queueing and synchronizing things).
Finally, for your particular use case, you might want to consider having the master process generate an infinite amount of parameter combinations using a generator function, and breaking off the loop once you find a good enough solution or enough time passes.
A simple-ish function to do this and a contrived problem (finding two random numbers 0..1 to maximize their sum) follows. Just remember to return the original parameter set from the worker function too, otherwise you won't have access to it! :)
import random
import multiprocessing
import time
def find_best(*, param_iterable, worker_func, metric_func, max_time, chunksize=10):
best_result = None
best_metric = None
start_time = time.time()
n_results = 0
with multiprocessing.Pool() as p:
for result in p.imap_unordered(worker_func, param_iterable, chunksize=chunksize):
n_results += 1
elapsed_time = time.time() - start_time
metric = metric_func(result)
if best_metric is None or metric > best_metric:
print(f'{elapsed_time}: Found new best solution, metric {metric}')
best_metric = metric
best_result = result
if elapsed_time >= max_time:
print(f'{elapsed_time}: Max time reached.')
break
final_time = time.time() - start_time
print(f'Searched {n_results} results in {final_time} s.')
return best_result
# ------------
def generate_parameter():
return {'a': random.random(), 'b': random.random()}
def generate_parameters():
while True:
yield generate_parameter()
def my_worker(parameters):
return {
'parameters': parameters, # remember to return this too!
'value': parameters['a'] + parameters['b'], # our maximizable metric
}
def my_metric(result):
return result['value']
def main():
result = find_best(
param_iterable=generate_parameters(),
worker_func=my_worker,
metric_func=my_metric,
max_time=5,
)
print(f'Best result: {result}')
if __name__ == '__main__':
main()
An example run:
~/Desktop $ python3 so59357979.py
0.022627830505371094: Found new best solution, metric 0.5126700311039976
0.022940874099731445: Found new best solution, metric 0.9464256914062249
0.022969961166381836: Found new best solution, metric 1.2946600313637404
0.02298712730407715: Found new best solution, metric 1.6255217652861256
0.023016929626464844: Found new best solution, metric 1.7041449687571075
0.02303481101989746: Found new best solution, metric 1.8898109980050104
0.030200958251953125: Found new best solution, metric 1.9031436071918972
0.030324935913085938: Found new best solution, metric 1.9321951916206537
0.03880715370178223: Found new best solution, metric 1.9410837287942249
0.03970479965209961: Found new best solution, metric 1.9649277383314245
0.07829880714416504: Found new best solution, metric 1.9926667738329622
0.6105098724365234: Found new best solution, metric 1.997217792614364
5.000051021575928: Max time reached.
Searched 621931 results in 5.07216 s.
Best result: {'parameters': {'a': 0.997483, 'b': 0.999734}, 'value': 1.997217}
(By the way, this is nearly 6 times slower when chunksize=1.)
Related
I have two functions. Each function runs a for loop.
def f1(df1, df2):
final_items = []
for ind, row in df1.iterrows():
id = row['Id']
some_num = row['some_num']
timestamp = row['Timestamp']
res = f2(df=df2, id=id, some_num=some_num, timestamp=timestamp))
final_items.append(res)
return final_items
def f2(df, id, some_num, timestamp):
for ind, row in df.iterrows():
filename = row['some_filename']
dfx = reader(key=filename) # User defined; object reader
# Assign variables
st_ID = dfx["Id"]
st_some_num = dfx["some_num"]
st_time_first = dfx['some_first_time_variable']
st_time_last = dfx['some_last_time_variable']
if device_id == st_ID and some_num == st_some_num:
if st_time_first <= timestamp and st_time_last >= timestamp:
return filename
else:
return None
else:
continue
The first function calls the second function as shown. The first loop occurs 2000 times, i.e., there are 2000 rows in the first dataframe.
The second function (the one that is called from f1()) runs 10 Million times.
My objective is to speed up f2() using parallel processing. I have tried using python packages like Multiprocessing and Ray but I am new to the world of parallel processing and am running into a lot of roadblocks due to lack of experience.
Can some one help me speed up the function so that it takes considerably lesser time to execute for 10 million rows?
FACTS : initial formulation asks 2E3 rows in f1() to request f2() to scan 1E7 rows in "shared" df2,so as to get called an unspecified reader()-process to receive some other data to decide about further processing or return
My objective is to speed up f2() using parallel processing
Can some one help me speed up the function so that it takes considerably lesser time to execute for 10 million rows?
Surprise No.1 : This is NOT a use-case of parallel-processing
The problem, as-is formulated above, calls many times file-I/O operations, that are never true-[PARALLEL] down there on the physical storage level, are they? Never. Any and all smart file-I/O-(pre)-caching and sliding-window file-I/O tricks cease to help on even moderate levels of a just-[CONCURRENT] workloads and often wreak havoc if going a single step beyond that principal workload ceiling due to physically limited scope of memory resources and I/O-bus width x speed and the weakest chain element's latency increasing under still growing traffic-loads.
The workflow controlling iterators are pure-[SERIAL] "Work Dispatchers" that sequentially step through their domain of values, one after another, and order just another file to get ( again iteratively ) processed.
Surprise No.2 : Vectorisation will NOT help
While vectorised operations are smart for many vector/matrix/tensor processing schemes ( love using numpy + numba ), the Condicio Sine Qua Non is, that the problem has to be:
"compact" - so that it gets easily expressed by vectorising syntax-tricks, which this original [SERIAL]-row-after-row-after-row to find a first and only first "device_ID match" in a "remote"-file-content, next return None if not ( <exprA> and <exprB> ) else filename
"uniform", i.e. non-sequential "until" something first happens - the vectorisation is great to "cover" the whole N-dimensional space with smart-internal code for (best) orthogonal-sub-structures processing uniformly "across" the whole space. On the contrary here, the vectorisation is hard to re-sequentialise "back" to stop (poison) it from any further smart-producing results right after the first occurrence was matched... (ref.1 above "find first and only first occurrence ( and die / return ) )
"memory-adequately-sized", i.e. given any add-on logic is added to the vectorised task, whenever a code asks vectorisation engine to process N-dim "data" using some sort of where(...)-clause, the interim product of such where(...)-condition is consuming additional [SPACE]-footprint ( best in RAM, worse in SWAP-file-I/O ) and this additional memory-footpring may soon devastate any and all benefits from the idea of vectorised processing re-formulation ( not speaking about the cases that due to such immense additional memory-allocation needs result but in a swap-file-I/O suffocation of the whole process flow ) where(...)-clause over a 10E6 rows is expensive, the more once the global strategy is to execute that 1 < nCPUs < 2E3 many times ( as noted above, vectorisation goes uniformly "across" the whole range of data, no sequentially beneficial shortcuts to stop after a first and only the first match... )
THE BEST NEXT STEP : dependency-graph -> latencies -> bottleneck
The problem as-is formulated above is a just-[CONCURRENT] processing, where the actual blocking or availability of "shared" resources' usage limits the overall processing duration. Having no more than a given set of resources to use, there are no magic chances to speed-up the concurrent usage patterns for faster processing. Thus the "amounts" of free-resources to harness and their respective response-"latencies" sure, those under-high-levels-of-concurrent-workloads, not the idealistic, unloaded, response times
If you have no profiling data, measure/benchmark at least the main characteristic durations:
a) the net f2()-per-row process latency [ min, Avg, MAX, StDev] in [us]
b) the reader()-related setup/retrieve latency [ min, Avg, MAX, StDev] in [us]
test, whether the reader()'s performance represents or not a bottleneck - a ceiling for the any-increased-concurrency operated process-flow
If it does, you get it's maximum workload it can handle and based on this, the concurrent-processing may get the speed forwards up to this reader()-determined performance ceiling.
All the rest is elementary.
Epilogue
Such latency-data engineered, (un)avoidable bottleneck-aware right-sized concurrent processing setup for a maximum Latency Masking is about the maximum one can expect here to help.
Given a chance to re-engineer and re-factor the global strategy, there might be much faster processing times, but that may come from other than a pure-[SERIAL] tandem of sequential iterators instructing the sequence of about ~ 20.000.000.000 calls to an unknown reader()-code.
Yet, that goes ways beyond the scope of this Stack Overflow MinCunVE-problem definition.
Hope this might have sparked some fresh views on how to make the results faster. Smart ideas may lead to processing times from a few days down to a few minutes (!). Having gone this way a few times, no one will believe how fulfilling this hard work may get both you and your customer(s), if you hit such a solution by designing the right-sized solution for their business domain.
I've implemented a genetic search algorithm and tried to parallelise it, but getting terrible performance (worse than single threaded). I suspect this is due to communication overhead.
I have provided pseudo-code below, but in essence the genetic algorithm creates a large pool of "Chromosome" objects, then runs many iterations of:
Score each individual chromosome based on how it performs in a 'world.' The world remains static across iterations.
Randomly selects a new population based on their scores calculated in the previous step
Go to step 1 for n iterations
The scoring algorithm (step 1) is the major bottleneck, hence it seemed natural to distribute out the processing of this code.
I have run into a couple of issues I hoped I could get help with:
How can I link the calculated score with the object that was passed to the scoring function by map(), i.e. link each Future holding a score back to a Chromosome? I've done this in a very clunky way by having the calculate_scores() method return the object, but in reality all I need is to send a float back if there is a better way to maintain the link.
The parallel processing of the scoring function is working okay, though takes a long time for map() to iterate through all the objects. However, the subsequent calls to draw_chromosome_from_pool() run very slowly compared to the single-threaded version to the point that I've not yet seen it complete. I have no idea what is causing this as the method always completes quickly in the single-threaded version. Is there some IPC going on to pull the chromosomes back to the local process, even after all the futures have completed? Is the local process de-prioritised in some way?
I am worried that the overall iterative nature of building/rebuilding the pool each cycle is going to cause an enormous amount of data transmission to the workers. The question at the root of this concern: what and when does Dask actually send data back and forth to the worker pool. i.e. when does Environment() get distributed out vs. Chromosome(), and how/when do results come back? I've read the docs but either haven't found the right detail, or am too stupid to understand.
Idealistically, I think (but open to correction) what I want is a distributed architecture where each worker holds the Environment() data locally on a 'permanent' basis, then Chromosome() instance data is distributed for scoring with little duplicated back/forth of unchanged Chromosome() data between iterations.
Very long post, so if you have taken the time to read this, thank you already!
class Chromosome(object): # Small size: several hundred bytes per instance
def get_score():
# Returns a float
def set_score(i):
# Stores a a float
class Environment(object): # Large size: 20-50Mb per instance, but only one instance
def calculate_scores(chromosome):
# Slow calculation using attributes from chromosome and instance data
chromosome.set_score(x)
return chromosome
class Evolver(object):
def draw_chromosome_from_pool(self, max_score):
while True:
individual = np.random.choice(self.chromosome_pool)
selection_chance = np.random.uniform()
if selection_chance < individual.get_score() / max_score:
return individual
def run_evolution()
self.dask_client = Client()
self.chromosome_pool = list()
for i in range(10000):
self.chromosome_pool.append( Chromosome() )
world_data = LoadWorldData() # Returns a pandas Dataframe
self.world = Environment(world_data)
iterations = 1000
for i in range(iterations):
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
for future in as_completed(futures):
c = future.result()
highest_score = max(highest_score, c.get_score())
new_pool = set()
while len(new_pool)<self.pool_size:
mother = self.draw_chromosome_from_pool(highest_score)
# do stuff to build a new pool
Yes, each time you call the line
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
you are serialising self.world, which is large. You could do this just once before the loop with
future_world = client.scatter(self.world, broadcast=True)
and then in the loop
futures = self.dask_client.map(lambda ch: Environment.calculate_scores(future_world, ch), self.chromosome_pool)
will use the copies already on the workers (or a simple function that does the same). The point is that future_world is just a pointer to stuff already distributed, but dask takes care of this for you.
On the issue of which chromosome is which: using as_completed breaks the order that you submitted them to map, but this is not necessary for your code. You could have used wait to process when all the work was done, or simply iterate over the future.result()s (which will wait for each task to be done), and then you will retain the ordering in the chromosome_pool.
I defined two correct ways of calculating averages in python.
def avg_regular(values):
total = 0
for value in values:
total += value
return total/len(values)
def avg_concurrent(values):
mean = 0
num_of_values = len(values)
for value in values:
#calculate a small portion of the average for each num and add to the total
mean += value/num_of_values
return mean
The first function is the regular way of calculating averages, but I wrote the second one because each run of the loop doesn't depend on previous runs. So theoretically the average can be computed in parallel.
However, the "parallel" one (without running in parallel) takes about 30% more time than the regular one.
Are my assumptions correct and worth the speed loss?
if yes how can I make the second function run the second one parrallely?
if not, where did I go wrong?
The code you implemented is basically the difference between (a1+a2+ ... + an) / n and (a1/n + a2/n + ... + an/n). The result is the same, but in the second version there are more operations (namely (n-1) more divisions) which slows the calculation down. You claimed that in the second version each loop run is independent of the others. In the first loop we need the following information to finish one loop run: total before the run and the current value. In the second version we need the following information to finish one loop run: mean before the run, the current value and num_of_values. As you see in the second version we even depend on more values!
But how could we divide the work between cores (which is the goal of multiprocessing)? We could just give one core the first half of the values and the second the second half, i.e. ((a1+a2+ ... + a(n//2)) + ( a(n//2 +1) + ... + a(n)) / n). Yes, the work of dividing by n is not splitted between the cores, but it's a single instruction so we don't really care. Also we need to add the left total and the right total, which we can't split, but again it's only a single operation.
So the code we want to run:
def my_sum(values):
total = 0
for value in values:
total += value
return total
There's still a problem with python - normally one could use threads to do the computations, because each thread will use one core. But in that case one has to take care that your program does not run into race conditions, and the python interpreter itself also needs to take care of that. CPython decided it's not worth it and basically only runs in one thread at a time. A basic solution is to use multiple processes via multiprocessing.
from multiprocessing import Pool
if __name__ == '__main__':
with Pool(5) as p:
results = p.map(my_sum, [long_list[0:len(long_list)//2], long_list[len(long_list)//2:]))
print(sum(results) / len(long_list)) # add subresults and divide by n
But of course multiple processes do not come for free. You need to fork, copy stuff, etc. so you will not gain a speedup of 2 as one could expect. Also the biggest slowdown is actually using python itself, it's not really optimized for fast numerical computations. There are various ways around that, but using numpy is probably the simplest. Just use:
import numpy
print(numpy.mean(long_list))
Which is probably much faster than the python version. I don't think numpy uses multiprocessing internal, so one could gain a boost by using multiple processes and a fast implementation (numpy or something other written in C) but normally numpy is fast enough.
I have a script in python but it takes more than 20 hours to run until the end.
Since my code is pretty big, I will post a simplified one.
The first part of the code:
flag = 1
mydic = {}
for i in mylist:
mydic[flag] = myfunction(i)
flag += 1
mylist has more than 700 entries and each time I call myfunction it run for around 20sec.
So, I was thinking if I can use paraller programming to split the iteration into two groups and run it simultaneously. Is that possible and will I need the half time than before?
The second part of the code:
mymatrix = []
for n1 in range(0,flag):
mat = []
for n2 in range(0,flag):
if n1 >= n2:
mat.append(0)
else:
res = myfunction2(mydic(n1),mydic(n2))
mat.append(res)
mymatrix.append(mat)
So, if mylist has 700 entries, I want to create a 700x700 matrix where it is upper triangular matrix. But the myfunction2() needs around 30sec each time. I don't know if I can use parallel programming here too.
I cannot simplify the myfunction() and myfunction2() since they are functions where I call an external api and return the results.
Do you have any suggestion of how can I change it to make it faster.
Based on your comments, I think it's very likely that the 30seconds of time is mostly due to external API calls. I would add some timing code to test what portions of your code are actually responsible for the slowness.
If it is from the external API calls, there are some easy fixes. The external API calls block, so you'll get a speedup if you can move to a parallel model ( though 30s of blocking sounds huge to me ).
I think it would be easiest to create a quick "task list" by having the output of 2 loops be a matrix of arguments to pass into a function. Then I'd pipe them into Celery to run the tasks. That should give you a decent speedup with a minimal amount of work.
You would probably save a lot more time with the threading or multiprocessing modules to run tasks (or sections) , or even write it all in Twisted python - but that usually takes longer than a simple celery function.
The one caveat with the Celery approach is that you'll be dispatching a lot of work - so you'll have to have some functionality to poll for results. That could be a while loop that just sleeps(10) and repeats itself until celery has a result for every task. If you do it in Twisted, you can access/track results on finish. I've never had to do something like this with multiprocessing, so don't know how that would fit in.
how about using a generator for the second part instead of one of the for loops
def fn():
for n1 in range(0, flag):
yield n1
generate = fn()
while True:
a = next(generate)
for n2 in range(0, flag):
if a >= n2:
mat.append(0)
else:
mat.append(myfunction2(mydic(a),mydic(n2))
mymatrix.append(mat)
I have a fuzzy string matching script that looks for some 30K needles in a haystack of 4 million company names. While the script works fine, my attempts at speeding up things via parallel processing on an AWS h1.xlarge failed as I'm running out of memory.
Rather than trying to get more memory as explained in response to my previous question, I'd like to find out how to optimize the workflow - I'm fairly new to this so there should be plenty of room. Btw, I've already experimented with queues (also worked but ran into the same MemoryError, plus looked through a bunch of very helpful SO contributions, but not quite there yet.
Here's what seems most relevant of the code. I hope it sufficiently clarifies the logic - happy to provide more info as needed:
def getHayStack():
## loads a few million company names into id: name dict
return hayCompanies
def getNeedles(*args):
## loads subset of 30K companies into id: name dict (for allocation to workers)
return needleCompanies
def findNeedle(needle, haystack):
""" Identify best match and return results with score """
results = {}
for hayID, hayCompany in haystack.iteritems():
if not isnull(haystack[hayID]):
results[hayID] = levi.setratio(needle.split(' '),
hayCompany.split(' '))
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
def runMatch(args):
""" Execute findNeedle and process results for poolWorker batch"""
batch, first = args
last = first + batch
hayCompanies = getHayStack()
needleCompanies = getTargets(first, last)
needles = defaultdict(list)
current = first
for needleID, needleCompany in needleCompanies.iteritems():
current += 1
needles[targetID] = findNeedle(needleCompany, hayCompanies)
## Then store results
if __name__ == '__main__':
pool = Pool(processes = numProcesses)
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
pool.map_async(runMatch,
itertools.izip(itertools.repeat(targetsPerBatch),
xrange(0,
totalTargets,
targetsPerBatch))).get(99999999)
pool.close()
pool.join()
So I guess the questions are: How can I avoid loading the haystack for all workers - e.g. by sharing the data or taking a different approach like dividing the much larger haystack across workers rather than the needles? How can I otherwise improve memory usage by avoiding or eliminating clutter?
Your design is a bit confusing. You're using a pool of N workers, and then breaking your M jobs work up into N tasks of size M/N. In other words, if you get that all correct, you're simulating worker processes on top of a pool built on top of worker processes. Why bother with that? If you want to use processes, just use them directly. Alternatively, use a pool as a pool, sends each job as its own task, and use the batching feature to batch them up in some appropriate (and tweakable) way.
That means that runMatch just takes a single needleID and needleCompany, and all it does is call findNeedle and then do whatever that # Then store results part is. And then the main program gets a lot simpler:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
results = pool.map_async(runMatch, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING).get()
Or, if the results are small, instead of having all of the processes (presumably) fighting over some shared resulting-storing thing, just return them. Then you don't need runMatch at all, just:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
for result in pool.imap_unordered(findNeedle, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING):
# Store result
Or, alternatively, if you do want to do exactly N batches, just create a Process for each one:
if __name__ == '__main__':
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
processes = [Process(target=runMatch,
args=(targetsPerBatch,
xrange(0,
totalTargets,
targetsPerBatch)))
for _ in range(numProcesses)]
for p in processes:
p.start()
for p in processes:
p.join()
Also, you seem to be calling getHayStack() once for each task (and getNeedles as well). I'm not sure how easy it would be to end up with multiple copies of this live at the same time, but considering that it's the largest data structure you have by far, that would be the first thing I try to rule out. In fact, even if it's not a memory-usage problem, getHayStack could easily be a big performance hit, unless you're already doing some kind of caching (e.g., explicitly storing it in a global or a mutable default parameter value the first time, and then just using it), so it may be worth fixing anyway.
One way to fix both potential problems at once is to use an initializer in the Pool constructor:
def initPool():
global _haystack
_haystack = getHayStack()
def runMatch(args):
global _haystack
# ...
hayCompanies = _haystack
# ...
if __name__ == '__main__':
pool = Pool(processes=numProcesses, initializer=initPool)
# ...
Next, I notice that you're explicitly generating lists in multiple places where you don't actually need them. For example:
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
If there's more than a handful of results, this is wasteful; just use the results.values() iterable directly. (In fact, it looks like you're using Python 2.x, in which case keys and values are already lists, so you're just making an extra copy for no good reason.)
But in this case, you can simplify the whole thing even farther. You're just looking for the key (resultID) and value (score) with the highest score, right? So:
needleID, score = max(results.items(), key=operator.itemgetter(1))
return [needleID, haystack[needleID], score]
This also eliminates all the repeated searches over score, which should save some CPU.
This may not directly solve the memory problem, but it should hopefully make it easier to debug and/or tweak.
The first thing to try is just to use much smaller batches—instead of input_size/cpu_count, try 1. Does memory usage go down? If not, we've ruled that part out.
Next, try sys.getsizeof(_haystack) and see what it says. If it's, say, 1.6GB, then you're cutting things pretty fine trying to squeeze everything else into 0.4GB, so that's the way to attack it—e.g., use a shelve database instead of a plain dict.
Also try dumping memory usage (with the resource module, getrusage(RUSAGE_SELF)) at the start and end of the initializer function. If the final haystack is only, say, 0.3GB, but you allocate another 1.3GB building it up, that's the problem to attack. For example, you might spin off a single child process to build and pickle the dict, then have the pool initializer just open it and unpickle it. Or combine the two—build a shelve db in the first child, and open it read-only in the initializer. Either way, this would also mean you're only doing the CSV-parsing/dict-building work once instead of 8 times.
On the other hand, if your total VM usage is still low (note that getrusage doesn't directly have any way to see your total VM size—ru_maxrss is often a useful approximation, especially if ru_nswap is 0) at time the first task runs, the problem is with the tasks themselves.
First, getsizeof the arguments to the task function and the value you return. If they're large, especially if they either keep getting larger with each task or are wildly variable, it could just be pickling and unpickling that data takes too much memory, and eventually 8 of them are together big enough to hit the limit.
Otherwise, the problem is most likely in the task function itself. Either you've got a memory leak (you can only have a real leak by using a buggy C extension module or ctypes, but if you keep any references around between calls, e.g., in a global, you could just be holding onto things forever unnecessarily), or some of the tasks themselves take too much memory. Either way, this should be something you can test more easily by pulling out the multiprocessing and just running the tasks directly, which is a lot easier to debug.