My mxnet script is likely limited by i/o of data loading into the GPU, and I am trying to speed this up by prefetching. The trouble is I can't figure out how to prefetch with a custom data iterator.
My first hypothesis/hope was that it would be enough to set the values of self.preprocess_threads and self.prefetch_buffer, as I had seen here for iterators such as mxnet.io.ImageRecordUInt8Iter. However, when I did this I saw no performance change relative to the script before I had set these variables, so clearly setting these did not work.
Then I noticed, the existence of a class mx.io.PrefetchingIter in addition to the base class for which I had implemented a child class mx.io.DataIter. I found this documentation, but I have not been able to find any examples, and I am a little confused about what needs to happen where/when. However, I am not clear on how to use this. For example. I see that in addition to next() it has an iter_next() method, which simply says "move to the next batch". What does this mean exactly? What does it mean to "move" to the next batch without producing it? I found the source code for this class, and based on a brief reading, it seems as though it takes multiple iterators and creates one thread per iterator. This likely would not work for my current design, as I really want multiple threads used to prefetch from the same iterator.
Here is what I am trying to do via a custom data iterator
I maintain a global multiprocessing.Queue on which I pop data as it becomes available
I produce that data by running (via multiprocessing) a command line script that executes a c++ binary which produces a numpy file
I open the numpy file and load its contents into memory, process them, and put the processed bits on the global multiprocessing.Queue
My custom iterator pulls off this queue and also kicks off more jobs to produce more data when the queue is empty.
Here is my code:
def launchJobForDate(date_str):
### this is a function that gets called via multiprocessing
### to produce new data by calling a c++ binary
### whenever data queue is empty so that we need to produce more data
try:
f = "testdata/data%s.npy"%date_str
if not os.path.isfile(f):
cmd = CMD % ( date_str, JSON_FILE, date_str, date_str, date_str)
while True:
try:
output = subprocess.check_output(cmd, shell=True)
break
except:
pass
while True:
try:
d = np.load(f)
break
except:
pass
data_queue.put((d, date_str))
except Exception as ex:
print("launchJobForDate: ERROR ", ex)
class ProduceDataIter(mx.io.DataIter):
#staticmethod
def processData(d, time_steps, num_inputs):
try:
...processes data...
return [z for z in zip(bigX, bigY, bigEvalY, dates)]
except Exception as ex:
print("processData: ERROR ", ex)
def __init__(self, num_mgrs, end_date_str):
## iter stuff
self.preprocess_threads = 4
self.prefetch_buffer = 1
## set up internal data to preserve state
## and make a list of dates for which to run binary
#property
def provide_data(self):
return [mx.io.DataDesc(name='seq_var',
shape=(args_batch_size * GPU_COUNT,
self.time_steps,
self.num_inputs),
layout='NTC')]
#property
def provide_label(self):
return [mx.io.DataDesc(name='bd_return',
shape=(args_batch_size * GPU_COUNT)),
mx.io.DataDesc(name='bd_return',
shape=(args_batch_size * GPU_COUNT, num_y_cols)),
mx.io.DataDesc(name='date',
shape=(args_batch_size * GPU_COUNT))]
def __next__(self):
try:
z = self.z.pop(0)
data = z[0:1]
label = z[1:]
return mx.io.DataBatch(data, label)
except Exception as ex:
### if self.z (a list) has no elements to pop we need
### to get more data off the queue, process it, and put it
### on self.x so it's ready for calls to __next__()
while True:
try:
d = data_queue.get_nowait()
processedData = ProduceDataIter.processData(d,
self.time_steps,
self.num_inputs)
self.z.extend(processedData)
counter_queue.put(counter_queue.get() - 1)
z = self.z.pop(0)
data = z[0:1]
label = z[1:]
return mx.io.DataBatch(data, label)
except queue.Empty:
...this is where new jobs to produce new data and put them
...on the queue would happen if nothing is left on the queue
I have then tried making one of these iterators as well as a prefetch iterator like so:
mgr = ProcessMgr(2, end_date_str)
mgrOuter = mx.io.PrefetchingIter([mgr])
The problem is that mgrOuter immediately throws a StopIteration as soon as __next__() is called the first time, and without invoking mgr.__next__() as I thought it might.
Finally, I also noticed that gluon has a DataLoader object which seems like it might handle prefetching, however in this case it also seems to assume that the underlying data is from a Dataset which has a finite and unchanging layout (based on the fact that it is implemented in terms of getitem, which takes an index). So I have not pursued this option as it seem unpromising given the dynamic queue-like nature of the data I am generating as training input.
My questions are:
How do I need to modify my code above so that there will be prefetching for my custom iterator?
Where might I find an example or more detailed documentation of how mx.io.PrefetchingIter works?
Are there other strategies I should be aware of for getting more performance out of my GPUs via a custom iterator? Right now they are only operating at around 50% capacity, and upping (or lowering) the batch size doesn't change this. What other knobs might I be able to turn to increase GPU use efficiency?
Thanks for any feedback and advice.
As you already mentioned, gluon DataLoader is providing prefetching. In your custom DataIterator, you are using Numpy arrays as input. So you could do the following:
f = "testdata/data%s.npy"%date_str
data = np.load(f)
train = gluon.data.ArrayDataset(mx.nd.array(data))
train_iter = gluon.data.DataLoader(train, shuffle=True, num_workers=4, batch_size=batch_size, last_batch='rollover')
Since you are creating your data dynamically, you could try resetting the DataLoader in every epoch and load a new Numpy array.
If GPU utilization is still low, then try to increase the batch_size and the num_workers. Another problem could be also the size of your dataset. Resetting the DataLoader will impact the performance, so having a larger dataset will increase the time of an epoch and as such increase performance.
Related
I have a folder containing 497 pandas' dataframes stored as .parquet files. The folder total dimension is 7.6GB.
I'm trying to develop a simple trading system. So I create 2 different classes, the main one is the Portfolio one, this class then creates an Asset object for every single dataframe in the data folder.
import os
import pandas as pd
from dask.delayed import delayed
class Asset(file):
def __init__:
self.data_path = 'path\\to\\data\\folder\\'
self.data = pd.read_parquet(self.data_path + file, engine='auto')
class Portfolio:
def __init__:
self.data_path = 'path\\to\\data\\folder\\'
self.files_list = [file for file in os.listdir(self.data_path) if file.endswith('.parquet')]
self.assets_list = []
self.results = None
self.shared_data = '???'
def assets_loading(self):
for file in self.files_list:
tmp = Asset(file)
self.assets_list.append(tmp)
def dask_delayed(self):
for asset in self.assets_list:
backtest = delayed(self.model)(asset)
def dask_compute(self):
self.results = delayed(dask_delayed)
self.results.compute()
def model(self, asset):
# do shet
if __name__ == '__main__':
portfolio = Portfolio()
portfolio.dask_compute()
I'm doing something wrong cause it looks like the results are not processed. If I try to check portfolio.results the console prints:
Out[5]: Delayed('NoneType-7512ffcc-3b10-445f-928a-f01c01bae29c')
So here are my questions:
Can you explain me what's wrong?
When I run the function assets_loading() I'm basically loading the entire data folder in memory for faster processing speed, but it saturates my RAM (16GB available). I didn't thought that a 7.6GB folder could saturates 16GB RAM, that's why I want to use Dask. Any solution compatible with my script work flow?
There's is another problem and probably the bigger one. With Dask I'm trying to parallelize the model function over multiple assets at the same time, but I need a shared memory (self.shared_data in the script) to store some variables value that resides inside each Dask process to the Portfolio object (for example, the single asset's year performance). Can you explain me how can I share data between Dask delayed processes and how to store this data in a Portfolio's variable?
Thanks a lot.
There are a few things wrong with the line self.results = delayed(dask_delayed):
Here you are creating a delayed function, not a delayed result; you need to call the delayed function
dask_delayed is not defined here, you probably mean self.dask_delayed
the method dask_delayed does not return anything
you call .compute() (which doesn't exist for a delayed function, only a delayed result), but don't store the output - computing doesn't happen in-place, as you seem to assume.
You probably wanted
self.result = delayed(self.dask_delayed)().compute()
Now you need to fix dask_delayed(), so that it return something. It should not be calling more delayed functions, since it itself is already to be delayed.
Finally, for filling up memory with pd.read_parquet, it does not surprise me that the in-memory version of the data is bigger, compression/encoding is one of the aims of the parquet format. You could try using dask.dataframe.read_parquet, which is lazy/on-demand.
I've implemented a genetic search algorithm and tried to parallelise it, but getting terrible performance (worse than single threaded). I suspect this is due to communication overhead.
I have provided pseudo-code below, but in essence the genetic algorithm creates a large pool of "Chromosome" objects, then runs many iterations of:
Score each individual chromosome based on how it performs in a 'world.' The world remains static across iterations.
Randomly selects a new population based on their scores calculated in the previous step
Go to step 1 for n iterations
The scoring algorithm (step 1) is the major bottleneck, hence it seemed natural to distribute out the processing of this code.
I have run into a couple of issues I hoped I could get help with:
How can I link the calculated score with the object that was passed to the scoring function by map(), i.e. link each Future holding a score back to a Chromosome? I've done this in a very clunky way by having the calculate_scores() method return the object, but in reality all I need is to send a float back if there is a better way to maintain the link.
The parallel processing of the scoring function is working okay, though takes a long time for map() to iterate through all the objects. However, the subsequent calls to draw_chromosome_from_pool() run very slowly compared to the single-threaded version to the point that I've not yet seen it complete. I have no idea what is causing this as the method always completes quickly in the single-threaded version. Is there some IPC going on to pull the chromosomes back to the local process, even after all the futures have completed? Is the local process de-prioritised in some way?
I am worried that the overall iterative nature of building/rebuilding the pool each cycle is going to cause an enormous amount of data transmission to the workers. The question at the root of this concern: what and when does Dask actually send data back and forth to the worker pool. i.e. when does Environment() get distributed out vs. Chromosome(), and how/when do results come back? I've read the docs but either haven't found the right detail, or am too stupid to understand.
Idealistically, I think (but open to correction) what I want is a distributed architecture where each worker holds the Environment() data locally on a 'permanent' basis, then Chromosome() instance data is distributed for scoring with little duplicated back/forth of unchanged Chromosome() data between iterations.
Very long post, so if you have taken the time to read this, thank you already!
class Chromosome(object): # Small size: several hundred bytes per instance
def get_score():
# Returns a float
def set_score(i):
# Stores a a float
class Environment(object): # Large size: 20-50Mb per instance, but only one instance
def calculate_scores(chromosome):
# Slow calculation using attributes from chromosome and instance data
chromosome.set_score(x)
return chromosome
class Evolver(object):
def draw_chromosome_from_pool(self, max_score):
while True:
individual = np.random.choice(self.chromosome_pool)
selection_chance = np.random.uniform()
if selection_chance < individual.get_score() / max_score:
return individual
def run_evolution()
self.dask_client = Client()
self.chromosome_pool = list()
for i in range(10000):
self.chromosome_pool.append( Chromosome() )
world_data = LoadWorldData() # Returns a pandas Dataframe
self.world = Environment(world_data)
iterations = 1000
for i in range(iterations):
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
for future in as_completed(futures):
c = future.result()
highest_score = max(highest_score, c.get_score())
new_pool = set()
while len(new_pool)<self.pool_size:
mother = self.draw_chromosome_from_pool(highest_score)
# do stuff to build a new pool
Yes, each time you call the line
futures = self.dask_client.map(self.world.calculate_scores, self.chromosome_pool)
you are serialising self.world, which is large. You could do this just once before the loop with
future_world = client.scatter(self.world, broadcast=True)
and then in the loop
futures = self.dask_client.map(lambda ch: Environment.calculate_scores(future_world, ch), self.chromosome_pool)
will use the copies already on the workers (or a simple function that does the same). The point is that future_world is just a pointer to stuff already distributed, but dask takes care of this for you.
On the issue of which chromosome is which: using as_completed breaks the order that you submitted them to map, but this is not necessary for your code. You could have used wait to process when all the work was done, or simply iterate over the future.result()s (which will wait for each task to be done), and then you will retain the ordering in the chromosome_pool.
seems there many open questions about the usage of TensorFlow out there and some developer of tensorflow here active on stackoverflow. Here is another question. I want to generate training data on-the-fly in other thread(s) using numpy or something which does not belongs to TensorFlow. But, I do not want to go through re-compiling the entire TensorFlow source again and again. I simply waiting for another way. "tf.py_func" seems to be a workaround. But the
This is related to [how-to-prefetch-data-using-a-custom-python-function-in-tensorflow][1]
Here is my MnWE (minmal-not-working-example):
Update (now there is an output but a race-condition, too):
import numpy as np
import tensorflow as tf
import threading
import os
import glob
import random
import matplotlib.pyplot as plt
IMAGE_ROOT = "/graphics/projects/data/mscoco2014/data/images/"
files = ["train/COCO_train2014_000000178763.jpg",
"train/COCO_train2014_000000543841.jpg",
"train/COCO_train2014_000000364433.jpg",
"train/COCO_train2014_000000091123.jpg",
"train/COCO_train2014_000000498916.jpg",
"train/COCO_train2014_000000429865.jpg",
"train/COCO_train2014_000000400199.jpg",
"train/COCO_train2014_000000230367.jpg",
"train/COCO_train2014_000000281214.jpg",
"train/COCO_train2014_000000041920.jpg"];
# --------------------------------------------------------------------------------
def pre_process(data):
"""Pre-process image with arbitrary functions
does not only use tf.functions, but arbitrary
"""
# here is the place to do some fancy stuff
# which might be out of the scope of tf
return data[0:81,0,0].flatten()
def populate_queue(sess, thread_pool, qData_enqueue_op ):
"""Put stuff into the data queue
is responsible such that there is alwaays data to process
for tensorflow
"""
# until somebody tell me I can stop ...
while not thread_pool.should_stop():
# get a random image from MS COCO
idx = random.randint(0,len(files))-1
data = np.array(plt.imread(os.path.join(IMAGE_ROOT,files[idx])))
data = pre_process(data)
# put into the queue
sess.run(qData_enqueue_op, feed_dict={data_input: data})
# a simple queue for gather data (just to keep it currently simple)
qData = tf.FIFOQueue(100, [tf.float32], shapes=[[9,9]])
data_input = tf.placeholder(tf.float32)
qData_enqueue_op = qData.enqueue([tf.reshape(data_input,[9,9])])
qData_dequeue_op = qData.dequeue()
init_op = tf.initialize_all_variables()
with tf.Session() as sess:
# init all variables
sess.run(init_op)
# coordinate of pool of threads
thread_pool = tf.train.Coordinator()
# start fill in data
t = threading.Thread(target=populate_queue, args=(sess, thread_pool, qData_enqueue_op))
t.start()
# Can I use "tf.train.start_queue_runners" here
# How to use multiple threads?
try:
while not thread_pool.should_stop():
print "iter"
# HERE THE SILENCE BEGIN !!!!!!!!!!!
batch = sess.run([qData_dequeue_op])
print batch
except tf.errors.OutOfRangeError:
print('Done training -- no more data')
finally:
# When done, ask the threads to stop.
thread_pool.request_stop()
# now they should definetely stop
thread_pool.request_stop()
thread_pool.join([t])
I basically have three question:
What's wrong with this code? It runs into an endless loss (which is not debug-able). See Line "HERE THE SILENCE BEGIN ..."
How to extend this code to use more threads?
Is it worth to convert to tf.Record large datasets or data which can be generated on the fly?
You have a mistake on this line:
t = threading.Thread(target=populate_queue, args=(sess, thread_pool, qData))
It should be qData_enqueue_op instead of qData. Otherwise your enqueue operations fail, and you get stuck trying to dequeue from queue of size 0. I saw this when trying to run your code and getting
TypeError: Fetch argument <google3.third_party.tensorflow.python.ops.data_flow_ops.FIFOQueue object at 0x4bc1f10> of <google3.third_party.tensorflow.python.ops.data_flow_ops.FIFOQueue object at 0x4bc1f10> has invalid type <class 'google3.third_party.tensorflow.python.ops.data_flow_ops.FIFOQueue'>, must be a string or Tensor. (Can not convert a FIFOQueue into a Tensor or Operation.)
Regarding other questions:
You don't need to start queue runners in this example because you don't have any. Queue runners are created by input producers like string_input_producer which is essentially FIFO queue + logic to launch threads. You are replicating 50% of queue runner functionality by launching your own threads that do enqueue ops. (the other 50% is closing the queue)
RE: converting to tf.record -- Python has this thing called Global Interpreter Lock which means that two bits of Python code can't execute concurrently. In practice that's mitigated by the fact that a lot of the time is spent in numpy C++ code or IO ops (which release GIL). So I think it's a matter of checking if you are able to achieve required parallelism using Python pre-processing pipelines.
I've hit the common problem of getting a pickle error when using the multiprocessing module.
My exact problem is that I need to give the function I'm calling some state before I call it in the pool.map function, but in doing so, I cause the attribute lookup __builtin__.function failed error found here.
Based on the linked SO answer, it looks like the only way to use a function in pool.map is to call the defined function itself so that it is looked up outside the scope of the current function.
I feel like I explained the above poorly, so here is the issue in code. :)
Testing without pool
# Function to be called by the multiprocessing pool
def my_func(x):
massive_list, medium_list, index1, index2 = x
result = [massive_list[index1 + x][index2:] for x in xrange(10)]
return result in medium_list
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = map(my_func, to_crunch)
This works A-OK and just as expected. The only thing "wrong" with it is that it's slow.
Pool Attempt 1
# (Note: my_func() remains the same)
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
pool = multiprocessing.Pool(2)
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = pool.map(my_func, to_crunch)
This technically works, but it is a stunning 18x slower! The slow down must be coming from not only copying the two massive data structures on each call, but also pickling/unpickling them as they get passed around. The non-pool version benefits from only having to pass the reference to the massive list around, rather than the actual list.
So, having tracked down the bottleneck, I try to store the two massive lists as state inside of my_func. That way, if I understand correctly, it will only need to be copied once for each worker (in my case, 4).
Pool Attempt 2:
I wrap up my_func in a closure passing in the two lists as stored state.
def build_myfunc(m,s):
def my_func(x):
massive_list = m # close the state in there
small_list = s
index1, index2 = x
result = [massive_list[index1 + x][index2:] for x in xrange(10)]
return result in medium_list
return my_func
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
modified_func = build_myfunc(data, source)
pool = multiprocessing.Pool(2)
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = pool.map(modified_func, to_crunch)
However, this returns the pickle error as (based on the above linked SO question) you cannot call a function with multiprocessing from inside of the same scope.
Error:
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
So, is there a way around this problem?
Map is a way to distribute workload. If you store the data in the func i think you vanish the initial purpose.
Let's try to find why it is slower. It's not normal and there must be something else.
First, the number of processes must be suitable for the machine running them. In your example you're using a pool of 2 processes so a total of 3 processes is involved. How many cores are on the system you're using? What else is running? What's the system load while crunching data?
What does the function do with the data? Does it access disk? Or maybe it uses DB which means there is probably another process accessing disk and cores.
What about memory? Is it sufficient for storing the initial lists?
The right implementation is your Attempt 1.
Try to profile the execution using iostat for example. This way you can spot the bottlenecks.
If it stalls on the cpu then you can try some tweaks to the code.
From another answer on Stackoverflow (by me so no problem copy and pasting it here :P ):
You're using .map() which collect the results and then returns. So for large dataset probably you're stuck in the collecting phase.
You can try using .imap() which is the iterator version on .map() or even the .imap_unordered() if the order of results is not important (as it seems from your example).
Here's the relevant documentation. Worth noting the line:
For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.
I have a fuzzy string matching script that looks for some 30K needles in a haystack of 4 million company names. While the script works fine, my attempts at speeding up things via parallel processing on an AWS h1.xlarge failed as I'm running out of memory.
Rather than trying to get more memory as explained in response to my previous question, I'd like to find out how to optimize the workflow - I'm fairly new to this so there should be plenty of room. Btw, I've already experimented with queues (also worked but ran into the same MemoryError, plus looked through a bunch of very helpful SO contributions, but not quite there yet.
Here's what seems most relevant of the code. I hope it sufficiently clarifies the logic - happy to provide more info as needed:
def getHayStack():
## loads a few million company names into id: name dict
return hayCompanies
def getNeedles(*args):
## loads subset of 30K companies into id: name dict (for allocation to workers)
return needleCompanies
def findNeedle(needle, haystack):
""" Identify best match and return results with score """
results = {}
for hayID, hayCompany in haystack.iteritems():
if not isnull(haystack[hayID]):
results[hayID] = levi.setratio(needle.split(' '),
hayCompany.split(' '))
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
def runMatch(args):
""" Execute findNeedle and process results for poolWorker batch"""
batch, first = args
last = first + batch
hayCompanies = getHayStack()
needleCompanies = getTargets(first, last)
needles = defaultdict(list)
current = first
for needleID, needleCompany in needleCompanies.iteritems():
current += 1
needles[targetID] = findNeedle(needleCompany, hayCompanies)
## Then store results
if __name__ == '__main__':
pool = Pool(processes = numProcesses)
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
pool.map_async(runMatch,
itertools.izip(itertools.repeat(targetsPerBatch),
xrange(0,
totalTargets,
targetsPerBatch))).get(99999999)
pool.close()
pool.join()
So I guess the questions are: How can I avoid loading the haystack for all workers - e.g. by sharing the data or taking a different approach like dividing the much larger haystack across workers rather than the needles? How can I otherwise improve memory usage by avoiding or eliminating clutter?
Your design is a bit confusing. You're using a pool of N workers, and then breaking your M jobs work up into N tasks of size M/N. In other words, if you get that all correct, you're simulating worker processes on top of a pool built on top of worker processes. Why bother with that? If you want to use processes, just use them directly. Alternatively, use a pool as a pool, sends each job as its own task, and use the batching feature to batch them up in some appropriate (and tweakable) way.
That means that runMatch just takes a single needleID and needleCompany, and all it does is call findNeedle and then do whatever that # Then store results part is. And then the main program gets a lot simpler:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
results = pool.map_async(runMatch, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING).get()
Or, if the results are small, instead of having all of the processes (presumably) fighting over some shared resulting-storing thing, just return them. Then you don't need runMatch at all, just:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
for result in pool.imap_unordered(findNeedle, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING):
# Store result
Or, alternatively, if you do want to do exactly N batches, just create a Process for each one:
if __name__ == '__main__':
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
processes = [Process(target=runMatch,
args=(targetsPerBatch,
xrange(0,
totalTargets,
targetsPerBatch)))
for _ in range(numProcesses)]
for p in processes:
p.start()
for p in processes:
p.join()
Also, you seem to be calling getHayStack() once for each task (and getNeedles as well). I'm not sure how easy it would be to end up with multiple copies of this live at the same time, but considering that it's the largest data structure you have by far, that would be the first thing I try to rule out. In fact, even if it's not a memory-usage problem, getHayStack could easily be a big performance hit, unless you're already doing some kind of caching (e.g., explicitly storing it in a global or a mutable default parameter value the first time, and then just using it), so it may be worth fixing anyway.
One way to fix both potential problems at once is to use an initializer in the Pool constructor:
def initPool():
global _haystack
_haystack = getHayStack()
def runMatch(args):
global _haystack
# ...
hayCompanies = _haystack
# ...
if __name__ == '__main__':
pool = Pool(processes=numProcesses, initializer=initPool)
# ...
Next, I notice that you're explicitly generating lists in multiple places where you don't actually need them. For example:
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
If there's more than a handful of results, this is wasteful; just use the results.values() iterable directly. (In fact, it looks like you're using Python 2.x, in which case keys and values are already lists, so you're just making an extra copy for no good reason.)
But in this case, you can simplify the whole thing even farther. You're just looking for the key (resultID) and value (score) with the highest score, right? So:
needleID, score = max(results.items(), key=operator.itemgetter(1))
return [needleID, haystack[needleID], score]
This also eliminates all the repeated searches over score, which should save some CPU.
This may not directly solve the memory problem, but it should hopefully make it easier to debug and/or tweak.
The first thing to try is just to use much smaller batches—instead of input_size/cpu_count, try 1. Does memory usage go down? If not, we've ruled that part out.
Next, try sys.getsizeof(_haystack) and see what it says. If it's, say, 1.6GB, then you're cutting things pretty fine trying to squeeze everything else into 0.4GB, so that's the way to attack it—e.g., use a shelve database instead of a plain dict.
Also try dumping memory usage (with the resource module, getrusage(RUSAGE_SELF)) at the start and end of the initializer function. If the final haystack is only, say, 0.3GB, but you allocate another 1.3GB building it up, that's the problem to attack. For example, you might spin off a single child process to build and pickle the dict, then have the pool initializer just open it and unpickle it. Or combine the two—build a shelve db in the first child, and open it read-only in the initializer. Either way, this would also mean you're only doing the CSV-parsing/dict-building work once instead of 8 times.
On the other hand, if your total VM usage is still low (note that getrusage doesn't directly have any way to see your total VM size—ru_maxrss is often a useful approximation, especially if ru_nswap is 0) at time the first task runs, the problem is with the tasks themselves.
First, getsizeof the arguments to the task function and the value you return. If they're large, especially if they either keep getting larger with each task or are wildly variable, it could just be pickling and unpickling that data takes too much memory, and eventually 8 of them are together big enough to hit the limit.
Otherwise, the problem is most likely in the task function itself. Either you've got a memory leak (you can only have a real leak by using a buggy C extension module or ctypes, but if you keep any references around between calls, e.g., in a global, you could just be holding onto things forever unnecessarily), or some of the tasks themselves take too much memory. Either way, this should be something you can test more easily by pulling out the multiprocessing and just running the tasks directly, which is a lot easier to debug.