Why does joblib parallel execution make runtime much slower?

Why does joblib parallel execution make runtime much slower? - python

I want to shuffle values in a 3D numpy-array, but only when they are > 0.
When I run my function with a single core, it is much faster than with even 2 cores. It is way beyond the overhead of creating new python processes. What am I missing?
The following code outputs:
random shuffling of markers started
time in serial execution: 1.0288s
time executing in parallel with num_cores=1: 0.9056s
time executing in parallel with num_cores=2: 273.5253s
import numpy as np
import time
from random import shuffle
from joblib import Parallel, delayed
import multiprocessing
import numpy as np
def randomizeVoxels(V,markerLUT):
V_rand=V.copy()
# the xyz naming here does not match outer convention, which will depend on permutation
for ix in range(V.shape[0]):
for iy in range(V.shape[1]):
if V[ix,iy]>0:
V_rand[ix,iy]=markerLUT[V[ix,iy]]
return V_rand
V_ori=np.arange(1000000,-1000000,-1).reshape(100,100,200)
V_rand=V_ori.copy()
listMarkers=np.unique(V_ori)
listMarkers=[val for val in listMarkers if val>0]
print("random shuffling of markers started\n")
reassignedMarkers=listMarkers.copy()
#random shuffling of original markers
shuffle(reassignedMarkers)
markerLUT={}
for i,iMark in enumerate(listMarkers):
markerLUT[iMark]=reassignedMarkers[i]
tic=time.perf_counter()
for ix in range(len(V_ori)):
for iy in range(len(V_ori[0])):
for iz in range(len(V_ori[0][0])):
if V_ori[ix,iy,iz]>0:
V_rand[ix,iy,iz]=markerLUT[V_ori[ix,iy,iz]]
toc=time.perf_counter()
print("time in serial execution: \t\t\t{: >4.4f} s".format(toc-tic))
#######################################################################3
num_cores = 1
V_rand=V_ori.copy()
tic=time.perf_counter()
results= Parallel(n_jobs=num_cores)\
(delayed(randomizeVoxels)\
(V_ori[imSlice,:,:],
markerLUT
)for imSlice in range(V_ori.shape[0]))
for i,resTuple in enumerate(results):
V_rand[i,:,:]=resTuple
toc=time.perf_counter()
print("time executing in parallel with num_cores={}:\t{: >4.4f} s".format(num_cores,toc-tic))
num_cores = 2
V_rand=V_ori.copy()
MASK = "time executing in parallel with num_cores={}:\t {: >4.4f}s"
tic=time.perf_counter() #----------------------------- [PERF-me]
results= Parallel(n_jobs=num_cores)\
(delayed(randomizeVoxels)\
(V_ori[imSlice,:,:],
markerLUT
)for imSlice in range(V_ori.shape[0]))
for i,resTuple in enumerate(results):
V_rand[i,:,:]=resTuple
toc=time.perf_counter() #----------------------------- [PERF-me]
print( MASK.format(num_cores,toc-tic) )

Q : "What am I missing?"
Most probably the memory-I/O bottlenecks.
While the numpy-part of the processing seems to be pretty shallow here (shuffle does not compute a bit, but moves data between a pair of locations, doesn't it?), for the most of the time, this will not permit "time-enough" (by doing any useful work) so as to get the memory-I/O-s be masked by re-ordered CPU-core instructions (ref. latency-costs for straight + cross-QPI memory-I/O ops at the lowest levels of the contemporary super-scalar CISC architectures with highly speculative branch predictions (not useful for memory-I/O bound non-branching tightly crafted sections) and multi-core and many-core NUMA designs ).
This is most probably why even the first spin-off concurrent process (no matter if enforced for camping on the same (here a shared-CPU-core time by an interleaving pair of a two-step dancing processes, again memory-I/O bound, with even worse chances for latency masking on shared memory-I/O channels...) or any other (here adding cross-QPI add-on latency costs if having to perform non-local memory-I/O, again worsening chances for memory-I/O latency-masking) CPU-core.
CPU-core hopping, enforced by the colliding effects of the CPU-clock Boost policy (later starting to violate the Thermal Management, thus hopping the process to camp on a next, colder CPU-core) will invalidate all CPU-core cache benefits, by not having the pre-cached data on the next, colder, core available, thus having to re-fetch all (once pre-cached already into the fastest L1data cache) data again (perhaps, for array-objects with larger memory footprints, having even a need to cross-QPI fetch), so harnessing more cores does not have a trivial effect on the resulting efficiency.
;o)The numpy high performance & smart processing is here not the one to be blamed - the very opposite - it clearly demasks the CPU "starvation" state - known for ages to be The Very Performance Ceiling for all our modern CPUs - this is why we see so many-core CPUs, that try to circumvent this bottleneck by having more and more cores - see the commented silicon-level analysis referenced above.Last but not leastthe code as-is contains immense count of opportunities to improve it's performance, numpy-smart-vectorised being the first one to name, avoiding range()-loops, so there are more tips to follow, all of which will finally ring the headbang into the very same trouble - the CPU-starvation ceiling

The reason using multiple processes can be slower than a single process is because multiprocessing requires that arguments are serialized before being sent to worker processes. This introduces additional overhead that scales with the amount of data that must be serialized.
The Python multiprocessing library uses pickle for this, while joblib has a custom serializer loky. The documentation can be found here: https://joblib.readthedocs.io/en/latest/parallel.html#serialization-processes
For more, here is another StackOverflow answer that gives more context about why this serialization is needed.
https://joblib.readthedocs.io/en/latest/parallel.html#serialization-processes

Related

Python 3.7 MultiThreading strategy

I have a work load that consist of a very slow query that returns a HUGE amount of data that has to be parsed and calculated, all that on a loop. Basically, it looks like this:
for x in lastTenYears
myData = DownloadData(x) # takes about ~40-50 [sec]
parsedData.append(ParseData(myData)) # takes another +30-60 [sec]
As I believe you have noticed, if I could run the data parsing on a thread, I could download the next batch of data while the parsing happens.
How can I achieve this parallelism of operations?
Ideally speaking, I would like to have 1 thread always downloading, and N threads doing the parsing. The download part is actually a query against a database, so it's not good to have a bunch o parallel of them...
Details:
The parsing of the data is a heavily CPU bound, and consists of raw math calculations and nothing else.
Using Python 3.7.4

1) Use a threadsafe queue. Queue.FIFOQueue. At the top level define
my_queue = Queue.FIFOQueue()
parsedData = []
2) On the first thread, kick off the data loading
my_queue.put(DownloadData(x))
On the second thread
if not (my_queue.empty()):
myData = my_queue.get()
parsedData.append(ParseData(myData))

If your program is CPU bound you will have hard times to do anything else in other threads due to the GIL (global interpreter lock).
Here is a link to an article which might help you to understand the topic: https://opensource.com/article/17/4/grok-gil
Downloading the data in a sub-process is most likely the best approach.

It's hard to say if and how much this will actually help (as I have nothing to test...), but you might try a multiprocessing.Pool. It handles all the dirty work for you and you can customize number of processes, chunk size etc.
from multiprocessing import Pool
def worker(x):
myData = DownloadData(x)
return ParseData(myData)
if __name__ == "__main__":
processes = None # defaults to os.cpu_count()
chunksize = 1
with Pool(processes) as pool:
parsedData = pool.map(worker, lastTenYears, chunksize)
Here for the example I use the map method, but according to your needs you might want to use imap or map_async.

Q : How can I achieve this parallelism of operations?
The step number one is to realise, the above requested use-case is not a [PARALLEL] code-execution, but an un-ordered batch of resources-use policy limited execution of a strict sequence of pairs of :
First-a-remote-[DB-Query](returning (cit.) HUGE amount of data)
Next-a-local-[CPU-process]( of (cit.) HUGE amount of data just returned here)
The latency of the first could be masked( if it were permitted, but it is not permitted - due to a will not to overload the DB-host ),the latency for the second not( can start but a next I/O-bound DB-Query, yet only if not violating the rule of keeping the DB-machine but under a mild workload ).
As I believe you have noticed, if I could run the data parsing on a thread, I could download the next batch of data while the parsing happens.
It is high time to make thing clear and sound :
Facts :
A )
The CPU-bound tasks will never run faster in whatever number N of threads in python-GIL-lock controlled ecosystem( since ever and forever, as Guido ROSSUM has expressed ),as the GIL-lock enforces a re-[SERIAL]-isation, so the more threads "work", the more threads actually wait for acquiring the GIL-lock, before they "get" it but for a 1 / ( N + 1 )-th fraction of time of the resulting, thanks to the GIL-lock policing again pure-[SERIAL], duration of N * ( 30 - 60 ) [sec]
B )
The I/O-bound task makes no sense to off-load into a full process-based, concurrent execution, as the full-copy of the python process ( in Windows also with duplicating the whole python interpreter state with all data, during the sub-process instantiation ) makes no sense, as there are smarter techniques for I/O-bound processing ( where GIL-lock does not hurt so much.
C )
The whole concept of N-parsing : 1-querying is principally wrong - the maximum achievable goal is to mask the latency of the I/O-process ( where making sense ), yet here each one and every query takes those said ~ 40-50 [sec] so no second pack-of-data to parse will ever be present here before running those said ~ 40-50 [sec] next time, sono second worker will ever get anything to parse anytime before T0 + ~ 80~100 [sec] - so one could dream a wish to have N-(unbound)-workers working ( yet have 'em but actually waiting for data ) is possible, but awfully anti-productive ( the worse for N-(GIL-MUTEX-ed)-"waiting"-agents ).

how to efficiently use multiprocessing to speed up huge amount of tiny tasks?

I'm having a bit trouble in Python multiprocessing.Pool. I have two list of numpy array a and b, in which
a.shape=(10000,3)
and
b.shape=(1000000000,3)
Then I have a function which does some computation like
def role(array, point):
sub = array-point
return (1/(np.sqrt(np.min(np.sum(sub*sub, axis=-1)))+0.001)**2)
Next, I need to compute
[role(a, point) for point in b]
To speed it up, I try to use
cpu_num = 4
m = multiprocessing.Pool(cpu_num)
cost_list = m.starmap(role, [(a, point) for point in b])
m.close
The whole process takes around 70s, but if I set cpu_num = 1, the processing time decrease to 60s... My laptop has 6 core, for reference.
Here I have two questions:
is there sth I did wrong with multiprocessing.Pool? why the processing time increased if I set cpu_num = 4?
for task like this (each for loop is a very tiny process), should I use multiprocessing to speed up? I feel like each time, python fill in Pool takes longer than process function role...
Any suggestions is really welcome.

Multiprocessing comes with some overhead (to create new processes), which is why it's not a very good choice when you have lots of tiny tasks, where the overhead of process creation might outweigh the benefit of parallelizing.
Have you considered vectorizing your problem?
In particular, if you broadcast the variable b you get there:
sub = a - b[::,np.newaxis] # broadcast b
1./(np.sqrt(np.min(np.sum(sub**2, axis=2), axis=-1))+0.001)**2
I believe you could then still reduce the complexity of the last expression a bit, as you're creating the square of a square root, which seems redundant (note that I'm assuming the 0.001 constant value is merely there to avoid some non-sensible operation like division by zero).

If the tasks are too tiny, then tohe multiprocessing overhead will be your bottleneck and you will win nothing.
If the amount of data per task that you have to pass to a worker or that the worker has to return then you will also not win a lot (or even win nothing)
If you have 10.000 tiny tasks, then I recommend to create a list of meta tasks.
Each meta task would consist of executing for example 20 tiny tasks.
meta_tasks = []
for idx in range(0, len(tiny_tasks), 20):
meta_tasks.append(tiny_tasks[idx:idx+20])
Then pass the meta tasks to your worker pool.

Why does Dask perform so slower while multiprocessing perform so much faster?

To get a better understanding about parallel, I am comparing a set of different pieces of code.
Here is the basic one (code_piece_1).
for loop
import time
# setup
problem_size = 1e7
items = range(9)
# serial
def counter(num=0):
junk = 0
for i in range(int(problem_size)):
junk += 1
junk -= 1
return num
def sum_list(args):
print("sum_list fn:", args)
return sum(args)
start = time.time()
summed = sum_list([counter(i) for i in items])
print(summed)
print('for loop {}s'.format(time.time() - start))
This code ran a time consumer in a serial style (for loop) and got this result
sum_list fn: [0, 1, 2, 3, 4, 5, 6, 7, 8]
36
for loop 8.7735116481781s
multiprocessing
Could multiprocessing style be viewed as a way to implement parallel computing?
I assume a Yes, since the doc says so.
Here is code_piece_2
import multiprocessing
start = time.time()
pool = multiprocessing.Pool(len(items))
num_to_sum = pool.map(counter, items)
print(sum_list(num_to_sum))
print('pool.map {}s'.format(time.time() - start))
This code ran the same time consumer in multiprocessing style and got this result
sum_list fn: [0, 1, 2, 3, 4, 5, 6, 7, 8]
36
pool.map 1.6011056900024414s
Obviously, the multiprocessing one is faster than the serial in this particular case.
Dask
Dask is a flexible library for parallel computing in Python.
This code (code_piece_3) ran the same time consumer with Dask (I am not sure whether I use Dask the right way.)
#delayed
def counter(num=0):
junk = 0
for i in range(int(problem_size)):
junk += 1
junk -= 1
return num
#delayed
def sum_list(args):
print("sum_list fn:", args)
return sum(args)
start = time.time()
summed = sum_list([counter(i) for i in items])
print(summed.compute())
print('dask delayed {}s'.format(time.time() - start))
I got
sum_list fn: [0, 1, 2, 3, 4, 5, 6, 7, 8]
36
dask delayed 10.288054704666138s
my cpu has 6 physical cores
Question
Why does Dask perform so slower while multiprocessing perform so much faster?
Am I using Dask the wrong way? If yes, what is the right way?
Note: Please discuss with this particular case or other specific and concrete cases. Please do NOT talk generally.

In your example, dask is slower than python multiprocessing, because you don't specify the scheduler, so dask uses the multithreading backend, which is the default. As mdurant has pointed out, your code does not release the GIL, therefore multithreading cannot execute the task graph in parallel.
Have a look here for a good overview over the topic: https://docs.dask.org/en/stable/scheduler-overview.html
For your code, you could switch to the multiprocessing backend by calling:
.compute(scheduler='processes').
If you use the multiprocessing backend, all communication between processes still needs to pass through the main process. You therefore might also want to check out the distributed scheduler, where worker processes can directly communicate with each other, which is beneficial especially for complex task graphs. Also, the distributed scheduler supports work-stealing to balance work between processes and has a webinterface providing some diagnostic information about running tasks. It often makes sense to use the distributed scheduler rather than the multirpocessing scheduler even if you only want to compute on a local machine.

Q : Why did parallel computing take longer than a serial one?
Because there are way more instructions loaded onto CPU to get executed ( "awfully" many even before a first step of the instructed / intended block of calculations gets first into the CPU ), then in a pure-[SERIAL] case, where no add-on costs were added to the flow-of-execution.
For these (hidden from the source-code) add-on operations ( that you pay both in [TIME]-domain ( duration of such "preparations" ) and in [SPACE]-domain ( allocating more RAM to contain all involved structures needed for [PARALLEL]-operated code ( well, most often a still just-[CONCURRENT]-operated code, if we are pedantic and accurate in terminology ), which again costs you in [TIME], as each and every RAM-I/O costs you about more than 1/3 of [us] ~ 300~380 [ns] )
The result?
Unless your workload-package has "sufficient enough" amount of work, that can get executed in parallel ( non-blocking, having no locks, no mutexes, no sharing, no dependencies, no I/O, ... indeed independent having minimum RAM-I/O re-fetches ), it is very easy to "pay way more than you ever get back".
For details on the add-on costs and things that have such strong effect on resulting Speedup, start reading the criticism of blind using the original, overhead naive formulation of the Amdahl's law here.

The code you have requires the GIL, so only one task is running at a time, and all you are getting is extra overhead. If you use, for example, the distributed scheduler with processes, then you get much better performance.

How to speedup python function using parallel processing?

I have two functions. Each function runs a for loop.
def f1(df1, df2):
final_items = []
for ind, row in df1.iterrows():
id = row['Id']
some_num = row['some_num']
timestamp = row['Timestamp']
res = f2(df=df2, id=id, some_num=some_num, timestamp=timestamp))
final_items.append(res)
return final_items
def f2(df, id, some_num, timestamp):
for ind, row in df.iterrows():
filename = row['some_filename']
dfx = reader(key=filename) # User defined; object reader
# Assign variables
st_ID = dfx["Id"]
st_some_num = dfx["some_num"]
st_time_first = dfx['some_first_time_variable']
st_time_last = dfx['some_last_time_variable']
if device_id == st_ID and some_num == st_some_num:
if st_time_first <= timestamp and st_time_last >= timestamp:
return filename
else:
return None
else:
continue
The first function calls the second function as shown. The first loop occurs 2000 times, i.e., there are 2000 rows in the first dataframe.
The second function (the one that is called from f1()) runs 10 Million times.
My objective is to speed up f2() using parallel processing. I have tried using python packages like Multiprocessing and Ray but I am new to the world of parallel processing and am running into a lot of roadblocks due to lack of experience.
Can some one help me speed up the function so that it takes considerably lesser time to execute for 10 million rows?

FACTS : initial formulation asks 2E3 rows in f1() to request f2() to scan 1E7 rows in "shared" df2,so as to get called an unspecified reader()-process to receive some other data to decide about further processing or return
My objective is to speed up f2() using parallel processing
Can some one help me speed up the function so that it takes considerably lesser time to execute for 10 million rows?
Surprise No.1 : This is NOT a use-case of parallel-processing
The problem, as-is formulated above, calls many times file-I/O operations, that are never true-[PARALLEL] down there on the physical storage level, are they? Never. Any and all smart file-I/O-(pre)-caching and sliding-window file-I/O tricks cease to help on even moderate levels of a just-[CONCURRENT] workloads and often wreak havoc if going a single step beyond that principal workload ceiling due to physically limited scope of memory resources and I/O-bus width x speed and the weakest chain element's latency increasing under still growing traffic-loads.
The workflow controlling iterators are pure-[SERIAL] "Work Dispatchers" that sequentially step through their domain of values, one after another, and order just another file to get ( again iteratively ) processed.
Surprise No.2 : Vectorisation will NOT help
While vectorised operations are smart for many vector/matrix/tensor processing schemes ( love using numpy + numba ), the Condicio Sine Qua Non is, that the problem has to be:
"compact" - so that it gets easily expressed by vectorising syntax-tricks, which this original [SERIAL]-row-after-row-after-row to find a first and only first "device_ID match" in a "remote"-file-content, next return None if not ( <exprA> and <exprB> ) else filename
"uniform", i.e. non-sequential "until" something first happens - the vectorisation is great to "cover" the whole N-dimensional space with smart-internal code for (best) orthogonal-sub-structures processing uniformly "across" the whole space. On the contrary here, the vectorisation is hard to re-sequentialise "back" to stop (poison) it from any further smart-producing results right after the first occurrence was matched... (ref.1 above "find first and only first occurrence ( and die / return ) )
"memory-adequately-sized", i.e. given any add-on logic is added to the vectorised task, whenever a code asks vectorisation engine to process N-dim "data" using some sort of where(...)-clause, the interim product of such where(...)-condition is consuming additional [SPACE]-footprint ( best in RAM, worse in SWAP-file-I/O ) and this additional memory-footpring may soon devastate any and all benefits from the idea of vectorised processing re-formulation ( not speaking about the cases that due to such immense additional memory-allocation needs result but in a swap-file-I/O suffocation of the whole process flow ) where(...)-clause over a 10E6 rows is expensive, the more once the global strategy is to execute that 1 < nCPUs < 2E3 many times ( as noted above, vectorisation goes uniformly "across" the whole range of data, no sequentially beneficial shortcuts to stop after a first and only the first match... )
THE BEST NEXT STEP : dependency-graph -> latencies -> bottleneck
The problem as-is formulated above is a just-[CONCURRENT] processing, where the actual blocking or availability of "shared" resources' usage limits the overall processing duration. Having no more than a given set of resources to use, there are no magic chances to speed-up the concurrent usage patterns for faster processing. Thus the "amounts" of free-resources to harness and their respective response-"latencies" sure, those under-high-levels-of-concurrent-workloads, not the idealistic, unloaded, response times
If you have no profiling data, measure/benchmark at least the main characteristic durations:
a) the net f2()-per-row process latency [ min, Avg, MAX, StDev] in [us]
b) the reader()-related setup/retrieve latency [ min, Avg, MAX, StDev] in [us]
test, whether the reader()'s performance represents or not a bottleneck - a ceiling for the any-increased-concurrency operated process-flow
If it does, you get it's maximum workload it can handle and based on this, the concurrent-processing may get the speed forwards up to this reader()-determined performance ceiling.
All the rest is elementary.
Epilogue
Such latency-data engineered, (un)avoidable bottleneck-aware right-sized concurrent processing setup for a maximum Latency Masking is about the maximum one can expect here to help.
Given a chance to re-engineer and re-factor the global strategy, there might be much faster processing times, but that may come from other than a pure-[SERIAL] tandem of sequential iterators instructing the sequence of about ~ 20.000.000.000 calls to an unknown reader()-code.
Yet, that goes ways beyond the scope of this Stack Overflow MinCunVE-problem definition.
Hope this might have sparked some fresh views on how to make the results faster. Smart ideas may lead to processing times from a few days down to a few minutes (!). Having gone this way a few times, no one will believe how fulfilling this hard work may get both you and your customer(s), if you hit such a solution by designing the right-sized solution for their business domain.

Multiprocessing Pool in Python - Only single CPU is utilized

Original Question
I am trying to use multiprocessing Pool in Python. This is my code:
def f(x):
return x
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
for x in xrange(1, 11):
res = list(mapper(f,bar(x)))
This code makes use of all CPUs (I have 8 CPUs) when the xrange is small like xrange(1, 6). However, when I increase the range to xrange(1, 10). I observe that only 1 CPU is running at 100% while the rest are just idling. What could be the reason? Is it because, when I increase the range, the OS shutdowns the CPUs due to overheating?
How can I resolve this problem?
minimal, complete, verifiable example
To replicate my problem, I have created this example: Its a simple ngram generation from a string problem.
#!/usr/bin/python
import time
import itertools
import threading
import multiprocessing
import random
def f(x):
return x
def ngrams(input_tmp, n):
input = input_tmp.split()
if n > len(input):
n = len(input)
output = []
for i in range(len(input)-n+1):
output.append(input[i:i+n])
return output
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
num = 100000000 #100
rand_list = random.sample(xrange(100000000), num)
rand_str = ' '.join(str(i) for i in rand_list)
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
if __name__ == '__main__':
start = time.time()
foo()
print 'Total time taken: '+str(time.time() - start)
When num is small (e.g., num = 10000), I find that all 8 CPUs are utilised. However, when num is substantially large (e.g.,num = 100000000). Only 2 CPUs are used and rest are idling. This is my problem.
Caution: When num is too large it may crash your system/VM.

First, ngrams itself takes a lot of time. While that's happening, it's obviously only one one core. But even when that finishes (which is very easy to test by just moving the ngrams call outside the mapper and throwing a print in before and after it), you're still only using one core. I get 1 core at 100% and the other cores all around 2%.
If you try the same thing in Python 3.4, things are a little different—I still get 1 core at 100%, but the others are at 15-25%.
So, what's happening? Well, in multiprocessing, there's always some overhead for passing parameters and returning values. And in your case, that overhead completely swamps the actual work, which is just return x.
Here's how the overhead works: The main process has to pickle the values, then put them on a queue, then wait for values on another queue and unpickle them. Each child process waits on the first queue, unpickles values, does your do-nothing work, pickles the values, and puts them on the other queue. Access to the queues has to be synchronized (by a POSIX semaphore on most non-Windows platforms, I think an NT kernel mutex on Windows).
From what I can tell, your processes are spending over 99% of their time waiting on the queue or reading or writing it.
This isn't too unexpected, given that you have a large amount of data to process, and no computation at all beyond pickling and unpickling that data.
If you look at the source for SimpleQueue in CPython 2.7, the pickling and unpickling happens with the lock held. So, pretty much all the work any of your background processes do happens with the lock held, meaning they all end up serialized on a single core.
But in CPython 3.4, the pickling and unpickling happens outside the lock. And apparently that's enough work to use up 15-25% of a core. (I believe this change happened in 3.2, but I'm too lazy to track it down.)
Still, even on 3.4, you're spending far more time waiting for access to the queue than doing anything, even the multiprocessing overhead. Which is why the cores only get up to 25%.
And of course you're spending orders of magnitude more time on the overhead than the actual work, which makes this not a great test, unless you're trying to test the maximum throughput you can get out of a particular multiprocessing implementation on your machine or something.
A few observations:
In your real code, if you can find a way to batch up larger tasks (explicitly—just relying on chunksize=1000 or the like here won't help), that would probably solve most of your problem.
If your giant array (or whatever) never actually changes, you may be able to pass it in the pool initializer, instead of in each task, which would pretty much eliminate the problem.
If it does change, but only from the main process side, it may be worth sharing rather than passing the data.
If you need to mutate it from the child processes, see if there's a way to partition the data so each task can own a slice without contention.
Even if you need fully-contended shared memory with explicit locking, it may still be better than passing something this huge around.
It may be worth getting a backport of the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries off PyPI (or upgrading to Python 3.x), just to move the pickling out of the lock.

The problem is that your f() function (which is the one running on separate processes) is doing nothing special, hence it is not putting load on the CPU.
ngrams(), on the other hand, is doing some "heavy" computation, but you are calling this function on the main process, not in the pool.
To make things clearer, consider that this piece of code...
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
...is equivalent to this:
for n in xrange(1, 100):
arg = ngrams(rand_str, n)
res = list(mapper(f, arg))
Also the following is a CPU-intensive operation that is being performed on your main process:
num = 100000000
rand_list = random.sample(xrange(100000000), num)
You should either change your code so that sample() and ngrams() are called inside the pool, or change f() so that it does something CPU-intensive, and you'll see a high load on all of your CPUs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.