How to parallelize list-comprehension calculations in Python?

How to parallelize list-comprehension calculations in Python? - python

Both list comprehensions and map-calculations should -- at least in theory -- be relatively easy to parallelize: each calculation inside a list-comprehension could be done independent of the calculation of all the other elements. For example in the expression
[ x*x for x in range(1000) ]
each x*x-Calculation could (at least in theory) be done in parallel.
My question is: Is there any Python-Module / Python-Implementation / Python Programming-Trick to parallelize a list-comprehension calculation (in order to use all 16 / 32 / ... cores or distribute the calculation over a Computer-Grid or over a Cloud)?

As Ken said, it can't, but with 2.6's multiprocessing module, it's pretty easy to parallelize computations.
import multiprocessing
try:
cpus = multiprocessing.cpu_count()
except NotImplementedError:
cpus = 2 # arbitrary default
def square(n):
return n * n
pool = multiprocessing.Pool(processes=cpus)
print(pool.map(square, range(1000)))
There are also examples in the documentation that show how to do this using Managers, which should allow for distributed computations as well.

For shared-memory parallelism, I recommend joblib:
from joblib import delayed, Parallel
def square(x): return x*x
values = Parallel(n_jobs=NUM_CPUS)(delayed(square)(x) for x in range(1000))

On automatical parallelisation of list comprehension
IMHO, effective automatic parallisation of list comprehension would be impossible without additional information (such as those provided using directives in OpenMP), or limiting it to expressions that involve only built-in types/methods.
Unless there is a guarantee that the processing done on each list item has no side effects, there is a possibility that the results will be invalid (or at least different) if done out of order.
# Artificial example
counter = 0
def g(x): # func with side-effect
global counter
counter = counter + 1
return x + counter
vals = [g(i) for i in range(100)] # diff result when not done in order
There is also the issue of task distribution. How should the problem space be decomposed?
If the processing of each element forms a task (~ task farm), then when there are many elements each involving trivial calculation, the overheads of managing the tasks will swamps out the performance gains of parallelisation.
One could also take the data decomposition approach where the problem space is divided equally among the available processes.
The fact that list comprehension also works with generators makes this slightly tricky, however this is probably not a show stopper if the overheads of pre-iterating it is acceptable. Of course, there is also a possibility of generators with side-effects which can change the outcome if subsequent items are prematurely iterated. Very unlikely, but possible.
A bigger concern would be load imbalance across processes. There is no guarantee that each element would take the same amount of time to process, so statically partitioned data may result in one process doing most of the work while the idle your time away.
Breaking the list down to smaller chunks and handing them as each child process is available is a good compromise, however, a good selection of chunk size would be application dependent hence not doable without more information from the user.
Alternatives
As mentioned in several other answers, there are many approaches and parallel computing modules/frameworks to choose from depending on one requirements.
Having used only MPI (in C) with no experience using Python for parallel processing, I am not in a position to vouch for any (although, upon a quick scan through,
multiprocessing, jug, pp and pyro stand out).
If a requirement is to stick as close as possible to list comprehension, then jug seems to be the closest match. From the tutorial, distributing tasks across multiple instances can be as simple as:
from jug.task import Task
from yourmodule import process_data
tasks = [Task(process_data,infile) for infile in glob('*.dat')]
While that does something similar to multiprocessing.Pool.map(), jug can use different backends for synchronising process and storing intermediate results (redis, filesystem, in-memory) which means the processes can span across nodes in a cluster.

As the above answers point out, this is actually pretty hard to do automatically. Then I think the question is actually how to do it in the easiest way possible. Ideally, a solution wouldn't require you to know things like "how many cores do I have". Another property that you might want is to be able to still do the list comprehension in a single readable line.
Some of the given answers already seem to have nice properties like this, but another alternative is Ray (docs), which is a framework for writing parallel Python. In Ray, you would do it like this:
import ray
# Start Ray. This creates some processes that can do work in parallel.
ray.init()
# Add this line to signify that the function can be run in parallel (as a
# "task"). Ray will load-balance different `square` tasks automatically.
#ray.remote
def square(x):
return x * x
# Create some parallel work using a list comprehension, then block until the
# results are ready with `ray.get`.
ray.get([square.remote(x) for x in range(1000)])

Using the futures.{Thread,Process}PoolExecutor.map(func, *iterables, timeout=None) and futures.as_completed(future_instances, timeout=None) functions from the new 3.2 concurrent.futures package could help.
It's also available as a 2.6+ backport.

No, because list comprehension itself is a sort of a C-optimized macro. If you pull it out and parallelize it, then it's not a list comprehension, it's just a good old fashioned MapReduce.
But you can easily parallelize your example. Here's a good tutorial on using MapReduce with Python's parallelization library:
http://mikecvet.wordpress.com/2010/07/02/parallel-mapreduce-in-python/

Not within a list comprehension AFAIK.
You could certainly do it with a traditional for loop and the multiprocessing/threading modules.

There is a comprehensive list of parallel packages for Python here:
http://wiki.python.org/moin/ParallelProcessing
I'm not sure if any handle the splitting of a list comprehension construct directly, but it should be trivial to formulate the same problem in a non-list comprehension way that can be easily forked to a number of different processors. I'm not familiar with cloud computing parallelization, but I've had some success with mpi4py on multi-core machines and over clusters. The biggest issue that you'll have to think about is whether the communication overhead is going to kill any gains you get from parallelizing the problem.
Edit: The following might also be of interest:
http://www.mblondel.org/journal/2009/11/27/easy-parallelization-with-data-decomposition/

You can use asyncio. (Documentation can be found [here][1]). It is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web-servers, database connection libraries, distributed task queues, etc. Plus it has both high-level and low-level APIs to accomodate any kind of problem.
import asyncio
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def your_function(argument):
#code
Now this function will be run in parallel whenever called without putting main program into wait state. You can use it to parallelize for loop as well. When called for a for loop, though loop is sequential but every iteration runs in parallel to the main program as soon as interpreter gets there.
For your specific case, you can do:
import asyncio
import time
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def op(x): # Do any operation you want
time.sleep(1)
print(f"function called for {x=}\n", end='')
return x*x
loop = asyncio.get_event_loop() # Have a new event loop
looper = asyncio.gather(*[op(i) for i in range(20)]) # Run the loop; Doing for 20 for better demo
results = loop.run_until_complete(looper) # Wait until finish
print('List comprehension has finished and results are gathered!')
print(results)
This produces following output:
function called for x=5
function called for x=4
function called for x=2
function called for x=0
function called for x=6
function called for x=1
function called for x=7
function called for x=3
function called for x=8
function called for x=9
function called for x=10
function called for x=12
function called for x=11
function called for x=15
function called for x=13
function called for x=14
function called for x=16
function called for x=17
function called for x=18
function called for x=19
List comprehension has finished and results are gathered!
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]
Note that all function calls were in parallel thus shuffled prints however original order is preserved in the resulting list.

Related

How to execute two "aggregate" functions (like sum) concurrently, feeding them from the same iterator?

Imagine we have an iterator, say iter(range(1, 1000)). And we have two functions, each accepting an iterator as the only parameter, say sum() and max(). In SQL world we would call them aggregate functions.
Is there any way to obtain results of both without buffering the iterator output?
To do it, we would need to pause and resume aggregate function execution, in order to feed them both with the same values without storing them. Maybe is there a way to express it using async things without sleeps?

Let's consider how to apply two aggregate functions to the same iterator, which we can only exhaust once. The initial attempt (which hardcodes sum and max for brevity, but is trivially generalizable to an arbitrary number of aggregate functions) might look like this:
def max_and_sum_buffer(it):
content = list(it)
p = sum(content)
m = max(content)
return p, m
This implementation has the downside that it stores all the generated elements in memory at once, despite both functions being perfectly capable of stream processing. The question anticipates this cop-out and explicitly requests the result to be produced without buffering the iterator output. Is it possible to do this?
Serial execution: itertools.tee
It certainly seems possible. After all, Python iterators are external, so every iterator is already capable of suspending itself. How hard can it be to provide an adapter that splits an iterator into two new iterators that provide the same content? Indeed, this is exactly the description of itertools.tee, which appears perfectly suited to parallel iteration:
def max_and_sum_tee(it):
it1, it2 = itertools.tee(it)
p = sum(it1) # XXX
m = max(it2)
return p, m
The above produces the correct result, but doesn't work the way we'd like it to. The trouble is that we're not iterating in parallel. Aggregate functions like sum and max never suspend - each insists on consuming all of the iterator content before producing the result. So sum will exhaust it1 before max has had a chance to run at all. Exhausting elements of it1 while leaving it2 alone will cause those elements to be accumulated inside an internal FIFO shared between the two iterators. That's unavoidable here - since max(it2) must see the same elements, tee has no choice but to accumulate them. (For more interesting details on tee, refer to this post.)
In other words, there is no difference between this implementation and the first one, except that the first one at least makes the buffering explicit. To eliminate buffering, sum and max must run in parallel, not one after the other.
Threads: concurrent.futures
Let's see what happens if we run the aggregate functions in separate threads, still using tee to duplicate the original iterator:
def max_and_sum_threads_simple(it):
it1, it2 = itertools.tee(it)
with concurrent.futures.ThreadPoolExecutor(2) as executor:
sum_future = executor.submit(lambda: sum(it1))
max_future = executor.submit(lambda: max(it2))
return sum_future.result(), max_future.result()
Now sum and max actually run in parallel (as much as the GIL permits), threads being managed by the excellent concurrent.futures module. It has a fatal flaw, however: for tee not to buffer the data, sum and max must process their items at exactly the same rate. If one is even a little bit faster than the other, they will drift apart, and tee will buffer all intermediate elements. Since there is no way to predict how fast each will run, the amount of buffering is both unpredictable and has the nasty worst case of buffering everything.
To ensure that no buffering occurs, tee must be replaced with a custom generator that buffers nothing and blocks until all the consumers have observed the previous value before proceeding to the next one. As before, each consumer runs in its own thread, but now the calling thread is busy running a producer, a loop that actually iterates over the source iterator and signals that a new value is available. Here is an implementation:
def max_and_sum_threads(it):
STOP = object()
next_val = None
consumed = threading.Barrier(2 + 1) # 2 consumers + 1 producer
val_id = 0
got_val = threading.Condition()
def send(val):
nonlocal next_val, val_id
consumed.wait()
with got_val:
next_val = val
val_id += 1
got_val.notify_all()
def produce():
for elem in it:
send(elem)
send(STOP)
def consume():
last_val_id = -1
while True:
consumed.wait()
with got_val:
got_val.wait_for(lambda: val_id != last_val_id)
if next_val is STOP:
return
yield next_val
last_val_id = val_id
with concurrent.futures.ThreadPoolExecutor(2) as executor:
sum_future = executor.submit(lambda: sum(consume()))
max_future = executor.submit(lambda: max(consume()))
produce()
return sum_future.result(), max_future.result()
This is quite some amount of code for something so simple conceptually, but it is necessary for correct operation.
produce() loops over the outside iterator and sends the items to the consumers, one value at a time. It uses a barrier, a convenient synchronization primitive added in Python 3.2, to wait until all consumers are done with the old value before overwriting it with the new one in next_val. Once the new value is actually ready, a condition is broadcast. consume() is a generator that transmits the produced values as they arrive, until detecting STOP. The code can be generalized run any number of aggregate functions in parallel by creating consumers in a loop, and adjusting their number when creating the barrier.
The downside of this implementation is that it requires creation of threads (possibly alleviated by making the thread pool global) and a lot of very careful synchronization at each iteration pass. This synchronization destroys performance - this version is almost 2000 times slower than the single-threaded tee, and 475 times slower than the simple but non-deterministic threaded version.
Still, as long as threads are used, there is no avoiding synchronization in some form. To completely eliminate synchronization, we must abandon threads and switch to cooperative multi-tasking. The question is is it possible to suspend execution of ordinary synchronous functions like sum and max in order to switch between them?
Fibers: greenlet
It turns out that the greenlet third-party extension module enables exactly that. Greenlets are an implementation of fibers, lightweight micro-threads that switch between each other explicitly. This is sort of like Python generators, which use yield to suspend, except greenlets offer a much more flexible suspension mechanism, allowing one to choose who to suspend to.
This makes it fairly easy to port the threaded version of max_and_sum to greenlets:
def max_and_sum_greenlet(it):
STOP = object()
consumers = None
def send(val):
for g in consumers:
g.switch(val)
def produce():
for elem in it:
send(elem)
send(STOP)
def consume():
g_produce = greenlet.getcurrent().parent
while True:
val = g_produce.switch()
if val is STOP:
return
yield val
sum_result = []
max_result = []
gsum = greenlet.greenlet(lambda: sum_result.append(sum(consume())))
gsum.switch()
gmax = greenlet.greenlet(lambda: max_result.append(max(consume())))
gmax.switch()
consumers = (gsum, gmax)
produce()
return sum_result[0], max_result[0]
The logic is the same, but with less code. As before, produce produces values retrieved from the source iterator, but its send doesn't bother with synchronization, as it doesn't need to when everything is single-threaded. Instead, it explicitly switches to every consumer in turn to do its thing, with the consumer dutifully switching right back. After going through all consumers, the producer is ready for the next iteration pass.
Results are retrieved using an intermediate single-element list because greenlet doesn't provide access to the return value of the target function (and neither does threading.Thread, which is why we opted for concurrent.futures above).
There are downsides to using greenlets, though. First, they don't come with the standard library, you need to install the greenlet extension. Then, greenlet is inherently non-portable because the stack-switching code is not supported by the OS and the compiler and can be considered somewhat of a hack (although an extremely clever one). A Python targeting WebAssembly or JVM or GraalVM would be very unlikely to support greenlet. This is not a pressing issue, but it's definitely something to keep in mind for the long haul.
Coroutines: asyncio
As of Python 3.5, Python provides native coroutines. Unlike greenlets, and similar to generators, coroutines are distinct from regular functions and must be defined using async def. Coroutines can't be easily executed from synchronous code, they must instead be processed by a scheduler which drives them to completion. The scheduler is also known as an event loop because its other job is to receive IO events and pass them to appropriate callbacks and coroutines. In the standard library, this is the role of the asyncio module.
Before implementing an asyncio-based max_and_sum, we must first resolve a hurdle. Unlike greenlet, asyncio is only able to suspend execution of coroutines, not of arbitrary functions. So we need to replace sum and max with coroutines that do essentially the same thing. This is as simple as implementing them in the obvious way, only replacing for with async for, enabling the async iterator to suspend the coroutine while waiting for the next value to arrive:
async def asum(it):
s = 0
async for elem in it:
s += elem
return s
async def amax(it):
NONE_YET = object()
largest = NONE_YET
async for elem in it:
if largest is NONE_YET or elem > largest:
largest = elem
if largest is NONE_YET:
raise ValueError("amax() arg is an empty sequence")
return largest
# or, using https://github.com/vxgmichel/aiostream
#
#from aiostream.stream import accumulate
#def asum(it):
# return accumulate(it, initializer=0)
#def amax(it):
# return accumulate(it, max)
One could reasonably ask if providing a new pair of aggregate functions is cheating; after all, the previous solutions were careful to use existing sum and max built-ins. The answer will depend on the exact interpretation of the question, but I would argue that the new functions are allowed because they are in no way specific to the task at hand. They do the exact same thing the built-ins do, but consuming async iterators. I suspect that the only reason such functions don't already exist somewhere in the standard library is due to coroutines and async iterators being a relatively new feature.
With that out of the way, we can proceed to write max_and_sum as a coroutine:
async def max_and_sum_asyncio(it):
loop = asyncio.get_event_loop()
STOP = object()
next_val = loop.create_future()
consumed = loop.create_future()
used_cnt = 2 # number of consumers
async def produce():
for elem in it:
next_val.set_result(elem)
await consumed
next_val.set_result(STOP)
async def consume():
nonlocal next_val, consumed, used_cnt
while True:
val = await next_val
if val is STOP:
return
yield val
used_cnt -= 1
if not used_cnt:
consumed.set_result(None)
consumed = loop.create_future()
next_val = loop.create_future()
used_cnt = 2
else:
await consumed
s, m, _ = await asyncio.gather(asum(consume()), amax(consume()),
produce())
return s, m
Although this version is based on switching between coroutines inside a single thread, just like the one using greenlet, it looks different. asyncio doesn't provide explicit switching of coroutines, it bases task switching on the await suspension/resumption primitive. The target of await can be another coroutine, but also an abstract "future", a value placeholder which will be filled in later by some other coroutine. Once the awaited value becomes available, the event loop automatically resumes execution of the coroutine, with the await expression evaluating to the provided value. So instead of produce switching to consumers, it suspends itself by awaiting a future that will arrive once all the consumers have observed the produced value.
consume() is an asynchronous generator, which is like an ordinary generator, except it creates an async iterator, which our aggregate coroutines are already prepared to accept by using async for. An async iterator's equivalent of __next__ is called __anext__ and is a coroutine, allowing the coroutine that exhausts the async iterator to suspend while waiting for the new value to arrive. When a running async generator suspends on an await, that is observed by async for as a suspension of the implicit __anext__ invocation. consume() does exactly that when it waits for the values provided by produce and, as they become available, transmits them to aggregate coroutines like asum and amax. Waiting is realized using the next_val future, which carries the next element from it. Awaiting that future inside consume() suspends the async generator, and with it the aggregate coroutine.
The advantage of this approach compared to greenlets' explicit switching is that it makes it much easier to combine coroutines that don't know of each other into the same event loop. For example, one could have two instances of max_and_sum running in parallel (in the same thread), or run a more complex aggregate function that invoked further async code to do calculations.
The following convenience function shows how to run the above from non-asyncio code:
def max_and_sum_asyncio_sync(it):
# trivially instantiate the coroutine and execute it in the
# default event loop
coro = max_and_sum_asyncio(it)
return asyncio.get_event_loop().run_until_complete(coro)
Performance
Measuring and comparing performance of these approaches to parallel execution can be misleading because sum and max do almost no processing, which over-stresses the overhead of parallelization. Treat these as you would treat any microbenchmarks, with a large grain of salt. Having said that, let's look at the numbers anyway!
Measurements were produced using Python 3.6 The functions were run only once and given range(10000), their time measured by subtracting time.time() before and after the execution. Here are the results:
max_and_sum_buffer and max_and_sum_tee: 0.66 ms - almost exact same time for both, with the tee version being a bit faster.
max_and_sum_threads_simple: 2.7 ms. This timing means very little because of non-deterministic buffering, so this might be measuring the time to start two threads and the synchronization internally performed by Python.
max_and_sum_threads: 1.29 seconds, by far the slowest option, ~2000 times slower than the fastest one. This horrible result is likely caused by a combination of the multiple synchronizations performed at each step of the iteration and their interaction with the GIL.
max_and_sum_greenlet: 25.5 ms, slow compared to the initial version, but much faster than the threaded version. With a sufficiently complex aggregate function, one can imagine using this version in production.
max_and_sum_asyncio: 351 ms, almost 14 times slower than the greenlet version. This is a disappointing result because asyncio coroutines are more lightweight than greenlets, and switching between them should be much faster than switching between fibers. It is likely that the overhead of running the coroutine scheduler and the event loop (which in this case is overkill given that the code does no IO) is destroying the performance on this micro-benchmark.
max_and_sum_asyncio using uvloop: 125 ms. This is more than twice the speed of regular asyncio, but still almost 5x slower than greenlet.
Running the examples under PyPy doesn't bring significant speedup, in fact most of the examples run slightly slower, even after running them several times to ensure JIT warmup. The asyncio function requires a rewrite not to use async generators (since PyPy as of this writing implements Python 3.5), and executes in somewhat under 100ms. This is comparable to CPython+uvloop performance, i.e. better, but not dramatic compared to greenlet.

If it holds for your aggregate functions that f(a,b,c,...) == f(a, f(b, f(c, ...))),then you could just cycle through your functions and feed them one element at a time, each time combining them with the result of the previous application, like reduce would do, e.g. like this:
def aggregate(iterator, *functions):
first = next(iterator)
result = [first] * len(functions)
for item in iterator:
for i, f in enumerate(functions):
result[i] = f((result[i], item))
return result
This is considerably slower (about 10-20 times) than just materializing the iterator in a list and applying the aggregate function on the list as a whole, or using itertools.tee (which basically does the same thing, internally), but it has the benefit of using no additional memory.
Note, however, that while this works well for functions like sum, min or max, it does not work for other aggregating functions, e.g. finding the mean or median element of an iterator, as mean(a, b, c) != mean(a, mean(b, c)). (For mean, you could of course just get the sum and divide it by the number of elements, but computing e.g. the median taking just one element at a time will be more challenging.)

Parallelizing for loop in Python 2.7

I'm very new to Python (and coding in general) and I need help parallising the code below. I looked around and found some packages (eg. Multiprocessing & JobLib) which could be useful.
However, I have trouble using it in my example. My code makes an outputfile, and updates it doing the loop(s). Therefore is it not directly paralisable, so I think I need to make smaller files. After this, I could merge the files together.
I'm unable to find a way to do this, could someone be so kind and give me a decent start?
I appreciate any help,
A code newbie
Code:
def delta(graph,n,t,nx,OutExt):
fout_=open(OutExt+'Delta'+str(t)+'.txt','w')
temp=nx.Graph(graph)
for u in range(0,n):
#print "stamp: "+str(t)+" node: "+str(u)
for v in range(u+1,n):
#print str(u)+"\t"+str(v)
Stat = dict()
temp.add_edge(u,v)
MineDeltaGraphletTransitionsFromDynamicNetwork(graph,temp,Stat,u,v)
for a in Stat:
for b in Stat[a]:
fout_.write(str(t)+"\t"+str(u)+"\t"+str(v)+"\t"+str(a)+"\t"+str(b)+"\t"+str(Stat[a][b])+"\n")
if not graph.has_edge(u,v):
temp.remove_edge(u,v)
del temp
fout_.close()

As a start, find the part of the code that you want to be able to execute in parallel with something (perhaps with other invocations of that very same function). Then, figure out how to make this code not share mutable state with anything else.
Mutable state is the enemy of parallel execution. If two pieces of code are executing in parallel and share mutable state, you don't know what the outcome will be (and the outcome will be different each time you run the program). This is becaues you don't know what order the code from the parallel executions will run in. Perhaps the first will mutate something and then the second one will compute something. Or perhaps the second one will compute something and then the first one will mutate it. Who knows? There are solutions to that problem but they involve fine-grained locking and careful reasoning about what can change and when.
After you have an algorithm with a core that doesn't share mutable state, factor it into a separate function (turning locals into parameters).
Finally, use something like the threading (if your computations are primarily in CPython extension modules with good GIL behavior) or multiprocessing (otherwise) modules to execute the algorithm core function (which you have abstracted out) at some level of parallelism.
The particular code example you've shared is a challenge because you use the NetworkX library and a lot of shared mutable state. Each iteration of your loop depends on the results of the previous, apparently. This is not obviously something you can parallelize. However, perhaps if you think about your goals more abstractly you will be able to think of a way to do it (remember, the key is to be able to expressive your algorithm without using shared mutable state).
Your function is called delta. Perhaps you can split your graph into sub-graphs and compute the deltas of each (which are now no longer shared) in parallel.
If the code within your outermost loop is concurrent safe (I don't know if it is or not), you could rewrite it like this for parallel execution:
from multiprocessing import Pool
def do_one_step(nx, graph, n, t, OutExt, u):
# Create a separate output file for this set of results.
name = "{}Delta{}-{}.txt".format(OutExt, t, u)
fout_ = open(name, 'w')
temp = nx.Graph(graph)
for v in range(u+1,n):
Stat = dict()
temp.add_edge(u,v)
MineDeltaGraphletTransitionsFromDynamicNetwork(graph,temp,Stat,u,v)
for a in Stat:
for b in Stat[a]:
fout_.write(str(t)+"\t"+str(u)+"\t"+str(v)+"\t"+str(a)+"\t"+str(b)+"\t"+str(Stat[a][b])+"\n")
if not graph.has_edge(u,v):
temp.remove_edge(u,v)
fout_.close()
def delta(graph,n,t,nx,OutExt):
pool = Pool()
pool.map(
partial(
do_one_step,
nx,
graph,
n,
t,
OutExt,
),
range(0,n),
)
This supposes that all of the arguments can be serialized across processes (required for any argument you pass to a function you call with multiprocessing). I suspect that nx and graph may be problems but I don't know what they are.
And again, this assumes it's actually correct to concurrently execute the inner loop.

Best use pool.map. Here an example that shows what you need to do. Here a simple example of how multiprocessing works with pool:
Single threaded, basic function:
def f(x):
return x*x
if __name__ == '__main__':
print(map(f, [1, 2, 3]))
>> [1, 4, 9]
Using multiple processors:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(3) # 3 parallel pools
print(p.map(f, [1, 2, 3]))
Using 1 processor
from multiprocessing.pool import ThreadPool as Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(3) # 3 parallel pools
print(p.map(f, [1, 2, 3]))
When you use map you can easily get a list back from the results of your function.

Python multi function multithreading with threading.Thread? (variable number of threads)

I'm trying to start a variable number of threads to compute the results of functions for one of my automated trading modules. I have about 14 functions all of which are computationally expensive. I've been calculating each function sequentially, but it takes around 3 minutes to complete, and my platform is high frequency, I have the need to cut that computation time down to 1 minute or less.
I've read up on multiprocessing and multithreading, but I can't find a solution that fits my need.
What I'm trying to do is define "n" number of threads to use, then divide my list of functions into "n" groups, then compute each group of functions in a separate thread. Essentially:
functionList = [func1,func2,func3,func4]
outputList = [func1out,func2out,func3out,func4out]
argsList = [func1args,func2args,func3args,func4args]
# number of threads
n = 3
functionSplit = np.array_split(np.array(functionList),n)
outputSplit = np.array_split(np.array(outputList),n)
argSplit = np.array_split(np.array(argsList),n)
Now I'd like to start "n" seperate threads, each processing the functions according to the split lists. Then I'd like to name the output of each function according to the outputList and create a master dict of the outputs from each function. I then will loop through the output dict and create a dataframe with column ID numbers according to the information in each column (already have this part worked out, just need the multithreading).
Is there any way to do something like this? I've been looking into creating a subclass of the threading.Thread class and passing the functions, output names, and arguments into the run() method, but I don't know how to name and output the results of the functions from each thread! Nor do I know how to call functions in a list according to their corresponding arguments!
The reason that I'm doing this is to discover the optimum thread number balance between computational efficiency and time. Like I said, this will be integrated into a high frequency trading platform I'm developing where time is my major constraint!
Any ideas?

You can use multiprocessing library like below
import multiprocessing
def callfns(fnList, argList, outList, d):
for i in range(len(fnList)):
d[somekey] = fnList[i](argList, outList)
...
manager = multiprocessing.Manager()
d = manager.dict()
processes = []
for i in range(len(functionSplit)):
process = multiprocessing.Process(target=callfns, args=(functionSplit[i], argSplit[i], outputSplit[i], d))
processes.append(process)
for j in processes:
j.start()
for j in processes:
j.join()
# use d here
You can use a server process to share the dictionary between these processes. To interact with the server process you need Manager. Then you can create a dictionary in server process manager.dict(). Once all process join back to the main process, you can use the dictionary d.
I hope this help you solve your problem.

You should use multiprocessing instead of threading for cpu bound tasks.
Manually creating and managing processes can be difficult and require more efforts. Do checkout the concurrent.futures and try the ProcessPool for maintaining a pool of processes. You can submit tasks to them and retrieve results.
The Pool.map method from multiprocessing module can take a function and iterable and then process them in chunks in parallel to compute faster. The iterable is broken into separate chunks. These chunks are passed to the function in separate processes. Then the results are then put back together.

Multiprocessing Pool in Python - Only single CPU is utilized

Original Question
I am trying to use multiprocessing Pool in Python. This is my code:
def f(x):
return x
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
for x in xrange(1, 11):
res = list(mapper(f,bar(x)))
This code makes use of all CPUs (I have 8 CPUs) when the xrange is small like xrange(1, 6). However, when I increase the range to xrange(1, 10). I observe that only 1 CPU is running at 100% while the rest are just idling. What could be the reason? Is it because, when I increase the range, the OS shutdowns the CPUs due to overheating?
How can I resolve this problem?
minimal, complete, verifiable example
To replicate my problem, I have created this example: Its a simple ngram generation from a string problem.
#!/usr/bin/python
import time
import itertools
import threading
import multiprocessing
import random
def f(x):
return x
def ngrams(input_tmp, n):
input = input_tmp.split()
if n > len(input):
n = len(input)
output = []
for i in range(len(input)-n+1):
output.append(input[i:i+n])
return output
def foo():
p = multiprocessing.Pool()
mapper = p.imap_unordered
num = 100000000 #100
rand_list = random.sample(xrange(100000000), num)
rand_str = ' '.join(str(i) for i in rand_list)
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
if __name__ == '__main__':
start = time.time()
foo()
print 'Total time taken: '+str(time.time() - start)
When num is small (e.g., num = 10000), I find that all 8 CPUs are utilised. However, when num is substantially large (e.g.,num = 100000000). Only 2 CPUs are used and rest are idling. This is my problem.
Caution: When num is too large it may crash your system/VM.

First, ngrams itself takes a lot of time. While that's happening, it's obviously only one one core. But even when that finishes (which is very easy to test by just moving the ngrams call outside the mapper and throwing a print in before and after it), you're still only using one core. I get 1 core at 100% and the other cores all around 2%.
If you try the same thing in Python 3.4, things are a little different—I still get 1 core at 100%, but the others are at 15-25%.
So, what's happening? Well, in multiprocessing, there's always some overhead for passing parameters and returning values. And in your case, that overhead completely swamps the actual work, which is just return x.
Here's how the overhead works: The main process has to pickle the values, then put them on a queue, then wait for values on another queue and unpickle them. Each child process waits on the first queue, unpickles values, does your do-nothing work, pickles the values, and puts them on the other queue. Access to the queues has to be synchronized (by a POSIX semaphore on most non-Windows platforms, I think an NT kernel mutex on Windows).
From what I can tell, your processes are spending over 99% of their time waiting on the queue or reading or writing it.
This isn't too unexpected, given that you have a large amount of data to process, and no computation at all beyond pickling and unpickling that data.
If you look at the source for SimpleQueue in CPython 2.7, the pickling and unpickling happens with the lock held. So, pretty much all the work any of your background processes do happens with the lock held, meaning they all end up serialized on a single core.
But in CPython 3.4, the pickling and unpickling happens outside the lock. And apparently that's enough work to use up 15-25% of a core. (I believe this change happened in 3.2, but I'm too lazy to track it down.)
Still, even on 3.4, you're spending far more time waiting for access to the queue than doing anything, even the multiprocessing overhead. Which is why the cores only get up to 25%.
And of course you're spending orders of magnitude more time on the overhead than the actual work, which makes this not a great test, unless you're trying to test the maximum throughput you can get out of a particular multiprocessing implementation on your machine or something.
A few observations:
In your real code, if you can find a way to batch up larger tasks (explicitly—just relying on chunksize=1000 or the like here won't help), that would probably solve most of your problem.
If your giant array (or whatever) never actually changes, you may be able to pass it in the pool initializer, instead of in each task, which would pretty much eliminate the problem.
If it does change, but only from the main process side, it may be worth sharing rather than passing the data.
If you need to mutate it from the child processes, see if there's a way to partition the data so each task can own a slice without contention.
Even if you need fully-contended shared memory with explicit locking, it may still be better than passing something this huge around.
It may be worth getting a backport of the 3.2+ version of multiprocessing or one of the third-party multiprocessing libraries off PyPI (or upgrading to Python 3.x), just to move the pickling out of the lock.

The problem is that your f() function (which is the one running on separate processes) is doing nothing special, hence it is not putting load on the CPU.
ngrams(), on the other hand, is doing some "heavy" computation, but you are calling this function on the main process, not in the pool.
To make things clearer, consider that this piece of code...
for n in xrange(1, 100):
res = list(mapper(f, ngrams(rand_str, n)))
...is equivalent to this:
for n in xrange(1, 100):
arg = ngrams(rand_str, n)
res = list(mapper(f, arg))
Also the following is a CPU-intensive operation that is being performed on your main process:
num = 100000000
rand_list = random.sample(xrange(100000000), num)
You should either change your code so that sample() and ngrams() are called inside the pool, or change f() so that it does something CPU-intensive, and you'll see a high load on all of your CPUs.

Parfor for Python

I am looking for a definitive answer to MATLAB's parfor for Python (Scipy, Numpy).
Is there a solution similar to parfor? If not, what is the complication for creating one?
UPDATE: Here is a typical numerical computation code that I need speeding up
import numpy as np
N = 2000
output = np.zeros([N,N])
for i in range(N):
for j in range(N):
output[i,j] = HeavyComputationThatIsThreadSafe(i,j)
An example of a heavy computation function is:
import scipy.optimize
def HeavyComputationThatIsThreadSafe(i,j):
n = i * j
return scipy.optimize.anneal(lambda x: np.sum((x-np.arange(n)**2)), np.random.random((n,1)))[0][0,0]

The one built-in to python would be multiprocessing docs are here. I always use multiprocessing.Pool with as many workers as processors. Then whenever I need to do a for-loop like structure I use Pool.imap
As long as the body of your function does not depend on any previous iteration then you should have near linear speed-up. This also requires that your inputs and outputs are pickle-able but this is pretty easy to ensure for standard types.
UPDATE:
Some code for your updated function just to show how easy it is:
from multiprocessing import Pool
from itertools import product
output = np.zeros((N,N))
pool = Pool() #defaults to number of available CPU's
chunksize = 20 #this may take some guessing ... take a look at the docs to decide
for ind, res in enumerate(pool.imap(Fun, product(xrange(N), xrange(N))), chunksize):
output.flat[ind] = res

There are many Python frameworks for parallel computing. The one I happen to like most is IPython, but I don't know too much about any of the others. In IPython, one analogue to parfor would be client.MultiEngineClient.map() or some of the other constructs in the documentation on quick and easy parallelism.

Jupyter Notebook
To see an example consider you want to write the equivalence of this Matlab code on in Python
matlabpool open 4
parfor n=0:9
for i=1:10000
for j=1:10000
s=j*i
end
end
n
end
disp('done')
The way one may write this in python particularly in jupyter notebook. You have to create a function in the working directory (I called it FunForParFor.py) which has the following
def func(n):
for i in range(10000):
for j in range(10000):
s=j*i
print(n)
Then I go to my Jupyter notebook and write the following code
import multiprocessing
import FunForParFor
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
pool.map(FunForParFor.func, range(10))
pool.close()
pool.join()
print('done')
This has worked for me! I just wanted to share it here to give you a particular example.

This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your functions with the #ray.remote decorator, and then invoke them with .remote.
import numpy as np
import time
import ray
ray.init()
# Define the function. Each remote function will be executed
# in a separate process.
#ray.remote
def HeavyComputationThatIsThreadSafe(i, j):
n = i*j
time.sleep(0.5) # Simulate some heavy computation.
return n
N = 10
output_ids = []
for i in range(N):
for j in range(N):
# Remote functions return a future, i.e, an identifier to the
# result, rather than the result itself. This allows invoking
# the next remote function before the previous finished, which
# leads to the remote functions being executed in parallel.
output_ids.append(HeavyComputationThatIsThreadSafe.remote(i,j))
# Get results when ready.
output_list = ray.get(output_ids)
# Move results into an NxN numpy array.
outputs = np.array(output_list).reshape(N, N)
# This program should take approximately N*N*0.5s/p, where
# p is the number of cores on your machine, N*N
# is the number of times we invoke the remote function,
# and 0.5s is the time it takes to execute one instance
# of the remote function. For example, for two cores this
# program will take approximately 25sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Note: One point to keep in mind is that each remote function is executed in a separate process, possibly on a different machine, and thus the remote function's computation should take more than invoking a remote function. As a rule of thumb a remote function's computation should take at least a few 10s of msec to amortize the scheduling and startup overhead of a remote function.

I've always used Parallel Python but it's not a complete analog since I believe it typically uses separate processes which can be expensive on certain operating systems. Still, if the body of your loops are chunky enough then this won't matter and can actually have some benefits.

I tried all solutions here, but found that the simplest way and closest equivalent to matlabs parfor is numba's prange.
Essentially you change a single letter in your loop, range to prange:
from numba import autojit, prange
#autojit
def parallel_sum(A):
sum = 0.0
for i in prange(A.shape[0]):
sum += A[i]
return sum

I recommend trying joblib Parallel.
one liner
from joblib import Parallel, delayed
out = Parallel(n_jobs=2)(delayed(heavymethod)(i) for i in range(10))
instructional
instead of taking a for loop
from time import sleep
for _ in range(10):
sleep(.2)
rewrite your operation into a list comprehension
[sleep(.2) for _ in range(10)]
Now let us not directly evaluate the expression, but collect what should be done.
This is what the delayed method is for.
from joblib import delayed
[delayed(sleep(.2)) for _ in range(10)]
Next instantiate a parallel process with n_workers and process the list.
from joblib import Parallel
r = Parallel(n_jobs=2, verbose=10)(delayed(sleep)(.2) for _ in range(10))
[Parallel(n_jobs=2)]: Done 1 tasks | elapsed: 0.6s
[Parallel(n_jobs=2)]: Done 4 tasks | elapsed: 0.8s
[Parallel(n_jobs=2)]: Done 10 out of 10 | elapsed: 1.4s finished

Ok, I'll also give it a go, let's see if my way is easier
from multiprocessing import Pool
def heavy_func(key):
#do some heavy computation on each key
output = key**2
return key, output
output_data ={} #<--this dict will store the results
keys = [1,5,7,8,10] #<--compute heavy_func over all the values of keys
with Pool(processes=40) as pool:
for i in pool.imap_unordered(heavy_func, keys):
output_data[i[0]] = i[1]
Now output_data is a dictionary that will contain for every key the result of the computation on this key.
That is it..

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.