Multiprocessing in Python for methods with multiple parameters

Multiprocessing in Python for methods with multiple parameters - python

I have around 4000 data points and I have a program that processes them. Due to the huge number of points the program is very slow, although I've applied some vectorization using numpy.arange in nested loops.
I searched for pool.map, the problem is that it takes only one argument. I see there exist some answers to this problem here, Python multiprocessing pool.map for multiple arguments. I used the last one which uses map method with a list of arguments, I have around 4 args, I put them in a list and passed in the map method a long with the function name. In the function, I've extracted each argument from the list and perform the operation, but it doesn't work. This is the code where I call map,
if __name__ == '__main__':
pool= Pool(processes=8)
p= pool.map (kriging1D, [x,v,a,n])
plt.scatter(x,v,color='red')
plt.plot(range(5227),p,color='blue')
This is the function to be parallelized,
def kriging1D(args):
x=args[0]
v=args [1]
a= args [2]
n= args [3]
#perform some operations on the args..
...
#return the result..
But, I get this error,
plt.plot(range(5227),p,color='blue')
NameError: name 'p' is not defined
Note: before adding this line,
if __name__ == '__main__':
I got this error,
RuntimeError:
Attempt to start a new process before the current process
has finished its bootstrapping phase.
This probably means that you are on Windows and you have
forgotten to use the proper idiom in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce a Windows executable.
That's why I've added the if statement.
For More Clarity: v and x are vectors each of a large size as 4000 (both have the same length). My intent is to parallelize the processing of each v[i] and x[i], so for example process multiple v and x elements at a time, instead of processing elements one by one.
Can anyone please tell me what mistake I'm doing? Or, suggest an alternative method?
Thank You.

The operation of map() and I assume consequently of pool.map() (I haven't used it myself) is as follows.
Calling map(myfunc, [1, 2, 3]) calls myfunc on each of the arguments 1, 2, 3 in turn. myfunc(1), then myfunc(2) etc.
So pool.map(kriging1D, [x,v,a,n]) is equivalent to calling kriging1D(x), then kriging1D(v), and so on, no? From your method body, it looks like that is not what you want to do. Are you sure you really want to be using pool.map and not pool.apply instead?
I apologise if I have misunderstood your question; this isn't my area of expertise but I thought I'd try and help since there are no answers yet.

The syntax you are using is appropriate for **apply*, which is a single invocation, and not batch parallel.
>>> from pathos.multiprocessing import ProcessPool as Pool
>>> p = Pool()
>>>
>>> def do_it(x,y,z):
... return x+y*z
...
>>> p.apply(do_it, [2,3,4])
14
If you want to use batch parallel, you'd need to give a list of the same length for each parameter. Here, I'm running a 3-argument function in 5-way parallel -- note the length 5 lists.
>>> p.map(do_it, range(5),range(5,10),range(0,10,2))
[0, 13, 30, 51, 76]
If you want to use this syntax, you need the multiprocessing fork called pathos (or there is also parmap) -- both are also found in the SO answer you linked in the question.
If you want to use the stdlib multiprocessing, then you should look at the other answers in the same question.
Hopefully, the above will clarify those answers, however.

Related

Parallelizing for loop in Python 2.7

I'm very new to Python (and coding in general) and I need help parallising the code below. I looked around and found some packages (eg. Multiprocessing & JobLib) which could be useful.
However, I have trouble using it in my example. My code makes an outputfile, and updates it doing the loop(s). Therefore is it not directly paralisable, so I think I need to make smaller files. After this, I could merge the files together.
I'm unable to find a way to do this, could someone be so kind and give me a decent start?
I appreciate any help,
A code newbie
Code:
def delta(graph,n,t,nx,OutExt):
fout_=open(OutExt+'Delta'+str(t)+'.txt','w')
temp=nx.Graph(graph)
for u in range(0,n):
#print "stamp: "+str(t)+" node: "+str(u)
for v in range(u+1,n):
#print str(u)+"\t"+str(v)
Stat = dict()
temp.add_edge(u,v)
MineDeltaGraphletTransitionsFromDynamicNetwork(graph,temp,Stat,u,v)
for a in Stat:
for b in Stat[a]:
fout_.write(str(t)+"\t"+str(u)+"\t"+str(v)+"\t"+str(a)+"\t"+str(b)+"\t"+str(Stat[a][b])+"\n")
if not graph.has_edge(u,v):
temp.remove_edge(u,v)
del temp
fout_.close()

As a start, find the part of the code that you want to be able to execute in parallel with something (perhaps with other invocations of that very same function). Then, figure out how to make this code not share mutable state with anything else.
Mutable state is the enemy of parallel execution. If two pieces of code are executing in parallel and share mutable state, you don't know what the outcome will be (and the outcome will be different each time you run the program). This is becaues you don't know what order the code from the parallel executions will run in. Perhaps the first will mutate something and then the second one will compute something. Or perhaps the second one will compute something and then the first one will mutate it. Who knows? There are solutions to that problem but they involve fine-grained locking and careful reasoning about what can change and when.
After you have an algorithm with a core that doesn't share mutable state, factor it into a separate function (turning locals into parameters).
Finally, use something like the threading (if your computations are primarily in CPython extension modules with good GIL behavior) or multiprocessing (otherwise) modules to execute the algorithm core function (which you have abstracted out) at some level of parallelism.
The particular code example you've shared is a challenge because you use the NetworkX library and a lot of shared mutable state. Each iteration of your loop depends on the results of the previous, apparently. This is not obviously something you can parallelize. However, perhaps if you think about your goals more abstractly you will be able to think of a way to do it (remember, the key is to be able to expressive your algorithm without using shared mutable state).
Your function is called delta. Perhaps you can split your graph into sub-graphs and compute the deltas of each (which are now no longer shared) in parallel.
If the code within your outermost loop is concurrent safe (I don't know if it is or not), you could rewrite it like this for parallel execution:
from multiprocessing import Pool
def do_one_step(nx, graph, n, t, OutExt, u):
# Create a separate output file for this set of results.
name = "{}Delta{}-{}.txt".format(OutExt, t, u)
fout_ = open(name, 'w')
temp = nx.Graph(graph)
for v in range(u+1,n):
Stat = dict()
temp.add_edge(u,v)
MineDeltaGraphletTransitionsFromDynamicNetwork(graph,temp,Stat,u,v)
for a in Stat:
for b in Stat[a]:
fout_.write(str(t)+"\t"+str(u)+"\t"+str(v)+"\t"+str(a)+"\t"+str(b)+"\t"+str(Stat[a][b])+"\n")
if not graph.has_edge(u,v):
temp.remove_edge(u,v)
fout_.close()
def delta(graph,n,t,nx,OutExt):
pool = Pool()
pool.map(
partial(
do_one_step,
nx,
graph,
n,
t,
OutExt,
),
range(0,n),
)
This supposes that all of the arguments can be serialized across processes (required for any argument you pass to a function you call with multiprocessing). I suspect that nx and graph may be problems but I don't know what they are.
And again, this assumes it's actually correct to concurrently execute the inner loop.

Best use pool.map. Here an example that shows what you need to do. Here a simple example of how multiprocessing works with pool:
Single threaded, basic function:
def f(x):
return x*x
if __name__ == '__main__':
print(map(f, [1, 2, 3]))
>> [1, 4, 9]
Using multiple processors:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(3) # 3 parallel pools
print(p.map(f, [1, 2, 3]))
Using 1 processor
from multiprocessing.pool import ThreadPool as Pool
def f(x):
return x*x
if __name__ == '__main__':
p = Pool(3) # 3 parallel pools
print(p.map(f, [1, 2, 3]))
When you use map you can easily get a list back from the results of your function.

Python Using Multiprocessing

I am trying to use multiprocessing in python 3.6. I have a for loopthat runs a method with different arguments. Currently, it is running one at a time which is taking quite a bit of time so I am trying to use multiprocessing. Here is what I have:
def test(self):
for key, value in dict.items():
pool = Pool(processes=(cpu_count() - 1))
pool.apply_async(self.thread_process, args=(key,value))
pool.close()
pool.join()
def thread_process(self, key, value):
# self.__init__()
print("For", key)
I think what my code is using 3 processes to run one method but I would like to run 1 method per process but I don't know how this is done. I am using 4 cores btw.

You're making a pool at every iteration of the for loop. Make a pool beforehand, apply the processes you'd like to run in multiprocessing, and then join them:
from multiprocessing import Pool, cpu_count
import time
def t():
# Make a dummy dictionary
d = {k: k**2 for k in range(10)}
pool = Pool(processes=(cpu_count() - 1))
for key, value in d.items():
pool.apply_async(thread_process, args=(key, value))
pool.close()
pool.join()
def thread_process(key, value):
time.sleep(0.1) # Simulate a process taking some time to complete
print("For", key, value)
if __name__ == '__main__':
t()

You're not populating your multiprocessing.Pool with data - you're re-initializing the pool on each loop. In your case you can use Pool.map() to do all the heavy work for you:
def thread_process(args):
print(args)
def test():
pool = Pool(processes=(cpu_count() - 1))
pool.map(thread_process, your_dict.items())
pool.close()
if __name__ == "__main__": # important guard for cross-platform use
test()
Also, given all those self arguments I reckon you're snatching this off of a class instance and if so - don't, unless you know what you're doing. Since multiprocessing in Python essentially works as, well, multi-processing (unlike multi-threading) you don't get to share your memory, which means your data is pickled when exchanging between processes, which means anything that cannot be pickled (like instance methods) doesn't get called. You can read more on that problem on this answer.

I think what my code is using 3 processes to run one method but I would like to run 1 method per process but I don't know how this is done. I am using 4 cores btw.
No, you are in fact using the correct syntax here to utilize 3 cores to run an arbitrary function independently on each. You cannot magically utilize 3 cores to work together on one task with out explicitly making that a part of the algorithm itself/ coding that your self often using threads (which do not work the same in python as they do outside of the language).
You are however re-initializing the pool every loop you'll need to do something like this instead to actually perform this properly:
cpus_to_run_on = cpu_count() - 1
pool = Pool(processes=(cpus_to_run_on)
# don't call a dictionary a dict, you will not be able to use dict() any
# more after that point, that's like calling a variable len or abs, you
# can't use those functions now
pool.map(your_function, your_function_args)
pool.close()
Take a look at the python multiprocessing docs for more specific information if you'd like to get a better understanding of how it works. Under python, you cannot utilize threading to do multiprocessing with the default CPython interpreter. This is because of something called the global interpreter lock, which stops concurrent resource access from within python itself. The GIL doesn't exist in other implementations of the language, and is not something other languages like C and C++ have to deal with (and thus you can actually use threads in parallel to work together on a task, unlike CPython)
Python gets around this issue by simply making multiple interpreter instances when using the multiprocessing module, and any message passing between instances is done via copying data between processes (ie the same memory is typically not touched by both interpreter instances). This does not however happen in the misleadingly named threading module, which often actually slow processes down because of a process called context switching. Threading today has limited usefullness, but provides an easier way around non GIL locked processes like socket and file reads/writes than async python.
Beyond all this though there is a bigger problem with your multiprocessing. Your writing to standard output. You aren't going to get the gains you want. Think about it. Each of your processes "print" data, but its all being displayed in one terminal/output screen. So even if your processes are "printing" they aren't really doing that independently, and the information has to be coalesced back into another processes where the text interface lies (ie your console). So these processes write whatever they were going to to some sort of buffer, which then has to be copied (as we learned from how multiprocessing works) to another process which will then take that buffered data and output it.
Typically dummy programs use printing as a means of showing how there is no order between execution of these processes, that they can finish at different times, they aren't meant to demonstrate the performance benefits of multi core processing.

I have experimented a bit this week with multiprocessing. The fastest way that I discovered to do multiprocessing in python3 is using imap_unordered, at least in my scenario. Here is a script you can experiment with using your scenario to figure out what works best for you:
import multiprocessing
NUMBER_OF_PROCESSES = multiprocessing.cpu_count()
MP_FUNCTION = 'imap_unordered' # 'imap_unordered' or 'starmap' or 'apply_async'
def process_chunk(a_chunk):
print(f"processig mp chunk {a_chunk}")
return a_chunk
map_jobs = [1, 2, 3, 4]
result_sum = 0
if MP_FUNCTION == 'imap_unordered':
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
for i in pool.imap_unordered(process_chunk, map_jobs):
result_sum += i
elif MP_FUNCTION == 'starmap':
pool = multiprocessing.Pool(processes=NUMBER_OF_PROCESSES)
try:
map_jobs = [(i, ) for i in map_jobs]
result_sum = pool.starmap(process_chunk, map_jobs)
result_sum = sum(result_sum)
finally:
pool.close()
pool.join()
elif MP_FUNCTION == 'apply_async':
with multiprocessing.Pool(processes=NUMBER_OF_PROCESSES) as pool:
result_sum = [pool.apply_async(process_chunk, [i, ]).get() for i in map_jobs]
result_sum = sum(result_sum)
print(f"result_sum is {result_sum}")
I found that starmap was not too far behind in performance, in my scenario it used more cpu and ended up being a bit slower. Hope this boilerplate helps.

Multiprocessing pool 'apply_async' only seems to call function once

I've been following the docs to try to understand multiprocessing pools. I came up with this:
import time
from multiprocessing import Pool
def f(a):
print 'f(' + str(a) + ')'
return True
t = time.time()
pool = Pool(processes=10)
result = pool.apply_async(f, (1,))
print result.get()
pool.close()
print ' [i] Time elapsed ' + str(time.time() - t)
I'm trying to use 10 processes to evaluate the function f(a). I've put a print statement in f.
This is the output I'm getting:
$ python pooltest.py
f(1)
True
[i] Time elapsed 0.0270888805389
It appears to me that the function f is only getting evaluated once.
I'm likely not using the right method but the end result I'm looking for is to run f with 10 processes simultaneously, and get the result returned by each one of those process. So I would end with a list of 10 results (which may or may not be identical).
The docs on multiprocessing are quite confusing and it's not trivial to figure out which approach I should be taking and it seems to me that f should be run 10 times in the example I provided above.

apply_async isn't meant to launch multiple processes; it's just meant to call the function with the arguments in one of the processes of the pool. You'll need to make 10 calls if you want the function to be called 10 times.
First, note the docs on apply() (emphasis added):
apply(func[, args[, kwds]])
Call func with arguments args and keyword arguments kwds. It blocks
until the result is ready. Given this blocks, apply_async() is better
suited for performing work in parallel. Additionally, func is only
executed in one of the workers of the pool.
Now, in the docs for apply_async():
apply_async(func[, args[, kwds[, callback[, error_callback]]]])
A variant of the apply() method which returns a result object.
The difference between the two is just that apply_async returns immediately. You can use map() to call a function multiple times, though if you're calling with the same inputs, then it's a little redudant to create the list of the same argument just to have a sequence of the right length.
However, if you're calling different functions with the same input, then you're really just calling a higher order function, and you could do it with map or map_async() like this:
multiprocessing.map(lambda f: f(1), functions)
except that lambda functions aren't pickleable, so you'd need to use a defined function (see How to let Pool.map take a lambda function). You can actually use the builtin apply() (not the multiprocessing one) (although it's deprecated):
multiprocessing.map(apply,[(f,1) for f in functions])
It's easy enough to write your own, too:
def apply_(f,*args,**kwargs):
return f(*args,**kwargs)
multiprocessing.map(apply_,[(f,1) for f in functions])

Each time you write pool.apply_async(...) it will delegate that function call to one of the processes that was started in the pool. If you want to call the function in multiple processes, you need to issue multiple pool.apply_async calls.
Note, there also exists a pool.map (and pool.map_async) function which will take a function and an iterable of inputs:
inputs = range(30)
results = pool.map(f, inputs)
These functions will apply the function to each input in the inputs iterable. It attempts to put "batches" into the pool so that the load gets balanced fairly evenly among all the processes in the pool.

If you want to run a single piece of code in ten processes, each of which then exits, a Pool of ten processes is probably not the right thing to use.
Instead, create ten Processes to run the code:
processes = []
for _ in range(10):
p = multiprocessing.Process(target=f, args=(1,))
p.start()
processes.append(p)
for p in processes:
p.join()
The multiprocessing.Pool class is designed to handle situations where the number of processes and the number of jobs are unrelated. Often the number of processes is selected to be the number of CPU cores you have, while the number of jobs is much larger. Thanks!

If you aren't committed to Pool for any particular reason, I've written a function around multiprocessing.Process that will probably do the trick for you. It's posted here, but I'd be happy to upload the most recent version to github if you want it.

Adding state to a function which gets called via pool.map -- how to avoid pickling errors

I've hit the common problem of getting a pickle error when using the multiprocessing module.
My exact problem is that I need to give the function I'm calling some state before I call it in the pool.map function, but in doing so, I cause the attribute lookup __builtin__.function failed error found here.
Based on the linked SO answer, it looks like the only way to use a function in pool.map is to call the defined function itself so that it is looked up outside the scope of the current function.
I feel like I explained the above poorly, so here is the issue in code. :)
Testing without pool
# Function to be called by the multiprocessing pool
def my_func(x):
massive_list, medium_list, index1, index2 = x
result = [massive_list[index1 + x][index2:] for x in xrange(10)]
return result in medium_list
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = map(my_func, to_crunch)
This works A-OK and just as expected. The only thing "wrong" with it is that it's slow.
Pool Attempt 1
# (Note: my_func() remains the same)
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
pool = multiprocessing.Pool(2)
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = pool.map(my_func, to_crunch)
This technically works, but it is a stunning 18x slower! The slow down must be coming from not only copying the two massive data structures on each call, but also pickling/unpickling them as they get passed around. The non-pool version benefits from only having to pass the reference to the massive list around, rather than the actual list.
So, having tracked down the bottleneck, I try to store the two massive lists as state inside of my_func. That way, if I understand correctly, it will only need to be copied once for each worker (in my case, 4).
Pool Attempt 2:
I wrap up my_func in a closure passing in the two lists as stored state.
def build_myfunc(m,s):
def my_func(x):
massive_list = m # close the state in there
small_list = s
index1, index2 = x
result = [massive_list[index1 + x][index2:] for x in xrange(10)]
return result in medium_list
return my_func
if __name__ == '__main__':
data = [comprehension which loads a ton of state]
source = [comprehension which also loads a medium amount of state]
modified_func = build_myfunc(data, source)
pool = multiprocessing.Pool(2)
for num in range(100):
to_crunch = ((massive_list, small_list, num, x) for x in range(1000))
result = pool.map(modified_func, to_crunch)
However, this returns the pickle error as (based on the above linked SO question) you cannot call a function with multiprocessing from inside of the same scope.
Error:
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
So, is there a way around this problem?

Map is a way to distribute workload. If you store the data in the func i think you vanish the initial purpose.
Let's try to find why it is slower. It's not normal and there must be something else.
First, the number of processes must be suitable for the machine running them. In your example you're using a pool of 2 processes so a total of 3 processes is involved. How many cores are on the system you're using? What else is running? What's the system load while crunching data?
What does the function do with the data? Does it access disk? Or maybe it uses DB which means there is probably another process accessing disk and cores.
What about memory? Is it sufficient for storing the initial lists?
The right implementation is your Attempt 1.
Try to profile the execution using iostat for example. This way you can spot the bottlenecks.
If it stalls on the cpu then you can try some tweaks to the code.
From another answer on Stackoverflow (by me so no problem copy and pasting it here :P ):
You're using .map() which collect the results and then returns. So for large dataset probably you're stuck in the collecting phase.
You can try using .imap() which is the iterator version on .map() or even the .imap_unordered() if the order of results is not important (as it seems from your example).
Here's the relevant documentation. Worth noting the line:
For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.

How to parallelize list-comprehension calculations in Python?

Both list comprehensions and map-calculations should -- at least in theory -- be relatively easy to parallelize: each calculation inside a list-comprehension could be done independent of the calculation of all the other elements. For example in the expression
[ x*x for x in range(1000) ]
each x*x-Calculation could (at least in theory) be done in parallel.
My question is: Is there any Python-Module / Python-Implementation / Python Programming-Trick to parallelize a list-comprehension calculation (in order to use all 16 / 32 / ... cores or distribute the calculation over a Computer-Grid or over a Cloud)?

As Ken said, it can't, but with 2.6's multiprocessing module, it's pretty easy to parallelize computations.
import multiprocessing
try:
cpus = multiprocessing.cpu_count()
except NotImplementedError:
cpus = 2 # arbitrary default
def square(n):
return n * n
pool = multiprocessing.Pool(processes=cpus)
print(pool.map(square, range(1000)))
There are also examples in the documentation that show how to do this using Managers, which should allow for distributed computations as well.

For shared-memory parallelism, I recommend joblib:
from joblib import delayed, Parallel
def square(x): return x*x
values = Parallel(n_jobs=NUM_CPUS)(delayed(square)(x) for x in range(1000))

On automatical parallelisation of list comprehension
IMHO, effective automatic parallisation of list comprehension would be impossible without additional information (such as those provided using directives in OpenMP), or limiting it to expressions that involve only built-in types/methods.
Unless there is a guarantee that the processing done on each list item has no side effects, there is a possibility that the results will be invalid (or at least different) if done out of order.
# Artificial example
counter = 0
def g(x): # func with side-effect
global counter
counter = counter + 1
return x + counter
vals = [g(i) for i in range(100)] # diff result when not done in order
There is also the issue of task distribution. How should the problem space be decomposed?
If the processing of each element forms a task (~ task farm), then when there are many elements each involving trivial calculation, the overheads of managing the tasks will swamps out the performance gains of parallelisation.
One could also take the data decomposition approach where the problem space is divided equally among the available processes.
The fact that list comprehension also works with generators makes this slightly tricky, however this is probably not a show stopper if the overheads of pre-iterating it is acceptable. Of course, there is also a possibility of generators with side-effects which can change the outcome if subsequent items are prematurely iterated. Very unlikely, but possible.
A bigger concern would be load imbalance across processes. There is no guarantee that each element would take the same amount of time to process, so statically partitioned data may result in one process doing most of the work while the idle your time away.
Breaking the list down to smaller chunks and handing them as each child process is available is a good compromise, however, a good selection of chunk size would be application dependent hence not doable without more information from the user.
Alternatives
As mentioned in several other answers, there are many approaches and parallel computing modules/frameworks to choose from depending on one requirements.
Having used only MPI (in C) with no experience using Python for parallel processing, I am not in a position to vouch for any (although, upon a quick scan through,
multiprocessing, jug, pp and pyro stand out).
If a requirement is to stick as close as possible to list comprehension, then jug seems to be the closest match. From the tutorial, distributing tasks across multiple instances can be as simple as:
from jug.task import Task
from yourmodule import process_data
tasks = [Task(process_data,infile) for infile in glob('*.dat')]
While that does something similar to multiprocessing.Pool.map(), jug can use different backends for synchronising process and storing intermediate results (redis, filesystem, in-memory) which means the processes can span across nodes in a cluster.

As the above answers point out, this is actually pretty hard to do automatically. Then I think the question is actually how to do it in the easiest way possible. Ideally, a solution wouldn't require you to know things like "how many cores do I have". Another property that you might want is to be able to still do the list comprehension in a single readable line.
Some of the given answers already seem to have nice properties like this, but another alternative is Ray (docs), which is a framework for writing parallel Python. In Ray, you would do it like this:
import ray
# Start Ray. This creates some processes that can do work in parallel.
ray.init()
# Add this line to signify that the function can be run in parallel (as a
# "task"). Ray will load-balance different `square` tasks automatically.
#ray.remote
def square(x):
return x * x
# Create some parallel work using a list comprehension, then block until the
# results are ready with `ray.get`.
ray.get([square.remote(x) for x in range(1000)])

Using the futures.{Thread,Process}PoolExecutor.map(func, *iterables, timeout=None) and futures.as_completed(future_instances, timeout=None) functions from the new 3.2 concurrent.futures package could help.
It's also available as a 2.6+ backport.

No, because list comprehension itself is a sort of a C-optimized macro. If you pull it out and parallelize it, then it's not a list comprehension, it's just a good old fashioned MapReduce.
But you can easily parallelize your example. Here's a good tutorial on using MapReduce with Python's parallelization library:
http://mikecvet.wordpress.com/2010/07/02/parallel-mapreduce-in-python/

Not within a list comprehension AFAIK.
You could certainly do it with a traditional for loop and the multiprocessing/threading modules.

There is a comprehensive list of parallel packages for Python here:
http://wiki.python.org/moin/ParallelProcessing
I'm not sure if any handle the splitting of a list comprehension construct directly, but it should be trivial to formulate the same problem in a non-list comprehension way that can be easily forked to a number of different processors. I'm not familiar with cloud computing parallelization, but I've had some success with mpi4py on multi-core machines and over clusters. The biggest issue that you'll have to think about is whether the communication overhead is going to kill any gains you get from parallelizing the problem.
Edit: The following might also be of interest:
http://www.mblondel.org/journal/2009/11/27/easy-parallelization-with-data-decomposition/

You can use asyncio. (Documentation can be found [here][1]). It is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web-servers, database connection libraries, distributed task queues, etc. Plus it has both high-level and low-level APIs to accomodate any kind of problem.
import asyncio
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def your_function(argument):
#code
Now this function will be run in parallel whenever called without putting main program into wait state. You can use it to parallelize for loop as well. When called for a for loop, though loop is sequential but every iteration runs in parallel to the main program as soon as interpreter gets there.
For your specific case, you can do:
import asyncio
import time
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def op(x): # Do any operation you want
time.sleep(1)
print(f"function called for {x=}\n", end='')
return x*x
loop = asyncio.get_event_loop() # Have a new event loop
looper = asyncio.gather(*[op(i) for i in range(20)]) # Run the loop; Doing for 20 for better demo
results = loop.run_until_complete(looper) # Wait until finish
print('List comprehension has finished and results are gathered!')
print(results)
This produces following output:
function called for x=5
function called for x=4
function called for x=2
function called for x=0
function called for x=6
function called for x=1
function called for x=7
function called for x=3
function called for x=8
function called for x=9
function called for x=10
function called for x=12
function called for x=11
function called for x=15
function called for x=13
function called for x=14
function called for x=16
function called for x=17
function called for x=18
function called for x=19
List comprehension has finished and results are gathered!
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]
Note that all function calls were in parallel thus shuffled prints however original order is preserved in the resulting list.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.