I have a script in python but it takes more than 20 hours to run until the end.
Since my code is pretty big, I will post a simplified one.
The first part of the code:
flag = 1
mydic = {}
for i in mylist:
mydic[flag] = myfunction(i)
flag += 1
mylist has more than 700 entries and each time I call myfunction it run for around 20sec.
So, I was thinking if I can use paraller programming to split the iteration into two groups and run it simultaneously. Is that possible and will I need the half time than before?
The second part of the code:
mymatrix = []
for n1 in range(0,flag):
mat = []
for n2 in range(0,flag):
if n1 >= n2:
mat.append(0)
else:
res = myfunction2(mydic(n1),mydic(n2))
mat.append(res)
mymatrix.append(mat)
So, if mylist has 700 entries, I want to create a 700x700 matrix where it is upper triangular matrix. But the myfunction2() needs around 30sec each time. I don't know if I can use parallel programming here too.
I cannot simplify the myfunction() and myfunction2() since they are functions where I call an external api and return the results.
Do you have any suggestion of how can I change it to make it faster.
Based on your comments, I think it's very likely that the 30seconds of time is mostly due to external API calls. I would add some timing code to test what portions of your code are actually responsible for the slowness.
If it is from the external API calls, there are some easy fixes. The external API calls block, so you'll get a speedup if you can move to a parallel model ( though 30s of blocking sounds huge to me ).
I think it would be easiest to create a quick "task list" by having the output of 2 loops be a matrix of arguments to pass into a function. Then I'd pipe them into Celery to run the tasks. That should give you a decent speedup with a minimal amount of work.
You would probably save a lot more time with the threading or multiprocessing modules to run tasks (or sections) , or even write it all in Twisted python - but that usually takes longer than a simple celery function.
The one caveat with the Celery approach is that you'll be dispatching a lot of work - so you'll have to have some functionality to poll for results. That could be a while loop that just sleeps(10) and repeats itself until celery has a result for every task. If you do it in Twisted, you can access/track results on finish. I've never had to do something like this with multiprocessing, so don't know how that would fit in.
how about using a generator for the second part instead of one of the for loops
def fn():
for n1 in range(0, flag):
yield n1
generate = fn()
while True:
a = next(generate)
for n2 in range(0, flag):
if a >= n2:
mat.append(0)
else:
mat.append(myfunction2(mydic(a),mydic(n2))
mymatrix.append(mat)
Related
I am running a backtest for a trading strategy, defined as a class. I am trying to select the best combination of parameters to input in the model, so I am running multiple backtesting on a given period, trying out different combinations. The idea is to be able to select the first generation of a population to feed into a genetic algorithm. Seems like the perfect job for multiprocessing!
So I tried a bunch of things to see what works faster. I opened 10 Spyder consoles (yes, I tried it) and ran a single combination of parameters for each console (all running at the same time).
The sample code used for each single Spyder console:
class MyStrategy(day,parameters):
# my strategy that runs on a single day
backtesting=[]
for day in days:
backtesting_day=MyStrategy(day,single_parameter_combi)
backtesting.append(backtesting_day)
I then tried the multiprocessing way, using pool.
The sample code used in multiprocessing:
class MyStrategy(day,parameters):
# my strategy that runs on a single day
def single_run_backtesting(single_parameter_combi):
backtesting=[]
for day in days:
backtesting_day=MyStrategy(day,single_parameter_combi)
backtesting.append(backtesting_day)
return backtesting
def backtest_many(list_of parameter_combinations):
p=multiprocessing.pool()
result=p.map(single_run_backtesting,list_of parameter_combinations)
p.close()
p.join()
return result
if __name__ == '__main__':
parameter_combis=[...] # a list of parameter combinations, 10 different ones in this case
result = backtest_many(parameter_combis)
I have also tried the following: opening 5 Spyder consoles and running 2 instances of the class in a for loop, as below, and a single Spyder console with 10 instances of the class.
class MyStrategy(day,parameters):
# my strategy that runs on a single day
parameter_combis=[...] # a list of parameter combinations
backtest_dict={k: [] for k in range(len(parameter_combis)} # make a dictionary of empty lists
for day in days:
for j,single_parameter_combi in enumerate(parameter_combis):
backtesting_day=MyStrategy(day,single_parameter_combi)
backtest_dict[j].append(backtesting_day)
To my great surprise, it takes around 25 minutes with multiprocessing to go thorugh a single day, about the same time with a single Spyder console with 10 instances of a class in the for loop, and magically it takes only 15 minutes when I run 10 Spyder consoles at the same time. How do I process this information? It doesn't really make sense to me. I am running a 12-cpu machine on windows 10.
Consider that I am planning to run things on AWS with a 96-core machine, with something like 100 combinations of parameters that cross in a genetic algorithm which should run something like 20-30 generations (a full backtesting is 2 business months = 44 days).
My question is: what am I missing??? Most importantly, is this just a difference in scale?
I know that for example if you define a simple squaring function and run it serially for 100 times, multiprocessing is actually slower than a for loop. You start seeing the advantage around 10000 times, see for example this: https://github.com/vprusso/youtube_tutorials/blob/master/multiprocessing_and_threading/multiprocessing/multiprocessing_pool.py
Will I see a difference in performance when I go up to 100 combinations with multiprocessing, and is there any way of knowing in advnace if this is the case? Am I properly writing the code? Other ideas? Do you think it would speed up significatively if I was to use multiprocessing one step "above", in a single parameter combination over many days?
To expand upon my comment "Try p.imap_unordered().":
p.map() ensures that you get the results in the same order they're in the parameter list. To achieve this, some of the workers necessarily remain idle for some time
For your use case – essentially a grid search of parameter combinations – you really don't need to have them in the same order, you just want to end up with the best option. (Additionally, quoth the documentation, "it may cause high memory usage for very long iterables. Consider using imap() or imap_unordered() with explicit chunksize option for better efficiency.")
p.imap_unordered(), by contrast, doesn't really care – it just queues things up and workers work on them as they free up.
It's also worth experimenting with the chunksize parameter – quoting the imap() documentation, "For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1." (since you spend less time queueing and synchronizing things).
Finally, for your particular use case, you might want to consider having the master process generate an infinite amount of parameter combinations using a generator function, and breaking off the loop once you find a good enough solution or enough time passes.
A simple-ish function to do this and a contrived problem (finding two random numbers 0..1 to maximize their sum) follows. Just remember to return the original parameter set from the worker function too, otherwise you won't have access to it! :)
import random
import multiprocessing
import time
def find_best(*, param_iterable, worker_func, metric_func, max_time, chunksize=10):
best_result = None
best_metric = None
start_time = time.time()
n_results = 0
with multiprocessing.Pool() as p:
for result in p.imap_unordered(worker_func, param_iterable, chunksize=chunksize):
n_results += 1
elapsed_time = time.time() - start_time
metric = metric_func(result)
if best_metric is None or metric > best_metric:
print(f'{elapsed_time}: Found new best solution, metric {metric}')
best_metric = metric
best_result = result
if elapsed_time >= max_time:
print(f'{elapsed_time}: Max time reached.')
break
final_time = time.time() - start_time
print(f'Searched {n_results} results in {final_time} s.')
return best_result
# ------------
def generate_parameter():
return {'a': random.random(), 'b': random.random()}
def generate_parameters():
while True:
yield generate_parameter()
def my_worker(parameters):
return {
'parameters': parameters, # remember to return this too!
'value': parameters['a'] + parameters['b'], # our maximizable metric
}
def my_metric(result):
return result['value']
def main():
result = find_best(
param_iterable=generate_parameters(),
worker_func=my_worker,
metric_func=my_metric,
max_time=5,
)
print(f'Best result: {result}')
if __name__ == '__main__':
main()
An example run:
~/Desktop $ python3 so59357979.py
0.022627830505371094: Found new best solution, metric 0.5126700311039976
0.022940874099731445: Found new best solution, metric 0.9464256914062249
0.022969961166381836: Found new best solution, metric 1.2946600313637404
0.02298712730407715: Found new best solution, metric 1.6255217652861256
0.023016929626464844: Found new best solution, metric 1.7041449687571075
0.02303481101989746: Found new best solution, metric 1.8898109980050104
0.030200958251953125: Found new best solution, metric 1.9031436071918972
0.030324935913085938: Found new best solution, metric 1.9321951916206537
0.03880715370178223: Found new best solution, metric 1.9410837287942249
0.03970479965209961: Found new best solution, metric 1.9649277383314245
0.07829880714416504: Found new best solution, metric 1.9926667738329622
0.6105098724365234: Found new best solution, metric 1.997217792614364
5.000051021575928: Max time reached.
Searched 621931 results in 5.07216 s.
Best result: {'parameters': {'a': 0.997483, 'b': 0.999734}, 'value': 1.997217}
(By the way, this is nearly 6 times slower when chunksize=1.)
I have implemented a cache oblivious algorithm and have shown with the PAPI library that the L1/L2/L3 misses are very low. However I would also like to see how the algorithm behaves if I reduce the available RAM memory and force the algorithm to start using the swap space in the disk. Since the algorithm is cache oblivious, I should expect a much better scaling to the disk compared to other non cache oblivious algorithms for the same problem.
The problem however is that it is very hard to predict how bad the algorithms will perform once out on the disk; a small increase in the input size might dramatically change the time that it takes for the algorithm to finish running. So if you have many algorithms that you want to test, if one takes forever to finish then the experiment will be useless (I could of course sit and monitor the experiment and kill if with ctrl+c, but I really need to sleep).
Let's say the algorithms are A,B and C. I use a different python script, one for each algorithm. For varying input size n I use subprocess.check_output to call the executable of the implementation. This executable returns some statistics that I then process and store in a suitable format that I can then use with R for example to make some nice plots.
This is an example code for algorithm A:
import subprocess
import sys
f1=open('data.stats', 'w+', 1)
min = 200000
max = 2000000
step = 200000
iterations = 10
ns = range(minLeafs, maxLeafs+1, step)
incr = 0
f1.write('n\tp\talg\ttime\n')
for n in ns:
i = 0
for p in ps:
for it in range(0, iterations):
resA = subprocess.check_output(['/usr/bin/time', '-v','./A',n],
stderr=subprocess.STDOUT)
#do something with resA
f1.write(resA + '\n')
incr = incr + 1
print(incr/(((len(ns)))*iterations)*100.0, '%', end="\r")
i = i + 1
My question is, can I somehow kill a script if subprocess.check_outputtakes too long to receive an answer? The best thing would for me to define a cut off, like 10 minutes, so if subprocess.check_output hasn't received anything, then kill the entire script.
If you're using Python 3 (and the format of your call to print suggests you might be), then check_output actually already has a timeout argument that might be useful to you: https://docs.python.org/3.6/library/subprocess.html#subprocess.check_output
I defined two correct ways of calculating averages in python.
def avg_regular(values):
total = 0
for value in values:
total += value
return total/len(values)
def avg_concurrent(values):
mean = 0
num_of_values = len(values)
for value in values:
#calculate a small portion of the average for each num and add to the total
mean += value/num_of_values
return mean
The first function is the regular way of calculating averages, but I wrote the second one because each run of the loop doesn't depend on previous runs. So theoretically the average can be computed in parallel.
However, the "parallel" one (without running in parallel) takes about 30% more time than the regular one.
Are my assumptions correct and worth the speed loss?
if yes how can I make the second function run the second one parrallely?
if not, where did I go wrong?
The code you implemented is basically the difference between (a1+a2+ ... + an) / n and (a1/n + a2/n + ... + an/n). The result is the same, but in the second version there are more operations (namely (n-1) more divisions) which slows the calculation down. You claimed that in the second version each loop run is independent of the others. In the first loop we need the following information to finish one loop run: total before the run and the current value. In the second version we need the following information to finish one loop run: mean before the run, the current value and num_of_values. As you see in the second version we even depend on more values!
But how could we divide the work between cores (which is the goal of multiprocessing)? We could just give one core the first half of the values and the second the second half, i.e. ((a1+a2+ ... + a(n//2)) + ( a(n//2 +1) + ... + a(n)) / n). Yes, the work of dividing by n is not splitted between the cores, but it's a single instruction so we don't really care. Also we need to add the left total and the right total, which we can't split, but again it's only a single operation.
So the code we want to run:
def my_sum(values):
total = 0
for value in values:
total += value
return total
There's still a problem with python - normally one could use threads to do the computations, because each thread will use one core. But in that case one has to take care that your program does not run into race conditions, and the python interpreter itself also needs to take care of that. CPython decided it's not worth it and basically only runs in one thread at a time. A basic solution is to use multiple processes via multiprocessing.
from multiprocessing import Pool
if __name__ == '__main__':
with Pool(5) as p:
results = p.map(my_sum, [long_list[0:len(long_list)//2], long_list[len(long_list)//2:]))
print(sum(results) / len(long_list)) # add subresults and divide by n
But of course multiple processes do not come for free. You need to fork, copy stuff, etc. so you will not gain a speedup of 2 as one could expect. Also the biggest slowdown is actually using python itself, it's not really optimized for fast numerical computations. There are various ways around that, but using numpy is probably the simplest. Just use:
import numpy
print(numpy.mean(long_list))
Which is probably much faster than the python version. I don't think numpy uses multiprocessing internal, so one could gain a boost by using multiple processes and a fast implementation (numpy or something other written in C) but normally numpy is fast enough.
I have a fuzzy string matching script that looks for some 30K needles in a haystack of 4 million company names. While the script works fine, my attempts at speeding up things via parallel processing on an AWS h1.xlarge failed as I'm running out of memory.
Rather than trying to get more memory as explained in response to my previous question, I'd like to find out how to optimize the workflow - I'm fairly new to this so there should be plenty of room. Btw, I've already experimented with queues (also worked but ran into the same MemoryError, plus looked through a bunch of very helpful SO contributions, but not quite there yet.
Here's what seems most relevant of the code. I hope it sufficiently clarifies the logic - happy to provide more info as needed:
def getHayStack():
## loads a few million company names into id: name dict
return hayCompanies
def getNeedles(*args):
## loads subset of 30K companies into id: name dict (for allocation to workers)
return needleCompanies
def findNeedle(needle, haystack):
""" Identify best match and return results with score """
results = {}
for hayID, hayCompany in haystack.iteritems():
if not isnull(haystack[hayID]):
results[hayID] = levi.setratio(needle.split(' '),
hayCompany.split(' '))
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
def runMatch(args):
""" Execute findNeedle and process results for poolWorker batch"""
batch, first = args
last = first + batch
hayCompanies = getHayStack()
needleCompanies = getTargets(first, last)
needles = defaultdict(list)
current = first
for needleID, needleCompany in needleCompanies.iteritems():
current += 1
needles[targetID] = findNeedle(needleCompany, hayCompanies)
## Then store results
if __name__ == '__main__':
pool = Pool(processes = numProcesses)
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
pool.map_async(runMatch,
itertools.izip(itertools.repeat(targetsPerBatch),
xrange(0,
totalTargets,
targetsPerBatch))).get(99999999)
pool.close()
pool.join()
So I guess the questions are: How can I avoid loading the haystack for all workers - e.g. by sharing the data or taking a different approach like dividing the much larger haystack across workers rather than the needles? How can I otherwise improve memory usage by avoiding or eliminating clutter?
Your design is a bit confusing. You're using a pool of N workers, and then breaking your M jobs work up into N tasks of size M/N. In other words, if you get that all correct, you're simulating worker processes on top of a pool built on top of worker processes. Why bother with that? If you want to use processes, just use them directly. Alternatively, use a pool as a pool, sends each job as its own task, and use the batching feature to batch them up in some appropriate (and tweakable) way.
That means that runMatch just takes a single needleID and needleCompany, and all it does is call findNeedle and then do whatever that # Then store results part is. And then the main program gets a lot simpler:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
results = pool.map_async(runMatch, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING).get()
Or, if the results are small, instead of having all of the processes (presumably) fighting over some shared resulting-storing thing, just return them. Then you don't need runMatch at all, just:
if __name__ == '__main__':
with Pool(processes=numProcesses) as pool:
for result in pool.imap_unordered(findNeedle, needleCompanies.iteritems(),
chunkSize=NUMBER_TWEAKED_IN_TESTING):
# Store result
Or, alternatively, if you do want to do exactly N batches, just create a Process for each one:
if __name__ == '__main__':
totalTargets = len(getTargets('all'))
targetsPerBatch = totalTargets / numProcesses
processes = [Process(target=runMatch,
args=(targetsPerBatch,
xrange(0,
totalTargets,
targetsPerBatch)))
for _ in range(numProcesses)]
for p in processes:
p.start()
for p in processes:
p.join()
Also, you seem to be calling getHayStack() once for each task (and getNeedles as well). I'm not sure how easy it would be to end up with multiple copies of this live at the same time, but considering that it's the largest data structure you have by far, that would be the first thing I try to rule out. In fact, even if it's not a memory-usage problem, getHayStack could easily be a big performance hit, unless you're already doing some kind of caching (e.g., explicitly storing it in a global or a mutable default parameter value the first time, and then just using it), so it may be worth fixing anyway.
One way to fix both potential problems at once is to use an initializer in the Pool constructor:
def initPool():
global _haystack
_haystack = getHayStack()
def runMatch(args):
global _haystack
# ...
hayCompanies = _haystack
# ...
if __name__ == '__main__':
pool = Pool(processes=numProcesses, initializer=initPool)
# ...
Next, I notice that you're explicitly generating lists in multiple places where you don't actually need them. For example:
scores = list(results.values())
resultIDs = list(results.keys())
needleID = resultIDs[scores.index(max(scores))]
return [needleID, haystack[needleID], max(scores)]
If there's more than a handful of results, this is wasteful; just use the results.values() iterable directly. (In fact, it looks like you're using Python 2.x, in which case keys and values are already lists, so you're just making an extra copy for no good reason.)
But in this case, you can simplify the whole thing even farther. You're just looking for the key (resultID) and value (score) with the highest score, right? So:
needleID, score = max(results.items(), key=operator.itemgetter(1))
return [needleID, haystack[needleID], score]
This also eliminates all the repeated searches over score, which should save some CPU.
This may not directly solve the memory problem, but it should hopefully make it easier to debug and/or tweak.
The first thing to try is just to use much smaller batches—instead of input_size/cpu_count, try 1. Does memory usage go down? If not, we've ruled that part out.
Next, try sys.getsizeof(_haystack) and see what it says. If it's, say, 1.6GB, then you're cutting things pretty fine trying to squeeze everything else into 0.4GB, so that's the way to attack it—e.g., use a shelve database instead of a plain dict.
Also try dumping memory usage (with the resource module, getrusage(RUSAGE_SELF)) at the start and end of the initializer function. If the final haystack is only, say, 0.3GB, but you allocate another 1.3GB building it up, that's the problem to attack. For example, you might spin off a single child process to build and pickle the dict, then have the pool initializer just open it and unpickle it. Or combine the two—build a shelve db in the first child, and open it read-only in the initializer. Either way, this would also mean you're only doing the CSV-parsing/dict-building work once instead of 8 times.
On the other hand, if your total VM usage is still low (note that getrusage doesn't directly have any way to see your total VM size—ru_maxrss is often a useful approximation, especially if ru_nswap is 0) at time the first task runs, the problem is with the tasks themselves.
First, getsizeof the arguments to the task function and the value you return. If they're large, especially if they either keep getting larger with each task or are wildly variable, it could just be pickling and unpickling that data takes too much memory, and eventually 8 of them are together big enough to hit the limit.
Otherwise, the problem is most likely in the task function itself. Either you've got a memory leak (you can only have a real leak by using a buggy C extension module or ctypes, but if you keep any references around between calls, e.g., in a global, you could just be holding onto things forever unnecessarily), or some of the tasks themselves take too much memory. Either way, this should be something you can test more easily by pulling out the multiprocessing and just running the tasks directly, which is a lot easier to debug.
Both list comprehensions and map-calculations should -- at least in theory -- be relatively easy to parallelize: each calculation inside a list-comprehension could be done independent of the calculation of all the other elements. For example in the expression
[ x*x for x in range(1000) ]
each x*x-Calculation could (at least in theory) be done in parallel.
My question is: Is there any Python-Module / Python-Implementation / Python Programming-Trick to parallelize a list-comprehension calculation (in order to use all 16 / 32 / ... cores or distribute the calculation over a Computer-Grid or over a Cloud)?
As Ken said, it can't, but with 2.6's multiprocessing module, it's pretty easy to parallelize computations.
import multiprocessing
try:
cpus = multiprocessing.cpu_count()
except NotImplementedError:
cpus = 2 # arbitrary default
def square(n):
return n * n
pool = multiprocessing.Pool(processes=cpus)
print(pool.map(square, range(1000)))
There are also examples in the documentation that show how to do this using Managers, which should allow for distributed computations as well.
For shared-memory parallelism, I recommend joblib:
from joblib import delayed, Parallel
def square(x): return x*x
values = Parallel(n_jobs=NUM_CPUS)(delayed(square)(x) for x in range(1000))
On automatical parallelisation of list comprehension
IMHO, effective automatic parallisation of list comprehension would be impossible without additional information (such as those provided using directives in OpenMP), or limiting it to expressions that involve only built-in types/methods.
Unless there is a guarantee that the processing done on each list item has no side effects, there is a possibility that the results will be invalid (or at least different) if done out of order.
# Artificial example
counter = 0
def g(x): # func with side-effect
global counter
counter = counter + 1
return x + counter
vals = [g(i) for i in range(100)] # diff result when not done in order
There is also the issue of task distribution. How should the problem space be decomposed?
If the processing of each element forms a task (~ task farm), then when there are many elements each involving trivial calculation, the overheads of managing the tasks will swamps out the performance gains of parallelisation.
One could also take the data decomposition approach where the problem space is divided equally among the available processes.
The fact that list comprehension also works with generators makes this slightly tricky, however this is probably not a show stopper if the overheads of pre-iterating it is acceptable. Of course, there is also a possibility of generators with side-effects which can change the outcome if subsequent items are prematurely iterated. Very unlikely, but possible.
A bigger concern would be load imbalance across processes. There is no guarantee that each element would take the same amount of time to process, so statically partitioned data may result in one process doing most of the work while the idle your time away.
Breaking the list down to smaller chunks and handing them as each child process is available is a good compromise, however, a good selection of chunk size would be application dependent hence not doable without more information from the user.
Alternatives
As mentioned in several other answers, there are many approaches and parallel computing modules/frameworks to choose from depending on one requirements.
Having used only MPI (in C) with no experience using Python for parallel processing, I am not in a position to vouch for any (although, upon a quick scan through,
multiprocessing, jug, pp and pyro stand out).
If a requirement is to stick as close as possible to list comprehension, then jug seems to be the closest match. From the tutorial, distributing tasks across multiple instances can be as simple as:
from jug.task import Task
from yourmodule import process_data
tasks = [Task(process_data,infile) for infile in glob('*.dat')]
While that does something similar to multiprocessing.Pool.map(), jug can use different backends for synchronising process and storing intermediate results (redis, filesystem, in-memory) which means the processes can span across nodes in a cluster.
As the above answers point out, this is actually pretty hard to do automatically. Then I think the question is actually how to do it in the easiest way possible. Ideally, a solution wouldn't require you to know things like "how many cores do I have". Another property that you might want is to be able to still do the list comprehension in a single readable line.
Some of the given answers already seem to have nice properties like this, but another alternative is Ray (docs), which is a framework for writing parallel Python. In Ray, you would do it like this:
import ray
# Start Ray. This creates some processes that can do work in parallel.
ray.init()
# Add this line to signify that the function can be run in parallel (as a
# "task"). Ray will load-balance different `square` tasks automatically.
#ray.remote
def square(x):
return x * x
# Create some parallel work using a list comprehension, then block until the
# results are ready with `ray.get`.
ray.get([square.remote(x) for x in range(1000)])
Using the futures.{Thread,Process}PoolExecutor.map(func, *iterables, timeout=None) and futures.as_completed(future_instances, timeout=None) functions from the new 3.2 concurrent.futures package could help.
It's also available as a 2.6+ backport.
No, because list comprehension itself is a sort of a C-optimized macro. If you pull it out and parallelize it, then it's not a list comprehension, it's just a good old fashioned MapReduce.
But you can easily parallelize your example. Here's a good tutorial on using MapReduce with Python's parallelization library:
http://mikecvet.wordpress.com/2010/07/02/parallel-mapreduce-in-python/
Not within a list comprehension AFAIK.
You could certainly do it with a traditional for loop and the multiprocessing/threading modules.
There is a comprehensive list of parallel packages for Python here:
http://wiki.python.org/moin/ParallelProcessing
I'm not sure if any handle the splitting of a list comprehension construct directly, but it should be trivial to formulate the same problem in a non-list comprehension way that can be easily forked to a number of different processors. I'm not familiar with cloud computing parallelization, but I've had some success with mpi4py on multi-core machines and over clusters. The biggest issue that you'll have to think about is whether the communication overhead is going to kill any gains you get from parallelizing the problem.
Edit: The following might also be of interest:
http://www.mblondel.org/journal/2009/11/27/easy-parallelization-with-data-decomposition/
You can use asyncio. (Documentation can be found [here][1]). It is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web-servers, database connection libraries, distributed task queues, etc. Plus it has both high-level and low-level APIs to accomodate any kind of problem.
import asyncio
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def your_function(argument):
#code
Now this function will be run in parallel whenever called without putting main program into wait state. You can use it to parallelize for loop as well. When called for a for loop, though loop is sequential but every iteration runs in parallel to the main program as soon as interpreter gets there.
For your specific case, you can do:
import asyncio
import time
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def op(x): # Do any operation you want
time.sleep(1)
print(f"function called for {x=}\n", end='')
return x*x
loop = asyncio.get_event_loop() # Have a new event loop
looper = asyncio.gather(*[op(i) for i in range(20)]) # Run the loop; Doing for 20 for better demo
results = loop.run_until_complete(looper) # Wait until finish
print('List comprehension has finished and results are gathered!')
print(results)
This produces following output:
function called for x=5
function called for x=4
function called for x=2
function called for x=0
function called for x=6
function called for x=1
function called for x=7
function called for x=3
function called for x=8
function called for x=9
function called for x=10
function called for x=12
function called for x=11
function called for x=15
function called for x=13
function called for x=14
function called for x=16
function called for x=17
function called for x=18
function called for x=19
List comprehension has finished and results are gathered!
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225, 256, 289, 324, 361]
Note that all function calls were in parallel thus shuffled prints however original order is preserved in the resulting list.