How do I parallelize a simple Python loop? - python
This is probably a trivial question, but how do I parallelize the following loop in python?
# setup output lists
output1 = list()
output2 = list()
output3 = list()
for j in range(0, 10):
# calc individual parameter value
parameter = j * offset
# call the calculation
out1, out2, out3 = calc_stuff(parameter = parameter)
# put results into correct output list
output1.append(out1)
output2.append(out2)
output3.append(out3)
I know how to start single threads in Python but I don't know how to "collect" the results.
Multiple processes would be fine too - whatever is easiest for this case. I'm using currently Linux but the code should run on Windows and Mac as-well.
What's the easiest way to parallelize this code?
Using multiple threads on CPython won't give you better performance for pure-Python code due to the global interpreter lock (GIL). I suggest using the multiprocessing module instead:
pool = multiprocessing.Pool(4)
out1, out2, out3 = zip(*pool.map(calc_stuff, range(0, 10 * offset, offset)))
Note that this won't work in the interactive interpreter.
To avoid the usual FUD around the GIL: There wouldn't be any advantage to using threads for this example anyway. You want to use processes here, not threads, because they avoid a whole bunch of problems.
from joblib import Parallel, delayed
def process(i):
return i * i
results = Parallel(n_jobs=2)(delayed(process)(i) for i in range(10))
print(results) # prints [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
The above works beautifully on my machine (Ubuntu, package joblib was pre-installed, but can be installed via pip install joblib).
Taken from https://blog.dominodatalab.com/simple-parallelization/
Edit on Mar 31, 2021: On joblib, multiprocessing, threading and asyncio
joblib in the above code uses import multiprocessing under the hood (and thus multiple processes, which is typically the best way to run CPU work across cores - because of the GIL)
You can let joblib use multiple threads instead of multiple processes, but this (or using import threading directly) is only beneficial if the threads spend considerable time on I/O (e.g. read/write to disk, send an HTTP request). For I/O work, the GIL does not block the execution of another thread
Since Python 3.7, as an alternative to threading, you can parallelise work with asyncio, but the same advice applies like for import threading (though in contrast to latter, only 1 thread will be used; on the plus side, asyncio has a lot of nice features which are helpful for async programming)
Using multiple processes incurs overhead. Think about it: Typically, each process needs to initialise/load everything you need to run your calculation. You need to check yourself if the above code snippet improves your wall time. Here is another one, for which I confirmed that joblib produces better results:
import time
from joblib import Parallel, delayed
def countdown(n):
while n>0:
n -= 1
return n
t = time.time()
for _ in range(20):
print(countdown(10**7), end=" ")
print(time.time() - t)
# takes ~10.5 seconds on medium sized Macbook Pro
t = time.time()
results = Parallel(n_jobs=2)(delayed(countdown)(10**7) for _ in range(20))
print(results)
print(time.time() - t)
# takes ~6.3 seconds on medium sized Macbook Pro
To parallelize a simple for loop, joblib brings a lot of value to raw use of multiprocessing. Not only the short syntax, but also things like transparent bunching of iterations when they are very fast (to remove the overhead) or capturing of the traceback of the child process, to have better error reporting.
Disclaimer: I am the original author of joblib.
This IS the easiest way to do it!
You can use asyncio. (Documentation can be found here). It is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web-servers, database connection libraries, distributed task queues, etc. Plus it has both high-level and low-level APIs to accomodate any kind of problem.
import asyncio
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def your_function(argument):
#code
Now this function will be run in parallel whenever called without putting main program into wait state. You can use it to parallelize for loop as well. When called for a for loop, though loop is sequential but every iteration runs in parallel to the main program as soon as interpreter gets there.
1. Firing loop in parallel to main thread without any waiting
#background
def your_function(argument):
time.sleep(5)
print('function finished for '+str(argument))
for i in range(10):
your_function(i)
print('loop finished')
This produces following output:
loop finished
function finished for 4
function finished for 8
function finished for 0
function finished for 3
function finished for 6
function finished for 2
function finished for 5
function finished for 7
function finished for 9
function finished for 1
Update: May 2022
Although this answers the original question, there are ways where we can wait for loops to finish as requested by upvoted comments. So adding them here as well. Keys to implementations are: asyncio.gather() & run_until_complete(). Consider the following functions:
import asyncio
import time
def background(f):
def wrapped(*args, **kwargs):
return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs)
return wrapped
#background
def your_function(argument, other_argument): # Added another argument
time.sleep(5)
print(f"function finished for {argument=} and {other_argument=}")
def code_to_run_before():
print('This runs Before Loop!')
def code_to_run_after():
print('This runs After Loop!')
2. Run in parallel but wait for finish
code_to_run_before() # Anything you want to run before, run here!
loop = asyncio.get_event_loop() # Have a new event loop
looper = asyncio.gather(*[your_function(i, 1) for i in range(1, 5)]) # Run the loop
results = loop.run_until_complete(looper) # Wait until finish
code_to_run_after() # Anything you want to run after, run here!
This produces following output:
This runs Before Loop!
function finished for argument=2 and other_argument=1
function finished for argument=3 and other_argument=1
function finished for argument=1 and other_argument=1
function finished for argument=4 and other_argument=1
This runs After Loop!
3. Run multiple loops in parallel and wait for finish
code_to_run_before() # Anything you want to run before, run here!
loop = asyncio.get_event_loop() # Have a new event loop
group1 = asyncio.gather(*[your_function(i, 1) for i in range(1, 2)]) # Run all the loops you want
group2 = asyncio.gather(*[your_function(i, 2) for i in range(3, 5)]) # Run all the loops you want
group3 = asyncio.gather(*[your_function(i, 3) for i in range(6, 9)]) # Run all the loops you want
all_groups = asyncio.gather(group1, group2, group3) # Gather them all
results = loop.run_until_complete(all_groups) # Wait until finish
code_to_run_after() # Anything you want to run after, run here!
This produces following output:
This runs Before Loop!
function finished for argument=3 and other_argument=2
function finished for argument=1 and other_argument=1
function finished for argument=6 and other_argument=3
function finished for argument=4 and other_argument=2
function finished for argument=7 and other_argument=3
function finished for argument=8 and other_argument=3
This runs After Loop!
4. Loops running sequentially but iterations of each loop running in parallel to one another
code_to_run_before() # Anything you want to run before, run here!
for loop_number in range(3):
loop = asyncio.get_event_loop() # Have a new event loop
looper = asyncio.gather(*[your_function(i, loop_number) for i in range(1, 5)]) # Run the loop
results = loop.run_until_complete(looper) # Wait until finish
print(f"finished for {loop_number=}")
code_to_run_after() # Anything you want to run after, run here!
This produces following output:
This runs Before Loop!
function finished for argument=3 and other_argument=0
function finished for argument=4 and other_argument=0
function finished for argument=1 and other_argument=0
function finished for argument=2 and other_argument=0
finished for loop_number=0
function finished for argument=4 and other_argument=1
function finished for argument=3 and other_argument=1
function finished for argument=2 and other_argument=1
function finished for argument=1 and other_argument=1
finished for loop_number=1
function finished for argument=1 and other_argument=2
function finished for argument=4 and other_argument=2
function finished for argument=3 and other_argument=2
function finished for argument=2 and other_argument=2
finished for loop_number=2
This runs After Loop!
Update: June 2022
This in its current form may not run on some versions of jupyter notebook. Reason being jupyter notebook utilizing event loop. To make it work on such jupyter versions, nest_asyncio (which would nest the event loop as evident from the name) is the way to go. Just import and apply it at the top of the cell as:
import nest_asyncio
nest_asyncio.apply()
And all the functionality discussed above should be accessible in a notebook environment as well.
What's the easiest way to parallelize this code?
Use a PoolExecutor from concurrent.futures. Compare the original code with this, side by side. First, the most concise way to approach this is with executor.map:
...
with ProcessPoolExecutor() as executor:
for out1, out2, out3 in executor.map(calc_stuff, parameters):
...
or broken down by submitting each call individually:
...
with ThreadPoolExecutor() as executor:
futures = []
for parameter in parameters:
futures.append(executor.submit(calc_stuff, parameter))
for future in futures:
out1, out2, out3 = future.result() # this will block
...
Leaving the context signals the executor to free up resources
You can use threads or processes and use the exact same interface.
A working example
Here is working example code, that will demonstrate the value of :
Put this in a file - futuretest.py:
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
from time import time
from http.client import HTTPSConnection
def processor_intensive(arg):
def fib(n): # recursive, processor intensive calculation (avoid n > 36)
return fib(n-1) + fib(n-2) if n > 1 else n
start = time()
result = fib(arg)
return time() - start, result
def io_bound(arg):
start = time()
con = HTTPSConnection(arg)
con.request('GET', '/')
result = con.getresponse().getcode()
return time() - start, result
def manager(PoolExecutor, calc_stuff):
if calc_stuff is io_bound:
inputs = ('python.org', 'stackoverflow.com', 'stackexchange.com',
'noaa.gov', 'parler.com', 'aaronhall.dev')
else:
inputs = range(25, 32)
timings, results = list(), list()
start = time()
with PoolExecutor() as executor:
for timing, result in executor.map(calc_stuff, inputs):
# put results into correct output list:
timings.append(timing), results.append(result)
finish = time()
print(f'{calc_stuff.__name__}, {PoolExecutor.__name__}')
print(f'wall time to execute: {finish-start}')
print(f'total of timings for each call: {sum(timings)}')
print(f'time saved by parallelizing: {sum(timings) - (finish-start)}')
print(dict(zip(inputs, results)), end = '\n\n')
def main():
for computation in (processor_intensive, io_bound):
for pool_executor in (ProcessPoolExecutor, ThreadPoolExecutor):
manager(pool_executor, calc_stuff=computation)
if __name__ == '__main__':
main()
And here's the output for one run of python -m futuretest:
processor_intensive, ProcessPoolExecutor
wall time to execute: 0.7326343059539795
total of timings for each call: 1.8033506870269775
time saved by parallelizing: 1.070716381072998
{25: 75025, 26: 121393, 27: 196418, 28: 317811, 29: 514229, 30: 832040, 31: 1346269}
processor_intensive, ThreadPoolExecutor
wall time to execute: 1.190223217010498
total of timings for each call: 3.3561410903930664
time saved by parallelizing: 2.1659178733825684
{25: 75025, 26: 121393, 27: 196418, 28: 317811, 29: 514229, 30: 832040, 31: 1346269}
io_bound, ProcessPoolExecutor
wall time to execute: 0.533886194229126
total of timings for each call: 1.2977914810180664
time saved by parallelizing: 0.7639052867889404
{'python.org': 301, 'stackoverflow.com': 200, 'stackexchange.com': 200, 'noaa.gov': 301, 'parler.com': 200, 'aaronhall.dev': 200}
io_bound, ThreadPoolExecutor
wall time to execute: 0.38941240310668945
total of timings for each call: 1.6049387454986572
time saved by parallelizing: 1.2155263423919678
{'python.org': 301, 'stackoverflow.com': 200, 'stackexchange.com': 200, 'noaa.gov': 301, 'parler.com': 200, 'aaronhall.dev': 200}
Processor-intensive analysis
When performing processor intensive calculations in Python, expect the ProcessPoolExecutor to be more performant than the ThreadPoolExecutor.
Due to the Global Interpreter Lock (a.k.a. the GIL), threads cannot use multiple processors, so expect the time for each calculation and the wall time (elapsed real time) to be greater.
IO-bound analysis
On the other hand, when performing IO bound operations, expect ThreadPoolExecutor to be more performant than ProcessPoolExecutor.
Python's threads are real, OS, threads. They can be put to sleep by the operating system and reawakened when their information arrives.
Final thoughts
I suspect that multiprocessing will be slower on Windows, since Windows doesn't support forking so each new process has to take time to launch.
You can nest multiple threads inside multiple processes, but it's recommended to not use multiple threads to spin off multiple processes.
If faced with a heavy processing problem in Python, you can trivially scale with additional processes - but not so much with threading.
There are a number of advantages to using Ray:
You can parallelize over multiple machines in addition to multiple cores (with the same code).
Efficient handling of numerical data through shared memory (and zero-copy serialization).
High task throughput with distributed scheduling.
Fault tolerance.
In your case, you could start Ray and define a remote function
import ray
ray.init()
#ray.remote(num_return_vals=3)
def calc_stuff(parameter=None):
# Do something.
return 1, 2, 3
and then invoke it in parallel
output1, output2, output3 = [], [], []
# Launch the tasks.
for j in range(10):
id1, id2, id3 = calc_stuff.remote(parameter=j)
output1.append(id1)
output2.append(id2)
output3.append(id3)
# Block until the results have finished and get the results.
output1 = ray.get(output1)
output2 = ray.get(output2)
output3 = ray.get(output3)
To run the same example on a cluster, the only line that would change would be the call to ray.init(). The relevant documentation can be found here.
Note that I'm helping to develop Ray.
I found joblib is very useful with me. Please see following example:
from joblib import Parallel, delayed
def yourfunction(k):
s=3.14*k*k
print "Area of a circle with a radius ", k, " is:", s
element_run = Parallel(n_jobs=-1)(delayed(yourfunction)(k) for k in range(1,10))
n_jobs=-1: use all available cores
Dask futures; I'm surprised no one has mentioned it yet . . .
from dask.distributed import Client
client = Client(n_workers=8) # In this example I have 8 cores and processes (can also use threads if desired)
def my_function(i):
output = <code to execute in the for loop here>
return output
futures = []
for i in <whatever you want to loop across here>:
future = client.submit(my_function, i)
futures.append(future)
results = client.gather(futures)
client.close()
why dont you use threads, and one mutex to protect one global list?
import os
import re
import time
import sys
import thread
from threading import Thread
class thread_it(Thread):
def __init__ (self,param):
Thread.__init__(self)
self.param = param
def run(self):
mutex.acquire()
output.append(calc_stuff(self.param))
mutex.release()
threads = []
output = []
mutex = thread.allocate_lock()
for j in range(0, 10):
current = thread_it(j * offset)
threads.append(current)
current.start()
for t in threads:
t.join()
#here you have output list filled with data
keep in mind, you will be as fast as your slowest thread
thanks #iuryxavier
from multiprocessing import Pool
from multiprocessing import cpu_count
def add_1(x):
return x + 1
if __name__ == "__main__":
pool = Pool(cpu_count())
results = pool.map(add_1, range(10**12))
pool.close() # 'TERM'
pool.join() # 'KILL'
The concurrent wrappers by the tqdm library are a nice way to parallelize longer-running code. tqdm provides feedback on the current progress and remaining time through a smart progress meter, which I find very useful for long computations.
Loops can be rewritten to run as concurrent threads through a simple call to thread_map, or as concurrent multi-processes through a simple call to process_map:
from tqdm.contrib.concurrent import thread_map, process_map
def calc_stuff(num, multiplier):
import time
time.sleep(1)
return num, num * multiplier
if __name__ == "__main__":
# let's parallelize this for loop:
# results = [calc_stuff(i, 2) for i in range(64)]
loop_idx = range(64)
multiplier = [2] * len(loop_idx)
# either with threading:
results_threading = thread_map(calc_stuff, loop_idx, multiplier)
# or with multi-processing:
results_processes = process_map(calc_stuff, loop_idx, multiplier)
Let's say we have an async function
async def work_async(self, student_name: str, code: str, loop):
"""
Some async function
"""
# Do some async procesing
That needs to be run on a large array. Some attributes are being passed to the program and some are used from property of dictionary element in the array.
async def process_students(self, student_name: str, loop):
market = sys.argv[2]
subjects = [...] #Some large array
batchsize = 5
for i in range(0, len(subjects), batchsize):
batch = subjects[i:i+batchsize]
await asyncio.gather(*(self.work_async(student_name,
sub['Code'],
loop)
for sub in batch))
This could be useful when implementing multiprocessing and parallel/ distributed computing in Python.
YouTube tutorial on using techila package
Techila is a distributed computing middleware, which integrates directly with Python using the techila package. The peach function in the package can be useful in parallelizing loop structures. (Following code snippet is from the Techila Community Forums)
techila.peach(funcname = 'theheavyalgorithm', # Function that will be called on the compute nodes/ Workers
files = 'theheavyalgorithm.py', # Python-file that will be sourced on Workers
jobs = jobcount # Number of Jobs in the Project
)
Have a look at this;
http://docs.python.org/library/queue.html
This might not be the right way to do it, but I'd do something like;
Actual code;
from multiprocessing import Process, JoinableQueue as Queue
class CustomWorker(Process):
def __init__(self,workQueue, out1,out2,out3):
Process.__init__(self)
self.input=workQueue
self.out1=out1
self.out2=out2
self.out3=out3
def run(self):
while True:
try:
value = self.input.get()
#value modifier
temp1,temp2,temp3 = self.calc_stuff(value)
self.out1.put(temp1)
self.out2.put(temp2)
self.out3.put(temp3)
self.input.task_done()
except Queue.Empty:
return
#Catch things better here
def calc_stuff(self,param):
out1 = param * 2
out2 = param * 4
out3 = param * 8
return out1,out2,out3
def Main():
inputQueue = Queue()
for i in range(10):
inputQueue.put(i)
out1 = Queue()
out2 = Queue()
out3 = Queue()
processes = []
for x in range(2):
p = CustomWorker(inputQueue,out1,out2,out3)
p.daemon = True
p.start()
processes.append(p)
inputQueue.join()
while(not out1.empty()):
print out1.get()
print out2.get()
print out3.get()
if __name__ == '__main__':
Main()
Hope that helps.
very simple example of parallel processing is
from multiprocessing import Process
output1 = list()
output2 = list()
output3 = list()
def yourfunction():
for j in range(0, 10):
# calc individual parameter value
parameter = j * offset
# call the calculation
out1, out2, out3 = calc_stuff(parameter=parameter)
# put results into correct output list
output1.append(out1)
output2.append(out2)
output3.append(out3)
if __name__ == '__main__':
p = Process(target=pa.yourfunction, args=('bob',))
p.start()
p.join()
Related
Best way to use parallel computing within a for-loop in Python [duplicate]
This is probably a trivial question, but how do I parallelize the following loop in python? # setup output lists output1 = list() output2 = list() output3 = list() for j in range(0, 10): # calc individual parameter value parameter = j * offset # call the calculation out1, out2, out3 = calc_stuff(parameter = parameter) # put results into correct output list output1.append(out1) output2.append(out2) output3.append(out3) I know how to start single threads in Python but I don't know how to "collect" the results. Multiple processes would be fine too - whatever is easiest for this case. I'm using currently Linux but the code should run on Windows and Mac as-well. What's the easiest way to parallelize this code?
Using multiple threads on CPython won't give you better performance for pure-Python code due to the global interpreter lock (GIL). I suggest using the multiprocessing module instead: pool = multiprocessing.Pool(4) out1, out2, out3 = zip(*pool.map(calc_stuff, range(0, 10 * offset, offset))) Note that this won't work in the interactive interpreter. To avoid the usual FUD around the GIL: There wouldn't be any advantage to using threads for this example anyway. You want to use processes here, not threads, because they avoid a whole bunch of problems.
from joblib import Parallel, delayed def process(i): return i * i results = Parallel(n_jobs=2)(delayed(process)(i) for i in range(10)) print(results) # prints [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] The above works beautifully on my machine (Ubuntu, package joblib was pre-installed, but can be installed via pip install joblib). Taken from https://blog.dominodatalab.com/simple-parallelization/ Edit on Mar 31, 2021: On joblib, multiprocessing, threading and asyncio joblib in the above code uses import multiprocessing under the hood (and thus multiple processes, which is typically the best way to run CPU work across cores - because of the GIL) You can let joblib use multiple threads instead of multiple processes, but this (or using import threading directly) is only beneficial if the threads spend considerable time on I/O (e.g. read/write to disk, send an HTTP request). For I/O work, the GIL does not block the execution of another thread Since Python 3.7, as an alternative to threading, you can parallelise work with asyncio, but the same advice applies like for import threading (though in contrast to latter, only 1 thread will be used; on the plus side, asyncio has a lot of nice features which are helpful for async programming) Using multiple processes incurs overhead. Think about it: Typically, each process needs to initialise/load everything you need to run your calculation. You need to check yourself if the above code snippet improves your wall time. Here is another one, for which I confirmed that joblib produces better results: import time from joblib import Parallel, delayed def countdown(n): while n>0: n -= 1 return n t = time.time() for _ in range(20): print(countdown(10**7), end=" ") print(time.time() - t) # takes ~10.5 seconds on medium sized Macbook Pro t = time.time() results = Parallel(n_jobs=2)(delayed(countdown)(10**7) for _ in range(20)) print(results) print(time.time() - t) # takes ~6.3 seconds on medium sized Macbook Pro
To parallelize a simple for loop, joblib brings a lot of value to raw use of multiprocessing. Not only the short syntax, but also things like transparent bunching of iterations when they are very fast (to remove the overhead) or capturing of the traceback of the child process, to have better error reporting. Disclaimer: I am the original author of joblib.
This IS the easiest way to do it! You can use asyncio. (Documentation can be found here). It is used as a foundation for multiple Python asynchronous frameworks that provide high-performance network and web-servers, database connection libraries, distributed task queues, etc. Plus it has both high-level and low-level APIs to accomodate any kind of problem. import asyncio def background(f): def wrapped(*args, **kwargs): return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs) return wrapped #background def your_function(argument): #code Now this function will be run in parallel whenever called without putting main program into wait state. You can use it to parallelize for loop as well. When called for a for loop, though loop is sequential but every iteration runs in parallel to the main program as soon as interpreter gets there. 1. Firing loop in parallel to main thread without any waiting #background def your_function(argument): time.sleep(5) print('function finished for '+str(argument)) for i in range(10): your_function(i) print('loop finished') This produces following output: loop finished function finished for 4 function finished for 8 function finished for 0 function finished for 3 function finished for 6 function finished for 2 function finished for 5 function finished for 7 function finished for 9 function finished for 1 Update: May 2022 Although this answers the original question, there are ways where we can wait for loops to finish as requested by upvoted comments. So adding them here as well. Keys to implementations are: asyncio.gather() & run_until_complete(). Consider the following functions: import asyncio import time def background(f): def wrapped(*args, **kwargs): return asyncio.get_event_loop().run_in_executor(None, f, *args, **kwargs) return wrapped #background def your_function(argument, other_argument): # Added another argument time.sleep(5) print(f"function finished for {argument=} and {other_argument=}") def code_to_run_before(): print('This runs Before Loop!') def code_to_run_after(): print('This runs After Loop!') 2. Run in parallel but wait for finish code_to_run_before() # Anything you want to run before, run here! loop = asyncio.get_event_loop() # Have a new event loop looper = asyncio.gather(*[your_function(i, 1) for i in range(1, 5)]) # Run the loop results = loop.run_until_complete(looper) # Wait until finish code_to_run_after() # Anything you want to run after, run here! This produces following output: This runs Before Loop! function finished for argument=2 and other_argument=1 function finished for argument=3 and other_argument=1 function finished for argument=1 and other_argument=1 function finished for argument=4 and other_argument=1 This runs After Loop! 3. Run multiple loops in parallel and wait for finish code_to_run_before() # Anything you want to run before, run here! loop = asyncio.get_event_loop() # Have a new event loop group1 = asyncio.gather(*[your_function(i, 1) for i in range(1, 2)]) # Run all the loops you want group2 = asyncio.gather(*[your_function(i, 2) for i in range(3, 5)]) # Run all the loops you want group3 = asyncio.gather(*[your_function(i, 3) for i in range(6, 9)]) # Run all the loops you want all_groups = asyncio.gather(group1, group2, group3) # Gather them all results = loop.run_until_complete(all_groups) # Wait until finish code_to_run_after() # Anything you want to run after, run here! This produces following output: This runs Before Loop! function finished for argument=3 and other_argument=2 function finished for argument=1 and other_argument=1 function finished for argument=6 and other_argument=3 function finished for argument=4 and other_argument=2 function finished for argument=7 and other_argument=3 function finished for argument=8 and other_argument=3 This runs After Loop! 4. Loops running sequentially but iterations of each loop running in parallel to one another code_to_run_before() # Anything you want to run before, run here! for loop_number in range(3): loop = asyncio.get_event_loop() # Have a new event loop looper = asyncio.gather(*[your_function(i, loop_number) for i in range(1, 5)]) # Run the loop results = loop.run_until_complete(looper) # Wait until finish print(f"finished for {loop_number=}") code_to_run_after() # Anything you want to run after, run here! This produces following output: This runs Before Loop! function finished for argument=3 and other_argument=0 function finished for argument=4 and other_argument=0 function finished for argument=1 and other_argument=0 function finished for argument=2 and other_argument=0 finished for loop_number=0 function finished for argument=4 and other_argument=1 function finished for argument=3 and other_argument=1 function finished for argument=2 and other_argument=1 function finished for argument=1 and other_argument=1 finished for loop_number=1 function finished for argument=1 and other_argument=2 function finished for argument=4 and other_argument=2 function finished for argument=3 and other_argument=2 function finished for argument=2 and other_argument=2 finished for loop_number=2 This runs After Loop! Update: June 2022 This in its current form may not run on some versions of jupyter notebook. Reason being jupyter notebook utilizing event loop. To make it work on such jupyter versions, nest_asyncio (which would nest the event loop as evident from the name) is the way to go. Just import and apply it at the top of the cell as: import nest_asyncio nest_asyncio.apply() And all the functionality discussed above should be accessible in a notebook environment as well.
What's the easiest way to parallelize this code? Use a PoolExecutor from concurrent.futures. Compare the original code with this, side by side. First, the most concise way to approach this is with executor.map: ... with ProcessPoolExecutor() as executor: for out1, out2, out3 in executor.map(calc_stuff, parameters): ... or broken down by submitting each call individually: ... with ThreadPoolExecutor() as executor: futures = [] for parameter in parameters: futures.append(executor.submit(calc_stuff, parameter)) for future in futures: out1, out2, out3 = future.result() # this will block ... Leaving the context signals the executor to free up resources You can use threads or processes and use the exact same interface. A working example Here is working example code, that will demonstrate the value of : Put this in a file - futuretest.py: from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor from time import time from http.client import HTTPSConnection def processor_intensive(arg): def fib(n): # recursive, processor intensive calculation (avoid n > 36) return fib(n-1) + fib(n-2) if n > 1 else n start = time() result = fib(arg) return time() - start, result def io_bound(arg): start = time() con = HTTPSConnection(arg) con.request('GET', '/') result = con.getresponse().getcode() return time() - start, result def manager(PoolExecutor, calc_stuff): if calc_stuff is io_bound: inputs = ('python.org', 'stackoverflow.com', 'stackexchange.com', 'noaa.gov', 'parler.com', 'aaronhall.dev') else: inputs = range(25, 32) timings, results = list(), list() start = time() with PoolExecutor() as executor: for timing, result in executor.map(calc_stuff, inputs): # put results into correct output list: timings.append(timing), results.append(result) finish = time() print(f'{calc_stuff.__name__}, {PoolExecutor.__name__}') print(f'wall time to execute: {finish-start}') print(f'total of timings for each call: {sum(timings)}') print(f'time saved by parallelizing: {sum(timings) - (finish-start)}') print(dict(zip(inputs, results)), end = '\n\n') def main(): for computation in (processor_intensive, io_bound): for pool_executor in (ProcessPoolExecutor, ThreadPoolExecutor): manager(pool_executor, calc_stuff=computation) if __name__ == '__main__': main() And here's the output for one run of python -m futuretest: processor_intensive, ProcessPoolExecutor wall time to execute: 0.7326343059539795 total of timings for each call: 1.8033506870269775 time saved by parallelizing: 1.070716381072998 {25: 75025, 26: 121393, 27: 196418, 28: 317811, 29: 514229, 30: 832040, 31: 1346269} processor_intensive, ThreadPoolExecutor wall time to execute: 1.190223217010498 total of timings for each call: 3.3561410903930664 time saved by parallelizing: 2.1659178733825684 {25: 75025, 26: 121393, 27: 196418, 28: 317811, 29: 514229, 30: 832040, 31: 1346269} io_bound, ProcessPoolExecutor wall time to execute: 0.533886194229126 total of timings for each call: 1.2977914810180664 time saved by parallelizing: 0.7639052867889404 {'python.org': 301, 'stackoverflow.com': 200, 'stackexchange.com': 200, 'noaa.gov': 301, 'parler.com': 200, 'aaronhall.dev': 200} io_bound, ThreadPoolExecutor wall time to execute: 0.38941240310668945 total of timings for each call: 1.6049387454986572 time saved by parallelizing: 1.2155263423919678 {'python.org': 301, 'stackoverflow.com': 200, 'stackexchange.com': 200, 'noaa.gov': 301, 'parler.com': 200, 'aaronhall.dev': 200} Processor-intensive analysis When performing processor intensive calculations in Python, expect the ProcessPoolExecutor to be more performant than the ThreadPoolExecutor. Due to the Global Interpreter Lock (a.k.a. the GIL), threads cannot use multiple processors, so expect the time for each calculation and the wall time (elapsed real time) to be greater. IO-bound analysis On the other hand, when performing IO bound operations, expect ThreadPoolExecutor to be more performant than ProcessPoolExecutor. Python's threads are real, OS, threads. They can be put to sleep by the operating system and reawakened when their information arrives. Final thoughts I suspect that multiprocessing will be slower on Windows, since Windows doesn't support forking so each new process has to take time to launch. You can nest multiple threads inside multiple processes, but it's recommended to not use multiple threads to spin off multiple processes. If faced with a heavy processing problem in Python, you can trivially scale with additional processes - but not so much with threading.
There are a number of advantages to using Ray: You can parallelize over multiple machines in addition to multiple cores (with the same code). Efficient handling of numerical data through shared memory (and zero-copy serialization). High task throughput with distributed scheduling. Fault tolerance. In your case, you could start Ray and define a remote function import ray ray.init() #ray.remote(num_return_vals=3) def calc_stuff(parameter=None): # Do something. return 1, 2, 3 and then invoke it in parallel output1, output2, output3 = [], [], [] # Launch the tasks. for j in range(10): id1, id2, id3 = calc_stuff.remote(parameter=j) output1.append(id1) output2.append(id2) output3.append(id3) # Block until the results have finished and get the results. output1 = ray.get(output1) output2 = ray.get(output2) output3 = ray.get(output3) To run the same example on a cluster, the only line that would change would be the call to ray.init(). The relevant documentation can be found here. Note that I'm helping to develop Ray.
I found joblib is very useful with me. Please see following example: from joblib import Parallel, delayed def yourfunction(k): s=3.14*k*k print "Area of a circle with a radius ", k, " is:", s element_run = Parallel(n_jobs=-1)(delayed(yourfunction)(k) for k in range(1,10)) n_jobs=-1: use all available cores
Dask futures; I'm surprised no one has mentioned it yet . . . from dask.distributed import Client client = Client(n_workers=8) # In this example I have 8 cores and processes (can also use threads if desired) def my_function(i): output = <code to execute in the for loop here> return output futures = [] for i in <whatever you want to loop across here>: future = client.submit(my_function, i) futures.append(future) results = client.gather(futures) client.close()
why dont you use threads, and one mutex to protect one global list? import os import re import time import sys import thread from threading import Thread class thread_it(Thread): def __init__ (self,param): Thread.__init__(self) self.param = param def run(self): mutex.acquire() output.append(calc_stuff(self.param)) mutex.release() threads = [] output = [] mutex = thread.allocate_lock() for j in range(0, 10): current = thread_it(j * offset) threads.append(current) current.start() for t in threads: t.join() #here you have output list filled with data keep in mind, you will be as fast as your slowest thread
thanks #iuryxavier from multiprocessing import Pool from multiprocessing import cpu_count def add_1(x): return x + 1 if __name__ == "__main__": pool = Pool(cpu_count()) results = pool.map(add_1, range(10**12)) pool.close() # 'TERM' pool.join() # 'KILL'
The concurrent wrappers by the tqdm library are a nice way to parallelize longer-running code. tqdm provides feedback on the current progress and remaining time through a smart progress meter, which I find very useful for long computations. Loops can be rewritten to run as concurrent threads through a simple call to thread_map, or as concurrent multi-processes through a simple call to process_map: from tqdm.contrib.concurrent import thread_map, process_map def calc_stuff(num, multiplier): import time time.sleep(1) return num, num * multiplier if __name__ == "__main__": # let's parallelize this for loop: # results = [calc_stuff(i, 2) for i in range(64)] loop_idx = range(64) multiplier = [2] * len(loop_idx) # either with threading: results_threading = thread_map(calc_stuff, loop_idx, multiplier) # or with multi-processing: results_processes = process_map(calc_stuff, loop_idx, multiplier)
Let's say we have an async function async def work_async(self, student_name: str, code: str, loop): """ Some async function """ # Do some async procesing That needs to be run on a large array. Some attributes are being passed to the program and some are used from property of dictionary element in the array. async def process_students(self, student_name: str, loop): market = sys.argv[2] subjects = [...] #Some large array batchsize = 5 for i in range(0, len(subjects), batchsize): batch = subjects[i:i+batchsize] await asyncio.gather(*(self.work_async(student_name, sub['Code'], loop) for sub in batch))
This could be useful when implementing multiprocessing and parallel/ distributed computing in Python. YouTube tutorial on using techila package Techila is a distributed computing middleware, which integrates directly with Python using the techila package. The peach function in the package can be useful in parallelizing loop structures. (Following code snippet is from the Techila Community Forums) techila.peach(funcname = 'theheavyalgorithm', # Function that will be called on the compute nodes/ Workers files = 'theheavyalgorithm.py', # Python-file that will be sourced on Workers jobs = jobcount # Number of Jobs in the Project )
Have a look at this; http://docs.python.org/library/queue.html This might not be the right way to do it, but I'd do something like; Actual code; from multiprocessing import Process, JoinableQueue as Queue class CustomWorker(Process): def __init__(self,workQueue, out1,out2,out3): Process.__init__(self) self.input=workQueue self.out1=out1 self.out2=out2 self.out3=out3 def run(self): while True: try: value = self.input.get() #value modifier temp1,temp2,temp3 = self.calc_stuff(value) self.out1.put(temp1) self.out2.put(temp2) self.out3.put(temp3) self.input.task_done() except Queue.Empty: return #Catch things better here def calc_stuff(self,param): out1 = param * 2 out2 = param * 4 out3 = param * 8 return out1,out2,out3 def Main(): inputQueue = Queue() for i in range(10): inputQueue.put(i) out1 = Queue() out2 = Queue() out3 = Queue() processes = [] for x in range(2): p = CustomWorker(inputQueue,out1,out2,out3) p.daemon = True p.start() processes.append(p) inputQueue.join() while(not out1.empty()): print out1.get() print out2.get() print out3.get() if __name__ == '__main__': Main() Hope that helps.
very simple example of parallel processing is from multiprocessing import Process output1 = list() output2 = list() output3 = list() def yourfunction(): for j in range(0, 10): # calc individual parameter value parameter = j * offset # call the calculation out1, out2, out3 = calc_stuff(parameter=parameter) # put results into correct output list output1.append(out1) output2.append(out2) output3.append(out3) if __name__ == '__main__': p = Process(target=pa.yourfunction, args=('bob',)) p.start() p.join()
How do I make two different things happen at the same time in Python?
I'm used to multiprocessing, but now I have a problem where mp.Pool isn't the tool that I need. I have a process that prepares input and another process that uses it. I'm not using up all of my cores, so I want to have the two go at the same time, with the first getting the batch ready for the next iteration. How do I do this? And (importantly) what is this sort of thing called, so that I can go and google it? Here's a dummy example. The following code takes 8 seconds: import time def make_input(): time.sleep(1) return "cthulhu r'lyeh wgah'nagl fhtagn" def make_output(input): time.sleep(1) return input.upper() start = time.time() for i in range(4): input = make_input() output = make_output(input) print(output) print(time.time() - start) CTHULHU R'LYEH WGAH'NAGL FHTAGN CTHULHU R'LYEH WGAH'NAGL FHTAGN CTHULHU R'LYEH WGAH'NAGL FHTAGN CTHULHU R'LYEH WGAH'NAGL FHTAGN 8.018263101577759 If I were preparing input batches at the same time as I was doing the output, it would take four seconds. Something like this: next_input = make_input() start = time.time() for i in range(4): res = do_at_the_same_time( output = make_output(next_input), next_input = make_input() ) print(output) print(time.time() - start) But, obviously, that doesn't work. How can I accomplish what I'm trying to accomplish? Important note: I tried the following, but it failed because the executing worker was working in the wrong scope (like, for my actual use-case). In my dummy use-case, it doesn't work because it prints in a different process. def proc(i): if i == 0: return make_input() if i == 1: return make_output(next_input) next_input = make_input() for i in range(4): pool = mp.Pool(2) next_input = pool.map(proc, [0, 1])[0] pool.close() So I need a solution where the second processes happens in the same scope or environment as the for loop, and where the first has output that can be gotten from that scope.
You should be able to use Pool. If I understand it correctly, you want one worker to prepare the input for the next worker which runs and does something more with it, given your example functions, this should do just that: pool = mp.Pool(2) for i in range(4): next_input = pool.apply(make_input) pool.apply_async(make_output, (next_input, ), callback=print) pool.close() pool.join() We prepare a pool with 2 workers, now we want run the loop to run our pair of tasks twice. We delegate make_input to a worker using apply() waiting for the function to complete assign the result to next_input. Note: in this example we could have used a single worker pool and just run next_input = make_input() (i.e. in the same process your script runs in and just delegate the make_output()). Now the more interesting bit: by using apply_async() we ask a worker to run make_output, passing single parameter next_input to it and telling it to runt (or any function) print with the result of make_output as argument passed to the function registered with callback. Then we close() the pool not accepting any more jobs and join() to wait for processes to complete their jobs.
multiprocessing.Pool and Rate limit
I'm making some API requests which are limited at 20 per second. As to get the answer the waiting time is about 0.5 secs I thought to use multiprocessing.Pool.map and using this decorator rate-limiting So my code looks like def fun(vec): #do stuff def RateLimited(maxPerSecond): minInterval = 1.0 / float(maxPerSecond) def decorate(func): lastTimeCalled = [0.0] def rateLimitedFunction(*args,**kargs): elapsed = time.clock() - lastTimeCalled[0] leftToWait = minInterval - elapsed if leftToWait>0: time.sleep(leftToWait) ret = func(*args,**kargs) lastTimeCalled[0] = time.clock() return ret return rateLimitedFunction return decorate #RateLimited(20) def multi(vec): p = Pool(5) return p.map(f, vec) I have 4 cores and this program works fine and there is an improvement in time compared to the loop version. Furthermore, when the Pool argument is 4,5,6 it works and the time is smaller for Pool(6) but when I use 7+ I got errors (Too many connections per second I guess). Then if my function is more complicated and can do 1-5 requests the decorator doesn't work as expected. What else I can use in this case? UPDATE For anyone looking for use Pool remembers to close it otherwise you are going to use all the RAM def multi(vec): p = Pool(5) res=p.map(f, vec) p.close() return res UPDATE 2 I found that something like this WebRequestManager can do the trick. The problem is that doesn't work with multiprocessing. Pool with 19-20 processes because the time is stored in the class you need to call when you run the request.
Your indents are inconsistent up above which makes it harder to answer this, but I'll take a stab. It looks like you're rate limiting the wrong thing; if f is supposed be limited, you need to limit the calls to f, not the calls to multi. Doing this in something that's getting dispatched to the Pool won't work, because the forked workers would each be limiting independently (forked processes will have independent tracking of the time since last call). The easiest way to do this would be to limit how quickly the iterator that the Pool pulls from produces results. For example: import collections import time def rate_limited_iterator(iterable, limit_per_second): # Initially, we can run immediately limit times runats = collections.deque([time.time()] * limit_per_second) for x in iterable: runat, now = runats.popleft(), time.time() if now < runat: time.sleep(runat - now) runats.append(time.time() + 1) yield x def multi(vec): p = Pool(5) return p.map(f, rate_limited_iterator(vec, 20))
Proper way to use multiprocessor.Pool in a nested loop
I am using the multiprocessor.Pool() module to speed up an "embarrassingly parallel" loop. I actually have a nested loop, and am using multiprocessor.Pool to speed up the inner loop. For example, without parallelizing the loop, my code would be as follows: outer_array=[random_array1] inner_array=[random_array2] output=[empty_array] for i in outer_array: for j in inner_array: output[j][i]=full_func(j,i) With parallelizing: import multiprocessing from functools import partial outer_array=[random_array1] inner_array=[random_array2] output=[empty_array] for i in outer_array: partial_func=partial(full_func,arg=i) pool=multiprocessing.Pool() output[:][i]=pool.map(partial_func,inner_array) pool.close() My main question is if this is the correct, and I should be including the multiprocessing.Pool() inside the loop, or if instead I should create the pool outside loop, i.e.: pool=multiprocessing.Pool() for i in outer_array: partial_func=partial(full_func,arg=i) output[:][i]=pool.map(partial_func,inner_array) Also, I am not sure if I should include the line "pool.close()" at the end of each loop in the second example above; what would be the benefits of doing so? Thanks!
Ideally, you should call the Pool() constructor exactly once - not over & over again. There are substantial overheads when creating worker processes, and you pay those costs every time you invoke Pool(). The processes created by a single Pool() call stay around! When they finish the work you've given to them in one part of the program, they stick around, waiting for more work to do. As to Pool.close(), you should call that when - and only when - you're never going to submit more work to the Pool instance. So Pool.close() is typically called when the parallelizable part of your main program is finished. Then the worker processes will terminate when all work already assigned has completed. It's also excellent practice to call Pool.join() to wait for the worker processes to terminate. Among other reasons, there's often no good way to report exceptions in parallelized code (exceptions occur in a context only vaguely related to what your main program is doing), and Pool.join() provides a synchronization point that can report some exceptions that occurred in worker processes that you'd otherwise never see. Have fun :-)
import itertools import multiprocessing as mp def job(params): a = params[0] b = params[1] return a*b def multicore(): a = range(1000) b = range(2000) paramlist = list(itertools.product(a,b)) print(paramlist[0]) pool = mp.Pool(processes = 4) res=pool.map(job, paramlist) for i in res: print(i) if __name__=='__main__': multicore() how about this?
import time from pathos.parallel import stats from pathos.parallel import ParallelPool as Pool def work(x, y): return x * y pool = Pool(5) pool.ncpus = 4 pool.servers = ('localhost:5654',) t1 = time.time() results = pool.imap(work, range(1, 2), range(1, 11)) print("INFO: List is: %s" % list(results)) print(stats()) t2 = time.time() print("TIMER: Function completed time is: %.5f" % (t2 - t1))
Python Multiprocessing.Pool lazy iteration
I'm wondering about the way that python's Multiprocessing.Pool class works with map, imap, and map_async. My particular problem is that I want to map on an iterator that creates memory-heavy objects, and don't want all these objects to be generated into memory at the same time. I wanted to see if the various map() functions would wring my iterator dry, or intelligently call the next() function only as child processes slowly advanced, so I hacked up some tests as such: def g(): for el in xrange(100): print el yield el def f(x): time.sleep(1) return x*x if __name__ == '__main__': pool = Pool(processes=4) # start 4 worker processes go = g() g2 = pool.imap(f, go) g2.next() And so on with map, imap, and map_async. This is the most flagrant example however, as simply calling next() a single time on g2 prints out all my elements from my generator g(), whereas if imap were doing this 'lazily' I would expect it to only call go.next() once, and therefore print out only '1'. Can someone clear up what is happening, and if there is some way to have the process pool 'lazily' evaluate the iterator as needed? Thanks, Gabe
Let's look at the end of the program first. The multiprocessing module uses atexit to call multiprocessing.util._exit_function when your program ends. If you remove g2.next(), your program ends quickly. The _exit_function eventually calls Pool._terminate_pool. The main thread changes the state of pool._task_handler._state from RUN to TERMINATE. Meanwhile the pool._task_handler thread is looping in Pool._handle_tasks and bails out when it reaches the condition if thread._state: debug('task handler found thread._state != RUN') break (See /usr/lib/python2.6/multiprocessing/pool.py) This is what stops the task handler from fully consuming your generator, g(). If you look in Pool._handle_tasks you'll see for i, task in enumerate(taskseq): ... try: put(task) except IOError: debug('could not put task on queue') break This is the code which consumes your generator. (taskseq is not exactly your generator, but as taskseq is consumed, so is your generator.) In contrast, when you call g2.next() the main thread calls IMapIterator.next, and waits when it reaches self._cond.wait(timeout). That the main thread is waiting instead of calling _exit_function is what allows the task handler thread to run normally, which means fully consuming the generator as it puts tasks in the workers' inqueue in the Pool._handle_tasks function. The bottom line is that all Pool map functions consume the entire iterable that it is given. If you'd like to consume the generator in chunks, you could do this instead: import multiprocessing as mp import itertools import time def g(): for el in xrange(50): print el yield el def f(x): time.sleep(1) return x * x if __name__ == '__main__': pool = mp.Pool(processes=4) # start 4 worker processes go = g() result = [] N = 11 while True: g2 = pool.map(f, itertools.islice(go, N)) if g2: result.extend(g2) time.sleep(1) else: break print(result)
I had this problem too and was disappointed to learn that map consumes all its elements. I coded a function which consumes the iterator lazily using the Queue data type in multiprocessing. This is similar to what #unutbu describes in a comment to his answer but as he points out, suffers from having no callback mechanism for re-loading the Queue. The Queue datatype instead exposes a timeout parameter and I've used 100 milliseconds to good effect. from multiprocessing import Process, Queue, cpu_count from Queue import Full as QueueFull from Queue import Empty as QueueEmpty def worker(recvq, sendq): for func, args in iter(recvq.get, None): result = func(*args) sendq.put(result) def pool_imap_unordered(function, iterable, procs=cpu_count()): # Create queues for sending/receiving items from iterable. sendq = Queue(procs) recvq = Queue() # Start worker processes. for rpt in xrange(procs): Process(target=worker, args=(sendq, recvq)).start() # Iterate iterable and communicate with worker processes. send_len = 0 recv_len = 0 itr = iter(iterable) try: value = itr.next() while True: try: sendq.put((function, value), True, 0.1) send_len += 1 value = itr.next() except QueueFull: while True: try: result = recvq.get(False) recv_len += 1 yield result except QueueEmpty: break except StopIteration: pass # Collect all remaining results. while recv_len < send_len: result = recvq.get() recv_len += 1 yield result # Terminate worker processes. for rpt in xrange(procs): sendq.put(None) This solution has the advantage of not batching requests to Pool.map. One individual worker can not block others from making progress. YMMV. Note that you may want to use a different object to signal termination for the workers. In the example, I've used None. Tested on "Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] on win32"
What you want is implemented in the NuMap package, from the website: NuMap is a parallel (thread- or process-based, local or remote), buffered, multi-task, itertools.imap or multiprocessing.Pool.imap function replacement. Like imap it evaluates a function on elements of a sequence or iterable, and it does so lazily. Laziness can be adjusted via the “stride” and “buffer” arguments.
In this example (see code, please) 2 workers. Pool work as expected: when worker is free, then to do next iteration. This code as code in topic, except one thing: argument size = 64 k. 64 k - default socket buffer size. import itertools from multiprocessing import Pool from time import sleep def f( x ): print( "f()" ) sleep( 3 ) return x def get_reader(): for x in range( 10 ): print( "readed: ", x ) value = " " * 1024 * 64 # 64k yield value if __name__ == '__main__': p = Pool( processes=2 ) data = p.imap( f, get_reader() ) p.close() p.join()
I ran into this issue as well, and came to a different solution than the other answers here so I figured I would share it. import collections, multiprocessing def map_prefetch(func, data, lookahead=128, workers=16, timeout=10): with multiprocessing.Pool(workers) as pool: q = collections.deque() for x in data: q.append(pool.apply_async(func, (x,))) if len(q) >= lookahead: yield q.popleft().get(timeout=timeout) while len(q): yield q.popleft().get(timeout=timeout) for x in map_prefetch(myfunction, huge_data_iterator): # do stuff with x Basically is uses a queue to send at most lookahead pending requests to the worker pool, enforcing a limit on buffered results. The work starts asap within that limit so it can run in parallel. Also the result remains in order.