How do you run a function repeatedly in parallel?
For example, I have a function that takes no parameters and has a stochastic element. I want to run it multiple times, which is illustrated below using a for loop. How do I accomplish the same in parallel please?
import numpy as np
def f():
x = np.random.uniform()
return x*x
np.random.seed(1)
a = []
for i in range(10):
a.append(f())
This is a duplicate of parallel-python-just-run-function-n-times, however, the answer doesn't quite fit as it passes different inputs into the function, and How do I parallelize a simple Python loop? also gives examples of passing different parameters into the function rather than repeating the same call.
I am on Windows 10 and using Jupyter
In regrds to my real use:
Does it produce a large volume of output per call?
Each iteration of the loop produces one number.
Do you need to keep the output? How long does each invocation take roughly?
Yes, I need to retain the numbers and it takes ~30 minutes per iteration.
?How many times do you need to run it in total?
At least 100.
Do you want to parallelize across multiple machines or just multiple cores?
Currently just across multiple cores.
If you don't want to pass any input to your function, just use a Throwaway variable _ as argument to your function and parallelise it as shown in the below code.
import numpy as np
from multiprocessing.pool import Pool
def f(_):
x = np.random.uniform()
return x*x
if __name__ == "__main__":
processes = 5 # Specify number of processes here
p = Pool(processes)
p.map(f, range(10))
Update:
To answer your updated question, if your tasks aren't too heavyweight and are just I/O bound, then I recommend you use ThreadPool (multithreading) instead of Pool (multiprocessing)
Code to create a Threadpool:
from multiprocessing.pool import ThreadPool
threads = 5
t = ThreadPool(threads)
t.map(f, range(10))
Related
I am using the multiprocessing module in python 3.7 to call a function repeatedly in parallel. I would like to write the results out to a file every k iterations. (It can be a different file each time.)
Below is my first attempt, which basically loops over sets of function arguments, running each set in parallel and writing the results to a file before moving onto the next set. This is obviously very inefficient. In practice, the time it takes for my function to run is much longer and varies depending on the input values, so many processors sit idle between iterations of the loop.
Is there a more efficient way to achieve this?
import multiprocessing as mp
import numpy as np
import pandas as pd
def myfunction(x): # toy example function
return(x**2)
for start in np.arange(0,500,100):
with mp.Pool(mp.cpu_count()) as pool:
out = pool.map(myfunction, np.arange(start, start+100))
pd.DataFrame(out).to_csv('filename_'+str(start//100+1)+'.csv', header=False, index=False)
My first comment is that if myfunction is a trivial as the one you have shown, then your performance will be worse using multiprocessing because there is overhead in creating the process pool (which by the way you are unnecessarily creating over and over in each loop iteration) and passing arguments from one process to another.
Assuming that myfunction is pure CPU and after map has returned 100 values there is an opportunity to overlap the writing of the CSV files that you are not taking advantage of (it's not clear how much performance will be improved by concurrent disk writing; it depends on the type of drive you have, head movement, etc.), then a combination of multithreading and multiprocessing could be the solution. The number of processes in your processing pool will be limited to the number of CPU cores given the assumption that myfunction is 100% CPU and does not release the Global Interpreter Lock and therefore cannot take advantage of a pool size greater than the number of CPUs you have. Anyway, that is my assumption. If you are going to be using certain numpy functions for example, then that is an erroneous assumption. On the other hand, it is known that numpy uses multiprocessing for some of its own processing in which case the combination of using numpy and your own multiprocessing could result in worse performance. Your current code is only using numpy for generating ranges. This seems to be a bit of overkill as there are other means of generating ranges. I have taken the liberty of generating the ranges in a slightly different fashion by defining START and STOP values and N_SPLITS, the number of equal (or as equally as possible) divisions of this range as possible and generating tuples of start and stop values that can be converted into ranges. I hope this is not too confusing. But this seemed to be a more flexible approach.
In the following code both a thread pool and a processing pool are created. The tasks are submitted to the thread pool with one of the arguments being the processing pool, whish is used by the worker to do the CPU intensive calculations and then when the results have been assembled the worker writes out the CSV file.
from multiprocessing.pool import Pool, ThreadPool
from multiprocessing import cpu_count
import pandas as pd
def worker(process_pool, index, split_range):
out = process_pool.map(myfunction, range(*split_range))
pd.DataFrame(out).to_csv(f'filename_{index}.csv', header=False, index=False)
def myfunction(x): # toy example function
return(x ** 2)
def split(start, stop, n):
k, m = divmod(stop - start, n)
return [(i * k + min(i, m),(i + 1) * k + min(i + 1, m)) for i in range(n)]
def main():
RANGE_START = 0
RANGE_STOP = 500
N_SPLITS = 5
n_processes = min(N_SPLITS, cpu_count())
split_ranges = split(RANGE_START, RANGE_STOP, N_SPLITS) # [(0, 100), (100, 200), ... (400, 500)]
process_pool = Pool(n_processes)
thread_pool = ThreadPool(N_SPLITS)
for index, split_range in enumerate(split_ranges):
thread_pool.apply_async(worker, args=(process_pool, index, split_range))
# wait for all threading tasks to complete:
thread_pool.close()
thread_pool.join()
# required for Windows:
if __name__ == '__main__':
main()
I am attempting to run an infinite while loop that will call a function that makes an API call. Each API call can take between 9-12 seconds and I want x processes constantly running.
I've attempted some of the existing pool examples but they seem to take a function along with a list of arguments. My function generates all inputs needed.
from multiprocessing import Process
from random import randint
from time import sleep
def loop_a():
while 1:
wait = randint(5,9)
print("now waiting" + str(wait) + " seconds")
sleep(wait)
if __name__ == '__main__':
#Initialize two separate while loops that can call/wait for responses independently
Process(target=loop_a).start()
Process(target=loop_a).start()
This sample code I found from another question solves my problem mostly, but I am wondering if there is an elegant way to define how many processes to run at once. Id like to be able to enter the number of processes as a parameter rather than defining a new line for each process. Is there a nice way to accomplish this?
This snippet seems to fix the issue I was facing.
if __name__ == '__main__':
[Process(target=loop_a).start() for x in range(2)]
You should use Pool
from multiprocessing import Pool
pool = Pool(processes=2) # define number of processes
results = [pool.apply(loop_a) for x in range(2)]
I have some trouble to parallelize a function.
I would like use a function several time in parallel way:
def f_diff_coord(vect1,vect2):
return vect1-vect2
This function need to be computed several time with different vectors in a general for loop.
So my code is writing in that way:
from multiprocessing import Pool
def f_diff_coord(vect1,vect2):
return vect1-vect2,vect1+vect2
if __name__ == '__main__':
p = Pool(3)
for _ in manytime
vect_a = np.arange(10)+2
vect_b = np.arange(10)
vect_c = np.arange(10)
vect_d = np.arange(10)+3
#vect_ are juste for example
r1=p.apply_async(f_diff_coord, (vect_a,vect_b,) )
r2=p.apply_async(f_diff_coord, (vect_c,vect_d,) )
data_a = r1.get()
data_b = r2.get()
#do somthing with data_
I run this kind of code with 3 pools and it seems being parallelized (in my Task Manager on windows). However, the computation time is quite longer than in the serialized code. Am I missing something or is it the fact that the call to several processes takes a lot of time to be initiated?
use the multiprocessing module to parallelise your calls
I've been following the docs to try to understand multiprocessing pools. I came up with this:
import time
from multiprocessing import Pool
def f(a):
print 'f(' + str(a) + ')'
return True
t = time.time()
pool = Pool(processes=10)
result = pool.apply_async(f, (1,))
print result.get()
pool.close()
print ' [i] Time elapsed ' + str(time.time() - t)
I'm trying to use 10 processes to evaluate the function f(a). I've put a print statement in f.
This is the output I'm getting:
$ python pooltest.py
f(1)
True
[i] Time elapsed 0.0270888805389
It appears to me that the function f is only getting evaluated once.
I'm likely not using the right method but the end result I'm looking for is to run f with 10 processes simultaneously, and get the result returned by each one of those process. So I would end with a list of 10 results (which may or may not be identical).
The docs on multiprocessing are quite confusing and it's not trivial to figure out which approach I should be taking and it seems to me that f should be run 10 times in the example I provided above.
apply_async isn't meant to launch multiple processes; it's just meant to call the function with the arguments in one of the processes of the pool. You'll need to make 10 calls if you want the function to be called 10 times.
First, note the docs on apply() (emphasis added):
apply(func[, args[, kwds]])
Call func with arguments args and keyword arguments kwds. It blocks
until the result is ready. Given this blocks, apply_async() is better
suited for performing work in parallel. Additionally, func is only
executed in one of the workers of the pool.
Now, in the docs for apply_async():
apply_async(func[, args[, kwds[, callback[, error_callback]]]])
A variant of the apply() method which returns a result object.
The difference between the two is just that apply_async returns immediately. You can use map() to call a function multiple times, though if you're calling with the same inputs, then it's a little redudant to create the list of the same argument just to have a sequence of the right length.
However, if you're calling different functions with the same input, then you're really just calling a higher order function, and you could do it with map or map_async() like this:
multiprocessing.map(lambda f: f(1), functions)
except that lambda functions aren't pickleable, so you'd need to use a defined function (see How to let Pool.map take a lambda function). You can actually use the builtin apply() (not the multiprocessing one) (although it's deprecated):
multiprocessing.map(apply,[(f,1) for f in functions])
It's easy enough to write your own, too:
def apply_(f,*args,**kwargs):
return f(*args,**kwargs)
multiprocessing.map(apply_,[(f,1) for f in functions])
Each time you write pool.apply_async(...) it will delegate that function call to one of the processes that was started in the pool. If you want to call the function in multiple processes, you need to issue multiple pool.apply_async calls.
Note, there also exists a pool.map (and pool.map_async) function which will take a function and an iterable of inputs:
inputs = range(30)
results = pool.map(f, inputs)
These functions will apply the function to each input in the inputs iterable. It attempts to put "batches" into the pool so that the load gets balanced fairly evenly among all the processes in the pool.
If you want to run a single piece of code in ten processes, each of which then exits, a Pool of ten processes is probably not the right thing to use.
Instead, create ten Processes to run the code:
processes = []
for _ in range(10):
p = multiprocessing.Process(target=f, args=(1,))
p.start()
processes.append(p)
for p in processes:
p.join()
The multiprocessing.Pool class is designed to handle situations where the number of processes and the number of jobs are unrelated. Often the number of processes is selected to be the number of CPU cores you have, while the number of jobs is much larger. Thanks!
If you aren't committed to Pool for any particular reason, I've written a function around multiprocessing.Process that will probably do the trick for you. It's posted here, but I'd be happy to upload the most recent version to github if you want it.
I am looking for a definitive answer to MATLAB's parfor for Python (Scipy, Numpy).
Is there a solution similar to parfor? If not, what is the complication for creating one?
UPDATE: Here is a typical numerical computation code that I need speeding up
import numpy as np
N = 2000
output = np.zeros([N,N])
for i in range(N):
for j in range(N):
output[i,j] = HeavyComputationThatIsThreadSafe(i,j)
An example of a heavy computation function is:
import scipy.optimize
def HeavyComputationThatIsThreadSafe(i,j):
n = i * j
return scipy.optimize.anneal(lambda x: np.sum((x-np.arange(n)**2)), np.random.random((n,1)))[0][0,0]
The one built-in to python would be multiprocessing docs are here. I always use multiprocessing.Pool with as many workers as processors. Then whenever I need to do a for-loop like structure I use Pool.imap
As long as the body of your function does not depend on any previous iteration then you should have near linear speed-up. This also requires that your inputs and outputs are pickle-able but this is pretty easy to ensure for standard types.
UPDATE:
Some code for your updated function just to show how easy it is:
from multiprocessing import Pool
from itertools import product
output = np.zeros((N,N))
pool = Pool() #defaults to number of available CPU's
chunksize = 20 #this may take some guessing ... take a look at the docs to decide
for ind, res in enumerate(pool.imap(Fun, product(xrange(N), xrange(N))), chunksize):
output.flat[ind] = res
There are many Python frameworks for parallel computing. The one I happen to like most is IPython, but I don't know too much about any of the others. In IPython, one analogue to parfor would be client.MultiEngineClient.map() or some of the other constructs in the documentation on quick and easy parallelism.
Jupyter Notebook
To see an example consider you want to write the equivalence of this Matlab code on in Python
matlabpool open 4
parfor n=0:9
for i=1:10000
for j=1:10000
s=j*i
end
end
n
end
disp('done')
The way one may write this in python particularly in jupyter notebook. You have to create a function in the working directory (I called it FunForParFor.py) which has the following
def func(n):
for i in range(10000):
for j in range(10000):
s=j*i
print(n)
Then I go to my Jupyter notebook and write the following code
import multiprocessing
import FunForParFor
if __name__ == '__main__':
pool = multiprocessing.Pool(processes=4)
pool.map(FunForParFor.func, range(10))
pool.close()
pool.join()
print('done')
This has worked for me! I just wanted to share it here to give you a particular example.
This can be done elegantly with Ray, a system that allows you to easily parallelize and distribute your Python code.
To parallelize your example, you'd need to define your functions with the #ray.remote decorator, and then invoke them with .remote.
import numpy as np
import time
import ray
ray.init()
# Define the function. Each remote function will be executed
# in a separate process.
#ray.remote
def HeavyComputationThatIsThreadSafe(i, j):
n = i*j
time.sleep(0.5) # Simulate some heavy computation.
return n
N = 10
output_ids = []
for i in range(N):
for j in range(N):
# Remote functions return a future, i.e, an identifier to the
# result, rather than the result itself. This allows invoking
# the next remote function before the previous finished, which
# leads to the remote functions being executed in parallel.
output_ids.append(HeavyComputationThatIsThreadSafe.remote(i,j))
# Get results when ready.
output_list = ray.get(output_ids)
# Move results into an NxN numpy array.
outputs = np.array(output_list).reshape(N, N)
# This program should take approximately N*N*0.5s/p, where
# p is the number of cores on your machine, N*N
# is the number of times we invoke the remote function,
# and 0.5s is the time it takes to execute one instance
# of the remote function. For example, for two cores this
# program will take approximately 25sec.
There are a number of advantages of using Ray over the multiprocessing module. In particular, the same code will run on a single machine as well as on a cluster of machines. For more advantages of Ray see this related post.
Note: One point to keep in mind is that each remote function is executed in a separate process, possibly on a different machine, and thus the remote function's computation should take more than invoking a remote function. As a rule of thumb a remote function's computation should take at least a few 10s of msec to amortize the scheduling and startup overhead of a remote function.
I've always used Parallel Python but it's not a complete analog since I believe it typically uses separate processes which can be expensive on certain operating systems. Still, if the body of your loops are chunky enough then this won't matter and can actually have some benefits.
I tried all solutions here, but found that the simplest way and closest equivalent to matlabs parfor is numba's prange.
Essentially you change a single letter in your loop, range to prange:
from numba import autojit, prange
#autojit
def parallel_sum(A):
sum = 0.0
for i in prange(A.shape[0]):
sum += A[i]
return sum
I recommend trying joblib Parallel.
one liner
from joblib import Parallel, delayed
out = Parallel(n_jobs=2)(delayed(heavymethod)(i) for i in range(10))
instructional
instead of taking a for loop
from time import sleep
for _ in range(10):
sleep(.2)
rewrite your operation into a list comprehension
[sleep(.2) for _ in range(10)]
Now let us not directly evaluate the expression, but collect what should be done.
This is what the delayed method is for.
from joblib import delayed
[delayed(sleep(.2)) for _ in range(10)]
Next instantiate a parallel process with n_workers and process the list.
from joblib import Parallel
r = Parallel(n_jobs=2, verbose=10)(delayed(sleep)(.2) for _ in range(10))
[Parallel(n_jobs=2)]: Done 1 tasks | elapsed: 0.6s
[Parallel(n_jobs=2)]: Done 4 tasks | elapsed: 0.8s
[Parallel(n_jobs=2)]: Done 10 out of 10 | elapsed: 1.4s finished
Ok, I'll also give it a go, let's see if my way is easier
from multiprocessing import Pool
def heavy_func(key):
#do some heavy computation on each key
output = key**2
return key, output
output_data ={} #<--this dict will store the results
keys = [1,5,7,8,10] #<--compute heavy_func over all the values of keys
with Pool(processes=40) as pool:
for i in pool.imap_unordered(heavy_func, keys):
output_data[i[0]] = i[1]
Now output_data is a dictionary that will contain for every key the result of the computation on this key.
That is it..