Returning values from multiprocessing Pool function - python

I want to run a loop in parallel using pool and store each result from a return of a function into an index of numpy array. I have written a basic function here, real one is a bit complex. Even in this basic one I am not getting desired output. By printing results at the end I am getting 100 different arrays of 100 values instead of one array of 100 values. How do I solve this or is there a better way to store return values. Because I have to take a mean and std of rejects after pool.
from multiprocessing import Pool
import numpy as np
rejects = np.zeros(100)
def func(i):
print("this is:",i)
rejects[i]=i
# print (rejects)
return rejects
def main():
l = [*range(1,100, 1)]
pool = Pool(3)
results=pool.map(func, l)
pool.close()
pool.join()
print (results)
if __name__ == '__main__':
main()

Because you are giving an array argument to func and also assigning that array as a single element in the array rejects. you can use the func below:
def func(i):
print("this is:",i)
rejects=i # this is where I have changed
# print (rejects)
return rejects

Related

Python Multiprocessing: Running each iteration of multiple for loops in parallel

I have two loops in python. Here is some pseudo code. I would like to run both those functions and each iteration of each of those functions at the same time. So in this example, there would be 8 processes going on at once. I know you can use "Process", but I just don't know how to incorporate an iterable. Please let me know, thanks!
import...
def example1(iteration):
print('stuff')
def example2(iteration):
print('stuff')
if __name__ == '__main__':
freeze_support()
pool = multiprocessing.Pool(4)
iteration = [1,2,3,4]
pool.map(example1,iteration)
Assuming they don't need to be kicked off at exactly the same time I think map_async is what you want.
In the example bellow we can print the result from example2 before example1 has finished even though example1 was kicked off first.
import multiprocessing
import time
def example1(iteration):
time.sleep(1)
return 1
def example2(iteration):
return 2
if __name__ == '__main__':
pool = multiprocessing.Pool(4)
iteration = [1,2,3,4]
result1 = pool.map_async(example1, iteration)
result2 = pool.map_async(example2, iteration)
print(result2.get())
print(result1.get())

Python multiprocessing pool doesn't take an iterable as an argument

I've read many posts here about multiprocessing.pool, but I still don't understand where the problem in my code is.
I want to parallelize a function using multiprocessing pool in python. The function takes one argument and returns two values. I want this one argument to be an integer and want to iterate over this integer. I've tried the examples I've seen here, but it doesn't work for me (apparently I do something wrong, but what?)
My code:
import multiprocessing
from multiprocessing import Pool
def function(num):
res1 = num ** 2 # calculate someting
res2 = num + num # calculate someting
return res1, res2
if __name__ == '__main__':
num = 10
pool = multiprocessing.Pool(processes=4)
# next line works, but with [something,something,...] as an argument
result = pool.map(function, [1, 100, 10000])
# next line doesn't work and I have no idea why!
result2 = pool.map(function, range(num))
pool.close()
pool.join()
print(result2)
I get TypeError: 'float' object is not subscriptable when I calculate result2.
Would be grateful for help!

Multiprocessing with large array in state

I've got a Class which stores a large numpy array in the state. This is causing multiprocessing.Pool to become extremely slow. Here's a MRE:
from multiprocessing import Pool
import numpy
import time
from tqdm import tqdm
class MP(object):
def __init__(self, mat):
self.mat = mat
def foo(self, x):
time.sleep(1)
return x*x + self.mat.shape[0]
def bar(self, arr):
results = []
with Pool() as p:
for x in tqdm(p.imap(self.foo, arr)):
results.append(x)
return results
if __name__ == '__main__':
x = numpy.arange(8)
mat = numpy.random.random((1,1))
h = MP(mat)
res = h.bar(x)
print(res)
I've got 4 cores on CPU, which means that this code should (and does) run in approximately 2 seconds. (The tqdm shows the 2 seconds as a progress bar, it's not really necessary to this example). However, in the main program, if I do mat = numpy.random.random((10000,10000)), it takes forever to run. I suspect this is because Pool is making copies of mat for each worker, but I'm not sure how this works because mat is in the state of the Class, and not directly involved in the imap call. So, my questions are:
Why is this behavior happening? (i.e., how does Pool work within a Class? What exactly does it pickle? What copies are made, and what is passed by reference?)
What is a viable workaround to this problem?
Edit: Modified foo to make use of mat, which is more representative of my real problem.
If as you say mat is not directly involved in the imap call, I'm guessing in general the state of MP is not used in the imap call (if it is, comment below and I'll remove this answer). If that's the case, you should write foo as an unbound function instead of as a method of MP. The reason mat is getting copied right now is because each execution of foo needs to be passed in self, which contains self.mat.
The following executes quickly regardless of the size of mat:
from multiprocessing import Pool
import numpy
import time
from tqdm import tqdm
class MP(object):
def __init__(self, mat):
self.mat = mat
def bar(self, arr):
results = []
with Pool() as p:
for x in tqdm(p.imap(foo, arr)):
results.append(x)
return results
def foo(x):
time.sleep(1)
return x * x
if __name__ == '__main__':
x = numpy.arange(8)
mat = numpy.random.random((10000, 10000))
h = MP(mat)
res = h.bar(x)
print(res)
If foo actually does need to be passed MP because it actually does need to read from mat, then there is no way to avoid sending mat to each processor, and your question 2 does not have an answer other than "you can't". But hopefully I've answered your question 1.

What would be the best way to get the index of the current output of imap_unordered in python multiprocessing

What would be the best way to get the index of the current output of imap_unordered in python multiprocessing?
use enumerate() on the sequence of arguments to your target function, and either change the function to return the index in addition to the result, or create a wrapper function that does that.
Simple example:
from multiprocessing import Pool
import time
import random
def func(args):
# real target function
time.sleep(random.random())
return args ** 2
def wrapper(args):
idx, args = args
return (idx, func(args))
if __name__ == '__main__':
pool = Pool(4)
args = range(10) # sample args
results = pool.imap_unordered(wrapper, enumerate(args))
for idx, result in results:
print(idx, result)

How to run three functions at the same time (and return values from each)?

I have three functions, each returning a list. The problem is that running each function takes around 20-30 seconds. So running the entire script ends up taking about 2 min.
I want to use multiprocessing or multithreading (whichever is easier to implement) to have all three functions running at the same time.
The other hurdle I ran into was I that I'm not sure how to return the list from each of the functions.
def main():
masterlist = get_crs_in_snow()
noop_crs = get_noops_in_snow()
made_crs = get_crs_in_git()
# take the prod master list in SNOW, subtract what's been made or is in the noop list
create_me = [obj for obj in masterlist if obj not in made_crs and obj not in noop_crs]
print "There are {0} crs in Service Now not in Ansible".format(len(create_me))
for cr in create_me:
print str(cr[0]),
if __name__ == '__main__':
main()
I figure I can get some significant improvements in run time just by multithreading or multiprocessing the following line:
masterlist = get_crs_in_snow()
noop_crs = get_noops_in_snow()
made_crs = get_crs_in_git()
How do I have these three functions run at the same time?
This is completely untested since I don't have the rest of your code, but it may give you an idea of what can be done. I have adapted your code into the multiprocessing pattern:
from multiprocessing import Pool
def dispatcher(n):
if n == 0:
return get_crs_in_snow()
if n == 1:
return get_noops_in_snow()
if n == 2:
return get_crs_in_git()
def main():
pool = Pool(processes=3)
v = pool.map(dispatcher, range(3))
masterlist = v[0]
noop_crs = v[1]
made_crs = v[2]
# take the prod master list in SNOW, subtract what's been made or is in the noop list
create_me = [obj for obj in masterlist if obj not in made_crs and obj not in noop_crs]
print "There are {0} crs in Service Now not in Ansible".format(len(create_me))
for cr in create_me:
print str(cr[0]),
if __name__ == '__main__':
main()
Try the threading library.
import threading
threading.Thread(target=get_crs_in_snow).start()
threading.Thread(target=get_noops_in_snow).start()
threading.Thread(target=get_crs_in_git).start()
As far as getting their return values, you could wrap the calls to recommon in some class functions and have them save the result to a member variable. Or, you could wrap the calls to recommon in some local functions and simply pass in a mutable object (list or dictionary) to the function, and have the function modify that mutable object.
Or, as others have stated, multiprocessing may be a good way to do what you want.

Categories