multiprocessing pool.map on a function inside other function - python

Say I have a function that provides different results for the same input and needs to be performed multiple times for the same input to obtain mean (I'll sketch a trivial example, but in reality the source of randomness is train_test_split from sklearn.model_selection if that matters)
define f(a,b):
output=[]
for i in range(0,b):
output[i] = np.mean(np.random.rand(a,))
return np.mean(output)
The arguments for this function are defined inside another function like so (again, a trivial example, please don't mind if these are not efficient/pythonistic):
define g(c,d):
a = c
b = c*d
result=f(a,b)
return(result)
Instead of using a for loop, I want to use multiprocessing to speed up the execution time. I found that neither pool.apply nor pool.startmap do the trick (execution time goes up), only pool.map works. However, it can only take one argument (in this case - the number of iterations). I tried redefining f as follows:
define f(number_of_iterations):
output=np.mean(np.random.rand(a,))
return output
And then use pool.map as follows:
import multiprocessing as mp
define g(c,d):
temp=[]
a = c
b = c*d
pool = mp.Pool(mp.cpu_count())
temp = pool.map(f, [number_of_iterations for number_of_iterations in b])
pool.close()
result=np.mean(temp)
return(result)
Basically, a convoluted workaround to make f a one-argument function. The hope was that f would still pick up argument a, however, executing g results in an error about a not being defined.
Is there any way to make pool.map work in this context?

I think functool.partial solves your issue. Here is a implementation: https://stackoverflow.com/a/25553970/9177173 Here the documentation: https://docs.python.org/3.7/library/functools.html#functools.partial

Related

Multiprocessing : Use process_map with many arg function

I found this answer (https://stackoverflow.com/a/59905309/7462275) to display a progress bar very very simple to use. I would like to use this simple solution for functions that take many arguments.
Following, the above mentioned answer, I write this code that works :
from tqdm.contrib.concurrent import process_map
import time
def _foo(my_tuple):
my_number1, my_number2 = my_tuple
square = my_number1 * my_number2
time.sleep(1)
return square
r = process_map(_foo, [(i,j) for i,j in zip(range(0,30),range(100,130))],max_workers=mp.cpu_count())
But I wonder, if it is the correct solution (using a tuple to assign function variable) to do that. Thanks for answer

Parallelizing a list comprehension in Python

someList = [x for x in someList if not isOlderThanXDays(x, XDays, DtToday)]
I have this line and the function isOlderThanXDays makes some API calls causing it to take a while. I would like to perform this using multi/parrellel processing in python. The order in which the list is done doesn't matter (so asynchronous I think)
The function isOlderThanXDays essentially returns a boolean value and everything newer than is kept in the new list using List Comprehension.
Edit:
Params of function: So the XDays is for the user to pass in lets say 60 days. and DtToday is today's date (date time object). Then I make API calls to see metaData of the file's modified date and return if it is older I return true otherwise false.
I am looking for something similar to the question below. The difference is this question for every list input there is an output, whereas mine is like filtering the list based on boolean value from the function used, so I don't know how to apply it in my scenario
How to parallelize list-comprehension calculations in Python?
This should run all of your checks in parallel, and then filter out the ones that failed the check.
import multiprocessing
try:
cpus = multiprocessing.cpu_count()
except NotImplementedError:
cpus = 2 # arbitrary default
def MyFilterFunction(x):
if not isOlderThanXDays(x, XDays, DtToday):
return x
return None
pool = multiprocessing.Pool(processes=cpus)
parallelized = pool.map(MyFilterFunction, someList)
newList = [x for x in parallelized if x]
you can use ThreadPool:
from multiprocessing.pool import ThreadPool # Class which supports an async version of applying functions to arguments
from functools import partial
NUMBER_CALLS_SAME_TIME = 10 # take care to avoid throttling
# Asume that isOlderThanXDays signature is isOlderThanXDays(x, XDays, DtToday)
my_api_call_func = partial(isOlderThanXDays, XDays=XDays, DtToday=DtToday)
pool = ThreadPool(NUMBER_CALLS_SAME_TIME)
responses = pool.map(my_api_call_func, someList)

Mocking subprocess.check_call more than once

I have a function that calls subprocess.check_call() twice. I want to test all their possible outputs. I want to be able to set the first check_call() to return 1 and the second to return 0 and to do so for all possible combinations. The below is what I have so far. I am not sure how to adjust the expected return value
#patch('subprocess.check_call')
def test_hdfs_dir_func(mock_check_call):
for p, d in list(itertools.product([1, 0], repeat=2)):
if p or d:
You can assign the side_effect of your mock to an iterable and that will return the next value in the iterable each time it's called. In this case, you could do something like this:
import copy
import itertools
import subprocess
from unittest.mock import patch
#patch('subprocess.check_call')
def test_hdfs_dir_func(mock_check_call):
return_values = itertools.product([0, 1], repeat=2)
# Flatten the list; only one return value per call
mock_check_call.side_effect = itertools.chain.from_iterable(copy.copy(return_values))
for p, d in return_values:
assert p == subprocess.check_call()
assert d == subprocess.check_call()
Note a few things:
I don't have your original functions so I put my own calls to check_call in the loop.
I'm using copy on the original itertools.product return value because if I don't, it uses the original iterator. This exhausts that original iterator when what we want is 2 separate lists: one for the mock's side_effect and one for you to loop through in your test.
You can do other neat stuff with side_effect, not just raise. As shown above, you can change the return value for multiple calls: https://docs.python.org/3/library/unittest.mock-examples.html#side-effect-functions-and-iterables
Not only that, but you can see from the link above that you can also give it a function pointer. That allows you to do even more complex logic when keeping track of multiple mock calls.

How to pass a function with more than one argument to python concurrent.futures.ProcessPoolExecutor.map()?

I would like concurrent.futures.ProcessPoolExecutor.map() to call a function consisting of 2 or more arguments. In the example below, I have resorted to using a lambda function and defining ref as an array of equal size to numberlist with an identical value.
1st Question: Is there a better way of doing this? In the case where the size of numberlist can be million to billion elements in size, hence ref size would have to follow numberlist, this approach unnecessarily takes up precious memory, which I would like to avoid. I did this because I read the map function will terminate its mapping until the shortest array end is reach.
import concurrent.futures as cf
nmax = 10
numberlist = range(nmax)
ref = [5, 5, 5, 5, 5, 5, 5, 5, 5, 5]
workers = 3
def _findmatch(listnumber, ref):
print('def _findmatch(listnumber, ref):')
x=''
listnumber=str(listnumber)
ref = str(ref)
print('listnumber = {0} and ref = {1}'.format(listnumber, ref))
if ref in listnumber:
x = listnumber
print('x = {0}'.format(x))
return x
a = map(lambda x, y: _findmatch(x, y), numberlist, ref)
for n in a:
print(n)
if str(ref[0]) in n:
print('match')
with cf.ProcessPoolExecutor(max_workers=workers) as executor:
#for n in executor.map(_findmatch, numberlist):
for n in executor.map(lambda x, y: _findmatch(x, ref), numberlist, ref):
print(type(n))
print(n)
if str(ref[0]) in n:
print('match')
Running the code above, I found that the map function was able to achieve my desired outcome. However, when I transferred the same terms to concurrent.futures.ProcessPoolExecutor.map(), python3.5 failed with this error:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/queues.py", line 241, in _feed
obj = ForkingPickler.dumps(obj)
File "/usr/lib/python3.5/multiprocessing/reduction.py", line 50, in dumps
cls(buf, protocol).dump(obj)
_pickle.PicklingError: Can't pickle <function <lambda> at 0x7fd2a14db0d0>: attribute lookup <lambda> on __main__ failed
Question 2: Why did this error occur and how do I get concurrent.futures.ProcessPoolExecutor.map() to call a function with more than 1 argument?
To answer your second question first, you are getting an exception because a lambda function like the one you're using is not picklable. Since Python uses the pickle protocol to serialize the data passed between the main process and the ProcessPoolExecutor's worker processes, this is a problem. It's not clear why you are using a lambda at all. The lambda you had takes two arguments, just like the original function. You could use _findmatch directly instead of the lambda and it should work.
with cf.ProcessPoolExecutor(max_workers=workers) as executor:
for n in executor.map(_findmatch, numberlist, ref):
...
As for the first issue about passing the second, constant argument without creating a giant list, you could solve this in several ways. One approach might be to use itertools.repeat to create an iterable object that repeats the same value forever when iterated on.
But a better approach would probably be to write an extra function that passes the constant argument for you. (Perhaps this is why you were trying to use a lambda function?) It should work if the function you use is accessible at the module's top-level namespace:
def _helper(x):
return _findmatch(x, 5)
with cf.ProcessPoolExecutor(max_workers=workers) as executor:
for n in executor.map(_helper, numberlist):
...
(1) No need to make a list. You can use itertools.repeat to create an iterator that just repeats the some value.
(2) You need to pass a named function to map because it will be passed to the subprocess for execution. map uses the pickle protocol to send things, lambdas can't be pickled and therefore they can't be part of the map. But its totally unnecessary. All your lambda did was call a 2 parameter function with 2 parameters. Remove it completely.
The working code is
import concurrent.futures as cf
import itertools
nmax = 10
numberlist = range(nmax)
workers = 3
def _findmatch(listnumber, ref):
print('def _findmatch(listnumber, ref):')
x=''
listnumber=str(listnumber)
ref = str(ref)
print('listnumber = {0} and ref = {1}'.format(listnumber, ref))
if ref in listnumber:
x = listnumber
print('x = {0}'.format(x))
return x
with cf.ProcessPoolExecutor(max_workers=workers) as executor:
#for n in executor.map(_findmatch, numberlist):
for n in executor.map(_findmatch, numberlist, itertools.repeat(5)):
print(type(n))
print(n)
#if str(ref[0]) in n:
# print('match')
Regarding your first question, do I understand it correctly that you want to pass an argument whose value is determined only at the time you call map but constant for all instances of the mapped function? If so, I would do the map with a function derived from a "template function" with the second argument (ref in your example) baked into it using functools.partial:
from functools import partial
refval = 5
def _findmatch(ref, listnumber): # arguments swapped
...
with cf.ProcessPoolExecutor(max_workers=workers) as executor:
for n in executor.map(partial(_findmatch, refval), numberlist):
...
Re. question 2, first part: I haven't found the exact piece of code that tries to pickle (serialize) the function that should then be executed in parallel, but it sounds natural that that has to happen -- not only the arguments but also the function has to be transferred to the workers somehow, and it likely has to be serialized for this transfer. The fact that partial functions can be pickled while lambdas cannot is mentioned elsewhere, for instance here: https://stackoverflow.com/a/19279016/6356764.
Re. question 2, second part: if you wanted to call a function with more than one argument in ProcessPoolExecutor.map, you would pass it the function as the first argument, followed by an iterable of first arguments for the function, followed by an iterable of its second arguments etc. In your case:
for n in executor.map(_findmatch, numberlist, ref):
...

Efficient way of calling set of functions in Python

I have a set of functions:
functions=set(...)
All the functions need one parameter x.
What is the most efficient way in python of doing something similar to:
for function in functions:
function(x)
The code you give,
for function in functions:
function(x)
...does not appear to do anything with the result of calling function(x). If that is indeed so, meaning that these functions are called for their side-effects, then there is no more pythonic alternative. Just leave your code as it is.† The point to take home here, specifically, is
Avoid functions with side-effects in list-comprehensions.
As for efficiency: I expect that using anything else instead of your simple loop will not improve runtime. When in doubt, use timeit. For example, the following tests seem to indicate that a regular for-loop is faster than a list-comprehension. (I would be reluctant to draw any general conclusions from this test, thought):
>>> timeit.Timer('[f(20) for f in functions]', 'functions = [lambda n: i * n for i in range(100)]').repeat()
[44.727972984313965, 44.752119779586792, 44.577917814254761]
>>> timeit.Timer('for f in functions: f(20)', 'functions = [lambda n: i * n for i in range(100)]').repeat()
[40.320928812026978, 40.491761207580566, 40.303879022598267]
But again, even if these tests would have indicated that list-comprehensions are faster, the point remains that you should not use them when side-effects are involved, for readability's sake.
†: Well, I'd write for f in functions, so that the difference beteen function and functions is more pronounced. But that's not what this question is about.
If you need the output, a list comprehension would work.
[func(x) for func in functions]
I'm somewhat doubtful of how much of an impact this will have on the total running time of your program, but I guess you could do something like this:
[func(x) for func in functions]
The downside is that you will create a new list that you immediatly toss away, but it should be slightly faster than just the for-loop.
In any case, make sure you profile your code to confirm that this really is a bottleneck that you need to take care of.
Edit: I redid the test using timeit
My new test code:
import timeit
def func(i):
return i;
a = b = c = d = e = f = func
functions = [a, b, c, d, e, f]
timer = timeit.Timer("[f(2) for f in functions]", "from __main__ import functions")
print (timer.repeat())
timer = timeit.Timer("map(lambda f: f(2), functions)", "from __main__ import functions")
print (timer.repeat())
timer = timeit.Timer("for f in functions: f(2)", "from __main__ import functions")
print (timer.repeat())
Here is the results from this timing.
testing list comprehension
[1.7169530391693115, 1.7683839797973633, 1.7840299606323242]
testing map(f, l)
[2.5285000801086426, 2.5957231521606445, 2.6551258563995361]
testing plain loop
[1.1665718555450439, 1.1711149215698242, 1.1652190685272217]
My original, time.time() based timings are pretty much inline with this testing, plain for loops seem to be the most efficient.

Categories