Parallelizing a list comprehension in Python - python

someList = [x for x in someList if not isOlderThanXDays(x, XDays, DtToday)]
I have this line and the function isOlderThanXDays makes some API calls causing it to take a while. I would like to perform this using multi/parrellel processing in python. The order in which the list is done doesn't matter (so asynchronous I think)
The function isOlderThanXDays essentially returns a boolean value and everything newer than is kept in the new list using List Comprehension.
Edit:
Params of function: So the XDays is for the user to pass in lets say 60 days. and DtToday is today's date (date time object). Then I make API calls to see metaData of the file's modified date and return if it is older I return true otherwise false.
I am looking for something similar to the question below. The difference is this question for every list input there is an output, whereas mine is like filtering the list based on boolean value from the function used, so I don't know how to apply it in my scenario
How to parallelize list-comprehension calculations in Python?

This should run all of your checks in parallel, and then filter out the ones that failed the check.
import multiprocessing
try:
cpus = multiprocessing.cpu_count()
except NotImplementedError:
cpus = 2 # arbitrary default
def MyFilterFunction(x):
if not isOlderThanXDays(x, XDays, DtToday):
return x
return None
pool = multiprocessing.Pool(processes=cpus)
parallelized = pool.map(MyFilterFunction, someList)
newList = [x for x in parallelized if x]

you can use ThreadPool:
from multiprocessing.pool import ThreadPool # Class which supports an async version of applying functions to arguments
from functools import partial
NUMBER_CALLS_SAME_TIME = 10 # take care to avoid throttling
# Asume that isOlderThanXDays signature is isOlderThanXDays(x, XDays, DtToday)
my_api_call_func = partial(isOlderThanXDays, XDays=XDays, DtToday=DtToday)
pool = ThreadPool(NUMBER_CALLS_SAME_TIME)
responses = pool.map(my_api_call_func, someList)

Related

Python transform and filter list with for / if

Is there a way to both transform and filter in a single list comprehension, i.e.:
def transform(el):
if some_condition(el):
return None
return complex_logic(el)
def main():
transformed = [transform(el) for el in some_list if transform(el) != None]
but avoid calling transform twice? I.e. assign it to a variable, something like (in pseudo-Python):
def main():
transformed = [transformed for el in some_list let transformed = transform(el) if transformed != None]
Since Python 3.8 you can use walrus operator :=:
def main():
return [res for el in some_list if (res := transform(el)) is not None]
This way the result of calling to the transform function is stored in res then you can use it in the expression part of your list comprehension.
Replace let transformed = transform(el) with for transformed in [transform(el)].
I would approach the solution from simple, over idiomatic to readable:
Simple loop with temp var
The simple but verbose for-loop can be used to "cache" the transformation result in a temporary variable t:
def transform(el):
if some_condition(el):
return None
return complex_logic(el)
def main():
transformed_list = []
for el in some_list:
t = transform(el) # invoked once
if t is not None: # equivalent to `if transform(el) != None`
transformed_list.append(t)
Embedded List Comprehension
Like Kelly Bundy suggests, embed the list-comprehensions:
transformation of elements
filter for non-null
See also Temporary variable within list comprehension
Decouple condition from transformation
Command–query separation (CQS) can be applied
to have a simplifying effect on a program, making its states (via queries) and state changes (via commands) more comprehensible.
Assume the 2 given functions (some_condition and complex_logic) were separately defined because each implements a single-responsibility (SRP). Then it would be consequential to take advantage of this separation and reuse the 2 steps in suitable composition:
Query: by filter first using the condition function as predicate
Command: afterwards to transform by complex logic
This way the pipeline or stream might even become more readable:
transformed = [complex_logic(el) for el in filter(some_condition, some_list)]
Finally this is close to what Samwise advised in his comment: Now the SRP is followed.

Dask dictionary to delayed object adapter

I've been searching around but have not found a solution. I've been working in Dask dictionary but the team is working in delayed object. I need to convert my dsk{} to the last step delayed object.
What I do now:
def add(x, y):
return x+y
dsk = {
'step1' : (add, 1, 2),
'step2' : (add, 'step1', 3),
'final' : (add, 'step2', 'step1'),
}
dask.visualize(dsk)
client.get(dsk, 'final')
In this way of working, all my functions are normal python functions. However, this is different than our team.
What the team is doing:
#dask.delayed
def add(x, y)
return x+y
step1 = add(1, 2)
step2 = add(step1, 3)
final = add(step2, step1)
final.visualize()
client.submit(final)
Then they are going to further schedule the work using the final step delayed object. How to convert the dsk last step final to the delayed object?
My current thinking (not working yet)
from dask.optimization import cull
outputs = ['final']
dsk1, dependencies = cull(dsk, outputs) # remove unnecessary tasks from the graph
After that, I'm not sure how to construct a delayed object.
Thank you!
Finally, I found a workaround. The idea is to iterate through the dsk to create delayed objects and dependencies.
# Covnert dsk dictionary to dask.delayed objects
for dsk_name, dsk_values in dsk.items():
args = []
dsk_function = dsk_values[0]
dsk_arguments = dsk_values[1:]
for arg in dsk_arguments:
if isinstance(arg, str):
# try to find the arguments in globals and return dependent dask object
args.append( globals().get(arg, arg) )
else:
args.append(arg)
globals()[dsk_name] = dask.delayed(dsk_function)(*args)
We generally recommend that people use Dask delayed. It is less error prone. Today, dictionaries are usually used mostly be people working on Dask itself. That said, if you want to convert a dictionary into a delayed object I recommend looking at the dask.Delayed object.
In [1]: from dask.delayed import Delayed
In [2]: Delayed?
Init signature: Delayed(key, dsk, length=None)
Docstring:
Represents a value to be computed by dask.
Equivalent to the output from a single key in a dask graph.
File: ~/workspace/dask/dask/delayed.py
Type: type
Subclasses: DelayedLeaf, DelayedAttr
So in your case you want
value = Delayed("final", dsk)

multiprocessing pool.map on a function inside other function

Say I have a function that provides different results for the same input and needs to be performed multiple times for the same input to obtain mean (I'll sketch a trivial example, but in reality the source of randomness is train_test_split from sklearn.model_selection if that matters)
define f(a,b):
output=[]
for i in range(0,b):
output[i] = np.mean(np.random.rand(a,))
return np.mean(output)
The arguments for this function are defined inside another function like so (again, a trivial example, please don't mind if these are not efficient/pythonistic):
define g(c,d):
a = c
b = c*d
result=f(a,b)
return(result)
Instead of using a for loop, I want to use multiprocessing to speed up the execution time. I found that neither pool.apply nor pool.startmap do the trick (execution time goes up), only pool.map works. However, it can only take one argument (in this case - the number of iterations). I tried redefining f as follows:
define f(number_of_iterations):
output=np.mean(np.random.rand(a,))
return output
And then use pool.map as follows:
import multiprocessing as mp
define g(c,d):
temp=[]
a = c
b = c*d
pool = mp.Pool(mp.cpu_count())
temp = pool.map(f, [number_of_iterations for number_of_iterations in b])
pool.close()
result=np.mean(temp)
return(result)
Basically, a convoluted workaround to make f a one-argument function. The hope was that f would still pick up argument a, however, executing g results in an error about a not being defined.
Is there any way to make pool.map work in this context?
I think functool.partial solves your issue. Here is a implementation: https://stackoverflow.com/a/25553970/9177173 Here the documentation: https://docs.python.org/3.7/library/functools.html#functools.partial

Mocking subprocess.check_call more than once

I have a function that calls subprocess.check_call() twice. I want to test all their possible outputs. I want to be able to set the first check_call() to return 1 and the second to return 0 and to do so for all possible combinations. The below is what I have so far. I am not sure how to adjust the expected return value
#patch('subprocess.check_call')
def test_hdfs_dir_func(mock_check_call):
for p, d in list(itertools.product([1, 0], repeat=2)):
if p or d:
You can assign the side_effect of your mock to an iterable and that will return the next value in the iterable each time it's called. In this case, you could do something like this:
import copy
import itertools
import subprocess
from unittest.mock import patch
#patch('subprocess.check_call')
def test_hdfs_dir_func(mock_check_call):
return_values = itertools.product([0, 1], repeat=2)
# Flatten the list; only one return value per call
mock_check_call.side_effect = itertools.chain.from_iterable(copy.copy(return_values))
for p, d in return_values:
assert p == subprocess.check_call()
assert d == subprocess.check_call()
Note a few things:
I don't have your original functions so I put my own calls to check_call in the loop.
I'm using copy on the original itertools.product return value because if I don't, it uses the original iterator. This exhausts that original iterator when what we want is 2 separate lists: one for the mock's side_effect and one for you to loop through in your test.
You can do other neat stuff with side_effect, not just raise. As shown above, you can change the return value for multiple calls: https://docs.python.org/3/library/unittest.mock-examples.html#side-effect-functions-and-iterables
Not only that, but you can see from the link above that you can also give it a function pointer. That allows you to do even more complex logic when keeping track of multiple mock calls.

How to find inputs of dask.delayed task?

Given a dask.delayed task, I want to get a list of all the inputs (parents) for that task.
For example,
from dask import delayed
#delayed
def inc(x):
return x + 1
def inc_list(x):
return [inc(n) for n in x]
task = delayed(sum)(inc_list([1,2,3]))
task.parents ???
Yields the following graph. How could I get the parents of sum#3 such that it yields a list of [inc#1, inc#2, inc#3]?
Delayed objects don't store references to their inputs, however you can get these back if you're willing dig into the task graph a bit and reconstruct Delayed objects manually.
In particular you can index into the .dask attribute with the delayed objects' key
>>> task.dask[task.key]
(<function sum>,
['inc-9d0913ab-d76a-4eb7-a804-51278882b310',
'inc-2f0e385e-beef-45e5-b47a-9cf5d02e2c1f',
'inc-b72ce20f-d0c4-4c50-9a88-74e3ef926dd0'])
This shows the task definition (see Dask's graph specification)
The 'inc-...' values are other keys in the task graph. You can get the dependencies using the dask.core.get_dependencies function
>>> from dask.core import get_dependencies
>>> get_dependencies(task.dask, task.key)
{'inc-2f0e385e-beef-45e5-b47a-9cf5d02e2c1f',
'inc-9d0913ab-d76a-4eb7-a804-51278882b310',
'inc-b72ce20f-d0c4-4c50-9a88-74e3ef926dd0'}
And from here you can make new delayed objects if you wish
>>> from dask.delayed import Delayed
>>> parents = [Delayed(key, task.dask) for key in get_dependencies(task.dask, task.key)]
[Delayed('inc-b72ce20f-d0c4-4c50-9a88-74e3ef926dd0'),
Delayed('inc-2f0e385e-beef-45e5-b47a-9cf5d02e2c1f'),
Delayed('inc-9d0913ab-d76a-4eb7-a804-51278882b310')]

Categories