How to find inputs of dask.delayed task? - python

Given a dask.delayed task, I want to get a list of all the inputs (parents) for that task.
For example,
from dask import delayed
#delayed
def inc(x):
return x + 1
def inc_list(x):
return [inc(n) for n in x]
task = delayed(sum)(inc_list([1,2,3]))
task.parents ???
Yields the following graph. How could I get the parents of sum#3 such that it yields a list of [inc#1, inc#2, inc#3]?

Delayed objects don't store references to their inputs, however you can get these back if you're willing dig into the task graph a bit and reconstruct Delayed objects manually.
In particular you can index into the .dask attribute with the delayed objects' key
>>> task.dask[task.key]
(<function sum>,
['inc-9d0913ab-d76a-4eb7-a804-51278882b310',
'inc-2f0e385e-beef-45e5-b47a-9cf5d02e2c1f',
'inc-b72ce20f-d0c4-4c50-9a88-74e3ef926dd0'])
This shows the task definition (see Dask's graph specification)
The 'inc-...' values are other keys in the task graph. You can get the dependencies using the dask.core.get_dependencies function
>>> from dask.core import get_dependencies
>>> get_dependencies(task.dask, task.key)
{'inc-2f0e385e-beef-45e5-b47a-9cf5d02e2c1f',
'inc-9d0913ab-d76a-4eb7-a804-51278882b310',
'inc-b72ce20f-d0c4-4c50-9a88-74e3ef926dd0'}
And from here you can make new delayed objects if you wish
>>> from dask.delayed import Delayed
>>> parents = [Delayed(key, task.dask) for key in get_dependencies(task.dask, task.key)]
[Delayed('inc-b72ce20f-d0c4-4c50-9a88-74e3ef926dd0'),
Delayed('inc-2f0e385e-beef-45e5-b47a-9cf5d02e2c1f'),
Delayed('inc-9d0913ab-d76a-4eb7-a804-51278882b310')]

Related

How to pass a pandas series, a list and two constant strings as arguments for multiprocessing using multiprocessing.pool?

I am trying to parallel process a function func1 which take multiple parameters of different type and varying length. I am executing this in another function other_func.
The pandas_series_var is a pandas.Series object of different length compared to list_3_elements variable which is a list.
Whereas color is a str and color_value is int.
I tried making pandas_series_var a global, but it's not identified in the func1 and causes a pandas_series_var not defined error.
def func1 (list_3_elements, color, color_value):
# pandas_series_var is not defined error during parallel process
# function does something
def other_func(*args):
# All variable initialised
global pandas_series_var
with multiprocessing.Pool() as pool:
result = pool.starmap(func1, zip(list_3_elements, repeat(color),
repeat(color_value) ))
How can I pass all the variables to the func1 and use the multiprocessing power or any easy method to make the pandas series available globally?
I am using Windows and Python 3.8.
Edit
I tried one more approach the code is working but it is now taking even more time.
p = [ {"series_obj":pandas_series_var,"list_elements":list_3_elements[0] ,"color":color,"color_value":color_value},
{"series_obj":pandas_series_var,"list_elements":list_3_elements[1] ,"color":color,"color_value":color_value},
{"series_obj":pandas_series_var,"list_elements":list_3_elements[2] ,"color":color,"color_value":color_value}
]
def func1 (p):
pandas_series_var = p["series_obj"]
list_3_elements = p["list_elements"]
color = p["color"]
color_value = p["color_value"]
# function does something
def other_func(*args):
# All variable initialised
global pandas_series_var
with multiprocessing.Pool() as pool:
result = pool.map(func1,p)
This approach is taking more time but I am able to pass arguments to the code. What I want to do is the list_elements have a 3 models which I want to apply to the pandas series object. I am very new to multiprocessing. how can this be rewritten.

Storing objects on workers and executing methods

I have an application where I have a set of objects that do a lot of setting up (this takes up to 30s-1minute per object). Once they have been set-up, I want to pass a parameter vector (small, <50 floats) and return a couple of small arrays back. Computations per object are "very fast" compared to setting up the object. I need to run this fast method many, many times, and have a cluster which I can use.
My idea is that this should be "pool" of workers where each worker gets an object initialised with its own particular configuration (which will be different), and stays in memory in the worker. Subsequently, a method of this object on each worker gets a vector and returns some arrays. These are then combined by the main program (order is not important).
Some MWE that works is as follows (serial version first, dask version second):
import datetime as dt
import itertools
import numpy as np
from dask.distributed import Client, LocalCluster
# Set up demo dask on local machine
cluster = LocalCluster()
client = Client(cluster)
class Model(object):
def __init__(self, power):
self.power = power
def powpow(self, x):
return np.power(x, self.power)
def neg(self, x):
return -x
def compute_me(self, x):
return sum(self.neg(self.powpow(x)))
# Set up the objects locally
bag = [Model(power)
for power in [1, 2, 3, 4, 5]
]
x = [13, 23, 37]
result = [obj.compute_me(x) for obj in bag]
# Using dask
# Wrapper function to pass the local object
# and parameter
def wrap(obj, x):
return obj.compute_me(x)
res = []
for obj,xx in itertools.product(bag, [x,]):
res.append(client.submit(wrap, obj, xx))
result_dask = [r.result() for r in res]
np.allclose(result, result_dask)
In my real-world case, the Model class does a lot of initialisation, pre-calculations, etc, and it takes possibly 10-50 times longer to initialise than to run the compute_me method. Basically, in my case, it'd be beneficial to have each worker have a pre-defined instance of Model locally, and have dask deliver the input to compute_me.
This post (2nd answer) initialising and storing in namespace, but the example doesn't show how pass different initialisation arguments to each worker. Or is there some other way of doing this?

Dask dictionary to delayed object adapter

I've been searching around but have not found a solution. I've been working in Dask dictionary but the team is working in delayed object. I need to convert my dsk{} to the last step delayed object.
What I do now:
def add(x, y):
return x+y
dsk = {
'step1' : (add, 1, 2),
'step2' : (add, 'step1', 3),
'final' : (add, 'step2', 'step1'),
}
dask.visualize(dsk)
client.get(dsk, 'final')
In this way of working, all my functions are normal python functions. However, this is different than our team.
What the team is doing:
#dask.delayed
def add(x, y)
return x+y
step1 = add(1, 2)
step2 = add(step1, 3)
final = add(step2, step1)
final.visualize()
client.submit(final)
Then they are going to further schedule the work using the final step delayed object. How to convert the dsk last step final to the delayed object?
My current thinking (not working yet)
from dask.optimization import cull
outputs = ['final']
dsk1, dependencies = cull(dsk, outputs) # remove unnecessary tasks from the graph
After that, I'm not sure how to construct a delayed object.
Thank you!
Finally, I found a workaround. The idea is to iterate through the dsk to create delayed objects and dependencies.
# Covnert dsk dictionary to dask.delayed objects
for dsk_name, dsk_values in dsk.items():
args = []
dsk_function = dsk_values[0]
dsk_arguments = dsk_values[1:]
for arg in dsk_arguments:
if isinstance(arg, str):
# try to find the arguments in globals and return dependent dask object
args.append( globals().get(arg, arg) )
else:
args.append(arg)
globals()[dsk_name] = dask.delayed(dsk_function)(*args)
We generally recommend that people use Dask delayed. It is less error prone. Today, dictionaries are usually used mostly be people working on Dask itself. That said, if you want to convert a dictionary into a delayed object I recommend looking at the dask.Delayed object.
In [1]: from dask.delayed import Delayed
In [2]: Delayed?
Init signature: Delayed(key, dsk, length=None)
Docstring:
Represents a value to be computed by dask.
Equivalent to the output from a single key in a dask graph.
File: ~/workspace/dask/dask/delayed.py
Type: type
Subclasses: DelayedLeaf, DelayedAttr
So in your case you want
value = Delayed("final", dsk)

Parallelizing a list comprehension in Python

someList = [x for x in someList if not isOlderThanXDays(x, XDays, DtToday)]
I have this line and the function isOlderThanXDays makes some API calls causing it to take a while. I would like to perform this using multi/parrellel processing in python. The order in which the list is done doesn't matter (so asynchronous I think)
The function isOlderThanXDays essentially returns a boolean value and everything newer than is kept in the new list using List Comprehension.
Edit:
Params of function: So the XDays is for the user to pass in lets say 60 days. and DtToday is today's date (date time object). Then I make API calls to see metaData of the file's modified date and return if it is older I return true otherwise false.
I am looking for something similar to the question below. The difference is this question for every list input there is an output, whereas mine is like filtering the list based on boolean value from the function used, so I don't know how to apply it in my scenario
How to parallelize list-comprehension calculations in Python?
This should run all of your checks in parallel, and then filter out the ones that failed the check.
import multiprocessing
try:
cpus = multiprocessing.cpu_count()
except NotImplementedError:
cpus = 2 # arbitrary default
def MyFilterFunction(x):
if not isOlderThanXDays(x, XDays, DtToday):
return x
return None
pool = multiprocessing.Pool(processes=cpus)
parallelized = pool.map(MyFilterFunction, someList)
newList = [x for x in parallelized if x]
you can use ThreadPool:
from multiprocessing.pool import ThreadPool # Class which supports an async version of applying functions to arguments
from functools import partial
NUMBER_CALLS_SAME_TIME = 10 # take care to avoid throttling
# Asume that isOlderThanXDays signature is isOlderThanXDays(x, XDays, DtToday)
my_api_call_func = partial(isOlderThanXDays, XDays=XDays, DtToday=DtToday)
pool = ThreadPool(NUMBER_CALLS_SAME_TIME)
responses = pool.map(my_api_call_func, someList)

Force a Dask Delayed object to compute all parameters before applying the function

I am really enjoying using Dask.
Is there a way that I can force a Delayed object to require all it's arguments to be computed before applying the delayed function?
easy example (the use-case is more interesting with a collection):
def inc(x, y):
return x + y
dinc = dask.delayed(inc, pure=True)
to something like
def inc(x, y):
if hasattr(x, compute):
x = x.compute()
if hasattr(y, compute):
y = y.compute()
return x + y
dinc = dask.delayed(inc, pure=True)
In this way the function will act according to a reduce pattern.
Thanks!
Dask.delayed automatically does this. Any delayed object or dask collection (array, dataframe, bag) will be computed before they enter the delayed function.

Categories