Dask dictionary to delayed object adapter

Dask dictionary to delayed object adapter - python

I've been searching around but have not found a solution. I've been working in Dask dictionary but the team is working in delayed object. I need to convert my dsk{} to the last step delayed object.
What I do now:
def add(x, y):
return x+y
dsk = {
'step1' : (add, 1, 2),
'step2' : (add, 'step1', 3),
'final' : (add, 'step2', 'step1'),
}
dask.visualize(dsk)
client.get(dsk, 'final')
In this way of working, all my functions are normal python functions. However, this is different than our team.
What the team is doing:
#dask.delayed
def add(x, y)
return x+y
step1 = add(1, 2)
step2 = add(step1, 3)
final = add(step2, step1)
final.visualize()
client.submit(final)
Then they are going to further schedule the work using the final step delayed object. How to convert the dsk last step final to the delayed object?
My current thinking (not working yet)
from dask.optimization import cull
outputs = ['final']
dsk1, dependencies = cull(dsk, outputs) # remove unnecessary tasks from the graph
After that, I'm not sure how to construct a delayed object.
Thank you!

Finally, I found a workaround. The idea is to iterate through the dsk to create delayed objects and dependencies.
# Covnert dsk dictionary to dask.delayed objects
for dsk_name, dsk_values in dsk.items():
args = []
dsk_function = dsk_values[0]
dsk_arguments = dsk_values[1:]
for arg in dsk_arguments:
if isinstance(arg, str):
# try to find the arguments in globals and return dependent dask object
args.append( globals().get(arg, arg) )
else:
args.append(arg)
globals()[dsk_name] = dask.delayed(dsk_function)(*args)

We generally recommend that people use Dask delayed. It is less error prone. Today, dictionaries are usually used mostly be people working on Dask itself. That said, if you want to convert a dictionary into a delayed object I recommend looking at the dask.Delayed object.
In [1]: from dask.delayed import Delayed
In [2]: Delayed?
Init signature: Delayed(key, dsk, length=None)
Docstring:
Represents a value to be computed by dask.
Equivalent to the output from a single key in a dask graph.
File: ~/workspace/dask/dask/delayed.py
Type: type
Subclasses: DelayedLeaf, DelayedAttr
So in your case you want
value = Delayed("final", dsk)

Related

Storing objects on workers and executing methods

I have an application where I have a set of objects that do a lot of setting up (this takes up to 30s-1minute per object). Once they have been set-up, I want to pass a parameter vector (small, <50 floats) and return a couple of small arrays back. Computations per object are "very fast" compared to setting up the object. I need to run this fast method many, many times, and have a cluster which I can use.
My idea is that this should be "pool" of workers where each worker gets an object initialised with its own particular configuration (which will be different), and stays in memory in the worker. Subsequently, a method of this object on each worker gets a vector and returns some arrays. These are then combined by the main program (order is not important).
Some MWE that works is as follows (serial version first, dask version second):
import datetime as dt
import itertools
import numpy as np
from dask.distributed import Client, LocalCluster
# Set up demo dask on local machine
cluster = LocalCluster()
client = Client(cluster)
class Model(object):
def __init__(self, power):
self.power = power
def powpow(self, x):
return np.power(x, self.power)
def neg(self, x):
return -x
def compute_me(self, x):
return sum(self.neg(self.powpow(x)))
# Set up the objects locally
bag = [Model(power)
for power in [1, 2, 3, 4, 5]
]
x = [13, 23, 37]
result = [obj.compute_me(x) for obj in bag]
# Using dask
# Wrapper function to pass the local object
# and parameter
def wrap(obj, x):
return obj.compute_me(x)
res = []
for obj,xx in itertools.product(bag, [x,]):
res.append(client.submit(wrap, obj, xx))
result_dask = [r.result() for r in res]
np.allclose(result, result_dask)
In my real-world case, the Model class does a lot of initialisation, pre-calculations, etc, and it takes possibly 10-50 times longer to initialise than to run the compute_me method. Basically, in my case, it'd be beneficial to have each worker have a pre-defined instance of Model locally, and have dask deliver the input to compute_me.
This post (2nd answer) initialising and storing in namespace, but the example doesn't show how pass different initialisation arguments to each worker. Or is there some other way of doing this?

Can I dynamically choose the method applied on a pandas Resampler object?

I am trying to create a function which resamples time series data in pandas. I would like to have the option to specify the type of aggregation that occurs depending on what type of data I am sending through (i.e. for some data, taking the sum of each bin is appropriate, while for others, taking the mean is needed, etc.). For example data like these:
import pandas as pd
import numpy as np
dr = pd.date_range('01-01-2020', '01-03-2020', freq='1H')
df = pd.DataFrame(np.random.rand(len(dr)), index=dr)
I could have a function like this:
def process(df, freq='3H', method='sum'):
r = df.resample(freq)
if method == 'sum':
r = r.sum()
elif method == 'mean':
r = r.mean()
#...
#more options
#...
return r
For a small amount of aggregation methods, this is fine, but seems like it could be tedious if I wanted to select from all of the possible choices.
I was hoping to use getattr to implement something like this post (under "Putting it to work: generalizing method calls"). However, I can't find a way to do this:
def process2(df, freq='3H', method='sum'):
r = df.resample(freq)
foo = getattr(r, method)
return r.foo()
#fails with:
#AttributeError: 'DatetimeIndexResampler' object has no attribute 'foo'
def process3(df, freq='3H', method='sum'):
r = df.resample(freq)
foo = getattr(r, method)
return foo(r)
#fails with:
#TypeError: __init__() missing 1 required positional argument: 'obj'
I get why process2 fails (calling r.foo() looks for the method foo() of r, not the variable foo). But I don't think I get why process3 fails.
I know another approach would be to pass functions to the parameter method, and then apply those functions on r. My inclination is that this would be less efficient? And it still doesn't allow me to access the built-in Resample methods directly.
Is there a working, more concise way to achieve this? Thanks!

Try .resample().apply(method)
But unless you are planning some more computation inside the function, it will probably be easier to just hard-code this line.

Parallelizing a list comprehension in Python

someList = [x for x in someList if not isOlderThanXDays(x, XDays, DtToday)]
I have this line and the function isOlderThanXDays makes some API calls causing it to take a while. I would like to perform this using multi/parrellel processing in python. The order in which the list is done doesn't matter (so asynchronous I think)
The function isOlderThanXDays essentially returns a boolean value and everything newer than is kept in the new list using List Comprehension.
Edit:
Params of function: So the XDays is for the user to pass in lets say 60 days. and DtToday is today's date (date time object). Then I make API calls to see metaData of the file's modified date and return if it is older I return true otherwise false.
I am looking for something similar to the question below. The difference is this question for every list input there is an output, whereas mine is like filtering the list based on boolean value from the function used, so I don't know how to apply it in my scenario
How to parallelize list-comprehension calculations in Python?

This should run all of your checks in parallel, and then filter out the ones that failed the check.
import multiprocessing
try:
cpus = multiprocessing.cpu_count()
except NotImplementedError:
cpus = 2 # arbitrary default
def MyFilterFunction(x):
if not isOlderThanXDays(x, XDays, DtToday):
return x
return None
pool = multiprocessing.Pool(processes=cpus)
parallelized = pool.map(MyFilterFunction, someList)
newList = [x for x in parallelized if x]

you can use ThreadPool:
from multiprocessing.pool import ThreadPool # Class which supports an async version of applying functions to arguments
from functools import partial
NUMBER_CALLS_SAME_TIME = 10 # take care to avoid throttling
# Asume that isOlderThanXDays signature is isOlderThanXDays(x, XDays, DtToday)
my_api_call_func = partial(isOlderThanXDays, XDays=XDays, DtToday=DtToday)
pool = ThreadPool(NUMBER_CALLS_SAME_TIME)
responses = pool.map(my_api_call_func, someList)

How to find inputs of dask.delayed task?

Given a dask.delayed task, I want to get a list of all the inputs (parents) for that task.
For example,
from dask import delayed
#delayed
def inc(x):
return x + 1
def inc_list(x):
return [inc(n) for n in x]
task = delayed(sum)(inc_list([1,2,3]))
task.parents ???
Yields the following graph. How could I get the parents of sum#3 such that it yields a list of [inc#1, inc#2, inc#3]?

Delayed objects don't store references to their inputs, however you can get these back if you're willing dig into the task graph a bit and reconstruct Delayed objects manually.
In particular you can index into the .dask attribute with the delayed objects' key
>>> task.dask[task.key]
(<function sum>,
['inc-9d0913ab-d76a-4eb7-a804-51278882b310',
'inc-2f0e385e-beef-45e5-b47a-9cf5d02e2c1f',
'inc-b72ce20f-d0c4-4c50-9a88-74e3ef926dd0'])
This shows the task definition (see Dask's graph specification)
The 'inc-...' values are other keys in the task graph. You can get the dependencies using the dask.core.get_dependencies function
>>> from dask.core import get_dependencies
>>> get_dependencies(task.dask, task.key)
{'inc-2f0e385e-beef-45e5-b47a-9cf5d02e2c1f',
'inc-9d0913ab-d76a-4eb7-a804-51278882b310',
'inc-b72ce20f-d0c4-4c50-9a88-74e3ef926dd0'}
And from here you can make new delayed objects if you wish
>>> from dask.delayed import Delayed
>>> parents = [Delayed(key, task.dask) for key in get_dependencies(task.dask, task.key)]
[Delayed('inc-b72ce20f-d0c4-4c50-9a88-74e3ef926dd0'),
Delayed('inc-2f0e385e-beef-45e5-b47a-9cf5d02e2c1f'),
Delayed('inc-9d0913ab-d76a-4eb7-a804-51278882b310')]

Force a Dask Delayed object to compute all parameters before applying the function

I am really enjoying using Dask.
Is there a way that I can force a Delayed object to require all it's arguments to be computed before applying the delayed function?
easy example (the use-case is more interesting with a collection):
def inc(x, y):
return x + y
dinc = dask.delayed(inc, pure=True)
to something like
def inc(x, y):
if hasattr(x, compute):
x = x.compute()
if hasattr(y, compute):
y = y.compute()
return x + y
dinc = dask.delayed(inc, pure=True)
In this way the function will act according to a reduce pattern.
Thanks!

Dask.delayed automatically does this. Any delayed object or dask collection (array, dataframe, bag) will be computed before they enter the delayed function.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dask dictionary to delayed object adapter - python

Related

Storing objects on workers and executing methods

Can I dynamically choose the method applied on a pandas Resampler object?

Parallelizing a list comprehension in Python

How to find inputs of dask.delayed task?

Force a Dask Delayed object to compute all parameters before applying the function

Categories

Resources