How can I systematically reuse the results of delayed functions in Dask? - python

I am working on building a computation graph with Dask. Some of the intermediate values will be used multiple times, but I would like those calculations to only run once. I must be making a trivial mistake, because that's not what happens. Here is a minimal example:
In [1]: import dask
dask.__version__
Out [1]: '1.0.0'
In [2]: class SumGenerator(object):
def __init__(self):
self.sources = []
def register(self, source):
self.sources += [source]
def generate(self):
return dask.delayed(sum)([s() for s in self.sources])
In [3]: sg = SumGenerator()
In [4]: #dask.delayed
def source1():
return 1.
#dask.delayed
def source2():
return 2.
#dask.delayed
def source3():
return 3.
In [5]: sg.register(source1)
sg.register(source1)
sg.register(source2)
sg.register(source3)
In [6]: sg.generate().visualize()
Sadly I am unable to post the resulting graph image, but basically I see two separate nodes for the function source1 that was registered twice. Therefore the function is called twice. I would rather like to have it called once, the result remembered and added twice in the sum. What would be the correct way to do that?

You need to call the dask.delayed decorator by passing the pure=True argument.
From the dask delayed docs
delayed also accepts an optional keyword pure. If False, then subsequent calls will always produce a different Delayed
If you know a function is pure (output only depends on the input, with no global state), then you can set pure=True.
So using that
import dask
class SumGenerator(object):
def __init__(self):
self.sources = []
def register(self, source):
self.sources += [source]
def generate(self):
return dask.delayed(sum)([s() for s in self.sources])
#dask.delayed(pure=True)
def source1():
return 1.
#dask.delayed(pure=True)
def source2():
return 2.
#dask.delayed(pure=True)
def source3():
return 3.
sg = SumGenerator()
sg.register(source1)
sg.register(source1)
sg.register(source2)
sg.register(source3)
sg.generate().visualize()
Output and Graph
Using print(dask.compute(sg.generate())) gives (7.0,) which is the same as the one you wrote but without the extra node as seen in the image.

Related

Storing objects on workers and executing methods

I have an application where I have a set of objects that do a lot of setting up (this takes up to 30s-1minute per object). Once they have been set-up, I want to pass a parameter vector (small, <50 floats) and return a couple of small arrays back. Computations per object are "very fast" compared to setting up the object. I need to run this fast method many, many times, and have a cluster which I can use.
My idea is that this should be "pool" of workers where each worker gets an object initialised with its own particular configuration (which will be different), and stays in memory in the worker. Subsequently, a method of this object on each worker gets a vector and returns some arrays. These are then combined by the main program (order is not important).
Some MWE that works is as follows (serial version first, dask version second):
import datetime as dt
import itertools
import numpy as np
from dask.distributed import Client, LocalCluster
# Set up demo dask on local machine
cluster = LocalCluster()
client = Client(cluster)
class Model(object):
def __init__(self, power):
self.power = power
def powpow(self, x):
return np.power(x, self.power)
def neg(self, x):
return -x
def compute_me(self, x):
return sum(self.neg(self.powpow(x)))
# Set up the objects locally
bag = [Model(power)
for power in [1, 2, 3, 4, 5]
]
x = [13, 23, 37]
result = [obj.compute_me(x) for obj in bag]
# Using dask
# Wrapper function to pass the local object
# and parameter
def wrap(obj, x):
return obj.compute_me(x)
res = []
for obj,xx in itertools.product(bag, [x,]):
res.append(client.submit(wrap, obj, xx))
result_dask = [r.result() for r in res]
np.allclose(result, result_dask)
In my real-world case, the Model class does a lot of initialisation, pre-calculations, etc, and it takes possibly 10-50 times longer to initialise than to run the compute_me method. Basically, in my case, it'd be beneficial to have each worker have a pre-defined instance of Model locally, and have dask deliver the input to compute_me.
This post (2nd answer) initialising and storing in namespace, but the example doesn't show how pass different initialisation arguments to each worker. Or is there some other way of doing this?

What is the proper way to write a custom AccumulatorParam for this task?

Context: Working in Azure Databricks, Python programming language, Spark environment.
I have a rdd, and have created a map operation.
rdd = sc.parallelize(my_collection)
mapper = rdd.map(lambda val: do_something(val))
Let's say the elements in this mapper are of type Foo. I have a global object of type Bar that is on the driver node, and has an internal collection of Foo objects that needs to be populated from the worker nodes (i.e. the elements in the mapper).
# This is what I want to do
bar_obj = Bar()
def add_to_bar(foo_obj):
global bar_obj
bar_obj.add_foo(foo_obj)
mapper.foreach(add_to_bar)
From my understanding of the RDD Programming Guide, this won't work due to how closures work in Spark. Instead, I should use an Accumulator to accomplish this.
I know I'm going to need to subclass AccumulatorParam somehow, but I'm unsure as to what this class looks like, and how to use it in this case.
Here is a first pass I have taken:
class FooAccumulator(AccumulatorParam):
def zero(self, value):
return value.bar
def addInPlace(self, value1, value2):
# bar is the parent Bar object for the value1 Foo instance
value1.bar.add_foo(value2)
return value1
But I am unsure how to proceed from here.
I'd also like to add that I have attempted to simply .collect() the results from the mapper, but this runs into the result set being larger than the maximally allowed memory on the driver node (~4G, when upped to 10G it functions but eventually times out).
I don't know if you tried anything so far ? I myself found this piece of code:
from pyspark import AccumulatorParam
class StringAccumulator(AccumulatorParam):
def zero(self, s):
return s
def addInPlace(self, s1, s2):
return s1 + s2
accumulator = sc.accumulator("", StringAccumulator())
So maybe you can try to do something like this:
from pyspark import AccumulatorParam
class FooAccumulator(AccumulatorParam):
def zero(self, f):
return []
def addInPlace(self, acc, el):
acc.extend(el)
return acc
accumulator = sc.accumulator([], FooAccumulator())
I think this thread can be also helpful to you.

Running dataframe apply in parallel with partial

I'm following the answer from this question: pandas multiprocessing apply
Usually when I run a function on rows in pandas, I do something like this
dataframe.apply(lambda row: process(row.attr1, row.attr2, ...))
...
def process(attr1, attr2, ...):
...
But I want to run this function multithreaded. So I implemented the parallelize_on_rows from the above posted question. However, the aforementioned solution works because the function passed in doesn't take parameters. For functions with parameters I tried to use partials. However, I can't figure out how to create a partial that takes arguments from the row that require the lambda function to access.
Here is my code
def parallelize_function_on_df(self, data, func, num_of_processes=5):
# refers to the data being split across array sections
data_split = np.array_split(data, num_of_processes)
# map a specific function to the array sections
pool = Pool(num_of_processes)
data = pd.concat(pool.map(func, data_split))
# must call close before join, research why
pool.close()
pool.join()
return data
def run_on_df_subset(self, func, data_subset):
return data_subset.apply(func, axis=1)
def parallelize_on_rows(self, data, func, num_of_processes=5):
return self.parallelize_function_on_df(data, partial(self.run_on_df_subset, func), num_of_processes)
def mass_download(some_sql):
download_table_df = pd.read_sql(some_sql, con=MYSQL.CONNECTION)
processed_data = {}
custom_option = True
process_row_partial = partial(self.process_candidate_row_parallel, processed_data, custom_option)
parallelize_on_rows(download_table_df, process_row_partial)
def process_candidate_row_parallel(row, processed_data, custom_option=False):
if row['some_attr'] in processed_data.keys() and processed_data[row['some_attr']] == 'download_successful' and custom_option:
do_some_other_processing()
download_single_file(row['some_attr1'], row['some_attr2'], processed_data)
So this doesn't work, because like I said, the row[attributes] aren't actually being passed to the thread since my partial just has the function with no arguments. How can I achieve otherwise?

Functional pipeline using python with advance operators

I am following the PyData talk in https://youtu.be/R1em4C0oXo8, the presenter whows a library for pipeling call yamal. This library is not open source. So, In my way of learning FP in python, I tried to replicate the basics of that library.
In a nutshell, you build a series of pure functions in python (f1, f2, f3, etc) , and create a list of them as follows:
pipeline = [f1, f2, f3, f4]
Then, you can apply the function run_pipeline, and the result will be the composition:
f4(f3(f2(f1)))
The requirements to the functions are that all have one return value, and except f1, all have one input.
This part is easy to implement I had it done using a composing the functions.
def run_pipeline(pipeline):
get_data, *rest_of_steps = steps
def compose(x):
for f in rest_of_steps:
y = f(x)
x = y
return x
data = get_data()
return compose(data)
The talk show a more advance use of the this abstraction, he defines the "operators" fork and reducer. This "operators" allow to run pipelines as the following:
pipeline1 = [ f1, fork(f2, f3), f4 ]
which is equivalent to: [ f4(f2(f1)), f4(f3(f1)) ]
and
pipeline2 = [ f1, fork(f2, f3), f4, reducer(f5) ]
which is equivalent to f5([f4(f3(f1)), f4(f2(f1))]).
I try to resolve this using functional programming, but I simply can't. I don't know if fork and reducer are decorators (and if so How do I pass the list of following functions?) don't know if I should transform this list to a graph using objects? coroutines? (maybe all of this is nonsense) I simply utterly confused.
Could someone help me about how to frame this using python and functional programming?
NOTE: In the video he talks about observers or executors. for this exercise I don't care about them.
Although this library is intended to facilitate FP in Python, it's not clear whether the library itself should be written using lots of FP.
This is one way to implement using classes (based on the list type) to tell the pipe function whether it needs to fork or reduce, and whether it is dealing with a single data item or a list of items.
This makes some limited use of FP style techniques such as the recursive calls to apply_func (allowing multiple forks within a pipeline).
class Forked(list):
""" Contains a list of data after forking """
class Fork(list):
""" Contains a list of functions for forking """
class Reducer(object):
""" Contains a function for reducing forked data """
def __init__(self, func):
self.func = func
def fork(*funcs):
return Fork(funcs)
def reducer(func):
""" Return a reducer form based on a function that accepts a
Forked list as its first argument """
return Reducer(func)
def apply_func(data, func):
""" Apply a function to data which may be forked """
if isinstance(data, Forked):
return Forked(apply_func(datum, func) for datum in data)
else:
return func(data)
def apply_form(data, form):
""" Apply a pipeline form (which may be a function, fork, or reducer)
to the data """
if callable(form):
return apply_func(data, form)
elif isinstance(form, Fork):
return Forked(apply_func(data, func) for func in form)
elif isinstance(form, Reducer):
return form.func(data)
def pipe(data, *forms):
""" Apply a pipeline of function forms to data """
return reduce(apply_form, forms, data)
Examples of this in use:
def double(x): return x * 2
def inc(x): return x + 1
def dec(x): return x - 1
def mult(L): return L[0] * L[1]
print pipe(10, inc, double) # 21
print pipe(10, fork(dec, inc), double) # [18, 22]
print pipe(10, fork(dec, inc), double, reducer(mult)) # 396
EDIT: This can also be simplified a bit further by making fork a function that returns a function and reducer a class that creates objects mimicking a function. Then the separate Fork and Reducer classes are no longer needed.
class Forked(list):
""" Contains a list of data after forking """
def fork(*funcs):
""" Return a function that will take data and output a forked
list of results of putting the data through several functions """
def inner(data):
return Forked(apply_form(data, func) for func in funcs)
return inner
class reducer(object):
def __init__(self, func):
self.func = func
def __call__(self, data):
return self.func(data)
def apply_form(data, form):
""" Apply a function or reducer to data which may be forked """
if isinstance(data, Forked) and not isinstance(form, reducer):
return Forked(apply_form(datum, form) for datum in data)
else:
return form(data)
def pipe(data, *forms):
""" Apply a pipeline of function forms to data """
return reduce(apply_form, forms, data)

Python - multiple functions - output of one to the next

I know this is super basic and I have been searching everywhere but I am still very confused by everything I'm seeing and am not sure the best way to do this and am having a hard time wrapping my head around it.
I have a script where I have multiple functions. I would like the first function to pass it's output to the second, then the second pass it's output to the third, etc. Each does it's own step in an overall process to the starting dataset.
For example, very simplified with bad names but this is to just get the basic structure:
#!/usr/bin/python
# script called process.py
import sys
infile = sys.argv[1]
def function_one():
do things
return function_one_output
def function_two():
take output from function_one, and do more things
return function_two_output
def function_three():
take output from function_two, do more things
return/print function_three_output
I want this to run as one script and print the output/write to new file or whatever which I know how to do. Just am unclear on how to pass the intermediate outputs of each function to the next etc.
infile -> function_one -> (intermediate1) -> function_two -> (intermediate2) -> function_three -> final result/outfile
I know I need to use return, but I am unsure how to call this at the end to get my final output
Individually?
function_one(infile)
function_two()
function_three()
or within each other?
function_three(function_two(function_one(infile)))
or within the actual function?
def function_one():
do things
return function_one_output
def function_two():
input_for_this_function = function_one()
# etc etc etc
Thank you friends, I am over complicating this and need a very simple way to understand it.
You could define a data streaming helper function
from functools import reduce
def flow(seed, *funcs):
return reduce(lambda arg, func: func(arg), funcs, seed)
flow(infile, function_one, function_two, function_three)
#for example
flow('HELLO', str.lower, str.capitalize, str.swapcase)
#returns 'hELLO'
edit
I would now suggest that a more "pythonic" way to implement the flow function above is:
def flow(seed, *funcs):
for func in funcs:
seed = func(seed)
return seed;
As ZdaR mentioned, you can run each function and store the result in a variable then pass it to the next function.
def function_one(file):
do things on file
return function_one_output
def function_two(myData):
doThings on myData
return function_two_output
def function_three(moreData):
doMoreThings on moreData
return/print function_three_output
def Main():
firstData = function_one(infile)
secondData = function_two(firstData)
function_three(secondData)
This is assuming your function_three would write to a file or doesn't need to return anything. Another method, if these three functions will always run together, is to call them inside function_three. For example...
def function_three(file):
firstStep = function_one(file)
secondStep = function_two(firstStep)
doThings on secondStep
return/print to file
Then all you have to do is call function_three in your main and pass it the file.
For safety, readability and debugging ease, I would temporarily store the results of each function.
def function_one():
do things
return function_one_output
def function_two(function_one_output):
take function_one_output and do more things
return function_two_output
def function_three(function_two_output):
take function_two_output and do more things
return/print function_three_output
result_one = function_one()
result_two = function_two(result_one)
result_three = function_three(result_two)
The added benefit here is that you can then check that each function is correct. If the end result isn't what you expected, just print the results you're getting or perform some other check to verify them. (also if you're running on the interpreter they will stay in namespace after the script ends for you to interactively test them)
result_one = function_one()
print result_one
result_two = function_two(result_one)
print result_two
result_three = function_three(result_two)
print result_three
Note: I used multiple result variables, but as PM 2Ring notes in a comment you could just reuse the name result over and over. That'd be particularly helpful if the results would be large variables.
It's always better (for readability, testability and maintainability) to keep your function as decoupled as possible, and to write them so the output only depends on the input whenever possible.
So in your case, the best way is to write each function independently, ie:
def function_one(arg):
do_something()
return function_one_result
def function_two(arg):
do_something_else()
return function_two_result
def function_three(arg):
do_yet_something_else()
return function_three_result
Once you're there, you can of course directly chain the calls:
result = function_three(function_two(function_one(arg)))
but you can also use intermediate variables and try/except blocks if needed for logging / debugging / error handling etc:
r1 = function_one(arg)
logger.debug("function_one returned %s", r1)
try:
r2 = function_two(r1)
except SomePossibleExceptio as e:
logger.exception("function_two raised %s for %s", e, r1)
# either return, re-reraise, ask the user what to do etc
return 42 # when in doubt, always return 42 !
else:
r3 = function_three(r2)
print "Yay ! result is %s" % r3
As an extra bonus, you can now reuse these three functions anywhere, each on it's own and in any order.
NB : of course there ARE cases where it just makes sense to call a function from another function... Like, if you end up writing:
result = function_three(function_two(function_one(arg)))
everywhere in your code AND it's not an accidental repetition, it might be time to wrap the whole in a single function:
def call_them_all(arg):
return function_three(function_two(function_one(arg)))
Note that in this case it might be better to decompose the calls, as you'll find out when you'll have to debug it...
I'd do it this way:
def function_one(x):
# do things
output = x ** 1
return output
def function_two(x):
output = x ** 2
return output
def function_three(x):
output = x ** 3
return output
Note that I have modified the functions to accept a single argument, x, and added a basic operation to each.
This has the advantage that each function is independent of the others (loosely coupled) which allows them to be reused in other ways. In the example above, function_two() returns the square of its argument, and function_three() the cube of its argument. Each can be called independently from elsewhere in your code, without being entangled in some hardcoded call chain such as you would have if called one function from another.
You can still call them like this:
>>> x = function_one(3)
>>> x
3
>>> x = function_two(x)
>>> x
9
>>> x = function_three(x)
>>> x
729
which lends itself to error checking, as others have pointed out.
Or like this:
>>> function_three(function_two(function_one(2)))
64
if you are sure that it's safe to do so.
And if you ever wanted to calculate the square or cube of a number, you can call function_two() or function_three() directly (but, of course, you would name the functions appropriately).
With d6tflow you can easily chain together complex data flows and execute them. You can quickly load input and output data for each task. It makes your workflow very clear and intuitive.
import d6tlflow
class Function_one(d6tflow.tasks.TaskCache):
function_one_output = do_things()
self.save(function_one_output) # instead of return
#d6tflow.requires(Function_one)
def Function_two(d6tflow.tasks.TaskCache):
output_from_function_one = self.inputLoad() # load function input
function_two_output = do_more_things()
self.save(function_two_output)
#d6tflow.requires(Function_two)
def Function_three():
output_from_function_two = self.inputLoad()
function_three_output = do_more_things()
self.save(function_three_output)
d6tflow.run(Function_three()) # executes all functions
function_one_output = Function_one().outputLoad() # get function output
function_three_output = Function_three().outputLoad()
It has many more useful features like parameter management, persistence, intelligent workflow management. See https://d6tflow.readthedocs.io/en/latest/
This way function_three(function_two(function_one(infile))) would be the best, you do not need global variables and each function is completely independent of the other.
Edited to add:
I would also say that function3 should not print anything, if you want to print the results returned use:
print function_three(function_two(function_one(infile)))
or something like:
output = function_three(function_two(function_one(infile)))
print output
Use parameters to pass the values:
def function1():
foo = do_stuff()
return function2(foo)
def function2(foo):
bar = do_more_stuff(foo)
return function3(bar)
def function3(bar):
baz = do_even_more_stuff(bar)
return baz
def main():
thing = function1()
print thing

Categories