Functional pipeline using python with advance operators - python

I am following the PyData talk in https://youtu.be/R1em4C0oXo8, the presenter whows a library for pipeling call yamal. This library is not open source. So, In my way of learning FP in python, I tried to replicate the basics of that library.
In a nutshell, you build a series of pure functions in python (f1, f2, f3, etc) , and create a list of them as follows:
pipeline = [f1, f2, f3, f4]
Then, you can apply the function run_pipeline, and the result will be the composition:
f4(f3(f2(f1)))
The requirements to the functions are that all have one return value, and except f1, all have one input.
This part is easy to implement I had it done using a composing the functions.
def run_pipeline(pipeline):
get_data, *rest_of_steps = steps
def compose(x):
for f in rest_of_steps:
y = f(x)
x = y
return x
data = get_data()
return compose(data)
The talk show a more advance use of the this abstraction, he defines the "operators" fork and reducer. This "operators" allow to run pipelines as the following:
pipeline1 = [ f1, fork(f2, f3), f4 ]
which is equivalent to: [ f4(f2(f1)), f4(f3(f1)) ]
and
pipeline2 = [ f1, fork(f2, f3), f4, reducer(f5) ]
which is equivalent to f5([f4(f3(f1)), f4(f2(f1))]).
I try to resolve this using functional programming, but I simply can't. I don't know if fork and reducer are decorators (and if so How do I pass the list of following functions?) don't know if I should transform this list to a graph using objects? coroutines? (maybe all of this is nonsense) I simply utterly confused.
Could someone help me about how to frame this using python and functional programming?
NOTE: In the video he talks about observers or executors. for this exercise I don't care about them.

Although this library is intended to facilitate FP in Python, it's not clear whether the library itself should be written using lots of FP.
This is one way to implement using classes (based on the list type) to tell the pipe function whether it needs to fork or reduce, and whether it is dealing with a single data item or a list of items.
This makes some limited use of FP style techniques such as the recursive calls to apply_func (allowing multiple forks within a pipeline).
class Forked(list):
""" Contains a list of data after forking """
class Fork(list):
""" Contains a list of functions for forking """
class Reducer(object):
""" Contains a function for reducing forked data """
def __init__(self, func):
self.func = func
def fork(*funcs):
return Fork(funcs)
def reducer(func):
""" Return a reducer form based on a function that accepts a
Forked list as its first argument """
return Reducer(func)
def apply_func(data, func):
""" Apply a function to data which may be forked """
if isinstance(data, Forked):
return Forked(apply_func(datum, func) for datum in data)
else:
return func(data)
def apply_form(data, form):
""" Apply a pipeline form (which may be a function, fork, or reducer)
to the data """
if callable(form):
return apply_func(data, form)
elif isinstance(form, Fork):
return Forked(apply_func(data, func) for func in form)
elif isinstance(form, Reducer):
return form.func(data)
def pipe(data, *forms):
""" Apply a pipeline of function forms to data """
return reduce(apply_form, forms, data)
Examples of this in use:
def double(x): return x * 2
def inc(x): return x + 1
def dec(x): return x - 1
def mult(L): return L[0] * L[1]
print pipe(10, inc, double) # 21
print pipe(10, fork(dec, inc), double) # [18, 22]
print pipe(10, fork(dec, inc), double, reducer(mult)) # 396
EDIT: This can also be simplified a bit further by making fork a function that returns a function and reducer a class that creates objects mimicking a function. Then the separate Fork and Reducer classes are no longer needed.
class Forked(list):
""" Contains a list of data after forking """
def fork(*funcs):
""" Return a function that will take data and output a forked
list of results of putting the data through several functions """
def inner(data):
return Forked(apply_form(data, func) for func in funcs)
return inner
class reducer(object):
def __init__(self, func):
self.func = func
def __call__(self, data):
return self.func(data)
def apply_form(data, form):
""" Apply a function or reducer to data which may be forked """
if isinstance(data, Forked) and not isinstance(form, reducer):
return Forked(apply_form(datum, form) for datum in data)
else:
return form(data)
def pipe(data, *forms):
""" Apply a pipeline of function forms to data """
return reduce(apply_form, forms, data)

Related

How can I systematically reuse the results of delayed functions in Dask?

I am working on building a computation graph with Dask. Some of the intermediate values will be used multiple times, but I would like those calculations to only run once. I must be making a trivial mistake, because that's not what happens. Here is a minimal example:
In [1]: import dask
dask.__version__
Out [1]: '1.0.0'
In [2]: class SumGenerator(object):
def __init__(self):
self.sources = []
def register(self, source):
self.sources += [source]
def generate(self):
return dask.delayed(sum)([s() for s in self.sources])
In [3]: sg = SumGenerator()
In [4]: #dask.delayed
def source1():
return 1.
#dask.delayed
def source2():
return 2.
#dask.delayed
def source3():
return 3.
In [5]: sg.register(source1)
sg.register(source1)
sg.register(source2)
sg.register(source3)
In [6]: sg.generate().visualize()
Sadly I am unable to post the resulting graph image, but basically I see two separate nodes for the function source1 that was registered twice. Therefore the function is called twice. I would rather like to have it called once, the result remembered and added twice in the sum. What would be the correct way to do that?
You need to call the dask.delayed decorator by passing the pure=True argument.
From the dask delayed docs
delayed also accepts an optional keyword pure. If False, then subsequent calls will always produce a different Delayed
If you know a function is pure (output only depends on the input, with no global state), then you can set pure=True.
So using that
import dask
class SumGenerator(object):
def __init__(self):
self.sources = []
def register(self, source):
self.sources += [source]
def generate(self):
return dask.delayed(sum)([s() for s in self.sources])
#dask.delayed(pure=True)
def source1():
return 1.
#dask.delayed(pure=True)
def source2():
return 2.
#dask.delayed(pure=True)
def source3():
return 3.
sg = SumGenerator()
sg.register(source1)
sg.register(source1)
sg.register(source2)
sg.register(source3)
sg.generate().visualize()
Output and Graph
Using print(dask.compute(sg.generate())) gives (7.0,) which is the same as the one you wrote but without the extra node as seen in the image.

What is the proper way to write a custom AccumulatorParam for this task?

Context: Working in Azure Databricks, Python programming language, Spark environment.
I have a rdd, and have created a map operation.
rdd = sc.parallelize(my_collection)
mapper = rdd.map(lambda val: do_something(val))
Let's say the elements in this mapper are of type Foo. I have a global object of type Bar that is on the driver node, and has an internal collection of Foo objects that needs to be populated from the worker nodes (i.e. the elements in the mapper).
# This is what I want to do
bar_obj = Bar()
def add_to_bar(foo_obj):
global bar_obj
bar_obj.add_foo(foo_obj)
mapper.foreach(add_to_bar)
From my understanding of the RDD Programming Guide, this won't work due to how closures work in Spark. Instead, I should use an Accumulator to accomplish this.
I know I'm going to need to subclass AccumulatorParam somehow, but I'm unsure as to what this class looks like, and how to use it in this case.
Here is a first pass I have taken:
class FooAccumulator(AccumulatorParam):
def zero(self, value):
return value.bar
def addInPlace(self, value1, value2):
# bar is the parent Bar object for the value1 Foo instance
value1.bar.add_foo(value2)
return value1
But I am unsure how to proceed from here.
I'd also like to add that I have attempted to simply .collect() the results from the mapper, but this runs into the result set being larger than the maximally allowed memory on the driver node (~4G, when upped to 10G it functions but eventually times out).
I don't know if you tried anything so far ? I myself found this piece of code:
from pyspark import AccumulatorParam
class StringAccumulator(AccumulatorParam):
def zero(self, s):
return s
def addInPlace(self, s1, s2):
return s1 + s2
accumulator = sc.accumulator("", StringAccumulator())
So maybe you can try to do something like this:
from pyspark import AccumulatorParam
class FooAccumulator(AccumulatorParam):
def zero(self, f):
return []
def addInPlace(self, acc, el):
acc.extend(el)
return acc
accumulator = sc.accumulator([], FooAccumulator())
I think this thread can be also helpful to you.

Python multiprocessing pool with shared data

I'm attempting to speed up a multivariate fixed-point iteration algorithm using multiprocessing however, I'm running issues dealing with shared data. My solution vector is actually a named dictionary rather than a vector of numbers. Each element of the vector is actually computed using a different formula. At a high level, I have an algorithm like this:
current_estimate = previous_estimate
while True:
for state in all_states:
current_estimate[state] = state.getValue(previous_estimate)
if norm(current_estimate, previous_estimate) < tolerance:
break
else:
previous_estimate, current_estimate = current_estimate, previous_estimate
I'm trying to parallelize the for-loop part with multiprocessing. The previous_estimate variable is read-only and each process only needs to write to one element of current_estimate. My current attempt at rewriting the for-loop is as follows:
# Class and function definitions
class A(object):
def __init__(self,val):
self.val = val
# representative getValue function
def getValue(self, est):
return est[self] + self.val
def worker(state, in_est, out_est):
out_est[state] = state.getValue(in_est)
def worker_star(a_b_c):
""" Allow multiple arguments for a pool
Taken from http://stackoverflow.com/a/5443941/3865495
"""
return worker(*a_b_c)
# Initialize test environment
manager = Manager()
estimates = manager.dict()
all_states = []
for i in range(5):
a = A(i)
all_states.append(a)
estimates[a] = 0
pool = Pool(process = 2)
prev_est = estimates
curr_est = estimates
pool.map(worker_star, itertools.izip(all_states, itertools.repeat(prev_est), itertools.repreat(curr_est)))
The issue I'm currently running into is that the elements added to the all_states array are not the same as those added to the manager.dict(). I keep getting key value errors when trying to access elements of the dictionary using elements of the array. And debugging, I found that none of the elements are the same.
print map(id, estimates.keys())
>>> [19558864, 19558928, 19558992, 19559056, 19559120]
print map(id, all_states)
>>> [19416144, 19416208, 19416272, 19416336, 19416400]
This is happening because the objects you're putting into the estimates DictProxy aren't actually the same objects as those that live in the regular dict. The manager.dict() call returns a DictProxy, which is proxying access to a dict that actually lives in a completely separate manager process. When you insert things into it, they're really being copied and sent to a remote process, which means they're going to have a different identity.
To work around this, you can define your own __eq__ and __hash__ functions on A, as described in this question:
class A(object):
def __init__(self,val):
self.val = val
# representative getValue function
def getValue(self, est):
return est[self] + self.val
def __hash__(self):
return hash(self.__key())
def __key(self):
return (self.val,)
def __eq__(x, y):
return x.__key() == y.__key()
This means the key look ups for items in the estimates will just use the value of the val attribute to establish identity and equality, rather than the id assigned by Python.

Implementing **<class>?

Edit:
This question has been marked duplicate but I don't think that it is. Implementing the suggested answer, that is to use the Mapping abc, does not have the behavior I would like:
from collections import Mapping
class data(Mapping):
def __init__(self,params):
self.params = params
def __getitem__(self,k):
print "getting",k
return self.params[k]
def __len__(self):
return len(self.params)
def __iter__(self):
return ( k for k in self.params.keys() )
def func(*args,**kwargs):
print "In func"
return None
ps = data({"p1":1.,"p2":2.})
print "\ncalling...."
func(ps)
print "\ncalling...."
func(**ps)
Output:
calling....
In func
calling....
in __getitem__ p2
in __getitem__ p1
In func
Which, as mentioned in the question, is not what I want.
The other solution, given in the comments, is to modify the routines that are causing problems. That will certainly work, however I was looking for a quick (lazy?) fix!
Question:
How can I implement the ** operator for a class, other than via __getitem__? For example I would like to be able to do this::
def func(**kwargs):
<do some clever stuff>
x = some_generic_class():
func( **x )
without an implicit call to some_generic_class.__getitem__(). In my application I have already implemented __getitem__ with some data logging which I do not want to perform when the class is referenced as above.
If it's not possible to overload the ** operator, is it possible to detect when __getitem__ is being called as a result of the class being passed to a function, rather than explicitly?
Background:
I am working on a physics model that is built out of a set of packages which are chosen according to user input at runtime. The flexible structure of the model means that I rarely know the required parameters and so i pass a dict of parameter names and values between the models. In order to make this more user friendly I am now trying to develop a class paramlist that overloads the dict functionality with a set of routines that do some consistency checking, set default values, etc. The idea is that I pass an instance of paramlist rather than a dict. One of the more important aims is to keep a log of which members of paramlist have been referenced by the physics packages and which ones have not. A stripped out version is below, which aims to maintain a second dict that logs whether a parameter has been referenced::
class paramlist(object):
def __init__( self, params ):
self.params = copy(params)
self.used = { k:False for k in self.params }
def __getitem__(self, k):
try:
v = self.params[k]
except KeyError:
raise KeyError("Parameter {} not in parameter list".format(k))
else:
self.used[k] = True
return v
def __setitem__(self,k,v):
self.params[k] = v
self.used[k] = False
Which does not have the behaviour I want:
ps = paramlist( {"p1":1.} )
def donothing( *args, **kwargs ):
return None
donothing(ps)
print paramlist.used["p1"]
donothing(**ps)
print paramlist.used["p1"]
Output:
False
True
I would like the use dict to remain False in both cases, so that I can tell the user that one of their parameters was not used (implying that they screwed up and a default value has been used instead). I presume that the ** case has the effect of calling __getitem__ on every entry in the paramlist.

automatic wrapper that adds an output to a function

[I am using python 2.7]
I wanted to make a little wrapper function that add one output to a function. Something like:
def add_output(fct, value):
return lambda *args, **kargs: (fct(*args,**kargs),value)
Example of use:
def f(a): return a+1
g = add_output(f,42)
print g(12) # print: (13,42)
This is the expected results, but it does not work if the function given to add_ouput return more than one output (nor if it returns no output). In this case, the wrapped function will return two outputs, one contains all the output of the initial function (or None if it returns no output), and one with the added output:
def f1(a): return a,a+1
def f2(a): pass
g1 = add_output(f1,42)
g2 = add_output(f2,42)
print g1(12) # print: ((12,13),42) instead of (12,13,42)
print g2(12) # print: (None,42) instead of 42
I can see this is related to the impossibility to distinguish between one output of type tuple and several output. But this is disappointing not to be able to do something so simple with a dynamic language like python...
Does anyone have an idea on a way to achieve this automatically and nicely enough, or am I in a dead-end ?
Note:
In case this change anything, my real purpose is doing some wrapping of class (instance) method, to looks like function (for workflow stuff). However it is require to add self in the output (in case its content is changed):
class C(object):
def f(self): return 'foo','bar'
def wrap(method):
return lambda self, *args, **kargs: (self,method(self,*args,**kargs))
f = wrap(C.f)
c = C()
f(c) # returns (c,('foo','bar')) instead of (c,'foo','bar')
I am working with python 2.7, so I a want solution with this version or else I abandon the idea. I am still interested (and maybe futur readers) by comments about this issue for python 3 though.
Your add_output() function is what is called a decorator in Python. Regardless, you can use one of the collections module's ABCs (Abstract Base Classes) to distinguish between different results from the function being wrapped. For example:
import collections
def add_output(fct, value):
def wrapped(*args, **kwargs):
result = fct(*args, **kwargs)
if isinstance(result, collections.Sequence):
return tuple(result) + (value,)
elif result is None:
return value
else: # non-None and non-sequence
return (result, value)
return wrapped
def f1(a): return a,a+1
def f2(a): pass
g1 = add_output(f1, 42)
g2 = add_output(f2, 42)
print g1(12) # -> (12,13,42)
print g2(12) # -> 42
Depending of what sort of functions you plan on decorating, you might need to use the collections.Iterable ABC instead of, or in addition to, collections.Sequence.

Categories