here is the problem:
1) suppose that I have some measure data (like 1Msample read from my electronics) and I need to process them by a processing chain.
2) this processing chain consists of different operations, which can be swapped/omitted/have different parameters. A typical example would be to take this data, first pass them via a lookup table, then do exponential fit, then multiply by some calibration factors
3) now, as I do not know what algorithm the the best, I'd like to evaluate at each stage best possible implementation (as an example, the LUTs can be produced by 5 ways and I want to see which one is the best)
4) i'd like to daisychain those functions such, that I would construct a 'class' containing top-level algorithm and owning (i.e. pointing) to child class, containing lower-level algorithm.
I was thinking to use double-linked-list and generate sequence like:
myCaptureClass.addDataTreatment(pmCalibrationFactor(opt, pmExponentialFit (opt, pmLUT (opt))))
where myCaptureClass is the class responsible for datataking and it should as well (after the data being taken) trigger the top-level data processing module (pm). This processing would first go deep into the bottom-child (lut), treat data there, then middle (expofit), then top (califactors) and return the data to the capture, which would return the data to the requestor.
Now this has several issues:
1) everywhere on the net is said that in python one should not use double-linked-lists
2) this seems to me highly inefficient because the data vectors are huge, hence i would prefer solution using generator function, but i'm not sure how to provide the 'plugin-like' mechanism.
could someone give me a hint how to solve this using 'plugin-style' and generator so I do not need to process vector of X megabytes of data and process them 'on-request' as is correct when using generator function?
thanks a lot
david
An addendum to the problem:
it seems that I did not express myself exactly. Hence: the data are generated by an external HW card plugged into VME crate. They are 'fetched' in a single block transfer to the python tuple, which is stored in myCaptureClass.
The set of operation to be applied is in fact on a stream data, represented by this tuple. Even exponential fit is stream operation (it is a set of variable state filters applied on each sample).
The parameter 'opt' i've mistakenly shown was to express, that each of those data processing classes has some configuration data which come with, and modify behaviour of the method used to operate on data.
The goal is to introduce into myCaptureClass a daisychained class (rather than function), which - when user asks for data - us used to process 'raw' data into final form.
In order to 'save' memory resources i thought it might be a good idea to use generator function to provide the data.
from this perspective it seems that the closest match to what i want to do is shown in code of bukzor. I'd prefer to have a class implementation instead of function, but i guess this is just a cosmetic stuff of implementing call operator in particular class, which realizes the data operation....
This is how I imagine you would do this. I expect this is incomplete, since I don't fully understand your problem statement. Please let me know what I've done wrong :)
class ProcessingPipeline(object):
def __init__(self, *functions, **kwargs):
self.functions = functions
self.data = kwargs.get('data')
def __call__(self, data):
return ProcessingPipeline(*self.functions, data=data)
def __iter__(self):
data = self.data
for func in self.functions:
data = func(data)
return data
# a few (very simple) operators, of different kinds
class Multiplier(object):
def __init__(self, by):
self.by = by
def __call__(self, data):
for x in data:
yield x * self.by
def add(data, y):
for x in data:
yield x + y
from functools import partial
by2 = Multiplier(by=2)
sub1 = partial(add, y=-1)
square = lambda data: ( x*x for x in data )
pp = ProcessingPipeline(square, sub1, by2)
print list(pp(range(10)))
print list(pp(range(-3, 4)))
Output:
$ python how-to-implement-daisychaining-of-pluggable-function-in-python.py
[-2, 0, 6, 16, 30, 48, 70, 96, 126, 160]
[16, 6, 0, -2, 0, 6, 16]
Get the functional module from pypi. It has a compose function to compose two callables. With that, you can chain functions together.
Both that module, and functool provide a partial function, for partial-application.
You can use the composed functions in a generator expression just like any other.
Not knowing exactly what you want, I feel like I should point out that you can put whatever you want inside a list comprehension:
l = [myCaptureClass.addDataTreatment(
pmCalibrationFactor(opt, pmExponentialFit (opt, pmLUT (opt))))
for opt in data]
will create a new list of data that has been passed through the composed functions.
Or you could create a generator expression for looping over, this won't construct a whole new list, it will just create an iterator. I don't think that there's any advantage to doing things this way as opposed to just processing the data in the body of the loop, but it's kind of interesting to look at:
d = (myCaptureClass.addDataTreatment(
pmCalibrationFactor(opt, pmExponentialFit (opt, pmLUT (opt))))
for opt in data)
for thing in d:
# do something
pass
Or is opt the data?
Related
I don't have any formal training in programming, but I routinely come across this question when I am making classes and running individual methods of that class in sequence. What is better: save results as class variables or return them and use them as inputs to subsequent method calls. For example, here is a class where the the variables are returned and used as inputs:
class ProcessData:
def __init__(self):
pass
def get_data(self,path):
data = pd.read_csv(f"{path}/data.csv"}
return data
def clean_data(self, data)
data.set_index("timestamp", inplace=True)
data.drop_duplicates(inplace=True)
return data
def main():
processor = ProcessData()
temp = processor.get_data("path/to/data")
processed_data = processor.clean_data(temp)
And here is an example where the results are saved/used to update the class variable:
class ProcessData:
def __init__(self):
self.data = None
def get_data(self,path):
data = pd.read_csv(f"{path}/data.csv"}
self.data = data
def clean_data(self)
self.data.set_index("timestamp", inplace=True)
self.data.drop_duplicates(inplace=True)
def main():
processor = ProcessData()
processor.get_data("path/to/data")
processor.clean_data()
I have a suspicion that the latter method is better, but I could also see instances where the former might have its advantages. I am sure the answer to my question is "it depends", but I am curious in general, what are the best practices?
Sketch the class based on usage, then create it
Instead of inventing classes to make your high level coding easier, tap your heels together and write the high-level code as if the classes already existed. Then create the classes with the methods and behavior that exactly fits what you need.
PEP AS AN EXAMPLE
If you look at several peps, you'll notice that the rationale or motivation is given before the details. The rationale and motivation shows how the new Python feature is going to solve a problem and how it is going to be used sometimes with code examples.
Example from PEP 289 – Generator Expressions:
Generator expressions are especially useful with functions like sum(),
min(), and max() that reduce an iterable input to a single value:
max(len(line) for line in file if line.strip())
Generator expressions also address some examples of functionals coded
with lambda:
reduce(lambda s, a: s + a.myattr, data, 0)
reduce(lambda s, a: s + a[3], data, 0)
These simplify to:
sum(a.myattr for a in data)
sum(a[3] for a in data)
My methodology given above is the same as describing the motivation and rationale for a class in terms of use. Because you are writing the code that is actually going to use it first.
I'm working on a python project that requires me to compile certain attributes of some objects into a dataset. The code I'm currently using is something like the following:
class VectorBuilder(object):
SIZE = 5
def __init__(self, player, frame_data):
self.player = player
self.fd = frame_data
def build(self):
self._vector = []
self._add(self.player)
self._add(self.fd.getSomeData())
self._add(self.fd.getSomeOtherData())
char = self.fd.getCharacter()
self._add(char.getCharacterData())
self._add(char.getMoreCharacterData())
assert len(self._vector) == self.SIZE
return self._vector
def _add(self, element):
self._vector.append(element)
However, this code is slightly unclean because adding/removing attributes to/from the dataset also requires correctly adjusting the SIZE variable. The reason I even have the SIZE variable is that the size of the dataset needs to be known at runtime before the dataset itself is created.
I've thought of instead keeping a list of all the functions used to construct the dataset as strings (as in attributes = ['getPlayer', 'fd.getSomeData', ...]) and then defining the build function as something like:
def build(self):
self._vector = []
for att in attributes:
self._vector.append(getattr(self, att)())
return self._vector
This would let me access the size as simply len(attributes) and I only ever need to edit attributes, but I don't know how to make this approach work with the chained function calls, such as self.fd.getCharacter().getCharacterData().
Is there a cleaner way to accomplish what I'm trying to do?
EDIT:
Some additional information and clarification is necessary.
I was using __ due to some bad advice I read online (essentially saying I should use _ for module-private members and __ for class-private members). I've edited them to _ attributes now.
The getters are a part of the framework I'm using.
The vector is stored as a private class member so I don't have to pass it around the construction methods, which are in actuality more numerous than the simple _add, doing some other stuff like normalisation and bool->int conversion on the elements before adding them to the vector.
SIZE as it currently stands, is a true constant. It is only ever given a value in the first line of VectorBuilder and never changed at runtime. I realise that I did not clarify this properly in the main post, but new attributes never get added at runtime. The adjustment I was talking about would take place at programming time. For example, if I wanted to add a new attribute, I would need to add it in the build function, e.g.:
self._add(self.fd.getCharacter().getAction().getActionData().getSpeed())
, as well as change the SIZE definition to SIZE = 6.
The attributes are compiled into what is currently a simple python list (but will probably be replaced with a numpy array), then passed into a neural network as an input vector. However, the neural network itself needs to be built first, and this happens before any data is made available (i.e. before any input vectors are created). In order to be built successfully, the neural network needs to know the size of the input vectors it will be receiving, though. This is why SIZE is necessary and also the reason for the assert statement - to ascertain that the vectors I'm passing to the network are in fact the size I claimed I would be passing to it.
I'm aware the code is unpythonic, that is why I'm here - the code works, it's just ugly.
Instead of providing the strings of the attributes as a list you would like to create the input arguments from, why don't you initialize the build function with a list containing all the values returned by your getter functions?
Your SIZE variable would then still be the length of the dynamic argument list provided in build(self,*args) for example.
I have some arguments taken from the user and passed along function to function (each function in a different class), until it eventually gets to a function that does some processing and then the solution is returned up the chain. Up the chain, the functions become more and more abstract merging results from multiple runs of the lower functions.
Where should I use *args and **kwargs?
I think *args and *kwargs can be used for every function where the function doesn't use the arguments explicitly. But, the actual arguments need to be defined at the top_level so that the user knows what the function expects.
Where should I define what the inputs mean?
I think they should be defined at the top_level because that's the one the end-user might want to see the documentation for.
Where should I define what default values?
Again, I think they should be defined at the top_level because that's the one the end-user interacts with.
This is a simple example to demonstrate the passing of the arguments, where I haven't shown how the functions become more and more abstract or how they interact with different classes, as I felt it was unnecessary detail.
def top_level(a=1, b=1, c=1, d=1, e=1):
""" Compute sum of five numbers.
:param a: int, a
:param b: int, b
:param c: int, c
:param d: int, d
:param e: int, e
:return: int, sum
"""
return mid_level(a, b, c, d, e)
def mid_level(*args, **kwargs):
return bottom_level(*args, **kwargs)
def bottom_level(a, b, c, d, e):
return a + b + c + d + e
print top_level(1, 2, 3)
8
Is there a Python convention for passing arguments like this?
I'm not going to answer your question because it would be like answering the question "what's the best way to use a screwdriver to tighten a nut?". I.e. I do not believe that the tools you are asking for guidance with (*args and **kwargs) are designed to solve the problem you want to solve.
Instead I'll answer this question: "how do I associate some data with a set of functions?", and the answer to that is clearly Use Classes.
Welcome to object-oriented programming. I think you're going to enjoy it!
This is a very basic example of what I mean, but it was hard to know exactly what you wanted from your example since it was simple, but the basic principle is encapsulate your data in a class, and then operate on it using the class's methods.
You can then call between methods in the class without needing to pass loads of arguments around all the time (such as the .calculate() method below), which you don't know whether the top layer will need or a bottom layer.
You can just document the parameters in one place, the __init__ method.
You can customize through subclassing transparently to the code (because if you override a method in a subclass, it can still be used by the more generic superclass), as I've done for the .reduce(x, y) method below.
Example:
class ReductionCalculator:
def __init__(self, *args):
self.args = args
def calculate(self):
start = self.args[0]
for arg in self.args[1:]:
start = self.reduce(start, arg)
return start
class Summer(ReductionCalculator):
def reduce(self, x, y):
return x + y
class Multiplier(ReductionCalculator):
def reduce(self, x, y):
return x * y
summer = Summer(1, 2, 4)
print('sum: %d' % (summer.calculate(),))
multiplier = Multiplier(1, 2, 4)
print('sum: %d' % (multiplier.calculate(),))
How about this approach: create a class, call it AllInputs, that represents the collection of all the "arguments taken from the user." The only purpose of this class is to serve as a container for a set of values. One instance of this class gets initialized, of course, at the top level of the program.
class AllInputs:
def __init__(self,a=1, b=1, c=1, d=1, e=1):
""" Compute sum of five numbers.
:param a: int, a
:param b: int, b
:param c: int, c
:param d: int, d
:param e: int, e
"""
self.a = a
self.b = b
self.c = c
self.d = d
self.e = e
This object, call it all_inputs, is now passed as the single argument to all of the functions in your example. If a function doesn't use any of the fields in the object, that's fine; it just passes it along to the lower-level function where the real work gets done. To refactor your example, you would now have:
def top_level(all_inputs):
""" Compute sum of all inputs
:return: int, sum
"""
return mid_level(all_inputs)
def mid_level(all_inputs):
return bottom_level(all_inputs)
def bottom_level(all_inputs):
return (all_inputs.a + all_inputs.b + all_inputs.c +
all_inputs.d + all_inputs.e)
all_inputs = AllInputs(1, 2, 3)
print top_level(all_inputs)
8
I don't know if this is "Pythonic" or "non-Pythonic" and I don't care. I think it's a good programming idea to group together the data that the program will use. The initialization process, which combines default values with others taken from the user, is centralized in one place where it's easy to understand. It's reasonably self-documenting. You say the function calls are distributed across several classes, and that's no problem. The function calls are clean and the program flow is easy to follow. There is potential for optimization by placing some of the calculation inside AllInputs so you can avoid duplicating code.
What I don't like in your example (and I think you don't like it either, or you probably wouldn't have asked the question in the first place) is how it uses the *args syntax. When I see that syntax, I take it as a hint that all the arguments have the same semantic meaning, like in the standard library function os.path.join. In your application, if I understand the question, the low-level functions require the argument list to be in a specific order and have specific meanings (your example doesn't reflect that but the text suggests it). It's confusing to see arguments that get passed into a function as *args and then, at a lower level, their specific names and meanings appear once again. Grouping them into a single object makes it clear what's going on.
This isn't the most common pattern, but I've seen it for command line programs that have levels of nested commands: sub-commands, sub-sub-commands and so on. That's a model where "upper" level functions may be more or less dispatchers and not have information about what parameters are needed by the sub-functions within a given route. The purest scenario for this model is when the sub-commands are plugins and the "upper" layers have literally no information about the sub-functions, other than a calling convention the plug-ins are expected to adhere to.
In these cases, I'd argue the pythonic way is to pass parameters from higher-level to lower-level functions, and let the worker level decide which are useful. The range of possible parameters would be defined in the calling convention. This is pythonic on the basis of DRY -- don't repeat yourself. If the low-level / worker function defines what inputs are required or optional, it would often make sense to not repeat this information at the higher levels.
The same could be said for any inversion-of-control flow design, not just CLI applications w/ plug-ins. There are many application designs where I wouldn't use this approach, but it works here.
An input's meaning must be set at the topmost level it can arise in -- as an interface spec to lower levels (a convention, not programmatic). Otherwise the inputs would have no semantic meaning.
If an input can be used by multiple sub-functions, i.e. there's a chaining or pipeline concept in the control flow, then an input's default will also need to be defined at the topmost level for the input.
I would argue that passing arguments down several levels of functions is not pythonic in itself.
From the Zen of Python:
Simple is better than complex
Flat is better than nested
Edit:
If there are a lot of arguments and the functions inbetween just pass them down, I would probably wrap them up in a tuple and unwrap them at the lowest level.
What is considered to be a better programming practice when dealing with more object at time (but with the option to process just one object)?
A: LOOP INSIDE FUNCTION
Function can be called with one or more objects and it is iterating inside function:
class Object:
def __init__(self, a, b):
self.var_a = a
self.var_b = b
var_a = ""
var_b = ""
def func(obj_list):
if type(obj_list) != list:
obj_list = [obj_list]
for obj in obj_list:
# do whatever with an object
print(obj.var_a, obj.var_b)
obj_list = [Object("a1", "a2"), Object("b1", "b2")]
obj_alone = Object("c1", "c2")
func(obj_list)
func(obj_alone)
B: LOOP OUTSIDE FUNCTION
Function is dealing with one object only and when it is dealing with more objects in must be called multiple times.
class Object:
def __init__(self, a, b):
self.var_a = a
self.var_b = b
var_a = ""
var_b = ""
def func(obj):
# do whatever with an object
print(obj.var_a, obj.var_b)
obj_list = [Object("a1", "a2"), Object("b1", "b2")]
obj_alone = Object("c1", "c2")
for obj in obj_list:
func(obj)
func(obj_alone)
I personally like the first one (A) more, because for me it makes cleaner code when calling the function, but maybe it's not the right approach. Is there some method generally better than the other? And if not, what are the cons and pros of each method?
A function should have a defined input and output and follow the single responsibility principle. You need to be able to clearly define your function in terms of "I put foo in, I get bar back". The more qualifiers you need to make in this statement to properly describe your function probably means your function is doing too much. "I put foo in and get bar back, unless I put baz in then I also get bar back, unless I put a foo-baz in then it'll error".
In this particular case, you can pass an object or a list of objects. Try to generalise that to a value or a list of values. What if you want to pass a list as a value? Now your function behaviour is ambiguous. You want the single list object to be your value, but the function treats it as multiple arguments instead.
Therefore, it's trivial to adapt a function which takes one argument to work on multiple values in practice. There's no reason to complicate the function's design by making it adaptable to multiple arguments. Write the function as simple and clearly as possible, and if you need it to work through a list of things then you can loop it through that list of things outside the function.
This might become clearer if you try to give an actual useful name to your function which describes what it does. Do you need to use plural or singular terms? foo_the_bar(bar) does something else than foo_the_bars(bars).
Move loops outside functions (when possible)
Generally speaking, keep loops that do nothing but iterate over the parameter outside of functions. This gives the caller maximum control and assumes the least about how the client will use the function.
The rule of thumb is to use the most minimal parameter complexity that the function needs do its job.
For example, let's say you have a function that processes one item. You've anticipated that a client might conceivably want to process multiple items, so you changed the parameter to an iterable, baked a loop into the function, and are now returning a list. Why not? It could save the client from writing an ugly loop in the caller, you figure, and the basic functionality is still available -- and then some!
But this turns out to be a serious constraint. Now the caller needs to pack (and possibly unpack, if the function returns a list of results in addition to a list of arguments) that single item into a list just to use the function. This is confusing and potentially expensive on heap memory:
>>> def square(it): return [x ** 2 for x in it]
...
>>> square(range(6)) # you're thinking ...
[0, 1, 4, 9, 16, 25]
>>> result, = square([3]) # ... but the client just wants to square 1 number
>>> result
9
Here's a much better design for this particular function, intuitive and flexible:
>>> def square(x): return x ** 2
...
>>> square(3)
9
>>> [square(x) for x in range(6)]
[0, 1, 4, 9, 16, 25]
>>> list(map(square, range(6)))
[0, 1, 4, 9, 16, 25]
>>> (square(x) for x in range(6))
<generator object <genexpr> at 0x00000166D122CBA0>
>>> all(square(x) % 2 for x in range(6))
False
This brings me to a second problem with the functions in your code: they have a side-effect, print. I realize these functions are just for demonstration, but designing functions like this makes the example somewhat contrived. Functions typically return values rather than simply produce side-effects, and the parameters and return values are often related, as in the above example -- changing the parameter type bound us to a different return type.
When does it make sense to use an iterable argument? A good example is sort -- the smallest unit of operation for a sorting function is an iterable, so the problem of packing and unpacking in the square example above is a non-issue.
Following this logic a step further, would it make sense for a sort function to accept a list (or variable arguments) of lists? No -- if the caller wants to sort multiple lists, they should loop over them explicitly and call sort on each one, as in the second square example.
Consider variable arguments
A nice feature that bridges the gap between iterables and single arguments is support for variable arguments, which many languages offer. This sometimes gives you the best of both worlds, and some functions go so far as to accept either args or an iterable:
>>> max([1, 3, 2])
3
>>> max(1, 3, 2)
3
One reason max is nice as a variable argument function is that it's a reduction function, so you'll always get a single value as output. If it were a mapping or filtering function, the output is always a list (or generator) so the input should be as well.
To take another example, a sort routine wouldn't make much sense with varargs because it's a classically in-place algorithm that works on lists, so you'd need to unpack the list into the arguments with the * operator pretty much every time you invoke the function -- not cool.
There's no real need for a call like sort(1, 3, 4, 2) as there is with max, where the parameters are just as likely to be loose variables as they are a packed iterable. Varargs are usually used when you have a small number of arguments, or the thing you're unpacking is a small pair or tuple-type element, as often the case with zip.
There's definitely a "feel" to when to offer parameters as varargs, an iterable, or a single value (i.e. let the caller handle looping), but as long as you follow the rule of avoiding iterables unless they're essential to the function, it's hard to go wrong.
As a final tip, try to write your functions with similar contracts to the library functions in your language or the tools you use frequently. These are pretty much always designed well; mimic good design.
If you implement B then you will make it harder for yourself to achieve A.
If you implement A then it isn't too difficult to achieve B. You also have many tools already available to apply this function to a list of arguments (the loop method you described, using something like map, or even a multiprocessing approach if needed)
Therefore I would choose to implement A, and if it makes things neater or easier in a given case you can think about also implementing B (using A) also so that you have both.
First, context:
As a side project, I'm building a computer algebra system in Python that yields the steps it takes to solve an equation.
So far, I've been able to parse algebraic expressions and equations into an expression tree. It's structured something like this (not the actual code—may not be running):
# Other operators and math functions are based off this.
# Numbers and symbols also have their own classes with 'parent' attributes.
class Operator(object):
def __init__(self, *args):
self.children = args
for child in self.children:
child.parent = self
# the parser does something like this:
expr = Add(1, Mult(3, 4), 5)
On top of this, I have a series of functions that operate recursively to simplify expressions. They're not purely functional, but I'm trying to avoid relying on mutability for operations, instead returning a modified copy of the node I'm working with. Each function looks something like this:
def simplify(node):
for index, child in enumerate(node.children):
if isinstance(child, Operator):
node.children[index] = simplify(node)
else:
# perform some operations to simplify numbers and symbols
pass
return node
The challenge comes in the "step by step" part. I'd like for my "simplification" functions to all be nested generators that "yield" the steps it takes to solve something. So basically, every time each function performs an operation, I'd like to be able to do something like this: yield (deepcopy(node), expression, "Combined like terms.") so that whatever is relying on this library can output something like:
5x + 3*4x + 3
5x + 12x + 3 Simplified product 3*4x into 12x
17x + 3 Combined like terms 5x + 12x = 17x
However, each function only has knowledge about the node it's operating on, but has no idea what the overall expression looks like.
So this is my question: What would be the best way of maintaining the "state" of the entire expression tree so that each "step" has knowledge of the entire expression?
Here are the solutions I've come up with:
Do every operation in place and either use a global variable or an instance variable in a class to store a pointer to the equation. I don't like this because unit testing is tougher, since now I have to set up the class first. You also lose other advantages of a more functional approach.
Pass through the root of the expression to every function. However, this either means I have to repeat every operation to also update the expression or that I have to rely on mutability.
Have the top level function 'reconstruct' the expression tree based on each step I yield. For example, if I yield 5x + 4x = 9x, have the top level function find the (5x + 4x) node and replace it with '9x'. This seems like the best solution, but how best to 'reconstruct' each step?
Two final, related questions: Does any of this make sense? I have a lot of caffeine in my system right now and have no idea if I'm being clear.
Am I worrying too much about mutability? Is this a case of premature optimization?
You might be asking about tree zippers. Check: Functional Pearl: Weaving a Web and see if it applies to what you want. From reading your question, I think you're asking to do recursion on a tree structure, but be able to navigate back to the top as necessary. Zippers act as a "breadcrumb" to let you get back to the ancestors of the tree.
I have an implementation of one in JavaScript.
Are you using Polish notation to construct the tree?
For the step by step simplification you can just use a loop until no modifications (operations) can be made in the tree.