python functions as optional generators - python

Lets say I have a function and I want to have the option to return results or not. This would be easy to code:
def foo(N, is_return=False):
l = []
for i in range(N):
print(i)
if is_return:
l.append(i)
if is_return:
return l
But now lets say I want the function to be a generator. I would write something like this:
def foo_gen(N, is_return=False):
for i in range(N):
print(i)
if is_return:
yield i
So presumably when is_return is False then foo_gen is just a function with no return value and when is_return is True foo_gen is a generator, for which I would like there to be two different invocations:
In [1]: list(foo_gen(3, is_return=True))
0
1
2
Out[2]: [0, 1, 2]
for when it is a generator and you have to iterate through the yielded values, and:
>>> In [2]: foo_gen(3)
0
1
2
For when it is not a generator and it just has it's side-effect and you don't have to iterate through it. However, this latter behavior doesn't work instead just returning the generator. You can just receive nothing from it:
In [3]: list(foo_gen(3, is_return=False))
0
1
2
Out[3]: []
But this isn't as nice and is confusing for users of an API who aren't expecting to have to iterate through anything to make the side-effects occur.
Is there anyway to make the behavior of In [2] in a function?

To do that, you would need to wrap foo_gen in another function which either returns the generator or iterates over it itself, like this:
def maybe_gen(N, is_return=False):
real_gen = foo_gen(N)
if is_return:
for item in real_gen:
pass
else:
return real_gen
def foo_gen(N):
for i in range(N):
print(i)
yield i
>>> list(maybe_gen(3))
0
1
2
[0, 1, 2]
>>> maybe_gen(3, is_return=True)
0
1
2
>>>
The reason is that occurrence of yield anywhere in the function makes it a generator function. There's no way to have a function that decides at call time whether it's a generator function or not. Instead, you have to have a non-generator function that decides at runtime whether to return a generator or something else.
That said, doing this is most likely not a good idea. You can see that what maybe_gen does when is_return is True is completely trivial. It just iterates over the generator without doing anything. This is especially silly since in this case the generator itself doesn't do anything except print.
It is better to have the function API be consistent: either always return a generator, or never do. A better idea would be to just have two functions foo_gen that is the generator, and print_gen or something which unconditionally prints it. If you want the generator, you call foo_gen. If you just want to print it, you call print_gen instead, rather than passing a "flag" argument to foo_gen.
With regard to your comment at the end:
But this isn't as nice and is confusing for users of an API who aren't expecting to have to iterate through anything to make the side-effects occur.
If the API specifies that the function returns a generator, users should expect to have to iterate over it. If the API says it doesn't return a generator, users shouldn't expect to have to iterate over it. The API should just say one or the other, which will make it clear to users what to expect. What is far more confusing is to have an awkward API that tells users they have to pass a flag to determine whether they get a generator or not, because this complicates the expectations of the user.

So presumably when is_return is False then foo_gen is just a
function with no return value and when is_return is True foo_gen
is a generator
You have your assumptions wrong. is_return does not determine if your function is a generator or not. The mere presence of a yield expression determines that, either the expression is reachable at function call or not, doesn't matter.
So you probably want to stick to the first approach of returning a list which in my opinion is less confusing and easier to maintain.

Related

Mixing functions and generators

So I was writing a function where a lot of processing happens in the body of a loop, and occasionally it may be of interest to the caller to have the answer to some of the computations.
Normally I would just put the results in a list and return the list, but in this case the results are too large (a few hundred MB on each loop).
I wrote this without really thinking about it, expecting Python's dynamic typing to figure things out, but the following is always created as a generator.
def mixed(is_generator=False):
for i in range(5):
# process some stuff including writing to a file
if is_generator:
yield i
return
From this I have two questions:
1) Does the presence of the yield keyword in a scope immediately turn the object its in into a generator?
2) Is there a sensible way to obtain the behaviour I intended?
2.1) If no, what is the reasoning behind it not being possible? (In terms of how functions and generators work in Python.)
Lets go step by step:
1) Does the presence of the yield keyword in a scope immediately turn the object its in into a generator? Yes
2) Is there a sensible way to obtain the behaviour I intended? Yes, see example below
The thing is to wrap the computation and either return a generator or a list with the data of that generator:
def mixed(is_generator=False):
# create a generator object
gen = (compute_stuff(i) for i in range(5))
# if we want just the generator
if is_generator:
return gen
# if not we consume it with a list and return that list
return list(gen)
Anyway, I would say this is a bad practice. You should have it separated, usually just have the generator function and then use some logic outside:
def computation():
for i in range(5):
# process some stuff including writing to a file
yield i
gen = computation()
if lazy:
for data in gen:
use_data(data)
else:
data = list(gen)
use_all_data(data)

Python Function with list as argument OR multiple values as argument

I want to write a python function my_sum that extends python's built-in sum in the following way:
If a sequence is passed to my_sum it behaves like the built-in sum.
If multiple values are passed to my_sum, it returns the sum of the values.
Desired output:
my_sum([1, 2, 3]) # shall return 6 (similiar to built-in sum)
my_sum(1, 2, 3) # shall return 6 as well, (sum throws TypeError)
What worked was the following.
def my_sum(*x):
try:
return sum(x) # sums multiple values
except TypeError:
return sum(*x) # sums sequence of values
Is that the pythonic way to accomplish the desired behavior? For me the code looks odd.
It is pythonic. I think at least one check is required and python has the philosophy "Ask for forgiveness not permission" (explained here) which basically means that using try-except blocks is OK for standard control flows.
If importing an established library is pythonic and something you are allowed and willing to do, you can use numpy.sum too, as follows:
import numpy as np
def my_sum(*x):
return np.sum(x)
With this definition, both
my_sum([1, 2, 3])
my_sum(1, 2, 3)
return 6.
I think it is not an use case for exceptions, exceptions are for exceptional cases. Use another function if you convert arguments to a list and pass to the adder function.
In addition, the validation and formatting of the input should be in the outer part of the code, not in the final function. In the best case the data should be validated before coming to this function, and you should only have the sum function to deal with the cooked data.
It's a way to:
Keep things simple
Avoid adding conditional paths
Avoid defensive programming in inner code
And thus avoid future problems.
I would have the following code.
def argsToList(*x):
return list(x)
print sum([1,2,3,4])
print sum(argsToList(1,2,3,4))
# both output 10

Understanding Python Maps()

So I am trying to figure out how Python's map() function works as a way to speed up my program a little bit. From my basic understanding it looks like you can use map() to replace certain instances where you'd use a for loop. What I'm curious about is can you change something like:
loopNum = 25
for i in range (loopNum):
self.doSomething()
To:
loopNum = 25
map(self.doSomething(), range(loopNum))
Additionally, in the above example, would I be able to forego that loopNum variable, and in the map just have map(something, 25)?
No, you can't as map(function, iterable) applies function to each element of the iterable. If you simply want to execute some functionn times, just use a loop.
Note that iterable must be (surprise!) an iterable, not a number.
map is roughly the equivalent of this for loop:
# my_map(func, iter1, iterN...)
def my_map(func, *iteables)
for x, y,... in zip(iter1, iter2,iterN...):
yield func(x,y,...)
What you're doing in your code is just like this:
my_map(self.doSomething(), range(loopNum))
self.dSomething() must return a function or a callable object or this obviously doesn't work. This is because, whatever object you pass into the func argument of my_map function will be called then in addition to passing the right number of arguments to func, func must be a callable object as well :-)
*You must iterate over the iterable returned by map to obtain the results, otherwise, you would just get an iterable object without tangible work.
I want to add to this question that map() is virtually never the right tool in Python. Our BDFL himself, wanted to remove it from Python 3, together with lambdas and, most forcefully reduce.
In almost all cases where you feel tempted to use map() take a step back and try to rewrite your code using list comprehension instead. For your current example, that would be:
my_list = [self.doSomething() for i in range(loopNum)]
or
my_generator = (self.doSomething() for i in range(loopNum))
to make it a generator instead.
But for this to make any sense at all, self.doSomething() should probably take the variable i as an input. After all you are trying to map the values in your range to something else.

Loop inside or outside a function?

What is considered to be a better programming practice when dealing with more object at time (but with the option to process just one object)?
A: LOOP INSIDE FUNCTION
Function can be called with one or more objects and it is iterating inside function:
class Object:
def __init__(self, a, b):
self.var_a = a
self.var_b = b
var_a = ""
var_b = ""
def func(obj_list):
if type(obj_list) != list:
obj_list = [obj_list]
for obj in obj_list:
# do whatever with an object
print(obj.var_a, obj.var_b)
obj_list = [Object("a1", "a2"), Object("b1", "b2")]
obj_alone = Object("c1", "c2")
func(obj_list)
func(obj_alone)
B: LOOP OUTSIDE FUNCTION
Function is dealing with one object only and when it is dealing with more objects in must be called multiple times.
class Object:
def __init__(self, a, b):
self.var_a = a
self.var_b = b
var_a = ""
var_b = ""
def func(obj):
# do whatever with an object
print(obj.var_a, obj.var_b)
obj_list = [Object("a1", "a2"), Object("b1", "b2")]
obj_alone = Object("c1", "c2")
for obj in obj_list:
func(obj)
func(obj_alone)
I personally like the first one (A) more, because for me it makes cleaner code when calling the function, but maybe it's not the right approach. Is there some method generally better than the other? And if not, what are the cons and pros of each method?
A function should have a defined input and output and follow the single responsibility principle. You need to be able to clearly define your function in terms of "I put foo in, I get bar back". The more qualifiers you need to make in this statement to properly describe your function probably means your function is doing too much. "I put foo in and get bar back, unless I put baz in then I also get bar back, unless I put a foo-baz in then it'll error".
In this particular case, you can pass an object or a list of objects. Try to generalise that to a value or a list of values. What if you want to pass a list as a value? Now your function behaviour is ambiguous. You want the single list object to be your value, but the function treats it as multiple arguments instead.
Therefore, it's trivial to adapt a function which takes one argument to work on multiple values in practice. There's no reason to complicate the function's design by making it adaptable to multiple arguments. Write the function as simple and clearly as possible, and if you need it to work through a list of things then you can loop it through that list of things outside the function.
This might become clearer if you try to give an actual useful name to your function which describes what it does. Do you need to use plural or singular terms? foo_the_bar(bar) does something else than foo_the_bars(bars).
Move loops outside functions (when possible)
Generally speaking, keep loops that do nothing but iterate over the parameter outside of functions. This gives the caller maximum control and assumes the least about how the client will use the function.
The rule of thumb is to use the most minimal parameter complexity that the function needs do its job.
For example, let's say you have a function that processes one item. You've anticipated that a client might conceivably want to process multiple items, so you changed the parameter to an iterable, baked a loop into the function, and are now returning a list. Why not? It could save the client from writing an ugly loop in the caller, you figure, and the basic functionality is still available -- and then some!
But this turns out to be a serious constraint. Now the caller needs to pack (and possibly unpack, if the function returns a list of results in addition to a list of arguments) that single item into a list just to use the function. This is confusing and potentially expensive on heap memory:
>>> def square(it): return [x ** 2 for x in it]
...
>>> square(range(6)) # you're thinking ...
[0, 1, 4, 9, 16, 25]
>>> result, = square([3]) # ... but the client just wants to square 1 number
>>> result
9
Here's a much better design for this particular function, intuitive and flexible:
>>> def square(x): return x ** 2
...
>>> square(3)
9
>>> [square(x) for x in range(6)]
[0, 1, 4, 9, 16, 25]
>>> list(map(square, range(6)))
[0, 1, 4, 9, 16, 25]
>>> (square(x) for x in range(6))
<generator object <genexpr> at 0x00000166D122CBA0>
>>> all(square(x) % 2 for x in range(6))
False
This brings me to a second problem with the functions in your code: they have a side-effect, print. I realize these functions are just for demonstration, but designing functions like this makes the example somewhat contrived. Functions typically return values rather than simply produce side-effects, and the parameters and return values are often related, as in the above example -- changing the parameter type bound us to a different return type.
When does it make sense to use an iterable argument? A good example is sort -- the smallest unit of operation for a sorting function is an iterable, so the problem of packing and unpacking in the square example above is a non-issue.
Following this logic a step further, would it make sense for a sort function to accept a list (or variable arguments) of lists? No -- if the caller wants to sort multiple lists, they should loop over them explicitly and call sort on each one, as in the second square example.
Consider variable arguments
A nice feature that bridges the gap between iterables and single arguments is support for variable arguments, which many languages offer. This sometimes gives you the best of both worlds, and some functions go so far as to accept either args or an iterable:
>>> max([1, 3, 2])
3
>>> max(1, 3, 2)
3
One reason max is nice as a variable argument function is that it's a reduction function, so you'll always get a single value as output. If it were a mapping or filtering function, the output is always a list (or generator) so the input should be as well.
To take another example, a sort routine wouldn't make much sense with varargs because it's a classically in-place algorithm that works on lists, so you'd need to unpack the list into the arguments with the * operator pretty much every time you invoke the function -- not cool.
There's no real need for a call like sort(1, 3, 4, 2) as there is with max, where the parameters are just as likely to be loose variables as they are a packed iterable. Varargs are usually used when you have a small number of arguments, or the thing you're unpacking is a small pair or tuple-type element, as often the case with zip.
There's definitely a "feel" to when to offer parameters as varargs, an iterable, or a single value (i.e. let the caller handle looping), but as long as you follow the rule of avoiding iterables unless they're essential to the function, it's hard to go wrong.
As a final tip, try to write your functions with similar contracts to the library functions in your language or the tools you use frequently. These are pretty much always designed well; mimic good design.
If you implement B then you will make it harder for yourself to achieve A.
If you implement A then it isn't too difficult to achieve B. You also have many tools already available to apply this function to a list of arguments (the loop method you described, using something like map, or even a multiprocessing approach if needed)
Therefore I would choose to implement A, and if it makes things neater or easier in a given case you can think about also implementing B (using A) also so that you have both.

deep-copying a generator in python

I'm using a generator function, say:
def foo():
i=0
while (i<10):
i+=1
yield i
Now, I would like the option to copy the generator after any number of iterations, so that the new copy will retain the internal state (will have the same 'i' in the example) but will now be independent from the original (i.e. iterating over the copy should not change the original).
I've tried using copy.deepcopy but I get the error:
"TypeError: object.__new__(generator) is not safe, use generator.__new__()"
Obviously, I could solve this using regular functions with counters for example.
But I'm really looking for a solution using generators.
There are three cases I can think of:
Generator has no side effects, and you just want to be able to walk back through results you've already captured. You could consider a cached generator instead of a true generator. You can shared the cached generator around as well, and if any client walks to an item you haven't been to yet, it will advance. This is similar to the tee() method, but does the tee functionality in the generator/cache itself instead of requiring the client to do it.
Generator has side effects, but no history, and you want to be able to restart anywhere. Consider writing it as a coroutine, where you can pass in the value to start at any time.
Generator has side effects AND history, meaning that the state of the generator at G(x) depends on the results of G(x-1), and so you can't just pass x back into it to start anywhere. In this case, I think you'd need to be more specific about what you are trying to do, as the result depends not just on the generator, but on the state of other data. Probably, in this case, there is a better way to do it.
The comment for itertools.tee was my first guess as well. Because of the warning that you shouldn't advance the original generator any longer after using tee, I might write something like this to spin off a copy:
>>> from itertools import tee
>>>
>>> def foo():
... i = 0
... while i < 10:
... i += 1
... yield i
...
>>>
>>> it = foo()
>>> it.next()
1
>>> it, other = tee(it)
>>> it.next()
2
>>> other.next()
2

Categories