Mixing functions and generators - python

So I was writing a function where a lot of processing happens in the body of a loop, and occasionally it may be of interest to the caller to have the answer to some of the computations.
Normally I would just put the results in a list and return the list, but in this case the results are too large (a few hundred MB on each loop).
I wrote this without really thinking about it, expecting Python's dynamic typing to figure things out, but the following is always created as a generator.
def mixed(is_generator=False):
for i in range(5):
# process some stuff including writing to a file
if is_generator:
yield i
return
From this I have two questions:
1) Does the presence of the yield keyword in a scope immediately turn the object its in into a generator?
2) Is there a sensible way to obtain the behaviour I intended?
2.1) If no, what is the reasoning behind it not being possible? (In terms of how functions and generators work in Python.)

Lets go step by step:
1) Does the presence of the yield keyword in a scope immediately turn the object its in into a generator? Yes
2) Is there a sensible way to obtain the behaviour I intended? Yes, see example below
The thing is to wrap the computation and either return a generator or a list with the data of that generator:
def mixed(is_generator=False):
# create a generator object
gen = (compute_stuff(i) for i in range(5))
# if we want just the generator
if is_generator:
return gen
# if not we consume it with a list and return that list
return list(gen)
Anyway, I would say this is a bad practice. You should have it separated, usually just have the generator function and then use some logic outside:
def computation():
for i in range(5):
# process some stuff including writing to a file
yield i
gen = computation()
if lazy:
for data in gen:
use_data(data)
else:
data = list(gen)
use_all_data(data)

Related

Convert callback function to generator

Suppose I am building a library function that does work incrementally and passes the results through callback functions:
def do_something(even_callback, odd_callback):
for i in range(10):
if i % 2:
even_callback(i)
else:
odd_callback(i)
Some callers care about handling the two types of events differently, and so they use separate callback implementations. However many callers don't care about the difference and just use the same implementation:
do_something(print, print)
For these callers, it would be more convenient to have a version of the function do_something_generator:
for result in do_something_generator():
print(result)
Leaving aside the solution of rewriting do_something to be a generator itself, or async/thread/process, how could I wrap do_something to turn it into a generator? (Also excluding "get all the results in a list and then yield them")
Per discussion in the comments, one possible solution would be to rewrite the do_something function to be a generator and then wrap it to get the callback version, however this would require tagging the events as "even" or "odd" so they could be distinguished in the callback implementation.
In my particular case, this is infeasible due to frequent events with chunks of data - adding tagging into the base implementation is a nontrivial overhead cost versus callbacks.
Leaving aside the solution of rewriting do_something to be a generator itself, or async/thread/process, how could I wrap do_something to turn it into a generator?
Assuming that you exclude rewriting the function on the fly so that it actually yields rather than calls callbacks, this is impossible.
Python is single-threaded. What you're describing has two functions running at once, one of which (the wrapped) signals to the other (the wrapper) whenever a value is ready and then passes that value up, whereupon the wrapper yields it. This is impossible, unless one of the following conditions is met:
the wrapped function is in a separate thread/process, able to run by itself
the wrapped function switches execution whilst passing a value rather than calling callbacks
Since the only two ways of switching execution whilst passing a value are yielding or async yielding, there is no way of doing this in the language if your constraints are met. (There is no way to write a callback which signals to the wrapper to yield a value and then allows the wrapped function to resume, except by placing them in different threads.)
What you're describing sounds kind of like protothreads, only for python, by the way.
However there's no reason not to write it the other way round:
from collections import namedtuple
Monad = namedtuple("Monad", "is_left,val")
def do_stuff():
for i in range(10):
yield Monad(is_left=bool(i%2), i)
def do_stuff_always():
for monad in do_stuff():
yield monad.val
def do_stuff_callback(left_callback, right_callback):
for monad in do_stuff():
if monad.is_left:
left_callback(val)
else:
right_callback(val)
I can't resist calling these structures monads, but perhaps that's not what you're trying to encode here.

python functions as optional generators

Lets say I have a function and I want to have the option to return results or not. This would be easy to code:
def foo(N, is_return=False):
l = []
for i in range(N):
print(i)
if is_return:
l.append(i)
if is_return:
return l
But now lets say I want the function to be a generator. I would write something like this:
def foo_gen(N, is_return=False):
for i in range(N):
print(i)
if is_return:
yield i
So presumably when is_return is False then foo_gen is just a function with no return value and when is_return is True foo_gen is a generator, for which I would like there to be two different invocations:
In [1]: list(foo_gen(3, is_return=True))
0
1
2
Out[2]: [0, 1, 2]
for when it is a generator and you have to iterate through the yielded values, and:
>>> In [2]: foo_gen(3)
0
1
2
For when it is not a generator and it just has it's side-effect and you don't have to iterate through it. However, this latter behavior doesn't work instead just returning the generator. You can just receive nothing from it:
In [3]: list(foo_gen(3, is_return=False))
0
1
2
Out[3]: []
But this isn't as nice and is confusing for users of an API who aren't expecting to have to iterate through anything to make the side-effects occur.
Is there anyway to make the behavior of In [2] in a function?
To do that, you would need to wrap foo_gen in another function which either returns the generator or iterates over it itself, like this:
def maybe_gen(N, is_return=False):
real_gen = foo_gen(N)
if is_return:
for item in real_gen:
pass
else:
return real_gen
def foo_gen(N):
for i in range(N):
print(i)
yield i
>>> list(maybe_gen(3))
0
1
2
[0, 1, 2]
>>> maybe_gen(3, is_return=True)
0
1
2
>>>
The reason is that occurrence of yield anywhere in the function makes it a generator function. There's no way to have a function that decides at call time whether it's a generator function or not. Instead, you have to have a non-generator function that decides at runtime whether to return a generator or something else.
That said, doing this is most likely not a good idea. You can see that what maybe_gen does when is_return is True is completely trivial. It just iterates over the generator without doing anything. This is especially silly since in this case the generator itself doesn't do anything except print.
It is better to have the function API be consistent: either always return a generator, or never do. A better idea would be to just have two functions foo_gen that is the generator, and print_gen or something which unconditionally prints it. If you want the generator, you call foo_gen. If you just want to print it, you call print_gen instead, rather than passing a "flag" argument to foo_gen.
With regard to your comment at the end:
But this isn't as nice and is confusing for users of an API who aren't expecting to have to iterate through anything to make the side-effects occur.
If the API specifies that the function returns a generator, users should expect to have to iterate over it. If the API says it doesn't return a generator, users shouldn't expect to have to iterate over it. The API should just say one or the other, which will make it clear to users what to expect. What is far more confusing is to have an awkward API that tells users they have to pass a flag to determine whether they get a generator or not, because this complicates the expectations of the user.
So presumably when is_return is False then foo_gen is just a
function with no return value and when is_return is True foo_gen
is a generator
You have your assumptions wrong. is_return does not determine if your function is a generator or not. The mere presence of a yield expression determines that, either the expression is reachable at function call or not, doesn't matter.
So you probably want to stick to the first approach of returning a list which in my opinion is less confusing and easier to maintain.

When returning an iterator in Python 3, should `yield from` be preferred over `return`?

In Python 3, we have the yield from keyword, which is useful for splicing the contents of an existing iterator between other yield statements. However, if we only want to yield all the results from an existing iterator, we can achieve the same result using either yield from iterable or simply return iterable. Obviously if you need your code to work in Python 2 as well as Python 3, you are locked in to the return option. But if you are only targetting Python 3, is there a strong reason to prefer one over the other? The yield from version has the advantage of making it immediately obvious to anyone reading the code that this function is returning an iterator. On the other hand, I imagine that the naive implementation of yield from might be slower than simply returning an existing iterator, but CPython or other implementations might have optimized this case. So which of the following should be preferred?
def returns_an_iterator():
return iter([1,2,3])
def yields_from_an_iterator():
yield from iter([1,2,3])
Unless you have other reasons to wrap an iterator, you should return it directly.
def with_return():
return range(10)
>>> with_return()
range(0, 10)
If you use yield from, you'll create an intermediate generator and add overhead when each value is read from the iterator. Returning the iterator directly will avoid that overhead.
def with_yield_from():
yield from range(10)
>>> with_yield_from()
<generator object with_yield_from at 0x030DEDB0>
Returning the iterator will be more efficient, removing extra indirection on each next call. However, note that the semantics are different.
If you return the iterator, your function is an ordinary function. The body will execute immediately when called, and things like context manager __exit__s and finally blocks in your function will execute before the iterator is returned and before any values are produced from the iterator.
If you yield from the iterator, your function is a generator function. The body will not execute when the function is called, and if the yield from happens inside a with or try block, __exit__ functions and finally blocks won't run until the generator is exhausted. Additionally, since the returned generator is not the iterator you yield from over, any additional functionality of the underlying iterator object will be unavailable.

How do Python Recursive Generators work?

In a Python tutorial, I've learned that
Like functions, generators can be recursively programmed. The following
example is a generator to create all the permutations of a given list of items.
def permutations(items):
n = len(items)
if n==0: yield []
else:
for i in range(len(items)):
for cc in permutations(items[:i]+items[i+1:]):
yield [items[i]]+cc
for p in permutations(['r','e','d']): print(''.join(p))
for p in permutations(list("game")): print(''.join(p) + ", ", end="")
I cannot figure out how it generates the results. The recursive things and 'yield' really confused me. Could someone explain the whole process clearly?
There are 2 parts to this --- recursion and generator. Here's the non-generator version that just uses recursion:
def permutations2(items):
n = len(items)
if n==0: return [[]]
else:
l = []
for i in range(len(items)):
for cc in permutations2(items[:i]+items[i+1:]):
l.append([items[i]]+cc)
return l
l.append([item[i]]+cc) roughly translates to the permutation of these items include an entry where item[i] is the first item, and permutation of the rest of the items.
The generator part yield one of the permutations instead of return the entire list of permutations.
When you call a function that returns, it disappears after having produced its result.
When you ask a generator for its next element, it produces it (yields it), and pauses -- yields (the control back) to you. When asked again for the next element, it will resume its operations, and run normally until hitting a yield statement. Then it will again produce a value and pause.
Thus calling a generator with some argument causes creation of actual memory entity, an object, capable of running, remembering its state and arguments, and producing values when asked.
Different calls to the same generator produce different actual objects in memory. The definition is a recipe for the creation of that object. After the recipe is defined, when it is called it can call any other recipe it needs -- or the same one -- to create new memory objects it needs, to produce the values for it.
This is a general answer, not Python-specific.
Thanks for the answers. It really helps me to clear my mind and now I want to share some useful resources about recursion and generator I found on the internet, which is also very friendly to the beginners.
To understand generator in python. The link below is really readable and easy to understand.
What does the "yield" keyword do in Python?
To understand recursion, "https://www.youtube.com/watch?v=MyzFdthuUcA". This youtube video gives a "patented" 4 steps method to writing any recursive method/function. That is very clear and practicable. The channel also has several videos to show people how does the recursion works and how to trace it.
I hope it can help someone like me.

Global variable messes up my recursive function

I've just run into a tricky issue. The following code is supposed to split words into chunks of length numOfChar. The function calls itself, which makes it impossible to have the resulting list (res) inside the function. But if I keep it outside as a global variable, then every subsequent call of the function with different input values leads to a wrong result because res doesn't get cleared.
Can anyone help me out?
Here's the code
(in case you are interested, this is problem 7-23 from PySchools.com):
res = []
def splitWord(word, numOfChar):
if len(word) > 0:
res.append(word[:numOfChar])
splitWord(word[numOfChar:], numOfChar)
return res
print splitWord('google', 2)
print splitWord('google', 3)
print splitWord('apple', 1)
print splitWord('apple', 4)
A pure recursive function should not modify the global state, this counts as a side effect.
Instead of appending-and-recursion, try this:
def splitWord(word, numOfChar):
if len(word) > 0:
return [word[:numOfChar]] + splitWord(word[numOfChar:], numOfChar)
else:
return []
Here, you chop the word into pieces one piece at a time, on every call while going down, and then rebuild the pieces into a list while going up.
This is a common pattern called tail recursion.
P.S. As #e-satis notes, recursion is not an efficient way to do this in Python. See also #e-satis's answer for a more elaborate example of tail recursion, along with a more Pythonic way to solve the problem using generators.
Recursion is completely unnecessary here:
def splitWord(word, numOfChar):
return [word[i:i+numOfChar] for i in xrange(0, len(word), numOfChar)]
If you insist on a recursive solution, it is a good idea to avoid global variables (they make it really tricky to reason about what's going on). Here is one way to do it:
def splitWord(word, numOfChar):
if len(word) > 0:
return [word[:numOfChar]] + splitWord(word[numOfChar:], numOfChar)
else:
return []
To elaborate on #Helgi answer, here is a more performant recursive implémentation. It updates the list instead of summing two lists (which results in the creation of a new object every time).
This pattern forces you to pass a list object as third parameter.
def split_word(word, num_of_chars, tail):
if len(word) > 0:
tail.append(word[:num_of_chars])
return split_word(word[num_of_chars:], num_of_chars, tail)
return tail
res = split_word('fdjskqmfjqdsklmfjm', 3, [])
Another advantage of this form, is that it allows tail recursion optimisation. It's useless in Python because it's not a language that performs such optimisation, but if you translate this code into Erlang or Lisp, you will get it for free.
Remember, in Python you are limited by the recursion stack, and there is no way out of it. This is why recursion is not the preferred method.
You would most likely use generators, using yield and itertools (a module to manipulate generators). Here is a very good example of a function that can split any iterable in chunks:
from itertools import chain, islice
def chunk(seq, chunksize, process=iter):
it = iter(seq)
while True:
yield process(chain([it.next()], islice(it, chunksize - 1)))
Now it's a bit complicated if you start learning Python, so I'm not expecting you to fully get it now, but it's good that you can see this and know it exists. You'll come back to it later (we all did, Python iteration tools are overwhelming at first).
The benefits of this approach are:
It can chunk ANY iterable, not just strings, but also lists, dictionaries, tuples, streams, files, sets, queryset, you name it...
It accepts iterables of any length, and even one with an unknown length (think bytes stream here).
It eats very few memory, as the best thing with generators is that they generate the values on the fly, one by one, and they don't store the previous results before computing the next.
It returns chunks of any nature, meaning you can have a chunks of x letters, lists of x items, or even generators spitting out x items (which is the default).
It returns a generator, and therefor can be use in a flow of other generators. Piping data from one generator to the other, bash style, is a wonderful Python ability.
To get the same result that with your function, you would do:
In [17]: list(chunk('fdjskqmfjqdsklmfjm', 3, ''.join))
Out[17]: ['fdj', 'skq', 'mfj', 'qds', 'klm', 'fjm']

Categories