How to exhaust a generator without using "list()"? [duplicate] - python

I have some use cases in which I need to run generator functions without caring about the yielded items.
I cannot make them non-generaor functions because in other use cases I certainly need the yielded values.
I am currently using a trivial self-made function to exhaust the generators.
def exhaust(generator):
for _ in generator:
pass
I wondered, whether there is a simpler way to do that, which I'm missing?
Edit
Following a use case:
def create_tables(fail_silently=True):
"""Create the respective tables."""
for model in MODELS:
try:
model.create_table(fail_silently=fail_silently)
except Exception:
yield (False, model)
else:
yield (True, model)
In some context, I care about the error and success values…
for success, table in create_tables():
if success:
print('Creation of table {} succeeded.'.format(table))
else:
print('Creation of table {} failed.'.format(table), file=stderr)
… and in some I just want to run the function "blindly":
exhaust(create_tables())

Setting up a for loop for this could be relatively expensive, keeping in mind that a for loop in Python is fundamentally successive execution of simple assignment statements; you'll be executing n (number of items in generator) assignments, only to discard the assignment targets afterwards.
You can instead feed the generator to a zero length deque; consumes at C-speed and does not use up memory as with list and other callables that materialise iterators/generators:
from collections import deque
def exhaust(generator):
deque(generator, maxlen=0)
Taken from the consume itertools recipe.

One very simple and possibly efficient solution could be
def exhaust(generator): all(generator)
if we can assume that generator will always return True (as in your case where a tuple of 2 elements (success,table) is true even if success and table both are False), or: any(generator) if it will always return False, and in the "worst case", all(x or True for x in generator).
Being that short & simple, you might not even need a function for it!
Regarding the "why?" comment (I dislike these...): There are many cases where one may want to exhaust a generator. To cite just one, it's a way of doing a for loop as an expression, e.g., any(print(i,x) for i,x in enumerate(S)) - of course there are less trivial examples.

Based on your use case it's hard to imagine that there would be sufficiently many tables to create that you would need to consider performance.
Additionally, table creation is going to be much more expensive than iteration.
So the for loop that you already have would seem the simplest and most Pythonic solution - in this case.

You could just have two functions that each do one thing and call the appropriate one at the appropriate time?
def create_table(model, fail_silently=True):
"""Create the table."""
try:
model.create_table(fail_silently=fail_silently)
except Exception:
return (False, model)
else:
return (True, model)
def create_tables(MODELS)
for model in MODELS:
create_table(model)
def iter_create_tables(MODELS)
for model in MODELS:
yield create_table(model)
When you care about the returned values do:
for success, table in iter_create_tables(MODELS):
if success:
print('Creation of table {} succeeded.'.format(table))
else:
print('Creation of table {} failed.'.format(table), file=stderr)
when you don't just do
create_tables(MODELS)

Related

Mixing functions and generators

So I was writing a function where a lot of processing happens in the body of a loop, and occasionally it may be of interest to the caller to have the answer to some of the computations.
Normally I would just put the results in a list and return the list, but in this case the results are too large (a few hundred MB on each loop).
I wrote this without really thinking about it, expecting Python's dynamic typing to figure things out, but the following is always created as a generator.
def mixed(is_generator=False):
for i in range(5):
# process some stuff including writing to a file
if is_generator:
yield i
return
From this I have two questions:
1) Does the presence of the yield keyword in a scope immediately turn the object its in into a generator?
2) Is there a sensible way to obtain the behaviour I intended?
2.1) If no, what is the reasoning behind it not being possible? (In terms of how functions and generators work in Python.)
Lets go step by step:
1) Does the presence of the yield keyword in a scope immediately turn the object its in into a generator? Yes
2) Is there a sensible way to obtain the behaviour I intended? Yes, see example below
The thing is to wrap the computation and either return a generator or a list with the data of that generator:
def mixed(is_generator=False):
# create a generator object
gen = (compute_stuff(i) for i in range(5))
# if we want just the generator
if is_generator:
return gen
# if not we consume it with a list and return that list
return list(gen)
Anyway, I would say this is a bad practice. You should have it separated, usually just have the generator function and then use some logic outside:
def computation():
for i in range(5):
# process some stuff including writing to a file
yield i
gen = computation()
if lazy:
for data in gen:
use_data(data)
else:
data = list(gen)
use_all_data(data)

Simpler way to run a generator function without caring about items

I have some use cases in which I need to run generator functions without caring about the yielded items.
I cannot make them non-generaor functions because in other use cases I certainly need the yielded values.
I am currently using a trivial self-made function to exhaust the generators.
def exhaust(generator):
for _ in generator:
pass
I wondered, whether there is a simpler way to do that, which I'm missing?
Edit
Following a use case:
def create_tables(fail_silently=True):
"""Create the respective tables."""
for model in MODELS:
try:
model.create_table(fail_silently=fail_silently)
except Exception:
yield (False, model)
else:
yield (True, model)
In some context, I care about the error and success values…
for success, table in create_tables():
if success:
print('Creation of table {} succeeded.'.format(table))
else:
print('Creation of table {} failed.'.format(table), file=stderr)
… and in some I just want to run the function "blindly":
exhaust(create_tables())
Setting up a for loop for this could be relatively expensive, keeping in mind that a for loop in Python is fundamentally successive execution of simple assignment statements; you'll be executing n (number of items in generator) assignments, only to discard the assignment targets afterwards.
You can instead feed the generator to a zero length deque; consumes at C-speed and does not use up memory as with list and other callables that materialise iterators/generators:
from collections import deque
def exhaust(generator):
deque(generator, maxlen=0)
Taken from the consume itertools recipe.
One very simple and possibly efficient solution could be
def exhaust(generator): all(generator)
if we can assume that generator will always return True (as in your case where a tuple of 2 elements (success,table) is true even if success and table both are False), or: any(generator) if it will always return False, and in the "worst case", all(x or True for x in generator).
Being that short & simple, you might not even need a function for it!
Regarding the "why?" comment (I dislike these...): There are many cases where one may want to exhaust a generator. To cite just one, it's a way of doing a for loop as an expression, e.g., any(print(i,x) for i,x in enumerate(S)) - of course there are less trivial examples.
Based on your use case it's hard to imagine that there would be sufficiently many tables to create that you would need to consider performance.
Additionally, table creation is going to be much more expensive than iteration.
So the for loop that you already have would seem the simplest and most Pythonic solution - in this case.
You could just have two functions that each do one thing and call the appropriate one at the appropriate time?
def create_table(model, fail_silently=True):
"""Create the table."""
try:
model.create_table(fail_silently=fail_silently)
except Exception:
return (False, model)
else:
return (True, model)
def create_tables(MODELS)
for model in MODELS:
create_table(model)
def iter_create_tables(MODELS)
for model in MODELS:
yield create_table(model)
When you care about the returned values do:
for success, table in iter_create_tables(MODELS):
if success:
print('Creation of table {} succeeded.'.format(table))
else:
print('Creation of table {} failed.'.format(table), file=stderr)
when you don't just do
create_tables(MODELS)

Is for item in query() equivalent to for item in query().all() in SqlAlchemy [duplicate]

I would like to ask whats the difference between
for row in session.Query(Model1):
pass
and
for row in session.Query(Model1).all():
pass
is the first somehow an iterator bombarding your DB with single queries and the latter "eager" queries the whole thing as a list (like range(x) vs xrange(x)) ?
Nope, there is no difference in DB traffic. The difference is just that for row in session.Query(Model1) does the ORM work on each row when it is about to give it to you, while for row in session.Query(Model1).all() does the ORM work on all rows, before starting to give them to you.
Note that q.all() is just sugar for list(q), i.e. collecting everything yielded by the generator into a list. Here is the source code for it, in the Query class (find def all in the linked source):
def all(self):
"""Return the results represented by this ``Query`` as a list.
This results in an execution of the underlying query.
"""
return list(self)
... where self, the query object, is an iterable, i.e. has an __iter__ method.
So logically the two ways are exactly the same in terms of DB traffic; both end up calling query.__iter__() to get a row iterator, and next()ing their way through it.
The practical difference is that the former can start giving you rows as soon as their data has arrived, “streaming” the DB result set to you, with less memory use and latency. I can't state for sure that all the current engine implementations do that (I hope they do!). In any case the latter version prevents that efficiency, for no good reason.
Actually, the accepted response is not true (or at least not anymore if ever true), specifically the following statements are false:
(1) The difference is just that for row in session.Query(Model1) does the ORM work on each row when it is about to give it to you, while for row in session.Query(Model1).all() does the ORM work on all rows, before starting to give them to you.
SQLAlchemy will always ORM map all rows no matter which of the 2 options you choose to use. This can be seen in their source code within these lines; The loading.instances method will indeed return a generator but one of already mapped ORM instances; you can confirm this in the actually generator looping code:
for row in rows: # ``rows`` here are already ORM mapped rows
yield row
So by the time the first run of the generator has finished and an instance has been yielded, all instances have been ORM mapped. (Exemplified below (1))
(2) The practical difference is that the former can start giving you rows as soon as their data has arrived, “streaming” the DB result set to you, with less memory use and latency.
As seen explained above this is also false as all data retrieved from the DB will be processed/mapped before any yielding is done.
The comments are also misleading. Specifically:
The database hit may be the same, but please bear in mind that all will load the entire result set into memory. This could be gigabytes of data
The entire result set will be loaded regardless of using .all() or not. Clearly seen in this line. This is also very easy to test/verify.
The only difference I can see from the 2 options stated is that by using .all you will be looping twice through the results (if you want to process all instances/rows) as the generator is iterated/exhausted first by the call to list(self) to transform it into a list.
Since the SQLAlchemy code is not easy to digest I wrote a short snippet to exemplify the all this:
class Query:
def all(self):
return list(self)
def instances(self, rows_to_fetch=5):
"""ORM instance generator"""
mapped_rows = []
for i in range(rows_to_fetch):
# ORM mapping work here as in lines 81-88 from loading.instances
mapped_rows.append(i)
print("orm work finished for all rows")
for row in mapped_rows: # same as ``yield from mapped_rows``
print("yield row")
yield row
def __iter__(self):
return self.instances()
query = Query()
print("(1) Generator scenario:")
print("First item of generator: ", next(iter(query)))
print("\n(2) List scenario:")
print("First item of list: ", query.all()[0])
"""
RESULTS:
--------
(1) Generator scenario:
orm work finished for all rows
yield row
First item of generator: 0
(2) List scenario:
orm work finished for all rows
yield row
yield row
yield row
yield row
yield row
First item of list: 0
"""
In order to have any finer control when processing large result sets one would have to use something like yield_per for example and still this would not be a one by one case scenario but rather performing the ORM mapping by batches of instances.
Worth to note the only comment that was indeed on point and can be easily overlooked (hence the writing of this answer) pointing to an explanation from Michael Bayer.

Pythonic equivalent of this function?

I have a function to port from another language, could you please help me make it "pythonic"?
Here the function ported in a "non-pythonic" way (this is a bit of an artificial example - every task is associated with a project or "None", we need a list of distinct projects, distinct meaning no duplication of the .identifier property, starting from a list of tasks):
#staticmethod
def get_projects_of_tasks(task_list):
projects = []
project_identifiers_seen = {}
for task in task_list:
project = task.project
if project is None:
continue
project_identifier = project.identifier
if project_identifiers_seen.has_key(project_identifier):
continue
project_identifiers_seen[project_identifier] = True
projects.append(project)
return projects
I have specifically not even started to make it "pythonic" not to start off on the wrong foot (e.g. list comprehension with "if project.identifier is not None, filter() based on predicate that looks up the dictionary-based registry of identifiers, using set() to strip duplicates, etc.)
EDIT:
Based on the feedback, I have this:
#staticmethod
def get_projects_of_tasks(task_list):
projects = []
project_identifiers_seen = set()
for task in task_list:
project = task.project
if project is None:
continue
project_identifier = project.identifier
if project_identifier in project_identifiers_seen:
continue
project_identifiers_seen.add(project_identifier)
projects.append(project)
return projects
There's nothing massively unPythonic about this code. A couple of possible improvements:
project_identifiers_seen could be a set, rather than a dictionary.
foo.has_key(bar) is better spelled bar in foo
I'm suspicious that this is a staticmethod of a class. Usually there's no need for a class in Python unless you're actually doing data encapsulation. If this is just a normal function, make it a module-level one.
What about:
project_list = {task.project.identifier:task.project for task in task_list if task.project is not None}
return project_list.values()
For 2.6- use dict constructor instead:
return dict((x.project.id, x.project) for x in task_list if x.project).values()
def get_projects_of_tasks(task_list):
seen = set()
return [seen.add(task.project.identifier) or task.project #add is always None
for task in task_list if
task.project is not None and task.project.identifier not in seen]
This works because (a) add returns None (and or returns the value of the last expression evaluated) and (b) the mapping clause (the first clause) is only executed if the if clause is True.
There is no reason that it has to be in a list comprehension - you could just as well set it out as a loop, and indeed you may prefer to. This way has the advantage that it is clear that you are just building a list, and what is supposed to be in it.
I've not used staticmethod because there is rarely a need for it. Either have this as a module-level function, or a classmethod.
An alternative is a generator (thanks to #delnan for point this out):
def get_projects_of_tasks(task_list):
seen = set()
for task in task_list:
if task.project is not None and task.project.identifier not in seen:
identifier = task.project.identifier
seen.add(identifier)
yield task.project
This eliminates the need for a side-effect in comprehension (which is controversial), but keeps clear what is being collected.
For the sake of avoiding another if/continue construction, I have left in two accesses to task.project.identifier. This could be conveniently eliminated with the use of a promise library.
This version uses promises to avoid repeated access to task.project.identifier without the need to include an if/continue:
from peak.util.proxies import LazyProxy, get_cache # pip install ProxyTypes
def get_projects_of_tasks(task_list):
seen = set()
for task in task_list:
identifier = LazyProxy(lambda:task.project.identifier) # a transparent promise
if task.project is not None and identifier not in seen:
seen.add(identifier)
yield task.project
This is safe from AttributeErrors because task.project.identifier is never accessed before task.project is checked.
Some say EAFP is pythonic, so:
#staticmethod
def get_projects_of_tasks(task_list):
projects = {}
for task in task_list:
try:
if not task.project.identifier in projects:
projects[task.project.identifier] = task.project
except AttributeError:
pass
return projects.values()
of cours an explicit check wouldn't be wrong either, and would of course be better if many tasks have not project.
And just one dict to keep track of seen identifiers and projects would be enough, if the order of the projects matters, then a OrderedDict (python2.7+) could come in handy.
There are already a lot of good answers, and, indeed, you have accepted one! But I thought I would add one more option. A number of people have seen that your code could be made more compact with generator expressions or list comprehensions. I'm going to suggest a hybrid style that uses generator expressions to do the initial filtering, while maintaining your for loop in the final filter.
The advantage of this style over the style of your original code is that it simplifies the flow of control by eliminating continue statements. The advantage of this style over a single list comprehension is that it avoids multiple accesses to task.project.identifier in a natural way. It also handles mutable state (the seen set) transparently, which I think is important.
def get_projects_of_tasks(task_list):
projects = (task.project for task in task_list)
ids_projects = ((p.identifier, p) for p in projects if p is not None)
seen = set()
unique_projects = []
for id, p in ids_projects:
if id not in seen:
seen.add(id)
unique_projects.append(p)
return unique_projects
Because these are generator expressions (enclosed in parenthesis instead of brackets), they don't build temporary lists. The first generator expression creates an iterable of projects; you could think of it as performing the project = task.project line from your original code on all the projects at once. The second generator expression creates an iterable of (project_id, project) tuples. The if clause at the end filters out the None values; (p.identifier, p) is only evaluated if p passes through the filter. Together, these two generator expressions do away with your first two if blocks. The remaining code is essentially the same as yours.
Note also the excellent suggestion from Marcin/delnan that you create a generator using yield. This cuts down further on the verbosity of your code, boiling it down to its essentials:
def get_projects_of_tasks(task_list):
projects = (task.project for task in task_list)
ids_projects = ((p.identifier, p) for p in projects if p is not None)
seen = set()
for id, p in ids_projects:
if id not in seen:
seen.add(id)
yield p
The only disadvantage -- in case this isn't obvious -- is that if you want to permanently store the projects, you have to pass the resulting iterable to list.
projects_of_tasks = list(get_projects_of_tasks(task_list))

Structured programming and Python generators?

Update: What I really wanted all along were greenlets.
Note: This question mutated a bit as people answered and forced me to "raise the stakes", as my trivial examples had trivial simplifications; rather than continue to mutate it here, I will repose the question when I have it clearer in my head, as per Alex's suggestion.
Python generators are a thing of beauty, but how can I easily break one down into modules (structured programming)? I effectively want PEP 380, or at least something comparable in syntax burden, but in existing Python (e.g. 2.6)
As an (admittedly stupid) example, take the following:
def sillyGenerator():
for i in xrange(10):
yield i*i
for i in xrange(12):
yield i*i
for i in xrange(8):
yield i*i
Being an ardent believer in DRY, I spot the repeated pattern here and factor it out into a method:
def quadraticRange(n):
for i in xrange(n)
yield i*i
def sillyGenerator():
quadraticRange(10)
quadraticRange(12)
quadraticRange(8)
...which of course doesn't work. The parent must call the new function in a loop, yielding the results:
def sillyGenerator():
for i in quadraticRange(10):
yield i
for i in quadraticRange(12):
yield i
for i in quadraticRange(8):
yield i
...which is even longer than before!
If I want to push part of a generator into a function, I always need this rather verbose, two-line wrapper to call it. It gets worse if I want to support send():
def sillyGeneratorRevisited():
g = subgenerator()
v = None
try:
while True:
v = yield g.send(v)
catch StopIteration:
pass
if v < 4:
# ...
else:
# ...
And that's not taking into account passing in exceptions. The same boilerplate every time! Yet one cannot apply DRY and factor this identical code into a function, because...you'd need the boilerplate to call it! What I want is something like:
def sillyGenerator():
yield from quadraticRange(10)
yield from quadraticRange(12)
yield from quadraticRange(8)
def sillyGeneratorRevisited():
v = yield from subgenerator()
if v < 4:
# ...
else:
# ...
Does anyone have a solution to this problem? I have a first-pass attempt, but I'd like to know what others have come up with. Ultimately, any solution will have to tackle examples where the main generator performs complex logic based on the result of data sent into the generator, and potentially makes a very large number of calls to sub-generators: my use-case is generators used to implement long-running, complex state machines.
However, I'd like to make my
reusability criteria one notch harder:
what if I need a control structure
around my repeated generation?
itertools often helps even there - you need to provide concrete examples where you think it doesn't.
For
instance, I might want to call a
subgenerator forever with different
parameters.
itertools.chain.from_iterable.
Or my subgenerators might
be very costly, and I only want to
start them up as and when they are
reached.
Both chain and chain_from_iterable do that -- no sub-iterator is "started up" until the very instant the first item from it is needed.
Or (and this is a real
desire) I might want to vary what I do
next based on what my controller
passes me using send().
A specific example would be greatly appreciated. Anyway, worst case, you'll be coding for x in blargh: yield x where the suspended Pep3080 would let you code yield from blargh -- about 4 extra characters (not a tragedy;-).
And if some sophisticated coroutine-version of some itertools functionality (itertools mostly supports iterators - there's no equivalent coroutools module yet) becomes warranted, because a certain pattern of coroutine composition is often repeated in your code, then it's not too hard to code it yourself.
For example, suppose we often find ourselves doing something like: first yield a certain value; then, repeatedly, if we're sent 'foo', yield the next item from fooiter, if 'bla', from blaiter, if 'zop', from zopiter, anything else, from defiter. As soon as we spot the second occurrence of this compositional pattern, we can code:
def corou_chaiters(initsend, defiter, val2itermap):
currentiter = iter([initsend])
while True:
val = yield next(currentiter)
currentiter = val2itermap(val, defiter)
and call this simple compositional function as and when needed. If we need to compose other coroutines, rather than general iterators, we'll have a slightly different composer using the send method instead of the next built-in function; and so forth.
If you can offer an example that's not easily tamed by such techniques, I suggest you do so in a separate question (specifically targeted to coroutine-like generators), as there's already a lot of material on this one that will have little to do with your other, much more complex/sophisticated, example.
You want to chain several iterators together:
from itertools import chain
def sillyGenerator(a,b,c):
return chain(quadraticRange(a),quadraticRange(b),quadraticRange(c))
Impractical (unfortunately) answer:
from __future__ import PEP0380
def sillyGenerator():
yield from quadraticRange(10)
yield from quadraticRange(12)
yield from quadraticRange(8)
Potentially practical reference: Syntax for delegating to a subgenerator
Unfortunately making this impractical: Python language moratorium
UPDATE Feb 2011:
The moratorium has been lifted, and PEP 380 is on the TODO list for Python 3.3. Hopefully this answer will be practical soon.
Read Guido's remarks on comp.python.devel
import itertools
def quadraticRange(n):
for i in xrange(n)
yield i*i
def sillyGenerator():
return itertools.chain(
quadraticRange(10),
quadraticRange(12),
quadraticRange(8),
)
def sillyGenerator2():
return itertools.chain.from_iterable(
quadraticRange(n) for n in [10, 12, 8])
The last is useful if you want to make sure one iterator is exhausted before another starts (including its initialization code).
There is a Python Enhancement Proposal for providing a yield from statement for "delegating generation". Your example would be written as:
def sillyGenerator():
sq = lambda i: i * i
yield from map(sq, xrange(10))
yield from map(sq, xrange(12))
yield from map(sq, xrange(8))
Or better, in the spirit of DRY:
def sillyGenerator():
for i in [10, 12, 8]:
yield from quadraticRange(i)
The proposal is in draft status and its eventual inclusion is not certain, but it shows that other developers share your thoughts about generators.
For an arbitrary number of calls to quadraticRange:
from itertools import chain
def sillyGenerator(*args):
return chain(*map(quadraticRange, args))
This code uses map and itertools.chain. It takes an arbitrary number of arguments and passes them in order to quadraticRange. The resulting iterators are then chained.
There is a pattern which I call "generator kernel" where generators don't yield directly to the user but to some "kernel" loop that treats (some of) their yields as "system calls" with special meaning.
You can apply it here by an intermediate function that accepts yielded generators and unrolls them automatically. To make it easy to use, we'll create that intermediate function in a decorator:
import functools, types
def flatten(values_or_generators):
for x in values_or_generators:
if isinstance(x, GeneratorType):
for y in x:
yield y
else:
yield x
# Better name anyone?
def subgenerator(g):
"""Decorator making ``yield <gen>`` mean ``yield from <gen>``."""
#functools.wraps(g)
def flat_g(*args, **kw):
return flatten(g(*args, **kw))
return flat_g
and then you can just write:
def quadraticRange(n):
for i in xrange(n)
yield i*i
#subgenerator
def sillyGenerator():
yield quadraticRange(10)
yield quadraticRange(12)
yield quadraticRange(8)
Note that subgenerator() is unrolling exactly one level of the hierarchy. You could easily make it multi-level (by managing a manual stack, or just replacing the inner loop with for y in flatten(x): - but I think it's better as it is, so that every generator that wants to use this non-standard syntax has to be explicitly wrapped with #subgenerator.
Note also that detection of generators is imperfect! It will detect things written as generators, but that's an implementation detail. As a caller of a generator, all you care about is that it returns an iterator. It could be a function returning some itertools object, and then this decorator would fail.
Checking whether the object has a .next() method is too broad - you won't be able to yield strings without them being taken apart. So the most reliable way would be to check for some explicit marker, so you'd write e.g.:
#subgenerator
def sillyGenerator():
yield 'from', quadraticRange(10)
yield 'from', quadraticRange(12)
yield 'from', quadraticRange(8)
Hey, that's almost like the PEP!
[credits: this answer gives a similar function - but it's deep (which I consider wrong) and is not a framed as a decorator]
class Communicator:
def __init__(self, inflow):
self.outflow = None
self.inflow = inflow
Then you do:
c = Communicator(something)
yield c
response = c.outflow
And instead of the boilerplate code you can simply do:
for i in run():
something = i.inflow
# ...
i.outflow = value_to_return_back
It's simple enough code that works without much brain.

Categories