Generators and files - python

When I write:
lines = (line.strip() for line in open('a_file'))
Is the file opened immediately or is the file system only accessed when I start to consume the generator expression?

open() is called immediately upon the construction of the generator, irrespective of when or whether you consume from it.
The relevant spec is PEP-289:
Early Binding versus Late Binding
After much discussion, it was
decided that the first (outermost) for-expression should be evaluated
immediately and that the remaining expressions be evaluated when the
generator is executed.
Asked to summarize the reasoning for binding the first expression,
Guido offered [5]:
Consider sum(x for x in foo()). Now suppose there's a bug in foo()
that raises an exception, and a bug in sum() that raises an exception
before it starts iterating over its argument. Which exception would
you expect to see? I'd be surprised if the one in sum() was raised
rather the one in foo(), since the call to foo() is part of the
argument to sum(), and I expect arguments to be processed before the
function is called.
OTOH, in sum(bar(x) for x in foo()), where sum() and foo() are
bugfree, but bar() raises an exception, we have no choice but to delay
the call to bar() until sum() starts iterating -- that's part of the
contract of generators. (They do nothing until their next() method is
first called.)
See the rest of that section for further discussion.

It is opened immediately. You can verify this if you use a filename that's not present (it will throw an Exception which indicates that Python actually tried to open it immediatly).
You can also use a function that gives more feedback to see that the command is executed even before the generator is iterated over:
def somefunction(filename):
print(filename)
return open(filename)
lines = (line.strip() for line in somefunction('a_file')) # prints
However if you use a generator function instead of a generator expression the file is only opened when you iterate over it:
def somefunction(filename):
print(filename)
for line in open(filename):
yield line.strip()
lines = somefunction('a_file') # no print!
list(lines) # prints because list iterates over the generator function.

It is opened immediately.
Example:
def func():
print('x')
return [1, 2, 3]
g = (x for x in func())
Output:
x
The function needs to return an iterable object.
open() returns an open file object that is iterable.
Therefore, the file will be opened when you define the generator expression.

Related

What happens when you invoke a function that contains yield?

I read here the following example:
>>> def double_inputs():
... while True: # Line 1
... x = yield # Line 2
... yield x * 2 # Line 3
...
>>> gen = double_inputs()
>>> next(gen) # Run up to the first yield
>>> gen.send(10) # goes into 'x' variable
If I understand the above correctly, it seems to imply that Python actually waits until next(gen) to "run up to" to Line 2 in the body of the function. Put another way, the interpreter would not start executing the body of the function until we call next.
Is that actually correct?
To my knowledge, Python does not do AOT compilation, and it doesn't "look ahead" much except for parsing the code and making sure it's valid Python. Is this correct?
If the above are true, how would Python know when I invoke double_inputs() that it needs to wait until I call next(gen) before it even enters the loop while True?
Correct. Calling double_inputs never executes any of the code; it simply returns a generator object. The presence of the yield expression in the body, discovered when the def statement is parsed, changes the semantics of the def statement to create a generator object rather than a function object.
The function contains yield is a generator.
When you call gen = double_inputs(), you get a generator instance as the result. You need to consume this generator by calling next.
So for your first question, it is true. It runs lines 1, 2, 3 when you first call next.
For your second question, I don't exactly get your point. When you define the function, Python knows what you are defining, it doesn't need to look ahead when running it.
For your third question, the key is yield key word.
Generator-function is de iure a function, but de facto it is an iterator, i.e. a class (with implemented __next__(), __iter()__, and some other methods.)
          In other words, it is a class disguised as a function.
It means, that “calling” this function is in reality making an instance of this class, and explains, why the “called function” does initially nothing. This is the answer to your 3rd question.
The answer to your 1st question is surprisingly no.
Instances always wait for calling its methods, and the __next__() method (indirectly launched by calling the next() build-in function) is not the only method of generators. Other method is the .send(), and you may use gen.send(None) instead of your next(gen).
The answer to your 2nd question is no. Python interpreter by no mean "look ahead" and there are no exceptions, including your
... except for parsing the code and making sure it's valid Python.
Or the answer to this question is yes, if you mean “parsing only up to the next command”. ;-)

Why does Python allow mentioning a method without calling it?

I had some trouble finding my error: I had written
myfile.close
instead of
myfile.close()
I am surprised and somewhat unhappy that python did not object; how come? BTW the file was NOT closed.
(python 2.7 on Ubuntu)
In python methods are first class objects, you can write something like that:
my_close = myfile.close
my_close()
Since expressions don't have to be assigned to some variable
2 + 3
a simple
myfile.close
is valid, too.
That is because myfile.close returns a method.
If you do print(myfile.close) you will get an output like:
<built-in method close of file object at 0x7fb672e0f540>
Python file object has built-in method close() to handle file object.
In your case myfile.close will return the method object reference but it is not being called.
This example will may be helpful to understand this:
>>> def test():
... print 'test print'
...
>>> a = test
>>> print(a)
<function test at 0x7f6ff1f6ab90>
>>>
myfile.close is a member function of a file-like object. Your code just 'mentions' this function, but doesn't call it (this is why the file wasn't closed).
It's the same as doing this:
a = 5
a # OK but does nothing
def test():
return 256
test # also not an error, but useless in this case
Now, you may say, why even allow this if it's totally useless? Well, not quite. With this you can pass functions as arguments to other functions, for example.
def MyMap(function, iterable):
for x in iterable:
yield function(x)
print(list(MyMap(str, range(5))))
fp.close is method of file object.
>>> fp.close
<built-in method close of file object at 0xb749c1d8>
fp.close() is method call by using ()
>>> fp.close()
You can understand in more details by following Demo.
We define test function and we is test and test() to see difference.
>>> def test():
... return 1
...
>>> test
<function test at 0xb742bb54>
>>> test()
1
Your code does not throw an error because you don't have to assign a name to the result of an expression (edit: or statement). Python allows this because it could be that you only need the expression/statement for some sort of side effect but not the actual return value.
This is often seen when advancing an iterator without assigning a name to the yielded object, i.e. just having a line
next(myiter, None)
in the code.
If you wanted expressions with no sideeffects like
1 + 1
or
myfile.close
to throw an error if there is no name assigned to the result (and therefore the expression is useless) Python would have to inspect each right hand side of an expression for possible sideeffects. Compared to the current behavior of just allowing these expressions, this seems needlessly expensive (and would also be very complex, if not impossible for every expression).
Consider how dynamic Python is. You could override what the + operator does to certain types at runtime.

Unexecuted yield statement blocks function to run?

In the below simplified code, I would like to reuse a loop to do a preparation first and yield the result.
However, the preparation (bar()) function is never executed.
Is yield statement changing the flow of the function?
def bar(*args,**kwargs):
print("ENTER bar")
pass
def foo(prepare=False):
print("ENTER foo")
for x in range(1,10):
if prepare:
bar(x)
else:
yield x
foo(prepare=True)
r = foo(prepare=False)
for x in r:
pass
Because the foo definition contains a yield, it won't run like a normal function even if you call it like one (e.g. foo(prepare=True) ).
Running foo() with whatever arguments will return a generator object, suitable to be iterated through. The body of the definition won't be run until you try and iterate that generator object.
The new coroutine syntax puts a keyword at the start of the definition, so that the change in nature isn't hidden inside the body of the function.
The problem is that having a yield statement changes the function to returning a generator and alters the behavior of the function.
Basically this means that on the call of the .next function of the generator the function executes to the yield or termination of the function (in which case it raises StopIteration exception).
Consequently what you should have done is to ensure that you iterate over it even if the yield statement won't be reached. Like:
r = foo(prepare=True)
for x in r:
pass
In this case the loop will terminate immediately as no yield statement is being reached.
In my opinion, the actual explanation here is that:
Python evaluates if condition lazily!
And I'll explain:
When you call to
foo(prepare=True)
just like that, nothing happens, although you might expected that bar(x) will be executed 10 times. But what really happen is that 'no-one' demanding the return value of foo(prepare=True) call, so the if is not evaluated, but it might if you use the return value from foo.
In the second call to foo, iterating the return value r, python has to evaluate the return value,and it does, and I'll show that:
Case 1
r = foo(prepare=True)
for x in r:
pass
The output here is 'ENTER bar' 9 times. This means that bar is executed 9 times.
Case 2
r = foo(prepare=False)
for x in r:
pass
In this case no 'ENTER bar' is printed, as expected.
To sum everything up, I'll say that:
There are some cases where Python perform Lazy Evaluation, one of them is the if statement.
Not everything is evaluated lazily in Python,
for example:
# builds a big list and immediately discards it
sum([x*x for x in xrange(2000000)])
vs.
# only keeps one value at a time in memory
sum(x*x for x in xrange(2000000))
About lazy and eager evaluation in python, continue read here.

Calling gen.send() with a new generator in Python 3.3+?

From PEP342:
Because generator-iterators begin execution at the top of the generator's function body, there is no yield expression to receive a value when the generator has just been created. Therefore, calling send() with a non-None argument is prohibited when the generator iterator has just started, ...
For example,
>>> def a():
... for i in range(5):
... print((yield i))
...
>>> g = a()
>>> g.send("Illegal")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: can't send non-None value to a just-started generator
Why is this illegal? The way I understood the use of yield here, it pauses execution of the function, and returns to that spot the next time that next() (or send()) is called. But it seems like it should be legal to print the first result of (yield i)?
Asked a different way, in what state is the generator 'g' directly after g = a(). I assumed that it had run a() up until the first yield, and since there was a yield it returned a generator, instead of a standard synchronous object return.
So why exactly is calling send with non-None argument on a new generator illegal?
Note: I've read the answer to this question, but it doesn't really get to the heart of why it's illegal to call send (with non-None) on a new generator.
Asked a different way, in what state is the generator 'g' directly after g = a(). I assumed that it had run a() up until the first yield, and since there was a yield it returned a generator, instead of a standard synchronous object return.
No. Right after g = a() it is right at the beginning of the function. It does not run up to the first yield until after you advance the generator once (by calling next(g)).
This is what it says in the quote you included in your question: "Because generator-iterators begin execution at the top of the generator's function body..." It also says it in PEP 255, which introduced generators:
When a generator function is called, the actual arguments are bound to function-local formal argument names in the usual way, but no code in the body of the function is executed.
Note that it does not matter whether the yield statement is actually executed. The mere occurrence of yield inside the function body makes the function a generator, as documented:
Using a yield expression in a function definition is sufficient to cause that definition to create a generator function instead of a normal function.

How yield catches StopIteration exception?

Why in the example function terminates:
def func(iterable):
while True:
val = next(iterable)
yield val
but if I take off yield statement function will raise StopIteration exception?
EDIT: Sorry for misleading you guys. I know what generators are and how to use them. Of course when I said function terminates I didn't mean eager evaluation of function. I just implied that when I use function to produce generator:
gen = func(iterable)
in case of func it works and returns the same generator, but in case of func2:
def func2(iterable):
while True:
val = next(iterable)
it raises StopIteration instead of None return or infinite loop.
Let me be more specific. There is a function tee in itertools which is equivalent to:
def tee(iterable, n=2):
it = iter(iterable)
deques = [collections.deque() for i in range(n)]
def gen(mydeque):
while True:
if not mydeque: # when the local deque is empty
newval = next(it) # fetch a new value and
for d in deques: # load it to all the deques
d.append(newval)
yield mydeque.popleft()
return tuple(gen(d) for d in deques)
There is, in fact, some magic, because nested function gen has infinite loop without break statements. gen function terminates due to StopIteration exception when there is no items in it. But it terminates correctly (without raising exceptions), i.e. just stops loop. So the question is: where is StopIteration is handled?
Note: This question (and the original part of my answer to it) are only really meaningful for Python versions prior to 3.7. The behavior that was asked about no longer happens in 3.7 and later, thanks to changes described in PEP 479. So this question and the original answer are only really useful as historical artifacts. After the PEP was accepted, I added an additional section at the bottom of the answer which is more relevant to modern versions of Python.
To answer your question about where the StopIteration gets caught in the gen generator created inside of itertools.tee: it doesn't. It is up to the consumer of the tee results to catch the exception as they iterate.
First off, it's important to note that a generator function (which is any function with a yield statement in it, anywhere) is fundamentally different than a normal function. Instead of running the function's code when it is called, instead, you'll just get a generator object when you call the function. Only when you iterate over the generator will you run the code.
A generator function will never finish iterating without raising StopIteration (unless it raises some other exception instead). StopIteration is the signal from the generator that it is done, and it is not optional. If you reach a return statement or the end of the generator function's code without raising anything, Python will raise StopIteration for you!
This is different from regular functions, which return None if they reach the end without returning anything else. It ties in with the different ways that generators work, as I described above.
Here's an example generator function that will make it easy to see how StopIteration gets raised:
def simple_generator():
yield "foo"
yield "bar"
# StopIteration will be raised here automatically
Here's what happens when you consume it:
>>> g = simple_generator()
>>> next(g)
'foo'
>>> next(g)
'bar'
>>> next(g)
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
next(g)
StopIteration
Calling simple_generator always returns a generator object immediately (without running any of the code in the function). Each call of next on the generator object runs the code until the next yield statement, and returns the yielded value. If there is no more to get, StopIteration is raised.
Now, normally you don't see StopIteration exceptions. The reason for this is that you usually consume generators inside for loops. A for statement will automatically call next over and over until StopIteration gets raised. It will catch and suppress the StopIteration exception for you, so you don't need to mess around with try/except blocks to deal with it.
A for loop like for item in iterable: do_suff(item) is almost exactly equivalent to this while loop (the only difference being that a real for doesn't need a temporary variable to hold the iterator):
iterator = iter(iterable)
try:
while True:
item = next(iterator)
do_stuff(item)
except StopIteration:
pass
finally:
del iterator
The gen generator function you showed at the top is one exception. It uses the StopIteration exception produced by the iterator it is consuming as it's own signal that it is done being iterated on. That is, rather than catching the StopIteration and then breaking out of the loop, it simply lets the exception go uncaught (presumably to be caught by some higher level code).
Unrelated to the main question, there is one other thing I want to point out. In your code, you're calling next on an variable called iterable. If you take that name as documentation for what type of object you will get, this is not necessarily safe.
next is part of the iterator protocol, not the iterable (or container) protocol. It may work for some kinds of iterables (such as files and generators, as those types are their own iterators), but it will fail for others iterables, such as tuples and lists. The more correct approach is to call iter on your iterable value, then call next on the iterator you receive. (Or just use for loops, which call both iter and next for you at appropriate times!)
I just found my own answer in a Google search for a related question, and I feel I should update to point out that the answer above is not true in modern Python versions.
PEP 479 has made it an error to allow a StopIteration to bubble up uncaught from a generator function. If that happens, Python will turn it into a RuntimeError exception instead. This means that code like the examples in older versions of itertools that used a StopIteration to break out of a generator function needs to be modified. Usually you'll need to catch the exception with a try/except and then return.
Because this was a backwards incompatible change, it was phased in gradually. In Python 3.5, all code worked as before by default, but you could get the new behavior with from __future__ import generator_stop. In Python 3.6, unmodified code would still work, but it would give a warning. In Python 3.7 and later, the new behavior applies all the time.
When a function contains yield, calling it does not actually execute anything, it merely creates a generator object. Only iterating over this object will execute the code. So my guess is that you're merely calling the function, which means the function doesn't raise StopIteration because it is never being executed.
Given your function, and an iterable:
def func(iterable):
while True:
val = next(iterable)
yield val
iterable = iter([1, 2, 3])
This is the wrong way to call it:
func(iterable)
This is the right way:
for item in func(iterable):
# do something with item
You could also store the generator in a variable and call next() on it (or iterate over it in some other way):
gen = func(iterable)
print(next(gen)) # prints 1
print(next(gen)) # prints 2
print(next(gen)) # prints 3
print(next(gen)) # StopIteration
By the way, a better way to write your function is as follows:
def func(iterable):
for item in iterable:
yield item
Or in Python 3.3 and later:
def func(iterable):
yield from iter(iterable)
Of course, real generators are rarely so trivial. :-)
Without the yield, you iterate over the entire iterable without stopping to do anything with val. The while loop does not catch the StopIteration exception. An equivalent for loop would be:
def func(iterable):
for val in iterable:
pass
which does catch the StopIteration and simply exit the loop and thus return from the function.
You can explicitly catch the exception:
def func(iterable):
while True:
try:
val = next(iterable)
except StopIteration:
break
yield doesn't catch the StopIteration. What yield does for your function is it causes it to become a generator function rather than a regular function. Thus, the object returned from the function call is an iterable object (which calculates the next value when you ask it to with the next function (which gets called implicitly by a for loop)). If you leave the yield statement out of it, then python executes the entire while loop right away which ends up exhausting the iterable (if it is finite) and raising StopIteration right when you call it.
consider:
x = func(x for x in [])
next(x) #raises StopIteration
A for loop catches the exception -- That's how it knows when to stop calling next on the iterable you gave it.
Tested on Python 3.8, chunk as lazy generator
def split_to_chunk(size: int, iterable: Iterable) -> Iterable[Iterable]:
source_iter = iter(iterable)
while True:
batch_iter = itertools.islice(source_iter, size)
try:
yield itertools.chain([next(batch_iter)], batch_iter)
except StopIteration:
return
Why handling StopInteration error: https://www.python.org/dev/peps/pep-0479/
def sample_gen() -> Iterable[int]:
i = 0
while True:
yield i
i += 1
for chunk in split_to_chunk(7, sample_gen()):
pprint.pprint(list(chunk))
time.sleep(2)
Output:
[0, 1, 2, 3, 4, 5, 6]
[7, 8, 9, 10, 11, 12, 13]
[14, 15, 16, 17, 18, 19, 20]
[21, 22, 23, 24, 25, 26, 27]
............................

Categories