Overwriting an iterator using itertools.tee - python

I would like to use itertools.tee inside of a function, with the original iterator as an argument, but I am concerned that I may be reusing the old iterator when I exit the function, which one is not supposed to do when using tee.
If I call tee in the same block as the iterator, then it seems safe:
my_iter = create_some_iterator()
my_iter, my_lookahead = itertools.tee(my_iter)
because the original iterator pointed to by my_iter has (I assume) no more reference counts and my_iter now points to its duplicate, so there's no way to use the original iterator.
But is this still true if I pass it through a function?
def foo(some_iter):
some_iter, some_lookahead = itertools.tee(some_iter)
# Do some lookahead tasks
my_iter = create_some_iterator()
foo(my_iter)
next(my_iter) # Which iter is this?
Does my_iter point to the copy of my_iter after leaving the function? Or does it still point to the original iterator, which I am not supposed to use?
I am concerned because most of the time this is not a problem, but there are occasions where I have been caught by this, particularly in less common implementations like PyPy.
This is what id tells me in the example above, which suggests that I cannot use iterators in this way, but I may also be misinterpreting what id means here:
import itertools
def foo(some_iter):
print(' some_iter id:', id(some_iter))
some_iter, some_lookahead = itertools.tee(some_iter)
print(' new some_iter id:', id(some_iter))
print(' some_lookahead id:', id(some_lookahead))
# Do some lookahead tasks
my_iter = iter(range(10))
print('my_iter id:', id(my_iter))
foo(my_iter)
print('my_iter id after foo:', id(my_iter))
Output:
my_iter id: 139686651427120
some_iter id: 139686651427120
new some_iter id: 139686650411776
some_lookahead id: 139686650411712
my_iter id after foo: 139686651427120
my_iter still has its original id, not the one assigned to some_iter by tee.
UPDATE: Sorry, this was not the question I meant to ask. I more or less answer it myself in the second part.
I was more asking why it still seems to work as expected, with iterations in the copy are reflected in the original, even though they have different IDs.
Also was half-trying to ask how to handle this problem but this answer provides a solution.
I tried to scale back the question, but scaled it back too much.
I tried to close this question, but it won't let me anymore, so not sure how to handle this. Apologies to those who already answered.

Does my_iter point to the copy of my_iter after leaving the function? Or does it still point to the original iterator, which I am not supposed to use?
It still points to the original. Python is a "pass by value" language (though all its values are references so it's a bit confusing sometimes). It is not a pass-by-reference language, assigning to a parameter is purely local to a function and invisible from the caller.

In Python, passing something to a function never makes a copy.
def identity(x):
return x
iter1 = iter(range(10))
iter2 = identity(iter1)
assert iter1 is iter2 # assert "object equality"
Python functions always work this way.
The behavior of itertools.tee is irrelevant. The fact that we are dealing with iterators specifically is irrelevant.
The fact that itertools.tee returns a copy, is behavior specific to itertools.tee.
If you have some free time, this talk can be very enlightening: https://www.youtube.com/watch?v=_AEJHKGk9ns

Related

Python + Iterator resulting from map sharing current state with the initial iterator

Let me illustrate this with an example we came across with my students :
>>>a_lot = (i for i in range(10e50))
>>>twice_a_lot = map(lambda x: 2*x, a_lot)
>>>next(a_lot)
0
>>>next(a_lot)
1
>>>next(a_lot)
2
>>>next(twice_a_lot)
6
So somehow these iterators share their current state, as crazy and unconfortable as it sounds...
Any hints as of the model python uses behind the scene ?
This may be surprising at first but upon a little reflection, it should seem obvious.
When you create an iterator from another iterator, there is no way to recover the original state over whatever underlying container you are iterating over (in this case, the range object). At least not in general.
Consider the simplest case of this: iter(something).
When something is an iterator, then according to the iterator protocol specification, iterator.__iter__ must:
Return the iterator object itself
In other words, if you've implemented the protocol correctly, then the following identity will always hold:
iter(iterator) is iterator
Of course, map could have some convention that would allow it to recover and create an independent iterator, but there is no such convention. In general, if you want to create independent iterators, you need to create it from the source.
And of course, there are iterators where this really is not possible without storing all previous results. Consider:
import random
def random_iterator():
while True:
yield random.random()
In which case, how should map function with the following?
iterator = random_iterator()
twice = map(lambda x: x*2, iterator)
Ok, thx to all the comments received (in less than 5 minutes!!!!) i understood to related things : if I want two independent iterators, I won't use map to compute the snd from the fst :
>>>a_lot = (i for i in range(10e50))
>>>twice_a_lot = (2*i for i in range(10e50))
and i'll remember map is lazy,'cause there's no other way that could make sense.
That was a nice SO lesson.
THX

Should I ever return a list that was passed by reference and modified?

I have recently discovered that lists in python are automatically passed by reference (unless the notation array[:] is used). For example, these two functions do the same thing:
def foo(z):
z.append(3)
def bar(z):
z.append(3)
return z
x = [1, 2]
y = [1, 2]
foo(x)
bar(y)
print(x, y)
Before now, I always returned arrays that I manipulated, because I thought I had to. Now, I understand it's superfluous (and perhaps inefficient), but it seems like returning values is generally good practice for code readability. My question is, are there any issues for doing either of these methods/ what are the best practices? Is there a third option that I am missing? I'm sorry if this has been asked before but I couldn't find anything that really answers my question.
This answer works on the assumption that the decision as to whether to modify your input in-place or return a copy has already been made.
As you noted, whether or not to return a modified object is a matter of opinion, since the result is functionally equivalent. In general, it is considered good form to not return a list that is modified in-place. According to the Zen of Python (item #2):
Explicit is better than implicit.
This is borne out in the standard library. List methods are notorious for this on SO: list.append, insert, extend, list.sort, etc.
Numpy also uses this pattern frequently, since it often deals with large data sets that would be impractical to copy and return. A common example is the array method numpy.ndarray.sort, not to be confused with the top-level function numpy.sort, which returns a new copy.
The idea is something that is very much a part of the Python way of thinking. Here is an excerpt from Guido's email that explains the whys and wherefors:
I find the chaining form a threat to readability; it requires that the reader must be intimately familiar with each of the methods. The second [unchained] form makes it clear that each of these calls acts on the same object, and so even if you don't know the class and its methods very well, you can understand that the second and third call are applied to x (and that all calls are made for their side-effects), and not to something else.
Python built-ins, as a rule, will not do both, to avoid confusion over whether the function/method modifies its argument in place or returns a new value. When modifying in place, no return is performed (making it implicitly return None). The exceptions are cases where a mutating function returns something other than the object mutated (e.g. dict.pop, dict.setdefault).
It's generally a good idea to follow the same pattern, to avoid confusion.
The "best practice" is technically to not modify the thing at all:
def baz(z):
return z + [3]
x = [1, 2]
y = baz(x)
print(x, y)
but in general it's clearer if you restrict yourself to either returning a new object or modifying an object in-place, but not both at once.
There are examples in the standard library that both modify an object in-place and return something (the foremost example being list.pop()), but that's a special case because it's not returning the object that was modified.
There's not strict should of course, However, a function should either do something, or return something.. So, you'd better either modify the list in place without returning anything, or return a new one, leaving the original one unchanged.
Note: the list is not exactly passed by reference. It's the value of the reference that is actually passed. Keep that in mind if you re-assign

Python itertools tee, clones and caching

Assumed: When using the python itertools.tee(), all duplicate iterators refer to the original iterator, and the original is cached to improve performance.
My main concern in the following inquiry is regarding MY IDEA of intended/proper caching behavior.
Edit: my idea of proper caching was based on flawed functional assumptions. Will ultimately need a little wrapper around tee (which will probably have consequences regarding caching).
Question:
Lets say I create 3 iterator clones using tee: a, b, c = itertools.tee(myiter,3). Also assume that at this point, I remove all references to the original, myiter (meaning, there is no great way for my code to refer back to the original hereafter).
At some later point in code, if I decided that I want another clone of myiter, can I just re-tee() one of my duplicates? (with proper caching back to the originally cached myiter)
In other words, at some later point, I wish I had instead used this:
a, b, c, d = itertools.tee(myiter,4).
But, since I have discarded all references to the original myiter, the best I can muster would be:
copytee = itertools.tee(a, 1) #where 'a' is from a previous tee()
Does tee() know what I want here? (that I REALLY want to create a clone based on the original myiter, NOT the intermediate clone a (which may be partially consumed))
There's nothing magical about tee. It's just clever ;-) At any point, tee clones the iterator passed to it. That means the cloned iterator(s) will yield the values produced by the passed-in iterator from this point on. But it's impossible for them to reproduce values that were produced before tee was invoked.
Let's show it with something much simpler than your example:
>>> it = iter(range(5))
>>> next(it)
0
0 is gone now - forever. tee() can't get it back:
>>> a, b = tee(it)
>>> next(a)
1
So a pushed it to produce its next value. It's that value that gets cached, so that other clones can reproduce it too:
>>> next(b)
1
To get that result, it wasn't touched - 1 was retrieved from the internal cache. And now that all of it, a and b have produced 1, 1 is gone forever too.
I don't know whether that answers your question - answering "Does tee() know what I want here?" seems to require telepathy ;-) That is, I don't know what you mean by "with proper caching". It would be most helpful if you gave an exact example of the input/output behavior you're hoping for.
Short of that, the Python docs give Python code that's equivalent to tee(), and perhaps studying that would answer your question:
def tee(iterable, n=2):
it = iter(iterable)
deques = [collections.deque() for i in range(n)]
def gen(mydeque):
while True:
if not mydeque: # when the local deque is empty
newval = next(it) # fetch a new value and
for d in deques: # load it to all the deques
d.append(newval)
yield mydeque.popleft()
return tuple(gen(d) for d in deques)
You can see from that, for example, that nothing about an iterator's internal state is cached - all that's cached is the values produced by the passed-in iterator, starting from the time tee() is called. Each clone has its own deque (FIFO list) of the passed-in iterator's values produced so far, and that's all the clones know about the passed-in iterator. So it may be too simple for whatever you're really hoping for.

why i can't reverse a list of list in python

i wanted to do something like this but this code return list of None (i think it's because list.reverse() is reversing the list in place):
map(lambda row: row.reverse(), figure)
i tried this one, but the reversed return an iterator :
map(reversed, figure)
finally i did something like this , which work for me , but i don't know if it's the right solution:
def reverse(row):
"""func that reverse a list not in place"""
row.reverse()
return row
map(reverse, figure)
if someone has a better solution that i'm not aware of please let me know
kind regards,
The mutator methods of Python's mutable containers (such as the .reverse method of lists) almost invariably return None -- a few return one useful value, e.g. the .pop method returns the popped element, but the key concept to retain is that none of those mutators returns the mutated container: rather, the container mutates in-place and the return value of the mutator method is not that container. (This is an application of the CQS principle of design -- not quite as fanatical as, say, in Eiffel, the language devised by Bertrand Meyer, who also invented CQS, but that's just because in Python "practicality beats purity, cfr import this;-).
Building a list is often costlier than just building an iterator, for the overwhelmingly common case where all you want to do is loop on the result; therefore, built-ins such as reversed (and all the wonderful building blocks in the itertools module) return iterators, not lists.
But what if you therefore have an iterator x but really truly need the equivalent list y? Piece of cake -- just do y = list(x). To make a new instance of type list, you call type list -- this is such a general Python idea that it's even more crucial to retain than the pretty-important stuff I pointed out in the first two paragraphs!-)
So, the code for your specific problem is really very easy to put together based on the crucial notions in the previous paragraphs:
[list(reversed(row)) for row in figure]
Note that I'm using a list comprehension, not map: as a rule of thumb, map should only be used as a last-ditch optimization when there is no need for a lambda to build it (if a lambda is involved then a listcomp, as well as being clearer as usual, also tends to be faster anyway!-).
Once you're a "past master of Python", if your profiling tells you that this code is a bottleneck, you can then know to try alternatives such as
[row[::-1] for row in figure]
applying a negative-step slicing (aka "Martian Smiley") to make reversed copies of the rows, knowing it's usually faster than the list(reversed(row)) approach. But -- unless your code is meant to be maintained only by yourself or somebody at least as skilled at Python -- it's a defensible position to use the simplest "code from first principles" approach except where profiling tells you to push down on the pedal. (Personally I think the "Martian Smiley" is important enough to avoid applying this good general philosophy to this specific use case, but, hey, reasonable people could differ on this very specific point!-).
You can also use a slice to get the reversal of a single list (not in place):
>>> a = [1,2,3,4]
>>> a[::-1]
[4, 3, 2, 1]
So something like:
all_reversed = [lst[::-1] for lst in figure]
...or...
all_reversed = map(lambda x: x[::-1], figure)
...will do what you want.
reversed_lists = [list(reversed(x)) for x in figure]
map(lambda row: list(reversed(row)), figure)
You can also simply do
for row in figure:
row.reverse()
to change each row in place.

What is the most pythonic way to have a generator expression executed?

More and more features of Python move to be "lazy executable", like generator
expressions and other kind of iterators.
Sometimes, however, I see myself wanting to roll a one liner "for" loop, just to perform some action.
What would be the most pythonic thing to get the loop actually executed?
For example:
a = open("numbers.txt", "w")
(a.write ("%d " % i) for i in xrange(100))
a.close()
Not actuall code, but you see what I mean. If I use a list generator, instead, I have the side effect of creating a N-lenght list filled with "None"'s.
Currently what I do is to use the expression as the argument in a call to "any" or to "all". But I would like to find a way that would not depend on the result of the expression performed in the loop - both "any" and "all" can stop depending on the expression evaluated.
To be clear, these are ways to do it that I already know about, and each one has its drawbacks:
[a.write ("%d " % i) for i in xrange(100))]
any((a.write ("%d " % i) for i in xrange(100)))
for item in (a.write ("%d " % i) for i in xrange(100)): pass
There is one obvious way to do it, and that is the way you should do it. There is no excuse for doing it a clever way.
a = open("numbers.txt", "w")
for i in xrange(100):
a.write("%d " % i)
d.close()
Lazy execution gives you a serious benefit: It allows you to pass a sequence to another piece of code without having to hold the entire thing in memory. It is for the creation of efficient sequences as data types.
In this case, you do not want lazy execution. You want execution. You can just ... execute. With a for loop.
If I wanted to do this specific example, I'd write
for i in xrange(100): a.write('%d ' % i)
If I often needed to consume an iterator for its effect, I'd define
def for_effect(iterable):
for _ in iterable:
pass
There are many accumulators which have the effect of consuming the whole iterable they're given, such as min or max -- but even they don't ignore entirely the results yielded in the process (min and max, for example, will raise an exception if some of the results are complex numbers). I don't think there's a built-in accumulator that does exactly what you want -- you'll have to write (and add to your personal stash of tiny utility function) a tiny utility function such as
def consume(iterable):
for item in iterable: pass
The main reason, I guess, is that Python has a for statement and you're supposed to use it when it fits like a glove (i.e., for the cases you'd want consume for;-).
BTW, a.write returns None, which is falsish, so any will actually consume it (and a.writelines will do even better!). But I realize you were just giving that as an example;-).
It is 2019 -
and this is a question from 2010 that keeps showing up. A recent thread in one of Python's mailing lists spammed over 70 e-mails on this subject, and they refused again to add a consume call to the language.
On that thread, the most efficient mode to that actually showed up, and it is far from being obvious, so I am posting it as the answer here:
import deque
consume = deque(maxlen=0).extend
And then use the consume callable to process generator expressions.
It turns out the deque native code in cPython actually is optimized for the maxlen=0 case, and will just consume the iterable.
The any and all calls I mentioned in the question should be equally as efficient, but one has to worry about the expression truthiness in order for the iterable to be consumed.
I see this still may be controversial, after all, an explicit two line for loop can handle this - I remembered this question because I just made a commit where I create some threads, start then, and join then back - without a consume callable, that is 4 lines with mostly boiler plate, and without benefiting from cycling through the iterable in native code:
https://github.com/jsbueno/extracontext/blob/a5d24be882f9aa18eb19effe3c2cf20c42135ed8/tests/test_thread.py#L27

Categories