Assumed: When using the python itertools.tee(), all duplicate iterators refer to the original iterator, and the original is cached to improve performance.
My main concern in the following inquiry is regarding MY IDEA of intended/proper caching behavior.
Edit: my idea of proper caching was based on flawed functional assumptions. Will ultimately need a little wrapper around tee (which will probably have consequences regarding caching).
Question:
Lets say I create 3 iterator clones using tee: a, b, c = itertools.tee(myiter,3). Also assume that at this point, I remove all references to the original, myiter (meaning, there is no great way for my code to refer back to the original hereafter).
At some later point in code, if I decided that I want another clone of myiter, can I just re-tee() one of my duplicates? (with proper caching back to the originally cached myiter)
In other words, at some later point, I wish I had instead used this:
a, b, c, d = itertools.tee(myiter,4).
But, since I have discarded all references to the original myiter, the best I can muster would be:
copytee = itertools.tee(a, 1) #where 'a' is from a previous tee()
Does tee() know what I want here? (that I REALLY want to create a clone based on the original myiter, NOT the intermediate clone a (which may be partially consumed))
There's nothing magical about tee. It's just clever ;-) At any point, tee clones the iterator passed to it. That means the cloned iterator(s) will yield the values produced by the passed-in iterator from this point on. But it's impossible for them to reproduce values that were produced before tee was invoked.
Let's show it with something much simpler than your example:
>>> it = iter(range(5))
>>> next(it)
0
0 is gone now - forever. tee() can't get it back:
>>> a, b = tee(it)
>>> next(a)
1
So a pushed it to produce its next value. It's that value that gets cached, so that other clones can reproduce it too:
>>> next(b)
1
To get that result, it wasn't touched - 1 was retrieved from the internal cache. And now that all of it, a and b have produced 1, 1 is gone forever too.
I don't know whether that answers your question - answering "Does tee() know what I want here?" seems to require telepathy ;-) That is, I don't know what you mean by "with proper caching". It would be most helpful if you gave an exact example of the input/output behavior you're hoping for.
Short of that, the Python docs give Python code that's equivalent to tee(), and perhaps studying that would answer your question:
def tee(iterable, n=2):
it = iter(iterable)
deques = [collections.deque() for i in range(n)]
def gen(mydeque):
while True:
if not mydeque: # when the local deque is empty
newval = next(it) # fetch a new value and
for d in deques: # load it to all the deques
d.append(newval)
yield mydeque.popleft()
return tuple(gen(d) for d in deques)
You can see from that, for example, that nothing about an iterator's internal state is cached - all that's cached is the values produced by the passed-in iterator, starting from the time tee() is called. Each clone has its own deque (FIFO list) of the passed-in iterator's values produced so far, and that's all the clones know about the passed-in iterator. So it may be too simple for whatever you're really hoping for.
Related
Let me illustrate this with an example we came across with my students :
>>>a_lot = (i for i in range(10e50))
>>>twice_a_lot = map(lambda x: 2*x, a_lot)
>>>next(a_lot)
0
>>>next(a_lot)
1
>>>next(a_lot)
2
>>>next(twice_a_lot)
6
So somehow these iterators share their current state, as crazy and unconfortable as it sounds...
Any hints as of the model python uses behind the scene ?
This may be surprising at first but upon a little reflection, it should seem obvious.
When you create an iterator from another iterator, there is no way to recover the original state over whatever underlying container you are iterating over (in this case, the range object). At least not in general.
Consider the simplest case of this: iter(something).
When something is an iterator, then according to the iterator protocol specification, iterator.__iter__ must:
Return the iterator object itself
In other words, if you've implemented the protocol correctly, then the following identity will always hold:
iter(iterator) is iterator
Of course, map could have some convention that would allow it to recover and create an independent iterator, but there is no such convention. In general, if you want to create independent iterators, you need to create it from the source.
And of course, there are iterators where this really is not possible without storing all previous results. Consider:
import random
def random_iterator():
while True:
yield random.random()
In which case, how should map function with the following?
iterator = random_iterator()
twice = map(lambda x: x*2, iterator)
Ok, thx to all the comments received (in less than 5 minutes!!!!) i understood to related things : if I want two independent iterators, I won't use map to compute the snd from the fst :
>>>a_lot = (i for i in range(10e50))
>>>twice_a_lot = (2*i for i in range(10e50))
and i'll remember map is lazy,'cause there's no other way that could make sense.
That was a nice SO lesson.
THX
Is there a way to check if a generator is in use anywhere globally? Such that an active generator will bail no one is using it.
This is mostly academic but I can think of numerous situations where it would be good to detect this. So you understand, here is an example:
def accord():
_accord = None
_inuse = lambda: someutilmodule.scopes_using(_accord) > 1
def gen():
uid = 0
while _inuse():
uid += 1
yield uid
else:
print("I'm done, although you obviously forgot about me.")
_accord = gen()
return _accord
a = accord()
a.__next__()
a.__next__()
a.__next__()
a = None
"""
<<< 1
<<< 2
<<< 3
<<< I'm done, although you obviously forgot about me.
"""
The triple quote is the text I would expect to see if someutilmodule.scopes_using reported the number of uses of the variable. By uses I mean how many copies or references exist.
Note the that the generator has an infinite loop which is generally bad practice but in cases like a unique id generator and other not widely or complexly used, it is often useful and won't create huge overhead. Obviously another way would simply be to expose a function or method that would see the flag where that the loop was using as it's condition. But again it's good to know ways to do various ways to do things.
In this case, when you do
a = accord()
A reference counter behind the scenes keeps track of the fact that a variable is referencing that generator object. This keeps it in memory because there's a chance it may be needed in the future.
Once you do this however:
a = None
The reference to the generator is lost, and the reference counter associated with it is decremented. Once it reaches 0 (which it would, because you only had one reference to it), the system knows that nothing can ever refer to that object again, which frees the data associated with that object up for garbage collection.
This is all handled behind the scenes. There's no need for you to intervene.
The best way to see what's going on, for better or worse, is to examine the relevant source code for CPython. Ultimately, _Py_DECREF is called when references are lost. You can see a little further down, after interpreting some convoluted logic, that once the reference is 0, _Py_Dealloc(op); is called on PyObject *op. I can't for the life of me find the actual call to free though that I'm sure ultimately results from _Py_Dealloc. It seems to be somewhere in the Py_TRASHCAN_END macro, but good lord. That's one of the longest rabbit holes I've ever gone down where I have nothing to show for it.
I would like to use itertools.tee inside of a function, with the original iterator as an argument, but I am concerned that I may be reusing the old iterator when I exit the function, which one is not supposed to do when using tee.
If I call tee in the same block as the iterator, then it seems safe:
my_iter = create_some_iterator()
my_iter, my_lookahead = itertools.tee(my_iter)
because the original iterator pointed to by my_iter has (I assume) no more reference counts and my_iter now points to its duplicate, so there's no way to use the original iterator.
But is this still true if I pass it through a function?
def foo(some_iter):
some_iter, some_lookahead = itertools.tee(some_iter)
# Do some lookahead tasks
my_iter = create_some_iterator()
foo(my_iter)
next(my_iter) # Which iter is this?
Does my_iter point to the copy of my_iter after leaving the function? Or does it still point to the original iterator, which I am not supposed to use?
I am concerned because most of the time this is not a problem, but there are occasions where I have been caught by this, particularly in less common implementations like PyPy.
This is what id tells me in the example above, which suggests that I cannot use iterators in this way, but I may also be misinterpreting what id means here:
import itertools
def foo(some_iter):
print(' some_iter id:', id(some_iter))
some_iter, some_lookahead = itertools.tee(some_iter)
print(' new some_iter id:', id(some_iter))
print(' some_lookahead id:', id(some_lookahead))
# Do some lookahead tasks
my_iter = iter(range(10))
print('my_iter id:', id(my_iter))
foo(my_iter)
print('my_iter id after foo:', id(my_iter))
Output:
my_iter id: 139686651427120
some_iter id: 139686651427120
new some_iter id: 139686650411776
some_lookahead id: 139686650411712
my_iter id after foo: 139686651427120
my_iter still has its original id, not the one assigned to some_iter by tee.
UPDATE: Sorry, this was not the question I meant to ask. I more or less answer it myself in the second part.
I was more asking why it still seems to work as expected, with iterations in the copy are reflected in the original, even though they have different IDs.
Also was half-trying to ask how to handle this problem but this answer provides a solution.
I tried to scale back the question, but scaled it back too much.
I tried to close this question, but it won't let me anymore, so not sure how to handle this. Apologies to those who already answered.
Does my_iter point to the copy of my_iter after leaving the function? Or does it still point to the original iterator, which I am not supposed to use?
It still points to the original. Python is a "pass by value" language (though all its values are references so it's a bit confusing sometimes). It is not a pass-by-reference language, assigning to a parameter is purely local to a function and invisible from the caller.
In Python, passing something to a function never makes a copy.
def identity(x):
return x
iter1 = iter(range(10))
iter2 = identity(iter1)
assert iter1 is iter2 # assert "object equality"
Python functions always work this way.
The behavior of itertools.tee is irrelevant. The fact that we are dealing with iterators specifically is irrelevant.
The fact that itertools.tee returns a copy, is behavior specific to itertools.tee.
If you have some free time, this talk can be very enlightening: https://www.youtube.com/watch?v=_AEJHKGk9ns
I'm using a generator function, say:
def foo():
i=0
while (i<10):
i+=1
yield i
Now, I would like the option to copy the generator after any number of iterations, so that the new copy will retain the internal state (will have the same 'i' in the example) but will now be independent from the original (i.e. iterating over the copy should not change the original).
I've tried using copy.deepcopy but I get the error:
"TypeError: object.__new__(generator) is not safe, use generator.__new__()"
Obviously, I could solve this using regular functions with counters for example.
But I'm really looking for a solution using generators.
There are three cases I can think of:
Generator has no side effects, and you just want to be able to walk back through results you've already captured. You could consider a cached generator instead of a true generator. You can shared the cached generator around as well, and if any client walks to an item you haven't been to yet, it will advance. This is similar to the tee() method, but does the tee functionality in the generator/cache itself instead of requiring the client to do it.
Generator has side effects, but no history, and you want to be able to restart anywhere. Consider writing it as a coroutine, where you can pass in the value to start at any time.
Generator has side effects AND history, meaning that the state of the generator at G(x) depends on the results of G(x-1), and so you can't just pass x back into it to start anywhere. In this case, I think you'd need to be more specific about what you are trying to do, as the result depends not just on the generator, but on the state of other data. Probably, in this case, there is a better way to do it.
The comment for itertools.tee was my first guess as well. Because of the warning that you shouldn't advance the original generator any longer after using tee, I might write something like this to spin off a copy:
>>> from itertools import tee
>>>
>>> def foo():
... i = 0
... while i < 10:
... i += 1
... yield i
...
>>>
>>> it = foo()
>>> it.next()
1
>>> it, other = tee(it)
>>> it.next()
2
>>> other.next()
2
In a Python program I'm writing, I've built up a linked list using a dictionary which maps each node to its successor (with the last node mapped to None).
(Actually, the dictionary holds what Wikipedia tells me is called a spaghetti stack, which is a tree where each node is linked to its parent but not its children. This means there are many partially-overlapping paths from the leaf nodes to the root node. I only care about one of those paths, starting from a specific leaf node. This doesn't really matter for the question, other than to rule out any solution that involves iterating over all the elements in the dictionary.)
I need to pass this list to another function as an iterable. I know I can do this using a generator function (see code below), but it seems like there ought to be a built-in function to make the iterator I need in one line (or a perhaps a generator expression). I've done a bit of searching through the documentation, but there's nothing in the itertools or functools modules seems applicable, and I'm not sure where else to look.
Here's the generator function I have now. The outer function can be eliminated (inlined) but the inner generator seems to be the only simple way to make an iterable for the data:
def makeListGenerator(nextDict, start):
def gen(node):
while node:
yield node
node = nextDict[node]
return gen(start)
It seems like there should be a pattern for this sort of generator, but I'm not sure what it would be called. Here's a generic version:
def makeGenericGenerator(nextFunc, continueFunc, start):
def gen(value):
while continueFunc(value):
yield value
value = nextFunc(value)
return gen(start)
I could implement the specific version using this one using the call:
makeGenericGenerator(lambda v: nextDict[v], bool, start)
Does something like that already exist in the Python standard library?
The essential problem you face is that every time another value is taken from your iterable, your iterable has to remember that value, so that it knows how to generate the next value. In other words, your iterable needs to maintain its own state.
That means that, whether or not there's a good answer to your question, using a generator is probably the right solution -- because that's exactly what generators were made for! The whole point of generators is that they save their state between calls to their next method, and that's exactly what you need.
Generator expressions, on the other hand, are better for stateless transformations. A lot of people try to shoehorn state into them, and it's generally kind of ugly. I actually spent a little while trying to do it for your case, and couldn't get a generator expression to work. I finally found something that sort of works. This uses the little-known callable_iterator version of iter:
>>> d = {1:2, 2:3, 3:4, 4:5, 5:None}
>>> list(iter(lambda st=[1]: st.__setitem__(0, d[st[0]]) or st[0], None))
[2, 3, 4, 5]
I present this not as an answer, but as a demonstration of why this doesn't work very well.
nodes = {
'A':'B',
'B':'C',
'C':'D',
'D':None
}
print (lambda f: lambda g: f(f,g)) (lambda f,x: [x] + f(f, nodes[x]) if x else [])('A')
# ['A', 'B', 'C', 'D']