deep-copying a generator in python - python

I'm using a generator function, say:
def foo():
i=0
while (i<10):
i+=1
yield i
Now, I would like the option to copy the generator after any number of iterations, so that the new copy will retain the internal state (will have the same 'i' in the example) but will now be independent from the original (i.e. iterating over the copy should not change the original).
I've tried using copy.deepcopy but I get the error:
"TypeError: object.__new__(generator) is not safe, use generator.__new__()"
Obviously, I could solve this using regular functions with counters for example.
But I'm really looking for a solution using generators.

There are three cases I can think of:
Generator has no side effects, and you just want to be able to walk back through results you've already captured. You could consider a cached generator instead of a true generator. You can shared the cached generator around as well, and if any client walks to an item you haven't been to yet, it will advance. This is similar to the tee() method, but does the tee functionality in the generator/cache itself instead of requiring the client to do it.
Generator has side effects, but no history, and you want to be able to restart anywhere. Consider writing it as a coroutine, where you can pass in the value to start at any time.
Generator has side effects AND history, meaning that the state of the generator at G(x) depends on the results of G(x-1), and so you can't just pass x back into it to start anywhere. In this case, I think you'd need to be more specific about what you are trying to do, as the result depends not just on the generator, but on the state of other data. Probably, in this case, there is a better way to do it.

The comment for itertools.tee was my first guess as well. Because of the warning that you shouldn't advance the original generator any longer after using tee, I might write something like this to spin off a copy:
>>> from itertools import tee
>>>
>>> def foo():
... i = 0
... while i < 10:
... i += 1
... yield i
...
>>>
>>> it = foo()
>>> it.next()
1
>>> it, other = tee(it)
>>> it.next()
2
>>> other.next()
2

Related

Mixing functions and generators

So I was writing a function where a lot of processing happens in the body of a loop, and occasionally it may be of interest to the caller to have the answer to some of the computations.
Normally I would just put the results in a list and return the list, but in this case the results are too large (a few hundred MB on each loop).
I wrote this without really thinking about it, expecting Python's dynamic typing to figure things out, but the following is always created as a generator.
def mixed(is_generator=False):
for i in range(5):
# process some stuff including writing to a file
if is_generator:
yield i
return
From this I have two questions:
1) Does the presence of the yield keyword in a scope immediately turn the object its in into a generator?
2) Is there a sensible way to obtain the behaviour I intended?
2.1) If no, what is the reasoning behind it not being possible? (In terms of how functions and generators work in Python.)
Lets go step by step:
1) Does the presence of the yield keyword in a scope immediately turn the object its in into a generator? Yes
2) Is there a sensible way to obtain the behaviour I intended? Yes, see example below
The thing is to wrap the computation and either return a generator or a list with the data of that generator:
def mixed(is_generator=False):
# create a generator object
gen = (compute_stuff(i) for i in range(5))
# if we want just the generator
if is_generator:
return gen
# if not we consume it with a list and return that list
return list(gen)
Anyway, I would say this is a bad practice. You should have it separated, usually just have the generator function and then use some logic outside:
def computation():
for i in range(5):
# process some stuff including writing to a file
yield i
gen = computation()
if lazy:
for data in gen:
use_data(data)
else:
data = list(gen)
use_all_data(data)

python functions as optional generators

Lets say I have a function and I want to have the option to return results or not. This would be easy to code:
def foo(N, is_return=False):
l = []
for i in range(N):
print(i)
if is_return:
l.append(i)
if is_return:
return l
But now lets say I want the function to be a generator. I would write something like this:
def foo_gen(N, is_return=False):
for i in range(N):
print(i)
if is_return:
yield i
So presumably when is_return is False then foo_gen is just a function with no return value and when is_return is True foo_gen is a generator, for which I would like there to be two different invocations:
In [1]: list(foo_gen(3, is_return=True))
0
1
2
Out[2]: [0, 1, 2]
for when it is a generator and you have to iterate through the yielded values, and:
>>> In [2]: foo_gen(3)
0
1
2
For when it is not a generator and it just has it's side-effect and you don't have to iterate through it. However, this latter behavior doesn't work instead just returning the generator. You can just receive nothing from it:
In [3]: list(foo_gen(3, is_return=False))
0
1
2
Out[3]: []
But this isn't as nice and is confusing for users of an API who aren't expecting to have to iterate through anything to make the side-effects occur.
Is there anyway to make the behavior of In [2] in a function?
To do that, you would need to wrap foo_gen in another function which either returns the generator or iterates over it itself, like this:
def maybe_gen(N, is_return=False):
real_gen = foo_gen(N)
if is_return:
for item in real_gen:
pass
else:
return real_gen
def foo_gen(N):
for i in range(N):
print(i)
yield i
>>> list(maybe_gen(3))
0
1
2
[0, 1, 2]
>>> maybe_gen(3, is_return=True)
0
1
2
>>>
The reason is that occurrence of yield anywhere in the function makes it a generator function. There's no way to have a function that decides at call time whether it's a generator function or not. Instead, you have to have a non-generator function that decides at runtime whether to return a generator or something else.
That said, doing this is most likely not a good idea. You can see that what maybe_gen does when is_return is True is completely trivial. It just iterates over the generator without doing anything. This is especially silly since in this case the generator itself doesn't do anything except print.
It is better to have the function API be consistent: either always return a generator, or never do. A better idea would be to just have two functions foo_gen that is the generator, and print_gen or something which unconditionally prints it. If you want the generator, you call foo_gen. If you just want to print it, you call print_gen instead, rather than passing a "flag" argument to foo_gen.
With regard to your comment at the end:
But this isn't as nice and is confusing for users of an API who aren't expecting to have to iterate through anything to make the side-effects occur.
If the API specifies that the function returns a generator, users should expect to have to iterate over it. If the API says it doesn't return a generator, users shouldn't expect to have to iterate over it. The API should just say one or the other, which will make it clear to users what to expect. What is far more confusing is to have an awkward API that tells users they have to pass a flag to determine whether they get a generator or not, because this complicates the expectations of the user.
So presumably when is_return is False then foo_gen is just a
function with no return value and when is_return is True foo_gen
is a generator
You have your assumptions wrong. is_return does not determine if your function is a generator or not. The mere presence of a yield expression determines that, either the expression is reachable at function call or not, doesn't matter.
So you probably want to stick to the first approach of returning a list which in my opinion is less confusing and easier to maintain.

What is more efficient ? Using Map or For loop in python ? [duplicate]

I have a "best practices" question here. I am using map in a way that it may not be intended to be used - using the elements of a list to change the state of a different object. the final list output is not actually changed. Is this appropriate?
For example:
class ToBeChanged(object):
def __init__(self):
self.foo_lst = [1,2,3,4]
def mapfunc(self, arg):
if arg in ['foo', 'bar']:
self.foo_lst.append(arg)
else:
pass
test = ToBeChanged()
list_to_map = [1,2,37,'foo']
map(lambda x: test.mapfunc(x), list_to_map)
It is not appropriate. In Python 2, you'll be creating a new list of the same length as list_to_map, and immediately discarding it; waste! And the lambda even makes it more complicated.
Better to use a for loop:
for x in list_to_map:
test.mapfunc(x)
More concise and readable.
And if you're still thinking of using this in Python 3 (by forcing the lazy object to be evaluated in some way), consider those who will maintain your code; map gives the impression you want to create a new iterable from the list.
map is the worst.
Because if you tried to run the code in python3 it wouldn't even perform the calls since in python3 map is lazy.
In any case both calling map or a list-comprehension are expressions and expressions should be as side-effect free as possible, their purpose is to return a value.
So if you don't have a value to return you should just use the plain statements: i.e. explicit for

Python itertools tee, clones and caching

Assumed: When using the python itertools.tee(), all duplicate iterators refer to the original iterator, and the original is cached to improve performance.
My main concern in the following inquiry is regarding MY IDEA of intended/proper caching behavior.
Edit: my idea of proper caching was based on flawed functional assumptions. Will ultimately need a little wrapper around tee (which will probably have consequences regarding caching).
Question:
Lets say I create 3 iterator clones using tee: a, b, c = itertools.tee(myiter,3). Also assume that at this point, I remove all references to the original, myiter (meaning, there is no great way for my code to refer back to the original hereafter).
At some later point in code, if I decided that I want another clone of myiter, can I just re-tee() one of my duplicates? (with proper caching back to the originally cached myiter)
In other words, at some later point, I wish I had instead used this:
a, b, c, d = itertools.tee(myiter,4).
But, since I have discarded all references to the original myiter, the best I can muster would be:
copytee = itertools.tee(a, 1) #where 'a' is from a previous tee()
Does tee() know what I want here? (that I REALLY want to create a clone based on the original myiter, NOT the intermediate clone a (which may be partially consumed))
There's nothing magical about tee. It's just clever ;-) At any point, tee clones the iterator passed to it. That means the cloned iterator(s) will yield the values produced by the passed-in iterator from this point on. But it's impossible for them to reproduce values that were produced before tee was invoked.
Let's show it with something much simpler than your example:
>>> it = iter(range(5))
>>> next(it)
0
0 is gone now - forever. tee() can't get it back:
>>> a, b = tee(it)
>>> next(a)
1
So a pushed it to produce its next value. It's that value that gets cached, so that other clones can reproduce it too:
>>> next(b)
1
To get that result, it wasn't touched - 1 was retrieved from the internal cache. And now that all of it, a and b have produced 1, 1 is gone forever too.
I don't know whether that answers your question - answering "Does tee() know what I want here?" seems to require telepathy ;-) That is, I don't know what you mean by "with proper caching". It would be most helpful if you gave an exact example of the input/output behavior you're hoping for.
Short of that, the Python docs give Python code that's equivalent to tee(), and perhaps studying that would answer your question:
def tee(iterable, n=2):
it = iter(iterable)
deques = [collections.deque() for i in range(n)]
def gen(mydeque):
while True:
if not mydeque: # when the local deque is empty
newval = next(it) # fetch a new value and
for d in deques: # load it to all the deques
d.append(newval)
yield mydeque.popleft()
return tuple(gen(d) for d in deques)
You can see from that, for example, that nothing about an iterator's internal state is cached - all that's cached is the values produced by the passed-in iterator, starting from the time tee() is called. Each clone has its own deque (FIFO list) of the passed-in iterator's values produced so far, and that's all the clones know about the passed-in iterator. So it may be too simple for whatever you're really hoping for.

Capture-and-yield in a list comprehension

I'm writing a generator function. I want to know if there's a better (read: more pythonic, ideally with a list comprehension) way to implement something like this:
generator = gen()
captures = []
for _ in xrange(x):
foo = next(generator)
directories.append(foo['name'])
yield foo
The key here is that I don't want to capture the WHOLE yield- the dictionary returned by gen() is large, which is why I'm using a generator. I do need to capture all of the 'name's, though. I feel like there's a way to do this with a list comprehension, but I'm just not seeing it. Thoughts?
There is another / shorter way to do this, but I wouldn't call it more Pythonic:
generator = gen()
directories = []
generator_wrapper = (directories.append(foo['name']) or foo
for foo in generator)
This takes advantage of the fact that append, like all mutating methods in Python, always returns None so .append(...) or foo will always evaluate to foo.
That way the whole dictionary is still the result of the generator expression, and you still get lazy evaluation, but the name still gets saved to the directories list.
You could also use this method in an explicit for loop:
for foo in generator:
yield directories.append(foo['name']) or foo
or even just simplify your loop a bit:
for foo in generator:
directories.append(foo['name'])
yield foo
as there is no reason to use an xrange just to iterate over the generator (unless you actually want to only iterate some known number of steps in).
You want the first x many elements of the generator? Use itertools.islice:
directories = [item['name'] for item in itertools.islice(gen(), x)]

Categories