Shared python generator - python

I am trying to reproduce the reactive extensions "shared" observable concept with Python generators.
Say I have an API that gives me an infinite stream that I can use like this:
def my_generator():
for elem in the_infinite_stream():
yield elem
I could use this generator multiple times like so:
stream1 = my_generator()
stream2 = my_generator()
And the_infinite_stream() will be called twice (once for each generator).
Now say that the_infinite_stream() is an expensive operation. Is there a way to "share" the generator between multiple clients? It seems like tee would do that, but I have to know in advance how many independent generators I want.
The idea is that in other languages (Java, Swift) using the reactive extensions (RxJava, RxSwift) "shared" streams, I can conveniently duplicate the stream on the client side. I am wondering how to do that in Python.
Note: I am using asyncio

I took tee implementation and modified it such you can have various number of generators from infinite_stream:
import collections
def generators_factory(iterable):
it = iter(iterable)
deques = []
already_gone = []
def new_generator():
new_deque = collections.deque()
new_deque.extend(already_gone)
deques.append(new_deque)
def gen(mydeque):
while True:
if not mydeque: # when the local deque is empty
newval = next(it) # fetch a new value and
already_gone.append(newval)
for d in deques: # load it to all the deques
d.append(newval)
yield mydeque.popleft()
return gen(new_deque)
return new_generator
# test it:
infinite_stream = [1, 2, 3, 4, 5]
factory = generators_factory(infinite_stream)
gen1 = factory()
gen2 = factory()
print(next(gen1)) # 1
print(next(gen2)) # 1 even after it was produced by gen1
print(list(gen1)) # [2, 3, 4, 5] # the rest after 1
To cache only some amount of values you can change already_gone = [] into already_gone = collections.deque(maxlen=size) and add size=None parameter to generators_factory.

Consider simple class attributes.
Given
def infinite_stream():
"""Yield a number from a (semi-)infinite iterator."""
# Alternatively, `yield from itertools.count()`
yield from iter(range(100000000))
# Helper
def get_data(iterable):
"""Print the state of `data` per stream."""
return ", ".join([f"{x.__name__}: {x.data}" for x in iterable])
Code
class SharedIterator:
"""Share the state of an iterator with subclasses."""
_gen = infinite_stream()
data = None
#staticmethod
def modify():
"""Advance the shared iterator + assign new data."""
cls = SharedIterator
cls.data = next(cls._gen)
Demo
Given a tuple of client streams (A, B and C),
# Streams
class A(SharedIterator): pass
class B(SharedIterator): pass
class C(SharedIterator): pass
streams = A, B, C
let us modify and print the state of one iterator shared between them:
# Observe changed state in subclasses
A.modify()
print("1st access:", get_data(streams))
B.modify()
print("2nd access:", get_data(streams))
C.modify()
print("3rd access:", get_data(streams))
Output
1st access: A: 0, B: 0, C: 0
2nd access: A: 1, B: 1, C: 1
3rd access: A: 2, B: 2, C: 2
Although any stream can modify the iterator, the class attribute is shared between sub-classes.
See Also
Docs on asyncio.Queue - an async alternative to shared container
Post on the Observer Pattern + asyncio

You can call "tee" repeatedly to create multiple iterators as needed.
it = iter([ random.random() for i in range(100)])
base, it_cp = itertools.tee(it)
_, it_cp2 = itertools.tee(base)
_, it_cp3 = itertools.tee(base)
Sample: http://tpcg.io/ZGc6l5.

You can use single generator and "subscriber generators":
subscribed_generators = []
def my_generator():
while true:
elem = yield
do_something(elem) # or yield do_something(elem) depending on your actual use
def publishing_generator():
for elem in the_infinite_stream():
for generator in subscribed_generators:
generator.send(elem)
subscribed_generators.extend([my_generator(), my_generator()])
# Next is just ane example that forces iteration over `the_infinite_stream`
for elem in publishing_generator():
pass
Instead of generator-function you may also create a class with methods: __next__, __iter__, send, throw. That way you can modify MyGenerator.__init__ method to automatically add new instances of it to subscribed_generators.
This is somewhat similar to event-based approach with a "dumb implementation":
for elem in the_infinite_stream is similar to emitting event
for generator ...: generator.send is similar to sending event to each subscriber.
So one way to implement a "more complex but structured solution" would be to use event-based approach:
For example you can use asyncio.Event
Or some third-party solution like aiopubsub
For any of those approaches you should emit event for each element from the_infinite_stream, and your instances of my_generator should be subscribed to those events.
And other approaches can also be used and the best choice depends: on details of your task, on how are you using event-loop in asyncio. For example:
You can implement the_infinite_stream (or wrapper for it) as some class with "cursors" (objects that track current position in the stream for different subscribers); then each my_generator registers new cursor and uses it to get next item in the infinite stream. In this approach event-loop will not automatically revisit my_generator instances, which might be required if those instances "are not equal" (for example have some "priority balancing")
Intermediate generator calling all the instances of my_generator (as described earlier). In this approach each instance of my_generator is automatically revisited by event-loop. Most likely this approach is thread-safe.
Event-based approaches:
using asyncio.Event. Similar to use of intermediate generator. Not
thread-safe
aiopubsub.
something that uses Observer pattern
Make the_infinite_generator (or wrapper for it) to be "Singleton" that "caches" latest event. Some approaches were described in other answers. Another "caching" solutions can be used:
emit the same element once for each instance of the_infinite_generator (use class with custom __new__ method that tracks instances, or use the same instance of class that has a method returning "shifted" iterator over the_infinite_loop) until someone calls special method on
instance of the_infinite_generator (or on class): infinite_gen.next_cycle. In
this case there should always be some "last finalizing
generator/processor" that at the end of each event-loop's cycle will
do the_infinite_generator().next_cycle()
Similar to previous but same event is allowed to fire multiple times in the same my_generator instance (so they should watch for this case). In this approach the_infinite_generator().next_cycle() can be called "periodically" with loop.call_later or loop.cal_at. This approach might be needed if "subscribers" should be able to handle/analyze: delays, rate-limits, timeouts between events, etc.
Many other solutions are possible. It's hard to propose something specific without looking at your current implementation and without knowing what is the desired behavior of generators that use the_infinite_loop
If I understand your description of "shared" streams correctly, that you really need "one" the_infinite_stream generator and a "handler" for it. Example that tries to do this:
class StreamHandler:
def __init__(self):
self.__real_stream = the_infinite_stream()
self.__sub_streams = []
def get_stream(self):
sub_stream = [] # or better use some Queue/deque object. Using list just to show base principle
self.__sub_streams.append(sub_stream)
while True:
while sub_stream:
yield sub_stream.pop(0)
next(self)
def __next__(self):
next_item = next(self.__real_stream)
for sub_stream in self.__sub_steams:
sub_stream.append(next_item)
some_global_variable = StreamHandler()
# Or you can change StreamHandler.__new__ to make it singleton, or you can create an instance at the point of creation of event-loop
def my_generator():
for elem in some_global_variable.get_stream():
yield elem
But if all your my_generator objects are initialized at the same point of infinite stream, and "equally" iterated inside the loop, then this approach will introduce "unnecessary" memory overhead for each "sub_stream" (used as queue). Unnecessary: because those queues will always be the same (but that can be optimized: if there are some existing "empty" sub_stream than it can be re-used for new sub_streams with some changes to "pop-logic"). And many-many other implementations and nuances can be discussed

If you have a single generator, you can use one queue per "subscriber" and route events to each subscriber as the primary generator produces results.
This has the advantage of allowing the subscribers to move at their own pace, and it can be dropped in existing code with very little changes to the original source.
For example:
def my_gen():
...
m1 = Muxer(my_gen)
m2 = Muxer(my_gen)
consumer1(m1).start()
consumer2(m2).start()
As items are pulled from the primary generator they are inserted into queues for each listener. Listeners can subscribe any time by constructing a new Muxer():
import queue
from threading import Lock
from collections import namedtuple
class Muxer():
Entry = namedtuple('Entry', 'genref listeners, lock')
already = {}
top_lock = Lock()
def __init__(self, func, restart=False):
self.restart = restart
self.func = func
self.queue = queue.Queue()
with self.top_lock:
if func not in self.already:
self.already[func] = self.Entry([func()], [], Lock())
ent = self.already[func]
self.genref = ent.genref
self.lock = ent.lock
self.listeners = ent.listeners
self.listeners.append(self)
def __iter__(self):
return self
def __next__(self):
try:
e = self.queue.get_nowait()
except queue.Empty:
with self.lock:
try:
e = self.queue.get_nowait()
except queue.Empty:
try:
e = next(self.genref[0])
for other in self.listeners:
if not other is self:
other.queue.put(e)
except StopIteration:
if self.restart:
self.genref[0] = self.func()
raise
return e
Original source code, including test suite:
https://gist.github.com/earonesty/cafa4626a2def6766acf5098331157b3
The unit tests run many threads concurrently processing the same generated events in sequence. The code is order preserving, with a lock acquired during the single generator's access.
Caveats: the version here uses a singleton to gate access, otherwise it would be possible to accidentally evade its control over the contained generators. It also allows the contained generators to be "restartable", which was a useful feature for me a the time. There is no "close()" feature, simply because I didn't need it. This is an appropriate use case for __del__ however, since the last reference to a listener is the right time to clean up.

Related

Is there a way to copy an arbitrary generator in Python?

Among the best-known features of functional programming are lazy evaluation and infinite lists. In Python, one generally implements these features with generators. But one of the precepts of functional programming is immutability, and generators are not immutable. Just the opposite. Every time one calls next() on a generator, it changes its internal state.
A possible work-around would be to copy a generator before calling next() on it. That works for some generators such as count(). (Perhaps count() is not generator?)
from itertools import count
count_gen = count()
count_gen_copy = copy(count_gen)
print(next(count_gen), next(count_gen), next(count_gen)) # => 0 1 2
print(next(count_gen_copy), next(count_gen_copy), next(count_gen_copy)) # => 0 1 2
But if I define my own generator, e.g., my_count(), I can't copy it.
def my_count(n=0):
while True:
yield n
n += 1
my_count_gen = my_count()
my_count_gen_copy = copy(my_count_gen)
print(next(my_count_gen), next(my_count_gen), next(my_count_gen))
print(next(my_count_gen_copy), next(my_count_gen_copy), next(my_count_gen_copy))
I get an error message when I attempt to execute copy(my_count_gen): TypeError: can't pickle generator objects.
Is there a way around this, or is there some other approach?
Perhaps another way to ask this is: what is copy() copying when it copies copy_gen?
Thanks.
P.S. If I use __iter__() rather than copy(), the __iter__() version acts like the original.
my_count_gen = my_count()
my_count_gen_i = my_count_gen.__iter__()
print(next(my_count_gen), next(my_count_gen), next(my_count_gen)) # => 0 1 2
print(next(my_count_gen_i), next(my_count_gen_i), next(my_count_gen_i)) # => 3 4 5
There's no way to copy arbitrary generators in Python. The operation just doesn't make sense. A generator could depend on all sorts of other uncopyable resources, like file handles, database connections, locks, worker processes, etc. If a generator is holding a lock and you copied it, what would happen to the lock? If a generator is in the middle of a database transaction and you copy it, what would happen to the transaction?
The things you thought were copyable generators aren't generators at all. They're instances of other iterator classes. If you want to write your own iterator class, you can:
class MyCount:
def __init__(self, n=0):
self._n = n
def __iter__(self):
return self
def __next__(self):
retval = self._n
self._n += 1
return retval
Some iterators you write that way might even be reasonably copyable. Others, copy.copy will do something completely unreasonable and useless.
While copy doesn't make sense on a generator, you can effectively "copy" an iterator so that you can iterate it many times. The easiest way is to use tee from the itertools module.
def my_count(n=0):
while True:
yield n
n += 1
a, b, c = itertools.tee(my_count(), 3)
# now use a, b, c ...
This uses memory to cache the iterator's results and pass them on.

Efficient growing pools of objects

Is there an established module, or good practice, to work efficiently with large object pools in Python 3?
What I mean by "object pool" is some class capable of:
fetching new instances of specified type, while dynamically extending the memory allocation under the hood when necessary;
maintaining a consistent indexing for previously fetched objects.
Here is a basic example:
class Value:
__slots__ = ('a','b')
def __init__(self,a=None,b=None):
self.a = a
self.b = b
class BasicPool:
def __init__(self):
self.data = []
def __getitem__(self,k):
return self.data[k]
def fetch(self):
v = Value()
self.data.append(v)
return v
class BlockPool:
def __init__(self,bsize=100):
self.bsize = bsize
self.next = bsize
self.data = []
def __getitem__(self,k):
b,k = divmod(k,self.bsize)
return self.data[b][k]
def fetch(self):
self.next += 1
if self.next >= self.bsize:
self.data.append([ Value() for _ in range(self.bsize) ])
self.next = 0
return self.data[-1][self.next]
The BasicPool doesn't do anything smart: whenever a new instance is requested, it is instanciated and appended to an underlying list. On the other hand, the BlockPool grows a list of pre-allocated blocks of instances. Surprisingly though, it seems that preallocation is not beneficial in practice:
from timeit import default_timer as timer
def benchmark(P):
N = int(1e6)
start = timer()
for _ in range(N): P.fetch()
print( timer() - start )
print( 'Basic pool:' )
for _ in range(5): benchmark(BasicPool())
# Basic pool:
# 1.2352294209995307
# 0.5003506309985823
# 0.48115064000012353
# 0.48508202800076106
# 1.1760561199989752
print( 'Block pool:' )
for _ in range(5): benchmark(BlockPool())
# Block pool:
# 0.7272855400005938
# 1.4875716509995982
# 0.726611527003115
# 0.7369502859983186
# 1.4867010340021807
As you can see, the BasicPool is always faster than the BlockPool (I also don't know the cause of these large variations). Pools of objects must be a fairly common need in Python; is the best approach really to use the builtin list.append? Are there smarter containers that can be used to further improve runtime performance, or is this dominated by the instanciation time anyway?
The whole point of the geometric growth of the array underlying a list is to reduce the reallocation overhead to a constant factor. That constant can easily be smaller than that for manually making blocks (principally because of the slow, interpreted manipulation of self.next and self.data in the latter). (Asymptotically, the cost of BlockPool.fetch is still the append, of course.) Moreover, your benchmark doesn’t include the additional cost of destroying the blocks, nor that of the two-step indexing on read.
So list is surely as good as it gets (without writing your own C code). You can improve BasicPool a bit by inheriting from list rather than containing one, eliminating a dictionary lookup per fetch and the interpreted __getitem__ wrapper entirely.

How to collect function return values while multithreading without using globals?

So I'm trying to work out a generic solution that will collect all values from a function and append them to a list that is later accessible. This is to be used during concurrent.futures or threading type tasks. Here is a solution I have using a global master_list:
from concurrent.futures import ThreadPoolExecutor
master_list = []
def return_from_multithreaded(func):
# master_list = []
def wrapper(*args, **kwargs):
# nonlocal master_list
global master_list
master_list += func(*args, **kwargs)
return wrapper
#return_from_multithreaded
def f(n):
return [n]
with ThreadPoolExecutor(max_workers=20) as exec:
exec.map(f, range(1, 100))
print(master_list)
I would like to find a solution that does not include globals, and perhaps can return the commented out master_list that is stored as a closure?
If you don't want to use globals, don't discard the results of map. map is giving you back the values returned by each function, you just ignored them. This code could be made much simpler by using map for its intended purpose:
def f(n):
return n # No need to wrap in list
with ThreadPoolExecutor(max_workers=20) as exec:
master_list = list(exec.map(f, range(1, 100)))
print(master_list)
If you need a master_list that shows the results computed so far (maybe some other thread is watching it), you just make the loop explicit:
def f(n):
return n # No need to wrap in list
master_list = []
with ThreadPoolExecutor(max_workers=20) as exec:
for result in exec.map(f, range(1, 100)):
master_list.append(result)
print(master_list)
This is what the Executor model is designed for; normal threads aren't intended to return values, but Executors provided a channel for returning values under the covers so you don't have to manage it yourself. Internally, this is using Queues of some form or another, with additional metadata to keep the results in order, but you don't need to deal with that complexity; from your perspective, it's equivalent to the regular map function, it just happens to parallelize the work.
Update to cover dealing with exceptions:
map will raise any exceptions raised in the workers when the result is hit. Thus, as written, the first set of code will not store anything if any of the tasks fail (the list will be partially constructed, but thrown away when the exception raises). The second example will only keep results before the first exception is thrown, with the rest discarded (you'd have to store the map iterator and use some awkward code to avoid it). If you need to store all successful results, ignoring failures (or just logging them in some way), it's easiest to use submit to create a list of Future objects, then wait on them, either serially or in order of completion, wrapping the .result() calls in try/except to avoid throwing away good results. For example, to store results in order of submission, you'd do:
master_list = []
with ThreadPoolExecutor(max_workers=20) as exec:
futures = [exec.submit(f, i) for i in range(1, 100)]
exec.shutdown(False) # Optional: workers terminate as soon as all futures finish,
# rather than waiting for all results to be processed
for fut in futures:
try:
master_list.append(fut.result())
except Exception:
... log error here ...
For more efficient code, you can retrieve results in order of completion, not submission, using concurrent.futures.as_completed to eagerly retrieve results as they finish. The only change from the previous code is that:
for fut in futures:
becomes:
for fut in concurrent.futures.as_completed(futures):
where as_completed does the work of yielding completed/cancelled futures as soon as they complete, instead of delaying until all futures submitted earlier complete and get handled.
There are more complicated options involving using add_done_callback so the main thread isn't involved in explicitly handling the results at all, but that's usually unnecessary, and often confusing, so it's best avoided if possible.
I have faced this issue in the past: Running multiple asynchronous function and get the returned value of each function. This was my approach to do it:
def async_call(func_list):
"""
Runs the list of function asynchronously.
:param func_list: Expects list of lists to be of format
[[func1, args1, kwargs1], [func2, args2, kwargs2], ...]
:return: List of output of the functions
[output1, output2, ...]
"""
def worker(function, f_args, f_kwargs, queue, index):
"""
Runs the function and appends the output to list, and the Exception in the case of error
"""
response = {
'index': index, # For tracking the index of each function in actual list.
# Since, this function is called asynchronously, order in
# queue may differ
'data': None,
'error': None
}
# Handle error in the function call
try:
response['data'] = function(*f_args, **f_kwargs)
except Exception as e:
response['error'] = e # send back the exception along with the queue
queue.put(response)
queue = Queue()
processes = [Process(target=worker, args=(func, args, kwargs, queue, i)) \
for i, (func, args, kwargs) in enumerate(func_list)]
for process in processes:
process.start()
response_list = []
for process in processes:
# Wait for process to finish
process.join()
# Get back the response from the queue
response = queue.get()
if response['error']:
raise response['error'] # Raise exception if the function call failed
response_list.append(response)
return [content['data'] for content in sorted(response_list, key=lambda x: x['index'])]
Sample run:
def my_sum(x, y):
return x + y
def your_mul(x, y):
return x*y
my_func_list = [[my_sum, [1], {'y': 2}], [your_mul, [], {'x':1, 'y':2}]]
async_call(my_func_list)
# Value returned: [3, 2]

Python multiprocessing pool with shared data

I'm attempting to speed up a multivariate fixed-point iteration algorithm using multiprocessing however, I'm running issues dealing with shared data. My solution vector is actually a named dictionary rather than a vector of numbers. Each element of the vector is actually computed using a different formula. At a high level, I have an algorithm like this:
current_estimate = previous_estimate
while True:
for state in all_states:
current_estimate[state] = state.getValue(previous_estimate)
if norm(current_estimate, previous_estimate) < tolerance:
break
else:
previous_estimate, current_estimate = current_estimate, previous_estimate
I'm trying to parallelize the for-loop part with multiprocessing. The previous_estimate variable is read-only and each process only needs to write to one element of current_estimate. My current attempt at rewriting the for-loop is as follows:
# Class and function definitions
class A(object):
def __init__(self,val):
self.val = val
# representative getValue function
def getValue(self, est):
return est[self] + self.val
def worker(state, in_est, out_est):
out_est[state] = state.getValue(in_est)
def worker_star(a_b_c):
""" Allow multiple arguments for a pool
Taken from http://stackoverflow.com/a/5443941/3865495
"""
return worker(*a_b_c)
# Initialize test environment
manager = Manager()
estimates = manager.dict()
all_states = []
for i in range(5):
a = A(i)
all_states.append(a)
estimates[a] = 0
pool = Pool(process = 2)
prev_est = estimates
curr_est = estimates
pool.map(worker_star, itertools.izip(all_states, itertools.repeat(prev_est), itertools.repreat(curr_est)))
The issue I'm currently running into is that the elements added to the all_states array are not the same as those added to the manager.dict(). I keep getting key value errors when trying to access elements of the dictionary using elements of the array. And debugging, I found that none of the elements are the same.
print map(id, estimates.keys())
>>> [19558864, 19558928, 19558992, 19559056, 19559120]
print map(id, all_states)
>>> [19416144, 19416208, 19416272, 19416336, 19416400]
This is happening because the objects you're putting into the estimates DictProxy aren't actually the same objects as those that live in the regular dict. The manager.dict() call returns a DictProxy, which is proxying access to a dict that actually lives in a completely separate manager process. When you insert things into it, they're really being copied and sent to a remote process, which means they're going to have a different identity.
To work around this, you can define your own __eq__ and __hash__ functions on A, as described in this question:
class A(object):
def __init__(self,val):
self.val = val
# representative getValue function
def getValue(self, est):
return est[self] + self.val
def __hash__(self):
return hash(self.__key())
def __key(self):
return (self.val,)
def __eq__(x, y):
return x.__key() == y.__key()
This means the key look ups for items in the estimates will just use the value of the val attribute to establish identity and equality, rather than the id assigned by Python.

Ability to use generator like functions without using yield? (Python3.x)

There are some cases where its convenient to use a generator with yield to pass back data, to the caller over an extended period. Is there a way to do something similar to yield, without having to make the function into a generator?
The reason for this, is in some cases I end up having to make all callee's into generators when those nested functions may have useful return values.
# currently this works fine, but requires a return arg
def nested(return_store):
return_store[0] = some_test()
yield from some_generator()
def do_stuff(return_store):
yield some_data
for more_data in data:
yield more_data
# Annoying workaround!
return_store = [None]
yield from nested(return_store)
if return_store[0]:
pass # do anything
def main():
return Reply(do_stuff())
Instead I'd like to pass an object as an argument which I can pass arguments to (instead of using yield)
# is something like this possible?
def nested(iter_obj):
iter_obj.yield_replacement(some_generator())
return some_test()
def do_stuff(iter_obj):
iter_obj.yield_replacement(some_data)
for more_data in data:
iter_obj.yield_replacement(more_data)
# No annoying workaround
if nested(iter_obj):
pass # do anything
def main():
iter_obj = yield_replacement_object(consumer=print)
# sets up the generator (Reply should consume iter_obj)
do_stuff(iter_obj)
return Reply(iter_obj)
Generators are just one form of iterators. Anything that implements the iterator protocol will do.
This means you can replace your nested function with an object with more attributes:
class Nested():
def __iter__:
self.some_flag = some_test()
yield data
I implemented the __iter__ method as a generator function even.
Then use the object in your generator at will:
n = Nested()
yield from n
if n.some_flag:
# ...
Another method is to throw exceptions; if you are trying to communicate some out-of-band state change, throw an exception and catch it in the parent generator.

Categories