I have a Python script that generates [str, float] tuples which are then indexed into ElasticSearch using a custom function which eventually calls helper.streaming_bulk().
This is how the generator is implemented:
doc_ids: List[str] = [...]
docs = ((doc_id, get_value(doc_id) for doc_id in doc_ids)
get_value() calls a remote service that computes a float value per document id.
Next, these tuples are passed on to update_page_quality_bulk():
for success, item in update_page_quality_bulk(
islice(doc_qualities, size)
):
total_success += success
if not success:
logging.error(item)
Internally, update_page_quality_bulk() creates the ElasticSearch requests.
One of the advantages of using a generator here is that the first size elements can be fed into update_page_quality_bulk() through islice().
In order to make the entire process faster, I would like to parallelize the get_value() calls. As mentioned, these are remote calls so the local compute cost in negligible, but the duration is significant.
The order of the tuples does not matter, neither which elements are passed into update_page_quality_bulk(). On a high level, I would like to make the get_value() calls (up to x in parallel) for any n tuples and pass on whichever ones are finished first.
My naive attempt was to define get_value() as asynchronous:
async def get_value():
...
and await the call in the generator:
docs = ((doc_id, await get_value(doc_id) for doc_id in doc_ids)
However, this raises an error in the subsequent islice() call:
TypeError: 'async_generator' object is not iterable
Removing the islice call and passing the unmodified docs generator to update_page_quality_bulk() causes the same error to be raised when looping over the tuples to convert them into ElasticSearch requests.
I am aware that the ElasticSearch client provides asynchronous helpers, but they don't seem applicable here because I need to generate the actions first.
According to this answer, it seems like I have to change the implementation to using a queue.
This answer implies that it cannot be done without using multiprocessing due to Python GIL, but that answer is not marked as correct and is quite old too.
Generally, I am looking for a way to change the current logic as little as possible while parallelizing the get_value() calls.
So, you want to pass an "synchronous looking" generator to a call that expects a normal lazy generator such as islice, and keep getting the results for this in parallel.
It sounds like a work for asyncio.as_completed: you use your plain generator to create tasks - these are run in parallel by the asyncio machinery, and the results are made available as the tasks are completed (d'oh!).
However since update_page_quality_bulk is not asynco aware, it will never yield the control to the asyncio loop, so that it can complete the tasks which got their results. This would likely block.
Calling update_page_quality_bulk in another thread probably won't work as well. I did not try it here, but I'd say you can't just iterate over doc in a different thread than the one it (and its tasks) where created.
So, first things first - the "generator expression" syntax does not work when you want some terms of the generator to be calculated asynchronously, as you found out - we refactor that so that the tuples are created in an coroutine-function - and we wrap all calls for those in tasks (some of the asyncio functions do the wrapping in a task automatically)
Then we can us the asyncio machinery to schedule all the calls and call update_page_quality_bulk as these results arrive. The problem is that as_completed, as stated above, can't be passed directly to a non-async function: the asyncio loop would never get control back. Instead, we keep picking the results of tasks in the main thread, and call the sync function in another thread - using a Queue to pass the fetched results. And finally, so that the results can be consumed as made available inside update_page_quality_bulk, we create a small wrapper class to the threading.Queue, so that it can be consumed as in iterator - this is transparent for the code consuming the iterator.
# example code: untested
async def get_doc_values(doc_id):
loop = asyncio.get_running_loop()
# Run_in_executor runs the synchronous function in parallel in a thread-pool
# check the docs - you might want to pass a custom executor with more than
# the default number of workers, instead of None:
return doc_id, await asyncio.run_in_executor(None, get_value, doc_id)
def update_es(iterator):
# this function runs in a separate thread -
for success, item in update_page_quality_bulk(iterator):
total_success += success
if not success:
logging.error(item)
sentinel = Ellipsis # ... : python ellipsis - a nice sentinel that also worker for multiprocessing
class Iterator:
"""This allows the queue, fed in the main thread by the tasks as they are as they are completed
to behave like an ordinary iterator, which can be consumed by "update_page_quality_bulk" in another thread
"""
def __init__(self, source_queue):
self.source = source_queue
def __next__(self):
value= self.source.get()
if value is sentinel:
raise StopIteration()
return value
queue = threading.Queue()
iterator = Iterator(queue)
es_worker = threading.Thread(target=update_es, args=(iterator,))
es_worker.start()
for doc_value_task in asyncio.as_completed(get_doc_values(doc_id) for doc_id in doc_ids):
doc_value = await doc_value_task
queue.put(doc_value)
es_worker.join()
Related
Environment: cooperative RTOS in C and micropython virtual machine is one of the tasks.
To make the VM not block the other RTOS tasks, I insert RTOS_sleep() in vm.c:DISPATCH() so that after every bytecode is executed, the VM relinquishes control to the next RTOS task.
I created a uPy interface to asynchronously obtain data from a physical data bus - could be CAN, SPI, ethernet - using producer-consumer design pattern.
Usage in uPy:
can_q = CANbus.queue()
message = can_q.get()
The implementation in C is such that can_q.get() does NOT block the RTOS: it polls a C-queue and if message is not received, it calls RTOS_sleep() to give another task the chance to fill the queue. Things are synchronized because the C-queue is only updated by another RTOS task and RTOS tasks only switch when RTOS_sleep() is called i.e. cooperative
The C-implementation is basically:
// gives chance for c-queue to be filled by other RTOS task
while(c_queue_empty() == true) RTOS_sleep();
return c_queue_get_message();
Although the Python statement can_q.get() does not block the RTOS, it does block the uPy script.
I'd like to rewrite it so I can use it with async def i.e. coroutine and have it not block the uPy script.
Not sure of the syntax but something like this:
can_q = CANbus.queue()
message = await can_q.get()
QUESTION
How do I write a C-function so I can await on it?
I would prefer a CPython and micropython answer but I would accept a CPython-only answer.
Note: this answer covers CPython and the asyncio framework. The concepts, however, should apply to other Python implementations as well as other async frameworks.
How do I write a C-function so I can await on it?
The simplest way to write a C function whose result can be awaited is by having it return an already made awaitable object, such as an asyncio.Future. Before returning the Future, the code must arrange for the future's result to be set by some asynchronous mechanism. All of these coroutine-based approaches assume that your program is running under some event loop that knows how to schedule the coroutines.
But returning a future isn't always enough - maybe we'd like to define an object with an arbitrary number of suspension points. Returning a future suspends only once (if the returned future is not complete), resumes once the future is completed, and that's it. An awaitable object equivalent to an async def that contains more than one await cannot be implemented by returning a future, it has to implement a protocol that coroutines normally implement. This is somewhat like an iterator implementing a custom __next__ and be used instead of a generator.
Defining a custom awaitable
To define our own awaitable type, we can turn to PEP 492, which specifies exactly which objects can be passed to await. Other than Python functions defined with async def, user-defined types can make objects awaitable by defining the __await__ special method, which Python/C maps to the tp_as_async.am_await part of the PyTypeObject struct.
What this means is that in Python/C, you must do the following:
specify a non-NULL value for the tp_as_async field of your extension type.
have its am_await member point to a C function that accepts an instance of your type and returns an instance of another extension type that implements the iterator protocol, i.e. defines tp_iter (trivially defined as PyIter_Self) and tp_iternext.
the iterator's tp_iternext must advance the coroutine's state machine. Each non-exceptional return from tp_iternext corresponds to a suspension, and the final StopIteration exception signifies the final return from the coroutine. The return value is stored in the value property of StopIteration.
For the coroutine to be useful, it must also be able to communicate with the event loop that drives it, so that it can specify when it is to be resumed after it has suspended. Most of coroutines defined by asyncio expect to be running under the asyncio event loop, and internally use asyncio.get_event_loop() (and/or accept an explicit loop argument) to obtain its services.
Example coroutine
To illustrate what the Python/C code needs to implement, let's consider simple coroutine expressed as a Python async def, such as this equivalent of asyncio.sleep():
async def my_sleep(n):
loop = asyncio.get_event_loop()
future = loop.create_future()
loop.call_later(n, future.set_result, None)
await future
# we get back here after the timeout has elapsed, and
# immediately return
my_sleep creates a Future, arranges for it to complete (its result to become set) in n seconds, and suspends itself until the future completes. The last part uses await, where await x means "allow x to decide whether we will now suspend or keep executing". An incomplete future always decides to suspend, and the asyncio Task coroutine driver special-cases yielded futures to suspend them indefinitely and connects their completion to resuming the task. Suspension mechanisms of other event loops (curio etc) can differ in details, but the underlying idea is the same: await is an optional suspension of execution.
__await__() that returns a generator
To translate this to C, we have to get rid of the magic async def function definition, as well as of the await suspension point. Removing the async def is fairly simple: the equivalent ordinary function simply needs to return an object that implements __await__:
def my_sleep(n):
return _MySleep(n)
class _MySleep:
def __init__(self, n):
self.n = n
def __await__(self):
return _MySleepIter(self.n)
The __await__ method of the _MySleep object returned by my_sleep() will be automatically called by the await operator to convert an awaitable object (anything passed to await) to an iterator. This iterator will be used to ask the awaited object whether it chooses to suspend or to provide a value. This is much like how the for o in x statement calls x.__iter__() to convert the iterable x to a concrete iterator.
When the returned iterator chooses to suspend, it simply needs to produce a value. The meaning of the value, if any, will be interpreted by the coroutine driver, typically part of an event loop. When the iterator chooses to stop executing and return from await, it needs to stop iterating. Using a generator as a convenience iterator implementation, _MySleepIter would look like this:
def _MySleepIter(n):
loop = asyncio.get_event_loop()
future = loop.create_future()
loop.call_later(n, future.set_result, None)
# yield from future.__await__()
for x in future.__await__():
yield x
As await x maps to yield from x.__await__(), our generator must exhaust the iterator returned by future.__await__(). The iterator returned by Future.__await__ will yield if the future is incomplete, and return the future's result (which we here ignore, but yield from actually provides) otherwise.
__await__() that returns a custom iterator
The final obstacle for a C implementation of my_sleep in C is the use of generator for _MySleepIter. Fortunately, any generator can be translated to a stateful iterator whose __next__ executes the piece of code up to the next await or return. __next__ implements a state machine version of the generator code, where yield is expressed by returning a value, and return by raising StopIteration. For example:
class _MySleepIter:
def __init__(self, n):
self.n = n
self.state = 0
def __iter__(self): # an iterator has to define __iter__
return self
def __next__(self):
if self.state == 0:
loop = asyncio.get_event_loop()
self.future = loop.create_future()
loop.call_later(self.n, self.future.set_result, None)
self.state = 1
if self.state == 1:
if not self.future.done():
return next(iter(self.future))
self.state = 2
if self.state == 2:
raise StopIteration
raise AssertionError("invalid state")
Translation to C
The above is quite some typing, but it works, and only uses constructs that can be defined with native Python/C functions.
Actually translating the two classes to C quite straightforward, but beyond the scope of this answer.
I come from the land of Twisted/Klein. I come in peace and to ask for Tornado help. I'm investigating Tornado and how its take on async differs from Twisted. Twisted has something similar to gen.coroutine which is defer.inlineCallbacks and I'm able to write async code like this:
kleinsample.py
#app.route('/endpoint/<int:n>')
#defer.inlineCallbacks
def myRoute(request, n):
jsonlist = []
for i in range(n):
yield jsonlist.append({'id': i})
return json.dumps(jsonlist)
curl cmd:
curl localhost:9000/json/2000
This endpoint will create a JSON string with n number of elements. n can be small or very big. I'm able to break it up in Twisted such that the event loop won't block using yield. Now here's how I tried to convert this into Tornado:
tornadosample.py
async def get(self, n):
jsonlist = []
for i in range(n):
await gen.Task(jsonlist.append, {'id': i}) # exception here
self.write(json.dumps(jsonlist))
The traceback:
TypeError: append() takes no keyword arguments
I'm confused about what I'm supposed to do to properly iterate each element in the loop so that the event loop doesn't get blocked. Does anyone know the "Tornado" way of doing this?
You cannot and must not await append, since it isn't a coroutine and doesn't return a Future. If you want to occasionally yield to allow other coroutines to proceed using Tornado's event loop, await gen.moment.
from tornado import gen
async def get(self, n):
jsonlist = []
for i in range(n):
jsonlist.append({'id': i})
if not i % 1000: # Yield control for a moment every 1k ops
await gen.moment
return json.dumps(jsonlist)
That said, unless this function is extremely CPU-intensive and requires hundreds of milliseconds or more to complete, you're probably better off just doing all your computation at once instead of taking multiple trips through the event loop before your function returns.
list.append() returns None, so it's a little misleading that your Klein sample looks like it's yielding some object. This is equivalent to jsonlist.append(...); yield as two separate statements. The tornado equivalent would be to do await gen.moment in place of the bare yield.
Also note that in Tornado, handlers produce their responses by calling self.write(), not by returning values, so the return statement should be self.write(json.dumps(jsonlist)).
Let's have a look at gen.Task docs:
Adapts a callback-based asynchronous function for use in coroutines.
Takes a function (and optional additional arguments) and runs it with those arguments plus a callback keyword argument. The argument passed to the callback is returned as the result of the yield expression.
Since append doesn't accept a keyword argument it doesn't know what to do with that callback kwarg and spits that exception.
What you could do is wrap append with your own function that does accept a callback kwarg or the approach showed in this answer.
After python 3.3.2+ python support a new syntax for create generator function
yield from <expression>
I have made a quick try for this by
>>> def g():
... yield from [1,2,3,4]
...
>>> for i in g():
... print(i)
...
1
2
3
4
>>>
It seems simple to use but the PEP document is complex. My question is that is there any other difference compare to the previous yield statement? Thanks.
For most applications, yield from just yields everything from the left iterable in order:
def iterable1():
yield 1
yield 2
def iterable2():
yield from iterable1()
yield 3
assert list(iterable2) == [1, 2, 3]
For 90% of users who see this post, I'm guessing that this will be explanation enough for them. yield from simply delegates to the iterable on the right hand side.
Coroutines
However, there are some more esoteric generator circumstances that also have importance here. A less known fact about Generators is that they can be used as co-routines. This isn't super common, but you can send data to a generator if you want:
def coroutine():
x = yield None
yield 'You sent: %s' % x
c = coroutine()
next(c)
print(c.send('Hello world'))
Aside: You might be wondering what the use-case is for this (and you're not alone). One example is the contextlib.contextmanager decorator. Co-routines can also be used to parallelize certain tasks. I don't know too many places where this is taken advantage of, but google app-engine's ndb datastore API uses it for asynchronous operations in a pretty nifty way.
Now, lets assume you send data to a generator that is yielding data from another generator ... How does the original generator get notified? The answer is that it doesn't in python2.x where you need to wrap the generator yourself:
def python2_generator_wapper():
for item in some_wrapped_generator():
yield item
At least not without a whole lot of pain:
def python2_coroutine_wrapper():
"""This doesn't work. Somebody smarter than me needs to fix it. . .
Pain. Misery. Death lurks here :-("""
# See https://www.python.org/dev/peps/pep-0380/#formal-semantics for actual working implementation :-)
g = some_wrapped_generator()
for item in g:
try:
val = yield item
except Exception as forward_exception: # What exceptions should I not catch again?
g.throw(forward_exception)
else:
if val is not None:
g.send(val) # Oops, we just consumed another cycle of g ... How do we handle that properly ...
This all becomes trivial with yield from:
def coroutine_wrapper():
yield from coroutine()
Because yield from truly delegates (everything!) to the underlying generator.
Return semantics
Note that the PEP in question also changes the return semantics. While not directly in OP's question, it's worth a quick digression if you are up for it. In python2.x, you can't do the following:
def iterable():
yield 'foo'
return 'done'
It's a SyntaxError. With the update to yield, the above function is not legal. Again, the primary use-case is with coroutines (see above). You can send data to the generator and it can do it's work magically (maybe using threads?) while the rest of the program does other things. When flow control passes back to the generator, StopIteration will be raised (as is normal for the end of a generator), but now the StopIteration will have a data payload. It is the same thing as if a programmer instead wrote:
raise StopIteration('done')
Now the caller can catch that exception and do something with the data payload to benefit the rest of humanity.
At first sight, yield from is an algorithmic shortcut for:
def generator1():
for item in generator2():
yield item
# do more things in this generator
Which is then mostly equivalent to just:
def generator1():
yield from generator2()
# more things on this generator
In English: when used inside an iterable, yield from issues each element in another iterable, as if that item were coming from the first generator, from the point of view of the code calling the first generator.
The main reasoning for its creation is to allow easy refactoring of code relying heavily on iterators - code which use ordinary functions always could, at very little extra cost, have blocks of one function refactored to other functions, which are then called - that divides tasks, simplifies reading and maintaining the code, and allows for more reusability of small code snippets -
So, large functions like this:
def func1():
# some calculation
for i in somesequence:
# complex calculation using i
# ...
# ...
# ...
# some more code to wrap up results
# finalizing
# ...
Can become code like this, without drawbacks:
def func2(i):
# complex calculation using i
# ...
# ...
# ...
return calculated_value
def func1():
# some calculation
for i in somesequence:
func2(i)
# some more code to wrap up results
# finalizing
# ...
When getting to iterators however, the form
def generator1():
for item in generator2():
yield item
# do more things in this generator
for item in generator1():
# do things
requires that for each item consumed from generator2, the running context be first switched to generator1, nothing is done in that context, and the cotnext have to be switched to generator2 - and when that one yields a value, there is another intermediate context switch to generator1, before getting the value to the actual code consuming those values.
With yield from these intermediate context switches are avoided, which can save quite some resources if there are a lot of iterators chained: the context switches straight from the context consuming the outermost generator to the innermost generator, skipping the context of the intermediate generators altogether, until the inner ones are exhausted.
Later on, the language took advantage of this "tunelling" through intermediate contexts to use these generators as co-routines: functions that can make asynchronous calls. With the proper framework in place, as descibed in https://www.python.org/dev/peps/pep-3156/ , these co-routines are written in a way that when they will call a function that would take a long time to resolve (due to a network operation, or a CPU intensive operation that can be offloaded to another thread) - that call is made with a yield from statement - the framework main loop then arranges so that the called expensive function is properly scheduled, and retakes execution (the framework mainloop is always the code calling the co-routines themselves). When the expensive result is ready, the framework makes the called co-routine behave like an exhausted generator, and execution of the first co-routine resumes.
From the programmer's point of view it is as if the code was running straight forward, with no interruptions. From the process point of view, the co-routine was paused at the point of the expensive call, and other (possibly parallel calls to the same co-routine) continued running.
So, one might write as part of a web crawler some code along:
#asyncio.coroutine
def crawler(url):
page_content = yield from async_http_fetch(url)
urls = parse(page_content)
...
Which could fetch tens of html pages concurrently when called from the asyncio loop.
Python 3.4 added the asyncio module to the stdlib as the default provider for this kind of functionality. It worked so well, that in Python 3.5 several new keywords were added to the language to distinguish co-routines and asynchronous calls from the generator usage, described above. These are described in https://www.python.org/dev/peps/pep-0492/
Here is an example that illustrates it:
>>> def g():
... yield from range(5)
...
>>> list(g())
[0, 1, 2, 3, 4]
>>> def g():
... yield range(5)
...
>>> list(g())
[range(0, 5)]
>>>
yield from yields each item of the iterable, but yield yields the iterable itself.
The difference is simple:
yield:
[extra info, if you know the working of generator you can skip that]
yield is used to produce a single value from the generator function. When the generator function is called, it starts executing, and when a yield statement is encountered, it temporarily suspends the execution of the function, returns the value to the caller, and saves its current state. The next time the function is called, it resumes execution from where it left off, and continues until it hits the next yield statement.
In example below, generator1 and generator2 returning a value wrapped in a generator object but combined_generator is also returning a generator object but that object has another generator object, Now, to get the value of these nested generator we were using yield from
class Gen:
def generator1(self):
yield 1
yield 2
yield 3
def generator2(self):
yield 'a'
yield 'b'
yield 'c'
def combined_generator(self):
"""
This function yielding a generator, which inturn yielding a generator
so we need to use `yield from` so that our end function can directly consume the values instead.
"""
yield from self.generator1()
yield from self.generator2()
def run(self):
print("Gen running ...")
for item in self.combined_generator():
print(item)
g = Gen()
g.run()
The output of above is:
Gen calling ...
1
2
3
a
b
c
I want to read and process some data from an external service. I ask the service if there is any data, if something was returned I process it and ask again (so data can be processed immediately when it's available) and otherwise I wait for a notification that data is available. This can be written as an infinite loop:
def loop(self):
while True:
data = yield self.get_data_nonblocking()
if data is not None:
yield self.process_data(data)
else:
yield self.data_available
def on_data_available(self):
self.data_available.fire()
How can data_available be implemented here? It could be a Deferred but a Deferred cannot be reset, only recreated. Are there better options?
Can this loop be integrated into the Twisted event loop? I can read and process data right in on_data_available and write some code instead of the loop checking get_data_nonblocking but I feel like then I'll need some locks to make sure data is processed in the same order it arrives (the code above enforces it because it's the only place where it's processed). Is this a good idea at all?
Consider the case of a TCP connection. The receiver buffer for a TCP connection can either have data in it or not. You can get that data, or get nothing, without blocking by using the non-blocking socket API:
data = socket.recv(1024)
if data:
self.process_data(data)
You can wait for data to be available using select() (or any of the basically equivalent APIs):
socket.setblocking(False)
while True:
data = socket.recv(1024)
if data:
self.process_data(data)
else:
select([socket], [], [])
Of these, only select() is particularly Twisted-unfriendly (though the Twisted idiom is certainly not to make your own socket.recv calls). You could replace the select call with a Twisted-friendly version though (implement a Protocol with a dataReceived method that fires a Deferred - sort of like your on_data_available method - toss in some yields and make this whole thing an inlineCallbacks generator).
But though that's one way you can get data from a TCP connection, that's not the API that Twisted encourages you to use to do so. Instead, the API is:
class SomeProtocol(Protocol):
def dataReceived(self, data):
# Your logic here
I don't see how your case is substantially different. What if, instead of the loop you wrote, you did something like this:
class YourDataProcessor(object):
def process_data(self, data):
# Your logic here
class SomeDataGetter(object):
def __init__(self, processor):
self.processor = processor
def on_available_data(self):
data = self.get_data_nonblocking()
if data is not None:
self.processor.process_data(data)
Now there are no Deferreds at all (except perhaps in whatever implements on_available_data or get_data_nonblocking but I can't see that code).
If you leave this roughly as-is, you are guaranteed of in-ordered execution because Twisted is single-threaded (except in a couple places that are very clearly marked) and in a single-threaded program, an earlier call to process_data must complete before any later call to process_data could be made (excepting, of course, the case where process_data reentrantly invokes itself - but that's another story).
If you switch this back to using inlineCallbacks (or any equivalent "coroutine" flavored drink mix) then you are probably introducing the possibility of out-of-order execution.
For example, if get_data_nonblocking returns a Deferred and you write something like this:
#inlineCallbacks
def on_available_data(self):
data = yield self.get_data_nonblocking()
if data is not None:
self.processor.process_data(data)
Then you have changed on_available_data to say that a context switch is allowed when calling get_data_nonblocking. In this case, depending on your implementation of get_data_nonblocking and on_available_data, it's entirely possible that:
on_available_data is called
get_data_nonblocking is called and returns a Deferred
on_available_data tells execution to switch to another context (via yield / inlineCallbacks)
on_available_data is called again
get_data_nonblocking is called again and returns a Deferred (perhaps the same one! perhaps a new one! depends on how it's implement)
The second invocation of on_available_data tells execution to switch to another context (same reason)
The reactor spins around for a while and eventually an event arrives that causes the Deferred returned by the second invocation of get_data_nonblocking to fire.
Execution switches back to the second on_available_data frame
process_data is called with whatever data the second get_data_nonblocking call returned
Eventually the same things happen to the first set of objects and process_data is called again with whatever data the first get_data_nonblocking call returned
Now perhaps you've processed data out of order - again, this depends on more details of other parts of your system.
If so, you can always re-impose order. There are a lot of different possible approaches to this. Twisted itself doesn't come with any APIs that are explicitly in support of this operation so the solution involves writing some new code. Here's one idea (untested) for an approach - a queue-like class that knows about object sequence numbers:
class SequencedQueue(object):
"""
A queue-like type which guarantees objects come out of the queue in the order
defined by a sequence number associated with the objects when they are put into
the queue.
Application code manages sequence number assignment so that sequence numbers don't
have to have the same order as `put` calls on this type.
"""
def __init__(self):
# The sequence number of the object that should be given out
# by the next call to `get`
self._next_sequence = 0
# The sequence number of the next result that needs to be provided.
self._next_result = 0
# A holding area for objects past _next_sequence
self._queue = {}
# A holding area
self._waiting =
def put(self, sequence, object):
"""
Put an object into the queue at a particular point in the sequence.
"""
if sequence < self._next_sequence:
# Programming error. The sequence number
# of the object being put has already been used.
raise ...
self._queue[sequence] = object
self._check_waiters()
def get(self):
"""
Get an object from the queue which has the next sequence number
following whatever was previously gotten.
"""
result = self._waiters[self._next_sequence] = Deferred()
self._next_sequence += 1
self._check_waiters()
return result
def _check_waiters(self):
"""
Find any Deferreds previously given out by get calls which can now be given
their results and give them to them.
"""
while True:
seq = self._next_result
if seq in self._queue and seq in self._waiting:
self._next_result += 1
# XXX Probably a re-entrancy bug here. If a callback calls back in to
# put then this loop might run recursively
self._waiting.pop(seq).callback(self._queue.pop(seq))
else:
break
The expected behavior (modulo any bugs I accidentally added) is something like:
q = SequencedQueue()
d1 = q.get()
d2 = q.get()
# Nothing in particular happens
q.put(1, "second result")
# d1 fires with "first result" and afterwards d2 fires with "second result"
q.put(0, "first result")
Using this, just make sure you assign sequence numbers in the order you want data dispatched rather than the order it actually shows up somewhere. For example:
#inlineCallbacks
def on_available_data(self):
sequence = self._process_order
data = yield self.get_data_nonblocking()
if data is not None:
self._process_order += 1
self.sequenced_queue.put(sequence, data)
Elsewhere, some code can consume the queue sort of like:
#inlineCallbacks
def queue_consumer(self):
while True:
yield self.process_data(yield self.sequenced_queue.get())
I'm implementing a utility library which is a sort-of task manager intended to run within the distributed environment of Google App Engine cloud computing service. (It uses a combination of task queues and memcache to execute background processing). I plan to use generators to control the execution of tasks, essentially enforcing a non-preemptive "concurrency" via the use of yield in the user's code.
The trivial example - processing a bunch of database entities - could be something like the following:
class EntityWorker(Worker):
def setup():
self.entity_query = Entity.all()
def run():
for e in self.entity_query:
do_something_with(e)
yield
As we know, yield is two way communication channel, allowing to pass values to code that uses generators. This allows to simulate a "preemptive API" such as the SLEEP call below:
def run():
for e in self.entity_query:
do_something_with(e)
yield Worker.SLEEP, timedelta(seconds=1)
But this is ugly. It would be great to hide the yield within seperate function which could invoked in simple way:
self.sleep(timedelta(seconds=1))
The problem is that putting yield in function sleep turns it into a generator function. The call above would therefore just return another generator. Only after adding .next() and yield back again we would obtain previous result:
yield self.sleep(timedelta(seconds=1)).next()
which is of course even more ugly and unnecessarily verbose that before.
Hence my question: Is there a way to put yield into function without turning it into generator function but making it usable by other generators to yield values computed by it?
You seem to be missing the obvious:
class EntityWorker(Worker):
def setup(self):
self.entity_query = Entity.all()
def run(self):
for e in self.entity_query:
do_something_with(e)
yield self.sleep(timedelta(seconds=1))
def sleep(self, wait):
return Worker.SLEEP, wait
It's the yield that turns functions into generators, it's impossible to leave it out.
To hide the yield you need a higher order function, in your example it's map:
from itertools import imap
def slowmap(f, sleep, *iters):
for row in imap(f, self.entity_query):
yield Worker.SLEEP, wait
def run():
return slowmap(do_something_with,
(Worker.SLEEP, timedelta(seconds=1)),
self.entity_query)
Alas, this won't work. But a "middle-way" could be fine:
def sleepjob(*a, **k):
if a:
return Worker.SLEEP, a[0]
else:
return Worker.SLEEP, timedelta(**k)
So
yield self.sleepjob(timedelta(seconds=1))
yield self.sleepjob(seconds=1)
looks ok for me.
I would suggest you have a look at the ndb. It uses generators as co-routines (as you are proposing here), allowing you to write programs that work with rpcs asynchronously.
The api does this by wrapping the generator with another function that 'primes' the generator (it calls .next() immediately so that the code begins execution). The tasklets are also designed to work with App Engine's rpc infrastructure, making it possible to use any of the existing asynchronous api calls.
With the concurreny model used in ndb, you yield either a future object (similar to what is described in pep-3148) or an App Engine rpc object. When that rpc has completed, the execution in the function that yielded the object is allowed to continue.
If you are using a model derived from ndb.model.Model then the following will allow you to asynchronously iterate over a query:
from ndb import tasklets
#tasklets.tasklet
def run():
it = iter(Entity.query())
# Other tasklets will be allowed to run if the next call has to wait for an rpc.
while (yield it.has_next_async()):
entity = it.next()
do_something_with(entity)
Although ndb is still considered experimental (some of its error handling code still needs some work), I would recommend you have a look at it. I have used it in my last 2 projects and found it to be an excellent library.
Make sure you read through the documentation linked from the main page, and also the companion documentation for the tasklet stuff.