I'm implementing a utility library which is a sort-of task manager intended to run within the distributed environment of Google App Engine cloud computing service. (It uses a combination of task queues and memcache to execute background processing). I plan to use generators to control the execution of tasks, essentially enforcing a non-preemptive "concurrency" via the use of yield in the user's code.
The trivial example - processing a bunch of database entities - could be something like the following:
class EntityWorker(Worker):
def setup():
self.entity_query = Entity.all()
def run():
for e in self.entity_query:
do_something_with(e)
yield
As we know, yield is two way communication channel, allowing to pass values to code that uses generators. This allows to simulate a "preemptive API" such as the SLEEP call below:
def run():
for e in self.entity_query:
do_something_with(e)
yield Worker.SLEEP, timedelta(seconds=1)
But this is ugly. It would be great to hide the yield within seperate function which could invoked in simple way:
self.sleep(timedelta(seconds=1))
The problem is that putting yield in function sleep turns it into a generator function. The call above would therefore just return another generator. Only after adding .next() and yield back again we would obtain previous result:
yield self.sleep(timedelta(seconds=1)).next()
which is of course even more ugly and unnecessarily verbose that before.
Hence my question: Is there a way to put yield into function without turning it into generator function but making it usable by other generators to yield values computed by it?
You seem to be missing the obvious:
class EntityWorker(Worker):
def setup(self):
self.entity_query = Entity.all()
def run(self):
for e in self.entity_query:
do_something_with(e)
yield self.sleep(timedelta(seconds=1))
def sleep(self, wait):
return Worker.SLEEP, wait
It's the yield that turns functions into generators, it's impossible to leave it out.
To hide the yield you need a higher order function, in your example it's map:
from itertools import imap
def slowmap(f, sleep, *iters):
for row in imap(f, self.entity_query):
yield Worker.SLEEP, wait
def run():
return slowmap(do_something_with,
(Worker.SLEEP, timedelta(seconds=1)),
self.entity_query)
Alas, this won't work. But a "middle-way" could be fine:
def sleepjob(*a, **k):
if a:
return Worker.SLEEP, a[0]
else:
return Worker.SLEEP, timedelta(**k)
So
yield self.sleepjob(timedelta(seconds=1))
yield self.sleepjob(seconds=1)
looks ok for me.
I would suggest you have a look at the ndb. It uses generators as co-routines (as you are proposing here), allowing you to write programs that work with rpcs asynchronously.
The api does this by wrapping the generator with another function that 'primes' the generator (it calls .next() immediately so that the code begins execution). The tasklets are also designed to work with App Engine's rpc infrastructure, making it possible to use any of the existing asynchronous api calls.
With the concurreny model used in ndb, you yield either a future object (similar to what is described in pep-3148) or an App Engine rpc object. When that rpc has completed, the execution in the function that yielded the object is allowed to continue.
If you are using a model derived from ndb.model.Model then the following will allow you to asynchronously iterate over a query:
from ndb import tasklets
#tasklets.tasklet
def run():
it = iter(Entity.query())
# Other tasklets will be allowed to run if the next call has to wait for an rpc.
while (yield it.has_next_async()):
entity = it.next()
do_something_with(entity)
Although ndb is still considered experimental (some of its error handling code still needs some work), I would recommend you have a look at it. I have used it in my last 2 projects and found it to be an excellent library.
Make sure you read through the documentation linked from the main page, and also the companion documentation for the tasklet stuff.
Related
I am trying to organize my code by removing a lot of repetitive logic. I felt like this would be a great use case for a context manager.
When making certain updates in the system, 4 things always happen -
We lock the resource to prevent concurrent updates
We wrap our logic in a database transaction
We validate the data making sure the update is permissible
After the function executes we add history rows to the database
I want a wrapper to encapsulate this like below
from contextlib import contextmanager
#contextmanager
def my_manager(resource_id):
try:
with lock(resource_id), transaction.atomic():
validate_data(resource_id)
resources = yield
create_history_objects(resources)
except LockError:
raise CustomError
def update(resource_id):
with my_manager(resource_id):
_update(resource_id)
def _update(resource_id):
# do something
return resource
Everything works as expected except my ability to access resources in the contextmanager, which are None. The resources are returned from the function that's called during the yield statement.
What is the proper way to access those resources through yield or maybe another utility? Thanks
I have a Python script that generates [str, float] tuples which are then indexed into ElasticSearch using a custom function which eventually calls helper.streaming_bulk().
This is how the generator is implemented:
doc_ids: List[str] = [...]
docs = ((doc_id, get_value(doc_id) for doc_id in doc_ids)
get_value() calls a remote service that computes a float value per document id.
Next, these tuples are passed on to update_page_quality_bulk():
for success, item in update_page_quality_bulk(
islice(doc_qualities, size)
):
total_success += success
if not success:
logging.error(item)
Internally, update_page_quality_bulk() creates the ElasticSearch requests.
One of the advantages of using a generator here is that the first size elements can be fed into update_page_quality_bulk() through islice().
In order to make the entire process faster, I would like to parallelize the get_value() calls. As mentioned, these are remote calls so the local compute cost in negligible, but the duration is significant.
The order of the tuples does not matter, neither which elements are passed into update_page_quality_bulk(). On a high level, I would like to make the get_value() calls (up to x in parallel) for any n tuples and pass on whichever ones are finished first.
My naive attempt was to define get_value() as asynchronous:
async def get_value():
...
and await the call in the generator:
docs = ((doc_id, await get_value(doc_id) for doc_id in doc_ids)
However, this raises an error in the subsequent islice() call:
TypeError: 'async_generator' object is not iterable
Removing the islice call and passing the unmodified docs generator to update_page_quality_bulk() causes the same error to be raised when looping over the tuples to convert them into ElasticSearch requests.
I am aware that the ElasticSearch client provides asynchronous helpers, but they don't seem applicable here because I need to generate the actions first.
According to this answer, it seems like I have to change the implementation to using a queue.
This answer implies that it cannot be done without using multiprocessing due to Python GIL, but that answer is not marked as correct and is quite old too.
Generally, I am looking for a way to change the current logic as little as possible while parallelizing the get_value() calls.
So, you want to pass an "synchronous looking" generator to a call that expects a normal lazy generator such as islice, and keep getting the results for this in parallel.
It sounds like a work for asyncio.as_completed: you use your plain generator to create tasks - these are run in parallel by the asyncio machinery, and the results are made available as the tasks are completed (d'oh!).
However since update_page_quality_bulk is not asynco aware, it will never yield the control to the asyncio loop, so that it can complete the tasks which got their results. This would likely block.
Calling update_page_quality_bulk in another thread probably won't work as well. I did not try it here, but I'd say you can't just iterate over doc in a different thread than the one it (and its tasks) where created.
So, first things first - the "generator expression" syntax does not work when you want some terms of the generator to be calculated asynchronously, as you found out - we refactor that so that the tuples are created in an coroutine-function - and we wrap all calls for those in tasks (some of the asyncio functions do the wrapping in a task automatically)
Then we can us the asyncio machinery to schedule all the calls and call update_page_quality_bulk as these results arrive. The problem is that as_completed, as stated above, can't be passed directly to a non-async function: the asyncio loop would never get control back. Instead, we keep picking the results of tasks in the main thread, and call the sync function in another thread - using a Queue to pass the fetched results. And finally, so that the results can be consumed as made available inside update_page_quality_bulk, we create a small wrapper class to the threading.Queue, so that it can be consumed as in iterator - this is transparent for the code consuming the iterator.
# example code: untested
async def get_doc_values(doc_id):
loop = asyncio.get_running_loop()
# Run_in_executor runs the synchronous function in parallel in a thread-pool
# check the docs - you might want to pass a custom executor with more than
# the default number of workers, instead of None:
return doc_id, await asyncio.run_in_executor(None, get_value, doc_id)
def update_es(iterator):
# this function runs in a separate thread -
for success, item in update_page_quality_bulk(iterator):
total_success += success
if not success:
logging.error(item)
sentinel = Ellipsis # ... : python ellipsis - a nice sentinel that also worker for multiprocessing
class Iterator:
"""This allows the queue, fed in the main thread by the tasks as they are as they are completed
to behave like an ordinary iterator, which can be consumed by "update_page_quality_bulk" in another thread
"""
def __init__(self, source_queue):
self.source = source_queue
def __next__(self):
value= self.source.get()
if value is sentinel:
raise StopIteration()
return value
queue = threading.Queue()
iterator = Iterator(queue)
es_worker = threading.Thread(target=update_es, args=(iterator,))
es_worker.start()
for doc_value_task in asyncio.as_completed(get_doc_values(doc_id) for doc_id in doc_ids):
doc_value = await doc_value_task
queue.put(doc_value)
es_worker.join()
Question:
Essentially I want to return a unique result from the database everytime a view is called (until I run out of unique objects and have to start over). I was thinking that a simple and elegant solution would be to use a generator to handle this. Is this possible and if so how can this be approached with regards to pulling values from with ORM?
Note:
I think sessions or utilizing a design pattern like Memento may be a solution here, but I'm really curious to see if and how Python generators could be used in this context.
As Django is synchronous wsgi, you have to process each request as stand alone, your python environment can be killed or switched to an other at any time.
Still if you have no fear and a single process, you can make a file scope dictionary with session ids and iterators that you'll consume each time
from django.shortcuts import render
from collections import defaultdict
import uuid
def iterator():
for item in DatabaseTable.objects.all():
yield item
sessions_current_iterators = defaultdict(iterator)
def my_view(request):
id = request.session.get("iterator_id", None)
if id is None:
request.session["iterator_id"] = str(uuid.uuid4())
try:
return render(request, "item_template.html", {"item": next(sessions_current_iterators)}
except StopIteration:
request.session.pop("iterator_id")
return render(request, "end_template.html", {})
but: NEVER USE THIS ON A PRODUCTION ENVIRONMENT!
generators are great to reduce memory consumption while computing the request or can be good for tornado web service, but clearly, django should not share data between request in local variables.
You can always use yield where you can use return (since these are python stuff not Django stuff). The only caveat here is that the same function is called for every request; so the continuation after the yield may serve another client instead of the one you intend. However you can beat this problem by using a higher level function (generator here). Basically the function will have a dictionary of generators indexed by unique keys derived from the requests. Every time the function is called, check whether an entry already exists for the request in the dictionary. If not add a new function for that request. Then invoke the generator for the given request making sure to store whatever is yielded or returned by the generator. To keep the dictionary in memory let the main function now yield the stored value. Finally, so that the dictionary is not cleared every time the main function is called, start the function body by initializing the dictionary to an empty dictionary; then wrap everything else in an infinite while loop. This will ensure that the main function, also a generator, never really exits. When called the first time, the dictionary is initialized; then the while starts. In the while, the function creates and stores a generator in the dictionary if no entry already exists for the given request. Then the function invokes the generator for the request and yields whatever the generator returns or yields at the bottom of the while. When called again; the main function resumes at the top of the while. The code is like so:
def main_func(request, *args) :
funcs = {}
while True:
request_key = make_key(request)
If request_key not in funcs.keys():
def generator_func():
# your generator code here...
# remember to delete the func item in funcs before returning...
funcs[request_key] = generator_func
yield funcs[request_key] ()
def make_key(request):
# quick and dirty impl
return str(request.session)
I want to make sure I got down how to create tasklets and asyncrounous methods. What I have is a method that returns a list. I want it to be called from somewhere, and immediatly allow other calls to be made. So I have this:
future_1 = get_updates_for_user(userKey, aDate)
future_2 = get_updates_for_user(anotherUserKey, aDate)
somelist.extend(future_1)
somelist.extend(future_2)
....
#ndb.tasklet
def get_updates_for_user(userKey, lastSyncDate):
noteQuery = ndb.GqlQuery('SELECT * FROM Comments WHERE ANCESTOR IS :1 AND modifiedDate > :2', userKey, lastSyncDate)
note_list = list()
qit = noteQuery.iter()
while (yield qit.has_next_async()):
note = qit.next()
noteDic = note.to_dict()
note_list.append(noteDic)
raise ndb.Return(note_list)
Is this code doing what I'd expect it to do? Namely, will the two calls run asynchronously? Am I using futures correctly?
Edit: Well after testing, the code does produce the desired results. I'm a newbie to Python - what are some ways to test to see if the methods are running async?
It's pretty hard to verify for yourself that the methods are running concurrently -- you'd have to put copious logging in. Also in the dev appserver it'll be even harder as it doesn't really run RPCs in parallel.
Your code looks okay, it uses yield in the right place.
My only recommendation is to name your function get_updates_for_user_async() -- that matches the convention NDB itself uses and is a hint to the reader of your code that the function returns a Future and should be yielded to get the actual result.
An alternative way to do this is to use the map_async() method on the Query object; it would let you write a callback that just contains the to_dict() call:
#ndb.tasklet
def get_updates_for_user_async(userKey, lastSyncDate):
noteQuery = ndb.gql('...')
note_list = yield noteQuery.map_async(lambda note: note.to_dict())
raise ndb.Return(note_list)
Advanced tip: you can simplify this even more by dropping the #ndb.tasklet decorator and just returning the Future returned by map_async():
def get_updates_for_user_Async(userKey, lastSyncDate):
noteQuery = ndb.gql('...')
return noteQuery.map_async(lambda note: note.to_dict())
This is a general slight optimization for async functions that contain only one yield and immediately return the value yielded. (If you don't immediately get this you're in good company, and it runs the risk to be broken by a future maintainer who doesn't either. :-)
I'm trying to convert a synchronous library to use an internal asynchronous IO framework. I have several methods that look like this:
def foo:
....
sync_call_1() # synchronous blocking call
....
sync_call_2() # synchronous blocking call
....
return bar
For each of the synchronous functions (sync_call_*), I have written a corresponding async function that takes a a callback. E.g.
def async_call_1(callback=none):
# do the I/O
callback()
Now for the python newbie question -- whats the easiest way to translate the existing methods to use these new async methods instead? That is, the method foo() above needs to now be:
def async_foo(callback):
# Do the foo() stuff using async_call_*
callback()
One obvious choice is to pass a callback into each async method which effectively "resumes" the calling "foo" function, and then call the callback global at the very end of the method. However, that makes the code brittle, ugly and I would need to add a new callback for every call to an async_call_* method.
Is there an easy way to do that using a python idiom, such as a generator or coroutine?
UPDATE: take this with a grain of salt, as I'm out of touch with modern python async developments, including gevent and asyncio and don't actually have serious experience with async code.
There are 3 common approaches to thread-less async coding in Python:
Callbacks - ugly but workable, Twisted does this well.
Generators - nice but require all your code to follow the style.
Use Python implementation with real tasklets - Stackless (RIP) and greenlet.
Unfortunately, ideally the whole program should use one style, or things become complicated. If you are OK with your library exposing a fully synchronous interface, you are probably OK, but if you want several calls to your library to work in parallel, especially in parallel with other async code, then you need a common event "reactor" that can work with all the code.
So if you have (or expect the user to have) other async code in the application, adopting the same model is probably smart.
If you don't want to understand the whole mess, consider using bad old threads. They are also ugly, but work with everything else.
If you do want to understand how coroutines might help you - and how they might complicate you, David Beazley's "A Curious Course on Coroutines and Concurrency" is good stuff.
Greenlets might be actualy the cleanest way if you can use the extension. I don't have any experience with them, so can't say much.
There are several way for multiplexing tasks. We can't say what is the best for your case without deeper knowledge on what you are doing. Probably the most easiest/universal way is to use threads. Take a look at this question for some ideas.
You need to make function foo also async. How about this approach?
#make_async
def foo(somearg, callback):
# This function is now async. Expect a callback argument.
...
# change
# x = sync_call1(somearg, some_other_arg)
# to the following:
x = yield async_call1, somearg, some_other_arg
...
# same transformation again
y = yield async_call2, x
...
# change
# return bar
# to a callback call
callback(bar)
And make_async can be defined like this:
def make_async(f):
"""Decorator to convert sync function to async
using the above mentioned transformations"""
def g(*a, **kw):
async_call(f(*a, **kw))
return g
def async_call(it, value=None):
# This function is the core of async transformation.
try:
# send the current value to the iterator and
# expect function to call and args to pass to it
x = it.send(value)
except StopIteration:
return
func = x[0]
args = list(x[1:])
# define callback and append it to args
# (assuming that callback is always the last argument)
callback = lambda new_value: async_call(it, new_value)
args.append(callback)
func(*args)
CAUTION: I haven't tested this