Creating an asynchronous method with Google App Engine's NDB - python

I want to make sure I got down how to create tasklets and asyncrounous methods. What I have is a method that returns a list. I want it to be called from somewhere, and immediatly allow other calls to be made. So I have this:
future_1 = get_updates_for_user(userKey, aDate)
future_2 = get_updates_for_user(anotherUserKey, aDate)
somelist.extend(future_1)
somelist.extend(future_2)
....
#ndb.tasklet
def get_updates_for_user(userKey, lastSyncDate):
noteQuery = ndb.GqlQuery('SELECT * FROM Comments WHERE ANCESTOR IS :1 AND modifiedDate > :2', userKey, lastSyncDate)
note_list = list()
qit = noteQuery.iter()
while (yield qit.has_next_async()):
note = qit.next()
noteDic = note.to_dict()
note_list.append(noteDic)
raise ndb.Return(note_list)
Is this code doing what I'd expect it to do? Namely, will the two calls run asynchronously? Am I using futures correctly?
Edit: Well after testing, the code does produce the desired results. I'm a newbie to Python - what are some ways to test to see if the methods are running async?

It's pretty hard to verify for yourself that the methods are running concurrently -- you'd have to put copious logging in. Also in the dev appserver it'll be even harder as it doesn't really run RPCs in parallel.
Your code looks okay, it uses yield in the right place.
My only recommendation is to name your function get_updates_for_user_async() -- that matches the convention NDB itself uses and is a hint to the reader of your code that the function returns a Future and should be yielded to get the actual result.
An alternative way to do this is to use the map_async() method on the Query object; it would let you write a callback that just contains the to_dict() call:
#ndb.tasklet
def get_updates_for_user_async(userKey, lastSyncDate):
noteQuery = ndb.gql('...')
note_list = yield noteQuery.map_async(lambda note: note.to_dict())
raise ndb.Return(note_list)
Advanced tip: you can simplify this even more by dropping the #ndb.tasklet decorator and just returning the Future returned by map_async():
def get_updates_for_user_Async(userKey, lastSyncDate):
noteQuery = ndb.gql('...')
return noteQuery.map_async(lambda note: note.to_dict())
This is a general slight optimization for async functions that contain only one yield and immediately return the value yielded. (If you don't immediately get this you're in good company, and it runs the risk to be broken by a future maintainer who doesn't either. :-)

Related

Mocking a function call within a function in Python

This is my first time building out unit tests, and I'm not quite sure how to proceed here. Here's the function I'd like to test; it's a method in a class that accepts one argument, url, and returns one string, task_id:
def url_request(self, url):
conn = self.endpoint_request()
authorization = conn.authorization
response = requests.get(url, authorization)
return response["task_id"]
The method starts out by calling another method within the same class to obtain a token to connect to an API endpoint. Should I be mocking the output of that call (self.endpoint_request())?
If I do have to mock it, and my test function looks like this, how do I pass a fake token/auth endpoint_request response?
#patch("common.DataGetter.endpoint_request")
def test_url_request(mock_endpoint_request):
mock_endpoint_request.return_value = {"Auth": "123456"}
# How do I pass the fake token/auth to this?
task_id = DataGetter.url_request(url)
The code you have shown is strongly dominated by interactions. Which means that there will most likely be no bugs to find with unit-testing: The potential bugs are on the interaction level: You access conn.authorization - but, is this the proper member? And, does it already have the proper representation in the way you need it further on? Is requests.get the right method for the job? Is the argument order as you expect it? Is the return value as you expect it? Is task_id spelled correctly?
These are (some of) the potential bugs in your code. But, with unit-testing you will not be able to find them: When you replace the depended-on components with some mocks (which you create or configure), your unit-tests will just succeed: Lets assume that you have a misconception about the return value of requests.get, namely that task_id is spelled wrongly and should rather be spelled taskId. If you mock requests.get, you would implement the mock based on your own misconception. That is, your mock would return a map with the (misspelled) key task_id. Then, the unit-test would succeed despite of the bug.
You will only find that bug with integration testing, where you bring your component and depended-on components together. Only then you can test the assumptions made in your component against the reality of the other components.

Python recursive tree example

I'm analying a bit of code in one of the Jenkins builds which uses recursion to get jenkins' downstream jobs' urls
def get_all_downstream_jobs_urls(ds_jobs: set = None):
global JENKINS_JOBS
if not ds_jobs:
ds_jobs = set(); ds_jobs.update(extract_ds_job_url(get_ds_jobs(BASE_JOB_URL)))
temp = ds_jobs
for _ in ds_jobs.copy():
result = extract_ds_job_url(get_ds_jobs(_)) # <--- jenkins rest api call
if result: temp.update(result); JENKINS_JOBS.update(temp);
else: return temp
return get_all_downstream_jobs_urls(temp)
This works OK for a project which has downstream jobs although it's making too many calls to Jenkins rest api, but if a project doesn't have downstream jobs it is stuck in recursion. Could you help me figure out where the issue is?
If extract_ds_job_url(get_ds_jobs(BASE_JOB_URL)) returns an empty set, you are forever calling get_all_downstream_jobs_urls(temp). That's because the for loop is not going to do anything.
The test at the top should check for None instead:
if ds_jobs is None:
and a separate test for ds_jobs being empty should end the recursion:
if not ds_jobs:
# no downstream jobs to process
return set()
I can't vouch for the rest of the logic, but there are certainly many style errors in the code. I'd refactor it to at least get rid of some of those errors:
JENKINS_JOBS is never rebound, so global JENKINS_JOBS is redundant and confusing and should be removed.
It's not clear why the function is updating a global and is returning the result set. It should do one, or the other, not both.
_ is, by convention, a throw-away variable. It signals that the value is not going to be used. Yet here the code does use it. It should be renamed job_url instead.
You really never should use ; in production code. Put the code on separate lines.
ds_jobs = set() then ds_jobs.update(...) is a way too verbose spelling of ds_jobs = set(...).
temp is not a good variable name, updated might be a better name. It should be made a copy when assigned, so updated = set(ds_jobs), and the .copy() call can be removed from the for loop.
the return when the first job URL doesn't have downstream URLs is probably not what you want either.
If you really want a tree of downstream URLs, the recursive call should not try to pass in all job urls collected so far! It's just as likely to call the jenkins API again and again for a job URL that was already checked.
The following code removes the recursion by using a stack instead, and is guaranteed to call the Jenkins API for each job URL just once:
def get_all_downstream_jobs_urls():
ds_jobs = set()
stack = [extract_ds_job_url(get_ds_jobs(BASE_JOB_URL))]
while stack:
job_url = stack.pop()
if job_url in ds_jobs:
# already seen before, skip
continue
ds_jobs.add(job_url)
# add downstream jobs to the stack for further processing
stack.extend(extract_ds_job_url(get_ds_jobs(job_url)))
return ds_jobs
Last but not least, I strongly suspect that using a third-party library like the jenkinsapi package would make this all even simpler; the Jenkins API probably lets you query this information in just one call but a library probably does such a call for you and give you readily-parsed Python objects for the information.

Overriding db.Model.all() in python google app engine

I'm trying to start using memcache in my Google App Engine app. Instead of creating a function that checks memcache and then maybe queries the database, I decided to just override my Model's all() class method. Here's my code so far:
def all(cls, order=None):
result = memcache.get("allitems")
if not result or not memcache.get("updateitems"):
logging.info(list(super(Item, cls.all())))
result = list(super(Item, cls).all()).sort(key=lambda x: getattr(x, order) if order else str(x))
memcache.set("allitems", result)
memcache.set("updateitems", True)
logging.info("DB Query for items")
return result
I had figured this would work. But instead I get a RuntimeError saying that recursion depth was exceeded. I think this comes from a misunderstanding of the super() method. Sorry for cluttering the code up with the ordering thing. But maybe the problem lies somewhere in there too. One place I found said that the super method should be called like this:
super(supercls, cls_or_self)
But this wouldn't work with GAE's API:
super(db.Model, cls)
This wouldn't know which model to query. Someone please tell me what I'm doing wrong, and maybe give me a better understanding super().
EDIT: Thanks to #Matthew, the problem turned out to be a misplaced parentheses in the first logging.info() call. Now I have another problem, the method is just returning None. I don't know if that means that the super implementation of all() returns None (Maybe it doesn't know what entity is calling it?) or just there is some other bug with my code.
I think the error might be here:
logging.info(list(super(Item, cls.all())))
If there's an error in cls.all(), you then call it again as part of the super constructor, rather than calling it on the result:
logging.info(list(super(Item, cls).all()))
So if an error would call all again it would still meet the logging branch conditions, which would call all again, which would still etc etc until you hit the recursion limit.
The other possible problem is that Model.all() returns a Query object, and I'm not sure if list(query) works. It also provides it's own sorting, so you might be able to use this instead:
query = super(Item, cls).all()
query.order( order )
...
return list(query)
Or just return query, as it's already iterable.

Python generator's 'yield' in separate function

I'm implementing a utility library which is a sort-of task manager intended to run within the distributed environment of Google App Engine cloud computing service. (It uses a combination of task queues and memcache to execute background processing). I plan to use generators to control the execution of tasks, essentially enforcing a non-preemptive "concurrency" via the use of yield in the user's code.
The trivial example - processing a bunch of database entities - could be something like the following:
class EntityWorker(Worker):
def setup():
self.entity_query = Entity.all()
def run():
for e in self.entity_query:
do_something_with(e)
yield
As we know, yield is two way communication channel, allowing to pass values to code that uses generators. This allows to simulate a "preemptive API" such as the SLEEP call below:
def run():
for e in self.entity_query:
do_something_with(e)
yield Worker.SLEEP, timedelta(seconds=1)
But this is ugly. It would be great to hide the yield within seperate function which could invoked in simple way:
self.sleep(timedelta(seconds=1))
The problem is that putting yield in function sleep turns it into a generator function. The call above would therefore just return another generator. Only after adding .next() and yield back again we would obtain previous result:
yield self.sleep(timedelta(seconds=1)).next()
which is of course even more ugly and unnecessarily verbose that before.
Hence my question: Is there a way to put yield into function without turning it into generator function but making it usable by other generators to yield values computed by it?
You seem to be missing the obvious:
class EntityWorker(Worker):
def setup(self):
self.entity_query = Entity.all()
def run(self):
for e in self.entity_query:
do_something_with(e)
yield self.sleep(timedelta(seconds=1))
def sleep(self, wait):
return Worker.SLEEP, wait
It's the yield that turns functions into generators, it's impossible to leave it out.
To hide the yield you need a higher order function, in your example it's map:
from itertools import imap
def slowmap(f, sleep, *iters):
for row in imap(f, self.entity_query):
yield Worker.SLEEP, wait
def run():
return slowmap(do_something_with,
(Worker.SLEEP, timedelta(seconds=1)),
self.entity_query)
Alas, this won't work. But a "middle-way" could be fine:
def sleepjob(*a, **k):
if a:
return Worker.SLEEP, a[0]
else:
return Worker.SLEEP, timedelta(**k)
So
yield self.sleepjob(timedelta(seconds=1))
yield self.sleepjob(seconds=1)
looks ok for me.
I would suggest you have a look at the ndb. It uses generators as co-routines (as you are proposing here), allowing you to write programs that work with rpcs asynchronously.
The api does this by wrapping the generator with another function that 'primes' the generator (it calls .next() immediately so that the code begins execution). The tasklets are also designed to work with App Engine's rpc infrastructure, making it possible to use any of the existing asynchronous api calls.
With the concurreny model used in ndb, you yield either a future object (similar to what is described in pep-3148) or an App Engine rpc object. When that rpc has completed, the execution in the function that yielded the object is allowed to continue.
If you are using a model derived from ndb.model.Model then the following will allow you to asynchronously iterate over a query:
from ndb import tasklets
#tasklets.tasklet
def run():
it = iter(Entity.query())
# Other tasklets will be allowed to run if the next call has to wait for an rpc.
while (yield it.has_next_async()):
entity = it.next()
do_something_with(entity)
Although ndb is still considered experimental (some of its error handling code still needs some work), I would recommend you have a look at it. I have used it in my last 2 projects and found it to be an excellent library.
Make sure you read through the documentation linked from the main page, and also the companion documentation for the tasklet stuff.

Python asynchronous callbacks and generators

I'm trying to convert a synchronous library to use an internal asynchronous IO framework. I have several methods that look like this:
def foo:
....
sync_call_1() # synchronous blocking call
....
sync_call_2() # synchronous blocking call
....
return bar
For each of the synchronous functions (sync_call_*), I have written a corresponding async function that takes a a callback. E.g.
def async_call_1(callback=none):
# do the I/O
callback()
Now for the python newbie question -- whats the easiest way to translate the existing methods to use these new async methods instead? That is, the method foo() above needs to now be:
def async_foo(callback):
# Do the foo() stuff using async_call_*
callback()
One obvious choice is to pass a callback into each async method which effectively "resumes" the calling "foo" function, and then call the callback global at the very end of the method. However, that makes the code brittle, ugly and I would need to add a new callback for every call to an async_call_* method.
Is there an easy way to do that using a python idiom, such as a generator or coroutine?
UPDATE: take this with a grain of salt, as I'm out of touch with modern python async developments, including gevent and asyncio and don't actually have serious experience with async code.
There are 3 common approaches to thread-less async coding in Python:
Callbacks - ugly but workable, Twisted does this well.
Generators - nice but require all your code to follow the style.
Use Python implementation with real tasklets - Stackless (RIP) and greenlet.
Unfortunately, ideally the whole program should use one style, or things become complicated. If you are OK with your library exposing a fully synchronous interface, you are probably OK, but if you want several calls to your library to work in parallel, especially in parallel with other async code, then you need a common event "reactor" that can work with all the code.
So if you have (or expect the user to have) other async code in the application, adopting the same model is probably smart.
If you don't want to understand the whole mess, consider using bad old threads. They are also ugly, but work with everything else.
If you do want to understand how coroutines might help you - and how they might complicate you, David Beazley's "A Curious Course on Coroutines and Concurrency" is good stuff.
Greenlets might be actualy the cleanest way if you can use the extension. I don't have any experience with them, so can't say much.
There are several way for multiplexing tasks. We can't say what is the best for your case without deeper knowledge on what you are doing. Probably the most easiest/universal way is to use threads. Take a look at this question for some ideas.
You need to make function foo also async. How about this approach?
#make_async
def foo(somearg, callback):
# This function is now async. Expect a callback argument.
...
# change
# x = sync_call1(somearg, some_other_arg)
# to the following:
x = yield async_call1, somearg, some_other_arg
...
# same transformation again
y = yield async_call2, x
...
# change
# return bar
# to a callback call
callback(bar)
And make_async can be defined like this:
def make_async(f):
"""Decorator to convert sync function to async
using the above mentioned transformations"""
def g(*a, **kw):
async_call(f(*a, **kw))
return g
def async_call(it, value=None):
# This function is the core of async transformation.
try:
# send the current value to the iterator and
# expect function to call and args to pass to it
x = it.send(value)
except StopIteration:
return
func = x[0]
args = list(x[1:])
# define callback and append it to args
# (assuming that callback is always the last argument)
callback = lambda new_value: async_call(it, new_value)
args.append(callback)
func(*args)
CAUTION: I haven't tested this

Categories