Tornado mysteriously hanging after an async operation -- how can I debug? - python

Our app uses a fair few network calls (it's built on top of a third-party REST API), so we're using a lot of asynchronous operations to keep the system responsive. (Using Swirl to stay sane, since the app was written before tornado.gen came about). So when the need arose to do a little geocoding, we figured it would be trivial -- throw in a couple of async calls to another external API, and we'd be golden.
Somehow, our async code is mysteriously hanging Tornado -- the process is still running, but it won't respond to requests or output anything to the logs. Worse, when we take the third-party server out of the equation entirely, it still hangs -- it seems to lock up some arbitrary period after the async request returns.
Here's some stub code that replicates the problem:
def async_geocode(lat, lon, callback, fields=('city', 'country')):
'''Translates lat and lon into human-readable geographic info'''
iol = IOLoop.instance()
iol.add_timeout(time.time() + 1, lambda: callback("(unknown)"))
And here's the test that usually (but not always -- that's how it got to production in the first place) catches it:
class UtilTest(tornado.testing.AsyncTestCase):
def get_new_ioloop(self):
'''Ensure that any test code uses the right IOLoop, since the code
it tests will use the singleton.'''
return tornado.ioloop.IOLoop.instance()
def test_async_geocode(self):
# Yahoo gives (-122.419644, 37.777125) for SF, so we expect it to
# reverse geocode to SF too...
async_geocode(lat=37.777, lon=-122.419, callback=self.stop,
fields=('city', 'country'))
result = self.wait(timeout=4)
self.assertEquals(result, u"San Francisco, United States")
# Now test if it's hanging (or has hung) the IOLoop on finding London
async_geocode(lat=51.506, lon=-0.127, callback=self.stop,
fields=('city',))
result = self.wait(timeout=5)
self.assertEquals(result, u"London")
# Test it fails gracefully
async_geocode(lat=0.00, lon=0.00, callback=self.stop,
fields=('city',))
result = self.wait(timeout=6)
self.assertEquals(result, u"(unknown)")
def test_async_geocode2(self):
async_geocode(lat=37.777, lon=-122.419, callback=self.stop,
fields=('city', 'state', 'country'))
result = self.wait(timeout=7)
self.assertEquals(result, u"San Francisco, California, United States")
async_geocode(lat=51.506325, lon=-0.127144, callback=self.stop,
fields=('city', 'state', 'country'))
result = self.wait(timeout=8)
self.io_loop.add_timeout(time.time() + 8, lambda: self.stop(True))
still_running = self.wait(timeout=9)
self.assert_(still_running)
Note that the first test almost always passes, and it's the second test (and its call to async_geocode) that usually fails.
Edited to add: Note also that we have lots of similarly asynchronous calls to our other third-party API which are working absolutely fine.
(For completeness, here's the full implementation of async_geocode and its helper class (although the stub above replicates the problem)):
def async_geocode(lat, lon, callback, fields=('city', 'country')):
'''Use AsyncGeocoder to do the work.'''
geo = AsyncGeocoder(lat, lon, callback, fields)
geo.geocode()
class AsyncGeocoder(object):
'''
Reverse-geocode to as specific a level as possible
Calls Yahoo! PlaceFinder for reverse geocoding. Takes a lat, lon, and
callback function (to call with the result string when the request
completes), and optionally a sequence of fields to return, in decreasing
order of specificity (e.g. street, neighborhood, city, country)
NB: Does not do anything intelligent with the geocoded data -- just returns
the first result found.
'''
url = "http://where.yahooapis.com/geocode"
def __init__(self, lat, lon, callback, fields, ioloop=None):
self.lat, self.lon = lat, lon
self.callback = callback
self.fields = fields
self.io_loop = ioloop or IOLoop.instance()
self._client = AsyncHTTPClient(io_loop=self.io_loop)
def geocode(self):
params = urllib.urlencode({
'q': '{0}, {1}'.format(self.lat, self.lon),
'flags': 'J', 'gflags': 'R'
})
tgt_url = self.url + "?" + params
self._client.fetch(tgt_url, self.geocode_cb)
def geocode_cb(self, response):
geodata = json_decode(response.body)
try:
geodata = geodata['ResultSet']['Results'][0]
except IndexError:
# Response didn't contain anything
result_string = ""
else:
results = []
for f in self.fields:
val = geodata.get(f, None)
if val:
results.append(val)
result_string = ", ".join(results)
if result_string == '':
# This can happen if the response was empty _or_ if
# the requested fields weren't in it. Regardless,
# the user needs to see *something*
result_string = "(unknown)"
self.io_loop.add_callback(lambda: self.callback(result_string))
Edit: So after quite a bit of tedious debugging and logging the situations in which the system fails over a few days, it turns out that, as the accepted answer points out, my test was failing for unrelated reasons. It also turns out that the reason it was hanging was nothing to do with the IOLoop, but rather that one of the coroutines in question was immediately hanging waiting for a database lock.
Sorry for the mis-targeted question, and thank you all for your patience.

Your second test appears to failing because of this part:
self.io_loop.add_timeout(time.time() + 8, lambda: self.stop(True))
still_running = self.wait(timeout=9)
self.assert_(still_running)
when you add a timeout to the IOLoop through self.wait, that timeout is not cleared when self.stop is called, as far as I can tell. I.E. your first timeout is persisting, and when you sleep the IOLoop for 8 seconds, it triggers.
I doubt any of that is related to your original problem.

Related

Python Tornado Async Fetching of URLs

In the following code example I have a function do_async_thing which appears to return a Future, even though I'm not sure why?
import tornado.ioloop
import tornado.web
import tornado.httpclient
#tornado.gen.coroutine
def do_async_thing():
http = tornado.httpclient.AsyncHTTPClient()
response = yield http.fetch("http://www.google.com/")
return response.body
class MainHandler(tornado.web.RequestHandler):
def get(self):
x = do_async_thing()
print(x) # <tornado.concurrent.Future object at 0x10753a6a0>
self.set_header("Content-Type", "application/json")
self.write('{"foo":"bar"}')
self.finish()
if __name__ == "__main__":
app = tornado.web.Application([
(r"/foo/?", MainHandler),
])
app.listen(8888)
tornado.ioloop.IOLoop.current().start()
You'll see that I yield the call to fetch and in doing so I should have forced the value to be realised (and subsequently been able to access the body field of the response).
What's more interesting is how I can even access the body field on a Future and not have it error (as far as I know a Future has no such field/property/method)
So does anyone know how I can:
Resolve the Future so I get the actual value
Modify this example so the function do_async_thing makes multiple async url fetches
Now it's worth noting that because I was still getting a Future back I thought I would try adding a yield to prefix the call to do_async_thing() (e.g. x = yield do_async_thing()) but that gave me back the following error:
tornado.gen.BadYieldError: yielded unknown object <generator object get at 0x1023bc308>
I also looked at doing something like this for the second point:
def do_another_async_thing():
http = tornado.httpclient.AsyncHTTPClient()
a = http.fetch("http://www.google.com/")
b = http.fetch("http://www.github.com/")
return a, b
class MainHandler(tornado.web.RequestHandler):
def get(self):
y = do_another_async_thing()
print(y)
But again this returns:
<tornado.concurrent.Future object at 0x102b966d8>
Where as I would've expected a tuple of Futures at least? At this point I'm unable to resolve these Futures without getting an error such as:
tornado.gen.BadYieldError: yielded unknown object <generator object get at 0x1091ac360>
Update
Below is an example that works (as per answered by A. Jesse Jiryu Davis)
But I've also added another example where by I have a new function do_another_async_thing which makes two async HTTP requests (but evaluating their values are a little bit more involved as you'll see):
def do_another_async_thing():
http = tornado.httpclient.AsyncHTTPClient()
a = http.fetch("http://www.google.com/")
b = http.fetch("http://www.github.com/")
return a, b
#tornado.gen.coroutine
def do_async_thing():
http = tornado.httpclient.AsyncHTTPClient()
response = yield http.fetch("http://www.google.com/")
return response.body
class MainHandler(tornado.web.RequestHandler):
#tornado.gen.coroutine
def get(self):
x = yield do_async_thing()
print(x) # displays HTML response
fa, fb = do_another_async_thing()
fa = yield fa
fb = yield fb
print(fa.body, fb.body) # displays HTML response for each
It's worth clarifying: you might expect the two yield statements for do_another_async_thing to cause a blockage. But here is a breakdown of the steps that are happening:
do_another_async_thing returns immediately a tuple with two Futures
we yield the first tuple which causes the program to be blocked until the value is realised
the value is realised and so we move to the next line
we yield again, causing the program to block until the value is realised
but as both futures were created at the same time and run concurrently the second yield returns practically instantly
Coroutines return futures. To wait for the coroutine to complete, the caller must also be a coroutine, and must yield the future. So:
#gen.coroutine
def get(self):
x = yield do_async_thing()
For more info see Refactoring Tornado Coroutines.

Run function after a certain type of model is committed

I want to run a function when instances of the Post model are committed. I want to run it any time they are committed, so I'd rather not explicitly call the function everywhere. How can I do this?
def notify_subscribers(post):
""" send email to subscribers """
...
post = Post("Hello World", "This is my first blog entry.")
session.commit() # How to run notify_subscribers with post as argument
# as soon as post is committed successfully?
post.title = "Hello World!!1"
session.commit() # Run notify_subscribers once again.
No matter which option you chose below, SQLAlchemy comes with a big warning about the after_commit event (which is when both ways send the signal).
The Session is not in an active transaction when the after_commit() event is invoked, and therefore can not emit SQL.
If your callback needs to query or commit to the database, it may have unexpected issues. In this case, you could use a task queue such as Celery to execute this in a background thread (with a separate session). This is probably the right way to go anyway, since sending emails takes a long time and you don't want your view to wait to return while it's happening.
Flask-SQLAlchemy provides a signal you can listen to that sends all the insert/update/delete ops. It needs to be enabled by setting app.config["SQLALCHEMY_TRACK_MODIFICATIONS"] = True because tracking modifications is expensive and not needed in most cases.
Then listen for the signal:
from flask_sqlalchemy import models_committed
def notify_subscribers(app, changes):
new_posts = [target for target, op in changes if isinstance(target, Post) and op in ('insert', 'update')]
# notify about the new and updated posts
models_committed.connect(notify_subscribers, app)
app.config["SQLALCHEMY_TRACK_MODIFICATIONS"] = True
You can also implement this yourself (mostly by copying the code from Flask-SQLAlchemy). It's slightly tricky, because model changes occur on flush, not on commit, so you need to record all changes as flushes occur, then use them after the commit.
from sqlalchemy import event
class ModelChangeEvent(object):
def __init__(self, session, *callbacks):
self.model_changes = {}
self.callbacks = callbacks
event.listen(session, 'before_flush', self.record_ops)
event.listen(session, 'before_commit', self.record_ops)
event.listen(session, 'after_commit', self.after_commit)
event.listen(session, 'after_rollback', self.after_rollback)
def record_ops(self, session, flush_context=None, instances=None):
for targets, operation in ((session.new, 'insert'), (session.dirty, 'update'), (session.deleted, 'delete')):
for target in targets:
state = inspect(target)
key = state.identity_key if state.has_identity else id(target)
self.model_changes[key] = (target, operation)
def after_commit(self, session):
if self._model_changes:
changes = list(self.model_changes.values())
for callback in self.callbacks:
callback(changes=changes)
self.model_changes.clear()
def after_rollback(self, session):
self.model_changes.clear()
def notify_subscribers(changes):
new_posts = [target for target, op in changes if isinstance(target, Post) and op in ('insert', 'update')]
# notify about new and updated posts
# pass all the callbacks (if you have more than notify_subscribers)
mce = ModelChangeEvent(db.session, notify_subscribers)
# or you can append more callbacks
mce.callbacks.append(my_other_callback)

Keeping context-manager object alive through function calls

I am running into a bit of an issue with keeping a context manager open through function calls. Here is what I mean:
There is a context-manager defined in a module which I use to open SSH connections to network devices. The "setup" code handles opening the SSH sessions and handling any issues, and the teardown code deals with gracefully closing the SSH session. I normally use it as follows:
from manager import manager
def do_stuff(device):
with manager(device) as conn:
output = conn.send_command("show ip route")
#process output...
return processed_output
In order to keep the SSH session open and not have to re-establish it across function calls, I would like to do add an argument to "do_stuff" which can optionally return the SSH session along with the data returned from the SSH session, as follows:
def do_stuff(device, return_handle=False):
with manager(device) as conn:
output = conn.send_command("show ip route")
#process output...
if return_handle:
return (processed_output, conn)
else:
return processed_output
I would like to be able to call this function "do_stuff" from another function, as follows, such that it signals to "do_stuff" that the SSH handle should be returned along with the output.
def do_more_stuff(device):
data, conn = do_stuff(device, return_handle=True)
output = conn.send_command("show users")
#process output...
return processed_output
However the issue that I am running into is that the SSH session is closed, due to the do_stuff function "returning" and triggering the teardown code in the context-manager (which gracefully closes the SSH session).
I have tried converting "do_stuff" into a generator, such that its state is suspended and perhaps causing the context-manager to stay open:
def do_stuff(device, return_handle=False):
with manager(device) as conn:
output = conn.send_command("show ip route")
#process output...
if return_handle:
yield (processed_output, conn)
else:
yield processed_output
And calling it as such:
def do_more_stuff(device):
gen = do_stuff(device, return_handle=True)
data, conn = next(gen)
output = conn.send_command("show users")
#process output...
return processed_output
However this approach does not seem to be working in my case, as the context-manager gets closed, and I get back a closed socket.
Is there a better way to approach this problem? Maybe my generator needs some more work...I think using a generator to hold state is the most "obvious" way that comes to mind, but overall should I be looking into another way of keeping the session open across function calls?
Thanks
I found this question because I was looking for a solution to an analogous problem where the object I wanted to keep alive was a pyvirtualdisplay.display.Display instance with selenium.webdriver.Firefox instances in it.
I also wanted any opened resources to die if an exception were raised during the display/browser instance creations.
I imagine the same could be applied to your database connection.
I recognize this probably only a partial solution and contains less-than-best practices. Help is appreciated.
This answer is the result of an ad lib spike using the following resources to patch together my solution:
https://docs.python.org/3/library/contextlib.html#contextlib.ContextDecorator
http://www.wefearchange.org/2013/05/resource-management-in-python-33-or.html
(I do not yet fully grok what is described here though I appreciate the potential. The second link above eventually proved to be the most helpful by providing analogous situations.)
from pyvirtualdisplay.display import Display
from selenium.webdriver import Firefox
from contextlib import contextmanager, ExitStack
RFBPORT = 5904
def acquire_desktop_display(rfbport=RFBPORT):
display_kwargs = {'backend': 'xvnc', 'rfbport': rfbport}
display = Display(**display_kwargs)
return display
def release_desktop_display(self):
print("Stopping the display.")
# browsers apparently die with the display so no need to call quits on them
self.display.stop()
def check_desktop_display_ok(desktop_display):
print("Some checking going on here.")
return True
class XvncDesktopManager:
max_browser_count = 1
def __init__(self, check_desktop_display_ok=None, **kwargs):
self.rfbport = kwargs.get('rfbport', RFBPORT)
self.acquire_desktop_display = acquire_desktop_display
self.release_desktop_display = release_desktop_display
self.check_desktop_display_ok = check_desktop_display_ok \
if check_desktop_display_ok is None else check_desktop_display_ok
#contextmanager
def _cleanup_on_error(self):
with ExitStack() as stack:
"""push adds a context manager’s __exit__() method
to stack's callback stack."""
stack.push(self)
yield
# The validation check passed and didn't raise an exception
# Accordingly, we want to keep the resource, and pass it
# back to our caller
stack.pop_all()
def __enter__(self):
url = 'http://stackoverflow.com/questions/30905121/'\
'keeping-context-manager-object-alive-through-function-calls'
self.display = self.acquire_desktop_display(self.rfbport)
with ExitStack() as stack:
# add XvncDesktopManager instance's exit method to callback stack
stack.push(self)
self.display.start()
self.browser_resources = [
Firefox() for x in range(self.max_browser_count)
]
for browser_resource in self.browser_resources:
for url in (url, ):
browser_resource.get(url)
"""This is the last bit of magic.
ExitStacks have a .close() method which unwinds
all the registered context managers and callbacks
and invokes their exit functionality."""
# capture the function that calls all the exits
# will be called later outside the context in which it was captured
self.close_all = stack.pop_all().close
# if something fails in this context in enter, cleanup
with self._cleanup_on_error() as stack:
if not self.check_desktop_display_ok(self):
msg = "Failed validation for {!r}"
raise RuntimeError(msg.format(self.display))
# self is assigned to variable after "as",
# manually call close_all to unwind callback stack
return self
def __exit__(self, *exc_details):
# had to comment this out, unable to add this to callback stack
# self.release_desktop_display(self)
pass
I had a semi-expected result with the following:
kwargs = {
'rfbport': 5904,
}
_desktop_manager = XvncDesktopManager(check_desktop_display_ok=check_desktop_display_ok, **kwargs)
with ExitStack() as stack:
# context entered and what is inside the __enter__ method is executed
# desktop_manager will have an attribute "close_all" that can be called explicitly to unwind the callback stack
desktop_manager = stack.enter_context(_desktop_manager)
# I was able to manipulate the browsers inside of the display
# and outside of the context
# before calling desktop_manager.close_all()
browser, = desktop_manager.browser_resources
browser.get(url)
# close everything down when finished with resource
desktop_manager.close_all() # does nothing, not in callback stack
# this functioned as expected
desktop_manager.release_desktop_display(desktop_manager)

How to make spydlay module to work like httplib/http.client?

I have to test server based on Jetty. This server can work with its own protocol, HTTP, HTTPS and lastly it started to support SPDY. I have some stress tests which are based on httplib /http.client -- each thread start with similar URL (some data in query string are variable), adds execution time to global variable and every few seconds shows some statistics. Code looks like:
t_start = time.time()
connection.request("GET", path)
resp = connection.getresponse()
t_stop = time.time()
check_response(resp)
QRY_TIMES.append(t_stop - t_start)
Client working with native protocol shares httplib API, so connection may be native, HTTPConnection or HTTPSConnection.
Now I want to add SPDY test using spdylay module. But its interface is opaque and I don't know how to change its opaqueness into something similar to httplib interface. I have made test client based on example but while 2nd argument to spdylay.urlfetch() is class name and not object I do not know how to use it with my tests. I have already add tests to on_close() method of my class which extends spdylay.BaseSPDYStreamHandler, but it is not compatibile with other tests. If it was instance I would use it outside of spdylay.urlfetch() call.
How can I use spydlay in a code that works based on httplib interfaces?
My only idea is to use global dictionary where url is a key and handler object is a value. It is not ideal because:
new queries with the same url will overwrite previous response
it is easy to forget to free handler from global dictionary
But it works!
import sys
import spdylay
CLIENT_RESULTS = {}
class MyStreamHandler(spdylay.BaseSPDYStreamHandler):
def __init__(self, url, fetcher):
super().__init__(url, fetcher)
self.headers = []
self.whole_data = []
def on_header(self, nv):
self.headers.append(nv)
def on_data(self, data):
self.whole_data.append(data)
def get_response(self, charset='UTF8'):
return (b''.join(self.whole_data)).decode(charset)
def on_close(self, status_code):
CLIENT_RESULTS[self.url] = self
def spdy_simply_get(url):
spdylay.urlfetch(url, MyStreamHandler)
data_handler = CLIENT_RESULTS[url]
result = data_handler.get_response()
del CLIENT_RESULTS[url]
return result
if __name__ == '__main__':
if '--test' in sys.argv:
spdy_response = spdy_simply_get('https://localhost:8443/test_spdy/get_ver_xml.hdb')
I hope somebody can do spdy_simply_get(url) better.

Waiting on events in other requests in Twisted

I have a simple Twisted server that handles requests like this (obviously, asynchronously)
global SomeSharedMemory
if SomeSharedMemory is None:
SomeSharedMemory = LoadSharedMemory()
return PickSomething(SomeSharedMemory)
Where SomeSharedMemory is loaded from a database.
I want to avoid loading SomeSharedMemory from the database multiple times. Specifically, when the server first starts, and we get two concurrent incoming requests, we might see something like this:
Request 1: Check for SomeSharedMemory, don't find it
Request 1: Issue database query to load SSM
Request 2: Check for SSM, don't find it
Request 2: Issue database query to load SSM
Request 1: Query returns, store SSM
Request 1: Return result
Request 2: Query returns, store SSM
Request 2: Return result
With more concurrent requests, the database gets hammered. I'd like to do something like this (see http://docs.python.org/library/threading.html#event-objects):
global SomeSharedMemory, SSMEvent
if SomeSharedMemory is None:
if not SSMEvent.isSet():
SSMEvent.wait()
else:
# assumes that the event is initialized "set"
SSMEvent.clear()
SomeSharedMemory = LoadSharedMemory()
SSMEvent.set()
return PickSomething(SomeSharedMemory)
Such that if one request is loading the shared memory, other requests will wait politely until the query is complete rather than issue their own duplicate database queries.
Is this possible in Twisted?
The way your example is set up, it's hard to see how you could actually have the problem you're describing. If a second request comes in to your Twisted server before the call to LoadSharedMemory issued by the first has returned, then the second request will just wait before being processed. When it is finally handled, SomeSharedMemory will be initialized and there will be no duplication.
However, I suppose maybe it is the case that LoadSharedMemory is asynchronous and returns a Deferred, so that your code really looks more like this:
def handleRequest(request):
if SomeSharedMemory is None:
d = initSharedMemory()
d.addCallback(lambda ignored: handleRequest(request))
else:
d = PickSomething(SomeSharedMemory)
return d
In this case, it's entirely possible that a second request might arrive while initSharedMemory is off doing its thing. Then you would indeed end up with two tasks trying to initialize that state.
The thing to do, of course, is notice this third state that you have. There is not only un-initialized and initializ-ed, but also initializ-ing. So represent that state as well. I'll hide it inside the initSharedMemory function to keep the request handler as simpler as it already is:
initInProgress = None
def initSharedMemory():
global initInProgress
if initInProgress is None:
initInProgress = _reallyInit()
def initialized(result):
global initInProgress, SomeSharedMemory
initInProgress = None
SomeSharedMemory = result
initInProgress.addCallback(initialized)
d = Deferred()
initInProgress.chainDeferred(d)
return d
This is a little gross because of the globals everywhere. Here's a slightly cleaner version:
from twisted.internet.defer import Deferred, succeed
class SharedResource(object):
def __init__(self, initializer):
self._initializer = initializer
self._value = None
self._state = "UNINITIALIZED"
self._waiting = []
def get(self):
if self._state == "INITIALIZED":
# Return the already computed value
return succeed(self._value)
# Create a Deferred for the caller to wait on
d = Deferred()
self._waiting.append(d)
if self._state == "UNINITIALIZED":
# Once, run the setup
self._initializer().addCallback(self._initialized)
self._state = "INITIALIZING"
# Initialized or initializing state here
return d
def _initialized(self, value):
# Save the value, transition to the new state, and tell
# all the previous callers of get what the result is.
self._value = value
self._state = "INITIALIZED"
waiting, self._waiting = self._waiting, None
for d in waiting:
d.callback(value)
SomeSharedMemory = SharedResource(initializeSharedMemory)
def handleRequest(request):
return SomeSharedMemory.get().addCallback(PickSomething)
Three states, nice explicit transitions between them, no global state to update (at least if you give SomeSharedMemory some non-global scope), and handleRequest doesn't know about any of this, it just asks for a value and then uses it.

Categories