Tornado memory leak on dropped connections - python

I've got a setup where Tornado is used as kind of a pass-through for workers. Request is received by Tornado, which sends this request to N workers, aggregates results and sends it back to client. Which works fine, except when for some reason timeout occurs — then I've got memory leak.
I've got a setup which similar to this pseudocode:
workers = ["http://worker1.example.com:1234/",
"http://worker2.example.com:1234/",
"http://worker3.example.com:1234/" ...]
class MyHandler(tornado.web.RequestHandler):
#tornado.web.asynchronous
def post(self):
responses = []
def __callback(response):
responses.append(response)
if len(responses) == len(workers):
self._finish_req(responses)
for url in workers:
async_client = tornado.httpclient.AsyncHTTPClient()
request = tornado.httpclient.HTTPRequest(url, method=self.request.method, body=body)
async_client.fetch(request, __callback)
def _finish_req(self, responses):
good_responses = [r for r in responses if not r.error]
if not good_responses:
raise tornado.web.HTTPError(500, "\n".join(str(r.error) for r in responses))
results = aggregate_results(good_responses)
self.set_header("Content-Type", "application/json")
self.write(json.dumps(results))
self.finish()
application = tornado.web.Application([
(r"/", MyHandler),
])
if __name__ == "__main__":
##.. some locking code
application.listen()
tornado.ioloop.IOLoop.instance().start()
What am I doing wrong? Where does the memory leak come from?

I don't know the source of the problem, and it seems gc should be able to take care of it, but there's two things you can try.
The first method would be to simplify some of the references (it looks like there may still be references to responses when the RequestHandler completes):
class MyHandler(tornado.web.RequestHandler):
#tornado.web.asynchronous
def post(self):
self.responses = []
for url in workers:
async_client = tornado.httpclient.AsyncHTTPClient()
request = tornado.httpclient.HTTPRequest(url, method=self.request.method, body=body)
async_client.fetch(request, self._handle_worker_response)
def _handle_worker_response(self, response):
self.responses.append(response)
if len(self.responses) == len(workers):
self._finish_req()
def _finish_req(self):
....
If that doesn't work, you can always invoke garbage collection manually:
import gc
class MyHandler(tornado.web.RequestHandler):
#tornado.web.asynchronous
def post(self):
....
def _finish_req(self):
....
def on_connection_close(self):
gc.collect()

The code looks good. The leak is probably inside Tornado.
I only stumbled over this line:
async_client = tornado.httpclient.AsyncHTTPClient()
Are you aware of the instantiation magic in this constructor?
From the docs:
"""
The constructor for this class is magic in several respects: It actually
creates an instance of an implementation-specific subclass, and instances
are reused as a kind of pseudo-singleton (one per IOLoop). The keyword
argument force_instance=True can be used to suppress this singleton
behavior. Constructor arguments other than io_loop and force_instance
are deprecated. The implementation subclass as well as arguments to
its constructor can be set with the static method configure()
"""
So actually, you don't need to do this inside the loop. (On the other
hand, it should not do any harm.) But which implementation are you
using CurlAsyncHTTPClient or SimpleAsyncHTTPClient?
If it is SimpleAsyncHTTPClient, be aware of this comment in the code:
"""
This class has not been tested extensively in production and
should be considered somewhat experimental as of the release of
tornado 1.2.
"""
You can try switching to CurlAsyncHTTPClient. Or follow
Nikolay Fominyh's suggestion and trace the calls to __callback().

Related

Call to async endpoint gets blocked by another thread

I have a tornado webservice which is going to serve something around 500 requests per minute. All these requests are going to hit 1 specific endpoint. There is a C++ program that I have compiled using Cython and use it inside the tornado service as my processor engine. Each request that goes to /check/ will trigger a function call in the C++ program (I will call it handler) and the return value will get sent to user as response.
This is how I wrap the handler class. One important point is that I do not instantiate the handler in __init__. There is another route in my tornado code that I want to start loading the DataStructure after an authroized request hits that route. (e.g. /reload/)
executors = ThreadPoolExecutor(max_workers=4)
class CheckerInstance(object):
def __init__(self, *args, **kwargs):
self.handler = None
self.is_loading = False
self.is_live = False
def init(self):
if not self.handler:
self.handler = pDataStructureHandler()
self.handler.add_words_from_file(self.data_file_name)
self.end_loading()
self.go_live()
def renew(self):
self.handler = None
self.init()
class CheckHandler(tornado.web.RequestHandler):
async def get(self):
query = self.get_argument("q", None).encode('utf-8')
answer = query
if not checker_instance.is_live:
self.write(dict(answer=self.get_argument("q", None), confidence=100))
return
checker_response = await checker_instance.get_response(query)
answer = checker_response[0]
confidence = checker_response[1]
if self.request.connection.stream.closed():
return
self.write(dict(correct=answer, confidence=confidence, is_cache=is_cache))
def on_connection_close(self):
self.wait_future.cancel()
class InstanceReloadHandler(BasicAuthMixin, tornado.web.RequestHandler):
def prepare(self):
self.get_authenticated_user(check_credentials_func=credentials.get, realm='Protected')
def new_file_exists(self):
return True
def can_reload(self):
return not checker_instance.is_loading
def get(self):
error = False
message = None
if not self.can_reload():
error = True
message = 'another job is being processed!'
else:
if not self.new_file_exists():
error = True
message = 'no new file found!'
else:
checker_instance.go_fake()
checker_instance.start_loading()
tornado.ioloop.IOLoop.current().run_in_executor(executors, checker_instance.renew)
message = 'job started!'
if self.request.connection.stream.closed():
return
self.write(dict(
success=not error, message=message
))
def on_connection_close(self):
self.wait_future.cancel()
def main():
app = tornado.web.Application(
[
(r"/", MainHandler),
(r"/check", CheckHandler),
(r"/reload", InstanceReloadHandler),
(r"/health", HealthHandler),
(r"/log-event", SubmitLogHandler),
],
debug=options.debug,
)
checker_instance = CheckerInstance()
I want this service to keep responding after checker_instance.renew starts running in another thread. But this is not what happens. When I hit the /reload/ endpoint and renew function starts working, any request to /check/ halts and waits for the reloading process to finish and then it starts working again. When the DataStructure is being loaded, the service should be in fake mode and respond to people with the same query that they send as input.
I have tested this code in my development environment with an i5 CPU (4 CPU cores) and it works just fine! But in the production environment (3 double-thread CPU cores) the /check/ endpoint halts requests.
It is difficult to fully trace the events being handled because you have clipped out some of the code for brevity. For instance, I don't see a get_response implementation here so I don't know if it is awaiting something itself that could be dependent on the state of checker_instance.
One area I would explore is in the thread-safety (or seeming absence of) in passing the checker_instance.renew to run_in_executor. This feels questionable to me because you are mutating the state of a single instance of CheckerInstance from a separate thread. While it might not break things explicitly, it does seem like this could be introducing odd race conditions or unanticipated copies of memory that might explain the unexpected behavior you are experiencing
If possible, I would make whatever load behavior you have that you want to offload to a thread be completely self-contained and when the data is loaded, return it as the function result which can then be fed back into you checker_instance. If you were to do this with the code as-is, you would want to await the run_in_executor call for its result and then update the checker_instance. This would mean the reload GET request would wait until the data was loaded. Alternatively, in your reload GET request, you could ioloop.spawn_callback to a function that triggers the run_in_executor in this manner, allowing the reload request to complete instead of waiting.

Using request level context in Tornado

I'm looking for a way to set request level context in Tornado.
This is useful for logging purpose, to print some request attributes with every log line (like user_id).
I'd like to populate the context in web.RequestHandler and then access it in other coroutines that this request called.
class WebRequestHandler(web.RequestHandler):
#gen.coroutine
def post(self):
RequestContext.test_mode = self.application.settings.get('test_mode', False)
RequestContext.corr_id = self.request.header.get('X-Request-ID')
result = yield some_func()
self.write(result)
#gen.coroutine
def some_func()
if RequestContext.test_mode:
print "In test mode"
do more async calls
Currently I pass context object (dict with values) to every async function call down stream, this way every part of the code can do monitoring and logging with right context.
I'm looking for a cleaner/simpler solution.
Thanks
Alex
The concept of request context doesn't really hold well in async frameworks (especially if you have high volume traffic) for the simple fact that there could potentially be hundreds of concurrent requests and it becomes difficult to determine which "context" to use. This works for sequential frameworks like Flask, Falcon, Django, etc. because requests are handled one by one and it's simple to determine which request you're dealing with.
The preferred method of handling functionality between a request start and end is to override prepare and on_finish respectively.
class WebRequestHandler(web.RequestHandler):
def prepare(self):
print('Logging...prepare')
if self.application.settings.get('test_mode', False):
print("In test mode")
print('X-Request-ID: {0}'.format(self.request.header.get('X-Request-ID')))
#gen.coroutine
def post(self):
result = yield some_func()
self.write(result)
def on_finish(self):
print('Logging...on_finish')
The simple solution would be to create an object that represents the context of your request and pass that into your log function. Example:
class RequestContext(object):
"""
Hold request context
"""
class WebRequestHandler(web.RequestHandler):
#gen.coroutine
def post(self):
# create new context obj and fill w/ necessary parameters
request_context = RequestContext()
request_context.test_mode = self.application.settings.get('test_mode', False)
request_context.corr_id = self.request.header.get('X-Request-ID')
# pass context objects into coroutine
result = yield some_func(request_context)
self.write(result)
#gen.coroutine
def some_func(request_context)
if request_context.test_mode:
print "In test mode"
# do more async calls

What is the benefit of adding a #tornado.web.asynchronous decorator?

class IndexHandler(tornado.web.RequestHandler):
#tornado.web.asynchronous
def get(request):
request.render("../resouce/index.html")
I always read some tornado code as above, confusing that what's the purpose of adding this decorator ? I know that adding this decorator, we should call self.finish()manually, but any benefits of doing that ?
Thanks !
Normally, finish() is called for you when the handler method returns, but if your handler depends on the result of an asynchronous computation (like an HTTP request), then it won't have finished by the time the method returns. Instead it should finish in some callback.
The example from the documentation is instructive:
class MyRequestHandler(web.RequestHandler):
#web.asynchronous
def get(self):
http = httpclient.AsyncHTTPClient()
http.fetch("http://friendfeed.com/", self._on_download)
def _on_download(self, response):
self.write("Downloaded!")
self.finish()
Without the decorator, by the time _on_download is entered the request would have finished already.
If your handler isn't doing anything async, then there's no benefit to adding the decorator.

Correct use of coroutine in Tornado web server

I'm trying to convert a simple syncronous server to an asyncronous version, the server receives post requestes and it retrieves the response from an external web service (amazon sqs). Here's the syncronous code
def post(self):
zoom_level = self.get_argument('zoom_level')
neLat = self.get_argument('neLat')
neLon = self.get_argument('neLon')
swLat = self.get_argument('swLat')
swLon = self.get_argument('swLon')
data = self._create_request_message(zoom_level, neLat, neLon, swLat, swLon)
self._send_parking_spots_request(data)
#....other stuff
def _send_parking_spots_request(self, data):
msg = Message()
msg.set_body(json.dumps(data))
self._sqs_send_queue.write(msg)
Reading Tornado documentation and some threads here I ended with this code using coroutines:
def post(self):
zoom_level = self.get_argument('zoom_level')
neLat = self.get_argument('neLat')
neLon = self.get_argument('neLon')
swLat = self.get_argument('swLat')
swLon = self.get_argument('swLon')
data = self._create_request_message(zoom_level, neLat, neLon, swLat, swLon)
self._send_parking_spots_request(data)
self.finish()
#gen.coroutine
def _send_parking_spots_request(self, data):
msg = Message()
msg.set_body(json.dumps(data))
yield gen.Task(write_msg, self._sqs_send_queue, msg)
def write_msg(queue, msg, callback=None):
queue.write(msg)
Comparing the performances using siege I get that the second version is even worse than the original one, so probably there's something about coroutines and Torndado asyncronous programming that I didn't understand at all.
Could you please help me with this?
Edit: self._sqs_send_queue it's a queue object retrieved from boto interface and queue.write(msg) returns the message that has been written on the queue
tornado relies on you converting all your I/O to be non-blocking. Simply sticking the same code you were using before inside of a gen.Task will not improve performance at all, because the I/O itself is still going to block the event loop. Additionally, you need to make your post method a coroutine, and call _send_parking_spots_requests using yield for the code to behave properly. So, a "correct" solution would look something like this:
#gen.coroutine
def post(self):
...
yield self._send_parking_spots_request(data) # wait (without blocking the event loop) until the method is done
self.finish()
#gen.coroutine
def _send_parking_spots_request(self, data):
msg = Message()
msg.set_body(json.dumps(data))
yield gen.Task(write_msg, self._sqs_send_queue, msg)
def write_msg(queue, msg, callback=None):
yield queue.write(msg, callback=callback) # This has to do non-blocking I/O.
In this example, queue.write would need to be some API that sends your request using non-blocking I/O, and executes callback when a response is received. Without knowing exactly what queue in your original example is, I can't specify exactly how that can be implemented in your case.
Edit: Assuming you're using boto, you may want to check out bototornado, which implements the exact same API I described above:
def write(self, message, callback=None):
"""
Add a single message to the queue.
:type message: Message
:param message: The message to be written to the queue
:rtype: :class:`boto.sqs.message.Message`
:return: The :class:`boto.sqs.message.Message` object that was written.

Twisted web - write() called after finish()?

I have the following Resource to handle http POST request with twisted web:
class RootResource(Resource):
isLeaf = True
def errback(self, failure):
print "Request finished with error: %s"%str(failure.value)
return failure
def write_response_happy(self, result):
self.request.write('HAPPY!')
self.request.finish()
def write_response_unhappy(self, result):
self.request.write('UNHAPPY!')
self.request.finish()
#defer.inlineCallbacks
def method_1(self):
#IRL I have many more queries to mySQL, cassandra and memcache to get final result, this is why I use inlineCallbacks to keep the code clean.
res = yield dbpool.runQuery('SELECT something FROM table')
#Now I make a decision based on result of the queries:
if res: #Doesn't make much sense but that's only an example
self.d.addCallback(self.write_response_happy) #self.d is already available after yield, so this looks OK?
else:
self.d.addCallback(self.write_response_unhappy)
returnValue(None)
def render_POST(self, request):
self.request = request
self.d = self.method_1()
self.d.addErrback(self.errback)
return server.NOT_DONE_YET
root = RootResource()
site = server.Site(root)
reactor.listenTCP(8002, site)
dbpool = adbapi.ConnectionPool('MySQLdb', host='localhost', db='mydb', user='myuser', passwd='mypass', cp_reconnect=True)
print "Serving on 8002"
reactor.run()
I've used the ab tool (from apache utils) to test 5 POST requests one after another:
ab -n 5 -p sample_post.txt http://127.0.0.1:8002/
Works fine!
Then I tried to run the same 5 POST requests simultaneously:
ab -n 5 -c 5 -p sample_post.txt http://127.0.0.1:8002/
Here I'm getting errors: exceptions.RuntimeError: Request.write called on a request after Request.finish was called. What am I doing wrong?
As Mualig suggested in his comments, you have only one instance of RootResource. When you assign to self.request and self.d in render_POST, you overwrite whatever value those attributes already had. If two requests arrive at around the same time, then this is a problem. The first Request and Deferred are discarded and replaced by new ones associated with the request that arrives second. Later, when your database operation finishes, the second request gets both results and the first one gets none at all.
This is an example of a general mistake in concurrent programming. Your per-request state is kept where it is shared between multiple requests. When multiple requests are handled concurrently, that sharing turns into a fight, and (at least) one request has to lose.
Try keeping your per-request state where it won't be shared between multiple requests. For example, try keeping it on the Deferred:
class RootResource(Resource):
isLeaf = True
def errback(self, failure):
print "Request finished with error: %s"%str(failure.value)
# You just handled the error, don't return the failure.
# Nothing later in the callback chain is doing anything with it.
# return failure
def write_response(self, result, request):
# No "self.request" anymore, just use the argument
request.write(result)
request.finish()
#defer.inlineCallbacks
def method_1(self):
#IRL I have many more queries to mySQL, cassandra and memcache to get final result, this is why I use inlineCallbacks to keep the code clean.
res = yield dbpool.runQuery('SELECT something FROM table')
#Now I make a decision based on result of the queries:
if res: #Doesn't make much sense but that's only an example
# No "self.d" anymore, just produce a result. No shared state to confuse.
returnValue("HAPPY!")
else:
returnValue("UNHAPPY!")
def render_POST(self, request):
# No more attributes on self. Just start the operation.
d = self.method_1()
# Push the request object into the Deferred. It'll be passed to response,
# which is what needs it. Each call to method_1 returns a new Deferred,
# so no shared state here.
d.addCallback(self.write_response, request)
d.addErrback(self.errback)
return server.NOT_DONE_YET

Categories