Fully loaded multi-tenant Django application with 1000's of WebSockets using Daphne/Channels, running fine for a few months and suddenly tenants all calling it the support line the application running slow or outright hanging. Narrowed it down to WebSockets as HTTP REST API hits came through fast and error free.
None of the application logs or OS logs indicate some issue, so only thing to go on is the exception noted below. It happened over and over again here and there throughout 2 days.
I don't expect any deep debugging help, just some off-the-cuff advice on possibilities.
AWS Linux 1
Python 3.6.4
Elasticache Redis 5.0
channels==2.4.0
channels-redis==2.4.2
daphne==2.5.0
Django==2.2.13
Split configuration HTTP served by uwsgi, daphne serves asgi, Nginx
May 10 08:08:16 prod-b-web1: [pid 15053] [version 119.5.10.5086] [tenant_id -] [domain_name -] [pathname /opt/releases/r119.5.10.5086/env/lib/python3.6/site-packages/daphne/server.py] [lineno 288] [priority ERROR] [funcname application_checker] [request_path -] [request_method -] [request_data -] [request_user -] [request_stack -] Exception inside application: Lock is not acquired.
Traceback (most recent call last):
File "/opt/releases/r119.5.10.5086/env/lib/python3.6/site-packages/channels_redis/core.py", line 435, in receive
real_channel
File "/opt/releases/r119.5.10.5086/env/lib/python3.6/site-packages/channels_redis/core.py", line 484, in receive_single
await self.receive_clean_locks.acquire(channel_key)
File "/opt/releases/r119.5.10.5086/env/lib/python3.6/site-packages/channels_redis/core.py", line 152, in acquire
return await self.locks[channel].acquire()
File "/opt/python3.6/lib/python3.6/asyncio/locks.py", line 176, in acquire
yield from fut
concurrent.futures._base.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/releases/r119.5.10.5086/env/lib/python3.6/site-packages/channels/sessions.py", line 183, in __call__
return await self.inner(receive, self.send)
File "/opt/releases/r119.5.10.5086/env/lib/python3.6/site-packages/channels/middleware.py", line 41, in coroutine_call
await inner_instance(receive, send)
File "/opt/releases/r119.5.10.5086/env/lib/python3.6/site-packages/channels/consumer.py", line 59, in __call__
[receive, self.channel_receive], self.dispatch
File "/opt/releases/r119.5.10.5086/env/lib/python3.6/site-packages/channels/utils.py", line 58, in await_many_dispatch
await task
File "/opt/releases/r119.5.10.5086/env/lib/python3.6/site-packages/channels_redis/core.py", line 447, in receive
self.receive_lock.release()
File "/opt/python3.6/lib/python3.6/asyncio/locks.py", line 201, in release
raise RuntimeError('Lock is not acquired.')
RuntimeError: Lock is not acquired.
First, lets have a look at the source of the RuntimeError: Lock is not acquired. error. As given by the traceback, the release() method in the file /opt/python3.6/lib/python3.6/asyncio/locks.py is defined like so:
def release(self):
"""Release a lock.
When the lock is locked, reset it to unlocked, and return.
If any other coroutines are blocked waiting for the lock to become
unlocked, allow exactly one of them to proceed.
When invoked on an unlocked lock, a RuntimeError is raised.
There is no return value.
"""
if self._locked:
self._locked = False
self._wake_up_first()
else:
raise RuntimeError('Lock is not acquired.')
A primitive lock is a synchronization primitive that is not owned by a particular thread when locked.
When attempting to release an unlocked lock by calling the release() method, the RuntimeError will be raised, as the method should only be called in the locked state. The state changes to unlocked when called in the locked state.
Now for the previous error raised in the acquire() method in the same file, the acquire() method is defined like so:
async def acquire(self):
"""Acquire a lock.
This method blocks until the lock is unlocked, then sets it to
locked and returns True.
"""
if (not self._locked and (self._waiters is None or
all(w.cancelled() for w in self._waiters))):
self._locked = True
return True
if self._waiters is None:
self._waiters = collections.deque()
fut = self._loop.create_future()
self._waiters.append(fut)
# Finally block should be called before the CancelledError
# handling as we don't want CancelledError to call
# _wake_up_first() and attempt to wake up itself.
try:
try:
await fut
finally:
self._waiters.remove(fut)
except exceptions.CancelledError:
if not self._locked:
self._wake_up_first()
raise
self._locked = True
return True
So in order for the concurrent.futures._base.CancelledError error you're getting to be raised, the await fut must've caused the issue.
To fix it, you can have a look at Awaiting an asyncio.Future raises concurrent.futures._base.CancelledError instead of waiting for a value/exception to be set
Basically, you might have an awaitable in your code that you didn't await, and by not awaiting it, you never handed control back to the event loop or store the awaitable, causing it to be immediately cleaned up, completely cancelling it (and all of the awaitables it controlled).
Simply make sure you await the results of the awaitables in your code, finding any you missed.
Related
Update: asyncio simply does what it's told and you can handle these exceptions just fine - see my follow-up answer that I've marked as the solution to this question. Original question below, with slightly modified example to clarify the issue and its solution.
I've been trying to debug a library that I'm working on that relies heavily on asyncio. While working on some example code, I realised that performing a keyboard interrupt (CTRL-C) sometimes (rarely!) triggered the dreaded...
Task exception was never retrieved
I've tried hard to make sure that all tasks that I spin off handle asyncio.CancelledError gracefully, and after having spent way too many hours debugging this I realised that I only end up with this error message if one of the asyncio tasks is stuck on a blocking operation.
Blocking? You really shouldn't perform blocking work in tasks - that's why asyncio is kind enough to warn you about this. Run the below code...
import asyncio
from time import sleep
async def possibly_dangerous_sleep(i: int, use_blocking_sleep: bool = True):
try:
print(f"Sleep #{i}: Fine to cancel me within the next 2 seconds")
await asyncio.sleep(2)
if use_blocking_sleep:
print(
f"Sleep #{i}: Not fine to cancel me within the next 10 seconds UNLESS someone is"
" awaiting me, e.g. asyncio.gather()"
)
sleep(10)
else:
print(f"Sleep #{i}: Will sleep using asyncio.sleep(), nothing to see here")
await asyncio.sleep(10)
print(f"Sleep #{i}: Fine to cancel me now")
await asyncio.sleep(2)
except asyncio.CancelledError:
print(f"Sleep #{i}: So, I got cancelled...")
raise
def done_cb(task: asyncio.Task):
name = task.get_name()
try:
task.exception()
except asyncio.CancelledError:
print(f"Done: Task {name} was cancelled")
pass
except Exception as e:
print(f"Done: Task {name} didn't handle exception { e }")
else:
print(f"Done: Task {name} is simply done")
async def start_doing_stuff(collect_exceptions_when_gathering: bool = False):
tasks = []
for i in range(1, 7):
task = asyncio.create_task(
possibly_dangerous_sleep(i, use_blocking_sleep=True), name=str(i)
)
task.add_done_callback(done_cb)
tasks.append(task)
# await asyncio.sleep(3600)
results = await asyncio.gather(*tasks, return_exceptions=collect_exceptions_when_gathering)
if __name__ == "__main__":
try:
asyncio.run(start_doing_stuff(collect_exceptions_when_gathering=False), debug=True)
except KeyboardInterrupt:
print("User aborted through keyboard")
...and the debug console will tell you something along the lines of:
Executing <Task finished name='Task-2' coro=<possibly_dangerous_sleep() done, defined at ~/src/hej.py:5> result=None created at ~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/tasks.py:337> took 10.005 seconds
Rest assured that the above call to sleep(10) isn't the culprit in the library I'm working on, but it illustrates the issue I'm running into: if I try to interrupt the above test application within the first 2 to 12 seconds of it running, the debug console will end up with a hefty source traceback:
Fine to cancel me within the next 2 seconds
Not fine to cancel me within the next 10 seconds UNLESS someone is awaiting me, e.g. asyncio.gather()
^CDone with: <Task finished name='Task-2' coro=<possibly_dangerous_sleep() done, defined at ~/src/hej.py:5> exception=KeyboardInterrupt() created at ~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/tasks.py:337>
User aborted through keyboard
Task exception was never retrieved
future: <Task finished name='Task-2' coro=<dangerous_sleep() done, defined at ~/src/hej.py:5> exception=KeyboardInterrupt() created at ~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/tasks.py:337>
source_traceback: Object created at (most recent call last):
File "~/.pyenv/versions/3.10.0/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "~/.pyenv/versions/3.10.0/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "~/.vscode/extensions/ms-python.python-2021.12.1559732655/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
cli.main()
File "~/.vscode/extensions/ms-python.python-2021.12.1559732655/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
run()
File "~/.vscode/extensions/ms-python.python-2021.12.1559732655/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
File "~/.pyenv/versions/3.10.0/lib/python3.10/runpy.py", line 269, in run_path
return _run_module_code(code, init_globals, run_name,
File "~/.pyenv/versions/3.10.0/lib/python3.10/runpy.py", line 96, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "~/.pyenv/versions/3.10.0/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "~/src/hej.py", line 37, in <module>
asyncio.run(start_doing_stuff(), debug=True)
File "~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/base_events.py", line 628, in run_until_complete
self.run_forever()
File "~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/base_events.py", line 595, in run_forever
self._run_once()
File "~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/base_events.py", line 1873, in _run_once
handle._run()
File "~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "~/src/hej.py", line 28, in start_doing_stuff
task = asyncio.create_task(dangerous_sleep())
File "~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/tasks.py", line 337, in create_task
task = loop.create_task(coro)
Traceback (most recent call last):
File "~/src/hej.py", line 37, in <module>
asyncio.run(start_doing_stuff(), debug=True)
File "~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/base_events.py", line 628, in run_until_complete
self.run_forever()
File "~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/base_events.py", line 595, in run_forever
self._run_once()
File "~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/base_events.py", line 1873, in _run_once
handle._run()
File "~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
File "~/src/hej.py", line 14, in dangerous_sleep
sleep(10)
KeyboardInterrupt
If I replace await asyncio.sleep(3600) with await asyncio.gather(task) (see the example code) and invoke CTRL-C, I instead get a very neat shutdown sequence in my debug console:
Fine to cancel me within the next 2 seconds
Not fine to cancel me within the next 10 seconds UNLESS someone is awaiting me, e.g. asyncio.gather()
^CDone with: <Task finished name='Task-2' coro=<possibly_dangerous_sleep() done, defined at ~/src/hej.py:5> exception=KeyboardInterrupt() created at ~/.pyenv/versions/3.10.0/lib/python3.10/asyncio/tasks.py:337>
User aborted through keyboard
Can someone explain to me if this is by design? I was expecting all asyncio tasks to be cancelled for me when asyncio.run() was interrupted (while cleaning up after itself).
Summary: You need to handle your exceptions, or asyncio will complain.
For background tasks (i.e. tasks that you don't explicitly wait for using gather())
You might think that trying to catch cancellation using except asyncio.CancelledError (and re-raising it) within your task would handle all types of cancellation. That's not the case. If your task is performing blocking work while being cancelled, you won't be able to catch the exception (e.g. KeyboardInterrupt) within the task itself. The safe bet here is to register a done callback using add_done_callback on your asyncio.Task. In this callback, check if there was an exception (see the updated example code in the question). If your task was stuck on blocking work while being cancelled, the done callback will tell you that the task was done (vs cancelled).
For a bunch of tasks that you await using gather()
If you use gather, you don't need to add done callbacks. Instead, ask it to return any exceptions and it will handle KeyboardInterrupt just fine. If you don't do this, the first exception being raised within any of its awaitables is immediately propagated to the task that awaits on gather(). In the case of a KeyboardInterrupt inside a task that's stuck doing blocking work, KeyboardInterrupt will be re-raised and you'll need to handle it. Alternatively, use try/except to handle any exceptions raised. Please try this yourself by setting the collect_exceptions_when_gathering variable in the example code.
Finally: the only thing I don't understand now is that I don't see any exception being raised if one calls gather() with a single task, not asking it to return exceptions. Try to modify the example code to have its range be range(1,2) and you won't get a messy stack trace on CTRL-C...?
I'm using the Firebase Realtime Database listener to listen to changes on a database path.
My program recently crashed because of the following 503 error that seems to be raised by the underlying requests library:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/threading.py", line 917, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.7/threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/site-packages/firebase_admin/db.py", line 123, in _start_listen
for sse_event in self._sse:
File "/usr/local/lib/python3.7/site-packages/firebase_admin/_sseclient.py", line 128, in __next__
self._connect()
File "/usr/local/lib/python3.7/site-packages/firebase_admin/_sseclient.py", line 112, in _connect
self.resp.raise_for_status()
File "/usr/local/lib/python3.7/site-packages/requests/models.py", line 940, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Unavailable for url: https://database_url...
My listener initialization is wrapped in a try statement, so I'm unsure why this wasn't caught, swallowed and retried as I expected it to:
def init_listener():
try:
listener = firebase_admin.db.reference(db_path).listen(handle_change)
except Exception as e:
time.sleep(1) # Retry in one second.
init_listener()
I'd like to handle future 503 errors, but I'm not sure how to go about doing this.
Additionally, I'm using except Exception as e above for demo/debugging purposes, but I'm also not sure if requests.exceptions.HTTPError will be specific enough to catch only 500 errors (though I don't know what other errors can be raised).
From the firebase_admin reference docs:
This API is based on the event streaming support available in the
Firebase REST API. Each call to listen() starts a new HTTP connection
and a background thread. This is an experimental feature.
The key here is that this all runs in a background thread. Therefore, wrapping the call to listen() in a try/except will not catch exceptions thrown in the thread. There is no simple way to catch the exceptions happening in the background thread.
To solve your issue, you will probably need to know more about why the database is returning an HTTP 503 status. Or you will need to switch to some other firebase_admin API that will allow you to catch and ignore these exceptions.
Making a Discord bot using discord.py, this is the first time I work with asyncio, and probably the first time I encountered something this frustrating in Python.
The point of this question isn't to teach me how to use asyncio, but instead to teach me how to avoid using it, even if it's not the right way to do things.
So I needed to run the discord client coroutines from regular def functions. After hours of searching I found this: asyncio.get_event_loop().run_until_complete(...). I set up a small script to test it out:
import asyncio
async def test():
print('Success')
asyncio.get_event_loop().run_until_complete(test())
And it worked perfectly. So I went ahead and tried to use it in a discord bot:
import discord
import asyncio
client = discord.Client()
#client.event
async def on_ready():
test()
def test():
asyncio.get_event_loop().run_until_complete(run())
async def run():
print('Success')
client.run('TOKEN_HERE')
And I got an error... Stacktrace/Output:
Success
Ignoring exception in on_ready
Traceback (most recent call last):
File "C:\Program Files\Python36\lib\site-packages\discord\client.py", line 307, in _run_event
yield from getattr(self, event)(*args, **kwargs)
File "C:/Users/OverclockedSanic/PyCharm Projects/asyncio test/test.py", line 8, in on_ready
test()
File "C:/Users/OverclockedSanic/PyCharm Projects/asyncio test/test.py", line 11, in test
asyncio.get_event_loop().run_until_complete(run())
File "C:\Program Files\Python36\lib\asyncio\base_events.py", line 454, in run_until_complete
self.run_forever()
File "C:\Program Files\Python36\lib\asyncio\base_events.py", line 408, in run_forever
raise RuntimeError('This event loop is already running')
RuntimeError: This event loop is already running
What's weird is that "Success" part at the end... I tried some other tests to see if I could return data from the coroutine or execute more stuff, but it couldn't.
I even tried replacing asyncio.get_event_loop() with client.loop, which didn't work either.
I looked for like 2 days, still no solution. Any ideas?
EDIT: Replacing get_event_loop() with new_event_loop() as suggested in the comments raised this:
Ignoring exception in on_ready
Traceback (most recent call last):
File "C:\Program Files\Python36\lib\site-packages\discord\client.py", line 307, in _run_event
yield from getattr(self, event)(*args, **kwargs)
File "C:/Users/USER/PyCharm Projects/asyncio test/test.py", line 8, in on_ready
test()
File "C:/Users/USER/PyCharm Projects/asyncio test/test.py", line 11, in test
asyncio.new_event_loop().run_until_complete(run())
File "C:\Program Files\Python36\lib\asyncio\base_events.py", line 454, in run_until_complete
self.run_forever()
File "C:\Program Files\Python36\lib\asyncio\base_events.py", line 411, in run_forever
'Cannot run the event loop while another loop is running')
RuntimeError: Cannot run the event loop while another loop is running
Your problem seems to essentially be about mixing synchronous and asynchronous code. There are two possibilities:
1) If your non-async routines don't need to block, just to schedule some async task (e.g. send_message) to be run later, then they can simply call get_event_loop().create_task(). You can even use add_done_callback on the returned task if you want some other (non-async) routine to be called when the asynchronous operation is complete. (If the routine to be run is also non-async, then use get_event_loop().call_soon().)
2) If your non-async routines absolutely must block (which includes possibly awaiting an asynchronous routine), and cannot schedule the blocking operation for later, then you should not run them from the same thread as the main event loop. You can create a thread pool with concurrent.futures.ThreadPoolExecutor, and use asyncio.run_in_executor() to schedule your non-async routines, then await the result. And if they in turn need to call async routines, then run_until_complete() should work because now you're not running in a thread that already has an event loop. (But beware of threadsafety issues. You may need something like run_coroutine_threadsafe if you need to wait for something to run in the main event loop.)
If it helps, the asgiref package contains routines that can simplify this for you. They're designed for a slightly different purpose (web servers), but may also work for you. You can use await asgiref.sync.sync_to_async(func)(args) when you want to call a non-async routine from an async routine, which will run the routine in a thread pool, then use asgiref.sync.async_to_sync(func)(args) when you want to call an async routine from a non-async routine that's running inside that thread pool.
I am trying to establish a long running Pull subscription to a Google Cloud PubSub topic.
I am using a code very similar to the example given in the documentation here, i.e.:
def receive_messages(project, subscription_name):
"""Receives messages from a pull subscription."""
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(
project, subscription_name)
def callback(message):
print('Received message: {}'.format(message))
message.ack()
subscriber.subscribe(subscription_path, callback=callback)
# The subscriber is non-blocking, so we must keep the main thread from
# exiting to allow it to process messages in the background.
print('Listening for messages on {}'.format(subscription_path))
while True:
time.sleep(60)
The problem is that I'm receiving the following traceback sometimes:
Exception in thread Consumer helper: consume bidirectional stream:
Traceback (most recent call last):
File "/usr/lib/python3.5/threading.py", line 914, in _bootstrap_inner
self.run()
File "/usr/lib/python3.5/threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "/path/to/google/cloud/pubsub_v1/subscriber/_consumer.py", line 248, in _blocking_consume
self._policy.on_exception(exc)
File "/path/to/google/cloud/pubsub_v1/subscriber/policy/thread.py", line 135, in on_exception
raise exception
File "/path/to/google/cloud/pubsub_v1/subscriber/_consumer.py", line 234, in _blocking_consume
for response in response_generator:
File "/path/to/grpc/_channel.py", line 348, in __next__
return self._next()
File "/path/to/grpc/_channel.py", line 342, in _next
raise self
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, The service was unable to fulfill your request. Please try again. [code=8a75])>
I saw that this was referenced in another question but here I am asking to how to handle it properly in Python. I have tried to wrap the request in an exception but it seems to run in the background and I am not able to retry in case of that error.
A somewhat hacky approach that is working for me is a custom policy_class. The default one has an on_exception function that ignores DEADLINE_EXCEEDED. You can make a class that inherits the default and also ignores UNAVAILABLE. Mine looks like this:
from google.cloud import pubsub
from google.cloud.pubsub_v1.subscriber.policy import thread
import grpc
class AvailablePolicy(thread.Policy):
def on_exception(self, exception):
"""The parent ignores DEADLINE_EXCEEDED. Let's also ignore UNAVAILABLE.
I'm not sure what triggers that error, but if you ignore it, your
subscriber seems to work just fine. It's probably an intermittent
thing and it reconnects later if you just give it a chance.
"""
# If this is UNAVAILABLE, then we want to retry.
# That entails just returning None.
unavailable = grpc.StatusCode.UNAVAILABLE
if getattr(exception, 'code', lambda: None)() == unavailable:
return
# For anything else, fallback on super.
super(AvailablePolicy, self).on_exception(exception)
subscriber = pubsub.SubscriberClient(policy_class=AvailablePolicy)
# Continue to set up as normal.
It looks a lot like the original on_exception just ignores a different error. If you want, you can add some logging whenever the exception is thrown and verify that everything still works. Future messages will still come through.
Is there a way to stop Twisted reactor from automatically swallowing exceptions (eg. NameError)? I just want it to stop execution, and give me a stack trace in console?
There's even a FAQ question about it, but to say the least, it's not very helpful.
Currently, in every errback I do this:
def errback(value):
import traceback
trace = traceback.format_exc()
# rest of the errback...
but that feels clunky, and there has to be a better way?
Update
In response to Jean-Paul's answer, I've tried running the following code (with Twisted 11.1 and 12.0):
from twisted.internet.endpoints import TCP4ClientEndpoint
from twisted.internet import protocol, reactor
class Broken(protocol.Protocol):
def connectionMade(self):
buggy_user_code()
e = TCP4ClientEndpoint(reactor, "127.0.0.1", 22)
f = protocol.Factory()
f.protocol = Broken
e.connect(f)
reactor.run()
After running it, it just hangs there, so I have to Ctrl-C it:
> python2.7 tx-example.py
^CUnhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
Failure: twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 111: Connection refused.
Let's explore "swallow" a little bit. What does it mean to "swallow" an exception?
Here's the most direct and, I think, faithful interpretation:
try:
user_code()
except:
pass
Here any exceptions from the call to user code are caught and then discarded with no action taken. If you look through Twisted, I don't think you'll find this pattern anywhere. If you do, it's a terrible mistake and a bug, and you would be helping out the project by filing a bug pointing it out.
What else might lead to "swallowing exceptions"? One possibility is that the exception is coming from application code that isn't supposed to be raising exceptions at all. This is typically dealt with in Twisted by logging the exception and then moving on, perhaps after disconnecting the application code from whatever event source it was connected to. Consider this buggy application:
from twisted.internet.endpoints import TCP4ClientEndpoint
from twisted.internet import protocol, reactor
class Broken(protocol.Protocol):
def connectionMade(self):
buggy_user_code()
e = TCP4ClientEndpoint(reactor, "127.0.0.1", 22)
f = protocol.Factory()
f.protocol = Broken
e.connect(f)
reactor.run()
When run (if you have a server running on localhost:22, so the connection succeeds and connectionMade actually gets called), the output produced is:
Unhandled Error
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 84, in callWithLogger
return callWithContext({"system": lp}, func, *args, **kw)
File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 69, in callWithContext
return context.call({ILogContext: newCtx}, func, *args, **kw)
File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
return func(*args,**kw)
--- <exception caught here> ---
File "/usr/lib/python2.7/dist-packages/twisted/internet/selectreactor.py", line 146, in _doReadOrWrite
why = getattr(selectable, method)()
File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 674, in doConnect
self._connectDone()
File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 681, in _connectDone
self.protocol.makeConnection(self)
File "/usr/lib/python2.7/dist-packages/twisted/internet/protocol.py", line 461, in makeConnection
self.connectionMade()
File "/usr/lib/python2.7/dist-packages/twisted/internet/endpoints.py", line 64, in connectionMade
self._wrappedProtocol.makeConnection(self.transport)
File "/usr/lib/python2.7/dist-packages/twisted/internet/protocol.py", line 461, in makeConnection
self.connectionMade()
File "proderr.py", line 6, in connectionMade
buggy_user_code()
exceptions.NameError: global name 'buggy_user_code' is not defined
This error clearly isn't swallowed. Even though the logging system hasn't been initialized in any particular way by this application, the logged error still shows up. If the logging system had been initialized in a way that caused errors to go elsewhere - say some log file, or /dev/null - then the error might not be as apparent. You would have to go out of your way to cause this to happen though, and presumably if you direct your logging system at /dev/null then you won't be surprised if you don't see any errors logged.
In general there is no way to change this behavior in Twisted. Each exception handler is implemented separately, at the call site where application code is invoked, and each one is implemented separately to do the same thing - log the error.
One more case worth inspecting is how exceptions interact with the Deferred class. Since you mentioned errbacks I'm guessing this is the case that's biting you.
A Deferred can have a success result or a failure result. When it has any result at all and more callbacks or errbacks, it will try to pass the result to either the next callback or errback. The result of the Deferred then becomes the result of the call to one of those functions. As soon as the Deferred has gone though all of its callbacks and errbacks, it holds on to its result in case more callbacks or errbacks are added to it.
If the Deferred ends up with a failure result and no more errbacks, then it just sits on that failure. If it gets garbage collected before an errback which handles that failure is added to it, then it will log the exception. This is why you should always have errbacks on your Deferreds, at least so that you can log unexpected exceptions in a timely manner (rather than being subject to the whims of the garbage collector).
If we revisit the previous example and consider the behavior when there is no server listening on localhost:22 (or change the example to connect to a different address, where no server is listening), then what we get is exactly a Deferred with a failure result and no errback to handle it.
e.connect(f)
This call returns a Deferred, but the calling code just discards it. Hence, it has no callbacks or errbacks. When it gets its failure result, there's no code to handle it. The error is only logged when the Deferred is garbage collected, which happens at an unpredictable time. Often, particularly for very simple examples, the garbage collection won't happen until you try to shut down the program (eg via Control-C). The result is something like this:
$ python someprog.py
... wait ...
... wait ...
... wait ...
<Control C>
Unhandled error in Deferred:
Unhandled Error
Traceback (most recent call last):
Failure: twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 111: Connection refused.
If you've accidentally written a large program and fallen into this trap somewhere, but you're not exactly sure where, then twisted.internet.defer.setDebugging might be helpful. If the example is changed to use it to enable Deferred debugging:
from twisted.internet.defer import setDebugging
setDebugging(True)
Then the output is somewhat more informative:
exarkun#top:/tmp$ python proderr.py
... wait ...
... wait ...
... wait ...
<Control C>
Unhandled error in Deferred:
(debug: C: Deferred was created:
C: File "proderr.py", line 15, in <module>
C: e.connect(f)
C: File "/usr/lib/python2.7/dist-packages/twisted/internet/endpoints.py", line 240, in connect
C: wf = _WrappingFactory(protocolFactory, _canceller)
C: File "/usr/lib/python2.7/dist-packages/twisted/internet/endpoints.py", line 121, in __init__
C: self._onConnection = defer.Deferred(canceller=canceller)
I: First Invoker was:
I: File "proderr.py", line 16, in <module>
I: reactor.run()
I: File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1162, in run
I: self.mainLoop()
I: File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1174, in mainLoop
I: self.doIteration(t)
I: File "/usr/lib/python2.7/dist-packages/twisted/internet/selectreactor.py", line 140, in doSelect
I: _logrun(selectable, _drdw, selectable, method, dict)
I: File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 84, in callWithLogger
I: return callWithContext({"system": lp}, func, *args, **kw)
I: File "/usr/lib/python2.7/dist-packages/twisted/python/log.py", line 69, in callWithContext
I: return context.call({ILogContext: newCtx}, func, *args, **kw)
I: File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
I: return self.currentContext().callWithContext(ctx, func, *args, **kw)
I: File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
I: return func(*args,**kw)
I: File "/usr/lib/python2.7/dist-packages/twisted/internet/selectreactor.py", line 146, in _doReadOrWrite
I: why = getattr(selectable, method)()
I: File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 638, in doConnect
I: self.failIfNotConnected(error.getConnectError((err, strerror(err))))
I: File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 592, in failIfNotConnected
I: self.connector.connectionFailed(failure.Failure(err))
I: File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1048, in connectionFailed
I: self.factory.clientConnectionFailed(self, reason)
I: File "/usr/lib/python2.7/dist-packages/twisted/internet/endpoints.py", line 144, in clientConnectionFailed
I: self._onConnection.errback(reason)
)
Unhandled Error
Traceback (most recent call last):
Failure: twisted.internet.error.ConnectionRefusedError: Connection was refused by other side: 111: Connection refused.
Notice near the top, where the e.connect(f) line is given as the origin of this Deferred - telling you a likely place where you should be adding an errback.
However, the code should have been written to add an errback to this Deferred in the first place, at least to log the error.
There are shorter (and more correct) ways to display exceptions than the one you've given, though. For example, consider:
d = e.connect(f)
def errback(reason):
reason.printTraceback()
d.addErrback(errback)
Or, even more succinctly:
from twisted.python.log import err
d = e.connect(f)
d.addErrback(err, "Problem fetching the foo from the bar")
This error handling behavior is somewhat fundamental to the idea of Deferred and so also isn't very likely to change.
If you have a Deferred, errors from which really are fatal and must stop your application, then you can define a suitable errback and attach it to that Deferred:
d = e.connect(f)
def fatalError(reason):
err(reason, "Absolutely needed the foo, could not get it")
reactor.stop()
d.addErrback(fatalError)
What you could do as a workaround is register a log listener and stop the reactor whenever you see a critical error! This is a twisted(verb) approach but luckily all "Unhandled errors" are raised with LogLevel.critical.
from twisted.logger._levels import LogLevel
def analyze(event):
if event.get("log_level") == LogLevel.critical:
print "Stopping for: ", event
reactor.stop()
globalLogPublisher.addObserver(analyze)