Celery ignores retry_backoff, instead retries 180 seconds repeatedly - python

Celery version number: 4.4.5
I have a function decorated like so:
#app.task(bind=True, retry_backoff=5, retry_jitter=False, retry_kwargs={"max_retries": 5})
def foo(self):
try:
#work
except Exception:
try:
_log.info(retrying task)
self.retry()
except MaxRetriesExceeded:
_log.error(Permanent failure)
I would expect this to retry after 5 seconds, then again after 10, then again after 20, then 40, then 80.
Instead, celery logs 'retrying task after 180 seconds', which it does. It then repeats the same procedure twice to make three retries in total, before giving up.
From what I've read on the docs, this seems to be the correct way to do it. Am I doing something wrong?

retry_backoff option relates only to autoretries that you specify using autoretry_for task decorator parameter:
A boolean, or a number. If this option is set to True, autoretries will be delayed following the rules of exponential backoff.
In your case, you are calling self.retry() yourself so the retry backoff doesn't apply.
EDIT:
To handle the cleanup actions after failure, consider this example:
from celery import Celery
from celery.utils.log import get_task_logger
app = Celery(broker='pyamqp://')
logger = get_task_logger(__name__)
def cleanup(self, exc, task_id, args, kwargs, einfo):
logger.error('An error has occured, cleaning up...')
#app.task(autoretry_for=(ZeroDivisionError,), retry_kwargs={'max_retries': 3},
retry_backoff=True, on_failure=cleanup)
def fail():
return 1/0
When you call the fail task, it will fail 3 times and then raise ZeroDivisionError exception. Also, it will call the cleanup function to do the cleanup. So you don't care if the task gets retried, you react to the fact the task failed and handle the fact accordingly in the on_failure callback. If your actions should depend on what exception occured, you can use the arguments the cleanup gets called with.

Related

How to protect against submitting too many jobs to a ThreadPoolExecutor?

I recently came across the ThreadPoolExecutor class and have been using it in a toy project. ThreadPoolExecutor has a _work_queue field - which, when I submit more tasks than the number of workers assigned to the executor, starts filling up:
>>> from concurrent.futures import ThreadPoolExecutor
>>> from time import sleep
>>> def wait():
... sleep(1000)
...
>>> tpe = ThreadPoolExecutor(max_workers=10)
>>> tpe._work_queue.qsize()
0
>>> for i in range(11):
... tpe.submit(wait)
...
<Future at 0x10cf63908 state=running>
[...snip...]
<Future at 0x10d20b278 state=pending>
>>> tpe._work_queue.qsize()
1
I notice that the _work_queue has a method full(), which, presumably, indicates that the queue cannot take any more tasks. I would expect an exception to be thrown if I submit more tasks than the queue can hold - however, I don't see that behaviour referenced anywhere in the documentation, and I wasn't able to replicate it even after adding more than 100,000 tasks to my Executor.
Right now, I'm defending against this behaviour with:
for task_to_do in my_tasks:
if tpe._work_queue.full():
sleep(0.1)
tpe.submit(task_to_do)
which feels hacky because of the reference to the "private" queue - I guess it would be more pythonic to do:
for task_to_do in my_tasks:
task_added = False
while not task_added:
try:
tpe.submit(task_to_do)
task_added = True
except SomeExceptionWhoseNameIDoNotKnowYet as e:
pass
but, in order to do so, I need to know what kind of Exception would be thrown (or, I guess, just catch Exception)
override the submit method of ThreadPoolExecutor
class YourThreadPoolExecutor(ThreadPoolExecutor):
def submit(self, fn, /, *args, **kwargs):
if self._work_queue.qsize() > self._max_workers:
raise Exception('Too many submits')
with self._shutdown_lock, _global_shutdown_lock:
if self._broken:
raise BrokenThreadPool(self._broken)
if self._shutdown:
raise RuntimeError('cannot schedule new futures after shutdown')
if _shutdown:
raise RuntimeError('cannot schedule new futures after '
'interpreter shutdown')
f = _base.Future()
w = _WorkItem(f, fn, args, kwargs)
self._work_queue.put(w)
self._adjust_thread_count()
return f
Then always use YourThreadPoolExecutor instead of ThreadPoolExecutor

Terminating a program within python

Hi I embeded a time constraint in to my python code which is running a fortran code with a function. However I realized that the function that puts a time constraint on the other function doesn't terminates the code, just leaves it in background and skips it instead. What I want to do is terminate the code, when it exceed the constrained time. Here is the code that I'm using to constrain time which is taken from here.
def timeout(func, args=(), kwargs={}, timeout_duration=15, default=1):
import signal
class TimeoutError(Exception):
pass
def handler(signum, frame):
raise TimeoutError()
# set the timeout handler
signal.signal(signal.SIGALRM, handler)
signal.alarm(timeout_duration)
try:
result = func(*args, **kwargs)
except TimeoutError as exc:
result = default
finally:
signal.alarm(0)
return result
I looked up popen.terminate(), sys.exit() and atexit.register() but couldn't figure out how it will work with this piece of code which I tried to add in part below that I showed in comment.
...
result = func(*args, **kwargs)
except TimeoutError as exc:
result = default
#add the terminator
finally:
...
NOTE: Function is inside a for loop chain so I dont want to kill whole python session but just want to kill the program that this function runs which is a fortran code and skip to the other element in the for loop chain.
Part below added after some comments and answers:
I tried to add SIGTERM with popen.terminate() however it terminated all python session which I just want to terminate current running session and skip to the other elements in the for loop. what I did is as follows:
...
signal.signal(signal.SIGTERM, handler)
signal.alarm(timeout_duration)
try:
result = func(*args, **kwargs)
except TimeoutError as exc:
result = default
popen.terminate()
...
You cannot expect the signal handler raising an Exception to get propagated through the call stack, it's invoked in a different context.
popen.terminate() will generate SIGTERM on posix systems so you should have a signal handler for SIGTERM and not SIGALRM.
Instead of raising an exception in your signal handler, you should set some variable that you periodically check in order to halt activity.
Alternatively if you don't have a signal handler for SIGTERM the default handler will probably generate a KeyboardInterrupt exception.

How to abort App Engine pipelines gracefully?

Problem
I have a chain of pipelines:
class PipelineA(base_handler.PipelineBase):
def run(self, *args):
# do something
class PipelineB(base_handler.PipelineBase):
def run(self, *args):
# do something
class EntryPipeline(base_handler.PipelineBase):
def run(self):
if some_condition():
self.abort("Condition failed. Pipeline aborted!")
yield PipelineA()
mr_output = yield mapreduce_pipeline.MapreducePipeline(
# mapreduce configs here
# ...
)
yield PipelineB(mr_output)
p = EntryPipeline()
p.start()
In EntryPipeline, I am testing some conditions before starting PipelineA, MapreducePipeline and PipelineB. If the condition fail, I want to abort EntryPipeline and all subsequent pipelines.
Questions
What is a graceful pipeline abortion? Is self.abort() the correct way to do it or do I need sys.exit()?
What if I want to do the abortion inside PipelineA? e.g. PipelineA kicks off successfully, but prevent subsequent pipelines(MapreducePipeline and PipelineB) from starting.
Edit:
I ended up moving the condition statement outside of EntryPipeline, so start the whole thing only if the condition is true. Otherwise I think Nick's answer is correct.
Since the docs currently say "TODO: Talk about explicit abort and retry"
we'll have to read the source:
https://github.com/GoogleCloudPlatform/appengine-pipelines/blob/master/python/src/pipeline/pipeline.py#L703
def abort(self, abort_message=''):
"""Mark the entire pipeline up to the root as aborted.
Note this should only be called from *outside* the context of a running
pipeline. Synchronous and generator pipelines should raise the 'Abort'
exception to cause this behavior during execution.
Args:
abort_message: Optional message explaining why the abort happened.
Returns:
True if the abort signal was sent successfully; False if the pipeline
could not be aborted for any reason.
"""
So if you have a handle to some_pipeline that isn't self, you can call some_pipeline.abort()... but if you want to abort yourself you need to raise Abort() ... and that will bubble up to the top and kill the whole tree

Manually return an error result and status failure for a celery task?

I've created celery tasks to run some various jobs that were written in javascript by way of nodejs. The task is basically a subprocess.popen that invokes nodejs.
The nodejs job will return a non-zero status when exiting, along with error information written to stderr.
When this occurs, I want to take the stderr, and return those as "results" to celery, along with a FAILURE status, that way my jobs monitor can reflect that the job failed.
How can I do this?
This is my task
#app.task
def badcommand():
try:
output = subprocess.check_output('ls foobar',stderr=subprocess.STDOUT,shell=True)
return output
except subprocess.CalledProcessError as er:
#What do I do here to return er.output, and set the status to fail?
If I don't catch the subprocess exception, the Job properly fails, but the result is empty, and I get a traceback stacktrace instead.
If I catch the exception, and return er.output the job completed as a success.
You can use celery.app.task.Task.update_state method to update the current task state.
#app.task(bind=True)
def badcommand(self):
try:
output = subprocess.check_output('ls foobar',stderr=subprocess.STDOUT,shell=True)
return output
except subprocess.CalledProcessError as er:
self.update_state(state='FAILURE', meta={'exc': er})
Note that the bind argument of the app.task decorator was introduced in Celery 3.1. If you're still using a older version, I think you can call the update_state task method this way:
#app.task
def badcommand():
...
except subprocess.CalledProcessError as er:
badcommand.update_state(state='FAILURE', meta={'exc': er})
You can use a base with specified functions of what to do when failing.
class YourBase(Task):
def on_success(self, retval, task_id, args, kwargs):
print "Failure"
def on_failure(self, exc, task_id, args, kwargs, einfo):
print "Success"
#app.task(base=YourBase)
def badcommand():
output = subprocess.check_output('ls foobar', stderr=subprocess.STDOUT, shell=True)
return output
These are the handlers that your base class can use: http://celery.readthedocs.org/en/latest/userguide/tasks.html#handlers

Error when using twisted and greenlets

I'm trying to use twisted with greenlets, so I can write synchronous looking code in twisted without using inlineCallbacks.
Here is my code:
import time, functools
from twisted.internet import reactor, threads
from twisted.internet.defer import Deferred
from functools import wraps
import greenlet
def make_async(func):
#wraps(func)
def wrapper(*pos, **kwds):
d = Deferred()
def greenlet_func():
try:
rc = func(*pos, **kwds)
d.callback(rc)
except Exception, ex:
print ex
d.errback(ex)
g = greenlet.greenlet(greenlet_func)
g.switch()
return d
return wrapper
def sleep(t):
print "sleep(): greenelet:", greenlet.getcurrent()
g = greenlet.getcurrent()
reactor.callLater(t, g.switch)
g.parent.switch()
def wait_one(d):
print "wait_one(): greenelet:", greenlet.getcurrent()
g = greenlet.getcurrent()
active = True
def callback(result):
if not active:
g.switch(result)
else:
reactor.callLater(0, g.switch, result)
def errback(failure):
if not active:
g.throw(failure)
else:
reactor.callLater(0, g.throw, failure)
d.addCallback(callback)
d.addErrback(errback)
active = False
rc = g.parent.switch()
return rc
#make_async
def inner():
print "inner(): greenelet:", greenlet.getcurrent()
import random, time
interval = random.random()
print "Sleeping for %s seconds..." % interval
sleep(interval)
print "done"
return interval
#make_async
def outer():
print "outer(): greenelet:", greenlet.getcurrent()
print wait_one(inner())
print "Here"
reactor.callLater(0, outer)
reactor.run()
There are 5 main parts:
A sleep function, that starts a timer, then switches back to the parent greenlet. When the timer goes off, it switches back to the greenlet that is sleeping.
A make_async decorator. This takes some synchronous looking code and runs it in a greenlet. IT also returns a deferred so the caller can register callbacks when the code completes.
A wait_one function, which blocks the greenlet until the deferred being waited on resolves.
The inner function, which (when wrapped) returns a deferred, sleeps for a random time, and then passes the time it slept for to the deferred.
The outer function, which calls inner(), waits for it to return, then prints the return value.
When I run this code I get this output (Note the error on the last two lines):
outer(): greenelet: <greenlet.greenlet object at 0xb729cc5c>
inner(): greenelet: <greenlet.greenlet object at 0xb729ce3c>
Sleeping for 0.545666723422 seconds...
sleep(): greenelet: <greenlet.greenlet object at 0xb729ce3c>
wait_one(): greenelet: <greenlet.greenlet object at 0xb729cc5c>
done
0.545666723422
Here
Exception twisted.python.failure.Failure: <twisted.python.failure.Failure <class 'greenlet.GreenletExit'>> in <greenlet.greenlet object at 0xb729ce3c> ignored
GreenletExit did not kill <greenlet.greenlet object at 0xb729ce3c>
Doing a bit of research I've found that:
The last line is logged by greenlet.c
The previous line is logged by python itself, as it's ignoring an exception raised in a del method.
I'm having real trouble debugging this as I can't access the GreenletExit or twisted.python.failure.Failure exceptions to get their stack traces.
Does anyone have any ideas what I'm doing wrong, or how I get debug the exceptions that are being thrown?
One other data point: If I hack wait_one() to just return immediately (and not to register anything on the deferred it is passed), the errors go away. :-/
Rewrite your error callback in wait_one like this:
def errback(failure):
## new code
if g.dead:
return
##
if not active:
g.throw(failure)
else:
reactor.callLater(0, g.throw, failure)
If greenlet is dead (finished running), there is no point throwing exceptions
in it.
mguijarr's answer fixed the problem, but I wanted to write up how I got into this situation.
I have three greenlets:
{main} that's runing the reactor.
{outer} that's running outer().
{inner} that's rrunning inner().
When the sleep finishes the {main} switches to {inner} which switches to {outer}. Outer then returns and raises GreenletExit in {inner}. This propogates back to twisted. It sees an exception being raised from callback(), and so invokes errback(). This tries to throw the exception into {outer} (which has already exited), and I hit the error.

Categories