Is it possible to deadlock a worker by accumulating scheduled tasks?

Is it possible to deadlock a worker by accumulating scheduled tasks? - python

By code:
#celery.task()
def some_recursive_task():
# Do some stuff and schedule it to run again later
# Note that the next run is not scheduled in a fixed basis, like crontabs
# but based on history of some object
# Actual task is found here:
# https://github.com/rafaelsierra/cheddar/blob/master/src/feeds/tasks.py#L39
# Then it call himself again
countdown = bla.get_countdown()
some_recursive_task.apply_async(countdown=countdown)
This task will run withing the next 10 minutes and 12 hours, but this task also calls another tasks that should run now, one for downloading stuff and other to parse it.
The problem is that the main function is called for every single record on database, let's assume a few hundred tasks running, but, considering that those task runs in average every few hours the amount of tasks is not a big deal.
The problem starts when I try to run this with a single worker, when I start the worker, I put it to run all queues and set 8 concurrent workers, then it starts an begin to acknowledge the tasks, but it seems that, no matter how far in future a task is set to, a worker will get it and wait for the its scheduled run, meaning that this worker is locked until then.
I know that I can just split the two other functions into different queues, which I already did, but my concern is that workers will acknowledge tasks to run 12 hours ahead and will not run the ones it should in 30 minutes.
Shouldn't workers ignore scheduled tasks until its time and run the ones that are just delayed without a time?
I don't think, or don't know how, periodic tasks is a solution.

See the points 5 & 6 there. Please, keep in mind that countdown is no different from eta argument of the task.
In short you're right. Single worker (or any amount of workers) should not block on scheduled (eta or countdown) tasks.
How can you tell that workers are locked? The scheduled tasks are prefetched from the queue, but not acknowledged until they are executed.
Also, please keep in mind all scheduled tasks are kept in RAM until they're executed. You would like them to be as light as possible. From what I understand the scheduled task doesn't pass around big chunks of data, probably only some URI, so this shouldn't be a problem.
The links you've pasted return 404. Are you sure cheddar isn't a private repository?

Related

Rate limit a celery task without blocking other tasks

I am trying to limit the rate of one celery task. Here is how I am doing it:
from project.celery import app
app.control.rate_limit('task_a', '10/m')
It is working well. However, there is a catch. Other tasks that this worker is responsible for are being blocked as well.
Let's say, 100 of task_a have been scheduled. As it is rate-limited, it will take 10 minutes to execute all of them. During this time, task_b has been scheduled as well. It will not be executed until task_a is done.
Is it possible to not block task_b?
By the looks of it, this is just how it works. I just didn't get that impression after reading the documentation.
Other options include:
Separate worker and queue only for this task
Adding an eta to the task task_a so that all of it are scheduled to run during the night
What is the best practice in such cases?

This should be part of a task declaration to work on per-task basis. The way you are doing it via control probably why it has this side-effect on other tasks
#task(rate_limit='10/m')
def task_a():
...
After more reading
Note that this is a per worker instance rate limit, and not a global rate limit. To enforce a global rate limit (e.g., for an API with a maximum number of requests per second), you must restrict to a given queue.
You probably will have to do this in separate queue

The easiest (no coding required) way is separating the task into its own queue and running a dedicated worker just for this purpose.
There's no shame in that, it is totally fine to have many Celery queues and workers, each dedicated just for a specific type of work. As an added bonus you may get some more control over the execution, you can easily turn workers ON/OFF to pause certain processes if needed, etc.
On the other hand, having lots of specialized workers idle most of the time (waiting for a specific job to be queued) is not particularly memory-efficient.
Thus, in case you need to rate limit more tasks and expect the specific workers to be idle most of the time, you may consider increasing the efficiency and implement a Token Bucket. With that all your workers can be generic-purpose and you can scale them naturally as your overall load increases, knowing that the work distribution will not be crippled by a single task's rate limit anymore.

Use Airflow for frequent tasks

We have been using Airflow for a while, it is just great.
Now we are considering moving some of our very frequent tasks into our airflow server too.
Let's say I have a script running every second.
What's the best practice to schedule it with airflow:
Run this script in DAG that is scheduled every second. I highly doubt this will be the solution, there is significant overhead for a DAGRUN
Run this script in a while loop that stops after 6 hours, then schedule it on Airflow to be run every 6 hour?
Create a DAG with no schedule, put the task in a while True loop with proper sleep time, so the task will never terminates unless there is an error.
Any other suggestions?
Or this kind of task is just not suitable for Airflow? should do it with a lambda function and AWS scheduler?
Cheers!

What's the best practice to schedule it
... this kind of task is just not suitable for Airflow?
It is not suitable.
In particular, your airflow is probably configured to re-examine the set of DAGs every 5 seconds, which doesn't sound like a good fit for a 1-second task. Plus the ratio of scheduling overhead to work performed would not be attractive. I suppose you could schedule five simultaneous tasks, twelve times per minute, and have them sleep zero to four seconds, but that's just crazy. And likely you would need to "lock against yourself" to avoid having simultaneous sibling tasks step on each other's toes.
The six-hour suggestion (2.) is not crazy. I will view it as a sixty-minute #hourly task instead, since overheads are similar. Exiting after an hour and letting airflow respawn has several benefits. Log rolling happens at regular intervals. If your program crashes, it will be restarted before too long. If your host reboots, again your program is restarted before too long. Downside is that your business need may view "more than a minute" as "much too long". And coordinating overlapping tasks, or gap between tasks, at the hour boundary may pose some issues.
Your stated needs exactly match the problem that Supervisor addresses. Just use that. You will always have exactly one copy of your event loop running, even if the app crashes, even if the host crashes. Log rolling and other administrative details have already been addressed. The code base is mature and lots of folks have beat on it and incorporated their feature requests. It fits what you want.

Avoiding duplicate tasks (or dealing with them) in task queue in Google App Engine

I have a service I've developed running on GAE. The application needs to 'tick' every 3 seconds to perform a bunch of calculations. It is a simulation-type game.
I have a manually scaled instance that I start which uses the deferred API and task queue like so (some error handling etc, removed for clarity):
#app.route('/_ah/start')
def start():
log.info('Ticker instance started')
return tick()
#app.route('/tick')
def tick():
_do_tick()
deferred.defer(tick, _countdown=3)
return 'Tick!', 200
The problem is that sometimes I end up with this being scheduled twice for some reason (likely a transient error/timeout causing the task to be rescheduled) and I end up with multiple tasks in the task queue, and the game ticking multiple times per 3-second period.
Any ideas how best to deal with this?
As far as I can see you can't ask a queue 'Are there are tasks of X already there?' or 'how many items on the queue at the moment?'.
I understand that this uses as push queue, and one idea might be to switch instead to a pull queue and have the ticker lease items off the queue, grouped by tag, which would get all of them, including duplicates. Would that be better?
In essence what I really want is just a cron-like scheduler to schedule something every 3 seconds, but I know that the scheduler on GAE likely doesn't run to that resolution.
I could just move everything into the startup handler, e.g.:
#app.route('/_ah/start')
def start():
log.info('Ticker instance started')
while True:
_do_tick()
sleep(3)
return 200
But from what I see, the logs won't update as I do this, as it is perceived to be a single request that never completes. This makes it a bit harder to see in the logs what is going on. Currently I see each individual tick as a separate request log entry.
Also if the above gets killed, I'd need to get it to reschedule itself anyway. Which might not be too much of a hassle as I know there are exceptions you can catch when the instance is about to be shut down and I could then fire off a deferred task to start it again.
Or is there a better way to handle this on GAE?

I can't see a way to detect/eliminate duplicates, but have worked around it now using a different mechanism. Rather than rely on the task queue as a scheduler, I run my own scheduler loop in a manually scaled instance:
TICKINTERVAL = 3
#app.route('/_ah/start')
def scheduler():
log.info('Ticker instance started')
while True:
if game.is_running():
task = taskqueue.add(
url='/v1/game/tick',
queue_name='tickqueue',
method='PUT',
target='tickworker',
)
else:
log.info('Tick skipped as game stopped')
db_session.rollback()
sleep(TICKINTERVAL)
I have defined my own queue, tickqueue in queue.yaml
queue:
- name: tickqueue
rate: 5/s
max_concurrent_requests: 1
retry_parameters:
task_retry_limit: 0
task_age_limit: 1m
The queue doesn't retry tasks and any tasks left on there longer than a minute get cancelled. I set the max concurrency to 1 so that is only attempts to process one item at a time.
If an occasional 'tick' takes longer than 3 seconds then it will back up on the queue, but the queue should clear if it speeds up again. If ticks end up taking longer than 3s on average then the tasks that have been on the queue longer than a minute will get discarded.
This gives the advantage that I get a log entry for each tick (and it is called /v1/game/tick so easy to spot, as opposed to /_ah/deferred). The downside is that I am needing to use one instance for the scheduler and one for the worker, as you can't have the scheduler instance process requests as it won't do until /_ah/start completes, which it never does here.

You can set to 0 the task_retry_limit value in the _retry_options optional argument as mentioned in https://stackoverflow.com/a/36621588/4495081.
The trouble is that if a valid reason for a failure exists then the ticking job stops forever. You may want to also keep track of the last time the job executed and have a cron-based sanity-check job to periodically check that ticking is still running and restart it if not.

Given a task_id, execute the task

I'm creating a celery task in a situation where task producers are more than consumers (workers). Now since my queues are getting filled up and the workers consume in FCFS manner, can I get to execute a specific task(given a task_id) instantly?
for eg:
My tasks are filled in the following fashion. [1,2,3,4,5,6,7,8,9,0]. Now the tasks are fetched from the zeroth index. Now a situation arise where I want to execute task 8 above all. How can I do this?
The worker need not execute that task (because there can be situation where a worker is already occupied). It can be run directly from the application. And when the task is completed (either from the worker or directly from the application), it should get deleted from the queue.
I know how to forcefully revoke a task (given a task_id) but how can I execute a task given an id ?

how can I execute a task given an id ?
the short answer is you can't. Celery workers pull tasks off the broker backend as they become available.
Why not?
Note that's not a limitation of Celery as such, rather it is a characteristic of message queuing systems(MQS) in general. The point of MQS is to desynchronize an application's component so that the producer can go on to do other work while workers execute the tasks asynchronously. In other words, once a task has been sent off it cannot be modified (but it can be removed as long as it has not been started yet).
What options are there?
Celery offers you several options to deal with lower v.s. higher priority or short- and long-running tasks, at task submission time:
Routing - tasks can be routed to different workers. So if your tasks [0 .. 9] are all long-running, except for task 8, you could route task 8 to a worker, or a set of workers, that deal with short-running tasks.
Timed execution - specify a countdown or estimated time of arrival (eta) for each task. That's a good option if you know that some tasks can be delayed for later execution i.e. when the system will be less busy. This leaves workers ready for those tasks that need to be executed immediately.
Task expiry - specify an expire countdown or time with a callback. This way the task will be revoked if it didn't execute within the time alloted to it and the callback can start an alternative course of action.
Check on task results periodically, revoke a task if it didn't start executing within some time. Note this is different from task expiry where the revoking only happens once a worker has fetched the task from the queue - if the queue is full the revoking may happen too late for your use case. Checking results periodically means you have another component in your system that does this and determines an alternate course of action.

Can celery assign task to specify worker

Celery will send task to idle workers.
I have a task will run every 5 seconds, and I want this task to only be sent to one specify worker.
Other tasks can share the left over workers
Can celery do this??
And I want to know what this parameter is: CELERY_TASK_RESULT_EXPIRES
Does it means that the task will not be sent to a worker in the queue?
Or does it stop the task if it runs too long?

Sure, you can. Best way to do it, separate celery workers using different queues. You just need to make sure that task you need goes to separate queue, and your worker listening particular queue.
Long story for this: http://docs.celeryproject.org/en/latest/userguide/routing.html

Just to answer your second question CELERY_TASK_RESULT_EXPIRES is the time in seconds that the result of the task is persisted. So after a task is over, its result is saved into your result backend. The result is kept there for the amount of time specified by that parameter. That is used when a task result might be accessed by different callers.
This has probably nothing to do with your problem. As for the first solution, as already stated you have to use multiple queues. However be aware that you cannot assign the task to a specific Worker Process, just to a specific Worker which will then assign it to a specific Worker Process.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.