In celery I've 3 types of tasks first task executes in every 3 minutes and take almost 1 minute to complete, second task is periodic which runs on every monday and takes almost 10 minutes to complete, the third and last one is for sending users emails for register/forget password, I'm confused how many workers/ celery beat instances I should use, can anyone help me out please?
Usually you'll have only one Celery beat instance to schedule your tasks. If you have more than one instance, it will lead to tasks being scheduled as many times as the number one Celery beat instances you have.
There's no hard and fast rule for how many Celery workers you should have. Start with a handful, e.g. three, and keep an eye on the metrics (you can use something like https://flower.readthedocs.io/en/latest/index.html to create dashboards).
Related
I have tasks that do a get request to an API.
I have around 70 000 requests that I need to do, and I want to spread them out in 24 hours. So not all 70k requests are run at for example 10AM.
How would I do that in celery django? I have been searching for hours but cant find a good simple solution.
The database has a list of games that needs to be refreshed. Currently I have a cron that creates tasks every hour. But is it better to create a task for every game and make it repeat every hour?
The typical approach is to send them whenever you need some work done, no matter how many there are (even hundreds of thousands). The execution however is controlled by how many workers (and worker processes) you have subscribed to a dedicated queue. The key here is the dedicated queue - that is a common way of not allowing all workers start executing the newly created tasks. This goes beyond the basic Celery usage. You need to use celery multi for this use-case, or create two or more separate Celery workers manually with different queues.
If you do not want to over-complicate things you can use your current setup, but make these tasks with lowest priority, so if any new, more important, task gets created, it will be executed first. Problem with this approach is that only Redis and RabbitMQ backends support priorities as far as I know.
We have been using Airflow for a while, it is just great.
Now we are considering moving some of our very frequent tasks into our airflow server too.
Let's say I have a script running every second.
What's the best practice to schedule it with airflow:
Run this script in DAG that is scheduled every second. I highly doubt this will be the solution, there is significant overhead for a DAGRUN
Run this script in a while loop that stops after 6 hours, then schedule it on Airflow to be run every 6 hour?
Create a DAG with no schedule, put the task in a while True loop with proper sleep time, so the task will never terminates unless there is an error.
Any other suggestions?
Or this kind of task is just not suitable for Airflow? should do it with a lambda function and AWS scheduler?
Cheers!
What's the best practice to schedule it
... this kind of task is just not suitable for Airflow?
It is not suitable.
In particular, your airflow is probably configured to re-examine the set of DAGs every 5 seconds, which doesn't sound like a good fit for a 1-second task. Plus the ratio of scheduling overhead to work performed would not be attractive. I suppose you could schedule five simultaneous tasks, twelve times per minute, and have them sleep zero to four seconds, but that's just crazy. And likely you would need to "lock against yourself" to avoid having simultaneous sibling tasks step on each other's toes.
The six-hour suggestion (2.) is not crazy. I will view it as a sixty-minute #hourly task instead, since overheads are similar. Exiting after an hour and letting airflow respawn has several benefits. Log rolling happens at regular intervals. If your program crashes, it will be restarted before too long. If your host reboots, again your program is restarted before too long. Downside is that your business need may view "more than a minute" as "much too long". And coordinating overlapping tasks, or gap between tasks, at the hour boundary may pose some issues.
Your stated needs exactly match the problem that Supervisor addresses. Just use that. You will always have exactly one copy of your event loop running, even if the app crashes, even if the host crashes. Log rolling and other administrative details have already been addressed. The code base is mature and lots of folks have beat on it and incorporated their feature requests. It fits what you want.
I have question about CELERYD_CONCURRENCY and CELERYD_PREFETCH_MULTIPLIER
Because my english is not well to understand the official site description,
I want to make sure it
I set CELERYD_CONCURRENCY=40
I think it will use 40 workers to do things
But I usually see INFO/MainProcess ,seldom see INFO/Worker-n
Is it because the task is fast,so it didn't have to assign to worker??
Here is a task architecture :
I have a period_task is celery period_task , and mail_it is normal celery task
#shared_task
def period_task():
do_something()
....
for mail in mail_list:
mail_it.delay(mail)
And the second question is CELERYD_PREFETCH_MULTIPLIER ,the default value is 4
Is it means that each worker can get 4 tasks from queue one time ??? So I have 40 worker,I can get 40*4 task????
My understanding:
CELERYD_CONCURRENCY:
This is the number of THREADS/GREENTHREADS a given worker will have. Celery calls these "processes". This is the number of tasks a single worker can execute in parallel. I believe Celery creates this numbe PLUS ONE internally, and that the additional 1 is for actually managing/assigning to the others (in your case, 40 of them!). In my experience, you likely don't need/want 40 (closer to 1 or 2 per CPU), but your mileage may vary.
CELERYD_PREFETCH_MULTIPLIER:
Prefetch is how many tasks are reserved per "process" according to the docs. It's a bit like a mini-queue just for that specific thread. This would indeed mean that your ONE started worker would potentially 'reserve' 40 * 4 tasks to do. Keep in mind that these reserved tasks cannot be "stolen" or sent to another worker or thread, so if they are long running you may wish to disable this feature to allow faster stations to pickup the slack of slower ones.
If this isn't clear in your current setup, I might suggest adding a sleep() to your task to be able to observe it.
Is there a good way to measure the amount of time a celery message spends on the queue? I know I could send a timestamp to the celery task and measure the difference between the beginning of the task and the timestamp sent, but I was hoping there might be a more general way to do it so I don't have to add timestamps to every single one of my celery tasks.
You might want to checkout Celery Flower, it's a very extensive reports for the broker, queues, tasks. And it's so easy to run (Just a couple of commands).
Here you go http://docs.celeryproject.org/en/latest/userguide/monitoring.html#flower-real-time-celery-web-monitor
By code:
#celery.task()
def some_recursive_task():
# Do some stuff and schedule it to run again later
# Note that the next run is not scheduled in a fixed basis, like crontabs
# but based on history of some object
# Actual task is found here:
# https://github.com/rafaelsierra/cheddar/blob/master/src/feeds/tasks.py#L39
# Then it call himself again
countdown = bla.get_countdown()
some_recursive_task.apply_async(countdown=countdown)
This task will run withing the next 10 minutes and 12 hours, but this task also calls another tasks that should run now, one for downloading stuff and other to parse it.
The problem is that the main function is called for every single record on database, let's assume a few hundred tasks running, but, considering that those task runs in average every few hours the amount of tasks is not a big deal.
The problem starts when I try to run this with a single worker, when I start the worker, I put it to run all queues and set 8 concurrent workers, then it starts an begin to acknowledge the tasks, but it seems that, no matter how far in future a task is set to, a worker will get it and wait for the its scheduled run, meaning that this worker is locked until then.
I know that I can just split the two other functions into different queues, which I already did, but my concern is that workers will acknowledge tasks to run 12 hours ahead and will not run the ones it should in 30 minutes.
Shouldn't workers ignore scheduled tasks until its time and run the ones that are just delayed without a time?
I don't think, or don't know how, periodic tasks is a solution.
See the points 5 & 6 there. Please, keep in mind that countdown is no different from eta argument of the task.
In short you're right. Single worker (or any amount of workers) should not block on scheduled (eta or countdown) tasks.
How can you tell that workers are locked? The scheduled tasks are prefetched from the queue, but not acknowledged until they are executed.
Also, please keep in mind all scheduled tasks are kept in RAM until they're executed. You would like them to be as light as possible. From what I understand the scheduled task doesn't pass around big chunks of data, probably only some URI, so this shouldn't be a problem.
The links you've pasted return 404. Are you sure cheddar isn't a private repository?