Is there a good way to measure the amount of time a celery message spends on the queue? I know I could send a timestamp to the celery task and measure the difference between the beginning of the task and the timestamp sent, but I was hoping there might be a more general way to do it so I don't have to add timestamps to every single one of my celery tasks.
You might want to checkout Celery Flower, it's a very extensive reports for the broker, queues, tasks. And it's so easy to run (Just a couple of commands).
Here you go http://docs.celeryproject.org/en/latest/userguide/monitoring.html#flower-real-time-celery-web-monitor
Related
In celery I've 3 types of tasks first task executes in every 3 minutes and take almost 1 minute to complete, second task is periodic which runs on every monday and takes almost 10 minutes to complete, the third and last one is for sending users emails for register/forget password, I'm confused how many workers/ celery beat instances I should use, can anyone help me out please?
Usually you'll have only one Celery beat instance to schedule your tasks. If you have more than one instance, it will lead to tasks being scheduled as many times as the number one Celery beat instances you have.
There's no hard and fast rule for how many Celery workers you should have. Start with a handful, e.g. three, and keep an eye on the metrics (you can use something like https://flower.readthedocs.io/en/latest/index.html to create dashboards).
Suppose we have the following web service. The main function is doing screenshots for the given website URL. There is REST API and user interface for entering URLs. For each new URL is a task in Celery is created. For frontend UI is important that screens for some URL will follow in a reasonable time, like 10 seconds.
Now a user, intensionally or by a software error, enters few hundreds URLs. This bloats task queue and other users must wait until all those tasks will be done.
So the request here is to:
Running tasks in some fair order. The simplest solution is to run one task for each user in one time. Like: user1 task, user2 task, user1 task, user2 task, and so on.
Having some priorities on tasks. Like tasks of priority 1 is always done before tasks of priority 2.
Currently, we utilize our handcrafted module. It stores tasks in Redis and pushes them in fair order to Celery. To not depend on Celery ordering it pushes only as many tasks as there are free Celery workers available, and checking Celery queue for free workers every 100 milliseconds.
Are there any libraries or services which meet my requirements?
How many tasks do you have?
How many users you have?
Sounds like you need rate-limiting mechanism in your webserver per user.
For your question, there are serval options:
you can use celery router and assign different tasks for different queues (and then consume from those queues by different workers.
Celery support tasks priority, you can read about it here.
You can rate-limit per task in Celery - again, depends on your usage.
EDIT:
#uhbif19 I described those features since you asked for them - you wanted a way to achieve priority and you send tasks with a specific priority.
In your current architecture you might want to decrease priority to abusers and avoid starvation of other users.
A better way to tackle this problem IMO is to add a rate-limiting mechanism in the gateway and ensure that a single user won't be able to abuse the system and make starvation for all others.
Good luck!
I have tasks that do a get request to an API.
I have around 70 000 requests that I need to do, and I want to spread them out in 24 hours. So not all 70k requests are run at for example 10AM.
How would I do that in celery django? I have been searching for hours but cant find a good simple solution.
The database has a list of games that needs to be refreshed. Currently I have a cron that creates tasks every hour. But is it better to create a task for every game and make it repeat every hour?
The typical approach is to send them whenever you need some work done, no matter how many there are (even hundreds of thousands). The execution however is controlled by how many workers (and worker processes) you have subscribed to a dedicated queue. The key here is the dedicated queue - that is a common way of not allowing all workers start executing the newly created tasks. This goes beyond the basic Celery usage. You need to use celery multi for this use-case, or create two or more separate Celery workers manually with different queues.
If you do not want to over-complicate things you can use your current setup, but make these tasks with lowest priority, so if any new, more important, task gets created, it will be executed first. Problem with this approach is that only Redis and RabbitMQ backends support priorities as far as I know.
My app engine app receives notifications from SendGrid for processing email deliveries, opens, etc. Sendgrid doesn't do much batching of these notifications so I could receive several per second.
I'd like to do batch processing of the incoming notifications, such as processing all of the notifications received in the last minute (my processing includes transactions so I need to combine them to avoid contention). There seems to be several ways of doing this...
For storing the incoming notifications, I could:
add an entity to the datastore or
create a pull queue task.
For triggering processing, I could:
Run a CRON job every minute (is this a good idea?) or
Have the handler that processes the incoming Sendgrid requests trigger processing of notifications but only if the last trigger was more than a minute ago (could store a last trigger date in memcache).
I'd love to hear pros and cons of the above or other approaches.
After a couple of days, I've come up with an implementation that works pretty well.
For storing incoming notifications, I'm storing the data in a pull queue task. I didn't know at the time of my question that you can actually store any raw data you want in a task, and the task doesn't have to itself be the execution of a function. You probably could store the incoming data in the datastore, but then you'd sort of be creating your own pull tasks so you might as well you the pull tasks provided by GAE.
For triggering a worker to process tasks in the pull queue, I came across this excellent blog post about On-demand Cron Jobs by a former GAE developer. I don't want to repeat that entire post here, but the basic idea is that each time you add a task to the pull queue, you create a worker task (regular push queue) to process tasks in the pull queue. For the worker task, you add a task name corresponding to a time interval to make sure you only have one worker task in the time interval. It allows you to get the benefit of 1-minute CRON job but the added performance bonus that it only runs when needed so you don't have a CRON job running when not needed.
I'm doing some metric analysis on on my web app, which makes extensive use of celery. I have one metric which measures the full trip from a post_save signal through a celery task (which itself calls a number of different celery tasks) to the end of that task. I've been hitting the server with up to 100 requests in 5 seconds.
What I find interesting is that when I hit the server with hundreds of requests (which entails thousands of celery worker processes being queued), the time it takes for the trip from post save to the end of the main celery task increases significantly, even though I never do any additional database calls, and none of the celery tasks should be blocking the main task.
Could the fact that there are so many celery tasks in the queue when I make a bunch of requests really quickly be slowing down the logic in my post_save function and main celery task? That is, could the processing associated with getting the sub-tasks that the main celery task creates onto a crowded queue be having a significant impact on the time it takes to reach the end of the main celery task?
It's impossible to really answer your question without an in-depth analysis of your actual code AND benchmark protocol, and while having some working experience with Python, Django and Celery I wouldn't be able to do such an in-depth analysis. Now there are a couple very obvious points :
if your workers are running on the same computer as your Django instance, they will compete with Django process(es) for CPU, RAM and IO.
if the benchmark "client" is also running on the same computer then you have a "heisenbench" case - bombing a server with 100s of HTTP request per second also uses a serious amount of resources...
To make a long story short: concurrent / parallel programming won't give you more processing power, it will only allow you to (more or less) easily scale horizontally.
I'm not sure about slowing down, but it can cause your application to hang. I've had this problem where one application would backup several other queues with no workers. My application could then no longer queue messages.
If you open up a django shell and try to queue a task. Then hit ctrl+c. I can't quite remember what the stack trace should be, but if you post it here I could confirm it.