I am currently using celery to run my background tasks.
Let's say I want to run these background tasks on weekly basis.
Is it a good idea to throw 20 tasks under one worker?
Each tasks would make about at least 800 web requests.
In the celery docs it says :
A single Celery process can process millions of tasks a minute, with sub-millisecond round-trip latency (using RabbitMQ, librabbitmq, and optimized settings).
So basically; one task could be for one user; I would need to run at least 50 different tasks; each of them making about 800 web request. I thought; maybe I would need new worker for each tasks; but reviewing the doc it doesn't seem like I need multiple workers for each tasks; instead I can throw all at once; and would be just fine. I don't feel confident about that though; What should I do in my case; if I am making 800 web requests per task; should I need multiple workers? Or should I just do everything under one worker.
Related
I have a couple of workers deployed in Kubernetes. I want to write a customized exporter for Prometheus so I need to check all workers' availability.
I have some huge tasks in one queue, which take 200 seconds (for example). The related workers to this queue have been run with eventlet pool and 1000 concurrency. This worker deployed in a workload with 2 pods.
Because of the huge tasks, sometimes light tasks got stuck in these workers and does not process until huge task are done(I have another queue for light tasks, but I have to have some light tasks in this queue).
How I can check all workers' performance and upness?
I come across Bootstrap in celery but I do not know whether it helps me or not, because I want to have a task that is run on every worker (and queues) and I want it to run between huge tasks not separated.
For more details: I want to save this data in a Redis and read it in my exporter.
I have tasks that do a get request to an API.
I have around 70 000 requests that I need to do, and I want to spread them out in 24 hours. So not all 70k requests are run at for example 10AM.
How would I do that in celery django? I have been searching for hours but cant find a good simple solution.
The database has a list of games that needs to be refreshed. Currently I have a cron that creates tasks every hour. But is it better to create a task for every game and make it repeat every hour?
The typical approach is to send them whenever you need some work done, no matter how many there are (even hundreds of thousands). The execution however is controlled by how many workers (and worker processes) you have subscribed to a dedicated queue. The key here is the dedicated queue - that is a common way of not allowing all workers start executing the newly created tasks. This goes beyond the basic Celery usage. You need to use celery multi for this use-case, or create two or more separate Celery workers manually with different queues.
If you do not want to over-complicate things you can use your current setup, but make these tasks with lowest priority, so if any new, more important, task gets created, it will be executed first. Problem with this approach is that only Redis and RabbitMQ backends support priorities as far as I know.
I have an app running on Heroku and I'm using celery together with a worker dyno to process background work.
I'm running tasks that are using quite a lot of memory. These tasks get started at roughly the same time, but I want only one or two tasks to be running at the same time, the others must wait in the queue. How can I achieve that?
If they run at the same time I run out of memory and the system gets restarted. I know why it's using a lot of memory and not looking to decrease that
Quite simply: limit your concurrency (number of celery worker processes) to the number of tasks that can safely run in parallel on this server.
Note that if you have different tasks having widly different resource needs (ie one task that eats a lot of ram and takes minutes to complete and a couple ones that are fast and don't require much resources at all) you might be better using two distinct nodes to serve them (one for the heavy tasks and the other for the light ones) so heavy tasks don't block light ones. You can use queues to route tasks to different celery nodes.
I'm doing some metric analysis on on my web app, which makes extensive use of celery. I have one metric which measures the full trip from a post_save signal through a celery task (which itself calls a number of different celery tasks) to the end of that task. I've been hitting the server with up to 100 requests in 5 seconds.
What I find interesting is that when I hit the server with hundreds of requests (which entails thousands of celery worker processes being queued), the time it takes for the trip from post save to the end of the main celery task increases significantly, even though I never do any additional database calls, and none of the celery tasks should be blocking the main task.
Could the fact that there are so many celery tasks in the queue when I make a bunch of requests really quickly be slowing down the logic in my post_save function and main celery task? That is, could the processing associated with getting the sub-tasks that the main celery task creates onto a crowded queue be having a significant impact on the time it takes to reach the end of the main celery task?
It's impossible to really answer your question without an in-depth analysis of your actual code AND benchmark protocol, and while having some working experience with Python, Django and Celery I wouldn't be able to do such an in-depth analysis. Now there are a couple very obvious points :
if your workers are running on the same computer as your Django instance, they will compete with Django process(es) for CPU, RAM and IO.
if the benchmark "client" is also running on the same computer then you have a "heisenbench" case - bombing a server with 100s of HTTP request per second also uses a serious amount of resources...
To make a long story short: concurrent / parallel programming won't give you more processing power, it will only allow you to (more or less) easily scale horizontally.
I'm not sure about slowing down, but it can cause your application to hang. I've had this problem where one application would backup several other queues with no workers. My application could then no longer queue messages.
If you open up a django shell and try to queue a task. Then hit ctrl+c. I can't quite remember what the stack trace should be, but if you post it here I could confirm it.
I am working on a project, to be deployed on Heroku in Django, which has around 12 update functions. They take around 15 minutes to run each. Let's call them update1(), update2()...update10().
I am deploying with one worker dyno on Heroku, and I would like to run up to n or more of these at once (They are not really computationally intensive, they are all HTML parsers, but the data is time-sensitive, so I would like them to be called as often as possible).
I've read a lot of Celery and APScheduler documentation, but I'm not really sure which is the best/easiest for me. Do scheduled tasks run concurrently if the times overlap with one another (ie. if I run one every 2 minutes, and another every 3 minutes, or do they wait until each one finishes?)
Any way I can queue these functions, so at least a few of them are running at once? What is the suggested number of simultaneous calls for this use-case?
Based on you use case description you do not need a Scheduler, so APScheduler will not match your requirements well.
Do you have a web dyno besides your worker dyno? The usual design pattern for this type of processing is to set up a control thread or control process (your web dyno) that accepts requests. These requests are then placed on a request queue.
This queue is read by one or more worker threads or worker processes (you worker dyno). I have not worked with Celery, but it looks like a match with your requirements. How many worker threads or worker dyno's you will need is difficult to determine based on your description. You will need to specify also how many requests for updates you will need to process per second. Also, you will need to specify if the request is CPU bound or IO bound.