Spread Celery Django tasks out over 24 hours - python

I have tasks that do a get request to an API.
I have around 70 000 requests that I need to do, and I want to spread them out in 24 hours. So not all 70k requests are run at for example 10AM.
How would I do that in celery django? I have been searching for hours but cant find a good simple solution.
The database has a list of games that needs to be refreshed. Currently I have a cron that creates tasks every hour. But is it better to create a task for every game and make it repeat every hour?

The typical approach is to send them whenever you need some work done, no matter how many there are (even hundreds of thousands). The execution however is controlled by how many workers (and worker processes) you have subscribed to a dedicated queue. The key here is the dedicated queue - that is a common way of not allowing all workers start executing the newly created tasks. This goes beyond the basic Celery usage. You need to use celery multi for this use-case, or create two or more separate Celery workers manually with different queues.
If you do not want to over-complicate things you can use your current setup, but make these tasks with lowest priority, so if any new, more important, task gets created, it will be executed first. Problem with this approach is that only Redis and RabbitMQ backends support priorities as far as I know.

Related

Fair scheduling of task execution time between web service users

Suppose we have the following web service. The main function is doing screenshots for the given website URL. There is REST API and user interface for entering URLs. For each new URL is a task in Celery is created. For frontend UI is important that screens for some URL will follow in a reasonable time, like 10 seconds.
Now a user, intensionally or by a software error, enters few hundreds URLs. This bloats task queue and other users must wait until all those tasks will be done.
So the request here is to:
Running tasks in some fair order. The simplest solution is to run one task for each user in one time. Like: user1 task, user2 task, user1 task, user2 task, and so on.
Having some priorities on tasks. Like tasks of priority 1 is always done before tasks of priority 2.
Currently, we utilize our handcrafted module. It stores tasks in Redis and pushes them in fair order to Celery. To not depend on Celery ordering it pushes only as many tasks as there are free Celery workers available, and checking Celery queue for free workers every 100 milliseconds.
Are there any libraries or services which meet my requirements?
How many tasks do you have?
How many users you have?
Sounds like you need rate-limiting mechanism in your webserver per user.
For your question, there are serval options:
you can use celery router and assign different tasks for different queues (and then consume from those queues by different workers.
Celery support tasks priority, you can read about it here.
You can rate-limit per task in Celery - again, depends on your usage.
EDIT:
#uhbif19 I described those features since you asked for them - you wanted a way to achieve priority and you send tasks with a specific priority.
In your current architecture you might want to decrease priority to abusers and avoid starvation of other users.
A better way to tackle this problem IMO is to add a rate-limiting mechanism in the gateway and ensure that a single user won't be able to abuse the system and make starvation for all others.
Good luck!

Fair task processing order with Celery+Rabbitmq

I have a Django applications serving multiple users. Each user can submit resource-intensive tasks (minutes to hours) to be executed. I want to execute the tasks based on a fair distribution of the resources. The backend uses Celery and RabbitMQ for task execution.
I have looked extensively and haven't been able to find any solution for my particular case (or haven't been able to piece it together.) As far as I can tell, there isn't any build-in features able to do this in Celery and RabbitMQ. Is it possible to have custom code to handle the order of execution of the tasks? This would allow to calculate priorities based on user data and chose which task should be executed next.
Related: How can Celery distribute users' tasks in a fair way?
The AMPQ queues are FIFO. So it is impossible to grab items from the middle of the queue to execute. The two solutions that come to mind are:
a.) As mentioned in the other post, use a lock to limit resources by user.
b.) Have 2 queues; a submission queue and an execution queue. The submission queue keeps the execution queue full of work based on whatever algorithm you choose to implement. This will likely be more complex, but may be more along the lines of what you are looking for.

Understanding celery task load balance

I am currently using celery to run my background tasks.
Let's say I want to run these background tasks on weekly basis.
Is it a good idea to throw 20 tasks under one worker?
Each tasks would make about at least 800 web requests.
In the celery docs it says :
A single Celery process can process millions of tasks a minute, with sub-millisecond round-trip latency (using RabbitMQ, librabbitmq, and optimized settings).
So basically; one task could be for one user; I would need to run at least 50 different tasks; each of them making about 800 web request. I thought; maybe I would need new worker for each tasks; but reviewing the doc it doesn't seem like I need multiple workers for each tasks; instead I can throw all at once; and would be just fine. I don't feel confident about that though; What should I do in my case; if I am making 800 web requests per task; should I need multiple workers? Or should I just do everything under one worker.

Scheduling update functions in Django and Heroku?

I am working on a project, to be deployed on Heroku in Django, which has around 12 update functions. They take around 15 minutes to run each. Let's call them update1(), update2()...update10().
I am deploying with one worker dyno on Heroku, and I would like to run up to n or more of these at once (They are not really computationally intensive, they are all HTML parsers, but the data is time-sensitive, so I would like them to be called as often as possible).
I've read a lot of Celery and APScheduler documentation, but I'm not really sure which is the best/easiest for me. Do scheduled tasks run concurrently if the times overlap with one another (ie. if I run one every 2 minutes, and another every 3 minutes, or do they wait until each one finishes?)
Any way I can queue these functions, so at least a few of them are running at once? What is the suggested number of simultaneous calls for this use-case?
Based on you use case description you do not need a Scheduler, so APScheduler will not match your requirements well.
Do you have a web dyno besides your worker dyno? The usual design pattern for this type of processing is to set up a control thread or control process (your web dyno) that accepts requests. These requests are then placed on a request queue.
This queue is read by one or more worker threads or worker processes (you worker dyno). I have not worked with Celery, but it looks like a match with your requirements. How many worker threads or worker dyno's you will need is difficult to determine based on your description. You will need to specify also how many requests for updates you will need to process per second. Also, you will need to specify if the request is CPU bound or IO bound.

What is the optimal way to organize infinitely looped work queue?

I have about 1000-10000 jobs which I need to run on a constant basis each minute or so. Sometimes new job comes in or other needs to be cancelled but it's rare event. Jobs are tagged and must be disturbed among workers each of them processes only jobs of specific kind.
For now I want to use cron and load whole database of jobs in some broker -- RabbitMQ or beanstalkd (haven't decided which one to use though).
But this approach seems ugly to me (using timer to simulate infinity, loading the whole database, etc) and has the disadvantage: for example if some kind of jobs are processed slower than added into the queue it may be overwhelmed and message broker will eat all ram, swap and then just halt.
Is there any other possibilities? Am I not using right patterns for a job? (May be I don't need queue or something..?)
p.s. I'm using python if this is important.
You create your initial batch of jobs and add them to the queue.
You have n-consumers of the queue each running the jobs. Adding consumers to the queue simply round-robins the distribution of jobs to each listening consumer, giving you arbitrary horizontal scalability.
Each job can, upon completion, be responsible for resubmitting itself back to the queue. This means that your job queue won't grow beyond the length that it was when you initialised it.
The master job can, if need be, spawn sub-jobs and add them to the queue.
For different types of jobs it is probably a good idea to use different queues. That way you can balance the load more effectively by having different quantities/horsepower of workers running the jobs from the different queues.
The fact that you are running Python isn't important here, it's the pattern, not the language that you need to nail first.
You can use asynchronous framework, e.g. Twisted
I don't think either it's a good idea to run script by cron daemon each minute (and you mentioned reasons), so I offer you Twisted. It doesn't give you benefit with scheduling, but you get flexibility in process management and memory sharing

Categories