I have a Django applications serving multiple users. Each user can submit resource-intensive tasks (minutes to hours) to be executed. I want to execute the tasks based on a fair distribution of the resources. The backend uses Celery and RabbitMQ for task execution.
I have looked extensively and haven't been able to find any solution for my particular case (or haven't been able to piece it together.) As far as I can tell, there isn't any build-in features able to do this in Celery and RabbitMQ. Is it possible to have custom code to handle the order of execution of the tasks? This would allow to calculate priorities based on user data and chose which task should be executed next.
Related: How can Celery distribute users' tasks in a fair way?
The AMPQ queues are FIFO. So it is impossible to grab items from the middle of the queue to execute. The two solutions that come to mind are:
a.) As mentioned in the other post, use a lock to limit resources by user.
b.) Have 2 queues; a submission queue and an execution queue. The submission queue keeps the execution queue full of work based on whatever algorithm you choose to implement. This will likely be more complex, but may be more along the lines of what you are looking for.
Related
Suppose we have the following web service. The main function is doing screenshots for the given website URL. There is REST API and user interface for entering URLs. For each new URL is a task in Celery is created. For frontend UI is important that screens for some URL will follow in a reasonable time, like 10 seconds.
Now a user, intensionally or by a software error, enters few hundreds URLs. This bloats task queue and other users must wait until all those tasks will be done.
So the request here is to:
Running tasks in some fair order. The simplest solution is to run one task for each user in one time. Like: user1 task, user2 task, user1 task, user2 task, and so on.
Having some priorities on tasks. Like tasks of priority 1 is always done before tasks of priority 2.
Currently, we utilize our handcrafted module. It stores tasks in Redis and pushes them in fair order to Celery. To not depend on Celery ordering it pushes only as many tasks as there are free Celery workers available, and checking Celery queue for free workers every 100 milliseconds.
Are there any libraries or services which meet my requirements?
How many tasks do you have?
How many users you have?
Sounds like you need rate-limiting mechanism in your webserver per user.
For your question, there are serval options:
you can use celery router and assign different tasks for different queues (and then consume from those queues by different workers.
Celery support tasks priority, you can read about it here.
You can rate-limit per task in Celery - again, depends on your usage.
EDIT:
#uhbif19 I described those features since you asked for them - you wanted a way to achieve priority and you send tasks with a specific priority.
In your current architecture you might want to decrease priority to abusers and avoid starvation of other users.
A better way to tackle this problem IMO is to add a rate-limiting mechanism in the gateway and ensure that a single user won't be able to abuse the system and make starvation for all others.
Good luck!
I have tasks that do a get request to an API.
I have around 70 000 requests that I need to do, and I want to spread them out in 24 hours. So not all 70k requests are run at for example 10AM.
How would I do that in celery django? I have been searching for hours but cant find a good simple solution.
The database has a list of games that needs to be refreshed. Currently I have a cron that creates tasks every hour. But is it better to create a task for every game and make it repeat every hour?
The typical approach is to send them whenever you need some work done, no matter how many there are (even hundreds of thousands). The execution however is controlled by how many workers (and worker processes) you have subscribed to a dedicated queue. The key here is the dedicated queue - that is a common way of not allowing all workers start executing the newly created tasks. This goes beyond the basic Celery usage. You need to use celery multi for this use-case, or create two or more separate Celery workers manually with different queues.
If you do not want to over-complicate things you can use your current setup, but make these tasks with lowest priority, so if any new, more important, task gets created, it will be executed first. Problem with this approach is that only Redis and RabbitMQ backends support priorities as far as I know.
Using distributed to schedule lots of interdependent tasks, running on google compute engine. When I start an extra instance with workers halfway, no tasks get scheduled to it (though it registers fine with the scheduler). I presume this is because (from http://distributed.readthedocs.io/en/latest/scheduling-state.html#distributed.scheduler.decide_worker):
"If the task requires data communication, then we choose to minimize the number of bytes sent between workers. This takes precedence over worker occupancy."
Once I'm halfway running the task tree, all remaining tasks depend on the result of tasks which have already run. So, if I interpret the above quote right, nothing will ever be scheduled on the new workers, no matter how idle they are, as the dependent data is never already there but always on a 'old' worker.
However, I do make sure the amount of data to transfer is fairly minimal, usually just a small string. So in this case it would make much more sense to let idleness prevail over data communication. Would it be possible to allow this (e.g. setting a 'scheduler policy'? Or maybe even have a data-vs-idleness tradeoff coefficent which could be tuned?
Update after comment #1:
Complicating factor: every task is using the resources framework to make sure it either runs on the set of workers for cpu-bound tasks ("CPU=1") or on the set of workers for network-bound tasks ("NET=1"). This separation was made to avoid overloading up/download servers and restrict up/download tasks to a certain max, while still being able to scale the other tasks. However, according to http://distributed.readthedocs.io/en/latest/work-stealing.html, task stealing will not happen in these cases? Is there a way to allow task stealing while keeping the resource restrictions?
Update 2: I see there is an open issue for that: https://github.com/dask/distributed/issues/1389. Are there plans to implement this?
While Dask prefers to schedule work to reduce communication it also acknowledges that this isn't always best. Generally Dask runs a task on the machine where it will finish first, taking into account both communication costs and existing task backlogs on overloaded workers.
For more information on load balancing you might consider reading this documentation page:
http://distributed.readthedocs.io/en/latest/work-stealing.html
I have some independent tasks which I am currently putting into different/independent workers.
To be understood easily I will walk you through an example. Let's say I have three independent tasks namely sleep, eat, smile. A task may need to work under different celery configurations. So, I think, it is better to separate each of these tasks into different directories with different workers. Some tasks may be required to work on different servers.
I am planning add some more tasks in the future and each of them will be implemented by different developers.
Providing these conditions, there are more than one workers associated to each individual task.
Now, here is the problem and my question.
When I start three smile tasks, one of these will be fetched by smile's worker and carried out. But the next task will be fetched from eat's worker and never will be carried out.
So, what is the accepted, most common pattern? Should I send each tasks into different queues and workers should listen its own queue?
The answer depends on couple of things that should be taken in consideration:
Does order of commands should be preserved ?
If so the best approach is placing some sort of command pattern as serialized message so each fetched/consumed message can be executed in it's order in single place in your application.
If it's not an issue for you - you can play with topic exchange while publishing different message types in single exchange, and having different workers receiving the messages by predefined pattern. This by the way will let you easily add another task lets say "drink" without changing a line in already existing transportation topology/already existing workers.
Are you planning scaling queues among different machines to increase throughput ?
In case you have very intense traffic of tasks (in terms of frequency) it may be worth creating different queue for each task type so latter when you grow you can place each one on different node in rabbit cluster.
In a similar setup, I decided to go with specific queues for different tasks, and then I can decide which worker listens on which queue (which can also be changed dynamically !).
I have about 1000-10000 jobs which I need to run on a constant basis each minute or so. Sometimes new job comes in or other needs to be cancelled but it's rare event. Jobs are tagged and must be disturbed among workers each of them processes only jobs of specific kind.
For now I want to use cron and load whole database of jobs in some broker -- RabbitMQ or beanstalkd (haven't decided which one to use though).
But this approach seems ugly to me (using timer to simulate infinity, loading the whole database, etc) and has the disadvantage: for example if some kind of jobs are processed slower than added into the queue it may be overwhelmed and message broker will eat all ram, swap and then just halt.
Is there any other possibilities? Am I not using right patterns for a job? (May be I don't need queue or something..?)
p.s. I'm using python if this is important.
You create your initial batch of jobs and add them to the queue.
You have n-consumers of the queue each running the jobs. Adding consumers to the queue simply round-robins the distribution of jobs to each listening consumer, giving you arbitrary horizontal scalability.
Each job can, upon completion, be responsible for resubmitting itself back to the queue. This means that your job queue won't grow beyond the length that it was when you initialised it.
The master job can, if need be, spawn sub-jobs and add them to the queue.
For different types of jobs it is probably a good idea to use different queues. That way you can balance the load more effectively by having different quantities/horsepower of workers running the jobs from the different queues.
The fact that you are running Python isn't important here, it's the pattern, not the language that you need to nail first.
You can use asynchronous framework, e.g. Twisted
I don't think either it's a good idea to run script by cron daemon each minute (and you mentioned reasons), so I offer you Twisted. It doesn't give you benefit with scheduling, but you get flexibility in process management and memory sharing