I have some independent tasks which I am currently putting into different/independent workers.
To be understood easily I will walk you through an example. Let's say I have three independent tasks namely sleep, eat, smile. A task may need to work under different celery configurations. So, I think, it is better to separate each of these tasks into different directories with different workers. Some tasks may be required to work on different servers.
I am planning add some more tasks in the future and each of them will be implemented by different developers.
Providing these conditions, there are more than one workers associated to each individual task.
Now, here is the problem and my question.
When I start three smile tasks, one of these will be fetched by smile's worker and carried out. But the next task will be fetched from eat's worker and never will be carried out.
So, what is the accepted, most common pattern? Should I send each tasks into different queues and workers should listen its own queue?
The answer depends on couple of things that should be taken in consideration:
Does order of commands should be preserved ?
If so the best approach is placing some sort of command pattern as serialized message so each fetched/consumed message can be executed in it's order in single place in your application.
If it's not an issue for you - you can play with topic exchange while publishing different message types in single exchange, and having different workers receiving the messages by predefined pattern. This by the way will let you easily add another task lets say "drink" without changing a line in already existing transportation topology/already existing workers.
Are you planning scaling queues among different machines to increase throughput ?
In case you have very intense traffic of tasks (in terms of frequency) it may be worth creating different queue for each task type so latter when you grow you can place each one on different node in rabbit cluster.
In a similar setup, I decided to go with specific queues for different tasks, and then I can decide which worker listens on which queue (which can also be changed dynamically !).
Related
I am a bit confused if Celery can fit our use case.
We have thousands of network equipment on which we fetch/send information. Those operations can be quite time consuming, hence celery.
We have X number of celery workers.
However those equipment do not accept concurrent connections (sometimes no more than one connection at a time). When multiple tasks are queued for the same equipment, they end up on different workers and we reach failures.
The equipment is deduced from the arguments of the tasks.
Two ideas:
Dynamically create queues, one for each equipment. Each queue runs on a single worker. We would end up with thousands of tiny queues and they would need to be automatically deleted once all tasks completed.
Or a single queue, but tasks are "on hold" if a worker is currently dealing with the equipment.
Any inputs warmly welcomed :)
PS: We use django, and consider redis or rabbitmq
Both of your proposed solutions could work depending on the specifics of your use case. Here are some considerations for each:
Dynamically create queues, one for each equipment:
This solution would ensure that tasks for a specific equipment are always processed by the same worker, which would prevent concurrent connections to the equipment. However, as you mentioned, this could result in a very large number of queues, which could be difficult to manage. Additionally, you would need a way to dynamically create and delete queues as needed, which could add some complexity to your system.
Single queue, but tasks are "on hold" if a worker is currently dealing with the equipment:
This solution would use a single queue, but would require some additional logic to ensure that tasks for a specific equipment are not executed concurrently. One way to implement this could be to use a lock to prevent multiple workers from processing tasks for the same equipment at the same time. This would ensure that each equipment is only being accessed by one worker at a time, but it could potentially lead to some tasks being delayed if a worker is already processing a task for the same equipment.
Overall, both solutions have pros and cons, and the best approach will depend on the specific requirements of your use case. If you choose to use a single queue, you may want to consider using a lock or some other mechanism to prevent concurrent access to the equipment. If you choose to use multiple queues, you may want to consider a way to manage the large number of queues that could be created.
Finally, both Redis and RabbitMQ are great choices for use with Celery, and both have their own strengths and weaknesses. Redis is often faster and simpler to use, but RabbitMQ offers more advanced features such as message persistence and clustering.
I have a Django applications serving multiple users. Each user can submit resource-intensive tasks (minutes to hours) to be executed. I want to execute the tasks based on a fair distribution of the resources. The backend uses Celery and RabbitMQ for task execution.
I have looked extensively and haven't been able to find any solution for my particular case (or haven't been able to piece it together.) As far as I can tell, there isn't any build-in features able to do this in Celery and RabbitMQ. Is it possible to have custom code to handle the order of execution of the tasks? This would allow to calculate priorities based on user data and chose which task should be executed next.
Related: How can Celery distribute users' tasks in a fair way?
The AMPQ queues are FIFO. So it is impossible to grab items from the middle of the queue to execute. The two solutions that come to mind are:
a.) As mentioned in the other post, use a lock to limit resources by user.
b.) Have 2 queues; a submission queue and an execution queue. The submission queue keeps the execution queue full of work based on whatever algorithm you choose to implement. This will likely be more complex, but may be more along the lines of what you are looking for.
Using distributed to schedule lots of interdependent tasks, running on google compute engine. When I start an extra instance with workers halfway, no tasks get scheduled to it (though it registers fine with the scheduler). I presume this is because (from http://distributed.readthedocs.io/en/latest/scheduling-state.html#distributed.scheduler.decide_worker):
"If the task requires data communication, then we choose to minimize the number of bytes sent between workers. This takes precedence over worker occupancy."
Once I'm halfway running the task tree, all remaining tasks depend on the result of tasks which have already run. So, if I interpret the above quote right, nothing will ever be scheduled on the new workers, no matter how idle they are, as the dependent data is never already there but always on a 'old' worker.
However, I do make sure the amount of data to transfer is fairly minimal, usually just a small string. So in this case it would make much more sense to let idleness prevail over data communication. Would it be possible to allow this (e.g. setting a 'scheduler policy'? Or maybe even have a data-vs-idleness tradeoff coefficent which could be tuned?
Update after comment #1:
Complicating factor: every task is using the resources framework to make sure it either runs on the set of workers for cpu-bound tasks ("CPU=1") or on the set of workers for network-bound tasks ("NET=1"). This separation was made to avoid overloading up/download servers and restrict up/download tasks to a certain max, while still being able to scale the other tasks. However, according to http://distributed.readthedocs.io/en/latest/work-stealing.html, task stealing will not happen in these cases? Is there a way to allow task stealing while keeping the resource restrictions?
Update 2: I see there is an open issue for that: https://github.com/dask/distributed/issues/1389. Are there plans to implement this?
While Dask prefers to schedule work to reduce communication it also acknowledges that this isn't always best. Generally Dask runs a task on the machine where it will finish first, taking into account both communication costs and existing task backlogs on overloaded workers.
For more information on load balancing you might consider reading this documentation page:
http://distributed.readthedocs.io/en/latest/work-stealing.html
Are there substantial performance implications that I should keep in mind when Celery workers are pulling from multiple (or perhaps many) queues? For example, would there be a significant performance penalty if my system were designed so that workers pulled from 10 to 15 queues rather than just 1 or 2? As a follow-up, what if some of those queues are sometimes empty?
The short answer to your question on queue limits is:
Don't worry having multiple queues will not be worse or better, broker are designed to handle huge numbers of them. Off course in a lot of use cases you don't need so many, except really advanced one. Empty queues don't create any problem, they just take a tiny amount of memory on the broker.
Don't forget also that you have other things like exchanges and bindings, also there you don't have real limits but is better you understand the performance implication of each of them before using it (a TOPIC exchange will use more CPU than a direct one for example)
To give you a more complete answer let's look at the performance topic from a more generic point of view.
When looking at a distributed system based on message passing like Celery there are 2 main topics to analyze from the point of view of performance:
The workers number and concurrency factor.
As you probably already know each celery worker has a concurrency parameter that sets how many tasks can be executed at the same time, this should be set in relation with the server capacity(CPU, RAM, I/O) and off course also based on the type of tasks that the specific consumer will execute (depends on the queue that it will consume).
Off course depending on the total number of tasks you need to execute in a certain time window you will need to decide how many workers/servers you will need to have up and running.
The broker, the Single point of Failure in this architecture style.
The broker, especially RabbitMQ, is designed to manage millions of messages without any problem, however more messages it will need to store more memory will use and more are the messages to route more CPU it will use.
This machine should be well tuned too and if possible be in high availability.
Off course the main thing to avoid is the messages are consumed at a lower rate than they are produced otherwise your queue will keep growing and your RabbitMQ will explode. Here you can find some hints.
There are cases where you may also need to increase the number of tasks executed in a certain time frame but on only in response to peaks of requests. The nice thing about this architecture is that you can monitor the size of the queues and when you understand is growing to fast you could create new machines on the fly with a celery worker already configured and than turn it off when they are not needed. This is a quite cost saving and efficient approach.
One hint, remember to don't store celery tasks results in RabbitMQ.
I have about 1000-10000 jobs which I need to run on a constant basis each minute or so. Sometimes new job comes in or other needs to be cancelled but it's rare event. Jobs are tagged and must be disturbed among workers each of them processes only jobs of specific kind.
For now I want to use cron and load whole database of jobs in some broker -- RabbitMQ or beanstalkd (haven't decided which one to use though).
But this approach seems ugly to me (using timer to simulate infinity, loading the whole database, etc) and has the disadvantage: for example if some kind of jobs are processed slower than added into the queue it may be overwhelmed and message broker will eat all ram, swap and then just halt.
Is there any other possibilities? Am I not using right patterns for a job? (May be I don't need queue or something..?)
p.s. I'm using python if this is important.
You create your initial batch of jobs and add them to the queue.
You have n-consumers of the queue each running the jobs. Adding consumers to the queue simply round-robins the distribution of jobs to each listening consumer, giving you arbitrary horizontal scalability.
Each job can, upon completion, be responsible for resubmitting itself back to the queue. This means that your job queue won't grow beyond the length that it was when you initialised it.
The master job can, if need be, spawn sub-jobs and add them to the queue.
For different types of jobs it is probably a good idea to use different queues. That way you can balance the load more effectively by having different quantities/horsepower of workers running the jobs from the different queues.
The fact that you are running Python isn't important here, it's the pattern, not the language that you need to nail first.
You can use asynchronous framework, e.g. Twisted
I don't think either it's a good idea to run script by cron daemon each minute (and you mentioned reasons), so I offer you Twisted. It doesn't give you benefit with scheduling, but you get flexibility in process management and memory sharing