I am using Celery to run some tasks that take a long time to complete. There
is an initial task that needs to complete before two sub-tasks can run. The tasks that I created are file system operations and don't return a result.
I would like the subtasks to run at the same time, but when I use a group for these tasks they run sequentially and not in parallel.
I have tried:
g = group([secondary_task(), secondary_tasks2()])
chain(initial_task(),g)
I've also tried running the group directly in the first task, but that doesn't seem to work either.
Is what I'm trying to accomplish doable with Celery?
First Task
/ \
Second Task Third Task
Not:
First Task
|
Second Task
|
Third Task
The chain is definitely the right approach.
I would expect this to work: chain(initial_task.s(), g)()
Do you have more than one celery worker running to be able to run more than one task at the same time?
Related
I'm using APScheduler to schedule tasks in python, and these tasks need to run independently and concurrently with another tasks.
The main rule is that these tasks have to be executed at the exact moment they were scheduled and cannot be blocked or delay execution because of another task.
The tasks are dynamically scheduled by the users of my application.
For that, when the task execution time arrives, I start a new sub-process to execute it:
def _initialize_order_process(user, order):
p = Process(target=do_scheduled_order, args=(user, order))
p.start()
p.join()
It's important to know that each subprocess start a connection with a server.
And i'm scheduling my taks like this:
scheduler.add_job(_initialize_order_process, 'date', run_date=start_time, args=[user, order], id=job_id)
My problem is when a large number of tasks are scheduled for the same time, due to the number of processes, the server crashes.
So, I need this application to be scalable to support many users.
Does anyone know how to create a scalable solution for my use case?
One solution would be to add more hardware, horizontally (get more servers).
You add requests to a task queue, for example, using Redis, then delegate tasks using Celery workers, and run many parallel applications to pick up the workload
Another solution would be setup a cluster for Apache Airflow, then run tasks through it
cannot be blocked or delay execution because of another task
Unfortunately, that's not how task scheduling works. Eventually, you will have jobs that depend on each other, and so you'll have to have a DAG of job processes
I have a Django application that needs to run an optimization algorithm. This algorithm is composed of two parts. The first part is an evolutionary algorithm and this algorithm calls a certain number of tasks of the second part which is a simulated annealing algorithm.
The problem is that celery doesn't allow a task calls an asynchronous task.
I have tried this code below:
sa_list = []
for cromossomo in self._populacao:
sa_list.append(simulated_annealing_t.s(cromossomo.to_JSON(), self._NR, self._T, get_mutacao_str(self._mutacao_SA), self._argumentos))
job = group(sa_list)
result = job.apply_async()
resultados = result.get()
This code is part of the evolutionary algorithm which is a celery task.
When I tried to run it the celery shows this message:
[2015-12-02 16:20:15,970: WARNING/Worker-1] /home/arthur/django-user/local/lib/python2.7/site-packages/celery/result.py:45: RuntimeWarning: Never call result.get() within a task!
See http://docs.celeryq.org/en/latest/userguide/tasks.html#task-synchronous-subtasks
In Celery 3.2 this will result in an exception being
raised instead of just being a warning.
despite being just a warning the celery seems to be full of tasks and locks.
I searched for a lot of solutions but none of them worked.
one way to deal with this is to have a 2 stage pipeline:
def first_task():
sa_list = []
for cromossomo in self._populacao:
sa_list.append(simulated_annealing_t.s(cromossomo.to_JSON(), self._NR, self._T, get_mutacao_str(self._mutacao_SA), self._argumentos))
job = group(sa_list)
result = job.apply_async()
result.save()
return result.id
then call it like this:
from path.to.tasks import app, first_task
result_1 = first_task.apply_async()
result_2_id = result_1.get()
result_2 = app.GroupResult.restore(result_2_id)
resultados = result_2.get()
there are other ways to do this that involve more work - you could use a chord to gather the results of the group.
The problem is not that celery doesn't allow the execution of async tasks in your example, but that you'll run into a deadlock, hence the warning:
Let's assume you have a task A that spawns a number of subtasks B through apply_async(). Every one of those tasks is executed by a worker. The problem is that if the number of tasks B is larger than the amount of available workers, task A is still waiting for their results (in your example, at least - it's not by default). When task A is still running, the workers that have executed a task B will not execute another one, they are blocked until task A is finished. (I don't know exactly why, but I had this problem just a few weeks ago.)
This means that celery can't execute anything until you manually shut down the workers.
Solutions
This depends entirely what you will do with your task results. If you need them to execute the next subtask, you can chain them through Linking with callbacks or by hardcoding it into the respective tasks (so that you call the first, that calls the second, and so on).
If you only need to see if they are executed and are successful or not, you can use flower to monitor your tasks.
If you need to process the output of all the subtasks further, I recommend writing the results to an xml file: Have task A call all tasks B, and once they are done you execute task C that processes the results. Maybe there are more elegant solutions, but this avoids the deadlock for sure.
How I can config celery to get one worker always run the same task. And after it ended starts it again on the same worker.
It looks like you will need to take two steps
Create a separate queue for this task, route the task to the queue
2a. Create an infinite loop that calls your particular task, such as this answer
OR
2b. Have a recursive task that calls itself on completion (this could get messy)
I have a task, I want it to run once every 10 seconds
at the same time, this task can only run one, can't repeat run this task
I found two ways:
1).http://celery.readthedocs.org/en/latest/tutorials/task-cookbook.html#cookbook-task-serial
2).http://engineroom.trackmaven.com/blog/announcing-celery-once/
I don't know this two way is the best, do you have any good idea ?
From reading the Celery documentation, it looks like I should be able to use the following python code to list tasks on the queue that have not yet been picked up:
from celery.task.control import inspect
i = inspect()
tasks = i.reserved()
However, when running this code, the list of tasks is empty, even if there are items waiting in the queue (I have verified they are definitely in the queue using django-admin). The same is true for using the command line equivalent:
$ celeryctl inspect reserved
So I'm guessing this is not in fact what this command is for? If not, what is the accepted way to retrieve a list of tasks that have not yet started? Do I have to maintain my own list of task IDs in the code in order to query them?
The reason I ask is because I am trying to handle a situation where two tasks are queued which perform a write operation on the same object in the database. If both tasks execute in parallel and task 1 takes longer than task 2, it will overwrite the output from task 2, but I want the output from the most recent task i.e. task 2. So my plan was to cancel any pending tasks that operate on an object each time a new task is added which will write to the same object.
Thanks
Tom
You can see pending tasks using scheduled instead of reserved.
$ celeryctl inspect scheduled