I have a task, I want it to run once every 10 seconds
at the same time, this task can only run one, can't repeat run this task
I found two ways:
1).http://celery.readthedocs.org/en/latest/tutorials/task-cookbook.html#cookbook-task-serial
2).http://engineroom.trackmaven.com/blog/announcing-celery-once/
I don't know this two way is the best, do you have any good idea ?
Related
In celery I've 3 types of tasks first task executes in every 3 minutes and take almost 1 minute to complete, second task is periodic which runs on every monday and takes almost 10 minutes to complete, the third and last one is for sending users emails for register/forget password, I'm confused how many workers/ celery beat instances I should use, can anyone help me out please?
Usually you'll have only one Celery beat instance to schedule your tasks. If you have more than one instance, it will lead to tasks being scheduled as many times as the number one Celery beat instances you have.
There's no hard and fast rule for how many Celery workers you should have. Start with a handful, e.g. three, and keep an eye on the metrics (you can use something like https://flower.readthedocs.io/en/latest/index.html to create dashboards).
I have an application with several luigi tasks (I did not write that app). Now I want to introduce another task, in the middle of a process, which will monitor some AWS instances. This task, once started, should run until the end and it must run in parallel with other tasks. You can see picture in the link for better understanding.
Link to the schema
I looked in the documentation but I could not find solution. I am new with luigi and I probably missed something.
I don't think you missed anything. I don't think that luigi covers that use case. However, one thing you could do is have Task 3 require only Task 2, have Task 4 require task 2 instead of task 3, and have Task 3 continually run some code and monitor the output of Task 5 to know when it should close. It's not the prettiest, but it should work.
However, there are a couple of problems I can forsee (which is why it probably isn't supported by luigi). If you have enough Task 3's running, you might never complete the workflow, as Task 4 never gets run. That's why this isn't recommended as you are essentially creating hidden requirements that the dependency graph doesn't know about. Another issue is that Task 3 might never run until you are done with all of the Task 5's, in which case, it's useless.
One last idea I have is that instead of having Task 3 at all, at the end of Task 2 or beginning of task 4 you start a process on the scheduler node (using simply luigi.Task instead of an extension to make the work run on another node on the cluster). Then at the end of Task 5 you remove the process. There are some other edge cases you'll need to consider though, to make sure the process doesn't run too short or too long.
Good luck!
We have been using Airflow for a while, it is just great.
Now we are considering moving some of our very frequent tasks into our airflow server too.
Let's say I have a script running every second.
What's the best practice to schedule it with airflow:
Run this script in DAG that is scheduled every second. I highly doubt this will be the solution, there is significant overhead for a DAGRUN
Run this script in a while loop that stops after 6 hours, then schedule it on Airflow to be run every 6 hour?
Create a DAG with no schedule, put the task in a while True loop with proper sleep time, so the task will never terminates unless there is an error.
Any other suggestions?
Or this kind of task is just not suitable for Airflow? should do it with a lambda function and AWS scheduler?
Cheers!
What's the best practice to schedule it
... this kind of task is just not suitable for Airflow?
It is not suitable.
In particular, your airflow is probably configured to re-examine the set of DAGs every 5 seconds, which doesn't sound like a good fit for a 1-second task. Plus the ratio of scheduling overhead to work performed would not be attractive. I suppose you could schedule five simultaneous tasks, twelve times per minute, and have them sleep zero to four seconds, but that's just crazy. And likely you would need to "lock against yourself" to avoid having simultaneous sibling tasks step on each other's toes.
The six-hour suggestion (2.) is not crazy. I will view it as a sixty-minute #hourly task instead, since overheads are similar. Exiting after an hour and letting airflow respawn has several benefits. Log rolling happens at regular intervals. If your program crashes, it will be restarted before too long. If your host reboots, again your program is restarted before too long. Downside is that your business need may view "more than a minute" as "much too long". And coordinating overlapping tasks, or gap between tasks, at the hour boundary may pose some issues.
Your stated needs exactly match the problem that Supervisor addresses. Just use that. You will always have exactly one copy of your event loop running, even if the app crashes, even if the host crashes. Log rolling and other administrative details have already been addressed. The code base is mature and lots of folks have beat on it and incorporated their feature requests. It fits what you want.
I am using Celery to run some tasks that take a long time to complete. There
is an initial task that needs to complete before two sub-tasks can run. The tasks that I created are file system operations and don't return a result.
I would like the subtasks to run at the same time, but when I use a group for these tasks they run sequentially and not in parallel.
I have tried:
g = group([secondary_task(), secondary_tasks2()])
chain(initial_task(),g)
I've also tried running the group directly in the first task, but that doesn't seem to work either.
Is what I'm trying to accomplish doable with Celery?
First Task
/ \
Second Task Third Task
Not:
First Task
|
Second Task
|
Third Task
The chain is definitely the right approach.
I would expect this to work: chain(initial_task.s(), g)()
Do you have more than one celery worker running to be able to run more than one task at the same time?
I have a task which I execute once a minute using celerybeat. It works fine. Sometimes though, the task takes a few seconds more than a minute to run because of which two instances of the task run. This leads to some race conditions that mess things up.
I can (and probably should) fix my task to work properly but I wanted to know if celery has any builtin ways to ensure this. My cursory Google searches and RTFMs yielded no results.
You could add a lock, using something like memcached or just your db.
If you are using a cron schedule or time interval for run periodic tasks you will still have the problem. You can always use a lock mechanism using a db or cache or even filesystem or also schedule the next task from the previous one, maybe not the best approach.
This question can probably help you:
django celery: how to set task to run at specific interval programmatically
You can try adding a classfield to the object that holds the function that youre making run and use that field as a "some other guy is working or not" control
The lock is a good way with either beat or a cron.
But, be aware that beat jobs run at worker start time, not at beat run time.
This was causing me to get a race condition even with a lock. Lets say the worker is off and beat throws 10 jobs into the queue. When celery starts up with 4 processes, all 4 of them grab a task and in my case 1 or 2 would get and set the lock at the same time.
Solution one is to use a cron with a lock, as a cron will execute at that time, not at worker start time.
Solution two is to use a slightly more advanced locking mechanism that handles race conditions. For redis look into setnx, or the newer redlock.
This blog post is really good, and includes a decorator pattern that uses redis-py's locking mechanism: http://loose-bits.com/2010/10/distributed-task-locking-in-celery.html.