Run luigi tasks in parallel

Run luigi tasks in parallel - python

I have an application with several luigi tasks (I did not write that app). Now I want to introduce another task, in the middle of a process, which will monitor some AWS instances. This task, once started, should run until the end and it must run in parallel with other tasks. You can see picture in the link for better understanding.
Link to the schema
I looked in the documentation but I could not find solution. I am new with luigi and I probably missed something.

I don't think you missed anything. I don't think that luigi covers that use case. However, one thing you could do is have Task 3 require only Task 2, have Task 4 require task 2 instead of task 3, and have Task 3 continually run some code and monitor the output of Task 5 to know when it should close. It's not the prettiest, but it should work.
However, there are a couple of problems I can forsee (which is why it probably isn't supported by luigi). If you have enough Task 3's running, you might never complete the workflow, as Task 4 never gets run. That's why this isn't recommended as you are essentially creating hidden requirements that the dependency graph doesn't know about. Another issue is that Task 3 might never run until you are done with all of the Task 5's, in which case, it's useless.
One last idea I have is that instead of having Task 3 at all, at the end of Task 2 or beginning of task 4 you start a process on the scheduler node (using simply luigi.Task instead of an extension to make the work run on another node on the cluster). Then at the end of Task 5 you remove the process. There are some other edge cases you'll need to consider though, to make sure the process doesn't run too short or too long.
Good luck!

Related

Use Airflow for frequent tasks

We have been using Airflow for a while, it is just great.
Now we are considering moving some of our very frequent tasks into our airflow server too.
Let's say I have a script running every second.
What's the best practice to schedule it with airflow:
Run this script in DAG that is scheduled every second. I highly doubt this will be the solution, there is significant overhead for a DAGRUN
Run this script in a while loop that stops after 6 hours, then schedule it on Airflow to be run every 6 hour?
Create a DAG with no schedule, put the task in a while True loop with proper sleep time, so the task will never terminates unless there is an error.
Any other suggestions?
Or this kind of task is just not suitable for Airflow? should do it with a lambda function and AWS scheduler?
Cheers!

What's the best practice to schedule it
... this kind of task is just not suitable for Airflow?
It is not suitable.
In particular, your airflow is probably configured to re-examine the set of DAGs every 5 seconds, which doesn't sound like a good fit for a 1-second task. Plus the ratio of scheduling overhead to work performed would not be attractive. I suppose you could schedule five simultaneous tasks, twelve times per minute, and have them sleep zero to four seconds, but that's just crazy. And likely you would need to "lock against yourself" to avoid having simultaneous sibling tasks step on each other's toes.
The six-hour suggestion (2.) is not crazy. I will view it as a sixty-minute #hourly task instead, since overheads are similar. Exiting after an hour and letting airflow respawn has several benefits. Log rolling happens at regular intervals. If your program crashes, it will be restarted before too long. If your host reboots, again your program is restarted before too long. Downside is that your business need may view "more than a minute" as "much too long". And coordinating overlapping tasks, or gap between tasks, at the hour boundary may pose some issues.
Your stated needs exactly match the problem that Supervisor addresses. Just use that. You will always have exactly one copy of your event loop running, even if the app crashes, even if the host crashes. Log rolling and other administrative details have already been addressed. The code base is mature and lots of folks have beat on it and incorporated their feature requests. It fits what you want.

Where is completion state of task instance stored in Luigi

I am starting with Luigi, and I wonder how does Luigi know, that it shouldn't re-run the task because it was already successfully run with the same parameters. I read through the docs, but didn't find the answer.
Hypotheses:
Does Luigi store the state (tasks instances and their results) in memory (it doesn't use DB)? So, when I restart scheduler, it forgets everything and re-runs all tasks?
Or, does Luigi always run task.complete for any scheduled task to see if the task should be run? Which would mean that the complete handler should be really quick?
Or, does it work in a different way?
Thanks for help!

Aha, found this in task.output:
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.
So, it means that the complete or output.exists should be really really fast.

is it possible to manually restart celery task

I am using celery 3 with django and flower to monitor tasks.
is there any way that if
my task fail
then i do some fix in code
Then i get the task id and then rstart that task
Is it possible
or even a way to manually place any failed task in another queue so that it can be processes again after fixing the cause of it

A bit of a hack, but what works for me is creating a new task instance with the same task ID. For example, a task with ID 'abc' runs and fails. I then "restart" the task by running:
my_task.apply_async(args=('whatever'), task_id='abc')
In reality it is less of a "restart" and more just a replacement of the original task result, but it gets the job done. Definitely open to better suggestions here as it does feel a bit clumsy.

Parallel processing within a queue (using Pool within Celery)

I'm using Celery to queue jobs from a CGI application I made. The way I've set it up, Celery makes each job run one- or two-at-a-time by setting CELERYD_CONCURRENCY = 1 or = 2 (so they don't crowd the processor or thrash from memory consumption). The queue works great, thanks to advice I got on StackOverflow.
Each of these jobs takes a fair amount of time (~30 minutes serial), but has an embarrassing parallelizability. For this reason, I was using Pool.map to split it and do the work in parallel. It worked great from the command line, and I got runtimes around 5 minutes using a new many-cored chip.
Unfortunately, there is some limitation that does not allow daemonic process to have subprocesses, and when I run the fancy parallelized code within the CGI queue, I get this error:
AssertionError: daemonic processes are not allowed to have children
I noticed other people have had similar questions, but I can't find an answer that wouldn't require abandoning Pool.map altogether, and making more complicated thread code.
What is the appropriate design choice here? I can easily run my serial jobs using my Celery queue. I can also run my much faster parallelized jobs without a queue. How should I approach this, and is it possible to get what I want (both the queue and the per-job parallelization)?
A couple of ideas I've had (some are quite hacky):
The job sent to the Celery queue simply calls the command line program. That program can use Pool as it pleases, and then saves the result figures & data to a file (just as it does now). Downside: I won't be able to check on the status of the job or see if it terminated successfully. Also, system calls from CGI may cause security issues.
Obviously, if the queue is very full of jobs, I can make use of the CPU resources (by setting CELERYD_CONCURRENCY = 6 or so); this will allow many people to be "at the front of the queue" at once.Downside: Each job will spend a lot of time at the front of the queue; if the queue isn't full, there will be no speedup. Also, many partially finished jobs will be stored in memory at the same time, using much more RAM.
Use Celery's #task to parallelize within sub-jobs. Then, instead of setting CELERYD_CONCURRENCY = 1, I would set it to 6 (or however many sub jobs I'd like to allow in memory at a time). Downside: First of all, I'm not sure whether this will successfully avoid the "task-within-task" problem. But also, the notion of queue position may be lost, and many partially finished jobs may end up in memory at once.
Perhaps there is a way to call Pool.map and specify that the threads are non-daemonic? Or perhaps there is something more lightweight I can use instead of Pool.map? This is similar to an approach taken on another open StackOverflow question. Also, I should note that the parallelization I exploit via Pool.map is similar to linear algebra, and there is no inter-process communication (each just runs independently and returns its result without talking to the others).
Throw away Celery and use multiprocessing.Queue. Then maybe there'd be some way to use the same "thread depth" for every thread I use (i.e. maybe all of the threads could use the same Pool, avoiding nesting)?
Thanks a lot in advance.

What you need is a workflow management system (WFMS) that manages
task concurrency
task dependency
task nesting
among other things.
From a very high level view, a WFMS sits on top of a task pool like celery, and submits the tasks which are ready to execute to the pool. It is also responsible for opening up a nest and submitting the tasks in the nest accordingly.
I've developed a system to do just that. It's called pomsets. Try it out, and feel free to send me any questions.

I using a multiprocessed deamons based on Twisted with forking and Gearman jobs query normally.
Try to look at Gearman.

Making only one task run at a time in celerybeat

I have a task which I execute once a minute using celerybeat. It works fine. Sometimes though, the task takes a few seconds more than a minute to run because of which two instances of the task run. This leads to some race conditions that mess things up.
I can (and probably should) fix my task to work properly but I wanted to know if celery has any builtin ways to ensure this. My cursory Google searches and RTFMs yielded no results.

You could add a lock, using something like memcached or just your db.

If you are using a cron schedule or time interval for run periodic tasks you will still have the problem. You can always use a lock mechanism using a db or cache or even filesystem or also schedule the next task from the previous one, maybe not the best approach.
This question can probably help you:
django celery: how to set task to run at specific interval programmatically

You can try adding a classfield to the object that holds the function that youre making run and use that field as a "some other guy is working or not" control

The lock is a good way with either beat or a cron.
But, be aware that beat jobs run at worker start time, not at beat run time.
This was causing me to get a race condition even with a lock. Lets say the worker is off and beat throws 10 jobs into the queue. When celery starts up with 4 processes, all 4 of them grab a task and in my case 1 or 2 would get and set the lock at the same time.
Solution one is to use a cron with a lock, as a cron will execute at that time, not at worker start time.
Solution two is to use a slightly more advanced locking mechanism that handles race conditions. For redis look into setnx, or the newer redlock.
This blog post is really good, and includes a decorator pattern that uses redis-py's locking mechanism: http://loose-bits.com/2010/10/distributed-task-locking-in-celery.html.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.