I am starting with Luigi, and I wonder how does Luigi know, that it shouldn't re-run the task because it was already successfully run with the same parameters. I read through the docs, but didn't find the answer.
Hypotheses:
Does Luigi store the state (tasks instances and their results) in memory (it doesn't use DB)? So, when I restart scheduler, it forgets everything and re-runs all tasks?
Or, does Luigi always run task.complete for any scheduled task to see if the task should be run? Which would mean that the complete handler should be really quick?
Or, does it work in a different way?
Thanks for help!
Aha, found this in task.output:
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single Target or a list of Target instances.
So, it means that the complete or output.exists should be really really fast.
Related
I have set up a testing environment where I have Celery workers actually running in other processes, so that the full functionality of my system with Celery can be tested. This way, tasks actually run in a worker process and communicate back to the test runner, and so I don't need CELERY_ALWAYS_EAGER to test this functionality.
That being said, in some situations I have tasks which trigger off other tasks without caring when they finish, and I'd like to create tests which do - that is, to wait for those subtasks to finish. In these cases, the simplest approach seems to be to run just these tests eagerly (i.e. with CELERY_ALWAYS_EAGER set to true).
However, I don't see a straightforward way to change the config after Celery is initialized... and indeed, from a glance at the source code, it seems that it assumes the config won't change once the app starts.
This makes sense for a lot of options, since the worker would have to actually see the change, and changing it from the main program wouldn't do anything. But in the case of CELERY_ALWAYS_EAGER, this makes sense for the main program to be able to change it.
Is there any straightforward/well-supported way to do this? If not, what's a preferably not-too-hacky way to do this?
Another option is to make the task in question return the task ids it started off, so that the test can then wait on them... but I don't like the idea of changing my API for the sole purpose of making it runnable in a unit test.
Simply changing variables on Celery's .conf object (an instance of Settings) works:
app.conf.CELERY_ALWAYS_EAGER = True
Although conf is indeed a #cached_property of Celery (in version 3.1.22 anyway), this caches the instance returned, not all the values - so the configuration is indeed dynamically updatable.
I would like to know when some tasks have finished executing, something I can achieve in celery with:
task.ready()
I really don't care about the actual results, I only need to know if the task are completed.
Storing the results is not an option, because they are complex objects from an external library, and they are not serializable.
So, is it possible to know when a task is ready, without having to store the results?
I need a framework which will allow me to do the following:
Allow to dynamically define tasks (I'll read an external configuration file and create the tasks/jobs; task=spawn an external command for instance)
Provide a way of specifying dependencies on existing tasks (e.g. task A will be run after task B is finished)
Be able to run tasks in parallel in multiple processes if the execution order allows it (i.e. no task interdependencies)
Allow a task to depend on some external event (don't know exactly how to describe this, but some tasks finish and they will produce results after a while, like a background running job; I need to specify some of the tasks to depend on this background-job-completed event)
Undo/Rollback support: if one tasks fail, try to undo everything that has been executed before (I don't expect this to be implemented in any framework, but I guess it's worth to ask..)
So, obviously, this looks more or less like a build system, but I don't seem to be able to find something that will allow me to dynamically create tasks, most things I've seem already have them defined in the "Makefile".
Any ideas?
I've been doing a little more research and I've stumbled upon doit which provides the core functionality I need, without being overkill (not saying that Celery wouldn't have solved the job, but this does it better for my use case).
Another option is to use make.
Write a Makefile manually or let a python script write it
use meaningful intermediate output file stages
Run make, which should then call out the processes. The processes would be a python (build) script with parameters that tell it which files to work on and what task to do.
parallel execution is supported with -j
it also deletes output files if tasks fail
This circumvents some of the python parallelisation problems (GIL, serialisation).
Obviously only straightforward on *nix platforms.
AFAIK, there is no such framework in python which does exactly what you describe. So your options include either building something on your own or hack some bits of your requirements and model them using an existing tool. Which smells like celery.
You may have a celery task which reads a configuration file which contains some python functions' source code, then use eval or ast.literal_eval to execute them.
Celery provides a way to define subtasks (dependencies between tasks), so if you are aware of your dependencies, you can model them accordingly.
Provided that you know the execution order of your tasks you can route them to as many worker machines as you want.
You can periodically poll this background job's result and then start your tasks that are dependent on it.
Undo/Rollback: this might be tricky and depends on what you want to undo; results? state?
I'm running a Django website where I use Celery to implement preventive caching - that is, I calculate and cache results even before they are requested by the user.
However, one of my Celery tasks could, in some situation, be called a lot (I'd say sightly quicker than it completes on average, actually). I'd like to rate_limit it so that it doesn't consume a lot of resources when it's actually not that useful.
However, I'd like first to understand how Celery's celery.task.base.Task.rate_limit attribute is enforced. Are tasks refused? Are they delayed and executed later?
Thanks in advance!
Rate limited tasks are never dropped, they are queued internally in the worker so that they execute as soon as they are allowed to run.
The token bucket algorithm does not specify anything about dropping packets (it is an option, but Celery does not do that).
I have a task which I execute once a minute using celerybeat. It works fine. Sometimes though, the task takes a few seconds more than a minute to run because of which two instances of the task run. This leads to some race conditions that mess things up.
I can (and probably should) fix my task to work properly but I wanted to know if celery has any builtin ways to ensure this. My cursory Google searches and RTFMs yielded no results.
You could add a lock, using something like memcached or just your db.
If you are using a cron schedule or time interval for run periodic tasks you will still have the problem. You can always use a lock mechanism using a db or cache or even filesystem or also schedule the next task from the previous one, maybe not the best approach.
This question can probably help you:
django celery: how to set task to run at specific interval programmatically
You can try adding a classfield to the object that holds the function that youre making run and use that field as a "some other guy is working or not" control
The lock is a good way with either beat or a cron.
But, be aware that beat jobs run at worker start time, not at beat run time.
This was causing me to get a race condition even with a lock. Lets say the worker is off and beat throws 10 jobs into the queue. When celery starts up with 4 processes, all 4 of them grab a task and in my case 1 or 2 would get and set the lock at the same time.
Solution one is to use a cron with a lock, as a cron will execute at that time, not at worker start time.
Solution two is to use a slightly more advanced locking mechanism that handles race conditions. For redis look into setnx, or the newer redlock.
This blog post is really good, and includes a decorator pattern that uses redis-py's locking mechanism: http://loose-bits.com/2010/10/distributed-task-locking-in-celery.html.