I'm using Celery in Python to run background tasks and couldn't find any definitive answer to the question of whether I can split the Celery task definition from task implementation?
For example, take the really simple task below:
#celery_app.task
def add_numbers(num1, num2):
return num1 + num2
The definition and implementation are in the same file i.e. when the caller imports this module to call add_numbers, both the definition and implementation are imported.
In this case, not so bad. But my tasks are a bit more complex, importing multiple modules and packages that the caller certainly doesn't need and I'd like to keep out of the caller.
So, does Celery provide a way to do this? Or am I going against the framework? Is this even a problem?
I have seen this question Celery dynamic tasks / hiding Celery implementation behind an interface
implementation-behind-an-interface, but it is well over two years old - more than enough time for a lot to change.
There's a feature called signatures which allows calling tasks without importing them. You will need the Celery app instance to be available:
sig = celery_app.signature('myapp.add_numbers', args=(1,2))
sig.delay()
Related
I have been trying to learn more about Celery, but it's difficult to understand what's literal in terms of "workers" and the "queue" and what it actually means in terms of programming. I apologize if this question is very basic, but I can't seem to find a straight answer in simple terms.
I have a Flask/Python app that I want to add a task assignment functionality to. For example, when one user completes a task, I want them to be able to flag it for their team members to check. Additionally, I'd like to be able to schedule task assignments, for example have a user complete 10 of a category of a task per week.
Celery seems like a good way to queue tasks and ensure that they are being completed, but the focus of this system seems to be scheduling resource intensive processes for asynchrous processing, not literal task assignment and queueing.
My question boils down to: is Celery the proper tool for assigning tasks to users, even if it's not for the purpose of resource saving? Have I misunderstood what the capabilities of Celery are? If so, what would be the tools to use to implement this feature?
Thank you!
Celery is meant to for handling the queuing of automated tasks on computer workers. I recommend against using it for assigning tasks to your users.
If you're looking for a lightweight solution for storing queues of tasks for users, you could use lists in Redis: https://redis.io/topics/data-types
You could also probably use any database that you're already using.
I have a heavy function (a lot of calculations are done) which outputs a individual number for each user in my Django project. This number changes just a little over time so to minimize the server load I thought about running the function once a day, save the output and just reference the output. I know that these kinda things are usually handled with Celery but the package requires a lot of site packages and extra modules so I thought about writing a simple function like:
x0 = #last.time function was called
x1 = datetime.now
if x0-x1 > 1 day:
def whatever():
....
x0 = datetime.now
return ....
I like to keep my code clean and not to install Packages which are not really required so I would like to know if there are any downsides by "just" using Python or any gain when I would do that with Celery. The task does not need to be asynchronous so I don't care about that.
Is there a clear "Use case" when Celery should be used and when not? Is there a performance loss/gain?
I hope somebody can explain that properly.
Celery is a clear winner but I would like to explain this with pros and cons.
Pros:
You can control celery from Django very easily. Running a celery task, cancelling task, checking state/progress of task can be done within django.
A periodical task running with celery is very simple, just register the task from django run the celery worker and voila you are done. No need to mess around with crontab or background processes.
Celery is very easy to setup and run. You might already know that if you have gone through the introduction of celery.
Cons
One of the cons is that you need to have at least one result backend with either redis, rabbitmq or any other one running with celery for queuing purposes. Although RabbitMq is not a heavy you need to install it once.
One more is that celery worker itself takes some memory but that won't be an issue if you are on a server, on local memory consumption might seem high to you.
I would suggest celery because it would provide you more control over your task rather than a simple background process.
If I understood the tutorial correctly, Celery subtask supports almost the same API as task, but has the additional advantage that it can be passed around to other functions or processes.
Clearly, if that was the case, Celery would have simply replaced tasks with subtasks instead of keeping both (e.g., the #app.task decorator would have converted a function to a subtask instead of to a task, etc.). So I must be misunderstanding something.
What can a task do that a subtask can't?
Celery API changed quite a bit; my question is specific to version 3.1 (currently, the latest).
Edit:
I know the docs say subtasks are intended to be called from other tasks. My question is what prevents Celery from getting rid of tasks completely and using subtasks everywhere? They seem to be strictly more flexible/powerful than tasks:
# tasks.py
from celery import Celery
app = Celery(backend='rpc://')
#app.task
def add(x, y):
# just print out a log line for testing purposes
print(x, y)
# client.py
from tasks import add
add_subtask = add.subtask()
# in this context, it seems the following two lines do the same thing
add.delay(2, 2)
add_subtask.delay(2, 2)
# when we need to pass argument to other tasks, we must use add_subtask
# so it seems add_subtask is strictly better than add
You will take the difference into account when you start using complex workflows with celery.
A signature() wraps the arguments, keyword arguments, and execution
options of a single task invocation in a way such that it can be
passed to functions or even serialized and sent across the wire.
Signatures are often nicknamed “subtasks” because they describe a task
to be called within a task.
Also:
subtask‘s are objects used to pass around the signature of a task
invocation, (for example to send it over the network)
Task is just a function definition wrapped with decorator, but subtask is a task with parameters passed, but not yet started. You may transfer the subtask serialized over network or, more used, call it within a group/chain/chord.
I have a program, which opens a lot of urls and downloads pictures .
I have a function of the program, which manages link's opening and pictures downloading, which contains a for loop and performs some operations on the priority queue. I want to run this function, but no longer than the set time period. For example if this function is running longer than 1 hour I want to terminate it and run the rest of the program (other functions).
I was trying to find some solutions, and I found two question here on stack.
The first solution use only time module First solution
The second use also the multiprocessing module
Second solution. Can some one suggest which one will be more appropriate to use in my program? I will write a pseudocode of my function:
def fun():
for link in linkList:
if link not in queue:
queue.push(link)
else:
queue.updatePriority(link)
if queue:
top = queue.pop()
fun(top)
This function is called in other function:
def run(startLink):
fun(startLink)
And the run() function is called in other module.
Which method is better to use with a program which contains a lot of modules and performs a lot of
The asyncio module is ideal for this task.
You can create a future, then use asyncio.wait which supports a timeout parameter.
Using multiprocessing here would be a little bit tricky, because fun is consuming a priority queue (I'm assuming a Queue.PriorityQueue) that is coming from some other part of the program. That queue cannot easily be passed between processes - you would need to create a custom multiprocessing.BaseManager subclass, register the Queue.PriorityQueue class with it, and start up the Manager server, instantiate a PriorityQueue on the server, and use a Proxy to that instance everywhere you interact with the queue. That's a lot of overhead, and also hurts performance a bit.
Since it appears you don't actually want any concurrency here - you want the rest of the program to stop while fun is running - I don't think there's a compelling reason to use multiprocessing. Instead, I think using the time-based solution makes more sense.
I need a framework which will allow me to do the following:
Allow to dynamically define tasks (I'll read an external configuration file and create the tasks/jobs; task=spawn an external command for instance)
Provide a way of specifying dependencies on existing tasks (e.g. task A will be run after task B is finished)
Be able to run tasks in parallel in multiple processes if the execution order allows it (i.e. no task interdependencies)
Allow a task to depend on some external event (don't know exactly how to describe this, but some tasks finish and they will produce results after a while, like a background running job; I need to specify some of the tasks to depend on this background-job-completed event)
Undo/Rollback support: if one tasks fail, try to undo everything that has been executed before (I don't expect this to be implemented in any framework, but I guess it's worth to ask..)
So, obviously, this looks more or less like a build system, but I don't seem to be able to find something that will allow me to dynamically create tasks, most things I've seem already have them defined in the "Makefile".
Any ideas?
I've been doing a little more research and I've stumbled upon doit which provides the core functionality I need, without being overkill (not saying that Celery wouldn't have solved the job, but this does it better for my use case).
Another option is to use make.
Write a Makefile manually or let a python script write it
use meaningful intermediate output file stages
Run make, which should then call out the processes. The processes would be a python (build) script with parameters that tell it which files to work on and what task to do.
parallel execution is supported with -j
it also deletes output files if tasks fail
This circumvents some of the python parallelisation problems (GIL, serialisation).
Obviously only straightforward on *nix platforms.
AFAIK, there is no such framework in python which does exactly what you describe. So your options include either building something on your own or hack some bits of your requirements and model them using an existing tool. Which smells like celery.
You may have a celery task which reads a configuration file which contains some python functions' source code, then use eval or ast.literal_eval to execute them.
Celery provides a way to define subtasks (dependencies between tasks), so if you are aware of your dependencies, you can model them accordingly.
Provided that you know the execution order of your tasks you can route them to as many worker machines as you want.
You can periodically poll this background job's result and then start your tasks that are dependent on it.
Undo/Rollback: this might be tricky and depends on what you want to undo; results? state?