I have a project using celery to process tasks, and a second project which is an API that might need to enqueue tasks to be processed by celery workers.
However, these 2 projects are separated and I can't import the tasks in the API one.
I've used Sidekiq - Celery's equivalent in Ruby - in the past, and for example it is possible to push jobs by storing data in Redis from other languages/apps/processes if using the same format/payload.
Is something similar possible with Celery ? I couldn't find anything related.
Yes, this is possible in celery using send_task or signatures. Assuming fetch_data is the function in a separate code base, you can invoke it using one of the below methods
send_task
celery_app.send_task('fetch_data', kwargs={'url': request.json['url']})
app.signature
celery_app.signature('fetch_data', kwargs={'url': request.json['url']).delay()
You just specify the function name as a string and do not need to import it into your codebase.
You can read about this in more detail from https://www.distributedpython.com/2018/06/19/call-celery-task-outside-codebase/
Related
I am trying to introduce dynamic workflows into my landscape that involves multiple steps of different model inference where the output from one model gets fed into another model.Currently we have few Celery workers spread across hosts to manage the inference chain. As the complexity increase, we are attempting to build workflows on the fly. For that purpose, I got a dynamic DAG setup with Celeryexecutor working. Now, is there a way I can retain the current Celery setup and route airflow driven tasks to the same workers? I do understand that the setup in these workers should have access to the DAG folders and environment same as the airflow server. I want to know how the celery worker need to be started in these servers so that airflow can route the same tasks that used to be done by the manual workflow from a python application. If I start the workers using command "airflow celery worker", I cannot access my application tasks. If I start celery the way it is currently ie "celery -A proj", airflow has nothing to do with it. Looking for ideas to make it work.
Thanks #DejanLekic. I got it working (though the DAG task scheduling latency was too much that I dropped the approach). If someone is looking to see how this was accomplished, here are few things I did to get it working.
Change the airflow.cfg to change the executor,queue and result back-end settings (Obvious)
If we have to use Celery worker spawned outside the airflow umbrella, change the celery_app_name setting to celery.execute instead of airflow.executors.celery_execute and change the Executor to "LocalExecutor". I have not tested this, but it may even be possible to avoid switching to celery executor by registering airflow's Task in the project's celery App.
Each task will now call send_task(), the AsynResult object returned is then stored in either Xcom(implicitly or explicitly) or in Redis(implicitly push to the queue) and the child task will then gather the Asyncresult ( it will be an implicit call to get the value from Xcom or Redis) and then call .get() to obtain the result from the previous step.
Note: It is not necessary to split the send_task() and .get() between two tasks of the DAG. By splitting them between parent and child, I was trying to take advantage of the lag between tasks. But in my case, the celery execution of tasks completed faster than airflow's inherent latency in scheduling dependent tasks.
I know it's not the best practice to use threads in django project but I have a project that is using threads:
threading.Thread(target=save_data, args=(dp, conv_handler)).start()
I want to replace this code to celery - to run worker with function
save_data(dispatcher, conversion)
Inside save_data I have infinite loop and in this loop I save states of dispatcher and conversation to file on disk with pickle.
I want to know may I use celery for such work?
Does the worker can see changes of state in dispatcher and conversation?
I personally don't like long running tasks in Celery. Normally you will have a maximum task time and if your task takes too much time it can time out. The best tasks for celery are quick and stateless tasks.
Notice that Celery params are serialized when you launch a task and it's tricky passing a python object as a task argument (not recommended).
I would need more info about the problem you are trying to solve but if dispatcher & conversion are django objects I would do something like:
def save_data(dispatcher_id, conversion_id):
dispatcher = Dispatcher.objects.get(id=dispatcher_id)
conversion = Conversion.objects.get(id_conversion_id)
And you should avoid that infinite loop in a celery task. You can workaround the infinite loop by calling this save_task periodically but I encourage you to find a solution that matches better with Celery (try to be stateless, quick tasks).
Is Celery mostly just a high level interface for message queues like RabbitMQ? I am trying to set up a system with multiple scheduled workers doing concurrent http requests, but I am not sure if I would need either of them. Another question I am wondering is where do you write the actual task in code for the workers to complete, if I am using Celery or RabbitMQ?
RabbitMQ is indeed a message queue, and Celery uses it to send messages to and from workers. Celery is more than just an interface for RabbitMQ. Celery is what you use to create workers, kick off tasks, and define your tasks. It sounds like your use case makes sense for Celery/RabbitMQ. You create a task using the #app.task decorator. Check the docs for more info. In previous projects, I've set up a module for celery, where I define any tasks I need. Then you can pull in functions from other modules to use in your tasks.
Celery is the task management framework--the API you use to schedule jobs, the code that gets those jobs started, the management tools (e.g. Flower) you use to monitor what's going on.
RabbitMQ is one of several "backends" for Celery. It's an oversimplification to say that Celery is a high-level interface to RabbitMQ. RabbitMQ is not actually required for Celery to run and do its job properly. But, in practice, they are often paired together, and Celery is a higher-level way of accomplishing some things that you could do at a lower level with just RabbitMQ (or another queue or message delivery backend).
In the Celery docs, the standard way to set the schedule of tasks is documented as hardcoding the schedule into the config file.
However, it also hints that this can be replaced with a custom backend. I see there is a dynamic, database driven option for Django but I'm using a simple Flask app to define my tasks.
Does anyone have a way to dynamically load the schedule, avoiding the need to restart the celery beat worker, either by dynamically pulling the schedule from a database or by reloading the schedule from a text file on a regular basis? Is it as simple as putting a reload() call around the schedule in a text file, perhaps even as its own scheduled celery task?
CELERYBEAT_SCHEDULE is just init/config sugar and the object is available from within a bound task at:
self.app.conf['CELERYBEAT_SCHEDULE']
You might write a periodic task that pulls down new values from some back end.
I am using Celery + Kombu with Amazon SQL.
The goal is to be able to a remove a task already scheduled for some specific datetime.
I've tried
from celery.task.control import revoke
revoke(task_id)
but that didn't change anything. Is revoke not implemented for SQS transport? Is there some design decision behind it or it's just a lacking feature that should be implemented by some "DeleteMessage" line of code?
Unless you're using RabbitMQ, it's better to come up with a custom solution for revoking tasks. E.g. instead of executing tasks, build a system of two components: scheduler task that scans your table of potential tasks and executes them when time comes. No need to revoke, you just can decide not to execute task when needed.