How to use airflow with Celery - python

I'm new to airflow and celery, and I have finished drawing dag by now, but I want to run task in two computers which are in the same subnet, I want to know how to modify the airflow.cfg. Some examples could be better. Thanks to any answers orz.

The Airflow documentation covers this quite nicely:
First, you will need a celery backend. This can be for example Redis or RabbitMQ. Then, the executor parameter in your airflow.cfg should be set to CeleryExecutor.
Then, in the celery section of the airflow.cfg, set the broker_url to point to your celery backend (e.g. redis://your_redis_host:your_redis_port/1).
Point celery_result_backend to a sql database (you can use the same as your main airflow db).
Then, on your worker machines simply kick off airflow worker and your jobs should start on the two machines.

Related

Apache Airflow without celery or kubernetes

Is there any way to run workflows by not using celery or kubernetes. Doc specifies only two ways to run it in multi-cluster mode. Can't I just have another multiple EC2 instances to run my workers for computations. (Without using celery or kubernetes).
Let's assume you have a number of EC2 instances. How would you manage them from Airflow? How would you distribute the load among those EC2 instances? Celery or Kubernetes take care exactly of these tasks.
If, for some reason, you cannot use Celery or Kubernetes, you can install Airflow on a single instance and scale up its resources as needed.
The only way to accomplish what you want is to write your own Executor (EC2Executor?) that fulfils your requirements.

How to execute plain Celery tasks on Airflow workers

I currently have Airflow set up and working correctly using the CeleryExecutor as a backend to provide horizontal scaling. This works remarkably well especially when having the worker nodes sit in an autoscaling group on EC2.
In addition to Airflow, I use plain Celery to handle simple asynchronous tasks (that don't need a whole pipeline) coming from Flask/Python. Until now, these plain Celery tasks were very low volume and I just ran the plain Celery worker on the same machine as Flask. There is now a requirement to run a massive number of plain Celery tasks in the system, so I need to scale my plain Celery as well.
One way to do this would be to run the plain Celery worker service on the Airflow worker servers as well (to benefit from the autoscaling etc.) but this doesn't seem to be an elegant solution since it creates two different "types" of Celery worker on the same machine. My question is whether there is some combination of configuration settings I can pass to my plain Celery app that will cause #celery.task decorated functions to be executed directly on my Airflow worker cluster as a plain Celery task, completely bypassing the Airflow middleware.
Thanks for the help.
The application is airflow.executors.celery_executor.app if I remember well. Try celery -A airflow.executors.celery_executor.app inspect active for an example in your current Airflow infrastructure to test it. However, I suggest you do not do this because your Celery tasks may affect the execution of Airflow DAGs, and it may affect the SLAs.
What we do in the company I work for is exactly what you suggested - we maintain a large Celery cluster, and we sometimes offload execution of some Airflow tasks to our Celery cluster, depending on the use-case. This is particularly handy when a task in our Airflow DAG actually triggers tens of thousands of small jobs. Our Celery cluster runs 8 million tasks on a busy day.

Celery on a different server [duplicate]

I am new to celery.I know how to install and run one server but I need to distribute the task to multiple machines.
My project uses celery to assign user requests passing to a web framework to different machines and then returns the result.
I read the documentation but there it doesn't mention how to set up multiple machines.
What am I missing?
My understanding is that your app will push requests into a queueing system (e.g. rabbitMQ) and then you can start any number of workers on different machines (with access to the same code as the app which submitted the task). They will pick out tasks from the message queue and then get to work on them. Once they're done, they will update the tombstone database.
The upshot of this is that you don't have to do anything special to start multiple workers. Just start them on separate identical (same source tree) machines.
The server which has the message queue need not be the same as the one with the workers and needn't be the same as the machines which submit jobs. You just need to put the location of the message queue in your celeryconfig.py and all the workers on all the machines can pick up jobs from the queue to perform tasks.
The way I deployed it is like this:
clone your django project on a heroku instance (this will run the frontend)
add RabitMQ as an add on and configure it
clone your django project into another heroku instance (call it like worker) where you will run the celery tasks

How to set up celery workers on separate machines?

I am new to celery.I know how to install and run one server but I need to distribute the task to multiple machines.
My project uses celery to assign user requests passing to a web framework to different machines and then returns the result.
I read the documentation but there it doesn't mention how to set up multiple machines.
What am I missing?
My understanding is that your app will push requests into a queueing system (e.g. rabbitMQ) and then you can start any number of workers on different machines (with access to the same code as the app which submitted the task). They will pick out tasks from the message queue and then get to work on them. Once they're done, they will update the tombstone database.
The upshot of this is that you don't have to do anything special to start multiple workers. Just start them on separate identical (same source tree) machines.
The server which has the message queue need not be the same as the one with the workers and needn't be the same as the machines which submit jobs. You just need to put the location of the message queue in your celeryconfig.py and all the workers on all the machines can pick up jobs from the queue to perform tasks.
The way I deployed it is like this:
clone your django project on a heroku instance (this will run the frontend)
add RabitMQ as an add on and configure it
clone your django project into another heroku instance (call it like worker) where you will run the celery tasks

Running celery in django not as an external process?

I want to give celery a try. I'm interested in a simple way to schedule crontab-like tasks, similar to Spring's quartz.
I see from celery's documentation that it requires running celeryd as a daemon process. Is there a way to refrain from running another external process and simply running this embedded in my django instance? Since I'm not interested in distributing the work at the moment, I'd rather keep it simple.
Add CELERY_ALWAYS_EAGER=True option in your django settings file and all your tasks will be executed locally. Seems like for the periodic tasks you have to execute celery beat as well.

Categories