Apache Airflow without celery or kubernetes - python

Is there any way to run workflows by not using celery or kubernetes. Doc specifies only two ways to run it in multi-cluster mode. Can't I just have another multiple EC2 instances to run my workers for computations. (Without using celery or kubernetes).

Let's assume you have a number of EC2 instances. How would you manage them from Airflow? How would you distribute the load among those EC2 instances? Celery or Kubernetes take care exactly of these tasks.
If, for some reason, you cannot use Celery or Kubernetes, you can install Airflow on a single instance and scale up its resources as needed.

The only way to accomplish what you want is to write your own Executor (EC2Executor?) that fulfils your requirements.

Related

How can I properly kill a celery task in a kubernetes environment?

How can I properly kill celery tasks running on containers inside a kubernetes environment? The structure of the whole application (all written in Python) is as follows:
A SDK that makes requests to our API;
A Kubernetes structure with one pod running the API and other pods running celery containers to deal with some long-running tasks that can be triggered by the API. These celery containers autoscale.
Suppose we call a SDK method that in turn makes a request to the API that triggers a task to be run on a celery container. What would be the correct/graceful way to kill this task if need be? I am aware that celery tasks have a revoke() method, but I tried using this approach and it did not work, even using terminate=True and signal=signal.SIGKILL (maybe this has something to do with the fact that I am using Azure Service Bus as a broker?)
Perhaps a mapping between a celery task and its corresponding container name would help, but I could not find a way to get this information as well.
Any help and/or ideas would be deeply appreciated.
The solution I found was to write to file shared by both API and Celery containers. In this file, whenever an interruption is captured, a flag is set to true. Inside the celery containers I keep periodically checking the contents of such file. If the flag is set to true, then I gracefully clear things up and raise an error.

How to execute plain Celery tasks on Airflow workers

I currently have Airflow set up and working correctly using the CeleryExecutor as a backend to provide horizontal scaling. This works remarkably well especially when having the worker nodes sit in an autoscaling group on EC2.
In addition to Airflow, I use plain Celery to handle simple asynchronous tasks (that don't need a whole pipeline) coming from Flask/Python. Until now, these plain Celery tasks were very low volume and I just ran the plain Celery worker on the same machine as Flask. There is now a requirement to run a massive number of plain Celery tasks in the system, so I need to scale my plain Celery as well.
One way to do this would be to run the plain Celery worker service on the Airflow worker servers as well (to benefit from the autoscaling etc.) but this doesn't seem to be an elegant solution since it creates two different "types" of Celery worker on the same machine. My question is whether there is some combination of configuration settings I can pass to my plain Celery app that will cause #celery.task decorated functions to be executed directly on my Airflow worker cluster as a plain Celery task, completely bypassing the Airflow middleware.
Thanks for the help.
The application is airflow.executors.celery_executor.app if I remember well. Try celery -A airflow.executors.celery_executor.app inspect active for an example in your current Airflow infrastructure to test it. However, I suggest you do not do this because your Celery tasks may affect the execution of Airflow DAGs, and it may affect the SLAs.
What we do in the company I work for is exactly what you suggested - we maintain a large Celery cluster, and we sometimes offload execution of some Airflow tasks to our Celery cluster, depending on the use-case. This is particularly handy when a task in our Airflow DAG actually triggers tens of thousands of small jobs. Our Celery cluster runs 8 million tasks on a busy day.

Celery on a different server [duplicate]

I am new to celery.I know how to install and run one server but I need to distribute the task to multiple machines.
My project uses celery to assign user requests passing to a web framework to different machines and then returns the result.
I read the documentation but there it doesn't mention how to set up multiple machines.
What am I missing?
My understanding is that your app will push requests into a queueing system (e.g. rabbitMQ) and then you can start any number of workers on different machines (with access to the same code as the app which submitted the task). They will pick out tasks from the message queue and then get to work on them. Once they're done, they will update the tombstone database.
The upshot of this is that you don't have to do anything special to start multiple workers. Just start them on separate identical (same source tree) machines.
The server which has the message queue need not be the same as the one with the workers and needn't be the same as the machines which submit jobs. You just need to put the location of the message queue in your celeryconfig.py and all the workers on all the machines can pick up jobs from the queue to perform tasks.
The way I deployed it is like this:
clone your django project on a heroku instance (this will run the frontend)
add RabitMQ as an add on and configure it
clone your django project into another heroku instance (call it like worker) where you will run the celery tasks

How to set up celery workers on separate machines?

I am new to celery.I know how to install and run one server but I need to distribute the task to multiple machines.
My project uses celery to assign user requests passing to a web framework to different machines and then returns the result.
I read the documentation but there it doesn't mention how to set up multiple machines.
What am I missing?
My understanding is that your app will push requests into a queueing system (e.g. rabbitMQ) and then you can start any number of workers on different machines (with access to the same code as the app which submitted the task). They will pick out tasks from the message queue and then get to work on them. Once they're done, they will update the tombstone database.
The upshot of this is that you don't have to do anything special to start multiple workers. Just start them on separate identical (same source tree) machines.
The server which has the message queue need not be the same as the one with the workers and needn't be the same as the machines which submit jobs. You just need to put the location of the message queue in your celeryconfig.py and all the workers on all the machines can pick up jobs from the queue to perform tasks.
The way I deployed it is like this:
clone your django project on a heroku instance (this will run the frontend)
add RabitMQ as an add on and configure it
clone your django project into another heroku instance (call it like worker) where you will run the celery tasks

Running celery in django not as an external process?

I want to give celery a try. I'm interested in a simple way to schedule crontab-like tasks, similar to Spring's quartz.
I see from celery's documentation that it requires running celeryd as a daemon process. Is there a way to refrain from running another external process and simply running this embedded in my django instance? Since I'm not interested in distributing the work at the moment, I'd rather keep it simple.
Add CELERY_ALWAYS_EAGER=True option in your django settings file and all your tasks will be executed locally. Seems like for the periodic tasks you have to execute celery beat as well.

Categories