Python tasks and DAGs with different conda environments - python

Say that most of my DAGs and tasks in AirFlow are supposed to run Python code on the same machine as the AirFlow server.
Can I have different DAGs use different conda environments? If so, how should I do it? For example, can I use the Python Operator for that? Or would that restrict me to using the same conda environment that I used to install AirFlow.
More generally, where/how should I ideally activate the desired conda environment for each DAG or task?

The Python that is running the Airflow Worker code, is the one whose environment will be used to execute the code.
What you can do is have separate named queues for separate execution environments for different workers, so that only a specific machine or group of machines will execute a certain DAG.

Related

Should Airflow and Jupyter be decoupled to 2 dockers

The airflow is to schedule python and jupyter jobs.
There are environment settings, directories and installed python and linux packages used by the python code.
Should airflow be installed in a separate docker or in the same docker?
If it is in a separate docker, how can the env, directories, installed packages be shared to the airflow?
Ideally yes, that way you can scale/restart Airflow and Jupyter independently of each other, without having to take down everything.
For environment variables and packages, you will need to set these on both containers. To avoid code duplication, you might want to look at e.g. a .env file so that you don't have to define the same variables twice. See e.g. https://docs.docker.com/compose/environment-variables/#using-the---env-file--option
Files can be shared on a shared volume. How to set this up depends on your container management system. E.g. Docker Compose or Kubernetes.

Make virtual environment setup faster

I have some automated tests on Jenkins where part of the build steps is to setup a virtual environment (Virtualenv Builder option).
This step performs a pip install for about 6 libraries maintained externally.
The problem I have is that this process takes time.
Is there a way of loading an image of the Python virtual environment?
That way I could create another Jenkins job which generates this image, and this job would only be executed if there are significant changes to the libraries. And then my original automated test jobs will simply get the artifact from this external job at the start of the tests and load the virtual environment from there.

Apache Airflow Continous Integration Workflow and Dependency management

I'm thinking of starting to use Apache Airflow for a project and am wondering how people manage continuous integration and dependencies with airflow. More specifically
Say I have the following set up
3 Airflow servers: dev staging and production.
I have two python DAG'S whose source code I want to keep in seperate repos.
The DAG's themselves are simple, basically just use a Python operator to call main(*args, **kwargs). However the actually code that's run by main is very large and stretches several files/modules.
Each python code base has different dependencies
for example,
Dag1 uses Python2.7 pandas==0.18.1, requests=2.13.0
Dag2 uses Python3.6 pandas==0.20.0 and Numba==0.27 as well as some cythonized code that needs to be compiled
How do I manage Airflow running these two Dag's with completely different dependencies?
Also, how do I manage the continuous integration of the code for both these Dags into each different Airflow enivornment (dev, staging, Prod)(do I just get jenkins or something to ssh to the airflow server and do something like git pull origin BRANCH)
Hopefully this question isn't too vague and people see the problems i'm having.
We use docker to run the code with different dependencies and DockerOperator in airflow DAG, which can run docker containers, also on remote machines (with docker daemon already running). We actually have only one airflow server to run jobs but more machines with docker daemon running, which the airflow executors call.
For continuous integration we use gitlab CI with the Gitlab container registry for each repository. This should be easily doable with Jenkins.

python conda deployment on server

Let say I have two projects that I develop on my personal machine. I use conda to manage my python dependencies. I created environments to manage these projects. When I'm done with the dev, I want to export them to a remote machine that will run regularly, in the same time, these two projects. How should I manage this deployment ?
After some researches, I came up with this:
clone your environments as described on conda's doc.
export your environment file on the server along with your project.
import the environment on the server's conda.
create a bash script like that
#!/bin/bash
source activate my_environment
python ~/my_project/src/code.py
set up cron as usual calling this previous bash script

How can I use multiple python virtual environments on same server

How can I deploy and host multiple python projects with different dependancies on the same server at the same time?
It's not true of course that only one virtualenv can be activated at once. Yes, only one can be active in a shell session at once, but your sites are not deployed via shell sessions. Each WSGI process, for example, will create its own environment: so all you need to do is to ensure that each wsgi script activates the correct virtualenv, as is (in the case of mod_wsgi at least) well documented.
Use virtualenv for python. You can can install any other version of python/packages in it, if required.

Categories