I'm thinking of starting to use Apache Airflow for a project and am wondering how people manage continuous integration and dependencies with airflow. More specifically
Say I have the following set up
3 Airflow servers: dev staging and production.
I have two python DAG'S whose source code I want to keep in seperate repos.
The DAG's themselves are simple, basically just use a Python operator to call main(*args, **kwargs). However the actually code that's run by main is very large and stretches several files/modules.
Each python code base has different dependencies
for example,
Dag1 uses Python2.7 pandas==0.18.1, requests=2.13.0
Dag2 uses Python3.6 pandas==0.20.0 and Numba==0.27 as well as some cythonized code that needs to be compiled
How do I manage Airflow running these two Dag's with completely different dependencies?
Also, how do I manage the continuous integration of the code for both these Dags into each different Airflow enivornment (dev, staging, Prod)(do I just get jenkins or something to ssh to the airflow server and do something like git pull origin BRANCH)
Hopefully this question isn't too vague and people see the problems i'm having.
We use docker to run the code with different dependencies and DockerOperator in airflow DAG, which can run docker containers, also on remote machines (with docker daemon already running). We actually have only one airflow server to run jobs but more machines with docker daemon running, which the airflow executors call.
For continuous integration we use gitlab CI with the Gitlab container registry for each repository. This should be easily doable with Jenkins.
Related
I want to be able to test my jobs and code on my local machine before executing on a remote cluster. Ideally this will not require a lot of setup on my end. Is this possible?
Yes, this is possible. A common development pattern with the Iguazio platform is to utilize a local version of MLRun and Nuclio on a laptop/workstation and move/execute jobs on the cluster at a later point.
There are two main options for installing MLRun and Nuclio on a local environment:
docker-compose - Simpler and easier to get up and running, however restricted to running jobs within the environment it was executed in (i.e. Jupyter or IDE). This means you cannot specify resources like CPU/MEM/GPU to run a particular job. This approach is great for quickly getting up and running. Instructions can be found here.
Kubernetes - More complex to get up and running, but allows for running jobs in their own containers with specified CPU/MEM/GPU resources. This approach is a better for better emulating capabilities of the Iguazio platform in a local environment. Instructions can be found here.
Once you have installed MLRun and Nuclio using one of the above options and have created a job/function you can test it locally as well as deploy to the Iguazio cluster directly from your local development environment:
To run your job locally, utilize the local=True flag when specifying your MLRun function like in the Quick-Start guide.
To run your job remotely, specify the required environment files to allow connectivity to the Iguazio cluster as specified in this guide, and run your job with local=False
I have Python script that is supposed to run once every few days to annotate some data on a remote database.
Which PaaS services (GAE, Heroku, etc.) allows for a stand-alone Python script to be deployed and executed via some sort of cron scheduler?
GAE has a module called cron jobs and Heroku has Heroku Scheduler. Both are fairly easy to use and configure. You can check the documentation of both. As I do not have any other information on what you want to do I don’t know if one would be more suitable to you than the other.
I develop a distributed application which is based on RabbitMQ and multiple python applications. System is pretty complex so it is very likely that we will need to update deployed solution multiple times. Customer wants that we use his servers which are running windows. So the question is how to deploy and update python part of this system. And as sub-question is it better to deploy sources or use pyinstaller to get executables and then deploy them? On my test server I just use git pull when I have some changes which is probably not the case for production system.
I was in a similar position and i combine pyinstaller with fabric. So i build a "compile" version of the project and with fabric, i deploy like the client wants.
Fabric support roles definition, several configuration for several clients.
I currently have a handful of small Python scripts on my laptop that are set to run every 1-15 minutes, depending on the script in question. They perform various tasks for me like checking for new data on a certain API, manipulating it, and then posting it to another service, etc.
I have a NAS/personal server (unRAID) and was thinking about moving the scripts to there via Docker, but since I'm relatively new to Docker I wasn't sure about the best approach.
Would it be correct to take something like the Phusion Baseimage which includes Cron, package my scripts and crontab as dependencies to the image, and write the Dockerfile to initialize all of this? Or would it be a more canonical approach to modify the scripts so that they are threaded with recursive timers and just run each script individually in it's own official Python image?
No dude just install python on the docker container/image, move your scripts and run them as normal.
You may have to expose some port or add firewall exception but your container can be as native linux environment.
So I have a web service (flask + MySQL + celery) and I'm trying to figure out the proper way to deploy it on Elastic Beanstalk into separate Web Server and Worker environments/tiers. I currently have it working by launching the worker (using this answer) on the same instance as the web server, but obviously I want to have the worker(s) running in a separately auto-scaled environment. Note that the celery tasks rely on the main server code (e.g. making queries, etc) so they cannot be separated. Essentially it's an app with two entry points.
The only way I can think to do this is by having the code/config-script examine some env variable (e.g. ENV_TYPE = "worker" or "server") to determine whether to launch the standard flask app, or the celery worker.
The other caveat here is that I would have to "eb deploy" my code to two separate environments (server and worker), when I'd like/expect them to be deployed simultaneously since both use the same code base.
Apologies if this has been asked before, but I've looked around a lot and couldn't find anything, which I find surprising since this seems like a common use case.
Edit: Just found this answer, which addresses my concern for deploying twice (I guess it's technically deploy once and then update two environments, easily scriptable). But my question regarding how to bootstrap the application into server vs worker mode still stands.
Regarding the bootstrapping, if you setup an environment variable for an Elastic Beanstalk environment (docs here), then you never have to touch it again when you re-deploy your code with your script. You only need to add the environment variable if you create a new environment.
Thus when starting up, you can just check in Python for that ENV variable and then bootstrap from there and load what you need.
My preference is instead of creating a enum by specifying "worker" or "server", just do a boolean for the env variable like ENV_WORKER=1 or something. It'll remove possibility of typing mistakes and be easier to read.
if os.environ.get('ENV_WORKER') is not None:
# Bootstrap worker stuff here
else:
# Specific stuff for server here