Should Airflow and Jupyter be decoupled to 2 dockers - python

The airflow is to schedule python and jupyter jobs.
There are environment settings, directories and installed python and linux packages used by the python code.
Should airflow be installed in a separate docker or in the same docker?
If it is in a separate docker, how can the env, directories, installed packages be shared to the airflow?

Ideally yes, that way you can scale/restart Airflow and Jupyter independently of each other, without having to take down everything.
For environment variables and packages, you will need to set these on both containers. To avoid code duplication, you might want to look at e.g. a .env file so that you don't have to define the same variables twice. See e.g. https://docs.docker.com/compose/environment-variables/#using-the---env-file--option
Files can be shared on a shared volume. How to set this up depends on your container management system. E.g. Docker Compose or Kubernetes.

Related

How can I develop locally when using Iguazio platform?

I want to be able to test my jobs and code on my local machine before executing on a remote cluster. Ideally this will not require a lot of setup on my end. Is this possible?
Yes, this is possible. A common development pattern with the Iguazio platform is to utilize a local version of MLRun and Nuclio on a laptop/workstation and move/execute jobs on the cluster at a later point.
There are two main options for installing MLRun and Nuclio on a local environment:
docker-compose - Simpler and easier to get up and running, however restricted to running jobs within the environment it was executed in (i.e. Jupyter or IDE). This means you cannot specify resources like CPU/MEM/GPU to run a particular job. This approach is great for quickly getting up and running. Instructions can be found here.
Kubernetes - More complex to get up and running, but allows for running jobs in their own containers with specified CPU/MEM/GPU resources. This approach is a better for better emulating capabilities of the Iguazio platform in a local environment. Instructions can be found here.
Once you have installed MLRun and Nuclio using one of the above options and have created a job/function you can test it locally as well as deploy to the Iguazio cluster directly from your local development environment:
To run your job locally, utilize the local=True flag when specifying your MLRun function like in the Quick-Start guide.
To run your job remotely, specify the required environment files to allow connectivity to the Iguazio cluster as specified in this guide, and run your job with local=False

Why should a python google cloud function not contain a Pipfile?

According to the documentation here
Dependency specification using the Pipfile/Pipfile.lock standard is currently not supported. Your project should not include these files.
I use Pipfile for managing my dependencies and create a requirements.txt file through
pipenv lock --requirements
Till now everything works and my gcloud function is up and running. So why should a python google cloud function not contain a Pipfile?
If it shouldn't contain, what is the preferred way suggested to manage an isolated environment ?
When you deploy your function, you deploy it on its own environment. You won't manage several environment because the cloud function deployment is dedicated to one and only one piece of code.
That's why, it's useless to have a virtual environment in a single usage environment. You could use Cloud Run to do that because you can customize your build and runtime environment. But, here again, it's useless: You won't have concurrent environment in the same container, it does not make sense.

Python tasks and DAGs with different conda environments

Say that most of my DAGs and tasks in AirFlow are supposed to run Python code on the same machine as the AirFlow server.
Can I have different DAGs use different conda environments? If so, how should I do it? For example, can I use the Python Operator for that? Or would that restrict me to using the same conda environment that I used to install AirFlow.
More generally, where/how should I ideally activate the desired conda environment for each DAG or task?
The Python that is running the Airflow Worker code, is the one whose environment will be used to execute the code.
What you can do is have separate named queues for separate execution environments for different workers, so that only a specific machine or group of machines will execute a certain DAG.

Should I activate my Python virtual environment before running my app in upstart?

I am working through the process of installing and configuring the Superset application. (A Flask app that allows real-time slicing and analysis of business data.)
When it comes to the Python virtual environment, I have read a number of articles and how-to guides and understand the concept of how it allows you to install packages into the virtual environment to keep things neatly contained for my application.
Now that I am preparing this application for (internal) production use, do I need to be activating the virtual environment before launching gunicorn in my upstart script? Or is the virtual environment more just for development and installing/updating packages for my application? (In which case I can just launch gunicorn without the extra step of activating the virtualenv.)
You should activate a virtualenv on the production server the same way as you do on the development machine. It allows you to run multiple Python applications on the same machine in a controlled environment. No need to worry that an update of packages in one virtualenv will cause an issue in the other one.
If I may suggest something. I really enjoy using virtualenvwrapper to simplify the use of virtualenvs even more. It allows you to define hooks, e.g.: preactivate, postactivate, predeactivate and postdeactivate using the scripts in $VIRTUAL_ENV/bin/. It's a good place for setting up environmental variables that your Python application can utilize.
And a good and simple tool for process control is supervisord.

python conda deployment on server

Let say I have two projects that I develop on my personal machine. I use conda to manage my python dependencies. I created environments to manage these projects. When I'm done with the dev, I want to export them to a remote machine that will run regularly, in the same time, these two projects. How should I manage this deployment ?
After some researches, I came up with this:
clone your environments as described on conda's doc.
export your environment file on the server along with your project.
import the environment on the server's conda.
create a bash script like that
#!/bin/bash
source activate my_environment
python ~/my_project/src/code.py
set up cron as usual calling this previous bash script

Categories