Value of Airflow variable gets invalid on restarting docker container - python

There is a long list of dags and associated airflow variables at some remote instance of airflow, copy of which is running in my local system. All the variables from remote airflow instance are imported to my local airflow instance.
I have installed airflow image on top of docker and thereafter started the container. Everything works fine and I can access the airflow UI from my local system.
Problem:
Whenever I restart the airflow container, all the variables that were imported during the previous container run get invalid like this.
Work Around
Import the variables again to fix the variable related error.
However, Its really frustrating to import variables every time container starts. There must be an intelligent way of achieving this. Please help me understand what am I doing wrong.

New encryption key is generated when the docker container is restarted.
To ensure that the same encryption key is used you will have to either hardcode a FERNET_KEY in the config file or pass an env variable when the container is initially run.
docker run -it -p 8888:8080 -v D:\dev\Dags:/usr/local/airflow/dags -e FERNET_KEY=81HqDtbqAywKSOumSha3BhWNOdQ26slT6K0YaZeZyPs= --name my_airflow_dags airflow_image
The Fernet key here can be anything. Once this key is provided, docker can reuse the same every time container is restarted.

The Root Cause of this Problem is the AirFlow Encryption Mechanism for the Key-value Variables.
when you import your variables manually, their is_encrypted attributes, are automatically set to True.
Whenever, you restart the container, new Encryption Key is generated, thus the old ones got Invalid.
You Have 3 Options :
Set the fernet_key explicitly in airflow.cfg
Set the AIRFLOW__CORE__FERNET_KEY Environment Variable in docker-compose.yml
Set the is_encrypted attributes to False(Admin UI, CLI, update sql query , ...)
I personally chose the second one, so my docker-compose.yml file, looks like this :
environment:
- LOAD_EX=n
- EXECUTOR=Local
- AIRFLOW__CORE__FERNET_KEY='81HqDtbqAywKSOumSha3BhWNOdQ26slT6K0YaZeZyPs='
thanks to wittfabian

Related

How can a Python app find out that it's running inside a Kubernetes pod?

I have a Python script that should have slightly different behaviour if it's running inside a Kubernetes pod.
But how can I find out that whether I'm running inside Kubernetes -- or not?
An easy way I use (and it's not Python specific), is to check whether kubernetes.default.svc.cluster.local resolves to an IP (you don't need to try to access, just see if it resolves successfully)
If it does, the script/program is running inside a cluster. If it doesn't, proceed on the assumption it's not running inside a cluster.
You can create and set an environment variable to some value when deploying your script to the cluster, then in your script you just check if the environment variable is present and set.
THE ENV FILE DEPLOYED TO CLUSTER:
KUBERNETES_POD=TRUE
SCRIPT:
import os
load_dotenv()
is_kube = os.getenv('KUBERNETES_POD',default=False)
if is_kube:
print('in cluster')
Then just make sure you do not have the environment variable set locally so that the is_kube boolean defaults to False.
If you are running a Django application, you can get the host name from the request parameter passed to each view with the get_host method:
def function_view(request):
host = request.get_host()
if host == 'your-cluster_ip.com'
print('in cluster')

Docker with Python and PostgreSQL

I some questions about Docker. I have very little knowledge about it, so kindly bear with me.
I have a python script that does something and writes into a PostgreSQL DB. Both are run on Docker. Python uses python:3.8.8-buster and PostgreSQL postgres:13. Using docker-compose up, I am able to instantiate both these services and I see the items inserted in the PostgreSQL table. When I docker-compose down, as usual, the services shut down as expected. Here are the questions I have:
When I run the container of the PostgreSQL service by itself (not using docker-compose up, but docker run then docker exec) then login into db using PSQL, it doesn't take the db name as the db name mentioned in the docker-compose.yml file. It takes localhost, but with the username mentioned per the docker-compose.yml file. It also doesn't ask me for the password, although it's mentioned in the Dockerfile itself(not docker-compose.yml - for each of the services, I have a Dockerfile that I build in the docker-compose.yml). Is that expected? If so, why?
After I've logged in, when I SELECT * FROM DB_NAME; it displays 0 records. So, basically it doesn't display the records written in the DB in the previous run. Why's that? How can I see the contents of the DB when it's not up? When the container is running (when I docker-compose up), I know I can see the records from PG Admin (which BTW is also a part of my docker-compose.yml file, and I have it only to make it easier to see if the records have been written into the DB).
So after my script runs, and it writes into the db, it stops. Is there a way to restart with without docker-compose down then docker-compose up? (On VSCode) when I simply run the script, while still docker-compose is up it says it cannot find the db (that's mentioned in the docker-compose.yml file). So I have to go back and change the db name in the script to point localhost - This circles back to the question #1.
I am new to docker, and I am trying my best to wrap my head around all this.
This behavior depends on your specific setup. I will have to see the Dockerfile(s) and docker-compose.yaml in order to give a suitable answer.
This is probably caused by mounting an anonymous volume to your postgres service instead of a named volume. Anonymous volumes are not automatically mounted when executing docker-compose up. Named volumes are.
Here's a docker-compose.yaml example of how to mount a named volume called database:
version: '3.8'
# Defining the named volume
volumes:
database:
services:
database:
image: 'postgres:latest'
restart: 'always'
environment:
POSTGRES_USER: 'admin'
POSTGRES_PASSWORD: 'admin'
POSTGRES_DB: 'app'
volumes:
# Mounting the named volume
- 'database:/var/lib/postgresql/data/'
ports:
- '5432:5432'
I assume this depends more on the contents of your script than on the way you configured your docker postgres service. Postgres does not shut down after simply writing data to it. But again, I will have to see the Dockerfiles(s) and docker-compose.yaml (and the script) in order to provide a more suitable answer.
If you docker run an image, it always creates a new container, and it never looks at the docker-compose.yml. If you for example
docker run --name postgres postgres
docker exec -it postgres ...
that starts a new container based on the postgres:latest image, with no particular storage or networking setup. That's why you can't use the Compose host name of the container or see any of the data that your Compose setup would normally have.
You can use docker-compose up to start a specific service and its dependencies, though:
docker-compose up -d postgres
Once you do this, you can use ordinary tools like psql to connect to the database through its published ports:
psql -h localhost -p 5433 my_db
You should not need normally debugging tools like docker exec; if you do, there is a Compose variant that knows about the Compose service names
# For debugging use only -- not the primary way to interact with the DB
docker-compose exec postgres psql my_db
After my script runs, and it writes into the db, it stops. Is there a way to restart it?
Several options:
Make your script not stop, in whatever way. Frequently a Docker container will have something like an HTTP service that can accept requests and act on them.
Re-running docker-compose up -d (without explicitly down first) will restart anything that's stopped or anything whose Compose configuration has changed.
You can run a one-off script directly on the host, with configuration pointing at your database's published ports:.
It's relevant here that "in the Compose environment" and "directly on a developer system" are different environments, and you will need a mechanism like environment variables to communicate these. In the Compose environment you will need the database container's name and the default PostgreSQL port 5432; on a developer system you will typically need localhost as the host name and whichever port has been published. You cannot hard-code this configuration in your application.

Dask Worker can't find module on AWS

So I have a distributed setup of dask with a scheduler running on one container and worker running on another. I have a similar setup on AWS, where the scheduler is running on an EC2 instance and the worker is on a docker container in another EC2 machine.
I want a python file to be available to the worker. I don't want to install this as a dependency directly to the worker as of yet but want to manually copy this file to the worker such that it's available in the python environment that the worker uses. To achieve this, I am add this to the DOCKERFILE:
# syntax=docker/dockerfile:experimental
FROM daskdev/dask:2020.12.0
WORKDIR /src/
COPY ./python_file.py /src/python_file.py
Basically I want the dask worker to be able to run a method inside python_file.py. So I submit the method like:
client.submit(python_file.some_method, arg1, arg2)
This works fine on my local setup of dask and the worker is able to deserialize this call and run the method. Somehow this doesn't work on the AWS setup. The worker keeps complaining:
ModuleNotFoundError: No module named 'python_file'
To debug:
I logged into the EC2 machine and I see that the container is alive.
I entered the container and I see that the file also exists where I want it to (exactly like my local).
I ran python and tried importing the module and that works too.
I ran `pickle.loads(b'\x80\x04\x95#\x00\x00\x00\x00\x00\x00\x00\x8c\python_file\x94\x8c\x0esome_method\x94\x93\x94.') and that returns the desired method.
If all this works, what else could be the reason that the worker still complains of the module not existing? Has anybody else faced something similar?
This sounds like a PYTHONPATH issue - on your local machine the python file is probably present in the current working directory (which is on your PYTHONPATH). Can you confirm that src is on your PYTHONPATH? If not, I would add that as an ENV in your docker image.

How do i solve the security problems caused by pg_cron?

I am using pg_cron to schedule a task which should be repeated every 1 hour.
I have installed and using this inside a docker environment inside the postgres container.
And I am calling the query to create this job using python from a different container.
I can see that job is created successfully but is not being executed due to lack of permission since the pg_hba.conf is not set to trust or due to no .pgpass file.
But if I enable any of those both, anyone can enter into database by using docker exec and do psql in the container.
Is there anyway to avoid this security issue??? Since in production environment it should not be allowed for anyone to enter into the database without a password.
Either keep people from running docker exec on the container or use something else than pg_cron.
I would feel nervous if random people were allowed to run docker exec on the container with my database or my job scheduler in it.

Openshift 2.1 cannot set OPENSHIFT_PYTHON_WSGI_APPLICATION using action hooks

I am trying to deploy a Django app on openshift (python3.3, django1.7, Openshift 2.1).
I need to set the OPENSHIFT_PYTHON_WSGI_APPLICATION to point to an alternative wsgi.py location.
I have tried using the pre_build script to set the variable, using the following commands:
export OPENSHIFT_PYTHON_WSGI_APPLICATION="$OPENSHIFT_REPO_DIR"geartest4/wsgi.py
echo "-------> $OPENSHIFT_PYTHON_WSGI_APPLICATION"
I can see during the git push that the pre_build script sets the variable correctly. The echo shows the correct path as expected. However wsgi.py does not launch and I get:
CLIENT_ERROR: WSGI application was not found
When I immediately ssh into the gear and check the environment variable I see that OPENSHIFT_PYTHON_WSGI_APPLICATION="" is not set.
If I set the variable manually from my workstation using rhc set-env OPENSHIFT_PYTHON_WSGI_APPLICATION=/var/lib/openshift/gear_name/bla/bla then the variable sticks, the wsgi server launches, and the app works fine.
The problem is that I don't want to use rhc set-env because that means I have to hardwire the gear name in the path. This becomes a problem when I want to do scaling with multiple gears.
Anyone have any ideas on how to set the variable and make stick?
The environment variable OPENSHIFT_PYTHON_WSGI_APPLICATION can be set to a relative path like this:
rhc env set OPENSHIFT_PYTHON_WSGI_APPLICATION=wsgi/wsgi.py
The openshift cartridge openshift-django17 by jfmatth uses this approach, too.

Categories