I'm writing an Airflow DAG using the KubernetesPodOperator. A Python process running in the container must open a file with sensitive data:
with open('credentials/jira_credentials.json', 'r') as f:
creds = json.load(f)
and a CloudStorage client must be authenticated:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "credentials/cloud_storage_credentials.json"
According to the best security practices I don't package a container's image with sensitive data. Instead I use Kubernetes Secrets. Using Python API for Kubernetes I'm trying to mount them as a volume but with no success. The credentials/ directory exists in the container but it's empty. What should I do to make files jira_crendentials.json and cloud_storage_credentials.json accessible in the container?
My DAG's code:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
from airflow.kubernetes.secret import Secret
from airflow.kubernetes.volume import Volume
from airflow.kubernetes.volume_mount import VolumeMount
from airflow.operators.dummy_operator import DummyOperator
from kubernetes.client import models as k8s
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.utcnow(),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retry_delay': timedelta(minutes=5)
}
volume = Volume(name="volume-credentials", configs={})
volume_mnt = VolumeMount(mount_path="/credentials", name="volume-credentials", sub_path="", read_only=True)
secret_jira_user = Secret(deploy_type="volume",
deploy_target="/credentials",
secret="jira-user-secret",
key="jira_credentials.json")
secret_storage_credentials = Secret(deploy_type="volume",
deploy_target="/credentials",
secret="jira-trans-projects-cloud-storage-creds",
key="cloud_storage_credentials.json")
dag = DAG(
dag_id="jira_translations_project",
schedule_interval="0 1 * * MON",
start_date=datetime(2021, 9, 5, 0, 0, 0),
max_active_runs=1,
default_args=default_args
)
start = DummyOperator(task_id='START', dag=dag)
passing = KubernetesPodOperator(namespace='default',
image="eu.gcr.io/data-engineering/jira_downloader:v0.18",
cmds=["/usr/local/bin/run_process.sh"],
name="jira-translation-projects-01",
task_id="jira-translation-projects-01",
get_logs=True,
dag=dag,
volumes=[volume],
volume_mounts=[volume_mnt],
secrets=[
secret_jira_user,
secret_storage_credentials],
env_vars={'MIGRATION_DATETIME': '2021-01-02T03:04:05'},
)
start >> passing
According to this example, Secret is a special class that will handle creating volume mounts automatically. Looking at your code, seems that your own volume with mount /credentials is overriding /credentials mount created by Secret, and because you provide empty configs={}, that mount is empty as well.
Try supplying just secrets=[secret_jira_user,secret_storage_credentials] and removing manual volume_mounts.
Code that generates secret volume mounts under the hood
Related
I am new to airflow. I have a simple python script my_python_script.py located inside a GCP bucket. I would like to trigger this python script with program arguments using airflow.
My airflow python code looks somewhat looks like this:
import pytz
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator
from pytz import timezone
from helper import help
import pendulum
config = help.loadJSON("batch/path/to/json")
executor_config = config["executor"]
common_task_args = {
'owner': 'my_name',
'depends_on_past': False,
'email': config["mailingList"],
'email_on_failure': True,
'start_date': pendulum.datetime(2022, 4, 10, tz=time_zone),
'executor_config': executor_config,
'gcp_conn_id': config["connectionId"],
'project_id': config["projectId"],
'location': config["location"]
}
dag = DAG('my_dag',
default_args=common_task_args,
is_paused_upon_creation=True,
catchup=False,
schedule_interval=None)
simple_python_task = {
"reference": {"project_id": config["projectId"]},
"placement": {"cluster_name": config["clusterName"]},
<TODO: initialise my_python_script.py script located on GCP bucket with program arguments>
}
job_to_be_triggered = DataprocSubmitJobOperator(
task_id="simple_python_task",
job=simple_python_task,
dag=dag
)
job_to_be_triggered
What am I supposed to do in the TODO section in the code snippet above? The idea is to trigger my_python_script.py script located in a GCP dataproc bucket [gs://path/to/python/script.py] with program arguments [to the python script].
PS: It is important for the python script to be on GCP.
I am completely new to Apache Airflow. I have a situation. The code I have used is
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.contrib.hooks.ssh_hook import SSHHook
default_args = {
'owner': 'john',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'tutorial',
default_args = default_args,
description='A simple tutorial DAG',
schedule_interval=None)
bash_tutorial = """
echo "Execute shell file: /A/B/server/tutorial.ksh"
echo "{{macros.ds_format(ds, "%Y-%m-%d", "%m-%d-%Y"}}"
source /home/johnbs/.profile
/A/B/server/tutorial.ksh
"""
t1 = SSHOperator(
ssh_conn_id='dev'
task_id='tutorial.ksh'
command=bash_tutorial,
dag=dag
)
Using airflow, I want to trigger a ksh script in different servers like dev and test servers i.e.
tutorial.ksh is present in dev server(conn_id is 'dev') of the path (/A/B/C/tutorial.ksh) and in test server(conn_id is 'test') of the path (/A/B/D/tutorial.ksh)...Here you can see C folder from dev and D folder from test...Which area should I update the code?
Each instance of SSHOperator execute a command on a single server.
You will need to define each connection separately as explained in the docs and then you can do:
server_connection_ids = ['dev', 'test']
start_op = DummyOperator(task_id="start_task", dag=dag)
for conn in server_connection_ids:
bash_tutorial = f"""
echo "Execute shell file: /A/B/server/tutorial.ksh"
echo "{{macros.ds_format(ds, "%Y-%m-%d", "%m-%d-%Y"}}"
source /home/johnbs/.profile
/A/B/{conn}/tutorial.ksh
"""
ssh_op = SSHOperator(
ssh_conn_id=f'{conn}',
task_id=f'ssh_{conn}_task',
command=bash_tutorial,
dag=dag
)
This will create a task per server.
I'm trying to make a dag that has 2 operators that are created dynamically, depending on the number of "pipelines" that a json config file has. this file is stored in the variable dag_datafusion_args. Then I have a standard bash operator, and I have a task called success at the end that sends a message to the slack saying that the dag is over. the other 2 tasks that are python operators are generated dynamically and run in parallel. I'm using the composer, when I put the dag in the bucket it appears on the webserver ui, but when I click to see the dag the following message appears'DAG "dag_lucas4" seems to be missing. ', If I test the tasks directly by CLI on the kubernetes cluster it works! But I can't seem to make the web UI appear. I tried to do as a suggestion of some people here in SO to restart the webserver by installing a python package, I tried 3x but without success. Does anyone know what can it be?
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from aux_py_files.med.med_airflow_functions import *
from google.cloud import storage
from datetime import timedelta
TEMPLATE_SEARCH_PATH = '/home/airflow/gcs/plugins/'
INDEX=1
default_args = {
'owner':'lucas',
'start_date': '2021-01-10',
'email': ['xxxx'],
'email_on_failure': False,
'email_on_success': False,
'retries': 3,
'retry_delay': timedelta(minutes=2),
'on_failure_callback': post_message_fail_to_slack
}
dag_datafusion_args=return_datafusion_config_file('med')
with DAG('dag_lucas4', default_args = default_args, schedule_interval="30 23 * * *", template_searchpath = [TEMPLATE_SEARCH_PATH]) as dag:
extract_ftp_csv_files_load_in_gcs = BashOperator(
task_id='extract_ftp_csv_files_load_in_gcs',
bash_command='aux_sh_files/med/script.sh'
)
success = PythonOperator(
task_id='success',
python_callable=post_message_success_to_slack,
op_kwargs={'dag_name':'dag_lucas2'}
)
for pipeline,args in dag_datafusion_args.items():
configure_pipeline=PythonOperator(
task_id=f'configure_pipeline{str(INDEX)}',
python_callable=setPipelineArguments,
op_kwargs={'dag_name':'med', 'pipeline_name':pipeline},
provide_context=True
)
start_pipeline = PythonOperator(
task_id= f'start_pipeline{str(INDEX)}',
python_callable=start_pipeline_wrapper,
op_kwargs={'configure_pipeline_task':f'configure_pipeline{str(INDEX)}'},
retries=3,
provide_context=True
)
[extract_ftp_csv_files_load_in_gcs,configure_pipeline] >> start_pipeline >> success
INDEX += 1
Appears that The Airflow-Webserver in Cloud Composer runs in the tenant project, the worker and scheduler runs in the customer project. Tenant project is nothing but its google side managed environment for some part of airflow components. So the Webserver UI doesn't have complete access to your project resources. As it doesn't run under your project's environment. So I can read my config json file with return_datafusion_config_file . Best way is create an ENV variable with that file.
I am new in python and airflow dag.
I am following below link and code which is mention in answer section.
How to pass dynamic arguments Airflow operator?
I am facing issue to reading yaml file, In yaml file I have some configuration related arguments.
configs:
cluster_name: "test-cluster"
project_id: "t***********"
zone: "europe-west1-c"
num_workers: 2
worker_machine_type: "n1-standard-1"
master_machine_type: "n1-standard-1"
In DAG script I have created one task which will be create cluster, before executing this task we need all the arguments which we need to pass on it default_args parameter like cluster-name, project_id etc.For reading those parameter I have created one readYML method.see below code
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from zipfile import ZipFile
from airflow.contrib.operators import dataproc_operator
from airflow.models import Variable
import yaml
def readYML():
print("inside readYML")
global cfg
file_name = "/home/airflow/gcs/data/cluster_config.yml"
with open(file_name, 'r') as ymlfile:
cfg = yaml.load(ymlfile)
print(cfg['configs']['cluster_name'])
# Default Arguments
readYML()
dag_name = Variable.get("dag_name")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.now(),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
#'cluster_name': cfg['configs']['cluster_name'],
}
# Instantiate a DAG
dag = DAG(dag_id='read_yml', default_args=default_args,
schedule_interval=timedelta(days=1))
# Creating Tasks
Task1 = DataprocClusterCreateOperator(
task_id='create_cluster',
dag=dag
)
In this code there is no error, When I am uploading in GCP composer environment, No error notification is showing but this DAG is no runnable there is no Run button is coming.
See attached screen shot.
I am using python 3 & airflow composer-1.7.2-airflow-1.10.2 version.
According to the Data Stored in Cloud Storage page in the Cloud Composer docs:
To avoid a webserver error, make sure that data the webserver needs to parse a DAG (not run) is available in the dags/ folder. Otherwise, the webserver can't access the data or load the Airflow web interface.
Your DAG is attempting to open the YAML file under /home/airflow/gcs/data, which isn't present on the webserver. Put the file under the dags/ folder in your GCS bucket, and it will be accessible to the scheduler, workers, and webserver, and the DAG will work in the Web UI.
Below is the airflow DAG code. It runs perfectly both when airflow is hosted locally, and on cloud composer. However, the DAG itself isn't clickable in the Composer UI.
I found a similar question and tried the accepted answer as linked in this question. My problem is similar.
import airflow
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.mysql_operator import MySqlOperator
from airflow.contrib.operators.dataproc_operator import DataprocClusterCreateOperator
from airflow.contrib.operators.dataproc_operator import DataprocClusterDeleteOperator
from airflow.contrib.operators.dataproc_operator import DataProcSparkOperator
from datetime import datetime, timedelta
import sys
#copy this package to dag directory in GCP composer bucket
from schemas.schemaValidator import loadSchema
from schemas.schemaValidator import sparkArgListToMap
#change these paths to point to GCP Composer data directory
## cluster config
clusterConfig= loadSchema("somePath/jobConfig/cluster.yaml","cluster")
##per job yaml config
autoLoanCsvToParquetConfig= loadSchema("somePath/jobConfig/job.yaml","job")
default_args= {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=3)
}
dag= DAG('usr_job', default_args=default_args, schedule_interval=None)
t1= DummyOperator(task_id= "start", dag=dag)
t2= DataprocClusterCreateOperator(
task_id= "CreateCluster",
cluster_name= clusterConfig["cluster"]["cluster_name"],
project_id= clusterConfig["project_id"],
num_workers= clusterConfig["cluster"]["worker_config"]["num_instances"],
image_version= clusterConfig["cluster"]["dataproc_img"],
master_machine_type= clusterConfig["cluster"]["worker_config"]["machine_type"],
worker_machine_type= clusterConfig["cluster"]["worker_config"]["machine_type"],
zone= clusterConfig["region"],
dag=dag
)
t3= DataProcSparkOperator(
task_id= "csvToParquet",
main_class= autoLoanCsvToParquetConfig["job"]["main_class"],
arguments= autoLoanCsvToParquetConfig["job"]["args"],
cluster_name= clusterConfig["cluster"]["cluster_name"],
dataproc_spark_jars= autoLoanCsvToParquetConfig["job"]["jarPath"],
dataproc_spark_properties= sparkArgListToMap(autoLoanCsvToParquetConfig["spark_params"]),
dag=dag
)
t4= DataprocClusterDeleteOperator(
task_id= "deleteCluster",
cluster_name= clusterConfig["cluster"]["cluster_name"],
project_id= clusterConfig["project_id"],
dag= dag
)
t5= DummyOperator(task_id= "stop", dag=dag)
t1>>t2>>t3>>t4>>t5
The UI gives this error - "This DAG isn't available in the webserver DAG bag object. It shows up in this list because the scheduler marked it as active in the metadata database."
And yet, when I triggered the DAG manually on Composer, I found it ran successfully through the log files.
The issue was with the path which was being provided for picking up the configuration files. I was giving path for the data folder in GCS. As per Google documentation, only dags folder is synced to all nodes, and not the data folder.
Needless to say, it was a issue encountered during dag parsing time, hence, it did not appear correctly on the UI. More interestingly, these debug messages were not exposed to Composer 1.5 and earlier. Now they are available to the end user to help in debugging. Thanks anyway to everyone who helped.