I am completely new to Apache Airflow. I have a situation. The code I have used is
import airflow
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.contrib.hooks.ssh_hook import SSHHook
default_args = {
'owner': 'john',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'tutorial',
default_args = default_args,
description='A simple tutorial DAG',
schedule_interval=None)
bash_tutorial = """
echo "Execute shell file: /A/B/server/tutorial.ksh"
echo "{{macros.ds_format(ds, "%Y-%m-%d", "%m-%d-%Y"}}"
source /home/johnbs/.profile
/A/B/server/tutorial.ksh
"""
t1 = SSHOperator(
ssh_conn_id='dev'
task_id='tutorial.ksh'
command=bash_tutorial,
dag=dag
)
Using airflow, I want to trigger a ksh script in different servers like dev and test servers i.e.
tutorial.ksh is present in dev server(conn_id is 'dev') of the path (/A/B/C/tutorial.ksh) and in test server(conn_id is 'test') of the path (/A/B/D/tutorial.ksh)...Here you can see C folder from dev and D folder from test...Which area should I update the code?
Each instance of SSHOperator execute a command on a single server.
You will need to define each connection separately as explained in the docs and then you can do:
server_connection_ids = ['dev', 'test']
start_op = DummyOperator(task_id="start_task", dag=dag)
for conn in server_connection_ids:
bash_tutorial = f"""
echo "Execute shell file: /A/B/server/tutorial.ksh"
echo "{{macros.ds_format(ds, "%Y-%m-%d", "%m-%d-%Y"}}"
source /home/johnbs/.profile
/A/B/{conn}/tutorial.ksh
"""
ssh_op = SSHOperator(
ssh_conn_id=f'{conn}',
task_id=f'ssh_{conn}_task',
command=bash_tutorial,
dag=dag
)
This will create a task per server.
Related
I am new to airflow. I have a simple python script my_python_script.py located inside a GCP bucket. I would like to trigger this python script with program arguments using airflow.
My airflow python code looks somewhat looks like this:
import pytz
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator
from pytz import timezone
from helper import help
import pendulum
config = help.loadJSON("batch/path/to/json")
executor_config = config["executor"]
common_task_args = {
'owner': 'my_name',
'depends_on_past': False,
'email': config["mailingList"],
'email_on_failure': True,
'start_date': pendulum.datetime(2022, 4, 10, tz=time_zone),
'executor_config': executor_config,
'gcp_conn_id': config["connectionId"],
'project_id': config["projectId"],
'location': config["location"]
}
dag = DAG('my_dag',
default_args=common_task_args,
is_paused_upon_creation=True,
catchup=False,
schedule_interval=None)
simple_python_task = {
"reference": {"project_id": config["projectId"]},
"placement": {"cluster_name": config["clusterName"]},
<TODO: initialise my_python_script.py script located on GCP bucket with program arguments>
}
job_to_be_triggered = DataprocSubmitJobOperator(
task_id="simple_python_task",
job=simple_python_task,
dag=dag
)
job_to_be_triggered
What am I supposed to do in the TODO section in the code snippet above? The idea is to trigger my_python_script.py script located in a GCP dataproc bucket [gs://path/to/python/script.py] with program arguments [to the python script].
PS: It is important for the python script to be on GCP.
I'm writing an Airflow DAG using the KubernetesPodOperator. A Python process running in the container must open a file with sensitive data:
with open('credentials/jira_credentials.json', 'r') as f:
creds = json.load(f)
and a CloudStorage client must be authenticated:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "credentials/cloud_storage_credentials.json"
According to the best security practices I don't package a container's image with sensitive data. Instead I use Kubernetes Secrets. Using Python API for Kubernetes I'm trying to mount them as a volume but with no success. The credentials/ directory exists in the container but it's empty. What should I do to make files jira_crendentials.json and cloud_storage_credentials.json accessible in the container?
My DAG's code:
from airflow import DAG
from datetime import datetime, timedelta
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
from airflow.kubernetes.secret import Secret
from airflow.kubernetes.volume import Volume
from airflow.kubernetes.volume_mount import VolumeMount
from airflow.operators.dummy_operator import DummyOperator
from kubernetes.client import models as k8s
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime.utcnow(),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retry_delay': timedelta(minutes=5)
}
volume = Volume(name="volume-credentials", configs={})
volume_mnt = VolumeMount(mount_path="/credentials", name="volume-credentials", sub_path="", read_only=True)
secret_jira_user = Secret(deploy_type="volume",
deploy_target="/credentials",
secret="jira-user-secret",
key="jira_credentials.json")
secret_storage_credentials = Secret(deploy_type="volume",
deploy_target="/credentials",
secret="jira-trans-projects-cloud-storage-creds",
key="cloud_storage_credentials.json")
dag = DAG(
dag_id="jira_translations_project",
schedule_interval="0 1 * * MON",
start_date=datetime(2021, 9, 5, 0, 0, 0),
max_active_runs=1,
default_args=default_args
)
start = DummyOperator(task_id='START', dag=dag)
passing = KubernetesPodOperator(namespace='default',
image="eu.gcr.io/data-engineering/jira_downloader:v0.18",
cmds=["/usr/local/bin/run_process.sh"],
name="jira-translation-projects-01",
task_id="jira-translation-projects-01",
get_logs=True,
dag=dag,
volumes=[volume],
volume_mounts=[volume_mnt],
secrets=[
secret_jira_user,
secret_storage_credentials],
env_vars={'MIGRATION_DATETIME': '2021-01-02T03:04:05'},
)
start >> passing
According to this example, Secret is a special class that will handle creating volume mounts automatically. Looking at your code, seems that your own volume with mount /credentials is overriding /credentials mount created by Secret, and because you provide empty configs={}, that mount is empty as well.
Try supplying just secrets=[secret_jira_user,secret_storage_credentials] and removing manual volume_mounts.
Code that generates secret volume mounts under the hood
I am working on AirFlow POC, written a DAG which can run a script using ssh on one server. It giving alert if the script fails to execute but if the scripts executed and the task in the script fails it is not giving any mail.
Example: I'm running a script which can take backup of a database in db2. If the instance is down and unable to take the backup so backup command failed but we are not getting any alert as the script executed successfully.
from airflow.models import DAG, Variable
from airflow.contrib.operators.ssh_operator import SSHOperator
from datetime import datetime, timedelta
import airflow
import os
import logging
import airflow.settings
from airflow.utils.dates import days_ago
from airflow.operators.email_operator import EmailOperator
from airflow.models import TaskInstance
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
START_DATE = airflow.utils.dates.days_ago(1)
SCHEDULE_INTERVAL = "5,10,15,20,30,40 * * * * *"
log = logging.getLogger(__name__)
# Use this to grab the pushed value and determine your path
def determine_branch(**kwargs):
"""Use this to define the pathway of the branch operator based on the return code from the SSHOperator"""
ti = TaskInstance(task='t1', execution_date=START_DATE)
return_code = kwargs['ti'].xcom_pull(ti.task_ids='SSHTest')
print("From Kwargs: ", return_code)
if return_code := '1':
return 't4'
return 't3'
# DAG for airflow task
dag_email_recipient = ["mailid"]
# These args will get passed on to the SSH operator
default_args = {
'owner': 'afuser',
'depends_on_past': False,
'email': ['mailid'],
'email_on_failure': True,
'email_on_retry': False,
'start_date': START_DATE,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
with DAG(
'DB2_SSH_TEST',
default_args=default_args,
description='How to use the SSH Operator?',
schedule_interval=SCHEDULE_INTERVAL,
start_date=START_DATE,
catchup=False,
) as dag:
# Be sure to add '; echo $?' to the end of your bash script for this to work.
t1 = SSHOperator(
ssh_conn_id='DB2_Jabxusr_SSH',
task_id='SSHTest',
xcom_push=True,
command='/path/backup_script',
provide_context=True,
)
# Function defined above called here
t2 = BranchPythonOperator(
task_id='check_ssh_output',
python_callable=determine_branch,
)
# If we don't want to send email
t3 = DummyOperator(
task_id='end'
)
# If we do
t4 = EmailOperator(
to=dag_email_recipient,
subject='Airflow Success: process incoming files',
files=None,
html_content='',
#...
)
t1 >> t2
t2 >> t3
t2 >> t4
You can configure email notifications per task. First you need to configure your email provider, and then when you create your task you can specify "email_on_retry", "email_on_failure" flags or you can even write your own custom hook "on_failure" where you will be able to code your own custom logic deciding when and how to notify.
Here is a very nice Astronomer article explaining all the ins and outs of notifications: https://www.astronomer.io/guides/error-notifications-in-airflow
I am trying to run a simple dag in Airflow to execute a python file and it is throwing the error can't open the file '/User/....'.
below is the script I am using.
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime,timedelta
default_args = {
'owner': 'airflow',
'start_date': datetime(2021,3,2),
'depends_on_past': False,
'retries': 0
}
dag=DAG(dag_id='DummyDag',default_args=default_args,catchup=False,schedule_interval='#once')
command_load='python /usr/local/airflow/dags/dummy.py '
#load=BashOperator(task_id='loadfile',bash_command='python /Users/<user-id>/MyDrive/DataEngineeringAssignment/dummydata.py',dag=dag)
start=DummyOperator(task_id='Start',dag=dag)
end=DummyOperator(task_id='End',dag=dag)
dir >> start >> end
Is there anywhere I am going wrong?
Option 1:
The file location (dummydata.py) is relative to the directory containing the pipeline file (DAG file).
dag=DAG(
dag_id='DummyDag',
...
)
load=BashOperator(task_id='loadfile',
bash_command='python dummydata.py',
dag=dag
)
Option 2:
define your template_searchpath as pointing to any folder locations in the DAG constructor call.
dag=DAG(
dag_id='DummyDag',
...,
template_searchpath=['/Users/<user-id>/MyDrive/DataEngineeringAssignment/']
)
load=BashOperator(task_id='loadfile',
# "dummydata.py" is a file under "/Users/<user-id>/MyDrive/DataEngineeringAssignment/"
bash_command='python dummydata.py ', # Note: Space is important!
dag=dag
)
For more information you can read about it in the docs
I'm running Airflow 1.9.0 with LocalExecutor and PostgreSQL database in a Linux AMI. I want to manually trigger DAGs, but whenever I create a DAG that has schedule_interval set to None or to #once, the webserver tree view crashes with the following error (I only show the last call):
File "/usr/local/lib/python2.7/site-packages/croniter/croniter.py", line 467, in expand
raise CroniterBadCronError(cls.bad_length)
CroniterBadCronError: Exactly 5 or 6 columns has to be specified for iteratorexpression.
Furthermore, when I manually trigger the DAG, a DAG run starts but the tasks themselves are never scheduled. I've looked around, but it seems that I'm the only one with this type of error. Has anyone encountered this error before and found a fix?
Minimal example triggering the problem:
import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = dt.datetime(2018, 7, 24),
schedule_interval='None',
catchup=False
) as dag:
first_task = BashOperator(task_id = "first_task", bash_command = bash_command)
Try this:
Set your schedule_interval to None without the '', or simply do not specify schedule_interval in your DAG. It is set to None as a default. More information on that here: airflow docs -- search for schedule_interval
Set orchestration for your tasks at the bottom of the dag.
Like so:
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = datetime(2018, 7, 24),
schedule_interval=None,
catchup=False
) as dag:
t1 = DummyOperator(
task_id='extract_data',
dag=dag
)
t2 = BashOperator(
task_id = "first_task",
bash_command = bash_command
)
#####ORCHESTRATION#####
## It is saying that in order for t2 to run, t1 must be done.
t2.set_upstream(t1)
None Value Should not in quotes
It should be like this:
schedule_interval=None
Here is the documentation link: https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html#:~:text=Note%3A%20Use%20schedule_interval%3DNone%20and%20not%20schedule_interval%3D%27None%27%20when%20you%20don%E2%80%99t%20want%20to%20schedule%20your%20DAG