Whoever can please point me to an example of how to use Airflow FileSensor?
I've googled and haven't found anything yet. Any example would be sufficient. My use case is quite simple:
Wait for a scheduled DAG to drop a file in a path, FileSensor task picks it up, read content and process it.
From the documentation & source code:
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.dummy_operator import DummyOperator
import datetime
import airflow
# https://airflow.apache.org/code.html#airflow.models.BaseOperator
default_args = {
"depends_on_past" : False,
"start_date" : airflow.utils.dates.days_ago( 1 ),
"retries" : 1,
"retry_delay" : datetime.timedelta( hours= 5 ),
}
with airflow.DAG( "file_sensor_test_v1", default_args= default_args, schedule_interval= "*/5 * * * *", ) as dag:
start_task = DummyOperator( task_id= "start" )
stop_task = DummyOperator( task_id= "stop" )
sensor_task = FileSensor( task_id= "my_file_sensor_task", poke_interval= 30, fs_conn_id= <path>, filepath= <file or directory name> )
start_task >> sensor_task >> stop_task
A simple example of a FileSensor task:
second_task = FileSensor(
task_id="file_sensor_task_id",
filepath="{{ task_instance.xcom_pull(task_ids='get_filepath_task') }}",
#fs_conn_id="fs_default" # default one, commented because not needed
poke_interval= 20,
dag=dag
)
Here I'm passing as filepath the returned value of the previous PythonOperator task_id (named get_filepath_task) using xcom_pull.
But it can be a whatever string of a filepath or directory that you are checking the existence.
The fs_conn_id parameter is the string name of a connection you have available in the UI Admin/Connections section.
The default value of fs_conn_id is "fs_default" (you can see it in the code of the FileSensor class operator). Check the UI Admin/Connections and you will find it.
You can skip to pass fs_conn_id and just pass the parameter filepath if you want to check if a file or a directory exists locally.
The poke_interval is inherited from BaseSensorOperator and it indicates the time in seconds that the job should wait in between each tries. The default value is 60 seconds.
Related
I am working on AirFlow POC, written a DAG which can run a script using ssh on one server. It giving alert if the script fails to execute but if the scripts executed and the task in the script fails it is not giving any mail.
Example: I'm running a script which can take backup of a database in db2. If the instance is down and unable to take the backup so backup command failed but we are not getting any alert as the script executed successfully.
from airflow.models import DAG, Variable
from airflow.contrib.operators.ssh_operator import SSHOperator
from datetime import datetime, timedelta
import airflow
import os
import logging
import airflow.settings
from airflow.utils.dates import days_ago
from airflow.operators.email_operator import EmailOperator
from airflow.models import TaskInstance
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
START_DATE = airflow.utils.dates.days_ago(1)
SCHEDULE_INTERVAL = "5,10,15,20,30,40 * * * * *"
log = logging.getLogger(__name__)
# Use this to grab the pushed value and determine your path
def determine_branch(**kwargs):
"""Use this to define the pathway of the branch operator based on the return code from the SSHOperator"""
ti = TaskInstance(task='t1', execution_date=START_DATE)
return_code = kwargs['ti'].xcom_pull(ti.task_ids='SSHTest')
print("From Kwargs: ", return_code)
if return_code := '1':
return 't4'
return 't3'
# DAG for airflow task
dag_email_recipient = ["mailid"]
# These args will get passed on to the SSH operator
default_args = {
'owner': 'afuser',
'depends_on_past': False,
'email': ['mailid'],
'email_on_failure': True,
'email_on_retry': False,
'start_date': START_DATE,
'retries': 1,
'retry_delay': timedelta(minutes=1),
}
with DAG(
'DB2_SSH_TEST',
default_args=default_args,
description='How to use the SSH Operator?',
schedule_interval=SCHEDULE_INTERVAL,
start_date=START_DATE,
catchup=False,
) as dag:
# Be sure to add '; echo $?' to the end of your bash script for this to work.
t1 = SSHOperator(
ssh_conn_id='DB2_Jabxusr_SSH',
task_id='SSHTest',
xcom_push=True,
command='/path/backup_script',
provide_context=True,
)
# Function defined above called here
t2 = BranchPythonOperator(
task_id='check_ssh_output',
python_callable=determine_branch,
)
# If we don't want to send email
t3 = DummyOperator(
task_id='end'
)
# If we do
t4 = EmailOperator(
to=dag_email_recipient,
subject='Airflow Success: process incoming files',
files=None,
html_content='',
#...
)
t1 >> t2
t2 >> t3
t2 >> t4
You can configure email notifications per task. First you need to configure your email provider, and then when you create your task you can specify "email_on_retry", "email_on_failure" flags or you can even write your own custom hook "on_failure" where you will be able to code your own custom logic deciding when and how to notify.
Here is a very nice Astronomer article explaining all the ins and outs of notifications: https://www.astronomer.io/guides/error-notifications-in-airflow
I'm currently facing a challenge in terms of parsing nested macros. Below is my DAG File
from airflow import DAG
from airflow.utils.dates import days_ago
from datetime import timedelta
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.models import Variable
from apty.utils.date import date_ref_now
default_args = {
"owner": "Akhil",
"depends_on_past": False,
"start_date": days_ago(0),
"email_on_failure": False,
"email_on_retry": False,
"retries": 0,
"retry_delay": timedelta(minutes=5),
}
dag = DAG(
"user_sample",
default_args=default_args,
description="test",
schedule_interval=None,
catchup=False,
)
def sample_app(hello=None):
return hello
extra_attrs = {"date_stamp":"{{ds}}",
"foo":"bar"}
start = DummyOperator(task_id="start", dag=dag)
python = PythonOperator(
python_callable=sample_app,
task_id="mid",
dag=dag,
params={"date_stamp": extra_attrs["date_stamp"]},
op_kwargs={"hello": "{{params.date_stamp}}"},
)
start >> python
I have a scenario where I need to pass {{ds}} as one of the parameters to my operator, after which I'll use that parameter as my wish either passing as an op_kwargs / op_args. (I have used Python Operator as an example but I would be using my own custom Operator).
Here I would like to make it clear that {{ds}} is passed as a parameter value only, I don't want it to be written anywhere i.e in op_kwargs as per this example.
When I try to run it I'm getting return value from python callable as {{ds}} but not the current date_stamp.
Please help me out.
Template or macro variables are only available for parameters that are specified as template_fields on the operator class in use. This depends on the specific version and implementation of Airflow you're using, but here's the latest https://github.com/apache/airflow/blob/98896e4e327f256fd04087a49a13e16a246022c9/airflow/operators/python.py#L72 for the PythonOperator. Since, as you say, you control the operator in question, you can specify any fields you want on the class definition's template_fields. (This all assumes your class inherits from BaseOperator.)
I am using Apache Airflow 1.10.9 (based on puckel/docker-airflow docker image) to run several Python scripts in a DAG via the BashOperator. The logs are currently written to /usr/local/airflow/logs.
Is it possible to configure Airflow to
also write the logs to another directory like /home/foo/logs
The logs should only contain the stdout from the python scripts
The logs should be stored in the following directory/filename format:
/home/foo/logs/[execution-date]-[dag-id]-[task-id].log
Retries should be appended to the same .log file, if possible. Otherwise, we can have the naming convention:
/home/foo/logs/[execution-date]-[dag-id]-[task-id]-[retry-number].log
Thanks everyone!
Example DAG
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = { ... }
dag = DAG(
'mydag',
default_args=default_args,
schedule_interval='*/10 * * * *',
)
# Log to /home/foo/logs/2020-05-12-mydag-hello_world.log
t1 = BashOperator(
task_id='hello_world',
bash_command='/path/to/env/bin/python /path/to/scripts/hello_world.py',
dag=dag,
)
# Log to /home/foo/logs/2020-05-12-mydag-hey_there.log
t2 = BashOperator(
task_id='hey_there',
bash_command='/path/to/env/bin/python /path/to/scripts/hey_there.py',
dag=dag,
)
t1 >> t2
https://bcb.github.io/airflow/run-dag-and-watch-logs
This link has an answer.
Set the FILENAME_TEMPLATE setting.
export AIRFLOW__CORE__LOG_FILENAME_TEMPLATE="{{ ti.dag_id }}.log"
or
you can edit the airflow.cfg file
log_filename_template variable
add any airflow related variables there.
I'm running Airflow 1.9.0 with LocalExecutor and PostgreSQL database in a Linux AMI. I want to manually trigger DAGs, but whenever I create a DAG that has schedule_interval set to None or to #once, the webserver tree view crashes with the following error (I only show the last call):
File "/usr/local/lib/python2.7/site-packages/croniter/croniter.py", line 467, in expand
raise CroniterBadCronError(cls.bad_length)
CroniterBadCronError: Exactly 5 or 6 columns has to be specified for iteratorexpression.
Furthermore, when I manually trigger the DAG, a DAG run starts but the tasks themselves are never scheduled. I've looked around, but it seems that I'm the only one with this type of error. Has anyone encountered this error before and found a fix?
Minimal example triggering the problem:
import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = dt.datetime(2018, 7, 24),
schedule_interval='None',
catchup=False
) as dag:
first_task = BashOperator(task_id = "first_task", bash_command = bash_command)
Try this:
Set your schedule_interval to None without the '', or simply do not specify schedule_interval in your DAG. It is set to None as a default. More information on that here: airflow docs -- search for schedule_interval
Set orchestration for your tasks at the bottom of the dag.
Like so:
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = datetime(2018, 7, 24),
schedule_interval=None,
catchup=False
) as dag:
t1 = DummyOperator(
task_id='extract_data',
dag=dag
)
t2 = BashOperator(
task_id = "first_task",
bash_command = bash_command
)
#####ORCHESTRATION#####
## It is saying that in order for t2 to run, t1 must be done.
t2.set_upstream(t1)
None Value Should not in quotes
It should be like this:
schedule_interval=None
Here is the documentation link: https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html#:~:text=Note%3A%20Use%20schedule_interval%3DNone%20and%20not%20schedule_interval%3D%27None%27%20when%20you%20don%E2%80%99t%20want%20to%20schedule%20your%20DAG
I have the following DAG, which executes the different methods with a class dedicated to a data preprocessing routine:
from datetime import datetime
import os
import sys
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import ds_dependencies
SCRIPT_PATH = os.getenv('MARKETING_PREPROC_PATH')
if SCRIPT_PATH:
sys.path.insert(0, SCRIPT_PATH)
from table_builder import OnlineOfflinePreprocess
else:
print('Define MARKETING_PREPROC_PATH value in environmental variables')
sys.exit(1)
default_args = {
'start_date': datetime.now(),
'max_active_runs': 1,
'concurrency': 4
}
worker = OnlineOfflinePreprocess()
DAG = DAG(
dag_id='marketing_data_preproc',
default_args=default_args,
start_date=datetime.today()
)
import_online_data = PythonOperator(
task_id='import_online_data',
python_callable=worker.import_online_data,
dag=DAG)
import_offline_data = PythonOperator(
task_id='import_offline_data',
python_callable=worker.import_offline_data,
dag=DAG)
merge_aurum_to_sherlock = PythonOperator(
task_id='merge_aurum_to_sherlock',
python_callable=worker.merge_aurum_to_sherlock,
dag=DAG)
merge_sherlock_to_aurum = PythonOperator(
task_id='merge_sherlock_to_aurum',
python_callable=worker.merge_sherlock_to_aurum,
dag=DAG)
upload_au_to_sh = PythonOperator(
task_id='upload_au_to_sh',
python_callable=worker.upload_table,
op_args='aurum_to_sherlock',
dag=DAG)
upload_sh_to_au = PythonOperator(
task_id='upload_sh_to_au',
python_callable=worker.upload_table,
op_args='sherlock_to_aurum',
dag=DAG)
import_online_data >> merge_aurum_to_sherlock
import_offline_data >> merge_aurum_to_sherlock
merge_aurum_to_sherlock >> merge_sherlock_to_aurum
merge_aurum_to_sherlock >> upload_au_to_sh
merge_sherlock_to_aurum >> upload_sh_to_au
This produces the following error:
[2017-09-07 19:32:09,587] {base_task_runner.py:97} INFO - Subtask: AttributeError: 'OnlineOfflinePreprocess' object has no attribute 'online_info'
Which is actually pretty obvious given how airflow works: the outputs from the different class methods called aren't stored to the global class object initialized at the top of the graph.
Can I solve this with XCom? Overall, what is the thinking about how to blend the coherence of OOP with Airflow?
It's less of an issue about OOP with airflow and more about state with airflow.
Any state that needs to be passed between tasks needs to be stored persistently. This is because each airflow task is an independent process (which could even be running on a different machine!) and thus in-memory communication is not possible.
You are correct you can use XCOM to pass this state (if it's small, since it gets stored in the airflow database). If it's large you probably want to store it somewhere else, maybe a filesystem or S3 or HDFS or a specialized database.