I have a problem with a DAG in Airflow, I've tried changing the start_date twice for a week before today and it still doesn't run. The schedule interval is set to '5 9 * * *'.
Here is the code
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
//code
default_args = {
'owner': 'Lucas',
'email': ['//email'],
'email_on_failure': True,
'start_date': datetime(2021, 7, 9),
'retry_delay': timedelta(minutes=5)
}
with DAG('instagram', default_args=default_args, schedule_interval='5 9 * * *', catchup=False) as dag:
token = get_token()
//code
It is really strange because it is not a problem with the dag itself, I can trigger the dag manually without any error and the start_date and schedule_interval seems fine, any ideas?
The solution was to delete the dag and creating another one with a different name, same code. I still have no idea what problem in airflow caused this.
Related
I want to get the email mentioned in this DAG's default args using another DAG in the airflow. How can I do that? Please help, I am new to airflow!
from airflow.models import DagRun
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from datetime import datetime, timedelta
from airflow import DAG
def first_function(**context):
print("hello")
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['example#gmail.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with DAG(
'main',
default_args=default_args,
description='Sample DAG',
schedule_interval=timedelta(days=1),
start_date=datetime(2022,6,10),
catchup=False
) as dag:
first_function = PythonOperator(
task_id="first_function",
python_callable=first_function,
)
first_function
You can use a custom module in Airflow to share config options/Operators/or any arbitrary Python code across DAGs.
Typically you would create a directory in your DAGs directory (which by default is {AIRFLOW_HOME}/dags.
To share default_args between DAGs, you could create the following layout:
Create {AIRFLOW_HOME}/dags/custom/__init__.py
Create {AIRFLOW_HOME}/dags/custom/shared_config.py
Create {AIRFLOW_HOME}/dags/.airflowignore
Add the directory name custom to the first line of .airflowignore.
Cut and paste your default_args dictionary from your DAG into {AIRFLOW_HOME}/dags/custom/shared_config.py
You can see this layout suggested in the Airflow documentation here.
The .airflowignore tells the scheduler to skip the custom directory when it parses your DAGs (which by default happens every 30s) - because the custom directory contains Python, but never any DAGs, the scheduler should skip these files to avoid adding unnecessary load to the scheduler. This is explained in the documentation link above.
You need to add an __init__.py to the custom directory - airflow requires it even though when writing in Python3 you don't need it because of implicit namespaces (again this is explained in the same link above).
From your dag you can then import as needed:
from custom.shared_config import default_args
I'm newbie in Apache Airflow.
There are a lot of examples of basic DAGs in the Internet.
Unfortunately, I didn't find any examples of single-task DAG's.
Most of DAG's examples contain bitshift operator in the end of the .py script, which defines tasks order.
For example:
# ...our DAG's code...
task1 >> task2 >> task3
But what if my DAG has just a single task at the moment?
My question is - do I need to use this single task's name in the end of Python file?
Or if we have only 1 task in the scope, Airflow will handle it itself, and the last line of code below is redundant?
from datetime import timedelta
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
}
with DAG(
'tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=days_ago(2),
tags=['example'],
) as dag:
t1 = BashOperator(
task_id='print_date',
bash_command='date',
)
t1 # IS THIS LINE OF CODE NECESSARY?
The answer is NO, you don't need to include the last line. You could also avoid the asignment of the variable t1, leaving the DAG like this:
with DAG(
'tutorial',
default_args=default_args,
description='A simple tutorial DAG',
schedule_interval=timedelta(days=1),
start_date=days_ago(2),
tags=['example'],
) as dag:
BashOperator(
task_id='print_date',
bash_command='date',
)
The reason to perfom the assignment of an instance of an Operator (such as BashOperator), to a variable (called Task in this scope) is similiar to any other object in OOP. In your example there is no other "operation" perfomed over the t1 variable (you are not reading it or consuming any method from it) so there no is no reason to declare it.
When starting with Airflow, I think is very clarifying to use the DebugExecutor to perform quick tests like this and understand how everything is working. If you are using VS Code you can find an example config file, here.
I am trying to get the Airflow ExternalTaskSensor to work but so far have not been able to get it to complete, it always seems to get stuck running and never finishes so the DAG can move onto the next task.
Here is the code I am using to test:
DEFAULT_ARGS = {
'owner': 'NAME',
'depends_on_past': False,
'start_date': datetime(2019, 9, 9),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False
}
external_watch_dag = DAG(
'DAG-External_watcher-Test',
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=1),
schedule_interval=None
)
start_op = DummyOperator(
task_id='start_op',
dag=external_watch_dag
)
trigger_external = TriggerDagRunOperator(
task_id='trigger_external',
trigger_dag_id='DAG-Dummy',
dag=external_watch_dag
)
external_watch_op = ExternalTaskSensor(
task_id='external_watch_op',
external_dag_id='DAG-Dummy',
external_task_id='dummy_task',
check_existence=True,
execution_delta=timedelta(minutes=-1),
# execution_date_fn=datetime(2019, 9, 25),
execution_timeout=timedelta(minutes=30),
dag=external_watch_dag
)
end_op = DummyOperator(
task_id='end_op',
dag=external_watch_dag
)
start_op >> trigger_external >> external_watch_op >> end_op
# start_op >> [external_watch_op, trigger_external]
# external_watch_op >> end_op
# Below is the setup for the dummy DAG that is called above by the Trigger and watched by the TaskSensor
dummy_dag = DAG(
'DAG-Dummy',
default_args=DEFAULT_ARGS,
dagrun_timeout=timedelta(hours=1),
schedule_interval=None
)
dummy_task = BashOperator(
task_id='dummy_task',
bash_command='sleep 10',
dag=dummy_dag
)
I have tried tweaking this code a number of ways but have not gotten any success with the ExternalTaskSensor.
Does anyone know how to solve this problem and get the ExternalTaskSensor to work properly? I have also read that issues can arise through scheduling intervals when using the ExternalTaskSensor, is it possible that part of the issue is that the DAGs both have schedule_interval=None?
I had gotten this to work with both of the DAGs set to the exact same schedule_interval, but that will not work in production. The goal is to have the main DAG, external-watch-dag to be on a regular schedule and trigger that DAG-Dummy during its run, with the DAG-Dummy itself having schedule_interval=None.
Any help is greatly appreciated.
By default the ExternalTaskSensor will monitor the external_dag_id with the same execution date that the sensor DAG. With execution_delta you can set a time delta between the sensor dag and the external dag so it can look for the correct execution_date to monitor. This works great when both dags are run in a schedule because you know exactly this timedelta.
The problem: when a dag is triggered manually or by another dag, you cannot known for sure the the exact execution date of any of these two dags.
The solution: because you are using the TriggerDagRunOperator, you can set the execution_date parameter. This will make sure that the execution date from your dag and the external dag is the same. From the docs:
execution_date (str or datetime.datetime) – Execution date for the dag (templated)
So your code will look like this:
trigger_external = TriggerDagRunOperator(
task_id='trigger_external',
trigger_dag_id='DAG-Dummy',
dag=external_watch_dag,
execution_date="{{ execution_date }}", # Use the template to get the current execution date
)
external_watch_op = ExternalTaskSensor(
task_id='external_watch_op',
external_dag_id='DAG-Dummy',
external_task_id='dummy_task',
check_existence=True,
execution_timeout=timedelta(minutes=30),
dag=external_watch_dag
)
I have a cronjob that runs with the cron schedule interval 05 */1 * * 1-5. Or as Crontab Guru says, “At minute 5 past every hour on every day-of-week from Monday through Friday.” (in EST instead of UTC)?
How can I convert this into a 'America/New_York' timezone aware Airflow DAG that will run the same exact way?
I asked a previous question on timezone aware DAGs in Airflow but it is not apparent to me in the answer or in the Airflow documentation how to make the jump from a DAG that has a start_date with tzinfo and a schedule_intervalthat mimics a cronjob.
I am currently trying to use a DAG with the my_dag.py file as follows:
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
import pendulum
local_tz = pendulum.timezone("America/New_York")
default_args=dict(
owner = 'airflow',
start_date=datetime(2018, 11, 7, 13, 5, tzinfo=local_tz), # 1:05 PM on Nov 7
schedule_interval=timedelta(hours=1),
)
dag = DAG('my_test_dag', catchup=False, default_args=default_args)
op = BashOperator(
task_id='my_test_dag',
bash_command="bash -i /home/user/shell_script.sh",
dag=dag
)
However, the DAG never gets scheduled. What am I doing wrong here?
Airflow support the use of cron expressions. schedule_interval is defined as a DAG arguments, and receives preferably a cron expression as a str, or a datetime.timedelta object. Alternatively, you can also use one of these cron “preset”:None, #once, #hourly, #daily, #weekly , #monthly, #yearly.
As I see, the timezone awareness is correct, but schedule interval should be change.
args=dict(
owner = 'airflow',
start_date=datetime(2018, 11, 7, 13, 5, tzinfo=local_tz), # 1:05 PM on Nov 7
)
dag=DAG(dag="dagname_here",
default_args=args,
schedule_interval='05 */1 * * 1-5' #should be string)
NOTE: Please be reminded that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be triggered soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
For reference: Airflow Scheduling
I'm running Airflow 1.9.0 with LocalExecutor and PostgreSQL database in a Linux AMI. I want to manually trigger DAGs, but whenever I create a DAG that has schedule_interval set to None or to #once, the webserver tree view crashes with the following error (I only show the last call):
File "/usr/local/lib/python2.7/site-packages/croniter/croniter.py", line 467, in expand
raise CroniterBadCronError(cls.bad_length)
CroniterBadCronError: Exactly 5 or 6 columns has to be specified for iteratorexpression.
Furthermore, when I manually trigger the DAG, a DAG run starts but the tasks themselves are never scheduled. I've looked around, but it seems that I'm the only one with this type of error. Has anyone encountered this error before and found a fix?
Minimal example triggering the problem:
import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = dt.datetime(2018, 7, 24),
schedule_interval='None',
catchup=False
) as dag:
first_task = BashOperator(task_id = "first_task", bash_command = bash_command)
Try this:
Set your schedule_interval to None without the '', or simply do not specify schedule_interval in your DAG. It is set to None as a default. More information on that here: airflow docs -- search for schedule_interval
Set orchestration for your tasks at the bottom of the dag.
Like so:
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = datetime(2018, 7, 24),
schedule_interval=None,
catchup=False
) as dag:
t1 = DummyOperator(
task_id='extract_data',
dag=dag
)
t2 = BashOperator(
task_id = "first_task",
bash_command = bash_command
)
#####ORCHESTRATION#####
## It is saying that in order for t2 to run, t1 must be done.
t2.set_upstream(t1)
None Value Should not in quotes
It should be like this:
schedule_interval=None
Here is the documentation link: https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html#:~:text=Note%3A%20Use%20schedule_interval%3DNone%20and%20not%20schedule_interval%3D%27None%27%20when%20you%20don%E2%80%99t%20want%20to%20schedule%20your%20DAG