I am working in $AIRFLOW_HOME/dags. I have created the following files:
- common
|- __init__.py # empty
|- common.py # common code
- foo_v1.py # dag instanciation
In common.py:
default_args = ...
def create_dag(project, version):
dag_id = project + '_' + version
dag = DAG(dag_id, default_args=default_args, schedule_interval='*/10 * * * *', catchup=False)
print('creating DAG ' + dag_id)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
t2.set_upstream(t1)
In foo_v1.py:
from common.common import create_dag
create_dag('foo', 'v1')
When testing the script with python, it looks OK:
$ python foo_v1.py
[2018-10-29 17:08:37,016] {__init__.py:57} INFO - Using executor SequentialExecutor
creating DAG pgrandjean_pgrandjean_spark2.1.0_hadoop2.6.0
I then launch the webserver and the scheduler locally. My problem is that I don't see any DAG with id foo_v1. There is no pyc file being created. What is being done wrong? Why isn't the code in foo_v1.py being executed?
To be found by Airflow, the DAG object returned by create_dag() must be in the global namespace of the foo_v1.py module. One way to place a DAG in the global namespace is simply to assign it to a module level variable:
from common.common import create_dag
dag = create_dag('foo', 'v1')
Another way is to update the global namespace using globals():
globals()['foo_v1'] = create_dag('foo', 'v1')
The later may look like an overkill, but it is useful for creating multiple DAGs dynamically. For example, in a for-loop:
for i in range(10):
globals()[f'foo_v{i}'] = create_dag('foo', f'v{i}')
Note: Any *.py file placed in $AIRFLOW_HOME/dags (even in sub-directories, such as common in your case) will be parsed by Airflow. If you do not want this you can use .airflowignore or packaged DAGs.
You need to assign the dag to an exported variable in the module. If the dag isn't in the module __dict__ airflow's DagBag processor won't pick it up.
Check out the source here: https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L428
As it is mentioned in here, you must return the dag after creating it!
default_args = ...
def create_dag(project, version):
dag_id = project + '_' + version
dag = DAG(dag_id, default_args=default_args, schedule_interval='*/10 * * * *', catchup=False)
print('creating DAG ' + dag_id)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
t2.set_upstream(t1)
return dag # Add this line to your code!
Related
I am using Apache Airflow 1.10.9 (based on puckel/docker-airflow docker image) to run several Python scripts in a DAG via the BashOperator. The logs are currently written to /usr/local/airflow/logs.
Is it possible to configure Airflow to
also write the logs to another directory like /home/foo/logs
The logs should only contain the stdout from the python scripts
The logs should be stored in the following directory/filename format:
/home/foo/logs/[execution-date]-[dag-id]-[task-id].log
Retries should be appended to the same .log file, if possible. Otherwise, we can have the naming convention:
/home/foo/logs/[execution-date]-[dag-id]-[task-id]-[retry-number].log
Thanks everyone!
Example DAG
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = { ... }
dag = DAG(
'mydag',
default_args=default_args,
schedule_interval='*/10 * * * *',
)
# Log to /home/foo/logs/2020-05-12-mydag-hello_world.log
t1 = BashOperator(
task_id='hello_world',
bash_command='/path/to/env/bin/python /path/to/scripts/hello_world.py',
dag=dag,
)
# Log to /home/foo/logs/2020-05-12-mydag-hey_there.log
t2 = BashOperator(
task_id='hey_there',
bash_command='/path/to/env/bin/python /path/to/scripts/hey_there.py',
dag=dag,
)
t1 >> t2
https://bcb.github.io/airflow/run-dag-and-watch-logs
This link has an answer.
Set the FILENAME_TEMPLATE setting.
export AIRFLOW__CORE__LOG_FILENAME_TEMPLATE="{{ ti.dag_id }}.log"
or
you can edit the airflow.cfg file
log_filename_template variable
add any airflow related variables there.
Whoever can please point me to an example of how to use Airflow FileSensor?
I've googled and haven't found anything yet. Any example would be sufficient. My use case is quite simple:
Wait for a scheduled DAG to drop a file in a path, FileSensor task picks it up, read content and process it.
From the documentation & source code:
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.dummy_operator import DummyOperator
import datetime
import airflow
# https://airflow.apache.org/code.html#airflow.models.BaseOperator
default_args = {
"depends_on_past" : False,
"start_date" : airflow.utils.dates.days_ago( 1 ),
"retries" : 1,
"retry_delay" : datetime.timedelta( hours= 5 ),
}
with airflow.DAG( "file_sensor_test_v1", default_args= default_args, schedule_interval= "*/5 * * * *", ) as dag:
start_task = DummyOperator( task_id= "start" )
stop_task = DummyOperator( task_id= "stop" )
sensor_task = FileSensor( task_id= "my_file_sensor_task", poke_interval= 30, fs_conn_id= <path>, filepath= <file or directory name> )
start_task >> sensor_task >> stop_task
A simple example of a FileSensor task:
second_task = FileSensor(
task_id="file_sensor_task_id",
filepath="{{ task_instance.xcom_pull(task_ids='get_filepath_task') }}",
#fs_conn_id="fs_default" # default one, commented because not needed
poke_interval= 20,
dag=dag
)
Here I'm passing as filepath the returned value of the previous PythonOperator task_id (named get_filepath_task) using xcom_pull.
But it can be a whatever string of a filepath or directory that you are checking the existence.
The fs_conn_id parameter is the string name of a connection you have available in the UI Admin/Connections section.
The default value of fs_conn_id is "fs_default" (you can see it in the code of the FileSensor class operator). Check the UI Admin/Connections and you will find it.
You can skip to pass fs_conn_id and just pass the parameter filepath if you want to check if a file or a directory exists locally.
The poke_interval is inherited from BaseSensorOperator and it indicates the time in seconds that the job should wait in between each tries. The default value is 60 seconds.
I'm running Airflow 1.9.0 with LocalExecutor and PostgreSQL database in a Linux AMI. I want to manually trigger DAGs, but whenever I create a DAG that has schedule_interval set to None or to #once, the webserver tree view crashes with the following error (I only show the last call):
File "/usr/local/lib/python2.7/site-packages/croniter/croniter.py", line 467, in expand
raise CroniterBadCronError(cls.bad_length)
CroniterBadCronError: Exactly 5 or 6 columns has to be specified for iteratorexpression.
Furthermore, when I manually trigger the DAG, a DAG run starts but the tasks themselves are never scheduled. I've looked around, but it seems that I'm the only one with this type of error. Has anyone encountered this error before and found a fix?
Minimal example triggering the problem:
import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = dt.datetime(2018, 7, 24),
schedule_interval='None',
catchup=False
) as dag:
first_task = BashOperator(task_id = "first_task", bash_command = bash_command)
Try this:
Set your schedule_interval to None without the '', or simply do not specify schedule_interval in your DAG. It is set to None as a default. More information on that here: airflow docs -- search for schedule_interval
Set orchestration for your tasks at the bottom of the dag.
Like so:
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = datetime(2018, 7, 24),
schedule_interval=None,
catchup=False
) as dag:
t1 = DummyOperator(
task_id='extract_data',
dag=dag
)
t2 = BashOperator(
task_id = "first_task",
bash_command = bash_command
)
#####ORCHESTRATION#####
## It is saying that in order for t2 to run, t1 must be done.
t2.set_upstream(t1)
None Value Should not in quotes
It should be like this:
schedule_interval=None
Here is the documentation link: https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html#:~:text=Note%3A%20Use%20schedule_interval%3DNone%20and%20not%20schedule_interval%3D%27None%27%20when%20you%20don%E2%80%99t%20want%20to%20schedule%20your%20DAG
I am new to Airflow. I wrote a simple code to save a list in a txt file as below:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import datetime
DAG = DAG(
dag_id='example_dag',
start_date=datetime.datetime.now(),
schedule_interval='#once'
)
def push_function(**kwargs):
ls = ['a', 'b', 'c']
return ls
push_task = PythonOperator(
task_id='push_task',
python_callable=push_function,
provide_context=True,
dag=DAG)
def pull_function(**kwargs):
ti = kwargs['ti']
ls = ti.xcom_pull(task_ids='push_task')
with open('test.txt','w') as out:
out.write(ls)
out.close()
pull_task = PythonOperator(
task_id='pull_task',
python_callable=pull_function,
provide_context=True,
dag=DAG)
push_task >> pull_task
when I use web server interface I see my dag. Also, I see my dag when I wrote airflow list_dags in CLI.
I also compiled my code using python code.py, and the result was like below without any error:
[2017-12-16 14:21:30,609] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-12-16 14:21:30,709] {driver.py:123} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
[2017-12-16 14:21:30,741] {driver.py:123} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
I both tried to run the dag with UI and with command airflow trigger_dag Mydag
However, I cannot see my txt result file after running. there is no error in log file either.
How can I find my txt file?
I would try again with an absolute file path or you can try logging the current working directory inside the method with os.getcwd() to help locate your file.
I have the following DAG, which executes the different methods with a class dedicated to a data preprocessing routine:
from datetime import datetime
import os
import sys
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import ds_dependencies
SCRIPT_PATH = os.getenv('MARKETING_PREPROC_PATH')
if SCRIPT_PATH:
sys.path.insert(0, SCRIPT_PATH)
from table_builder import OnlineOfflinePreprocess
else:
print('Define MARKETING_PREPROC_PATH value in environmental variables')
sys.exit(1)
default_args = {
'start_date': datetime.now(),
'max_active_runs': 1,
'concurrency': 4
}
worker = OnlineOfflinePreprocess()
DAG = DAG(
dag_id='marketing_data_preproc',
default_args=default_args,
start_date=datetime.today()
)
import_online_data = PythonOperator(
task_id='import_online_data',
python_callable=worker.import_online_data,
dag=DAG)
import_offline_data = PythonOperator(
task_id='import_offline_data',
python_callable=worker.import_offline_data,
dag=DAG)
merge_aurum_to_sherlock = PythonOperator(
task_id='merge_aurum_to_sherlock',
python_callable=worker.merge_aurum_to_sherlock,
dag=DAG)
merge_sherlock_to_aurum = PythonOperator(
task_id='merge_sherlock_to_aurum',
python_callable=worker.merge_sherlock_to_aurum,
dag=DAG)
upload_au_to_sh = PythonOperator(
task_id='upload_au_to_sh',
python_callable=worker.upload_table,
op_args='aurum_to_sherlock',
dag=DAG)
upload_sh_to_au = PythonOperator(
task_id='upload_sh_to_au',
python_callable=worker.upload_table,
op_args='sherlock_to_aurum',
dag=DAG)
import_online_data >> merge_aurum_to_sherlock
import_offline_data >> merge_aurum_to_sherlock
merge_aurum_to_sherlock >> merge_sherlock_to_aurum
merge_aurum_to_sherlock >> upload_au_to_sh
merge_sherlock_to_aurum >> upload_sh_to_au
This produces the following error:
[2017-09-07 19:32:09,587] {base_task_runner.py:97} INFO - Subtask: AttributeError: 'OnlineOfflinePreprocess' object has no attribute 'online_info'
Which is actually pretty obvious given how airflow works: the outputs from the different class methods called aren't stored to the global class object initialized at the top of the graph.
Can I solve this with XCom? Overall, what is the thinking about how to blend the coherence of OOP with Airflow?
It's less of an issue about OOP with airflow and more about state with airflow.
Any state that needs to be passed between tasks needs to be stored persistently. This is because each airflow task is an independent process (which could even be running on a different machine!) and thus in-memory communication is not possible.
You are correct you can use XCOM to pass this state (if it's small, since it gets stored in the airflow database). If it's large you probably want to store it somewhere else, maybe a filesystem or S3 or HDFS or a specialized database.