I have the following DAG, which executes the different methods with a class dedicated to a data preprocessing routine:
from datetime import datetime
import os
import sys
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import ds_dependencies
SCRIPT_PATH = os.getenv('MARKETING_PREPROC_PATH')
if SCRIPT_PATH:
sys.path.insert(0, SCRIPT_PATH)
from table_builder import OnlineOfflinePreprocess
else:
print('Define MARKETING_PREPROC_PATH value in environmental variables')
sys.exit(1)
default_args = {
'start_date': datetime.now(),
'max_active_runs': 1,
'concurrency': 4
}
worker = OnlineOfflinePreprocess()
DAG = DAG(
dag_id='marketing_data_preproc',
default_args=default_args,
start_date=datetime.today()
)
import_online_data = PythonOperator(
task_id='import_online_data',
python_callable=worker.import_online_data,
dag=DAG)
import_offline_data = PythonOperator(
task_id='import_offline_data',
python_callable=worker.import_offline_data,
dag=DAG)
merge_aurum_to_sherlock = PythonOperator(
task_id='merge_aurum_to_sherlock',
python_callable=worker.merge_aurum_to_sherlock,
dag=DAG)
merge_sherlock_to_aurum = PythonOperator(
task_id='merge_sherlock_to_aurum',
python_callable=worker.merge_sherlock_to_aurum,
dag=DAG)
upload_au_to_sh = PythonOperator(
task_id='upload_au_to_sh',
python_callable=worker.upload_table,
op_args='aurum_to_sherlock',
dag=DAG)
upload_sh_to_au = PythonOperator(
task_id='upload_sh_to_au',
python_callable=worker.upload_table,
op_args='sherlock_to_aurum',
dag=DAG)
import_online_data >> merge_aurum_to_sherlock
import_offline_data >> merge_aurum_to_sherlock
merge_aurum_to_sherlock >> merge_sherlock_to_aurum
merge_aurum_to_sherlock >> upload_au_to_sh
merge_sherlock_to_aurum >> upload_sh_to_au
This produces the following error:
[2017-09-07 19:32:09,587] {base_task_runner.py:97} INFO - Subtask: AttributeError: 'OnlineOfflinePreprocess' object has no attribute 'online_info'
Which is actually pretty obvious given how airflow works: the outputs from the different class methods called aren't stored to the global class object initialized at the top of the graph.
Can I solve this with XCom? Overall, what is the thinking about how to blend the coherence of OOP with Airflow?
It's less of an issue about OOP with airflow and more about state with airflow.
Any state that needs to be passed between tasks needs to be stored persistently. This is because each airflow task is an independent process (which could even be running on a different machine!) and thus in-memory communication is not possible.
You are correct you can use XCOM to pass this state (if it's small, since it gets stored in the airflow database). If it's large you probably want to store it somewhere else, maybe a filesystem or S3 or HDFS or a specialized database.
Related
(First time user of cloud composer) All examples I have seen define very simple python functions within the DAG.
I have multiple lengthy python scripts I want to run. Can I put these inside a task?
If so, is it then better to use the PythonOperator or call them from the BashOperator?
E.g. something like
default_dag-args ={}
with models.DAG('jobname', schedule_interval = datetime.timedelta(days=1), default_args = default_dag_args) as dag:
do_stuff1 = python_operator.PythonOperator(
task_id ='task_1'
python_callable =myscript1.py)
do_stuff2 = python_operator.PythonOperator(
task_id ='task_2'
python_callable =myscript2.py)
If you put your python scripts into separate files, you can actually use both PythonOperator and BashOperator to execute the scripts.
Let's assume you place your python scripts under the following folder structure.
dags/
my_dag.py
tasks/
myscript1.py
myscript2.py
Using PythonOperator in my_dag.py
from datetime import timedelta
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
from scripts import myscript1, myscript2
default_dag_args = {}
with DAG(
"jobname",
schedule_interval=timedelta(days=1),
default_args=default_dag_args,
) as dag:
do_stuff1 = PythonOperator(
task_id="task_1",
python_callable=myscript1.main, # assume entrypoint is main()
)
do_stuff2 = PythonOperator(
task_id="task_2",
python_callable=myscript2.main, # assume entrypoint is main()
)
Using BashOperator in my_dag.py
from datetime import timedelta
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
default_dag_args = {}
with DAG(
"jobname",
schedule_interval=timedelta(days=1),
default_args=default_dag_args,
) as dag:
do_stuff1 = BashOperator(
task_id="task_1",
bash_command="python /path/to/myscript1.py",
)
do_stuff2 = BashOperator(
task_id="task_2",
bash_command="python /path/to/myscript2.py",
)
Whoever can please point me to an example of how to use Airflow FileSensor?
I've googled and haven't found anything yet. Any example would be sufficient. My use case is quite simple:
Wait for a scheduled DAG to drop a file in a path, FileSensor task picks it up, read content and process it.
From the documentation & source code:
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.dummy_operator import DummyOperator
import datetime
import airflow
# https://airflow.apache.org/code.html#airflow.models.BaseOperator
default_args = {
"depends_on_past" : False,
"start_date" : airflow.utils.dates.days_ago( 1 ),
"retries" : 1,
"retry_delay" : datetime.timedelta( hours= 5 ),
}
with airflow.DAG( "file_sensor_test_v1", default_args= default_args, schedule_interval= "*/5 * * * *", ) as dag:
start_task = DummyOperator( task_id= "start" )
stop_task = DummyOperator( task_id= "stop" )
sensor_task = FileSensor( task_id= "my_file_sensor_task", poke_interval= 30, fs_conn_id= <path>, filepath= <file or directory name> )
start_task >> sensor_task >> stop_task
A simple example of a FileSensor task:
second_task = FileSensor(
task_id="file_sensor_task_id",
filepath="{{ task_instance.xcom_pull(task_ids='get_filepath_task') }}",
#fs_conn_id="fs_default" # default one, commented because not needed
poke_interval= 20,
dag=dag
)
Here I'm passing as filepath the returned value of the previous PythonOperator task_id (named get_filepath_task) using xcom_pull.
But it can be a whatever string of a filepath or directory that you are checking the existence.
The fs_conn_id parameter is the string name of a connection you have available in the UI Admin/Connections section.
The default value of fs_conn_id is "fs_default" (you can see it in the code of the FileSensor class operator). Check the UI Admin/Connections and you will find it.
You can skip to pass fs_conn_id and just pass the parameter filepath if you want to check if a file or a directory exists locally.
The poke_interval is inherited from BaseSensorOperator and it indicates the time in seconds that the job should wait in between each tries. The default value is 60 seconds.
I am working in $AIRFLOW_HOME/dags. I have created the following files:
- common
|- __init__.py # empty
|- common.py # common code
- foo_v1.py # dag instanciation
In common.py:
default_args = ...
def create_dag(project, version):
dag_id = project + '_' + version
dag = DAG(dag_id, default_args=default_args, schedule_interval='*/10 * * * *', catchup=False)
print('creating DAG ' + dag_id)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
t2.set_upstream(t1)
In foo_v1.py:
from common.common import create_dag
create_dag('foo', 'v1')
When testing the script with python, it looks OK:
$ python foo_v1.py
[2018-10-29 17:08:37,016] {__init__.py:57} INFO - Using executor SequentialExecutor
creating DAG pgrandjean_pgrandjean_spark2.1.0_hadoop2.6.0
I then launch the webserver and the scheduler locally. My problem is that I don't see any DAG with id foo_v1. There is no pyc file being created. What is being done wrong? Why isn't the code in foo_v1.py being executed?
To be found by Airflow, the DAG object returned by create_dag() must be in the global namespace of the foo_v1.py module. One way to place a DAG in the global namespace is simply to assign it to a module level variable:
from common.common import create_dag
dag = create_dag('foo', 'v1')
Another way is to update the global namespace using globals():
globals()['foo_v1'] = create_dag('foo', 'v1')
The later may look like an overkill, but it is useful for creating multiple DAGs dynamically. For example, in a for-loop:
for i in range(10):
globals()[f'foo_v{i}'] = create_dag('foo', f'v{i}')
Note: Any *.py file placed in $AIRFLOW_HOME/dags (even in sub-directories, such as common in your case) will be parsed by Airflow. If you do not want this you can use .airflowignore or packaged DAGs.
You need to assign the dag to an exported variable in the module. If the dag isn't in the module __dict__ airflow's DagBag processor won't pick it up.
Check out the source here: https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L428
As it is mentioned in here, you must return the dag after creating it!
default_args = ...
def create_dag(project, version):
dag_id = project + '_' + version
dag = DAG(dag_id, default_args=default_args, schedule_interval='*/10 * * * *', catchup=False)
print('creating DAG ' + dag_id)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
t2.set_upstream(t1)
return dag # Add this line to your code!
I'm running Airflow 1.9.0 with LocalExecutor and PostgreSQL database in a Linux AMI. I want to manually trigger DAGs, but whenever I create a DAG that has schedule_interval set to None or to #once, the webserver tree view crashes with the following error (I only show the last call):
File "/usr/local/lib/python2.7/site-packages/croniter/croniter.py", line 467, in expand
raise CroniterBadCronError(cls.bad_length)
CroniterBadCronError: Exactly 5 or 6 columns has to be specified for iteratorexpression.
Furthermore, when I manually trigger the DAG, a DAG run starts but the tasks themselves are never scheduled. I've looked around, but it seems that I'm the only one with this type of error. Has anyone encountered this error before and found a fix?
Minimal example triggering the problem:
import datetime as dt
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = dt.datetime(2018, 7, 24),
schedule_interval='None',
catchup=False
) as dag:
first_task = BashOperator(task_id = "first_task", bash_command = bash_command)
Try this:
Set your schedule_interval to None without the '', or simply do not specify schedule_interval in your DAG. It is set to None as a default. More information on that here: airflow docs -- search for schedule_interval
Set orchestration for your tasks at the bottom of the dag.
Like so:
import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'me'
}
bash_command = """
echo "this is a test task"
"""
with DAG('schedule_test',
default_args=default_args,
start_date = datetime(2018, 7, 24),
schedule_interval=None,
catchup=False
) as dag:
t1 = DummyOperator(
task_id='extract_data',
dag=dag
)
t2 = BashOperator(
task_id = "first_task",
bash_command = bash_command
)
#####ORCHESTRATION#####
## It is saying that in order for t2 to run, t1 must be done.
t2.set_upstream(t1)
None Value Should not in quotes
It should be like this:
schedule_interval=None
Here is the documentation link: https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html#:~:text=Note%3A%20Use%20schedule_interval%3DNone%20and%20not%20schedule_interval%3D%27None%27%20when%20you%20don%E2%80%99t%20want%20to%20schedule%20your%20DAG
Consider the following example of a DAG where the first task, get_id_creds, extracts a list of credentials from a database. This operation tells me what users in my database I am able to run further data preprocessing on and it writes those ids to the file /tmp/ids.txt. I then scan those ids into my DAG and use them to generate a list of upload_transaction tasks that can be run in parallel.
My question is: Is there a more idiomatically correct, dynamic way to do this using airflow? What I have here feels clumsy and brittle. How can I directly pass a list of valid IDs from one process to that defines the subsequent downstream processes?
from datetime import datetime, timedelta
import os
import sys
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import ds_dependencies
SCRIPT_PATH = os.getenv('DASH_PREPROC_PATH')
if SCRIPT_PATH:
sys.path.insert(0, SCRIPT_PATH)
import dash_workers
else:
print('Define DASH_PREPROC_PATH value in environmental variables')
sys.exit(1)
default_args = {
'start_date': datetime.now(),
'schedule_interval': None
}
DAG = DAG(
dag_id='dash_preproc',
default_args=default_args
)
get_id_creds = PythonOperator(
task_id='get_id_creds',
python_callable=dash_workers.get_id_creds,
provide_context=True,
dag=DAG)
with open('/tmp/ids.txt', 'r') as infile:
ids = infile.read().splitlines()
for uid in uids:
upload_transactions = PythonOperator(
task_id=uid,
python_callable=dash_workers.upload_transactions,
op_args=[uid],
dag=DAG)
upload_transactions.set_downstream(get_id_creds)
Per #Juan Riza's suggestion I checked out this link: Proper way to create dynamic workflows in Airflow. This was pretty much the answer, although I was able to simplify the solution enough that I thought I would offer my own modified version of the implementation here:
from datetime import datetime
import os
import sys
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import ds_dependencies
SCRIPT_PATH = os.getenv('DASH_PREPROC_PATH')
if SCRIPT_PATH:
sys.path.insert(0, SCRIPT_PATH)
import dash_workers
else:
print('Define DASH_PREPROC_PATH value in environmental variables')
sys.exit(1)
ENV = os.environ
default_args = {
# 'start_date': datetime.now(),
'start_date': datetime(2017, 7, 18)
}
DAG = DAG(
dag_id='dash_preproc',
default_args=default_args
)
clear_tables = PythonOperator(
task_id='clear_tables',
python_callable=dash_workers.clear_db,
dag=DAG)
def id_worker(uid):
return PythonOperator(
task_id=uid,
python_callable=dash_workers.main_preprocess,
op_args=[uid],
dag=DAG)
for uid in capone_dash_workers.get_id_creds():
clear_tables >> id_worker(uid)
clear_tables cleans the database that will be re-built as a result of the process. id_worker is a function that dynamically generates new preprocessing tasks, based on the array of ID values returned from get_if_creds. The task ID is just the corresponding user ID, though it could easily have been an index, i, as in the example mentioned above.
NOTE That bitshift operator (<<) looks backwards to me, as the clear_tables task should come first, but it's what seems to be working in this case.
Considering that Apache Airflow is a workflow management tool, ie. it determines the dependencies between task that the user defines in comparison (as an example) with apache Nifi which is a dataflow management tool, ie. the dependencies here are data which are transferd through the tasks.
That said, i think that your approach is quit right (my comment is based on the code posted) but Airflow offers a concept called XCom. It allows tasks to "cross-communicate" between them by passing some data. How big should the passed data be ? it is up to you to test! But generally it should be not so big. I think it is in the form of key,value pairs and it get stored in the airflow meta-database,ie you can't pass files for example but a list with ids could work.
Like i said you should test it your self. I would be very happy to know your experience. Here is an example dag which demonstrates the use of XCom and here is the necessary documentation. Cheers!