I am using Apache Airflow 1.10.9 (based on puckel/docker-airflow docker image) to run several Python scripts in a DAG via the BashOperator. The logs are currently written to /usr/local/airflow/logs.
Is it possible to configure Airflow to
also write the logs to another directory like /home/foo/logs
The logs should only contain the stdout from the python scripts
The logs should be stored in the following directory/filename format:
/home/foo/logs/[execution-date]-[dag-id]-[task-id].log
Retries should be appended to the same .log file, if possible. Otherwise, we can have the naming convention:
/home/foo/logs/[execution-date]-[dag-id]-[task-id]-[retry-number].log
Thanks everyone!
Example DAG
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = { ... }
dag = DAG(
'mydag',
default_args=default_args,
schedule_interval='*/10 * * * *',
)
# Log to /home/foo/logs/2020-05-12-mydag-hello_world.log
t1 = BashOperator(
task_id='hello_world',
bash_command='/path/to/env/bin/python /path/to/scripts/hello_world.py',
dag=dag,
)
# Log to /home/foo/logs/2020-05-12-mydag-hey_there.log
t2 = BashOperator(
task_id='hey_there',
bash_command='/path/to/env/bin/python /path/to/scripts/hey_there.py',
dag=dag,
)
t1 >> t2
https://bcb.github.io/airflow/run-dag-and-watch-logs
This link has an answer.
Set the FILENAME_TEMPLATE setting.
export AIRFLOW__CORE__LOG_FILENAME_TEMPLATE="{{ ti.dag_id }}.log"
or
you can edit the airflow.cfg file
log_filename_template variable
add any airflow related variables there.
Related
I'm trying to make a dag that has 2 operators that are created dynamically, depending on the number of "pipelines" that a json config file has. this file is stored in the variable dag_datafusion_args. Then I have a standard bash operator, and I have a task called success at the end that sends a message to the slack saying that the dag is over. the other 2 tasks that are python operators are generated dynamically and run in parallel. I'm using the composer, when I put the dag in the bucket it appears on the webserver ui, but when I click to see the dag the following message appears'DAG "dag_lucas4" seems to be missing. ', If I test the tasks directly by CLI on the kubernetes cluster it works! But I can't seem to make the web UI appear. I tried to do as a suggestion of some people here in SO to restart the webserver by installing a python package, I tried 3x but without success. Does anyone know what can it be?
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from aux_py_files.med.med_airflow_functions import *
from google.cloud import storage
from datetime import timedelta
TEMPLATE_SEARCH_PATH = '/home/airflow/gcs/plugins/'
INDEX=1
default_args = {
'owner':'lucas',
'start_date': '2021-01-10',
'email': ['xxxx'],
'email_on_failure': False,
'email_on_success': False,
'retries': 3,
'retry_delay': timedelta(minutes=2),
'on_failure_callback': post_message_fail_to_slack
}
dag_datafusion_args=return_datafusion_config_file('med')
with DAG('dag_lucas4', default_args = default_args, schedule_interval="30 23 * * *", template_searchpath = [TEMPLATE_SEARCH_PATH]) as dag:
extract_ftp_csv_files_load_in_gcs = BashOperator(
task_id='extract_ftp_csv_files_load_in_gcs',
bash_command='aux_sh_files/med/script.sh'
)
success = PythonOperator(
task_id='success',
python_callable=post_message_success_to_slack,
op_kwargs={'dag_name':'dag_lucas2'}
)
for pipeline,args in dag_datafusion_args.items():
configure_pipeline=PythonOperator(
task_id=f'configure_pipeline{str(INDEX)}',
python_callable=setPipelineArguments,
op_kwargs={'dag_name':'med', 'pipeline_name':pipeline},
provide_context=True
)
start_pipeline = PythonOperator(
task_id= f'start_pipeline{str(INDEX)}',
python_callable=start_pipeline_wrapper,
op_kwargs={'configure_pipeline_task':f'configure_pipeline{str(INDEX)}'},
retries=3,
provide_context=True
)
[extract_ftp_csv_files_load_in_gcs,configure_pipeline] >> start_pipeline >> success
INDEX += 1
Appears that The Airflow-Webserver in Cloud Composer runs in the tenant project, the worker and scheduler runs in the customer project. Tenant project is nothing but its google side managed environment for some part of airflow components. So the Webserver UI doesn't have complete access to your project resources. As it doesn't run under your project's environment. So I can read my config json file with return_datafusion_config_file . Best way is create an ENV variable with that file.
Below is the airflow DAG code. It runs perfectly both when airflow is hosted locally, and on cloud composer. However, the DAG itself isn't clickable in the Composer UI.
I found a similar question and tried the accepted answer as linked in this question. My problem is similar.
import airflow
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from airflow.operators.mysql_operator import MySqlOperator
from airflow.contrib.operators.dataproc_operator import DataprocClusterCreateOperator
from airflow.contrib.operators.dataproc_operator import DataprocClusterDeleteOperator
from airflow.contrib.operators.dataproc_operator import DataProcSparkOperator
from datetime import datetime, timedelta
import sys
#copy this package to dag directory in GCP composer bucket
from schemas.schemaValidator import loadSchema
from schemas.schemaValidator import sparkArgListToMap
#change these paths to point to GCP Composer data directory
## cluster config
clusterConfig= loadSchema("somePath/jobConfig/cluster.yaml","cluster")
##per job yaml config
autoLoanCsvToParquetConfig= loadSchema("somePath/jobConfig/job.yaml","job")
default_args= {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2019, 1, 1),
'retries': 1,
'retry_delay': timedelta(minutes=3)
}
dag= DAG('usr_job', default_args=default_args, schedule_interval=None)
t1= DummyOperator(task_id= "start", dag=dag)
t2= DataprocClusterCreateOperator(
task_id= "CreateCluster",
cluster_name= clusterConfig["cluster"]["cluster_name"],
project_id= clusterConfig["project_id"],
num_workers= clusterConfig["cluster"]["worker_config"]["num_instances"],
image_version= clusterConfig["cluster"]["dataproc_img"],
master_machine_type= clusterConfig["cluster"]["worker_config"]["machine_type"],
worker_machine_type= clusterConfig["cluster"]["worker_config"]["machine_type"],
zone= clusterConfig["region"],
dag=dag
)
t3= DataProcSparkOperator(
task_id= "csvToParquet",
main_class= autoLoanCsvToParquetConfig["job"]["main_class"],
arguments= autoLoanCsvToParquetConfig["job"]["args"],
cluster_name= clusterConfig["cluster"]["cluster_name"],
dataproc_spark_jars= autoLoanCsvToParquetConfig["job"]["jarPath"],
dataproc_spark_properties= sparkArgListToMap(autoLoanCsvToParquetConfig["spark_params"]),
dag=dag
)
t4= DataprocClusterDeleteOperator(
task_id= "deleteCluster",
cluster_name= clusterConfig["cluster"]["cluster_name"],
project_id= clusterConfig["project_id"],
dag= dag
)
t5= DummyOperator(task_id= "stop", dag=dag)
t1>>t2>>t3>>t4>>t5
The UI gives this error - "This DAG isn't available in the webserver DAG bag object. It shows up in this list because the scheduler marked it as active in the metadata database."
And yet, when I triggered the DAG manually on Composer, I found it ran successfully through the log files.
The issue was with the path which was being provided for picking up the configuration files. I was giving path for the data folder in GCS. As per Google documentation, only dags folder is synced to all nodes, and not the data folder.
Needless to say, it was a issue encountered during dag parsing time, hence, it did not appear correctly on the UI. More interestingly, these debug messages were not exposed to Composer 1.5 and earlier. Now they are available to the end user to help in debugging. Thanks anyway to everyone who helped.
I am working in $AIRFLOW_HOME/dags. I have created the following files:
- common
|- __init__.py # empty
|- common.py # common code
- foo_v1.py # dag instanciation
In common.py:
default_args = ...
def create_dag(project, version):
dag_id = project + '_' + version
dag = DAG(dag_id, default_args=default_args, schedule_interval='*/10 * * * *', catchup=False)
print('creating DAG ' + dag_id)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
t2.set_upstream(t1)
In foo_v1.py:
from common.common import create_dag
create_dag('foo', 'v1')
When testing the script with python, it looks OK:
$ python foo_v1.py
[2018-10-29 17:08:37,016] {__init__.py:57} INFO - Using executor SequentialExecutor
creating DAG pgrandjean_pgrandjean_spark2.1.0_hadoop2.6.0
I then launch the webserver and the scheduler locally. My problem is that I don't see any DAG with id foo_v1. There is no pyc file being created. What is being done wrong? Why isn't the code in foo_v1.py being executed?
To be found by Airflow, the DAG object returned by create_dag() must be in the global namespace of the foo_v1.py module. One way to place a DAG in the global namespace is simply to assign it to a module level variable:
from common.common import create_dag
dag = create_dag('foo', 'v1')
Another way is to update the global namespace using globals():
globals()['foo_v1'] = create_dag('foo', 'v1')
The later may look like an overkill, but it is useful for creating multiple DAGs dynamically. For example, in a for-loop:
for i in range(10):
globals()[f'foo_v{i}'] = create_dag('foo', f'v{i}')
Note: Any *.py file placed in $AIRFLOW_HOME/dags (even in sub-directories, such as common in your case) will be parsed by Airflow. If you do not want this you can use .airflowignore or packaged DAGs.
You need to assign the dag to an exported variable in the module. If the dag isn't in the module __dict__ airflow's DagBag processor won't pick it up.
Check out the source here: https://github.com/apache/incubator-airflow/blob/master/airflow/models.py#L428
As it is mentioned in here, you must return the dag after creating it!
default_args = ...
def create_dag(project, version):
dag_id = project + '_' + version
dag = DAG(dag_id, default_args=default_args, schedule_interval='*/10 * * * *', catchup=False)
print('creating DAG ' + dag_id)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
t2.set_upstream(t1)
return dag # Add this line to your code!
I am new to Airflow. I wrote a simple code to save a list in a txt file as below:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import datetime
DAG = DAG(
dag_id='example_dag',
start_date=datetime.datetime.now(),
schedule_interval='#once'
)
def push_function(**kwargs):
ls = ['a', 'b', 'c']
return ls
push_task = PythonOperator(
task_id='push_task',
python_callable=push_function,
provide_context=True,
dag=DAG)
def pull_function(**kwargs):
ti = kwargs['ti']
ls = ti.xcom_pull(task_ids='push_task')
with open('test.txt','w') as out:
out.write(ls)
out.close()
pull_task = PythonOperator(
task_id='pull_task',
python_callable=pull_function,
provide_context=True,
dag=DAG)
push_task >> pull_task
when I use web server interface I see my dag. Also, I see my dag when I wrote airflow list_dags in CLI.
I also compiled my code using python code.py, and the result was like below without any error:
[2017-12-16 14:21:30,609] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-12-16 14:21:30,709] {driver.py:123} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
[2017-12-16 14:21:30,741] {driver.py:123} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
I both tried to run the dag with UI and with command airflow trigger_dag Mydag
However, I cannot see my txt result file after running. there is no error in log file either.
How can I find my txt file?
I would try again with an absolute file path or you can try logging the current working directory inside the method with os.getcwd() to help locate your file.
I am new to airflow .In my company for ETL pipeline currently we are using Crontab and custom Scheduler(developed in-house) .Now we are planning to implement apache airflow for our all Data Pipe-line scenarios .For that while exploring the features not able to find unique_id for each Task Instances/Dag .When I searched most of the solutions ended up in macros and template .But none of them are not providing a uniqueID for a task .But I am able to see incremental uniqueID in the UI for each tasks .Is there any way to easily access those variables inside my python method .The main use case is I need to pass those ID's as a parameter to out Python/ruby/Pentaho jobs which is called as scripts/Methods .
For Example
my shell script 'test.sh ' need two arguments one is run_id and other is collection_id. Currently we are generating this unique run_id from a centralised Database and passing it to the jobs .If it is already present in the airflow context we are going to use that
from airflow.operators.bash_operator import BashOperator
from datetime import date, datetime, timedelta
from airflow import DAG
shell_command = "/data2/test.sh -r run_id -c collection_id"
putfiles_s3 = BashOperator(
task_id='putfiles_s3',
bash_command=shell_command,
dag=dag)
Looking for a unique run_id(Either Dag level/task level) for each run while executing this Dag(scheduled/manual)
Note: This is a sample task .There will be multiple dependant task to this Dag .
Attaching Job_Id screenshot from airflow UI
Thanks
Anoop R
{{ ti.job_id }} is what you want:
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
from airflow import DAG
dag = DAG(
"job_id",
start_date=datetime(2018, 1, 1),
)
with dag:
BashOperator(
task_id='unique_id',
bash_command="echo {{ ti.job_id }}",
)
This will be valid at runtime. A log from this execution looks like:
[2018-01-03 10:28:37,523] {bash_operator.py:80} INFO - Temporary script location: /tmp/airflowtmpcj0omuts//tmp/airflowtmpcj0omuts/unique_iddq7kw0yj
[2018-01-03 10:28:37,524] {bash_operator.py:88} INFO - Running command: echo 4
[2018-01-03 10:28:37,621] {bash_operator.py:97} INFO - Output:
[2018-01-03 10:28:37,648] {bash_operator.py:101} INFO - 4
Note that this will only be valid at runtime, so the "Rendered Template" view in the webui will show None instead of a number.