I am new to Airflow. I wrote a simple code to save a list in a txt file as below:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import datetime
DAG = DAG(
dag_id='example_dag',
start_date=datetime.datetime.now(),
schedule_interval='#once'
)
def push_function(**kwargs):
ls = ['a', 'b', 'c']
return ls
push_task = PythonOperator(
task_id='push_task',
python_callable=push_function,
provide_context=True,
dag=DAG)
def pull_function(**kwargs):
ti = kwargs['ti']
ls = ti.xcom_pull(task_ids='push_task')
with open('test.txt','w') as out:
out.write(ls)
out.close()
pull_task = PythonOperator(
task_id='pull_task',
python_callable=pull_function,
provide_context=True,
dag=DAG)
push_task >> pull_task
when I use web server interface I see my dag. Also, I see my dag when I wrote airflow list_dags in CLI.
I also compiled my code using python code.py, and the result was like below without any error:
[2017-12-16 14:21:30,609] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-12-16 14:21:30,709] {driver.py:123} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
[2017-12-16 14:21:30,741] {driver.py:123} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
I both tried to run the dag with UI and with command airflow trigger_dag Mydag
However, I cannot see my txt result file after running. there is no error in log file either.
How can I find my txt file?
I would try again with an absolute file path or you can try logging the current working directory inside the method with os.getcwd() to help locate your file.
Related
I am running Airflow in a docker container on my local machine. I'm running a test DAG doing 3 tasks. The three tasks run fine, however, the last task with the bash operator is stuck in a loop as seen in the picture in the bottom. Looking in the log file, an entry is only generated for the first execution of the bash python script, then nothing, but the python file keeps getting executed. Any suggestions as to what could be the issue?
Thanks,
Richard
from datetime import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
def creating_dataframe(ti):
import pandas as pd
import os
loc = r'/opt/airflow/dags/'
filename = r'demo.csv'
df_location = loc + filename
ti.xcom_push(key='df_location', value=df_location)
if os.path.exists(loc + filename):
print("if exists")
return df_location
else:
df = pd.DataFrame({'GIA_AIRFLOW_DEMO': ['First entry']},
index = [pd.Timestamp.now()])
df.to_csv(loc + filename, sep=';')
print("does not exist")
return df_location
def adding_row_to_dataframe(ti):
import pandas as pd
fetched_location = ti.xcom_pull(key='df_location', task_ids=['creating_dataframe'])[0]
df = pd.read_csv(fetched_location,index_col=0,sep=';')
new_df = pd.DataFrame({'GIA_AIRFLOW_DEMO': ['adding entry to demo file']},
index = [pd.Timestamp.now()])
df2 = pd.concat([df,new_df])
df2.to_csv(fetched_location,sep=";")
print("second function")
with DAG(
dag_id="richards_airflow_demo",
schedule_interval="#once",
start_date=datetime(2022, 2, 17 ),
catchup=False,
tags=["this is a demo of airflow","adding row"],
) as dag:
task1 = PythonOperator(
task_id="creating_dataframe",
python_callable=creating_dataframe,
do_xcom_push=True
)
task2 = PythonOperator(
task_id='adding_row_to_dataframe',
python_callable=adding_row_to_dataframe
)
task3 = BashOperator(
task_id='python_bash_script',
bash_command=r"echo 'python /opt/scripts/test.py'"
)
task1 >> task2 >> task3
Bash python script:
import pandas as pd
df = pd.read_csv('/opt/airflow/dags/demo.csv',index_col=0,sep=';')
new_df = pd.DataFrame({'GIA_AIRFLOW_DEMO': ['adding entry with bash python script']},
index = [pd.Timestamp.now()])
df2 = pd.concat([df,new_df])
df2.to_csv('/opt/airflow/dags/demo.csv',sep=';')
Example of issue
Log file for bashoperator
All right, didn't research as to why this is the case, but it seems like if I create a scripts folder inside the dags folder, the python script inside (test_dontputthescripthere.py) is executed even if the bashoperator isn't telling it to execute. As you can see, the bashoperator is executing the test.py file perfectly, and adds the following line to the csv:
2022-02-21 15:11:53.923284;adding entry with bash python script
The test_dontputthescripthere.py is executed in a loop, and without the bashoperator executing the file. This is all the "- and this is wrong" entries in the demo.csv file.
I suspect some kind of refresh is going on inside airflow, forcing it to execute the python file.
I'm trying to make a dag that has 2 operators that are created dynamically, depending on the number of "pipelines" that a json config file has. this file is stored in the variable dag_datafusion_args. Then I have a standard bash operator, and I have a task called success at the end that sends a message to the slack saying that the dag is over. the other 2 tasks that are python operators are generated dynamically and run in parallel. I'm using the composer, when I put the dag in the bucket it appears on the webserver ui, but when I click to see the dag the following message appears'DAG "dag_lucas4" seems to be missing. ', If I test the tasks directly by CLI on the kubernetes cluster it works! But I can't seem to make the web UI appear. I tried to do as a suggestion of some people here in SO to restart the webserver by installing a python package, I tried 3x but without success. Does anyone know what can it be?
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from aux_py_files.med.med_airflow_functions import *
from google.cloud import storage
from datetime import timedelta
TEMPLATE_SEARCH_PATH = '/home/airflow/gcs/plugins/'
INDEX=1
default_args = {
'owner':'lucas',
'start_date': '2021-01-10',
'email': ['xxxx'],
'email_on_failure': False,
'email_on_success': False,
'retries': 3,
'retry_delay': timedelta(minutes=2),
'on_failure_callback': post_message_fail_to_slack
}
dag_datafusion_args=return_datafusion_config_file('med')
with DAG('dag_lucas4', default_args = default_args, schedule_interval="30 23 * * *", template_searchpath = [TEMPLATE_SEARCH_PATH]) as dag:
extract_ftp_csv_files_load_in_gcs = BashOperator(
task_id='extract_ftp_csv_files_load_in_gcs',
bash_command='aux_sh_files/med/script.sh'
)
success = PythonOperator(
task_id='success',
python_callable=post_message_success_to_slack,
op_kwargs={'dag_name':'dag_lucas2'}
)
for pipeline,args in dag_datafusion_args.items():
configure_pipeline=PythonOperator(
task_id=f'configure_pipeline{str(INDEX)}',
python_callable=setPipelineArguments,
op_kwargs={'dag_name':'med', 'pipeline_name':pipeline},
provide_context=True
)
start_pipeline = PythonOperator(
task_id= f'start_pipeline{str(INDEX)}',
python_callable=start_pipeline_wrapper,
op_kwargs={'configure_pipeline_task':f'configure_pipeline{str(INDEX)}'},
retries=3,
provide_context=True
)
[extract_ftp_csv_files_load_in_gcs,configure_pipeline] >> start_pipeline >> success
INDEX += 1
Appears that The Airflow-Webserver in Cloud Composer runs in the tenant project, the worker and scheduler runs in the customer project. Tenant project is nothing but its google side managed environment for some part of airflow components. So the Webserver UI doesn't have complete access to your project resources. As it doesn't run under your project's environment. So I can read my config json file with return_datafusion_config_file . Best way is create an ENV variable with that file.
I am using Apache Airflow 1.10.9 (based on puckel/docker-airflow docker image) to run several Python scripts in a DAG via the BashOperator. The logs are currently written to /usr/local/airflow/logs.
Is it possible to configure Airflow to
also write the logs to another directory like /home/foo/logs
The logs should only contain the stdout from the python scripts
The logs should be stored in the following directory/filename format:
/home/foo/logs/[execution-date]-[dag-id]-[task-id].log
Retries should be appended to the same .log file, if possible. Otherwise, we can have the naming convention:
/home/foo/logs/[execution-date]-[dag-id]-[task-id]-[retry-number].log
Thanks everyone!
Example DAG
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = { ... }
dag = DAG(
'mydag',
default_args=default_args,
schedule_interval='*/10 * * * *',
)
# Log to /home/foo/logs/2020-05-12-mydag-hello_world.log
t1 = BashOperator(
task_id='hello_world',
bash_command='/path/to/env/bin/python /path/to/scripts/hello_world.py',
dag=dag,
)
# Log to /home/foo/logs/2020-05-12-mydag-hey_there.log
t2 = BashOperator(
task_id='hey_there',
bash_command='/path/to/env/bin/python /path/to/scripts/hey_there.py',
dag=dag,
)
t1 >> t2
https://bcb.github.io/airflow/run-dag-and-watch-logs
This link has an answer.
Set the FILENAME_TEMPLATE setting.
export AIRFLOW__CORE__LOG_FILENAME_TEMPLATE="{{ ti.dag_id }}.log"
or
you can edit the airflow.cfg file
log_filename_template variable
add any airflow related variables there.
for my project for data extraction I have gone for the apacahe Airflow, with GCP composer and bucket storage.
I have several modules in a package in my repo in Github, that my DAG file need to acess
for now im using BashOperator to check if it works:
#dag.py
dag = DAG(
dag_id='my_example_DAG',
start_date=datetime(2019, 10, 17, 8, 25),
schedule_interval=timedelta(minutes=15),
default_args=default_args,
)
t1 = BashOperator(
task_id='example_task',
bash_command='python /home/airflow/gcs/data/my_example_maindir/main.py ',
dag=dag)
t1
#main.py
def run_main(path_name)
#Reads YML file
extractor_pool(yml_info)
def extractor_pool
#do work
if __name__ == "__main__":
test_path = Example/path/for/test.yml
run_main(test_path)
And it works, it starts main.py with the test_path. but want to use the function run_main to parse the correct path with the correct YML file for the task.
I have tried to sys.path.insert the dir inside my storage bucket where my modules is, But i get import error
dir:
dir for my dags file (cloned from my git repo) = Buckets/europe-west1-eep-envxxxxxxx-bucket/dags
dir for my scripts/packages = Buckets/europe-west1-eep-envxxxxxxx-bucket/data
#dag.py
import sys
sys.path.insert(0, "/home/airflow/gcs/data/Example/")
from Example import main
dag = DAG(
dag_id='task_1_dag',
start_date=datetime(2019, 10, 13),
schedule_interval=timedelta(minutes=10),
default_args=default_args,
)
t1 = PythonOperator(
task_id='task_1',
provide_context=True,
python_callable=main.run_main,
op_args={'path_name': "project_output_0184_Storgaten_33"},
dag=dag
)
t1
This result in a ''module not found'' error, and does not work.
I have done som reading in GCP and found this:
Installing a Python dependency from private repository:
https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies
That says i need to place it in the directory path /config/pip/
example: gs://us-central1-b1-6efannnn-bucket/config/pip/pip.conf
But in my GCP storage bucket i have no directory named config.
I have tried to trace my steps in when i created the bucket and env but can figure out what i have done wrong
GCS has no true notion of folders or directories, what you actually have is a series of blobs that have names which may contain slashes and give the appearance of a directory.
The instructions are a bit unclear by asking you to put it in a directory, but what you actually want to do is create a file and give it the prefix config/pip/pip.conf.
With gsutil you'd do something like:
gsutil cp my-local-pip.conf gs://[DESTINATION_BUCKET_NAME]/config/pip/pip.conf
I am new to airflow .In my company for ETL pipeline currently we are using Crontab and custom Scheduler(developed in-house) .Now we are planning to implement apache airflow for our all Data Pipe-line scenarios .For that while exploring the features not able to find unique_id for each Task Instances/Dag .When I searched most of the solutions ended up in macros and template .But none of them are not providing a uniqueID for a task .But I am able to see incremental uniqueID in the UI for each tasks .Is there any way to easily access those variables inside my python method .The main use case is I need to pass those ID's as a parameter to out Python/ruby/Pentaho jobs which is called as scripts/Methods .
For Example
my shell script 'test.sh ' need two arguments one is run_id and other is collection_id. Currently we are generating this unique run_id from a centralised Database and passing it to the jobs .If it is already present in the airflow context we are going to use that
from airflow.operators.bash_operator import BashOperator
from datetime import date, datetime, timedelta
from airflow import DAG
shell_command = "/data2/test.sh -r run_id -c collection_id"
putfiles_s3 = BashOperator(
task_id='putfiles_s3',
bash_command=shell_command,
dag=dag)
Looking for a unique run_id(Either Dag level/task level) for each run while executing this Dag(scheduled/manual)
Note: This is a sample task .There will be multiple dependant task to this Dag .
Attaching Job_Id screenshot from airflow UI
Thanks
Anoop R
{{ ti.job_id }} is what you want:
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
from airflow import DAG
dag = DAG(
"job_id",
start_date=datetime(2018, 1, 1),
)
with dag:
BashOperator(
task_id='unique_id',
bash_command="echo {{ ti.job_id }}",
)
This will be valid at runtime. A log from this execution looks like:
[2018-01-03 10:28:37,523] {bash_operator.py:80} INFO - Temporary script location: /tmp/airflowtmpcj0omuts//tmp/airflowtmpcj0omuts/unique_iddq7kw0yj
[2018-01-03 10:28:37,524] {bash_operator.py:88} INFO - Running command: echo 4
[2018-01-03 10:28:37,621] {bash_operator.py:97} INFO - Output:
[2018-01-03 10:28:37,648] {bash_operator.py:101} INFO - 4
Note that this will only be valid at runtime, so the "Rendered Template" view in the webui will show None instead of a number.