for my project for data extraction I have gone for the apacahe Airflow, with GCP composer and bucket storage.
I have several modules in a package in my repo in Github, that my DAG file need to acess
for now im using BashOperator to check if it works:
#dag.py
dag = DAG(
dag_id='my_example_DAG',
start_date=datetime(2019, 10, 17, 8, 25),
schedule_interval=timedelta(minutes=15),
default_args=default_args,
)
t1 = BashOperator(
task_id='example_task',
bash_command='python /home/airflow/gcs/data/my_example_maindir/main.py ',
dag=dag)
t1
#main.py
def run_main(path_name)
#Reads YML file
extractor_pool(yml_info)
def extractor_pool
#do work
if __name__ == "__main__":
test_path = Example/path/for/test.yml
run_main(test_path)
And it works, it starts main.py with the test_path. but want to use the function run_main to parse the correct path with the correct YML file for the task.
I have tried to sys.path.insert the dir inside my storage bucket where my modules is, But i get import error
dir:
dir for my dags file (cloned from my git repo) = Buckets/europe-west1-eep-envxxxxxxx-bucket/dags
dir for my scripts/packages = Buckets/europe-west1-eep-envxxxxxxx-bucket/data
#dag.py
import sys
sys.path.insert(0, "/home/airflow/gcs/data/Example/")
from Example import main
dag = DAG(
dag_id='task_1_dag',
start_date=datetime(2019, 10, 13),
schedule_interval=timedelta(minutes=10),
default_args=default_args,
)
t1 = PythonOperator(
task_id='task_1',
provide_context=True,
python_callable=main.run_main,
op_args={'path_name': "project_output_0184_Storgaten_33"},
dag=dag
)
t1
This result in a ''module not found'' error, and does not work.
I have done som reading in GCP and found this:
Installing a Python dependency from private repository:
https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies
That says i need to place it in the directory path /config/pip/
example: gs://us-central1-b1-6efannnn-bucket/config/pip/pip.conf
But in my GCP storage bucket i have no directory named config.
I have tried to trace my steps in when i created the bucket and env but can figure out what i have done wrong
GCS has no true notion of folders or directories, what you actually have is a series of blobs that have names which may contain slashes and give the appearance of a directory.
The instructions are a bit unclear by asking you to put it in a directory, but what you actually want to do is create a file and give it the prefix config/pip/pip.conf.
With gsutil you'd do something like:
gsutil cp my-local-pip.conf gs://[DESTINATION_BUCKET_NAME]/config/pip/pip.conf
Related
When creating a Pipeline with Python SDK V2 for Azure ML all contents of my current working directory are uploaded. Can I blacklist some files being upload? E.g. I use load_env(".env") in order to read some credentials but I don't wan't it to be uploaded.
Directory content:
./src
utilities.py # contains helper function to get Azure credentials
.env # contains credentials
conda.yaml
script.py
A minimal pipeline example:
import mldesigner
import mlflow
from azure.ai.ml import MLClient
from azure.ai.ml.dsl import pipeline
from src.utilities import get_credential
credential = get_credential() # calls `load_env(".env") locally
ml_client = MLClient(
credential=credential,
subscription_id="foo",
resource_group_name="bar",
workspace_name="foofoo",
)
#mldesigner.command_component(
name="testcomponent",
display_name="Test Component",
description="Test Component description.",
environment=dict(
conda_file="./conda.yaml",
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
),
)
def test_component():
mlflow.log_metric("metric", 0)
cluster_name = "foobar"
#pipeline(default_compute=cluster_name)
def pipe():
test_component()
pipeline_job = pipe()
pipeline_job = ml_client.jobs.create_or_update(
pipeline_job, experiment_name="pipeline_samples"
)
After running python script.py the pipeline job is created and runs in Azure ML. If I have a look at the Pipeline in Azure ML UI and inspect Test Component and the tab Code I find all source files including .env.
How can I prevent uploading this file using the SDK while creating a pipeline job?
you can use a .gitignore or .amlignore file in your working directory to specify files and directories to ignore. These files will not be included when you run the pipeline by default.
Here is the document to prevent unnecessary files.
or
# Get all files in the current working directory
all_files = os.listdir()
# Remove ".env" file from the list of files
all_files.remove(".env")
#pipeline(default_compute=cluster_name, files=all_files)
def pipe():
test_component()
pipeline_job = pipe()
pipeline_job = ml_client.jobs.create_or_update(
pipeline_job, experiment_name="pipeline_samples"
)
I am using Apache Airflow 1.10.9 (based on puckel/docker-airflow docker image) to run several Python scripts in a DAG via the BashOperator. The logs are currently written to /usr/local/airflow/logs.
Is it possible to configure Airflow to
also write the logs to another directory like /home/foo/logs
The logs should only contain the stdout from the python scripts
The logs should be stored in the following directory/filename format:
/home/foo/logs/[execution-date]-[dag-id]-[task-id].log
Retries should be appended to the same .log file, if possible. Otherwise, we can have the naming convention:
/home/foo/logs/[execution-date]-[dag-id]-[task-id]-[retry-number].log
Thanks everyone!
Example DAG
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
default_args = { ... }
dag = DAG(
'mydag',
default_args=default_args,
schedule_interval='*/10 * * * *',
)
# Log to /home/foo/logs/2020-05-12-mydag-hello_world.log
t1 = BashOperator(
task_id='hello_world',
bash_command='/path/to/env/bin/python /path/to/scripts/hello_world.py',
dag=dag,
)
# Log to /home/foo/logs/2020-05-12-mydag-hey_there.log
t2 = BashOperator(
task_id='hey_there',
bash_command='/path/to/env/bin/python /path/to/scripts/hey_there.py',
dag=dag,
)
t1 >> t2
https://bcb.github.io/airflow/run-dag-and-watch-logs
This link has an answer.
Set the FILENAME_TEMPLATE setting.
export AIRFLOW__CORE__LOG_FILENAME_TEMPLATE="{{ ti.dag_id }}.log"
or
you can edit the airflow.cfg file
log_filename_template variable
add any airflow related variables there.
I am trying to create a DAG that generates tasks dynamically based on a JSON file located in storage. I followed this guide step-by-step:
https://bigdata-etl.com/apache-airflow-create-dynamic-dag/
But the DAG gets stuck with the following message:
Is it possible to read an external file and use it to create tasks dynamically in Composer? I can do this when I read data only from an airflow Variable, but when I read an external file, the dag gets stuck in the isn't available in the web server's DagBag object state. I need to read from an external file as the contents of the JSON will change with every execution.
I am using composer-1.8.2-airflow-1.10.2.
I read this answer to a similar question:
Dynamic task definition in Airflow
But I am not trying to create the tasks based on a separate task, only based on the external file.
This is my second approach that also get's stuck in that error state:
import datetime
import airflow
from airflow.operators import bash_operator
from airflow.operators.dummy_operator import DummyOperator
from airflow.models import Variable
import json
import os
products = json.loads(Variable.get("products"))
default_args = {
'owner': 'Composer Example',
'depends_on_past': False,
'email': [''],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': datetime.timedelta(minutes=5),
'start_date': datetime.datetime(2020, 1, 10),
}
with airflow.DAG(
'json_test2',
default_args=default_args,
# Not scheduled, trigger only
schedule_interval=None) as dag:
# Print the dag_run's configuration, which includes information about the
# Cloud Storage object change.
def read_json_file(file_path):
if os.path.exists(file_path):
with open(file_path, 'r') as f:
return json.load(f)
def get_run_list(files):
run_list = []
#The file is uploaded in the storage bucket used as a volume by Composer
last_exec_json = read_json_file("/home/airflow/gcs/data/last_execution.json")
date = last_exec_json["date"]
hour = last_exec_json["hour"]
for file in files:
#Testing by adding just date and hour
name = file['name']+f'_{date}_{hour}'
run_list.append(name)
return run_list
rl = get_run_list(products)
start = DummyOperator(task_id='start', dag=dag)
end = DummyOperator(task_id='end', dag=dag)
for name in rl:
tsk = DummyOperator(task_id=name, dag=dag)
start >> tsk >> end
It is possible to create DAG that generates task dynamically based on a JSON file, which is located in a Cloud Storage bucket. I followed guide, that you provided, and it works perfectly in my case.
Firstly you need to upload your JSON configuration file to $AIRFLOW_HOME/dags directory, and then DAG python file to the same path (you can find the path in airflow.cfg file, which is located in the bucket).
Later on, you will be able to see DAG in Airflow UI:
As you can see the log DAG isn't available in the web server's DagBag object, the DAG isn't available on Airflow Web Server. However, the DAG can be scheduled as active because Airflow Scheduler is working independently with the Airflow Web Server.
When a lot of DAGs are loaded at once to a Composer environment, it may overload on the environment. As the Airflow webserver is on a Google-managed project, only certain types of updates will cause the webserver container to be restarted, like adding or upgrading one of the PyPI packages or changing an Airflow setting. The workaround is to add a dummy environment variable:
Open Composer instance in GCP
ENVIRONMENT VARIABLE tab
Edit, then add environment variable and Submit
You can use following command to restart it:
gcloud composer environments update ${ENVIRONMENT_NAME} --location=${ENV_LOCATION} --update-airflow-configs=core-dummy=true
gcloud composer environments update ${ENVIRONMENT_NAME} --location=${ENV_LOCATION} --remove-airflow-configs=core-dummy
I hope you find the above pieces of information useful.
We're setting up an Airflow framework in which multiple data scientist teams can orchestrate their data processing pipelines. We've developed a Python code-base to help them implement the DAGs, which includes functions and classes (Operator sub-classes as well) in various packages and modules.
Every team will have their own DAG packaged in a ZIP file together with the functions and classes in packages. For example first ZIP file would contain
ZIP1:
main_dag_teamA.py
subfolder1: package1-with-generic-functions + init.py
subfolder2: package2-with-generic-operators + init.py
And another ZIP file would contain
ZIP2:
main_dag_teamB.py
subfolder1: package1-with-generic-functions + init.py
subfolder2: package2-with-generic-operators + init.py
Please note that in both ZIP files subfolder1 and subfolder2 will usually be exactly the same, meaning exact same files with same functions and classes.
But in time, when new versions of packages will become available, the package contents will start deviating across the DAG packages.
With this setup we bump into the following problem: it seems that Airflow is not handling the same-name packages very well when contents of packages/subfolders start deviating across the ZIPs.
Because when I run "airflow list_dags" it shows errors like:
File "/data/share/airflow/dags/program1/program1.zip/program1.py", line 1, in > from subfolder1.functions1 import function1
ImportError: No module named 'subfolder1.functions1'
Problem can be reproduced with following code, where two small DAGs are in their ZIP files together with package my_functions, which has the same name, but different content.
DAG package ZIP 1:
program1.py
from my_functions.functions1 import function1
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def do_it():
print('program1')
dag = DAG(
'program1',
schedule_interval=None,
catchup=False,
start_date=datetime(2019, 6, 23)
)
hello_operator = PythonOperator(task_id='program1_task1', python_callable=do_it, dag=dag)
my_functions/functions1.py:
def function1():
print('function1')
DAG package ZIP 2:
program2.py:
from my_functions.functions2 import function2
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def do_it():
print('program1')
dag = DAG(
'program1',
schedule_interval=None,
catchup=False,
start_date=datetime(2019, 6, 23)
)
hello_operator = PythonOperator(task_id='program2_task2', python_callable=do_it, dag=dag)
my_functions/functions2.py:
def function2():
print('function2')
With these two ZIP files when I run "airflow list_dags" it shows an error:
File "/data/share/airflow/dags/program1/program1.zip/program1.py", line 1, in
from subfolder1.functions1 import function1 ImportError: No module named 'subfolder1.functions1'
When the contents of the subfolders in the ZIPs are the same, no error occurs.
My question: how can I prevent this clash of subfolders in ZIPs? I really would like to have fully code independent DAGs, with their own version of packages.
Solved by doing following at top of the DAGs (program1.py and program2.py), before the
from my_functions.functions1 import function1
and
from my_functions.functions2 import function2
Code:
import sys
# Cleanup up the already imported function module
cleanup_mods = []
for mod in sys.modules:
if mod.startswith("function"):
cleanup_mods.append(mod)
for mod in cleanup_mods:
del sys.modules[mod]
This makes sure that every parse of a DAG, the imported libraries are cleaned.
I am new to Airflow. I wrote a simple code to save a list in a txt file as below:
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import datetime
DAG = DAG(
dag_id='example_dag',
start_date=datetime.datetime.now(),
schedule_interval='#once'
)
def push_function(**kwargs):
ls = ['a', 'b', 'c']
return ls
push_task = PythonOperator(
task_id='push_task',
python_callable=push_function,
provide_context=True,
dag=DAG)
def pull_function(**kwargs):
ti = kwargs['ti']
ls = ti.xcom_pull(task_ids='push_task')
with open('test.txt','w') as out:
out.write(ls)
out.close()
pull_task = PythonOperator(
task_id='pull_task',
python_callable=pull_function,
provide_context=True,
dag=DAG)
push_task >> pull_task
when I use web server interface I see my dag. Also, I see my dag when I wrote airflow list_dags in CLI.
I also compiled my code using python code.py, and the result was like below without any error:
[2017-12-16 14:21:30,609] {__init__.py:57} INFO - Using executor SequentialExecutor
[2017-12-16 14:21:30,709] {driver.py:123} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
[2017-12-16 14:21:30,741] {driver.py:123} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
I both tried to run the dag with UI and with command airflow trigger_dag Mydag
However, I cannot see my txt result file after running. there is no error in log file either.
How can I find my txt file?
I would try again with an absolute file path or you can try logging the current working directory inside the method with os.getcwd() to help locate your file.