Airflow Packaged Dags (zipped) clash when subfolders have same name

Airflow Packaged Dags (zipped) clash when subfolders have same name - python

We're setting up an Airflow framework in which multiple data scientist teams can orchestrate their data processing pipelines. We've developed a Python code-base to help them implement the DAGs, which includes functions and classes (Operator sub-classes as well) in various packages and modules.
Every team will have their own DAG packaged in a ZIP file together with the functions and classes in packages. For example first ZIP file would contain
ZIP1:
main_dag_teamA.py
subfolder1: package1-with-generic-functions + init.py
subfolder2: package2-with-generic-operators + init.py
And another ZIP file would contain
ZIP2:
main_dag_teamB.py
subfolder1: package1-with-generic-functions + init.py
subfolder2: package2-with-generic-operators + init.py
Please note that in both ZIP files subfolder1 and subfolder2 will usually be exactly the same, meaning exact same files with same functions and classes.
But in time, when new versions of packages will become available, the package contents will start deviating across the DAG packages.
With this setup we bump into the following problem: it seems that Airflow is not handling the same-name packages very well when contents of packages/subfolders start deviating across the ZIPs.
Because when I run "airflow list_dags" it shows errors like:
File "/data/share/airflow/dags/program1/program1.zip/program1.py", line 1, in > from subfolder1.functions1 import function1
ImportError: No module named 'subfolder1.functions1'
Problem can be reproduced with following code, where two small DAGs are in their ZIP files together with package my_functions, which has the same name, but different content.
DAG package ZIP 1:
program1.py
from my_functions.functions1 import function1
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def do_it():
print('program1')
dag = DAG(
'program1',
schedule_interval=None,
catchup=False,
start_date=datetime(2019, 6, 23)
)
hello_operator = PythonOperator(task_id='program1_task1', python_callable=do_it, dag=dag)
my_functions/functions1.py:
def function1():
print('function1')
DAG package ZIP 2:
program2.py:
from my_functions.functions2 import function2
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def do_it():
print('program1')
dag = DAG(
'program1',
schedule_interval=None,
catchup=False,
start_date=datetime(2019, 6, 23)
)
hello_operator = PythonOperator(task_id='program2_task2', python_callable=do_it, dag=dag)
my_functions/functions2.py:
def function2():
print('function2')
With these two ZIP files when I run "airflow list_dags" it shows an error:
File "/data/share/airflow/dags/program1/program1.zip/program1.py", line 1, in
from subfolder1.functions1 import function1 ImportError: No module named 'subfolder1.functions1'
When the contents of the subfolders in the ZIPs are the same, no error occurs.
My question: how can I prevent this clash of subfolders in ZIPs? I really would like to have fully code independent DAGs, with their own version of packages.

Solved by doing following at top of the DAGs (program1.py and program2.py), before the
from my_functions.functions1 import function1
and
from my_functions.functions2 import function2
Code:
import sys
# Cleanup up the already imported function module
cleanup_mods = []
for mod in sys.modules:
if mod.startswith("function"):
cleanup_mods.append(mod)
for mod in cleanup_mods:
del sys.modules[mod]
This makes sure that every parse of a DAG, the imported libraries are cleaned.

Related

Airflow is failing my DAG when I use external scripts giving ModuleNotFoundError: No module named

I am new to Airflow, and I am trying to create a Python pipeline scheduling automation process. My project youtubecollection01 utilizes custom created modules, so when I run the DAG it fails with ModuleNotFoundError: No module named 'Authentication'.
This is how my project is structured:
This is my dag file:
# This to intialize the file as a dag file
from airflow import DAG
from datetime import datetime, timedelta
from airflow.operators.python import PythonOperator
# from airflow.utils.dates import days_ago
from youtubecollectiontier01.src.__main__ import main
default_args = {
'owner': 'airflow',
'depends_on_past': False,
# 'start_date': days_ago(1),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
# curate dag
with DAG('collect_layer_01', start_date=datetime(2022,7,25),
schedule_interval='#daily', catchup=False, default_args=default_args) as dag:
curate = PythonOperator(
task_id='collect_tier_01', # name for the task you would like to execute
python_callable=main, # the name of your python function
provide_context=True,
dag=dag)
I am importing main function from the __main__.py, however inside the main I am importing other classes such as Authentication.py, ChannelClass.py, Common.py and that's where Airflow is not recognizing.
Why it is failing for the imports, is it a directory issue or an Airflow issue? I tried moving the project under plugins and run it, but it did not work, any feedback would be highly appreciated!
Thank you!

Up until the last part, you got everything setup according to the tutorials! Also, thank you for a well documented question.
If you have not changed the PYTHON_PATH for airflow, you can try the following to get the default with:
$ airflow info
In the paths info part, you get "airflow_home", "system_path", "python_path" and "airflow_on_path".
Now within the "python_path", you'll basically see that, airflow is set up so that it will check everything inside /dags, /plugins and /config folder.
More about this topic in documents called "Module Management"
Now, I think, the problem with your code can be fixed with a little change.
In your main code you import:
from Authentication import Authentication
in a default setup, Airflow doesn't know where that is!
If you import it this way:
from youtubecollectiontier01.src.Authentication import Authentication
Just like the one you did in the DAG file. I believe it will work. Same goes for the other classes you have ChannelClass, Common, etc.
Waiting to hear from you!

Unable to import custom airflow operator from plugins/operator folder (Airflow v1.10.14)

I am new to airflow and Im trying to run a dag that references a custom operator (my_operators.py) in Airflow v1.10.14
Issue: Im getting the following error in the airflow UI:
Broken DAG: [/opt/airflow/dags/test_operator.py] No module named 'operators.my_operators'
Directory structure:
airflow
|-- dags
|-- test_operator.py
|-- requirements.txt
|-- __init__.py
|-- plugins
|--__init__.py
|-- operators
|-- my_operators.py
|-- __init__.py
|-- airflow.cfg
I am able to successfully reference and import when the operator file (my_operators.py) is directly in the "plugins" folder using
from my_operators import MyFirstOperator
or when it is under the "dags/operators/" directory using
from operators.my_operators import MyFirstOperator
But not when its in the "plugins/operators/" directory. Seems like it cannot detect the "operators" folder in "plugins" directory but does in "dags" directory.
What am I doing wrong?
Additional Context:
Dag file content:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from operators.my_operators import MyFirstOperator
dag = DAG('my_test_dag', description='Another tutorial DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2019, 5, 29), catchup=False)
dummy_task = DummyOperator(task_id='dummy_task', dag=dag)
operator_task = MyFirstOperator(my_operator_param='This is a test.',
task_id='my_first_operator_task', dag=dag)
dummy_task >> operator_task
Custom operator file content:
import logging
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
log = logging.getLogger(__name__)
class MyFirstOperator(BaseOperator):
#apply_defaults
def __init__(self, my_operator_param, *args, **kwargs):
self.operator_param = my_operator_param
super(MyFirstOperator, self).__init__(*args, **kwargs)
def execute(self, context):
log.info("Hello World!")
log.info('operator_param: %s', self.operator_param)
requirements.txt content:
flask-bcrypt==0.7.1
apache-airflow==1.10.14
All "init.py" files are empty
I tried following along with the answer provided in the following post with no success:
Can't import Airflow plugins

I think you're confused on the {AIRFLOW_HOME}/plugins directory.
Plugins don't function like it would do if you placed your custom operator in {AIRFLOW_HOME}/dags or {AIRFLOW_HOME}/data.
When you place custom code in either of these two directories, you can declare any arbitrary Python code that can be shared between DAGs. This could be an operator, a default default_args dictionary that you might want multiple DAGs to share etc.
The documentation for Airflow 1 for this is here (in Airflow 2 the documentation has been changed to make it much clearer how Airflow uses these directories when you want to add custom code).
Your plugin needs to define the AirflowPlugin class. When you implement this class your Operator will be integrated into Airflow - it's import path will be (assuming you define the plugin as my_custom_plugin in AirflowPlugin:
from airflow.operators.my_custom_plugin import MyFirstOperator
You cannot declare arbitrary Python code to share between DAGs when using plugins - it has to implement this class and implement all the required methods for your custom Airflow plugin (whether it's a Hook, Sensor, Operator etc).
Check out the documentation for Plugins in Airflow 1 here - this example shows you exactly what you need to implement.
It's up to you whether you want to go to the trouble of implementing a Plugin. This functionality is used if you are going to write an Operator that you want to share and publish for people to use. If the Operator is just for internal use in the overwhelming majority of cases (at least I've seen) people just use {AIRFLOW_HOME}/dags or {AIRFLOW_HOME}/data.

you should folder "plugins" into folder "dags"

GCP apache airflow, how to install Python dependency from private repository

for my project for data extraction I have gone for the apacahe Airflow, with GCP composer and bucket storage.
I have several modules in a package in my repo in Github, that my DAG file need to acess
for now im using BashOperator to check if it works:
#dag.py
dag = DAG(
dag_id='my_example_DAG',
start_date=datetime(2019, 10, 17, 8, 25),
schedule_interval=timedelta(minutes=15),
default_args=default_args,
)
t1 = BashOperator(
task_id='example_task',
bash_command='python /home/airflow/gcs/data/my_example_maindir/main.py ',
dag=dag)
t1
#main.py
def run_main(path_name)
#Reads YML file
extractor_pool(yml_info)
def extractor_pool
#do work
if __name__ == "__main__":
test_path = Example/path/for/test.yml
run_main(test_path)
And it works, it starts main.py with the test_path. but want to use the function run_main to parse the correct path with the correct YML file for the task.
I have tried to sys.path.insert the dir inside my storage bucket where my modules is, But i get import error
dir:
dir for my dags file (cloned from my git repo) = Buckets/europe-west1-eep-envxxxxxxx-bucket/dags
dir for my scripts/packages = Buckets/europe-west1-eep-envxxxxxxx-bucket/data
#dag.py
import sys
sys.path.insert(0, "/home/airflow/gcs/data/Example/")
from Example import main
dag = DAG(
dag_id='task_1_dag',
start_date=datetime(2019, 10, 13),
schedule_interval=timedelta(minutes=10),
default_args=default_args,
)
t1 = PythonOperator(
task_id='task_1',
provide_context=True,
python_callable=main.run_main,
op_args={'path_name': "project_output_0184_Storgaten_33"},
dag=dag
)
t1
This result in a ''module not found'' error, and does not work.
I have done som reading in GCP and found this:
Installing a Python dependency from private repository:
https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies
That says i need to place it in the directory path /config/pip/
example: gs://us-central1-b1-6efannnn-bucket/config/pip/pip.conf
But in my GCP storage bucket i have no directory named config.
I have tried to trace my steps in when i created the bucket and env but can figure out what i have done wrong

GCS has no true notion of folders or directories, what you actually have is a series of blobs that have names which may contain slashes and give the appearance of a directory.
The instructions are a bit unclear by asking you to put it in a directory, but what you actually want to do is create a file and give it the prefix config/pip/pip.conf.
With gsutil you'd do something like:
gsutil cp my-local-pip.conf gs://[DESTINATION_BUCKET_NAME]/config/pip/pip.conf

How to validate airflow DAG with customer operator?

The airflow docs suggest that a basic sanity check for a DAG file is to interpret it. ie:
$ python ~/path/to/my/dag.py
I've found this to be useful. However, now I've created a plugin, MordorOperator under $AIRFLOW_HOME/plugins:
from airflow.plugins_manager import AirflowPlugin
from airflow.utils.decorators import apply_defaults
from airflow.operators import BaseOperator
from airflow.exceptions import AirflowException
import pika
import json
class MordorOperator(BaseOperator):
JOB_QUEUE_MAPPING = {"testing": "testing"}
#apply_defaults
def __init__(self, job, *args, **kwargs):
super().__init__(*args, **kwargs)
# stuff
def execute(self, context):
# stuff
class MordorPlugin(AirflowPlugin):
name = "MordorPlugin"
operators = [MordorOperator]
I can import the plugin and see it work in a sample DAG:
from airflow import DAG
from airflow.operators import MordorOperator
from datetime import datetime
dag = DAG('mordor_dag', description='DAG with a single task', start_date=datetime.today(), catchup=False)
hello_operator = MordorOperator(job="testing", task_id='run_single_task', dag=dag)
However, when I try to interpret this file I get failures which I suspect I shouldn't get since the plugin successfully runs. My suspicion is that this is because there's some dynamic code gen happening at runtime which isn't available when a DAG is interpreted by itself. I also find that PyCharm can't perform any autocompletion when importing the plugin.
(venv) 3:54PM /Users/paymahn/solvvy/scheduler mordor.operator ✱
❮❮❮ python dags/mordor_test.py
section/key [core/airflow-home] not found in config
Traceback (most recent call last):
File "dags/mordor_test.py", line 2, in
from airflow.operators import MordorOperator
ImportError: cannot import name 'MordorOperator'
How can a DAG using a plugin be sanity tested? Is it possible to get PyCharm to give autocompletion for the custom operator?

I'm running airflow in a docker container and have a script which runs as the containers entry point. Turns out that the plugins folder wasn't available to my container when I was running my tests. I had to add a symlink in the container as part of the setup script. The solution to my problem is highly specific to me and if someone else stumbles upon this I don't have a good answer for your situation other than: make sure your plugins folder is correctly available.

How to dynamically iterate over the output of an upstream task to create parallel tasks in airflow?

Consider the following example of a DAG where the first task, get_id_creds, extracts a list of credentials from a database. This operation tells me what users in my database I am able to run further data preprocessing on and it writes those ids to the file /tmp/ids.txt. I then scan those ids into my DAG and use them to generate a list of upload_transaction tasks that can be run in parallel.
My question is: Is there a more idiomatically correct, dynamic way to do this using airflow? What I have here feels clumsy and brittle. How can I directly pass a list of valid IDs from one process to that defines the subsequent downstream processes?
from datetime import datetime, timedelta
import os
import sys
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import ds_dependencies
SCRIPT_PATH = os.getenv('DASH_PREPROC_PATH')
if SCRIPT_PATH:
sys.path.insert(0, SCRIPT_PATH)
import dash_workers
else:
print('Define DASH_PREPROC_PATH value in environmental variables')
sys.exit(1)
default_args = {
'start_date': datetime.now(),
'schedule_interval': None
}
DAG = DAG(
dag_id='dash_preproc',
default_args=default_args
)
get_id_creds = PythonOperator(
task_id='get_id_creds',
python_callable=dash_workers.get_id_creds,
provide_context=True,
dag=DAG)
with open('/tmp/ids.txt', 'r') as infile:
ids = infile.read().splitlines()
for uid in uids:
upload_transactions = PythonOperator(
task_id=uid,
python_callable=dash_workers.upload_transactions,
op_args=[uid],
dag=DAG)
upload_transactions.set_downstream(get_id_creds)

Per #Juan Riza's suggestion I checked out this link: Proper way to create dynamic workflows in Airflow. This was pretty much the answer, although I was able to simplify the solution enough that I thought I would offer my own modified version of the implementation here:
from datetime import datetime
import os
import sys
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator
import ds_dependencies
SCRIPT_PATH = os.getenv('DASH_PREPROC_PATH')
if SCRIPT_PATH:
sys.path.insert(0, SCRIPT_PATH)
import dash_workers
else:
print('Define DASH_PREPROC_PATH value in environmental variables')
sys.exit(1)
ENV = os.environ
default_args = {
# 'start_date': datetime.now(),
'start_date': datetime(2017, 7, 18)
}
DAG = DAG(
dag_id='dash_preproc',
default_args=default_args
)
clear_tables = PythonOperator(
task_id='clear_tables',
python_callable=dash_workers.clear_db,
dag=DAG)
def id_worker(uid):
return PythonOperator(
task_id=uid,
python_callable=dash_workers.main_preprocess,
op_args=[uid],
dag=DAG)
for uid in capone_dash_workers.get_id_creds():
clear_tables >> id_worker(uid)
clear_tables cleans the database that will be re-built as a result of the process. id_worker is a function that dynamically generates new preprocessing tasks, based on the array of ID values returned from get_if_creds. The task ID is just the corresponding user ID, though it could easily have been an index, i, as in the example mentioned above.
NOTE That bitshift operator (<<) looks backwards to me, as the clear_tables task should come first, but it's what seems to be working in this case.

Considering that Apache Airflow is a workflow management tool, ie. it determines the dependencies between task that the user defines in comparison (as an example) with apache Nifi which is a dataflow management tool, ie. the dependencies here are data which are transferd through the tasks.
That said, i think that your approach is quit right (my comment is based on the code posted) but Airflow offers a concept called XCom. It allows tasks to "cross-communicate" between them by passing some data. How big should the passed data be ? it is up to you to test! But generally it should be not so big. I think it is in the form of key,value pairs and it get stored in the airflow meta-database,ie you can't pass files for example but a list with ids could work.
Like i said you should test it your self. I would be very happy to know your experience. Here is an example dag which demonstrates the use of XCom and here is the necessary documentation. Cheers!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Airflow Packaged Dags (zipped) clash when subfolders have same name - python

Related

Airflow is failing my DAG when I use external scripts giving ModuleNotFoundError: No module named

Unable to import custom airflow operator from plugins/operator folder (Airflow v1.10.14)

GCP apache airflow, how to install Python dependency from private repository

How to validate airflow DAG with customer operator?

How to dynamically iterate over the output of an upstream task to create parallel tasks in airflow?

Categories

Resources