How to validate airflow DAG with customer operator? - python

The airflow docs suggest that a basic sanity check for a DAG file is to interpret it. ie:
$ python ~/path/to/my/dag.py
I've found this to be useful. However, now I've created a plugin, MordorOperator under $AIRFLOW_HOME/plugins:
from airflow.plugins_manager import AirflowPlugin
from airflow.utils.decorators import apply_defaults
from airflow.operators import BaseOperator
from airflow.exceptions import AirflowException
import pika
import json
class MordorOperator(BaseOperator):
JOB_QUEUE_MAPPING = {"testing": "testing"}
#apply_defaults
def __init__(self, job, *args, **kwargs):
super().__init__(*args, **kwargs)
# stuff
def execute(self, context):
# stuff
class MordorPlugin(AirflowPlugin):
name = "MordorPlugin"
operators = [MordorOperator]
I can import the plugin and see it work in a sample DAG:
from airflow import DAG
from airflow.operators import MordorOperator
from datetime import datetime
dag = DAG('mordor_dag', description='DAG with a single task', start_date=datetime.today(), catchup=False)
hello_operator = MordorOperator(job="testing", task_id='run_single_task', dag=dag)
However, when I try to interpret this file I get failures which I suspect I shouldn't get since the plugin successfully runs. My suspicion is that this is because there's some dynamic code gen happening at runtime which isn't available when a DAG is interpreted by itself. I also find that PyCharm can't perform any autocompletion when importing the plugin.
(venv) 3:54PM /Users/paymahn/solvvy/scheduler mordor.operator ✱
❮❮❮ python dags/mordor_test.py
section/key [core/airflow-home] not found in config
Traceback (most recent call last):
File "dags/mordor_test.py", line 2, in
from airflow.operators import MordorOperator
ImportError: cannot import name 'MordorOperator'
How can a DAG using a plugin be sanity tested? Is it possible to get PyCharm to give autocompletion for the custom operator?

I'm running airflow in a docker container and have a script which runs as the containers entry point. Turns out that the plugins folder wasn't available to my container when I was running my tests. I had to add a symlink in the container as part of the setup script. The solution to my problem is highly specific to me and if someone else stumbles upon this I don't have a good answer for your situation other than: make sure your plugins folder is correctly available.

Related

Is there a way to force MonkeyType to use a specific import for a type?

I am currently trying to automatically add type annotations for a project. This works great for most cases but I have one issue.
When working with sqlalchemys async session monkey type does not work.
Lets say we have following function:
async def foo(session):
await session.commit()
and we will call this function like this:
from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine
engine = create_async_engine()
async_session = AsyncSession(engine)
...
foo(async_session)
lets say all of this is in file "test.py" and we now run:
monkeytype run test.py
this will run without problems, but when running this:
monkeytype -v stub test
monekytype will show following error:
WARNING: Failed decoding trace: Module 'sqlalchemy.orm.session' has no attribute 'AsyncSession'
this happens because type(async_session) returns sqlalchemy.orm.session.AsyncSession
but sqlalchemy.orm.session does not have AsyncSession. Instead AsyncSession is saved in sqlalchemy.ext.asyncio
Is there a way to force MonkeyType to use this specific path?

Unable to import custom airflow operator from plugins/operator folder (Airflow v1.10.14)

I am new to airflow and Im trying to run a dag that references a custom operator (my_operators.py) in Airflow v1.10.14
Issue: Im getting the following error in the airflow UI:
Broken DAG: [/opt/airflow/dags/test_operator.py] No module named 'operators.my_operators'
Directory structure:
airflow
|-- dags
|-- test_operator.py
|-- requirements.txt
|-- __init__.py
|-- plugins
|--__init__.py
|-- operators
|-- my_operators.py
|-- __init__.py
|-- airflow.cfg
I am able to successfully reference and import when the operator file (my_operators.py) is directly in the "plugins" folder using
from my_operators import MyFirstOperator
or when it is under the "dags/operators/" directory using
from operators.my_operators import MyFirstOperator
But not when its in the "plugins/operators/" directory. Seems like it cannot detect the "operators" folder in "plugins" directory but does in "dags" directory.
What am I doing wrong?
Additional Context:
Dag file content:
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from operators.my_operators import MyFirstOperator
dag = DAG('my_test_dag', description='Another tutorial DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2019, 5, 29), catchup=False)
dummy_task = DummyOperator(task_id='dummy_task', dag=dag)
operator_task = MyFirstOperator(my_operator_param='This is a test.',
task_id='my_first_operator_task', dag=dag)
dummy_task >> operator_task
Custom operator file content:
import logging
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
log = logging.getLogger(__name__)
class MyFirstOperator(BaseOperator):
#apply_defaults
def __init__(self, my_operator_param, *args, **kwargs):
self.operator_param = my_operator_param
super(MyFirstOperator, self).__init__(*args, **kwargs)
def execute(self, context):
log.info("Hello World!")
log.info('operator_param: %s', self.operator_param)
requirements.txt content:
flask-bcrypt==0.7.1
apache-airflow==1.10.14
All "init.py" files are empty
I tried following along with the answer provided in the following post with no success:
Can't import Airflow plugins
I think you're confused on the {AIRFLOW_HOME}/plugins directory.
Plugins don't function like it would do if you placed your custom operator in {AIRFLOW_HOME}/dags or {AIRFLOW_HOME}/data.
When you place custom code in either of these two directories, you can declare any arbitrary Python code that can be shared between DAGs. This could be an operator, a default default_args dictionary that you might want multiple DAGs to share etc.
The documentation for Airflow 1 for this is here (in Airflow 2 the documentation has been changed to make it much clearer how Airflow uses these directories when you want to add custom code).
Your plugin needs to define the AirflowPlugin class. When you implement this class your Operator will be integrated into Airflow - it's import path will be (assuming you define the plugin as my_custom_plugin in AirflowPlugin:
from airflow.operators.my_custom_plugin import MyFirstOperator
You cannot declare arbitrary Python code to share between DAGs when using plugins - it has to implement this class and implement all the required methods for your custom Airflow plugin (whether it's a Hook, Sensor, Operator etc).
Check out the documentation for Plugins in Airflow 1 here - this example shows you exactly what you need to implement.
It's up to you whether you want to go to the trouble of implementing a Plugin. This functionality is used if you are going to write an Operator that you want to share and publish for people to use. If the Operator is just for internal use in the overwhelming majority of cases (at least I've seen) people just use {AIRFLOW_HOME}/dags or {AIRFLOW_HOME}/data.
you should folder "plugins" into folder "dags"

Apache Airflow giving broken DAG error cannot import __builtin__ for speedtest.py

This is a weird error I'm coming across. In my Python 3.7 environment I have installed Airflow 2, speedtest-cli and few other things using pip and I keep seeing this error popup in the Airflow UI:
Broken DAG: [/env/app/airflow/dags/my_dag.py] Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/speedtest.py", line 156, in <module>
import __builtin__
ModuleNotFoundError: No module named '__builtin__'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/speedtest.py", line 179, in <module>
_py3_utf8_stdout = _Py3Utf8Output(sys.stdout)
File "/usr/local/lib/python3.7/site-packages/speedtest.py", line 166, in __init__
buf = FileIO(f.fileno(), 'w')
AttributeError: 'StreamLogWriter' object has no attribute 'fileno'
For sanity checks I did run the following and saw no problems:
~# python airflow/dags/my_dag.py
/usr/local/lib/python3.7/site-packages/airflow/utils/decorators.py:94 DeprecationWarning: provide_context is deprecated as of 2.0 and is no longer required
~# airflow dags list
dag_id | filepath | owner | paused
===========+===============+=========+=======
my_dag | my_dag.py | rafay | False
~# airflow tasks list my_dag
[2021-03-08 16:46:26,950] {dagbag.py:448} INFO - Filling up the DagBag from /env/app/airflow/dags
/usr/local/lib/python3.7/site-packages/airflow/utils/decorators.py:94 DeprecationWarning: provide_context is deprecated as of 2.0 and is no longer required
Start_backup
get_configs
get_targets
push_targets
So nothing out of the ordinary and testing each of the tasks does not cause problems either. Further running the speedtest-cli script independently outside of Airflow does not raise any errors either. The script goes something like this:
import speedtest
def get_upload_speed():
"""
Calculates the upload speed of the internet in using speedtest api
Returns:
Returns upload speed in Mbps
"""
try:
s = speedtest.Speedtest()
upload = s.upload()
except speedtest.SpeedtestException as e:
raise AirflowException("Failed to check network bandwidth make sure internet is available.\nException: {}".format(e))
return round(upload / (1024**2), 2)
I even went to the exact line of speedtest.py as mentioned Broken DAG error, line 156, it seems fine and runs fine when I put in in the python interpreter.
try:
import __builtin__
except ImportError:
import builtins
from io import TextIOWrapper, FileIO
So, how do I diagnose this? Seems like a package import problem of some sort
Edit: If it helps here is my directory and import structure for my_dag.py
- airflow
- dags
- tasks
- get_configs.py
- get_taargets.py
- push_targets.py (speedtest is imported here)
- my_dag.py
The import sequence of tasks in the dag file are as follows:
from datetime import timedelta
# The DAG object; we'll need this to instantiate a DAG
from airflow import DAG
# Operators; we need this to operate!
from airflow.operators.python import PythonOperator
from airflow.operators.dummy import DummyOperator
from tasks.get_configs import get_configs
from tasks.get_targets import get_targets
from tasks.push_targets import push_targets
...
The Airflow StreamLogWriter (and other log-related facilities) do not implement the fileno method expected by "standard" Python (I/O) log facility clients (confirmed by a todo comment). The problem here happens also when enabling the faulthandler standard library in an Airflow task.
So what to do at this point? Aside opening an issue or sending a PR to Airflow, it is really case by case. In the speedtest-cli situation, it may be necessary to isolate the function calling fileno, and try to "replace" it (e.g. forking the library, changing the function if it can be isolated and injected, perhaps choosing a configuration that does not use that part of the code).
In my particular case, there is no way to bypass the code, and a fork was the most straightforward method.

Airflow Packaged Dags (zipped) clash when subfolders have same name

We're setting up an Airflow framework in which multiple data scientist teams can orchestrate their data processing pipelines. We've developed a Python code-base to help them implement the DAGs, which includes functions and classes (Operator sub-classes as well) in various packages and modules.
Every team will have their own DAG packaged in a ZIP file together with the functions and classes in packages. For example first ZIP file would contain
ZIP1:
main_dag_teamA.py
subfolder1: package1-with-generic-functions + init.py
subfolder2: package2-with-generic-operators + init.py
And another ZIP file would contain
ZIP2:
main_dag_teamB.py
subfolder1: package1-with-generic-functions + init.py
subfolder2: package2-with-generic-operators + init.py
Please note that in both ZIP files subfolder1 and subfolder2 will usually be exactly the same, meaning exact same files with same functions and classes.
But in time, when new versions of packages will become available, the package contents will start deviating across the DAG packages.
With this setup we bump into the following problem: it seems that Airflow is not handling the same-name packages very well when contents of packages/subfolders start deviating across the ZIPs.
Because when I run "airflow list_dags" it shows errors like:
File "/data/share/airflow/dags/program1/program1.zip/program1.py", line 1, in > from subfolder1.functions1 import function1
ImportError: No module named 'subfolder1.functions1'
Problem can be reproduced with following code, where two small DAGs are in their ZIP files together with package my_functions, which has the same name, but different content.
DAG package ZIP 1:
program1.py
from my_functions.functions1 import function1
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def do_it():
print('program1')
dag = DAG(
'program1',
schedule_interval=None,
catchup=False,
start_date=datetime(2019, 6, 23)
)
hello_operator = PythonOperator(task_id='program1_task1', python_callable=do_it, dag=dag)
my_functions/functions1.py:
def function1():
print('function1')
DAG package ZIP 2:
program2.py:
from my_functions.functions2 import function2
from datetime import datetime
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
def do_it():
print('program1')
dag = DAG(
'program1',
schedule_interval=None,
catchup=False,
start_date=datetime(2019, 6, 23)
)
hello_operator = PythonOperator(task_id='program2_task2', python_callable=do_it, dag=dag)
my_functions/functions2.py:
def function2():
print('function2')
With these two ZIP files when I run "airflow list_dags" it shows an error:
File "/data/share/airflow/dags/program1/program1.zip/program1.py", line 1, in
from subfolder1.functions1 import function1 ImportError: No module named 'subfolder1.functions1'
When the contents of the subfolders in the ZIPs are the same, no error occurs.
My question: how can I prevent this clash of subfolders in ZIPs? I really would like to have fully code independent DAGs, with their own version of packages.
Solved by doing following at top of the DAGs (program1.py and program2.py), before the
from my_functions.functions1 import function1
and
from my_functions.functions2 import function2
Code:
import sys
# Cleanup up the already imported function module
cleanup_mods = []
for mod in sys.modules:
if mod.startswith("function"):
cleanup_mods.append(mod)
for mod in cleanup_mods:
del sys.modules[mod]
This makes sure that every parse of a DAG, the imported libraries are cleaned.

Deferred tasks creates new instances that can't access some python modules

I am using the latest version of GAE with automated scaling, endpoints API, and deferred.defer() tasks.
The problem is that since adding the API, there have been some instances that will spin up automatically that always throw permanent task failures:
Permanent failure attempting to execute task
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/deferred/deferred.py", line 310, in post
self.run_from_request()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/deferred/deferred.py", line 305, in run_from_request
run(self.request.body)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/deferred/deferred.py", line 145, in run
raise PermanentTaskFailure(e)
PermanentTaskFailure: No module named app.Report
The permanent task failures are unique for a single instance though, in which every deferred tasks on that instance fail. These deferred tasks all throw the same error, even though the tasks aren't using the Api.py module. On other instances, the same deferred tasks will run just fine if they aren't routed to a failing instance.
The app.yaml handlers looks like this:
handlers:
# Api Handler
- url: /_ah/api/.*
script: main.api
- url: /_ah/spi/.*
script: main.api
# All other traffic
- url: .*
script: main.app
builtins:
- deferred: on
The main.py looks like:
import Api, endpoints, webapp2
api = endpoints.api_server([Api.AppApi])
app = webapp2.WSGIApplication(
[(misc routes)]
,debug=True)
The Api.py looks like :
import endpoints
from protorpc import messages
from protorpc import message_types
from protorpc import remote
from google.appengine.ext import deferred
from app.Report import ETLScheduler
#endpoints.api(...)
class AppApi(remote.Service):
#endpoints.method(...)
def reportExtract(self, request):
deferred.defer(
ETLScheduler,
params
)
I'm not doing any path modification, so I'm curious why the new instance is having trouble finding the python modules for the API, even though the deferred tasks are in another module using other functions. Why would it throw these errors for that instance only?
Edit:
So after looking at some other SO issues, I tried doing path modification in appengine_config.py. I moved all my folders to a lib directory, and added this to the config file:
import os,sys
sys.path.append(os.path.join(os.path.dirname(__file__), 'lib'))
Now the error I get on the failing instance is:
Permanent failure attempting to execute task
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/deferred/deferred.py", line 310, in post
self.run_from_request()
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/deferred/deferred.py", line 305, in run_from_request
run(self.request.body)
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/ext/deferred/deferred.py", line 145, in run
raise PermanentTaskFailure(e)
PermanentTaskFailure: cannot import name ETLScheduler
So it seems to be finding the module, but same as before, none of the deferred tasks on the instance can import the method.
So I figured out a way to make it work, but am not sure why it works.
By importing the entire module, rather than a method from the module, the new instances that spin up for deferred tasks no longer throw the PermanentTaskFailure: cannot import name ETLScheduler error.
I tried importing the whole module instead of the method, so that the Api.py looks like this:
import endpoints
from protorpc import messages
from protorpc import message_types
from protorpc import remote
from google.appengine.ext import deferred
# Import the module instead of the method
#from app.Report import ETLScheduler
import app.Report
#endpoints.api(...)
class AppApi(remote.Service):
#endpoints.method(...)
def reportExtract(self, request):
deferred.defer(
app.Report.ETLScheduler,
params
)
Now I am no longer getting instances that throw the PermanentTaskFailure: cannot import name ETLScheduler. Might be a circular dependency by import Api.py in main.py (I'm not sure) but at least it works now.
You're missing the _target kwarg in your defer invocation if you're trying to run something in a specific module.
deferred.defer(
app.Report.ETLScheduler,
params,
_target="modulename"
)

Categories