Getting Exception on using 'DataProcSparkOperator' in Airflow Dags - python

I am very new for Apache Airflow Usage, currently using Airflow1.10.4 with Python 2.7 support.
I need to trigger a spark job via Airflow DAG, so using 'DataProcSparkOperator'. But facing exception
AttributeError: 'DataProcSparkOperator' object has no attribute 'dataproc_spark_jars'
Code snippet:
from airflow.contrib.operators.dataproc_operator import DataProcSparkOperator
.
.
.
data_t1 = DataProcSparkOperator(
task_id='data_job',
job_name='extract_data',
cluster_name='cluster-a',
arguments=["{{ task_instance.xcom_pull(task_ids='puller') }}","gs://data-bucket/dailydata"],
main_jar='gs://data-bucket/spark_jar1/spark-read-5.0-SNAPSHOT-jar-with-dependencies.jar',
region="us-central",
dag=dag
)
Tried with main_jar / dataproc_spark_jars attribute (all possible ways)
However, I tried with other fixes suggested(as airflow.contrib.operators.dataproc_operator is deprecated in some versions), hence used below
from airflow.gcp.operators.dataproc import DataProcSparkOperator
again I am facing
Import Error: No module gcp.operators.dataproc..

Related

Apache Airflow giving broken DAG error cannot import __builtin__ for speedtest.py

This is a weird error I'm coming across. In my Python 3.7 environment I have installed Airflow 2, speedtest-cli and few other things using pip and I keep seeing this error popup in the Airflow UI:
Broken DAG: [/env/app/airflow/dags/my_dag.py] Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/speedtest.py", line 156, in <module>
import __builtin__
ModuleNotFoundError: No module named '__builtin__'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/speedtest.py", line 179, in <module>
_py3_utf8_stdout = _Py3Utf8Output(sys.stdout)
File "/usr/local/lib/python3.7/site-packages/speedtest.py", line 166, in __init__
buf = FileIO(f.fileno(), 'w')
AttributeError: 'StreamLogWriter' object has no attribute 'fileno'
For sanity checks I did run the following and saw no problems:
~# python airflow/dags/my_dag.py
/usr/local/lib/python3.7/site-packages/airflow/utils/decorators.py:94 DeprecationWarning: provide_context is deprecated as of 2.0 and is no longer required
~# airflow dags list
dag_id | filepath | owner | paused
===========+===============+=========+=======
my_dag | my_dag.py | rafay | False
~# airflow tasks list my_dag
[2021-03-08 16:46:26,950] {dagbag.py:448} INFO - Filling up the DagBag from /env/app/airflow/dags
/usr/local/lib/python3.7/site-packages/airflow/utils/decorators.py:94 DeprecationWarning: provide_context is deprecated as of 2.0 and is no longer required
Start_backup
get_configs
get_targets
push_targets
So nothing out of the ordinary and testing each of the tasks does not cause problems either. Further running the speedtest-cli script independently outside of Airflow does not raise any errors either. The script goes something like this:
import speedtest
def get_upload_speed():
"""
Calculates the upload speed of the internet in using speedtest api
Returns:
Returns upload speed in Mbps
"""
try:
s = speedtest.Speedtest()
upload = s.upload()
except speedtest.SpeedtestException as e:
raise AirflowException("Failed to check network bandwidth make sure internet is available.\nException: {}".format(e))
return round(upload / (1024**2), 2)
I even went to the exact line of speedtest.py as mentioned Broken DAG error, line 156, it seems fine and runs fine when I put in in the python interpreter.
try:
import __builtin__
except ImportError:
import builtins
from io import TextIOWrapper, FileIO
So, how do I diagnose this? Seems like a package import problem of some sort
Edit: If it helps here is my directory and import structure for my_dag.py
- airflow
- dags
- tasks
- get_configs.py
- get_taargets.py
- push_targets.py (speedtest is imported here)
- my_dag.py
The import sequence of tasks in the dag file are as follows:
from datetime import timedelta
# The DAG object; we'll need this to instantiate a DAG
from airflow import DAG
# Operators; we need this to operate!
from airflow.operators.python import PythonOperator
from airflow.operators.dummy import DummyOperator
from tasks.get_configs import get_configs
from tasks.get_targets import get_targets
from tasks.push_targets import push_targets
...
The Airflow StreamLogWriter (and other log-related facilities) do not implement the fileno method expected by "standard" Python (I/O) log facility clients (confirmed by a todo comment). The problem here happens also when enabling the faulthandler standard library in an Airflow task.
So what to do at this point? Aside opening an issue or sending a PR to Airflow, it is really case by case. In the speedtest-cli situation, it may be necessary to isolate the function calling fileno, and try to "replace" it (e.g. forking the library, changing the function if it can be isolated and injected, perhaps choosing a configuration that does not use that part of the code).
In my particular case, there is no way to bypass the code, and a fork was the most straightforward method.

Unable to import module 'lambda_function': No module named 'error'

I have a simple Python Code that uses Elasticsearch module "curator" to make snapshots.
I've tested my code locally and it works.
Now I want to run it in an AWS Lambda but I have this error :
Unable to import module 'lambda_function': No module named 'error'
Here is how I proceeded :
I created manually a Lambda and gave it a "AISA-BasicLambdaExecutionRole" role. Then I created my package with my function and the dependencies that I installed with the command :
pip install elasticsearch-curator -t /<path>/myRepository
I zipped the content (not the folder) and uploaded it in my Lambda.
I changed the Handler name to "lambda_function.lambda_handler" (my function's name is "lambda_function.py").
Did I miss something ? This is my first time working with Lambda and Python
I've seen the other questions about this error :
"errorMessage": "Unable to import module 'lambda_function'"
But nothing works for me.
EDIT :
Here is my lambda_function :
from __future__ import print_function
import curator
import time
from curator.exceptions import NoIndices
from elasticsearch import Elasticsearch
def lambda_handler(event, context):
es = Elasticsearch()
index_list = curator.IndexList(es)
index_list.filter_by_regex(kind='prefix', value="logstash-")
Number = 1
try:
while Number <= 3:
Name="snapshotLmbd_n_"+ str(Number) +""
curator.Snapshot(index_list, repository="s3-backup", name= Name , wait_for_completion=True).do_action()
Number += 1
print('Just taking a nap ! will be back soon')
time.sleep(30)
except KeyboardInterrupt:
print('My bad ! I interrupted this')
return
Thank you for your time.
Ok, since you have everything else correct, check for the permissions of the python script.
It must have executable permissions (755)

How to validate airflow DAG with customer operator?

The airflow docs suggest that a basic sanity check for a DAG file is to interpret it. ie:
$ python ~/path/to/my/dag.py
I've found this to be useful. However, now I've created a plugin, MordorOperator under $AIRFLOW_HOME/plugins:
from airflow.plugins_manager import AirflowPlugin
from airflow.utils.decorators import apply_defaults
from airflow.operators import BaseOperator
from airflow.exceptions import AirflowException
import pika
import json
class MordorOperator(BaseOperator):
JOB_QUEUE_MAPPING = {"testing": "testing"}
#apply_defaults
def __init__(self, job, *args, **kwargs):
super().__init__(*args, **kwargs)
# stuff
def execute(self, context):
# stuff
class MordorPlugin(AirflowPlugin):
name = "MordorPlugin"
operators = [MordorOperator]
I can import the plugin and see it work in a sample DAG:
from airflow import DAG
from airflow.operators import MordorOperator
from datetime import datetime
dag = DAG('mordor_dag', description='DAG with a single task', start_date=datetime.today(), catchup=False)
hello_operator = MordorOperator(job="testing", task_id='run_single_task', dag=dag)
However, when I try to interpret this file I get failures which I suspect I shouldn't get since the plugin successfully runs. My suspicion is that this is because there's some dynamic code gen happening at runtime which isn't available when a DAG is interpreted by itself. I also find that PyCharm can't perform any autocompletion when importing the plugin.
(venv) 3:54PM /Users/paymahn/solvvy/scheduler mordor.operator ✱
❮❮❮ python dags/mordor_test.py
section/key [core/airflow-home] not found in config
Traceback (most recent call last):
File "dags/mordor_test.py", line 2, in
from airflow.operators import MordorOperator
ImportError: cannot import name 'MordorOperator'
How can a DAG using a plugin be sanity tested? Is it possible to get PyCharm to give autocompletion for the custom operator?
I'm running airflow in a docker container and have a script which runs as the containers entry point. Turns out that the plugins folder wasn't available to my container when I was running my tests. I had to add a symlink in the container as part of the setup script. The solution to my problem is highly specific to me and if someone else stumbles upon this I don't have a good answer for your situation other than: make sure your plugins folder is correctly available.

AttributeError: 'module' object has no attribute 'vtkvmtkVesselEnhancingDiffusion3DImageFilter'

I get the following error when I try to run the script for the VEDm filter in vmtk:
[location] line 149, in ApplyVEDManniesing
vesselness = vtkvmtk.vtkvmtkVesselEnhancingDiffusion3DImageFilter()
AttributeError: 'module' object has no attribute 'vtkvmtkVesselEnhancingDiffusion3DImageFilter'
I tried to run the code in my Python 3.6 environment, instead of 2.7 where I currently work in. But it gave the same error.
The other filters (Sato, Frangi, VED) were completed successfully and are represented in the same python-file vmtkimagevesselenhancement.py.
Can someone help me to find the problem?
Edit: this is what the beginning of the code looks like:
from __future__ import absolute_import #NEEDS TO STAY AS TOP LEVEL MODULE FOR Py2-3 COMPATIBILITY
import vtk
import sys
import pypes
import vtkvmtk
The filename where the line 149 doesn't work is vmtkimagevesselenhancement.py and is imported with the vmtk download.

While using jenkins API, getting a failure on reconfig_job

I am using jenkins rest API to recurse through jobs and then reconfigure this one. All methods work except one. He's is my code :
def get_server_instance():
jenkins_url = 'xxxx'
#server = Jenkins(jenkins_url, username = '', password = '')
# Connect to instance - username and password are optional
server = jenkins.Jenkins(jenkins_url, username = '', password = '')
return server
def get_job_details():
# Refer Example #1 for definition of function 'get_server_instance'
server = get_server_instance()
for job in server.get_jobs_list():
if job == "GithubMigration":
configuration = server.get_job(job).get_config().encode('utf-8')
#server.reconfig_job(job, configuration)
if server.has_job("GithubMigration"):
server.reconfig_job('GithubMigration', config_xml)
It gets my configuration.xml, find the job as well but fails on server.reconfig_job('GithubMigration', config_xml) with the error , AttributeError: 'Jenkins' object has no attribute 'reconfig_job'
when obviously this functions exists in the jenkins rest API and yes I'm importing jenkins, from jenkinsapi.jenkins import Jenkins .
Edit 1 - I uninstalled Jenkinsapi and have only python-jenkins module and now it fails even before saying
AttributeError: 'module' object has no attribute 'Jenkins' for line : AttributeError: 'module' object has no attribute 'Jenkins'
Any ideas?
Edit 2 :
I tries solely python-jenkins API and tried their own example as you see here http://python-jenkins.readthedocs.org/en/latest/example.html
import jenkins
j = jenkins.Jenkins('http://your_url_here', 'username', 'password')
j.get_jobs()
j.create_job('empty', jenkins.EMPTY_CONFIG_XML)
j.disable_job('empty')
j.copy_job('empty', 'empty_copy')
j.enable_job('empty_copy')
j.reconfig_job('empty_copy', jenkins.RECONFIG_XML)
Even this fails at jenkins.Jenkins with attribute error at Jenkins - No module.
I am pretty sure the API is broken.
Your script is probably importing wrong module. You can check it as follows:
import jenkins
print jenkins.__file__
If printed path is other than installation path of jenkins module (eg. C:\Python27_32\lib\site-packages\jenkins\__init__.pyc), then you should check pythonpath:
import sys
print sys.path
Common problem is existence of python script with same name as imported module in current directory, which is at the first place in search path ''.
For more info on import order see module search path
Following #Chemik answer, I realized that the script I wrote was named jenkins.py and it was conflicting with python-jenkins import.
The library isn't broken. Check your script name.
had to add another solution, while running the same command
server = jenkins.Jenkins(jenkins_url, username = '', password = '')
i got the error:
'jenkins' has no attribute 'Jenkins'
my mistake was when installing the package, i installed package "jenkins" and the package i was needed is "python-jenkins".
docs can be found:
python-jenkins docs
so what i had to do is just
pip install python-jenkins

Categories