I created a very simple DAG to execute a Python file using PythonOperator. I'm using docker image to run Airflow but it doesn't recognize a module where I have my .py file
The structure is like this:
main_dag.py
plugins/__init__.py
plugins/njtransit_scrapper.py
plugins/sql_queries.py
plugins/config/config.cfg
cmd to run docker airflow image:
docker run -p 8080:8080 -v /My/Path/To/Dags:/usr/local/airflow/dags puckel/docker-airflow webserver
I already tried airflow initdb and restarting the web server but it keeps showing the error ModuleNotFoundError: No module named 'plugins'
For the import statement I'm using:
from plugins import njtransit_scrapper
This is my PythonOperator:
tweets_load = PythonOperator(
task_id='Tweets_load',
python_callable=njtransit_scrapper.main,
dag=dag
)
My njtransit_scrapper.py file is just a file that collects all tweets for a tweeter account and saves the result in a Postgres database.
If I remove the PythonOperator code and imports the code works fine. I already test almost everything but I'm not quite sure if this is a bug or something else.
It's possible that when I created a volume for the docker image, it's just importing the main dag and stopping there causing to not import the entire package?
To help others who might land on this page and get this error because of the same mistake I did, I will record it here.
I had an unnecessary __init__.py file in dags/ folder.
Removing it solved the problem, and allowed all the dags to find their dependency modules.
Related
I'm trying to submit my pyspark code through cron job. When I run manually, its working fine. Through cron its not working.
Here is the project structure I have:
my-project
|
|--src
|----jobs
|------execute_metrics.py
|----utils
|------get_spark_session.py
The main code lies in execute_metrics.py from src/jobs. I'm using get_spark_session.py
in execute_metrics.py using from src.utils import get_spark_session.
I created a shell script execute_metric.sh with below content for executing the cron job
#!/bin/bash
PATH=<included entire path here>
spark-submit <included required options> src/jobs/execute_metrics.py
my-project
|
|--src
|----jobs
|------execute_metrics.py
|----utils
|------get_spark_session.py
|--execute_metric.sh
When I run this shell script using ./execute_metric.sh, I'm able to see the results.
Now, I need this to run the job every minute. So, I created a cron file with below content and copied in the same directory
* * * * * ./execute_metric.sh > execute_metric_log.log
my-project
|
|--src
|----jobs
|------execute_metrics.py
|----utils
|------get_spark_session.py
|--execute_metric.sh
|--execute_cron.crontab
This cron is running for every minute, but giving me the error:
ModuleNotFoundError: No module named 'src'
Can someone please tell me what went wrong here?
Thanks in advance
Your module directories are not getting into the python path. Try one of the following:
Explicitly set the PYTHONPATH:
#!/bin/bash
PATH=<included entire path here>
PYTHONPATH=somewhere/my-project/src
spark-submit <included required options> src/jobs/execute_metrics.py
Invoke the spark shell from your project directory:
#!/bin/bash
PATH=<included entire path here>
cd somewhere/my-project/src
spark-submit <included required options> execute_metrics.py
I got it fixed by adding a main.py file in the project directory and changed my cron to execute main.py. The project structure now looks like:
my-project
|
|--src
|----jobs
|------execute_metrics.py
|----utils
|------get_spark_session.py
|--execute_metric.sh
|--execute_cron.crontab
|--main.py
In main.py, I'm invoking the functions of execute_metrics.py.
Importing packages that are subsequently installed (not present by default in python distribution for rhel 7.6) not working when ran as cron job
Hi Team,
I have a python(2.7) script which imports paramiko package. The script can import the paramiko package successfully when ran as a user(root or ftpuser) after logging in but it cannot import it when ran from cron job. I have tried out various options as provided in the brilliant stack overflow pages like the below but unfortunately couldn't resolve the issue.
1) Crontab not running my python script
I have provided the path to the paramiko package and verified it is successfully received at the script end by logging it when run as cron job and also I have given chmod -R 777 permission to the paramiko folder in the /opt/rh/python27/root/usr/lib/python2.7/site-packages location. Still the import is not functioning when ran as cron job
I have created a shell script and tried to invoke python script from with in the script and configured the shell script in cron job but it seems python script was not invoked
I have verified that there is only one python installation present in the server and so I' am using the correct path
I have disabled selinux option and tried after rebooting but issue still persists
Please note the issue exists not for just paramiko package but for other packages as well that was installed subsequently like mysql.connector e t c
Update1
It has to be something to do with the way I install the paramiko package because the script can even import other packages in the same path as that of paramiko and the permissions for both of them look identical only difference is former comes with the python distribution that is deployed from using the url https://access.redhat.com/solutions/1519803. Cannot figure out what is wrong with the installation steps as I install it as root after doing sudo su and then setting umask to 0022. I do pip install of parmiko and python-crontab as mentioned in their sites and both have the same issue
Another interesting thing is though I have the code to log exception around the failing import statement it never logs the exception but script seems to halt/hang at the import statment
Please help to resolve this issue...
PYTHON CODE
#!/usr/bin/env python
import sys
import logging
import os
def InitLog():
logging.basicConfig(
level=logging.DEBUG,
format='%(asctime)s %(levelname)s %(message)s',
filename=os.path.dirname(os.path.abspath(__file__)) + '/test_paramiko.log',
filemode='a'
)
logging.info('***************start logging****************')
InitLog()
logging.info('before import')
logging.info(sys.path)
try:
sys.path.append("/opt/rh/python27/root/usr/lib/python2.7/site-packages")
logging.info("sys path appended before import")
import paramiko
except ImportError:
logging.ERROR("Exception occured druing import")
logging.info('after import')
CRONTAB Entry
SHELL=/bin/bash
PATH=/opt/rh/python27/root/usr/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/ftpuser/.local/bin:/home/ftpuser/bin
PYTHONPATH=/opt/rh/python27/root/usr/lib64/python27.zip:/opt/rh/python27/root/usr/lib64/python2.7:/opt/rh/python27/root/usr/lib64/python2.7/plat-linux2:/opt/rh/python27/root/usr/lib64/python2.7/lib-tk:/opt/rh/python27/root/usr/lib64/python2.7/lib-old:/opt/rh/python27/root/usr/lib64/python2.7/lib-dynload:/opt/rh/python27/root/usr/lib64/python2.7/site-packages:/opt/rh/python27/root/usr/lib/python2.7/site-packages
*/1 * * * * /opt/rh/python27/root/usr/bin/python /home/ftpuser/Ganesh/test_paramiko.py
#*/1 * * * * /home/ftpuser/Ganesh/test_cron.sh >> /home/ftpuser/Ganesh/tes_cron.txt 2>&1
#*/1 * * * * /home/ftpuser/Ganesh/test_cron.sh
Shell script
#!/opt/rh/python27/root/usr/bin/python
export PATH=$PATH:/opt/rh/python27/root/usr/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/ftpuser/.local/bin:/home/ftpuser/bin
export PYTHONPATH=$PYTHONPATH:/opt/rh/python27/root/usr/lib64/python27.zip:/opt/rh/python27/root/usr/lib64/python2.7:/opt/rh/python27/root/usr/lib64/python2.7/plat-linux2:/opt/rh/python27/root/usr/lib64/python2.7/lib-tk:/opt/rh/python27/root/usr/lib64/python2.7/lib-old:/opt/rh/python27/root/usr/lib64/python2.7/lib-dynload:/opt/rh/python27/root/usr/lib64/python2.7/site-packages:/opt/rh/python27/root/usr/lib/python2.7/site-packages
python /home/ftpuser/Ganesh/test_paramiko.py
Expected result from my python script is to log the "after import" string
But currently it is printing only till "sys path appended before import" which also shows the normal python packages are getting imported successfully
This seems to be working now after adding one more environment variable to crontab as below
LD_LIBRARY_PATH=/opt/rh/python27/root/usr/lib64
I have a problem running pyspark script through oozie, using hue. I can run the same code included in a script through a notebook or with spark-submit without error, leading me to suspect that something in my oozie workflow is misconfigured. The spark action part of generated for my workflow xml is:
<action name="spark-51d9">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>cluster</mode>
<name>MySpark</name>
<jar>myapp.py</jar>
<file>/path/to/local/spark/hue-oozie-1511868018.89/lib/MyScript.py#MyScript.py</file>
</spark>
<ok to="hive2-07c2"/>
<error to="Kill"/>
</action>
The only message I find in my logs is:
Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [1]
This is what I have tried so far without solving the problem:
I have tried running it both in yarn client and cluster modes. I have also both tried using paths to a separate directory, and to the hue-generated oozie workflow directory's lib directory in which I have my script. I think that it can find the script, because if I specify another directory I get a message that it is not found. Any help with this is greatly appreciated.
The way this works for me is:
First you create an sh file that will run your python script.
The file should have the sumbit command:
....spark-submit
then all the flags you need:
--master yarn-cluster......--conf executer-cores 3 .......conf spark.executor.extraClassPath=jar1.jar:jar2.jar --driver-class-path jar1.jar:jar2.jar:jar3.jar
and at the end:
..... my_pyspark_script.py
Then you create a workflow and you choose the shell option and add your sh file as the "shell command" and in "files"
From here it's a bit of work to make sure everything is connected properly.
For example I had to add "export" in my sh file so that my "spark/conf" will be properly added.
I have a project I'm trying to test and run on Jenkins. On my machine it works fine, but when I try to run it in Jenkins, it fails to find a module in the workspace.
In the main workspace directory, I run the command:
python xtests/app_verify_auto.py
And get the error:
+ python /home/tomcat7/.jenkins/jobs/exit103/workspace/xtests/app_verify_auto.py
Traceback (most recent call last):
File "/home/tomcat7/.jenkins/jobs/exit103/workspace/xtests/app_verify_auto.py", line 19, in <module>
import exit103.data.db as db
ImportError: No module named exit103.data.db
Build step 'Execute shell' marked build as failure
Finished: FAILURE
The directory exit103/data exists in the workspace and is a correct path, but python can't seem to find it.
This error exists both with and without virtualenv.
It's may caused by your PATH setting not right in jenkins environment.In fact , the environments for your default user and jenkins-user are not the same.
You may try to find what are the PATH and PYTHONPATH in your jenkins-user environments .
Try to run "shell commands" in jenkins "echo $path" and so on to see what's them are.
In most of time , you need to set the PATH by yourself.
You may reference this answer.
Jenkins: putting my Python module on the PYTHONPATH
Faced the same issue.
For others who are reading this, Run the build in your master node. It fixed the problem for me.
Running the build in the slave node doesn't give proper access to all the python modules and other commands such as jq to the workspace.
I feel I set everything up correctly. I followed these instructions.
and installed from the tar file.
My home directory has a folder "gsutil" now. I ran through the configuration to set my app up for oauth2, and am able to call gsutil from the command line. To use gsutil and Google App Engine, I added the following lines to the .bashrc file in my Home directory and sourced it:
export PATH=$PATH:$HOME/google_appengine
export PATH=${PATH}:$HOME/gsutil
export PYTHONPATH=${PYTHONPATH}:$HOME/gsutil/third_party/boto:$HOME/gsutil
However, when I try to import in my python script by either:
import gsutil
Or something like this (straight from the documentation).
from gslib.third_party.oauth2_plugin import oauth2_plugin
I get errors like:
ImportError: No module named gslib.third_party.oauth2_plugin
Did I miss a step somewhere? Thanks
EDIT:
Here is the output of (','.join(sys.path)):
import sys; print(', '.join(sys.path))
, /usr/local/lib/python2.7/dist-packages/setuptools-1.4.1-py2.7.egg, /usr/local/lib/python2.7/dist-packages/pip-1.4.1-py2.7.egg, /usr/local/lib/python2.7/dist-packages/gsutil-3.40-py2.7.egg, /home/[myname], /home/[myname]/gsutil/third_party/boto, /home/[myname]/gsutil, /usr/lib/python2.7, /usr/lib/python2.7/plat-linux2, /usr/lib/python2.7/lib-tk, /usr/lib/python2.7/lib-old, /usr/lib/python2.7/lib-dynload, /usr/local/lib/python2.7/dist-packages, /usr/lib/python2.7/dist-packages, /usr/lib/python2.7/dist-packages/PIL, /usr/lib/python2.7/dist-packages/gst-0.10, /usr/lib/python2.7/dist-packages/gtk-2.0, /usr/lib/python2.7/dist-packages/ubuntu-sso-client, /usr/lib/python2.7/dist-packages/ubuntuone-client, /usr/lib/python2.7/dist-packages/ubuntuone-control-panel, /usr/lib/python2.7/dist-packages/ubuntuone-couch, /usr/lib/python2.7/dist-packages/ubuntuone-installer, /usr/lib/python2.7/dist-packages/ubuntuone-storage-protocol
EDIT 2:
I can import the module from the command line, but can't from within my Google App Engine app..
Here is the first line of the output using python -v
import gsutil
/home/adrian/gsutil/gsutil.pyc matches /home/adrian/gsutil/gsutil.py
But when I try to import it from an app, I get this message:
import gsutil
ImportError: No module named gsutil
gsutil is intended to only be used from the command line. If you want to interact with cloud storage from within an appengine application you should be using the cloud storage client library: https://developers.google.com/appengine/docs/java/googlecloudstorageclient/