I have a problem running pyspark script through oozie, using hue. I can run the same code included in a script through a notebook or with spark-submit without error, leading me to suspect that something in my oozie workflow is misconfigured. The spark action part of generated for my workflow xml is:
<action name="spark-51d9">
<spark xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>yarn</master>
<mode>cluster</mode>
<name>MySpark</name>
<jar>myapp.py</jar>
<file>/path/to/local/spark/hue-oozie-1511868018.89/lib/MyScript.py#MyScript.py</file>
</spark>
<ok to="hive2-07c2"/>
<error to="Kill"/>
</action>
The only message I find in my logs is:
Launcher ERROR, reason: Main class [org.apache.oozie.action.hadoop.SparkMain], exit code [1]
This is what I have tried so far without solving the problem:
I have tried running it both in yarn client and cluster modes. I have also both tried using paths to a separate directory, and to the hue-generated oozie workflow directory's lib directory in which I have my script. I think that it can find the script, because if I specify another directory I get a message that it is not found. Any help with this is greatly appreciated.
The way this works for me is:
First you create an sh file that will run your python script.
The file should have the sumbit command:
....spark-submit
then all the flags you need:
--master yarn-cluster......--conf executer-cores 3 .......conf spark.executor.extraClassPath=jar1.jar:jar2.jar --driver-class-path jar1.jar:jar2.jar:jar3.jar
and at the end:
..... my_pyspark_script.py
Then you create a workflow and you choose the shell option and add your sh file as the "shell command" and in "files"
From here it's a bit of work to make sure everything is connected properly.
For example I had to add "export" in my sh file so that my "spark/conf" will be properly added.
Related
I'm trying to submit my pyspark code through cron job. When I run manually, its working fine. Through cron its not working.
Here is the project structure I have:
my-project
|
|--src
|----jobs
|------execute_metrics.py
|----utils
|------get_spark_session.py
The main code lies in execute_metrics.py from src/jobs. I'm using get_spark_session.py
in execute_metrics.py using from src.utils import get_spark_session.
I created a shell script execute_metric.sh with below content for executing the cron job
#!/bin/bash
PATH=<included entire path here>
spark-submit <included required options> src/jobs/execute_metrics.py
my-project
|
|--src
|----jobs
|------execute_metrics.py
|----utils
|------get_spark_session.py
|--execute_metric.sh
When I run this shell script using ./execute_metric.sh, I'm able to see the results.
Now, I need this to run the job every minute. So, I created a cron file with below content and copied in the same directory
* * * * * ./execute_metric.sh > execute_metric_log.log
my-project
|
|--src
|----jobs
|------execute_metrics.py
|----utils
|------get_spark_session.py
|--execute_metric.sh
|--execute_cron.crontab
This cron is running for every minute, but giving me the error:
ModuleNotFoundError: No module named 'src'
Can someone please tell me what went wrong here?
Thanks in advance
Your module directories are not getting into the python path. Try one of the following:
Explicitly set the PYTHONPATH:
#!/bin/bash
PATH=<included entire path here>
PYTHONPATH=somewhere/my-project/src
spark-submit <included required options> src/jobs/execute_metrics.py
Invoke the spark shell from your project directory:
#!/bin/bash
PATH=<included entire path here>
cd somewhere/my-project/src
spark-submit <included required options> execute_metrics.py
I got it fixed by adding a main.py file in the project directory and changed my cron to execute main.py. The project structure now looks like:
my-project
|
|--src
|----jobs
|------execute_metrics.py
|----utils
|------get_spark_session.py
|--execute_metric.sh
|--execute_cron.crontab
|--main.py
In main.py, I'm invoking the functions of execute_metrics.py.
#!/usr/bin/python
import requests, zipfile, StringIO, sys
extractDir = "myfolder"
zip_file_url = "download url"
response = requests.get(zip_file_url)
zipDocument = zipfile.ZipFile(StringIO.StringIO(response.content))
zipinfos = zipDocument.infolist()
for zipinfo in zipinfos:
extrat = zipDocument.extract(zipinfo,path=extractDir)
System configuration
Ubuntu OS 16.04
Python 2.7.12
$ python extract.py
when I run the code on Terminal with above command, it works properly and create the folder and extract the file into it.
Similarly, when I create a cron job using sodu rights the code executes but don't create any folder or extracts the files.
crontab command:-
40 10 * * * /usr/bin/sudo /usr/bin/python /home/ubuntu/demo/directory.py > /home/ubuntu/demo/logmyshit.log 2>&1
also tried
40 10 * * * /usr/bin/python /home/ubuntu/demo/directory.py > /home/ubuntu/demo/logmyshit.log 2>&1
Notes :
I check the syslog, it says the cron is running successfully
The above code gives no errors
also made the python program executable by chmod +x filename.py
Please help where am I going wrong.
Oups, there is nothing really wrong in running a Python script in crontab, but many bad things can happen because the environment is not the one you are used to.
When you type in an interactive shell python directory.py, the PATH and all required PYTHON environment variable have been set as part of login and interactive shell initialization, and the current directory is your home directory by default or anywhere you currently are.
When the same command is run from crontab, the current directory is not specified (but may not be what you expect), PATH is only /bin:/usr/bin and python environment variables are not set. That means that you will have to tweak environment variables in crontab file until you get a correct Python environment, and set the current directory.
I had a very similar problem and it turned out cron didn’t like importing matplotlib, I ended up having to specify Agg backend. I figured it out by putting log statements after each line to see how far the program got before it crapped out. Of course, my log was empty which tipped me off that it crashed on imports.
TLDR: log each line inside the script
I'm trying to install VATIC Video Annotation Tool on Linux. I followed the instructions in README file twice, always failing to execute command:
$ turkic setup --database
which gives these two error messages:
No handlers could be found for logger "turkic.geolocation"
Error: Unknown action setup
Other turkic commands, e.g. turkic status --verify give the same error messages (for a given action name).
I also noticed that source file ~/vatic/public/index.html contains links to stylesheets and scripts in turkic folder src="/turkic/file_name", which can't be reached. Their true location is in ~/turkic/turkic/public.
Any ideas what can be wrong?
You should go into vatic folder when executing any commands starting with turkic.
Only inside vatic folder "actions" will be recognized.
Make sure that you issue the symbolic link command:
$ turkic setup --public-symlink
I have a project I'm trying to test and run on Jenkins. On my machine it works fine, but when I try to run it in Jenkins, it fails to find a module in the workspace.
In the main workspace directory, I run the command:
python xtests/app_verify_auto.py
And get the error:
+ python /home/tomcat7/.jenkins/jobs/exit103/workspace/xtests/app_verify_auto.py
Traceback (most recent call last):
File "/home/tomcat7/.jenkins/jobs/exit103/workspace/xtests/app_verify_auto.py", line 19, in <module>
import exit103.data.db as db
ImportError: No module named exit103.data.db
Build step 'Execute shell' marked build as failure
Finished: FAILURE
The directory exit103/data exists in the workspace and is a correct path, but python can't seem to find it.
This error exists both with and without virtualenv.
It's may caused by your PATH setting not right in jenkins environment.In fact , the environments for your default user and jenkins-user are not the same.
You may try to find what are the PATH and PYTHONPATH in your jenkins-user environments .
Try to run "shell commands" in jenkins "echo $path" and so on to see what's them are.
In most of time , you need to set the PATH by yourself.
You may reference this answer.
Jenkins: putting my Python module on the PYTHONPATH
Faced the same issue.
For others who are reading this, Run the build in your master node. It fixed the problem for me.
Running the build in the slave node doesn't give proper access to all the python modules and other commands such as jq to the workspace.
I'm trying to get a response from Nagios by using the following Python code and instructions:
http://skipperkongen.dk/2011/12/06/hello-world-plugin-for-nagios-in-python/
From some reason I never get to have OK from Nagios and it's always comes back with the message: Return code 126 is out of bounds - plugin may be missing
I installed nagiosplugin 1.0.0, and still nothing seems to be working
In parallel I have some other services (not python files) that work e.g. http check, current users, and SSH
What am I doing wrong? I'm trying to solve that for few days already
Getting Nagios to utilize your new plug-in is quite easy. You should make changes to three files and restart Nagios — that’s all it takes.
The first file is /etc/nagios/command-plugins.cfg (leave comment please if you know path to this file or analog in ubuntu). Assumed that plugin file is placed in /usr/lib/nagios/plugins/ directory:
command[check_hello_world]=/usr/lib/nagios/plugins/check_helloworld.py -m 'some message'
Drop down one directory to /etc/nagios/objects/commands.cfg (for ubuntu user should create cfg file in that dir /etc/nagios-plugins/config/):
define command {
command_name check_hello_world
command_line $USER1$/check_hello_world.py -m 'some message'
}
Save the file and open up /etc/nagios/objects/localhost.cfg (in ubuntu path to service definition files located in /etc/nagios3/nagios.cfg and by default cfg_dir=/etc/nagios3/conf.d. So, to define new service in ubuntu user should create cfg file in that dir, for example hello.cfg). Locate this section:
#
# SERVICE DEFINITIONS
#
and add new entry:
define service {
use local-service ; Name of service template to use
host_name localhost
service_description Check using the hello world plugin (always returns OK)
check_command check_hello_world
}
All that remains is to restart Nagios and to verify that plug-in is working. Restart Nagios by issuing the following command:
/etc/init.d/nagios restart
http://www.linux-mag.com/id/7706/
ubuntuforums.org - Thread: My Notes for Installing Nagios on Ubuntu Server 12.04 LTS
I had to prepend the path to python2.7 even though the shebang in the file specified it.
In the command definition I had this:
command_line /usr/local/bin/python2.7 $USER1$/check_rabbit_queues.py --host $HOSTADDRESS$ --password $ARG1$
Even though the top of the actual python file had:
#!/usr/bin/env python2.7
Even though the script executed and returned just fine from the command line without specifying the interpreter.
Nothing else I tried seemed to work.