ImportError: No module named numpy on spark workers - python

Launching pyspark in client mode. bin/pyspark --master yarn-client --num-executors 60 The import numpy on the shell goes fine but it fails in the kmeans. Somehow the executors do not have numpy installed is my feeling. I didnt find any good solution anywhere to let workers know about numpy. I tried setting PYSPARK_PYTHON but that didnt work either.
import numpy
features = numpy.load(open("combined_features.npz"))
features = features['arr_0']
features.shape
features_rdd = sc.parallelize(features, 5000)
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt
clusters = KMeans.train(features_rdd, 2, maxIterations=10, runs=10, initializationMode="random")
Stack trace
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/mllib/__init__.py", line 25, in <module>
ImportError: No module named numpy
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
enter code here

To use Spark in Yarn client mode, you'll need to install any dependencies to the machines on which Yarn starts the executors. That's the only surefire way to make this work.
Using Spark with Yarn cluster mode is a different story. You can distribute python dependencies with spark-submit.
spark-submit --master yarn-cluster my_script.py --py-files my_dependency.zip
However, the situation with numpy is complicated by the same thing that makes it so fast: the fact that does the heavy lifting in C. Because of the way that it is installed, you won't be able to distribute numpy in this fashion.

numpy is not installed on the worker (virtual) machines. If you use anaconda, it's very convenient to upload such python dependencies when deploying the application in cluster mode. (So there is no need to install numpy or other modules on each machine, instead they must in your anaconda).
Firstly, zip your anaconda and put the zip file to the cluster, and then you can submit a job using following script.
spark-submit \
--master yarn \
--deploy-mode cluster \
--archives hdfs://host/path/to/anaconda.zip#python-env
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pthon-env/anaconda/bin/python
app_main.py
Yarn will copy anaconda.zip from the hdfs path to each worker, and use that pthon-env/anaconda/bin/python to execute tasks.
Refer to Running PySpark with Virtualenv may provide more information.

What solved it for me (On mac) was actually this guide (Which also explains how to run python through Jupyter Notebooks -
https://medium.com/#yajieli/installing-spark-pyspark-on-mac-and-fix-of-some-common-errors-355a9050f735
In a nutshell:
(Assuming you installed spark with brew install spark)
Find the SPARK_PATH using - brew info apache-spark
Add those lines to your ~/.bash_profile
# Spark and Python
######
export SPARK_PATH=/usr/local/Cellar/apache-spark/2.4.1
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
#For python 3, You have to add the line below or you will get an error
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]'
######
You should be able to open Jupyter Notebook simply by calling:
pyspark
And just remember you don't need to set the Spark Context but instead simply call:
sc = SparkContext.getOrCreate()

For me environment variable PYSPARK_PYTHON was not set so I set up /etc/environment file and added python environment path to the variable.
PYSPARK_PYTHON=/home/venv/python3
Afterwards, no such error.

I had similar issue but I dont think you need to set PYSPARK_PYTHON instead just install numpy on the worker machine (apt-get or yum). The error will also tell you on which machine the import was missing.

You have to be aware that you need to have numpy installed on each and every worker, and even the master itself (depending on your component placement)
Also ensure to launch pip install numpy command from a root account (sudo does not suffice) after forcing umask to 022 (umask 022) so it cascades the rights to Spark (or Zeppelin) User

A few of things to check
Install the required packages on the worker nodes with sudo permission so that they are available to all users
If you have multiple versions of the python on the worker nodes, make sure to install packages for python used by Spark (usually set by PYSPARK_PYTHON).
Finally, to pass the custom modules (.py files), use --py-files while starting the session using spark-submit or pyspark

I had the same issue. Try installing numpy on pip3 if you're using Python3
pip3 install numpy

Related

git executable not found in python

I was trying to clone a git repo with access key, but when I am trying to run it, It throws an exception saying git executable not found.
But i have installed git and the in_it.py shows correct path "C:\Program Files\Git\bin" Also I have installed gitpython to use the library in python
here's my code...
import git
git.Git("D:/madhav/myrep/").clone("#github.com:myrepo/scripts")
========= and it throws the following exception =================
Traceback (most recent call last): File
"C:\Users\1096506\Desktop\gitclone.py", line 1, in <module>
from git import Repo File "C:\Users\1096506\AppData\Local\Programs\Python\Python36-32\lib\site-packages\git\__init__.py",
line 84, in <module>
refresh() File "C:\Users\1096506\AppData\Local\Programs\Python\Python36-32\lib\site-packages\git\__init__.py",
line 73, in refresh
if not Git.refresh(path=path): File "C:\Users\1096506\AppData\Local\Programs\Python\Python36-32\lib\site-packages\git\cmd.py",
line 293, in refresh
raise ImportError(err) ImportError: Bad git executable. The git executable must be specified in one of the following ways:
- be included in your $PATH
- be set via $GIT_PYTHON_GIT_EXECUTABLE
- explicitly set via git.refresh()
All git commands will error until this is rectified.
This initial warning can be silenced or aggravated in the future by setting the
$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
- quiet|q|silence|s|none|n|0: for no warning or exception
- warn|w|warning|1: for a printed warning
- error|e|raise|r|2: for a raised exception
Example:
export GIT_PYTHON_REFRESH=quiet
I had the same issue. What I did is:
I went to: System Properties -> Environment Variables
On System Variables Section I clicked Path, Edit and Move Up.
Environment Variables
Edit and Moved Up two places from the bottom
Error occurs because git is not in the path. So its not able to import git module.
Couple of ways to resolve it.
As suggested above adding the git binary path to environment variable path.
If git is not being used directly in the module and its only a dependent module import thats throwing this exception before importing git we could add
os.environ["GIT_PYTHON_REFRESH"] = "quiet"
and import git after this line, this would suppress the error caused due to git import
Had the same problem got it to work thanks to Muthukumaran. Just make Muthukumaran answer more clear.
Follow these steps:
import os
os.environ["GIT_PYTHON_REFRESH"] = "quiet"
import git
See if you have installed Git in the OS.
If not install git first this will solve your error.
Centos
sudo yum -y install git
Ubuntu/Debian
sudo apt-get install git
Mac Os
sudo brew install git
This solved my problem.
Make sure you're not in an inaccessible directory on *nix, such as when you've just been root and then done a su username
you may still be in root's home folder and that will trigger this error (assuming you have the correct environment variables set, and have sourced the .profile or .bashrc etc with source ~/.bashrc )
which git
/usr/bin/git
I was getting this error even after setting the environment:
in ~/.bashrc
# for bench
PATH=$PATH:/usr/bin/git
export PATH
GIT_PYTHON_GIT_EXECUTABLE=/usr/bin/git
export GIT_PYTHON_GIT_EXECUTABLE
cd
and it's working
$ bench --version
WARN: Command not being executed in bench directory
5.3.0
I came across similar problem recently and installing git followed by restarting Windows Powershell CommandLine solved the problem. May it helps.
For those who are using a Lambda layer with it. It worked adding as the comment above says just adding GIT_PYTHON_REFRESH=quiet as an environment variable.
Execute GIT_PYTHON_REFRESH=quiet in your terminal and then try to run the code.

Apache Spark with Python: error

New to Spark. Downloaded everything alright but when I run pyspark I get the following errors:
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/02/05 20:46:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "C:\Users\Carolina\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\bin\..\python\pyspark\shell.py", line 43, in <module>
spark = SparkSession.builder\
File "C:\Users\Carolina\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\python\pyspark\sql\session.py", line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "C:\Users\Carolina\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
File "C:\Users\Carolina\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\python\pyspark\sql\utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
Also, when I try (as recommended by http://spark.apache.org/docs/latest/quick-start.html)
textFile = sc.textFile("README.md")
I get:
NameError: name 'sc' is not defined
Any advice? Thank you!
If you are doing it from the pyspark console, it may be because your installation did not work.
If not, it's because most example assume you are testing code in the pyspark console where a default variable 'sc' exist.
You can create a SparkContext by yourself at the beginning of your script using the following code:
from pyspark import SparkContext, SparkConf
conf = SparkConf()
sc = SparkContext(conf=conf)
It looks like you've found the answer to the second part of your question in the above answer, but for future users getting here via the 'org.apache.spark.sql.hive.HiveSessionState' error, this class is found in the spark-hive jar file, which does not come bundled with Spark if it isn't built with Hive.
You can get this jar at:
http://central.maven.org/maven2/org/apache/spark/spark-hive_${SCALA_VERSION}/${SPARK_VERSION}/spark-hive_${SCALA_VERSION}-${SPARK_VERSION}.jar
You'll have to put it into your SPARK_HOME/jars folder, and then Spark should be able to find all of the Hive classes required.
I also encountered this issue on Windows 7 with pre-built Spark 2.2. Here is a possible solution for Windows guys:
make sure you get all the environment path set correctly, including SPARK_PATH, HADOOP_HOME, etc.
get the correct version of winutils.exe for the Spark-Hadoop prebuilt package
then open a cmd prompt as Administration, run this command:
winutils chmod 777 C:\tmp\hive
Note: The drive might be different depending on where you invoke pyspark or spark-shell
This link should take the credit: see the answer by timesking
If you're on a Mac and you've installed Spark (and eventually Hive) through Homebrew the answers from #Eric Pettijohn and #user7772046 will not work. The former due to the fact that Homebrew's Spark contains the aforementioned jar file; the latter because, trivially, it is a pure Windows-based solution.
Inspired by this link and the permission issues hint, I've come up with the following simple solution: launch pyspark using sudo. No more Hive-related errors.
I deleted the metastore_db directory and then things worked. I'm doing some light development on a macbook -- I had run pycharm to sync my directory with the server - I thin it picked up that spark specific directory and messed it up. For my the the error message came when I was trying to start an interactive ipython pyspark shell.
With my problem like this, because I have set the Hadoop at yarn model, so my solution is to start the hdfs and the YARN.
start-dfs.sh
start-yarn.sh
I come across the error:
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
this is because i already run ./bin/spark-shell
So, just kill that spark-shell, and re-run ./bin/pyspark
You need a "winutils" competable in the hadoop bin directory.

import pymongo_spark doesn't work when executing with spark-commit

I am running into problem running my script with spark-submit. The main script won't even run because import pymongo_spark returns ImportError: No module named pymongo_spark
I checked this thread and this thread to try to figure out the issue, but so far there's no result.
My setup:
$HADOOP_HOME is set to /usr/local/cellar/hadoop/2.7.1 where my hadoop files are
$SPARK_HOME is set to /usr/local/cellar/apache_spark/1.5.2
I also followed those threads and guide online as close as possible to get
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
export PATH=$PATH:$HADOOP_HOME/bin
PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
then I used this piece of code to test in the first thread I linked
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName('pyspark test')
sc = SparkContext(conf=conf)
if __name__ == '__main__':
main()
Then in the terminal, I did:
$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --master local[4] ~/Documents/pysparktest.py
Where mongo-hadoop-r1.4.2-1.4.2.jar is the jar I built following this guide
I'm definitely missing things, but I'm not sure where/what I'm missing. I'm running everything locally on Mac OSX El Capitan. Almost sure this doesn't matter, but just wanna add it in.
EDIT:
I also used another jar file mongo-hadoop-1.5.0-SNAPSHOT.jar, the same problem remains
my command:
$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --master local[4] ~/Documents/pysparktest.py
pymongo_spark is available only on mongo-hadoop 1.5 so it won't work with mongo-hadoop 1.4. To make it available you have to add directory with Python package to the PYTHONPATH as well. If you've built package by yourself it is located in spark/src/main/python/.
export PYTHONPATH=$PYTHONPATH:$MONGO_SPARK_SRC/src/main/python
where MONGO_SPARK_SRC is a directory with Spark Connector source.
See also Getting Spark, Python, and MongoDB to work together

Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0

I am trying to use IPython notebook with Apache Spark 1.4.0. I have followed the 2 tutorial below to set my configuration
Installing Ipython notebook with pyspark 1.4 on AWS
and
Configuring IPython notebook support for Pyspark
After fisnish the configuration, following is several code in the related files:
1.ipython_notebook_config.py
c=get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser =False
c.NotebookApp.port = 8193
2.00-pyspark-setup.py
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
I also add following two lines to my .bash_profile:
export SPARK_HOME='home/hadoop/sparl'
source ~/.bash_profile
However, when I run
ipython notebook --profile=pyspark
it shows the message: unrecognized alias '--profile=pyspark' it will probably have no effect
It seems that the notebook doesn't configure with pyspark successfully
Does anyone know how to solve it? Thank you very much
following are some software version
ipython/Jupyter: 4.0.0
spark 1.4.0
AWS EMR: 4.0.0
python: 2.7.9
By the way I have read the following, but it doesn't work
IPython notebook won't read the configuration file
Jupyter notebooks don't have the concept of profiles (as IPython did). The recommended way of launching with a different configuration is e.g.:
JUPTYER_CONFIG_DIR=~/alternative_jupyter_config_dir jupyter notebook
See also issue jupyter/notebook#309, where you'll find a comment describing how to set up Jupyter notebook with PySpark without profiles or kernels.
This worked for me...
Update ~/.bashrc with:
export SPARK_HOME="<your location of spark>"
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
(Lookup pyspark docs for those arguments)
Then create a new ipython profile eg. pyspark:
ipython profile create pyspark
Then create and add the following lines in ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.9-src.zip'))
filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.6" in open(spark_release_file).read():
pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
(update versions of py4j and spark to suit your case)
Then mkdir -p ~/.ipython/kernels/pyspark and then create and add following lines in the file ~/.ipython/kernels/pyspark/kernel.json
{
"display_name": "pySpark (Spark 1.6.1)",
"language": "python",
"argv": [
"/usr/bin/python",
"-m",
"IPython.kernel",
"--profile=pyspark",
"-f",
"{connection_file}"
]
}
Now you should see this kernel, pySpark (Spark 1.6.1), under jupyter's new notebook option. You can test by executing sc and should see your spark context.
I have tried so many ways to solve this 4.0 version problem, and finally I decided to install version 3.2.3. of IPython:
conda install 'ipython<4'
It's anazoning! And wish to help all you!
ref: https://groups.google.com/a/continuum.io/forum/#!topic/anaconda/ace9F4dWZTA
As people commented, in Jupyter you don't need profiles. All you need to do is export the variables for jupyter to find your spark install (I use zsh but it's the same for bash)
emacs ~/.zshrc
export PATH="/Users/hcorona/anaconda/bin:$PATH"
export SPARK_HOME="$HOME/spark"
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_SUBMIT_ARGS="--master local[*,8] pyspark-shell"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
It is important to add pyspark-shell in the PYSPARK_SUBMIT_ARGS
I found this guide useful but not fully accurate.
My config is local, but should work if you use the PYSPARK_SUBMIT_ARGS to the ones you need.
I am having the same problem to specify the --profile **kwarg. It seems it is a general problem with the new version, not related with Spark. If you downgrade to ipython 3.2.1 you will be able to specify the profile again.

Include package in Spark local mode

I'm writing some unit tests for my Spark code in python. My code depends on spark-csv. In production I use spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 to submit my python script.
I'm using pytest to run my tests with Spark in local mode:
conf = SparkConf().setAppName('myapp').setMaster('local[1]')
sc = SparkContext(conf=conf)
My question is, since pytest isn't using spark-submit to run my code, how can I provide my spark-csv dependency to the python process?
you can use your config file spark.driver.extraClassPath to sort out the problem.
Spark-default.conf
and add the property
spark.driver.extraClassPath /Volumes/work/bigdata/CHD5.4/spark-1.4.0-bin-hadoop2.6/lib/spark-csv_2.11-1.1.0.jar:/Volumes/work/bigdata/CHD5.4/spark-1.4.0-bin-hadoop2.6/lib/commons-csv-1.1.jar
After setting the above you even don't need packages flag while running from shell.
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load(BASE_DATA_PATH + '/ssi.csv')
Both the jars are important, as spark-csv depends on commons-csv apache jar. The spark-csv jar you can either build or download from mvn-site.

Categories