I'm trying to set up a stand alone instance of spark locally on a mac and use the Python 3 API. To do this I've done the following,
1. I've downloaded and installed Scala and Spark.
2. I've set up the following environment variables,
#Scala
export SCALA_HOME=$HOME/scala/scala-2.12.4
export PATH=$PATH:$SCALA_HOME/bin
#Spark
export SPARK_HOME=$HOME/spark/spark-2.2.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
#Jupyter Python
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=ipython3
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
#Python
alias python="python3"
alias pip="pip3"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
Now when I run the command
pyspark --master local[2]
And type sc on the notebook, I get the following,
SparkContext
Spark UI
Version
v2.2.1
Master
local[2]
AppName
PySparkShell
Clearly my SparkContext is not initialized. I'm expecting to see an initialized SparkContext object.
What am I doing wrong here?
Well, as I have argued elsewhere, setting PYSPARK_DRIVER_PYTHON to jupyter (or ipython) is a really bad and plain wrong practice, which can lead to unforeseen outcomes downstream, such as when you try to use spark-submit with the above settings...
There is one and only one proper way to customize a Jupyter notebook in order to work with other languages (PySpark here), and this is the use of Jupyter kernels.
The first thing to do is run a jupyter kernelspec list command, to get the list of any already available kernels in your machine; here is the result in my case (Ubuntu):
$ jupyter kernelspec list
Available kernels:
python2 /usr/lib/python2.7/site-packages/ipykernel/resources
caffe /usr/local/share/jupyter/kernels/caffe
ir /usr/local/share/jupyter/kernels/ir
pyspark /usr/local/share/jupyter/kernels/pyspark
pyspark2 /usr/local/share/jupyter/kernels/pyspark2
tensorflow /usr/local/share/jupyter/kernels/tensorflow
The first kernel, python2, is the "default" one coming with IPython (there is a great chance of this being the only one present in your system); as for the rest, I have 2 more Python kernels (caffe & tensorflow), an R one (ir), and two PySpark kernels for use with Spark 1.6 and Spark 2.0 respectively.
The entries of the list above are directories, and each one contains one single file, named kernel.json. Let's see the contents of this file for my pyspark2 kernel:
{
"display_name": "PySpark (Spark 2.0)",
"language": "python",
"argv": [
"/opt/intel/intelpython27/bin/python2",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
"PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
"PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
}
}
Now, the easiest way for you would be to manually do the necessary changes (paths only) to my above shown kernel and save it in a new subfolder of the .../jupyter/kernels directory (that way, it should be visible if you run again a jupyter kernelspec list command). And if you think this approach is also a hack, well, I would agree with you, but it is the one recommended in the Jupyter documentation (page 12):
However, there isn’t a great way to modify the kernelspecs. One approach uses jupyter kernelspec list to find the kernel.json file and then modifies it, e.g. kernels/python3/kernel.json, by hand.
If you don't have already a .../jupyter/kernels folder, you can still install a new kernel using jupyter kernelspec install - haven't tried it, but have a look at this SO answer.
If you want to pass command-line arguments to PySpark, you should add the PYSPARK_SUBMIT_ARGS setting under env; for example, here is the last line of my respective kernel file for Spark 1.6.0, where we still had to use the external spark-csv package for reading CSV files:
"PYSPARK_SUBMIT_ARGS": "--master local --packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell"
Finally, don't forget to remove all the PySpark/Jupyter-related environment variables from your bash profile (leaving only SPARK_HOME and PYSPARK_PYTHON should be OK).
Another possibility could be to use Apache Toree, but I haven't tried it myself yet.
Documentation seams to say that environment variables are read from a certain file and not as shell environment variables.
Certain Spark settings can be configured through environment variables, which are read from the conf/spark-env.sh script in the directory where Spark is installed
Related
What is the standard development process involving some kind of IDE for spark with python for
Data exploration on the cluster
Application development?
I found the following answers, which do not satisfy me:
a) Zeeplin/Jupiter notbooks running "on the cluster"
b)
Install Spark and PyCharm locally,
use some local files containing dummy data to develope locally,
change references in the code to some real files on the cluster,
execute script using spark-submit in the console on the cluster.
source: https://de.hortonworks.com/tutorial/setting-up-a-spark-development-environment-with-python/
I would love to do a) and b) using some locally installed IDE, which communicates with the cluster directly, because I dislike the idea to create local dummy files and to change the code before running it on the cluster. I would also prefer an IDE over a notebook. Is there a standard way to do this or are my answers above already "best practice"?
You should be able to use any IDE with PySpark. Here are some instructions for Eclipse and PyDev:
set HADOOP_HOME variable referencing location of winutils.exe
set SPARK_HOME variable referencing your local spark folder
set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j)
add %SPARK_HOME%/python/lib/pyspark.zip and
%SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter
For the testing purposes you can add code like:
spark = SparkSession.builder.set_master("my-cluster-master-node:7077")..
With the proper configuration file in SPARK_CONF_DIR, it should work with just SparkSession.builder.getOrCreate(). Alternatively you could setup your run configurations to use spark-submit directly. Some websites with similar instructions for other IDEs include:
PyCharm
Spyder
PyCharm & Spark
Jupyter Notebook
PySpark
I tried to set up a simple Standalone Spark cluster,
with a interface to Spyder. There have been several remarks in the spark mailing list and elsewhere, which give an guideline how to do this.
This does not work for my setup though. Once I submit the script to spark-submit, I get the following error:
File "/home/philip/Programme/anaconda2/bin/spyder.py", line 4, in <module> import spyder.app.start
ImportError: No module named app.start
From my understanding, this has to do something with the $PYTHONPATH variable. I already changed the path to the py4j module (in current spark version 2.1.0, it is py4j-0.10.4 instead of the listed one.
My .bashrc file looks currently like this:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export JRE_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre
export SPARK_HOME=~/Programme/spark-2.1.0-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
export PATH=$PATHusr/bin/spyder
export PYTHONPATH=${PYTHONPATH}home/philip/Programme/anaconda2/bin/
# added by Anaconda2 4.3.0 installer
export PATH=/home/philip/Programme/anaconda2/bin:$PATH
If somebody has encountered a similar issue, help is greatly appreciated!
I encountered a similar error. The reason in my case was that I had not set PYTHONPATH. You should try setting this to your installation of python.
So instead of:
export PYTHONPATH=${PYTHONPATH}home/philip/Programme/anaconda2/bin/
Try
export PYTHONPATH=/home/philip/Programme/anaconda2/bin/python2.7
I was able to get my spyder setup going by using the following code in the spyder editor window:
import os
import sys
if 'SPARK_HOME' not in os.environ:
os.environ['SPARK_HOME']='/home/ramius/spark-2.1.1-bin-hadoop2.7'
SPARK_HOME=os.environ['SPARK_HOME']
if 'PYTHONPATH' not in os.environ:
os.environ['PYTHONPATH']='/home/ramius/anaconda2/bin/python2.7'
PYTHONPATH=os.environ['PYTHONPATH']
sys.path.insert(0,os.path.join(SPARK_HOME,"python"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","pyspark.zip"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","py4j-0.10.4-src.zip"))
from pyspark import SparkContext
Hope that helps.
I want to read a Spark Avro file in Jupyter notebook.
I have got the spark -avro built.
When I go to my directory and do the following
pyspark --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
It is able to open a jupyter notebook in the browser and I can then run the following command and it reads properly.
sdf_entities = sqlContext.read.format("com.databricks.spark.avro").load("learning_entity.avro")
sdf_entities.cache().take(1)
However, I don't want to give the packages command every time I am opening up a pyspark notebook. Like if I have to use Spark-csv packages I just do
pyspark
in the terminal and it opens up a jupyter notebook with spark-csv package. I don't have to specifically give the packages command for spark-csv there.
But this doesn't seem to work for spark-avro.
Note:
1). I have configured the iphython/jupyter notebook command as "pyspark" in the configuration setting so whenever pyspark is called in terminal it opens up a jyupyter notebook automatically.
2). I have also added the path of both spark-csv and spark-avro in the spark-conf file in my spark/conf folder. Here is how the spark-defaults.conf file looks:
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 12g
spark.executor.memory 3g
spark.driver.maxResultSize 3g
spark.rdd.compress false
spark.storage.memoryFraction 0.5
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value
spark.jars.packages com.databricks:spark-csv_2.11:1.4.0
spark-jars.packages com.databricks:spark-avro_2.10:2.0.1
Any help?
The correct property name is spark.jars.packages (not spark-jars.packages) and multiple packages should be provided as a single, comma separated list, same as the command line argument.
You should also use the same Scala artifact, which matches Scala version used to build Spark binaries. For example with Scala 2.10 (default in Spark 1.x):
spark.jars.packages com.databricks:spark-avro_2.10:2.0.1,com.databricks:spark-csv_2.10:1.5.0
I've a problem that Jupyter can't see env variable in bashrc file, is there a way to load these variables in jupyter or add custome variable to it?
To set an env variable in a jupyter notebook, just use a % magic commands, either %env or %set_env, e.g., %env MY_VAR=MY_VALUE or %env MY_VAR MY_VALUE. (Use %env by itself to print out current environmental variables.)
See: http://ipython.readthedocs.io/en/stable/interactive/magics.html
You can also set the variables in your kernel.json file:
My solution is useful if you need the same environment variables every time you start a jupyter kernel, especially if you have multiple sets of environment variables for different tasks.
To create a new ipython kernel with your environment variables, do the following:
Read the documentation at https://jupyter-client.readthedocs.io/en/stable/kernels.html#kernel-specs
Run jupyter kernelspec list to see a list with installed kernels and where the files are stored.
Copy the directory that contains the kernel.json (e.g. named python2) to a new directory (e.g. python2_myENV).
Change the display_name in the new kernel.json file.
Add a env dictionary defining the environment variables.
Your kernel json could look like this (I did not modify anything from the installed kernel.json except display_name and env):
{
"display_name": "Python 2 with environment",
"language": "python",
"argv": [
"/usr/bin/python2",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"env": {"LD_LIBRARY_PATH":""}
}
Use cases and advantages of this approach
In my use-case, I wanted to set the variable LD_LIBRARY_PATH which effects how compiled modules (e.g. written in C) are loaded. Setting this variable using %set_env did not work.
I can have multiple python kernels with different environments.
To change the environment, I only have to switch/ restart the kernel, but I do not have to restart the jupyter instance (useful, if I do not want to loose the variables in another notebook). See -however - https://github.com/jupyter/notebook/issues/2647
If you're using Python, you can define your environment variables in a .env file and load them from within a Jupyter notebook using python-dotenv.
Install python-dotenv:
pip install python-dotenv
Load the .env file in a Jupyter notebook:
%load_ext dotenv
%dotenv
You can setup environment variables in your code as follows:
import sys,os,os.path
sys.path.append(os.path.expanduser('~/code/eol_hsrl_python'))
os.environ['HSRL_INSTRUMENT']='gvhsrl'
os.environ['HSRL_CONFIG']=os.path.expanduser('~/hsrl_config')
This if of course a temporary fix, to get a permanent one, you probably need to export the variables into your ~.profile, more information can be found here
A gotcha I ran into: The following two commands are equivalent. Note the first cannot use quotes. Somewhat counterintuitively, quoting the string when using %env VAR ... will result in the quotes being included as part of the variable's value, which is probably not what you want.
%env MYPATH=C:/Folder Name/file.txt
and
import os
os.environ['MYPATH'] = "C:/Folder Name/file.txt"
If you need the variable set before you're starting the notebook, the only solution which worked for me was env VARIABLE=$VARIABLE jupyter notebook with export VARIABLE=value in .bashrc.
In my case tensorflow needs the exported variable for successful importing it in a notebook.
A related (short-term) solution is to store your environment variables in a single file, with a predictable format, that can be sourced when starting a terminal and/or read into the notebook. For example, I have a file, .env, that has my environment variable definitions in the format VARIABLE_NAME=VARIABLE_VALUE (no blank lines or extra spaces). You can source this file in the .bashrc or .bash_profile files when beginning a new terminal session and you can read this into a notebook with something like,
import os
env_vars = !cat ../script/.env
for var in env_vars:
key, value = var.split('=')
os.environ[key] = value
I used a relative path to show that this .env file can live anywhere and be referenced relative to the directory containing the notebook file. This also has the advantage of not displaying the variable values within your code anywhere.
If your notebook is being spawned by a Jupyter Hub, you might need to configure (in jupyterhub_config.py) the list of environment variables that are allowed to be carried over from the JupyterHub process environment to the Notebook environment by setting
c.Spawner.env_keep = [VAR1, VAR2, ...]
(https://jupyterhub.readthedocs.io/en/stable/api/spawner.html#jupyterhub.spawner.Spawner.env_keep)
See also: Spawner.environment
If you are using systemd I just found out that you seem to have to add them to the systemd unit file. This on Ubuntu 16. Putting them into the .profile and .bashrc (even the /etc/profile) resulted in the ENV Vars not being available in the juypter notebooks.
I had to edit:
/lib/systemd/system/jupyer-notebook.service
and put in the variable i wanted to read in the unit file like:
Environment=MYOWN_VAR=theVar
and only then could I read it from within juypter notebook.
you can run jupyter notebook with docker and don(t have to manage dependancy leaks.
docker run -p 8888:8888 -v /home/mee/myfolder:/home/jovyan --name notebook1 jupyter/notebook
docker exec -it notebook1 /bin/bash
then kindly ask jupyter about the opened notebooks,
jupyter notebook list
http:// 0.0.0.0:8888/?token=012456788997977a6eb11e45fffff
Url can be copypasted, verify port if you have changed it.
Create a notebook and paste the following,
into the notebook
!pip install python-dotenv
import dotenv
%load_ext dotenv
%dotenv
I have some third-party database client libraries in Java. I want to access them through
java_gateway.py
E.g.: to make the client class (not a JDBC driver!) available to the Python client via the Java gateway:
java_import(gateway.jvm, "org.mydatabase.MyDBClient")
It is not clear where to add the third-party libraries to the JVM classpath. I tried to add to file compute-classpath.sh, but that did not seem to work. I get:
Py4jError: Trying to call a package
Also, when comparing to Hive: the hive JAR files are not loaded via file compute-classpath.sh, so that makes me suspicious. There seems to be some other mechanism happening to set up the JVM side classpath.
You could add the path to jar file using Spark configuration at Runtime.
Here is an example :
conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")
sc = SparkContext( conf=conf)
Refer the document for more information.
You can add external jars as arguments to pyspark
pyspark --jars file1.jar,file2.jar
You could add --jars xxx.jar when using spark-submit
./bin/spark-submit --jars xxx.jar your_spark_script.py
or set the enviroment variable SPARK_CLASSPATH
SPARK_CLASSPATH='/path/xxx.jar:/path/xx2.jar' your_spark_script.py
your_spark_script.py was written by pyspark API
All the above answers did not work for me
What I had to do with pyspark was
pyspark --py-files /path/to/jar/xxxx.jar
For Jupyter Notebook:
spark = (SparkSession
.builder
.appName("Spark_Test")
.master('yarn-client')
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.config("spark.executor.cores", "4")
.config("spark.executor.instances", "2")
.config("spark.sql.shuffle.partitions","8")
.enableHiveSupport()
.getOrCreate())
# Do this
spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")
Link to the source where I found it:
https://github.com/graphframes/graphframes/issues/104
Extract the downloaded jar file.
Edit system environment variable
Add a variable named SPARK_CLASSPATH and set its value to \path\to\the\extracted\jar\file.
Eg: you have extracted the jar file in C drive in folder named sparkts
its value should be: C:\sparkts
Restart your cluster
Apart from the accepted answer, you also have below options:
if you are in virtual environment then you can place it in
e.g. lib/python3.7/site-packages/pyspark/jars
if you want java to discover it then you can place where your jre is installed under ext/ directory
One more thing you can do is to add the Jar in the pyspark jar folder where pyspark is installed. Usually /python3.6/site-packages/pyspark/jars
Be careful if you are using a virtual environment that the jar needs to go to the pyspark installation in the virtual environment.
This way you can use the jar without sending it in command line or load it in your code.
I've worked around this by dropping the jars into a directory drivers and then creating a spark-defaults.conf file in conf folder. Steps to follow;
To get the conf path:
cd ${SPARK_HOME}/conf
vi spark-defaults.conf
spark.driver.extraClassPath /Users/xxx/Documents/spark_project/drivers/*
run your Jupyter notebook.
java/scala libs from pyspark both --jars and spark.jars are not working in version 2.4.0 and earlier (I didn't check newer version). I'm surprised how many guys are claiming that it is working.
The main problem is that for classloader retrieved in following way:
jvm = SparkSession.builder.getOrCreate()._jvm
clazz = jvm.my.scala.class
# or
clazz = jvm.java.lang.Class.forName('my.scala.class')
it works only when you copy jar files to ${SPARK_HOME}/jars (this one works for me).
But when your only way is using --jars or spark.jars there is another classloader used (which is child class loader) which is set in current thread. So your python code needs to look like:
clazz = jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(f"{object_name}$")
Hope it explains your troubles. Give me a shout if not.