My objectif is to use Jupyter Notebook (IPython) with Apache Spark. I'm using Apache Toree to do this. I was setting environment variable for SPARK_HOME and configuring Apache Toree installation with Jupyter. Everything seems fine.
When I run the below command, a juypter browser is opened ipython notebook --profile=pyspark
When I choose Apache Toree - PySpark in the drop-down menu, I can't code in my notebook and I have this view (Python 2 is OK):
The red button gives :
What's wrong ? Help please ?
Not really an answer, but if you're not hooked on toree and just need a local spark for learning and experimenting, you could download a copy of spark, unzip it and use this in the beginning of your notebook:
import os
import sys
os.environ['SPARK_HOME']="<path where you have extracted the spark file>"
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'bin') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python/lib/py4j-0.10.4-src.zip') )
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext, Row
import pyspark.sql.functions as sql
sc = SparkContext()
sqlContext = SQLContext(sc)
print sc.version
Related
My method of using pyspark is to always run the code below in jupyter. Is this method always necessary ?
import findspark
findspark.init('/opt/spark2.4')
import pyspark
sc = pyspark.SparkContext()
If you want to reduce the findspark dependency, you can just make sure you have these variables in your .bashrc
export SPARK_HOME='/opt/spark2.4'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
Change the directories according to your enviroment, and the spark version as well. Apart from that, findspark will have to be in your code for your python interpreter to find the spark directory
If you get it working, you can run pip uninstall findspark
EDIT:
Pure python solution, add this code on top of your jupyter notebook (maybe in the first cell):
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/continuum/anaconda/bin/python"
os.environ["SPARK_HOME"] = "/opt/spark2.4"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
Source : Anaconda docs
I believe you can call this only once, what this does is that it edits your bashrc file and set the environment variables there
findspark.init('/path/to/spark_home', edit_rc=True)
I want to query a PostgreSQL with pyspark within a jupyter notebook. I have browsed a lot of questions on StackOverflow but none of them worked for me, mainly because the answers seemed outdated. Here's my minimal code:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
Running this from a notebook would raise the following error:
Py4JJavaError: An error occurred while calling o69.jdbc.
: java.sql.SQLException: No suitable driver
at java.sql.DriverManager.getDriver(DriverManager.java:315)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:85)
at scala.Option.getOrElse(Option.scala:121)...
The principal tips I have found were summed up in the link below but unfortunately I can't get them to work in my notebook:
Pyspark connection to Postgres database in ipython notebook
Note: I am using Spark 2.3.1 and Python 3.6.3 and I am able to connect to the database from the pyspark shell if I specify the jar location.
pyspark --driver-class-path /home/.../postgresql.jar --jars /home/.../jars/postgresql.jar
Thanks to anyone who can help me on this one.
EDIT
The answers from How to load jar dependenices in IPython Notebook are already listed in the link I shared myself, and do not work for me. I already tried to configure the environment variable from the notebook:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql.jar --jars /path/to/postgresql.jar'
There's nothing wrong with the file path or the file itself since it works fine when I specify it and run the pyspark-shell.
Using the config method worked for me:
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
I am running into problem running my script with spark-submit. The main script won't even run because import pymongo_spark returns ImportError: No module named pymongo_spark
I checked this thread and this thread to try to figure out the issue, but so far there's no result.
My setup:
$HADOOP_HOME is set to /usr/local/cellar/hadoop/2.7.1 where my hadoop files are
$SPARK_HOME is set to /usr/local/cellar/apache_spark/1.5.2
I also followed those threads and guide online as close as possible to get
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
export PATH=$PATH:$HADOOP_HOME/bin
PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
then I used this piece of code to test in the first thread I linked
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName('pyspark test')
sc = SparkContext(conf=conf)
if __name__ == '__main__':
main()
Then in the terminal, I did:
$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --master local[4] ~/Documents/pysparktest.py
Where mongo-hadoop-r1.4.2-1.4.2.jar is the jar I built following this guide
I'm definitely missing things, but I'm not sure where/what I'm missing. I'm running everything locally on Mac OSX El Capitan. Almost sure this doesn't matter, but just wanna add it in.
EDIT:
I also used another jar file mongo-hadoop-1.5.0-SNAPSHOT.jar, the same problem remains
my command:
$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --master local[4] ~/Documents/pysparktest.py
pymongo_spark is available only on mongo-hadoop 1.5 so it won't work with mongo-hadoop 1.4. To make it available you have to add directory with Python package to the PYTHONPATH as well. If you've built package by yourself it is located in spark/src/main/python/.
export PYTHONPATH=$PYTHONPATH:$MONGO_SPARK_SRC/src/main/python
where MONGO_SPARK_SRC is a directory with Spark Connector source.
See also Getting Spark, Python, and MongoDB to work together
I am trying to use IPython notebook with Apache Spark 1.4.0. I have followed the 2 tutorial below to set my configuration
Installing Ipython notebook with pyspark 1.4 on AWS
and
Configuring IPython notebook support for Pyspark
After fisnish the configuration, following is several code in the related files:
1.ipython_notebook_config.py
c=get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser =False
c.NotebookApp.port = 8193
2.00-pyspark-setup.py
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
I also add following two lines to my .bash_profile:
export SPARK_HOME='home/hadoop/sparl'
source ~/.bash_profile
However, when I run
ipython notebook --profile=pyspark
it shows the message: unrecognized alias '--profile=pyspark' it will probably have no effect
It seems that the notebook doesn't configure with pyspark successfully
Does anyone know how to solve it? Thank you very much
following are some software version
ipython/Jupyter: 4.0.0
spark 1.4.0
AWS EMR: 4.0.0
python: 2.7.9
By the way I have read the following, but it doesn't work
IPython notebook won't read the configuration file
Jupyter notebooks don't have the concept of profiles (as IPython did). The recommended way of launching with a different configuration is e.g.:
JUPTYER_CONFIG_DIR=~/alternative_jupyter_config_dir jupyter notebook
See also issue jupyter/notebook#309, where you'll find a comment describing how to set up Jupyter notebook with PySpark without profiles or kernels.
This worked for me...
Update ~/.bashrc with:
export SPARK_HOME="<your location of spark>"
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
(Lookup pyspark docs for those arguments)
Then create a new ipython profile eg. pyspark:
ipython profile create pyspark
Then create and add the following lines in ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.9-src.zip'))
filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.6" in open(spark_release_file).read():
pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
(update versions of py4j and spark to suit your case)
Then mkdir -p ~/.ipython/kernels/pyspark and then create and add following lines in the file ~/.ipython/kernels/pyspark/kernel.json
{
"display_name": "pySpark (Spark 1.6.1)",
"language": "python",
"argv": [
"/usr/bin/python",
"-m",
"IPython.kernel",
"--profile=pyspark",
"-f",
"{connection_file}"
]
}
Now you should see this kernel, pySpark (Spark 1.6.1), under jupyter's new notebook option. You can test by executing sc and should see your spark context.
I have tried so many ways to solve this 4.0 version problem, and finally I decided to install version 3.2.3. of IPython:
conda install 'ipython<4'
It's anazoning! And wish to help all you!
ref: https://groups.google.com/a/continuum.io/forum/#!topic/anaconda/ace9F4dWZTA
As people commented, in Jupyter you don't need profiles. All you need to do is export the variables for jupyter to find your spark install (I use zsh but it's the same for bash)
emacs ~/.zshrc
export PATH="/Users/hcorona/anaconda/bin:$PATH"
export SPARK_HOME="$HOME/spark"
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_SUBMIT_ARGS="--master local[*,8] pyspark-shell"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
It is important to add pyspark-shell in the PYSPARK_SUBMIT_ARGS
I found this guide useful but not fully accurate.
My config is local, but should work if you use the PYSPARK_SUBMIT_ARGS to the ones you need.
I am having the same problem to specify the --profile **kwarg. It seems it is a general problem with the new version, not related with Spark. If you downgrade to ipython 3.2.1 you will be able to specify the profile again.
I follow this link http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/ in order to create PySpark Profile for IPython.
00-pyspark-setup.py
# Configure the necessary Spark environment
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "\python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, '\python\lib\py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, '\python\pyspark\shell.py'))
My problem when I type sc in ipython-notebook, I got '' I should see output similar to <pyspark.context.SparkContext at 0x1097e8e90>.
Any idea about how to resolve it ?
I was trying to do the same, but had problems. Now, I use findspark (https://github.com/minrk/findspark) instead. You can install it with pip (see https://pypi.python.org/pypi/findspark/):
$ pip install findspark
And then, inside a notebook:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")
If you want to avoid this boilerplate, you can put the above 4 lines in 00-pyspark-setup.py.
(Right now I have Spark 1.4.1. and findspark 0.0.5.)
Please try to set proper value to SPARK_LOCAL_IP variable, eg.:
export SPARK_LOCAL_IP="$(hostname -f)"
before you run ipython notebook --profile=pyspark.
If this doesn't help, try to debug your environment by executing setup script:
python 00-pyspark-setup.py
Maybe you can find some error lines in that way and debug them.
Are you on windows? I am dealing with the same things, and a couple of things helped.
In the 00-pyspark-setup.py, change this line (match your path to your spark folder)
# Configure the environment
if 'SPARK_HOME' not in os.environ:
print 'environment spark not set'
os.environ['SPARK_HOME'] = 'C:/spark-1.4.1-bin-hadoop2.6'
I am sure you added a new environment variable, if not, this will manually set it.
The next thing I noticed is that if you use ipython 4 (the latest), the config files don't work the same way you see it in all the tutorials. You can try out if your config files get called by adding a print statement or just messing them up so an error gets thrown.
I am using a lower version of iPython (3) and I call it using
ipython notebook --profile=pyspark
Change the 00-pyspark-setup.py to:
# Configure the necessary Spark environment
import os
# Spark home
spark_home = os.environ.get("SPARK_HOME")
######## CODE ADDED ########
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[2] pyspark-shell"
######## END OF ADDED CODE #########
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Basically, the added code sets the PYSPARK_SUBMIT_ARGS environment variable to
--master local[2] pyspark-shell, which works for Spark 1.6 standalone.
Now run ipython notebook again. Run os.environ["PYSPARK_SUBMIT_ARGS"] to check whether its value is correctly set. If so, then type sc should give you the expected output like <pyspark.context.SparkContext at 0x1097e8e90>