Reading Spark Avro file in Jupyter notebook with Pyspark Kernel

Reading Spark Avro file in Jupyter notebook with Pyspark Kernel - python

I want to read a Spark Avro file in Jupyter notebook.
I have got the spark -avro built.
When I go to my directory and do the following
pyspark --packages org.apache.avro:avro-mapred:1.7.7,com.databricks:spark-avro_2.10:2.0.1
It is able to open a jupyter notebook in the browser and I can then run the following command and it reads properly.
sdf_entities = sqlContext.read.format("com.databricks.spark.avro").load("learning_entity.avro")
sdf_entities.cache().take(1)
However, I don't want to give the packages command every time I am opening up a pyspark notebook. Like if I have to use Spark-csv packages I just do
pyspark
in the terminal and it opens up a jupyter notebook with spark-csv package. I don't have to specifically give the packages command for spark-csv there.
But this doesn't seem to work for spark-avro.
Note:
1). I have configured the iphython/jupyter notebook command as "pyspark" in the configuration setting so whenever pyspark is called in terminal it opens up a jyupyter notebook automatically.
2). I have also added the path of both spark-csv and spark-avro in the spark-conf file in my spark/conf folder. Here is how the spark-defaults.conf file looks:
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 12g
spark.executor.memory 3g
spark.driver.maxResultSize 3g
spark.rdd.compress false
spark.storage.memoryFraction 0.5
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value
spark.jars.packages com.databricks:spark-csv_2.11:1.4.0
spark-jars.packages com.databricks:spark-avro_2.10:2.0.1
Any help?

The correct property name is spark.jars.packages (not spark-jars.packages) and multiple packages should be provided as a single, comma separated list, same as the command line argument.
You should also use the same Scala artifact, which matches Scala version used to build Spark binaries. For example with Scala 2.10 (default in Spark 1.x):
spark.jars.packages com.databricks:spark-avro_2.10:2.0.1,com.databricks:spark-csv_2.10:1.5.0

Related

What is the difference between a .py file and .ipynb file?

I have read about .py and .ipy, also the difference between python, ipython and notebook.
But the question is: what is the real difference between .py and .ipynb file?
Is .ipynb file just more convenient to be run on jupyter notebook, or anything more? I am wondering because I am thinking about which format to be used for publishing on GitHub.
Thanks

.py is a regular python file. It's plain text and contains just your code.
.ipynb is a python notebook and it contains the notebook code, the execution results and other internal settings in a specific format. You can just run .ipynb on the jupyter environment.
Better way to understand the difference: open each file using a regular text editor like notepad (on Windows) or gedit (on Linux).
Save on git the .ipynb if you want to show the results of your script for didatic purposes, for example. But if you are going to run your code on a server, just save the .py

Adding #Josir answer, the below information is very useful for open .ipynb file using PyCharm.
Create a new Python project in Pycharm
Specify a virtual environment, and install the jupyter package(pip install jupyterlab).
Run the server using the jupyter-lab command.
Browser will open the jupyter notebook like below, there you can execute the .ipynp file.
Here is documentation https://jupyterlab.readthedocs.io/en/latest/

py means PYthon
ipynb means Interactive PYthon NoteBook - which is now known as Jupyter notebook.
The latter one is merely a Python script with descriptive contents - you describe what your data is doing by means of Python script and some funny texts. That's pretty much it - and also, you need a specific editor e.g. PyCharm or Google Collab to open and run it.

I think the answer here might help you: https://stackoverflow.com/a/32029027/11924650
.ipy indicates that it's an IPython script. The only difference between IPython scripts and normal Python scripts is that IPython scripts can use IPython magics, e.g. %timeit, and run system commands as !echo Hi.

Spark development process with Python and IDE

What is the standard development process involving some kind of IDE for spark with python for
Data exploration on the cluster
Application development?
I found the following answers, which do not satisfy me:
a) Zeeplin/Jupiter notbooks running "on the cluster"
b)
Install Spark and PyCharm locally,
use some local files containing dummy data to develope locally,
change references in the code to some real files on the cluster,
execute script using spark-submit in the console on the cluster.
source: https://de.hortonworks.com/tutorial/setting-up-a-spark-development-environment-with-python/
I would love to do a) and b) using some locally installed IDE, which communicates with the cluster directly, because I dislike the idea to create local dummy files and to change the code before running it on the cluster. I would also prefer an IDE over a notebook. Is there a standard way to do this or are my answers above already "best practice"?

You should be able to use any IDE with PySpark. Here are some instructions for Eclipse and PyDev:
set HADOOP_HOME variable referencing location of winutils.exe
set SPARK_HOME variable referencing your local spark folder
set SPARK_CONF_DIR to the folder where you have actual cluster config copied (spark-defaults and log4j)
add %SPARK_HOME%/python/lib/pyspark.zip and
%SPARK_HOME%/python/lib/py4j-xx.x.zip to a PYTHONPATH of the interpreter
For the testing purposes you can add code like:
spark = SparkSession.builder.set_master("my-cluster-master-node:7077")..
With the proper configuration file in SPARK_CONF_DIR, it should work with just SparkSession.builder.getOrCreate(). Alternatively you could setup your run configurations to use spark-submit directly. Some websites with similar instructions for other IDEs include:
PyCharm
Spyder
PyCharm & Spark
Jupyter Notebook
PySpark

Can't instantiate Spark Context in iPython

I'm trying to set up a stand alone instance of spark locally on a mac and use the Python 3 API. To do this I've done the following,
1. I've downloaded and installed Scala and Spark.
2. I've set up the following environment variables,
#Scala
export SCALA_HOME=$HOME/scala/scala-2.12.4
export PATH=$PATH:$SCALA_HOME/bin
#Spark
export SPARK_HOME=$HOME/spark/spark-2.2.1-bin-hadoop2.7
export PATH=$PATH:$SPARK_HOME/bin
#Jupyter Python
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=ipython3
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
#Python
alias python="python3"
alias pip="pip3"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PYTHONPATH
Now when I run the command
pyspark --master local[2]
And type sc on the notebook, I get the following,
SparkContext
Spark UI
Version
v2.2.1
Master
local[2]
AppName
PySparkShell
Clearly my SparkContext is not initialized. I'm expecting to see an initialized SparkContext object.
What am I doing wrong here?

Well, as I have argued elsewhere, setting PYSPARK_DRIVER_PYTHON to jupyter (or ipython) is a really bad and plain wrong practice, which can lead to unforeseen outcomes downstream, such as when you try to use spark-submit with the above settings...
There is one and only one proper way to customize a Jupyter notebook in order to work with other languages (PySpark here), and this is the use of Jupyter kernels.
The first thing to do is run a jupyter kernelspec list command, to get the list of any already available kernels in your machine; here is the result in my case (Ubuntu):
$ jupyter kernelspec list
Available kernels:
python2 /usr/lib/python2.7/site-packages/ipykernel/resources
caffe /usr/local/share/jupyter/kernels/caffe
ir /usr/local/share/jupyter/kernels/ir
pyspark /usr/local/share/jupyter/kernels/pyspark
pyspark2 /usr/local/share/jupyter/kernels/pyspark2
tensorflow /usr/local/share/jupyter/kernels/tensorflow
The first kernel, python2, is the "default" one coming with IPython (there is a great chance of this being the only one present in your system); as for the rest, I have 2 more Python kernels (caffe & tensorflow), an R one (ir), and two PySpark kernels for use with Spark 1.6 and Spark 2.0 respectively.
The entries of the list above are directories, and each one contains one single file, named kernel.json. Let's see the contents of this file for my pyspark2 kernel:
{
"display_name": "PySpark (Spark 2.0)",
"language": "python",
"argv": [
"/opt/intel/intelpython27/bin/python2",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
"PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
"PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
"PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
}
}
Now, the easiest way for you would be to manually do the necessary changes (paths only) to my above shown kernel and save it in a new subfolder of the .../jupyter/kernels directory (that way, it should be visible if you run again a jupyter kernelspec list command). And if you think this approach is also a hack, well, I would agree with you, but it is the one recommended in the Jupyter documentation (page 12):
However, there isn’t a great way to modify the kernelspecs. One approach uses jupyter kernelspec list to find the kernel.json file and then modifies it, e.g. kernels/python3/kernel.json, by hand.
If you don't have already a .../jupyter/kernels folder, you can still install a new kernel using jupyter kernelspec install - haven't tried it, but have a look at this SO answer.
If you want to pass command-line arguments to PySpark, you should add the PYSPARK_SUBMIT_ARGS setting under env; for example, here is the last line of my respective kernel file for Spark 1.6.0, where we still had to use the external spark-csv package for reading CSV files:
"PYSPARK_SUBMIT_ARGS": "--master local --packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell"
Finally, don't forget to remove all the PySpark/Jupyter-related environment variables from your bash profile (leaving only SPARK_HOME and PYSPARK_PYTHON should be OK).
Another possibility could be to use Apache Toree, but I haven't tried it myself yet.

Documentation seams to say that environment variables are read from a certain file and not as shell environment variables.
Certain Spark settings can be configured through environment variables, which are read from the conf/spark-env.sh script in the directory where Spark is installed

jupyter notebook starting directory

I'm trying to set a custom starting directory in Jupyter Notebook. I have edited jupyter_notebook_config.py.
Removed # from line "c.NotebookApp.notebook_dir =", added parameter:
c.NotebookApp.notebook_dir = u'c:\\my\\chosen\\directory'.
But still doesn't work, console coming up with error, and jupyter starting in the default home directory.
I'm using Windows server 2008. According to the manuals, it should work.
Does anyone have a suggestion about my problem?

The followings steps work perfectly for me on Windows:
First find which directory Jupyter is looking in for your config file:
jupyter --config-dir
If there is no jupyter_notebook_config.py file in that directory, generate one by typing:
jupyter notebook --generate-config
Then edit the jupyter_notebook_config.py file and add something like:
## The directory to use for notebooks and kernels.
c.NotebookApp.notebook_dir = 'c:\\users\\rsignell\\documents\\github'
Then start your jupyter notebook from any directory:
jupyter notebook
and it will start in the directory you specified.
For more info see: http://jupyter-notebook.readthedocs.io/en/latest/config.html

Microsoft Windows
Open dos command line by typing cmd on windows explorer address bar. This will open command prompt with current path set to current folder. Type jupyter notebook --notebook-dir=%CD% on commandline to start jupyter notebook (ipython notebook) with current directory as notebook's starting directory

I also had the problem, and editing jupyter configuration file didn't work either.
My workaround is to make a batch file that goes to a specified directory, then start jupyter notebook from that directory.
You can use notepad to write the batch file, just save it as an all files and specify the extension as .bat
An easy way is also available from DOS prompt using copy con. First, access command prompt (usually by typing "cmd" and enter). Then:
copy con startjupyter.bat
after that you can specify your directory and start notebook from there, for example if your directory is D:\python_codes :
d:
cd python_codes
jupyter notebook
After that, save the file using CTRL+Z and Enter.
You can run the batch file by calling the name (startjupyter), or click it. For the latter, maybe put it in your desktop for easy access.

I had also problems with the solutions given here. My solution was quick and dirty then, but it works with Windows. I made a batch-file:
cd C:\[starting Directory]
jupyter notebook
stop
You can start Jupyter with a defined directory when you use different batch-files. For example:
cd C:\datascience
or
cd C:\browsergame

How to add third-party Java JAR files for use in PySpark

I have some third-party database client libraries in Java. I want to access them through
java_gateway.py
E.g.: to make the client class (not a JDBC driver!) available to the Python client via the Java gateway:
java_import(gateway.jvm, "org.mydatabase.MyDBClient")
It is not clear where to add the third-party libraries to the JVM classpath. I tried to add to file compute-classpath.sh, but that did not seem to work. I get:
Py4jError: Trying to call a package
Also, when comparing to Hive: the hive JAR files are not loaded via file compute-classpath.sh, so that makes me suspicious. There seems to be some other mechanism happening to set up the JVM side classpath.

You could add the path to jar file using Spark configuration at Runtime.
Here is an example :
conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")
sc = SparkContext( conf=conf)
Refer the document for more information.

You can add external jars as arguments to pyspark
pyspark --jars file1.jar,file2.jar

You could add --jars xxx.jar when using spark-submit
./bin/spark-submit --jars xxx.jar your_spark_script.py
or set the enviroment variable SPARK_CLASSPATH
SPARK_CLASSPATH='/path/xxx.jar:/path/xx2.jar' your_spark_script.py
your_spark_script.py was written by pyspark API

All the above answers did not work for me
What I had to do with pyspark was
pyspark --py-files /path/to/jar/xxxx.jar
For Jupyter Notebook:
spark = (SparkSession
.builder
.appName("Spark_Test")
.master('yarn-client')
.config("spark.sql.warehouse.dir", "/user/hive/warehouse")
.config("spark.executor.cores", "4")
.config("spark.executor.instances", "2")
.config("spark.sql.shuffle.partitions","8")
.enableHiveSupport()
.getOrCreate())
# Do this
spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")
Link to the source where I found it:
https://github.com/graphframes/graphframes/issues/104

Extract the downloaded jar file.
Edit system environment variable
Add a variable named SPARK_CLASSPATH and set its value to \path\to\the\extracted\jar\file.
Eg: you have extracted the jar file in C drive in folder named sparkts
its value should be: C:\sparkts
Restart your cluster

Apart from the accepted answer, you also have below options:
if you are in virtual environment then you can place it in
e.g. lib/python3.7/site-packages/pyspark/jars
if you want java to discover it then you can place where your jre is installed under ext/ directory

One more thing you can do is to add the Jar in the pyspark jar folder where pyspark is installed. Usually /python3.6/site-packages/pyspark/jars
Be careful if you are using a virtual environment that the jar needs to go to the pyspark installation in the virtual environment.
This way you can use the jar without sending it in command line or load it in your code.

I've worked around this by dropping the jars into a directory drivers and then creating a spark-defaults.conf file in conf folder. Steps to follow;
To get the conf path:
cd ${SPARK_HOME}/conf
vi spark-defaults.conf
spark.driver.extraClassPath /Users/xxx/Documents/spark_project/drivers/*
run your Jupyter notebook.

java/scala libs from pyspark both --jars and spark.jars are not working in version 2.4.0 and earlier (I didn't check newer version). I'm surprised how many guys are claiming that it is working.
The main problem is that for classloader retrieved in following way:
jvm = SparkSession.builder.getOrCreate()._jvm
clazz = jvm.my.scala.class
# or
clazz = jvm.java.lang.Class.forName('my.scala.class')
it works only when you copy jar files to ${SPARK_HOME}/jars (this one works for me).
But when your only way is using --jars or spark.jars there is another classloader used (which is child class loader) which is set in current thread. So your python code needs to look like:
clazz = jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(f"{object_name}$")
Hope it explains your troubles. Give me a shout if not.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading Spark Avro file in Jupyter notebook with Pyspark Kernel - python

Related

What is the difference between a .py file and .ipynb file?

Spark development process with Python and IDE

Can't instantiate Spark Context in iPython

jupyter notebook starting directory

How to add third-party Java JAR files for use in PySpark

Categories

Resources