Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0

Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0 - python

I am trying to use IPython notebook with Apache Spark 1.4.0. I have followed the 2 tutorial below to set my configuration
Installing Ipython notebook with pyspark 1.4 on AWS
and
Configuring IPython notebook support for Pyspark
After fisnish the configuration, following is several code in the related files:
1.ipython_notebook_config.py
c=get_config()
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser =False
c.NotebookApp.port = 8193
2.00-pyspark-setup.py
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
I also add following two lines to my .bash_profile:
export SPARK_HOME='home/hadoop/sparl'
source ~/.bash_profile
However, when I run
ipython notebook --profile=pyspark
it shows the message: unrecognized alias '--profile=pyspark' it will probably have no effect
It seems that the notebook doesn't configure with pyspark successfully
Does anyone know how to solve it? Thank you very much
following are some software version
ipython/Jupyter: 4.0.0
spark 1.4.0
AWS EMR: 4.0.0
python: 2.7.9
By the way I have read the following, but it doesn't work
IPython notebook won't read the configuration file

Jupyter notebooks don't have the concept of profiles (as IPython did). The recommended way of launching with a different configuration is e.g.:
JUPTYER_CONFIG_DIR=~/alternative_jupyter_config_dir jupyter notebook
See also issue jupyter/notebook#309, where you'll find a comment describing how to set up Jupyter notebook with PySpark without profiles or kernels.

This worked for me...
Update ~/.bashrc with:
export SPARK_HOME="<your location of spark>"
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
(Lookup pyspark docs for those arguments)
Then create a new ipython profile eg. pyspark:
ipython profile create pyspark
Then create and add the following lines in ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.9-src.zip'))
filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))
spark_release_file = spark_home + "/RELEASE"
if os.path.exists(spark_release_file) and "Spark 1.6" in open(spark_release_file).read():
pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
(update versions of py4j and spark to suit your case)
Then mkdir -p ~/.ipython/kernels/pyspark and then create and add following lines in the file ~/.ipython/kernels/pyspark/kernel.json
{
"display_name": "pySpark (Spark 1.6.1)",
"language": "python",
"argv": [
"/usr/bin/python",
"-m",
"IPython.kernel",
"--profile=pyspark",
"-f",
"{connection_file}"
]
}
Now you should see this kernel, pySpark (Spark 1.6.1), under jupyter's new notebook option. You can test by executing sc and should see your spark context.

I have tried so many ways to solve this 4.0 version problem, and finally I decided to install version 3.2.3. of IPython:
conda install 'ipython<4'
It's anazoning! And wish to help all you!
ref: https://groups.google.com/a/continuum.io/forum/#!topic/anaconda/ace9F4dWZTA

As people commented, in Jupyter you don't need profiles. All you need to do is export the variables for jupyter to find your spark install (I use zsh but it's the same for bash)
emacs ~/.zshrc
export PATH="/Users/hcorona/anaconda/bin:$PATH"
export SPARK_HOME="$HOME/spark"
export PATH=$SPARK_HOME/bin:$PATH
export PYSPARK_SUBMIT_ARGS="--master local[*,8] pyspark-shell"
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH
It is important to add pyspark-shell in the PYSPARK_SUBMIT_ARGS
I found this guide useful but not fully accurate.
My config is local, but should work if you use the PYSPARK_SUBMIT_ARGS to the ones you need.

I am having the same problem to specify the --profile **kwarg. It seems it is a general problem with the new version, not related with Spark. If you downgrade to ipython 3.2.1 you will be able to specify the profile again.

Related

Azure dataset .to_pandas_dataframe() error

I am following an azure ml course on udemy and cannot get around the following error:
Execution failed in operation 'to_pandas_dataframe' for Dataset(id='id', name='Loan Applications Using SDK', version=1, error_code=None, exception_type=PandasImportError)
Here is the code for Submitting the Script:
from azureml.core import Workspace, Experiment, ScriptRunConfig,
Environment
ws = Workspace.from_config(path="./config")
new_experiment = Experiment(workspace=ws,
name="Loan_Script")
script_config = ScriptRunConfig(source_directory=".",
script="180 - Script to Run.py")
script_config.framework = "python"
script_config.environment = Environment("conda_env")
new_run = new_experiment.submit(config=script_config)
Here is the Script being run:
from azureml.core import Workspace, Datastore, Dataset,
Experiment
from azureml.core import Run
ws = Workspace.from_config(path="./config")
az_store = Datastore.get(ws, "bencouser_sdk_blob01")
az_dataset = Dataset.get_by_name(ws, name='Loan Applications Using SDK')
az_default_store = ws.get_default_datastore()
#%%----------------------------------------------------
# Get context of the run
#------------------------------------------------------
new_run = Run.get_context()
#%%----------------------------------------------------
# Stuff that will be logged
#------------------------------------------------------
df = az_dataset.to_pandas_dataframe()
total_observations = len(df)
nulldf = df.isnull().sum()
#%%----------------------------------------------------
# Complete the Experiment
#------------------------------------------------------
new_run.log("Total Observations:", total_observations)
for columns in df.columns:
new_run.log(columns, nulldf[columns])
new_run.complete()
I have run the .to_pandas_dataframe() part outside of an experiment and it worked without error. I have also tried the following (that was recommended in the driver log):
InnerException Could not import pandas. Ensure a compatible version is installed by running: pip install azureml-dataprep[pandas]
I have seen people come across this before but I cannot find a solution, any help is appreciated.

When doing an experiment a new azure environment was created without pandas installed. To install pandas (if using anaconda nav) go onto environments in the anaconda nav window, click the azure env, go to uninstalled packages and search pandas, click install. It worked once this was done.

For VS code users:
conda info --envs
You will have an environment with a name starting from azureml_f3f7e6c5xxxxxxx. Activate this environment
conda activate azureml_f3f7e6c5xxxxxxx
Then install pandas in the environment
pip install pandas

Do I need to run always findspark or once?

My method of using pyspark is to always run the code below in jupyter. Is this method always necessary ?
import findspark
findspark.init('/opt/spark2.4')
import pyspark
sc = pyspark.SparkContext()

If you want to reduce the findspark dependency, you can just make sure you have these variables in your .bashrc
export SPARK_HOME='/opt/spark2.4'
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME:$PATH:~/.local/bin:$JAVA_HOME/bin:$JAVA_HOME/jre/bin
Change the directories according to your enviroment, and the spark version as well. Apart from that, findspark will have to be in your code for your python interpreter to find the spark directory
If you get it working, you can run pip uninstall findspark
EDIT:
Pure python solution, add this code on top of your jupyter notebook (maybe in the first cell):
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/continuum/anaconda/bin/python"
os.environ["SPARK_HOME"] = "/opt/spark2.4"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
Source : Anaconda docs

I believe you can call this only once, what this does is that it edits your bashrc file and set the environment variables there
findspark.init('/path/to/spark_home', edit_rc=True)

environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON

I have installed pyspark recently. It was installed correctly. When I am using following simple program in python, I am getting an error.
>>from pyspark import SparkContext
>>sc = SparkContext()
>>data = range(1,1000)
>>rdd = sc.parallelize(data)
>>rdd.collect()
while running the last line I am getting error whose key line seems to be
[Stage 0:> (0 + 0) / 4]18/01/15 14:36:32 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 123, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
I have the following variables in .bashrc
export SPARK_HOME=/opt/spark
export PYTHONPATH=$SPARK_HOME/python3
I am using Python 3.

By the way, if you use PyCharm, you could add PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to run/debug configurations per image below

You should set the following environment variables in $SPARK_HOME/conf/spark-env.sh:
export PYSPARK_PYTHON=/usr/bin/python
export PYSPARK_DRIVER_PYTHON=/usr/bin/python
If spark-env.sh doesn't exist, you can rename spark-env.sh.template

This may happen also if you're working within an environment. In this case, it may be harder to retrieve the correct path to the python executable (and anyway I think it's not a good idea to hardcode the path if you want to share it with others).
If you run the following lines at the beginning of your script/notebook (at least before you create the SparkSession/SparkContext) the problem is solved:
import os
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
Package os allows you to set global variables; package sys gives the string with the absolute path of the executable binary for the Python interpreter.

I got the same issue, and I set both variable in .bash_profile
export PYSPARK_PYTHON=/usr/local/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3
But My problem is still there.
Then I found out the problem is that my default python version is python 2.7 by typing python --version
So I solved the problem by following below page:
How to set Python's default version to 3.x on OS X?

Just run the code below in the very beginning of your code. I am using Python3.7. You might need to run locate python3.7 to get your Python path.
import os
os.environ["PYSPARK_PYTHON"] = "/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7"
os.environ["PYSPARK_DRIVER_PYTHON"] = "/Library/Frameworks/Python.framework/Versions/3.7/bin/python3.7"

I'm using Jupyter Notebook to study PySpark, and that's what worked for me.
Find where python3 is installed doing in a terminal:
which python3
Here is pointing to /usr/bin/python3.
Now in the the beginning of the notebook (or .py script), do:
import os
# Set spark environments
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3'
os.environ['PYSPARK_DRIVER_PYTHON'] = '/usr/bin/python3'
Restart your notebook session and it should works!

Apache-Spark 2.4.3 on Archlinux
I've just installed Apache-Spark-2.3.4 from Apache-Spark website, I'm using Archlinux distribution, it's simple and lightweight distribution. So, I've installed and put the apache-spark directory on /opt/apache-spark/, now it's time to export our environment variables, remember, I'm using Archlinux, so take in mind to using your $JAVA_HOME for example.
Importing environment variables
echo 'export JAVA_HOME=/usr/lib/jvm/java-7-openjdk/jre' >> /home/user/.bashrc
echo 'export SPARK_HOME=/opt/apache-spark' >> /home/user/.bashrc
echo 'export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH' >> /home/user/.bashrc
echo 'export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH' >> /home/user/.bashrc
source ../.bashrc
Testing
emanuel#hinton ~ $ echo 'export JAVA_HOME=/usr/lib/jvm/java-7-openjdk/jre' >> /home/emanuel/.bashrc
emanuel#hinton ~ $ echo 'export SPARK_HOME=/opt/apache-spark' >> /home/emanuel/.bashrc
emanuel#hinton ~ $ echo 'export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH' >> /home/emanuel/.bashrc
emanuel#hinton ~ $ echo 'export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip:$PYTHONPATH' >> /home/emanuel/.bashrc
emanuel#hinton ~ $ source .bashrc
emanuel#hinton ~ $ python
Python 3.7.3 (default, Jun 24 2019, 04:54:02)
[GCC 9.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>>
Everything it's working fine since you correctly imported the environment variables for SparkContext.
Using Apache-Spark on Archlinux via DockerImage
For my use purposes I've created a Docker image with python, jupyter-notebook and apache-spark-2.3.4
running the image
docker run -ti -p 8888:8888 emanuelfontelles/spark-jupyter
just go to your browser and type
http://localhost:8888/tree
and will prompted a authentication page, come back to terminal and copy the token number and voila, will have Archlinux container running a Apache-Spark distribution.

If you are using Pycharm , Got to Run - > Edit Configurations and click on Environment variables to add as below(basically the PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON should point to the same version of Python) . This solution worked for me .Thanks to the above posts.

To make it easier to see for people, that instead of having to set a specific path /usr/bin/python3 that you can do this:
I put this line in my ~/.zshrc
export PYSPARK_PYTHON=python3.8
export PYSPARK_DRIVER_PYTHON=python3.8
When I type in python3.8 in my terminal I get Python3.8 going. I think it's because I installed pipenv.
Another good website to reference to get your SPARK_HOME is https://towardsdatascience.com/how-to-use-pyspark-on-your-computer-9c7180075617
(for permission denied issues use sudo mv)

1. Download and Install Java (Jre)
2. It has two options, you can choose one of the following solution:-
## -------- Temporary Solution -------- ##
Just put the path in your jupyter notebook in the following code and RUN IT EVERYTIME:-
import os
os.environ["PYSPARK_PYTHON"] = r"C:\Users\LAPTOP0534\miniconda3\envs\pyspark_v3.3.0"
os.environ["PYSPARK_DRIVER_PYTHON"] = r"C:\Users\LAPTOP0534\miniconda3\envs\pyspark_v3.3.0"
os.environ["JAVA_HOME"] = r"C:\Program Files\Java\jre1.8.0_333"
----OR----
## -------- Permanent Solution -------- ##
Set above 3 variables in your Environment Variable.
I have gone through many answers but nothing works for me.
But both of these resolution worked for me. This has resolved my error.
Thanks

import os
os.environ["JAVA_HOME"] = "C:\Program Files\Java\jdk-19"
os.environ["SPARK_HOME"] = "C:\Program Files\Spark\spark-3.3.1-bin-hadoop2"
import sys
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
This worked for me in a jupyter notebook as the os library makes our life easy in setting up the environment variables. Make sure to run this cell befor running the sparksession.

I tried two methods for the question. the method in the picture can works.
add environment variables
PYSPARK_PYTHON=/usr/local/bin/python3.7;PYSPARK_DRIVER_PYTHON=/usr/local/bin/python3.7;PYTHONUNBUFFERED=1

ImportError: No module named numpy on spark workers

Launching pyspark in client mode. bin/pyspark --master yarn-client --num-executors 60 The import numpy on the shell goes fine but it fails in the kmeans. Somehow the executors do not have numpy installed is my feeling. I didnt find any good solution anywhere to let workers know about numpy. I tried setting PYSPARK_PYTHON but that didnt work either.
import numpy
features = numpy.load(open("combined_features.npz"))
features = features['arr_0']
features.shape
features_rdd = sc.parallelize(features, 5000)
from pyspark.mllib.clustering import KMeans, KMeansModel
from numpy import array
from math import sqrt
clusters = KMeans.train(features_rdd, 2, maxIterations=10, runs=10, initializationMode="random")
Stack trace
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
File "/hadoop/3/scratch/local/usercache/ajkale/appcache/application_1451301880705_525011/container_1451301880705_525011_01_000011/pyspark.zip/pyspark/mllib/__init__.py", line 25, in <module>
ImportError: No module named numpy
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:99)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
enter code here

To use Spark in Yarn client mode, you'll need to install any dependencies to the machines on which Yarn starts the executors. That's the only surefire way to make this work.
Using Spark with Yarn cluster mode is a different story. You can distribute python dependencies with spark-submit.
spark-submit --master yarn-cluster my_script.py --py-files my_dependency.zip
However, the situation with numpy is complicated by the same thing that makes it so fast: the fact that does the heavy lifting in C. Because of the way that it is installed, you won't be able to distribute numpy in this fashion.

numpy is not installed on the worker (virtual) machines. If you use anaconda, it's very convenient to upload such python dependencies when deploying the application in cluster mode. (So there is no need to install numpy or other modules on each machine, instead they must in your anaconda).
Firstly, zip your anaconda and put the zip file to the cluster, and then you can submit a job using following script.
spark-submit \
--master yarn \
--deploy-mode cluster \
--archives hdfs://host/path/to/anaconda.zip#python-env
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pthon-env/anaconda/bin/python
app_main.py
Yarn will copy anaconda.zip from the hdfs path to each worker, and use that pthon-env/anaconda/bin/python to execute tasks.
Refer to Running PySpark with Virtualenv may provide more information.

What solved it for me (On mac) was actually this guide (Which also explains how to run python through Jupyter Notebooks -
https://medium.com/#yajieli/installing-spark-pyspark-on-mac-and-fix-of-some-common-errors-355a9050f735
In a nutshell:
(Assuming you installed spark with brew install spark)
Find the SPARK_PATH using - brew info apache-spark
Add those lines to your ~/.bash_profile
# Spark and Python
######
export SPARK_PATH=/usr/local/Cellar/apache-spark/2.4.1
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
#For python 3, You have to add the line below or you will get an error
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]'
######
You should be able to open Jupyter Notebook simply by calling:
pyspark
And just remember you don't need to set the Spark Context but instead simply call:
sc = SparkContext.getOrCreate()

For me environment variable PYSPARK_PYTHON was not set so I set up /etc/environment file and added python environment path to the variable.
PYSPARK_PYTHON=/home/venv/python3
Afterwards, no such error.

I had similar issue but I dont think you need to set PYSPARK_PYTHON instead just install numpy on the worker machine (apt-get or yum). The error will also tell you on which machine the import was missing.

You have to be aware that you need to have numpy installed on each and every worker, and even the master itself (depending on your component placement)
Also ensure to launch pip install numpy command from a root account (sudo does not suffice) after forcing umask to 022 (umask 022) so it cascades the rights to Spark (or Zeppelin) User

A few of things to check
Install the required packages on the worker nodes with sudo permission so that they are available to all users
If you have multiple versions of the python on the worker nodes, make sure to install packages for python used by Spark (usually set by PYSPARK_PYTHON).
Finally, to pass the custom modules (.py files), use --py-files while starting the session using spark-submit or pyspark

I had the same issue. Try installing numpy on pip3 if you're using Python3
pip3 install numpy

Create PySpark Profile for IPython

I follow this link http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/ in order to create PySpark Profile for IPython.
00-pyspark-setup.py
# Configure the necessary Spark environment
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "\python")
# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, '\python\lib\py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, '\python\pyspark\shell.py'))
My problem when I type sc in ipython-notebook, I got '' I should see output similar to <pyspark.context.SparkContext at 0x1097e8e90>.
Any idea about how to resolve it ?

I was trying to do the same, but had problems. Now, I use findspark (https://github.com/minrk/findspark) instead. You can install it with pip (see https://pypi.python.org/pypi/findspark/):
$ pip install findspark
And then, inside a notebook:
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(appName="myAppName")
If you want to avoid this boilerplate, you can put the above 4 lines in 00-pyspark-setup.py.
(Right now I have Spark 1.4.1. and findspark 0.0.5.)

Please try to set proper value to SPARK_LOCAL_IP variable, eg.:
export SPARK_LOCAL_IP="$(hostname -f)"
before you run ipython notebook --profile=pyspark.
If this doesn't help, try to debug your environment by executing setup script:
python 00-pyspark-setup.py
Maybe you can find some error lines in that way and debug them.

Are you on windows? I am dealing with the same things, and a couple of things helped.
In the 00-pyspark-setup.py, change this line (match your path to your spark folder)
# Configure the environment
if 'SPARK_HOME' not in os.environ:
print 'environment spark not set'
os.environ['SPARK_HOME'] = 'C:/spark-1.4.1-bin-hadoop2.6'
I am sure you added a new environment variable, if not, this will manually set it.
The next thing I noticed is that if you use ipython 4 (the latest), the config files don't work the same way you see it in all the tutorials. You can try out if your config files get called by adding a print statement or just messing them up so an error gets thrown.
I am using a lower version of iPython (3) and I call it using
ipython notebook --profile=pyspark

Change the 00-pyspark-setup.py to:
# Configure the necessary Spark environment
import os
# Spark home
spark_home = os.environ.get("SPARK_HOME")
######## CODE ADDED ########
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[2] pyspark-shell"
######## END OF ADDED CODE #########
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Basically, the added code sets the PYSPARK_SUBMIT_ARGS environment variable to
--master local[2] pyspark-shell, which works for Spark 1.6 standalone.
Now run ipython notebook again. Run os.environ["PYSPARK_SUBMIT_ARGS"] to check whether its value is correctly set. If so, then type sc should give you the expected output like <pyspark.context.SparkContext at 0x1097e8e90>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Configure Ipython/Jupyter notebook with Pyspark on AWS EMR v4.0.0 - python

I have tried so many ways to solve this 4.0 version problem, and finally I decided to install version 3.2.3. of IPython: conda install 'ipython<4' It's anazoning! And wish to help all you! ref: https://groups.google.com/a/continuum.io/forum/#!topic/anaconda/ace9F4dWZTA

I am having the same problem to specify the --profile **kwarg. It seems it is a general problem with the new version, not related with Spark. If you downgrade to ipython 3.2.1 you will be able to specify the profile again.

Related

Azure dataset .to_pandas_dataframe() error

Do I need to run always findspark or once?

environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON

ImportError: No module named numpy on spark workers

Create PySpark Profile for IPython

Categories

Resources