Connect local Jupyter Hub to Azure Databricks Spark Cluster - python

I have one Azure Databricks cluster. Although it provides notebook, my team is more familiar with using Jupyter Lab which they can upload offline-csv, install python packages.
I want to setup a Jupyter lab which can connect to the Spark Cluster.
Although databricks allow using remote kernel to access it - https://databricks.com/blog/2019/12/03/jupyterlab-databricks-integration-bridge-local-and-remote-workflows.html, it can't read local files on Jupyter lab.
Is there any way to use spark cluster with a local jupyter lab like https://medium.com/ibm-data-ai/connect-to-remote-kerberized-hive-from-a-local-jupyter-notebook-to-run-sql-queries-83d5e548d82c?
Many thanks

If you prefix a magic command with a %%, it will take the rest of the cell as its argument, which means %%local is used to send data to Spark cluster from Local instance.
Install databrickslabs_jupyterlab locally:
(base)$ conda create -n dj python=3.8 # you might need to add "pywin32" if you are on Windows
(base)$ conda activate dj
(dj)$ pip install --upgrade databrickslabs-jupyterlab[cli]==2.2.1
(db-jlab)$ dj $PROFILE -k
Start JupyterLab:
(db-jlab)$ dj $PROFILE -l
Test the Spark access:
import socket
from databrickslabs_jupyterlab import is_remote
result = sc.range(10000).repartition(100).map(lambda x: x).sum()
print(socket.gethostname(), is_remote())
print(result)
For more details, you can refer to Install Jupyter Notebook on your computer and connect to Apache Spark on HDInsight, Kernels for Jupyter Notebook on Apache Spark clusters in Azure HDInsight and Sending data to Spark cluster from Local instance

Related

Connect Google colab to a runtime on a Google Compute Engine instance

I am trying to connect a jupyter notebook on Google colab to a runtime on GCP EC2 instance. I followed this colab doc instructions Link
Steps taken:
Set up a Jupyter server on my local
pip install jupyter_http_over_ws && jupyter serverextension enable --py jupyter_http_over_ws
jupyter notebook \
--NotebookApp.allow_origin='https://colab.research.google.com' \
--port=8888 \
--NotebookApp.port_retries=0
Create and start a EC2 instance on GCP
SSH into EC2 instance and forward local port using:
gcloud beta compute ssh --zone "europe-west2-c" "<ec2-instance-name>" --project "<project-name>" -- -L 8888:localhost:8888
Error Message from trying to forward the port:
bind [127.0.0.1]:8888: Address already in use
channel_setup_fwd_listener_tcpip: cannot listen to port: 8888
Could not request local forwarding.
I also tried connecting the ec2 instance directly to colab but I was unable to. For the final step, I am suppose to copy the Jupyter url to the colab local runtime. How can I fix this?
I figured it out.
Steps:
Start instance
Connect to instance and forward port on remote instance to local machine
gcloud beta compute ssh --zone "<zone>" "<ec2-instance-name>" --project "<project-name>" -- -L 8888:localhost:8888
Install jupyter notebook and jupyter_http_over_ws if you dont have it install on the
remote instance already.
Then enable jupyter_http_over_ws:
jupyter serverextension enable --py jupyter_http_over_ws
Start Jupyter server on remote instance
jupyter notebook \
--NotebookApp.allow_origin='https://colab.research.google.com' \
--port=8888 \
--NotebookApp.port_retries=0
Copy server url to colab

Connect to remote python kernel from python code

I have been using PaperMill for executing my python notebook periodically. To execute compute intensive notebook, I need to connect to remote kernel running in my EMR cluster.
In case of Jupyter notebook I can do that by starting jupyter server with jupyter notebook --gateway-url=http://my-gateway-server:8888 and I am able to execute my code on remote kernel. But how do I let my local python code(through PaperMill) to use remote kernel? What changes do what to make in Kernel Manager to connect to remote kernel?
One related SO answer I could find is here. This suggests to do port forwarding to remote server and initialize KernelManager with the connection file from the server. I am not able to do this as blockingkernelmanager is no longer in Ipython.zmp and I would also prefer HTTP connection like how jupyter does.
Hacky approach - Set up a shell script to do the following :
Create a python environment on your EMR masternode using the hadoop user
Install sparkmagic in your environment and configure all kernels as described in the README.md file for sparkmagic
Copy your notebook to master node/use it directly from s3 location
Run with papermill :
papermill s3://path/to/notebook/input.ipynb s3://path/to/notebook/output.ipynb -p param=1
Step 1 and 2 are one time requirements if your cluster master node is the same every time.
A slightly better approach :
Set up a remote kernel in your Jupyter itself : REMOTE KERNEL
Execute with papermill as a normal notebook by selecting this remote kernel
I am using both approaches for different use cases and they seem to work fine for now.

How to use pycharm to run an application in remote spark cluster

I have installed PyCharm on my local system and have configured it to run spark applications in local mode in windows.
My spark cluster is in a remote Ubuntu box.
How can I run a spark application in the remote spark cluster, which is on Ubuntu, from my locally installed PyCharm which is on Windows?
My goal is to run the application in a remote cluster so workarounds are also welcome.
PyCharm is already setup for this. Ideally you want to setup a deployment and a remote interpreter for your setup, ideally via ssh.
This allows you to upload your codebase to the cluster (so that the pyspark driver has access to it), but run it from your laptop. Remote interpreter then takes care of resolving dependencies on the cluster.
Have a look here https://www.jetbrains.com/help/pycharm/configuring-remote-interpreters-via-ssh.html and here https://www.jetbrains.com/help/pycharm/creating-a-remote-server-configuration.html.
NB: Before you start configuring the remote interpreter, it's better to install venv or conda on your cluster and create a virtual environment, so that you don't have any dependencies or outdated packages. You then point the remote interpreter config to the python binary of the environment, such as /app/anaconda3/envs/my_env/bin/python.

Installing Anaconda on Server

I have a Unix server where I have Python3 installed. I ssh to the server from my mac.
I was wondering if it possible to install Anaconda and Jupyter (will come with Anaconda) on the server so that I can just pull up Jupyter on the server terminal and run codes on jupyter running on the server.
Is it possible? And if yes, could someone guide me to the right link?
in a terminal on your remote server:
#download anaconda (change version if you want)
wget https://repo.continuum.io/archive/Anaconda3-2018.12-Linux-x86_64.sh
# run the installer
bash Anaconda3-5.1.0-Linux-x86-64.sh
# so changes in your path take place in you current session:
source ~/.bashrc
#To run a remote notebook, replace XXXX with your choice of four numbers like 9191
jupyter notebook --no-browser --port=XXXX
#copy the url that you get as a result
Then in your local machine, open up a terminal and write:
#XXXX is the port you specified in the previous step, YYYY is a local port, for example 9999 to keep it simple
ssh -f [USER]#[SERVER] -L YYYY:localhost:XXXX -N
Then copy the url from the previous step, paste it in a browser, since you used the same port, you don't have to change anything on the url
you can download anaconda using:wget https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh
and install using: bash Anaconda3-5.1.0-Linux-x86_64.sh
After that just source the path of Anaconda in .bashrc file, it should work.
To access jupyter notebook, you can use ssh and run notebook in your browser on your host. Steps are mentioned in this link
Yes you can install anaconda on your linux machine (Server) and manage the python environment. But if just need Jupyter hosted in a server, just install Jupyter only and start the service which will server Jupyter Notebook. Access Jupyter notebook using your browser on any other PC.
Make a google search that how to install anaconda on Linux machine (Centos/Ubuntu etc)
After installation run following command
conda info
and then configure the Jupyter and run.
Simple way (Install Jupyter on a server): Install, Run, and Connect to Jupyter Notebook on a Remote Server

How to start a standalone cluster using pyspark?

I am using pyspark under ubuntu with python 2.7
I installed it using
pip install pyspark --user
And trying to follow the instruction to setup spark cluster
I can't find the script start-master.sh
I assume that it has to do with the fact that i installed pyspark and not regular spark
I found here that i can connect a worker node to the master via pyspark, but how do i start the master node with pyspark?
https://pypi.python.org/pypi/pyspark
The Python packaging for Spark is not intended to replace all ... use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.
Well i did a bit of a mix-up in the op.
You need to get spark on the machine that should run as master.
You can download it here
After extracting it, you have spark/sbin folder, there you have start-master.sh script. you need to start it with -h argument.
please note that you need to create a spark-env file like explained here and define the spark local and master variables, this is important on the master machine.
After that, on the worker nodes, use the start-slave.sh script to start worker nodes.
And you are good to go, you can use a spark context inside python to use it!
If you are already using pyspark through conda / pip installation, there's no need to install Spark and setup environment variables again for cluster setup.
For conda / pip pyspark installation is missing only 'conf', 'sbin' , 'kubernetes', 'yarn' folders, You can simply download Spark and move those folders into the folder where pyspark is located (usually site-packages folder inside python).
After you installed pyspark via pip install pyspark, you can start the Spark standalone cluster master process using this command:
spark-class org.apache.spark.deploy.master.Master -h 127.0.0.1
And then you can add some workers (executors), which would process the jobs:
spark-class org.apache.spark.deploy.worker.Worker \
spark://127.0.0.1:7077 \
-c 4 -m 8G
Flags -c and -m specify the number of CPU cores and amount of memory provided by the worker.
The 127.0.0.1 local address is used there for security reasons (it isn't good if anyone just copy/pasting this lines would expose an "arbitary code execution service" in their network), but for the distributed standalone Spark cluster setup the different address should be used (ex, a private IP address in an isolated network available only for this cluster nodes and their intended users, and an official Spark security guide should be read).
The spark-class script is contained in the "pyspark" python package, and it is a wrapper to load the environment variables from spark-env.sh and add the corresponding spark jars locations to -cp flag of java command.
If you may need to configure the environment - consult the official Spark docs, but it also works and may be suitable for the regular usage with default parameters. Also, see the flags for the master/worker commands using their --help.
This is an example how to connect to this standalone cluster using pyspark script with ipython shell:
PYSPARK_DRIVER_PYTHON=ipython \
pyspark --master spark://127.0.0.1:7077 \
--num-executors 2
--executor-cores 2
--executor-memory 4G
The code for instantiating spark session manually, ex. in Jupyter:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.master("spark://127.0.0.1:7077")
# the number of executors this job needs
.config("spark.executor.instances", 2)
# the number of CPU cores memory this needs from the executor,
# it would be reserved on the worker
.config("spark.executor.cores", "2")
.config("spark.executor.memory", "4G")
.getOrCreate()
)

Categories