Papermill PySpark support

Papermill PySpark support - python

I'm looking for a way to easily execute parametrized run of Jupyter Notebooks, and I've found Papermill Project (https://github.com/nteract/papermill/)
This tool seems to match my requirements, but I can't find any reference for PySpark kernel support.
Is PySpark kernels supported by papermill executions?
If it is, there is some configuration to be done to connect it to the Spark cluster used by Jupyter?
Thanks in advance for the support, Mattia

Papermill will work with PySpark kernels, so long as they implement Jupyter's kernel spec.
Configuring your kernel will depend on the kernel in question. Usually these read from spark.conf and/or spark.properties files to configure cluster and launch-time settings for Spark.

Related

start pyspark cluster with jupyter notebook

i'm buildind a pyspark app using jupyter notebook , so far i'm using it in a standalone mode.
Now i have in my disposition 3 Virtual machines with spark on them, and i want to start Pyspark in a cluster.
Here is my code to start it in standalone mode :
knowing i'm using spark 3.1.2 hadoop 3.2
i've searched for ways to do it and i didn't get it, and there are some articles saying that pyspark doesn't work in clusters, so please if you know how i can change this code and launch my session in a cluster please help.
thank you.

You most have a cluster of some sort.
I use kubernutes and https://github.com/bjornjorgensen/jlpyk8s
This way I have a notebook that interactive run pyspark on.

Jupyter notebooks as Kedro node

How can I use a Jupyter Notebook as a node in Kedro pipeline? This is different from converting functions from Jupyter Notebooks into Kedro nodes. What I want to do is using the full notebook as the node.

Although this is technically possible (via nbconvert, for example), this is strongly discouraged for multiple reasons including the lack of testability and reproducibility of the notebooks among others.
The best practice is usually to keep your pipeline node functions pure (where applicable), meaning that they don't incur any side effects. The way notebooks work generally contradicts with that principle.

AFAIK Kedro doesn't support this but Ploomber does (disclaimer: I'm the author). Tasks can be notebooks, scripts, functions, or any combination of them. You can run locally, Airflow, or Kubernetes (using Argo workflows).
If using a notebook or script as a pipeline task, Ploomber creates a copy whenever you run the pipeline. For example, you can create functions to pre-process your data and add a final task that trains a model in a notebook, this way you can leverage the ipynb format to generate reports for your model training procedure.
This is how a pipeline declaration looks like:
tasks:
- source: notebook.ipynb
product:
nb: output.html
data: output.csv
- source: another.ipynb
product:
nb: another.html
data: another.csv
Resources:
Repository
Exporting to Airflow
Exporting to Kubernetes
Sample pipelines

How to develop with PYSPARK locally and run on Spark Cluster?

I'm new to Spark.I installed a Spark 2.3.0 in Stand-Alone-Mode on an Ubuntu 16.04.3 server. That runs well so far. Now I would like to start developing with pyspark because I've got more experience using python than scala.
Ok. Even after using google for a while I'm not sure how I should setup my development environment. My local machine is a windows 10 laptop with eclipse neon and pydev configured. What are the neccessary steps to set ist up that I can develop in a local context and submit my modules to the spark cluster on my server?
Thank for helping.

use spark-submit to run locally or on a cluster. There are many online tutorials for this. I like the AWS documentation which explains the architecture, has sample spark code, and gives examples of local and remote commands. Even if you are not using AWS EMR the basics are the same.
give it a try and let us know how it goes

How to find Spark RDD created in different cores of a computer

I just wanted to educate myself more on Spark. So wanted to ask this question.
I have Spark currently installed on my local machine. Its a Mach with 16GB.
I have connected a Jupyter notebook running with Pyspark.
So now when I do any coding in that notebook, like reading the data and converting the data into Spark DataFrame, I wanted to check:
1). Where all the dataset is distributed on local machine. Like does it distribute the dataset using different cores of CPU etc?. Is there a way to find that out?
2). Running the code and computation just using Jupyter notebook without spark is different from running Jupyter notebook with Pyspark? Like the first one just uses one core of the machine and runs using one thread while Jupyter notebook with Pyspark runs the code and computing on different cores of CPU with multi-threading/processing? Is this understanding correct?.
Is there a way to check these.
Thanks

Jupyter has mainly three parts Jupyter Notebook, Jupyter Clients and Kernels. http://jupyter.readthedocs.io/en/latest/architecture/how_jupyter_ipython_work.html
Here is short note on Kernel from Jupyter homepage.
Kernels are processes that run interactive code in a particular
programming language and return output to the user. Kernels also
respond to tab completion and introspection requests.
Jupyter's job is to communicate between the Kernel(Like python kernel, spark kernel ..) and web interface(your notebook). So under the hood spark is running drivers and executors for you, Jupyter helps in communicating with the spark driver.
1). Where all the dataset is distributed on local machine. Like does
it distribute the dataset using different cores of CPU etc?. Is there
a way to find that out?
Spark will spawn n number of executors(process responsible for executing a task) as specified using --num-executors, executors are managed by a spark driver(program/process responsible for running the Job over the Spark Engine). So the number of executors are specified while you run a spark program, you will find this kernel conf directory.
2). Running the code and computation just using Jupyter notebook
without spark is different from running Jupyter notebook with Pyspark?
Like the first one just uses one core of the machine and runs using
one thread while Jupyter notebook with Pyspark runs the code and
computing on different cores of CPU with multi-threading/processing?
Is this understanding correct?.
Yes, as I explained Jupyter is merely an interface letting you run code. Under the hood, Jupyter connects to Kernels be it normal Python or Apache Spark.
Spark has it's own good UI to monitor jobs, by default it runs on spark master server on port 4040. In your case, it will available on http://localhost:4040

How can I reference libraries for ApacheSpark using IPython Notebook only?

I'm currently playing around with the Apache Spark Service in IBM Bluemix. There is a quick start composite application (Boilerplate) consisting of the Spark Service itself, an OpenStack Swift service for the data and an IPython/Jupyter Notebook.
I want to add some 3rd party libraries to the system and I'm wondering how this could be achieved. Using an python import statement doesn't really help since the libraries are then expected to be located on the SparkWorker nodes.
Is there a ways of loading python libraries in Spark from an external source during job runtime (e.g. a Swift or ftp source)?
thanks a lot!

You cannot add 3rd party libraries at this point in the beta. This will most certainly be coming later in the beta as it's a popular requirement ;-)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.