Create Spark Standalone setup with Pyspark installation via Pypi - python

I am trying to setup spark standalone node which can be used locally without a cluster and run spark program on that instance.
As I have read the articles, there are multiple approaches to have pyspark on an instance, I am listing just these two:
PySpark with Pypi - Runs with either
python file.py
spark-submit
Install with Apache Spark - Runs with
spark-submit
Both can run pyspark, however, what I am confused is if pyspark with pip installation will setup the node to be used as a standalone cluster with a single node and what is the difference between the two?
As is written in Docs:
The Python packaging for Spark is not intended to replace all ... use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.
What is the way one can execute spark application on a single node for a relatively small dataset, say about 50Gb? How can we setup a cluster for a single node?

Related

Develop and test Python code to connect kafka streams on local machine

I am new to Python on local machine. Till now I can code in Azure Databricks. I want to create and deploy libraries which connects to confluent kafka and save data to delta table.
I got confused -
1] Do I need to connect to Databricks Delta from my local machine using python to store the streams to delta
OR
Store the streams to local delta (I am able to create delta table) by setting up like below
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
and deploy the lib into databricks , when it will run it will point to Databricks Delta
I want to use dbfs filestore as well to connect to kafka
.option("kafka.ssl.truststore.location", "/dbfs/FileStore/tables/test.jks") \
I am new to this, Please share the details about How to create streaming application in Python?
and How to deploy to Databricks?
To execute the Python code on Databricks without notebooks, you need to configure a job. As was mentioned by OneCricketeer, the egg is the file format for libraries, you will need to have a Python file that will be the entry point for the job - it will initialize the Spark session, and then will call your libraries.
Job could be configured either (you will also need to upload your libraries):
via UI, but it's limited to configuring the notebooks and jars, but not the Python code. But you'll still able to run Python code, using the spark-submit option.
via REST API - with it, you can create a job that executes Python code directly
via command-line (uses REST API under the hood) and you'll need to create JSON yourself, same way as for REST API.
via Databricks Terraform Provider - it also uses REST API, but could be easier to configure everything in one place - upload libraries, upload the file to DBFS, create/modify the job.
On Databricks, Delta is already pre-installed, so you don't need to set options, and specify maven coordinates, and everything else, so your initialization code will be:
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.getOrCreate()

How to Kickstart Kubeflow Pipeline development in Python

I have been studying about Kubeflow and trying to grasp how do I write my first hollo world program in it and run locally on my mac. I have kfp and kubectl installed locally on my machine. For testing purpose I want to write a simple pipeline with two functions: get_data() and add_data(). The doc is overwhelming that I am not clear how to program locally without k8s installed, connecting remote GCP machine and debug locally before creating zip and upload or there way to execute code locally and see how is it running on Google cloud?
Currently you need Kubernetes to run KFP pipelines.
The easiest way to deploy KFP is when you use the Google Cloud Marketplace
Alternatively you can locally install Docker Desktop which includes Kubernetes and install standalone version of KFP on it.
After that you can try this tutorial: Data passing in python components
Actually you can install a reduced version of kubeflow with minikf. More info https://www.kubeflow.org/docs/distributions/minikf/minikf-vagrant/
Check whether you are using kubeflow pipelines from the google cloud marketplace, or a custom kubernetes cluster. If you are using the managed one, you can see your pipeline running through the kubeflow pipelines management console.
for details about how to create components based on functions, you can check https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/#getting-started-with-python-function-based-components

is it possible to run spark udf functions (mainly python) under docker?

I'm using pyspark on emr. To simplify the setup of python libraries and dependencies, we're using docker images.
This works fine for general python applications (non spark), and for the spark driver (calling spark submit from within a docker image)
However, I couldn't find a method to make the workers run within a docker image (either the "full" worker, or just the UDF functions)
EDIT
Found a solution with beta EMR version, if there's some alternative with current (5.*) EMR versions it's still relevant
Apparently yarn 3.2 supports this feature: https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/DockerContainers.html
and it expected to be available with EMR 6 (now in beta) https://aws.amazon.com/blogs/big-data/run-spark-applications-with-docker-using-amazon-emr-6-0-0-beta/

How to Connect Python to Spark Session and Keep RDDs Alive

How do I get a small Python script to hook into an existing instance of Spark and do operations on existing RDDs?
I'm in the early stages of working with Spark on Windows 10, trying scripts on a "Local" instance. I'm working with the latest stable build of Spark (Spark 2.0.1 for Hadoop 2.7). I've installed and set environment variables for Hadoop 2.7.3. I'm experimenting with both the Pyspark shell and Visual Studio 2015 Community with Python.
I'm trying to build a large engine, on which I'll run individual scripts to load, massage, format, and access the data. I'm sure there's a normal way to do that; isn't that the point of Spark?
Anyway, here's the experience I have so far. This is generally to be expected. When I build a small Spark script in Python and run it using Visual Studio, the script runs, does its job, and exits. In the process of exiting, it also exits the Spark Context it was using.
So I had the following thought: What if I started a persistent Spark Context in Pyspark and then set my SparkConf and SparkContext in each Python script to connect to that Spark Context? So, looking up online what the defaults are for Pyspark, I tried the following:
conf = SparkConf().setMaster("local[*]").setAppName("PySparkShell")
sc = SparkContext(conf = conf)
I started Pyspark. In a separate script in Visual Studio, I used this code for SparkContext. I loaded a text file into an RDD named RDDFromFilename . But I couldn't access that RDD in the Pyspark shell once the script had run.
How do I start a persistent Spark Context, create an RDD in it in one Python script, and access that RDD from subsequent Python scripts? Particularly in Windows?
There is no solution in Spark. You may consider:
To keep persistent RDDs:
Apache Ignite
To keep persistent shared context:
spark-jobserver
livy - https://github.com/cloudera/livy
mist - https://github.com/Hydrospheredata/mist
To share context for with notebooks:
Apache Zeppelin
I think that out of these only Zeppelin officially supports Windows.
For those who may follow: I've recently discovered SnappyData.
SnappyData is still fairly young and there's a bit of a learning curve, but what it promises to do is make a persistent mutable SQL collection that can be shared between multiple Spark jobs and can be accessed natively as RDDs and DataFrames. It has a job server that you can dump concurrent jobs onto.
It's essentially a combination of a GemFire in-memory database with Spark clusters that are local in the same JVM, so (when I get decent at managing it) I can do large tasks without single-machine bottlenecks to pipe data in and out of Spark, or I can even do live data manipulation while another Spark program is running on the same data.
I know this is my own answer, but I'm probably not going to mark it as the answer until I get sophisticated enough to have opinions on how well it solves my problems.

Zeppelin and BigQuery

I'm looking for a visualisation and analytical notebook engine for BigQuery and am interested in Apache/Zeppelin.
We have internal capability in Python and R and want to use this with our BigQuery back end.
All the installation scenarios I've seen so far ( eg: https://cloud.google.com/blog/big-data/2016/09/analyzing-bigquery-datasets-using-bigquery-interpreter-for-apache-zeppelin) seem to require the installation of a fairly hefty Scala/Spark cluster which I don't see the need for (and which would cost a lot)
Is it possible to install Zeppelin without the cluster in Google Cloud?
Starting with 0.6.1 there is a Native BigQuery Interpreter for Apache Zeppelin available.
It allows you to process and analyze datasets stored in Google BigQuery by directly running SQL against it from within an Apache Zeppelin notebook.
So you do not need anymore query BigQuery using Apache Spark as it was only way before

Categories