How do I get a small Python script to hook into an existing instance of Spark and do operations on existing RDDs?
I'm in the early stages of working with Spark on Windows 10, trying scripts on a "Local" instance. I'm working with the latest stable build of Spark (Spark 2.0.1 for Hadoop 2.7). I've installed and set environment variables for Hadoop 2.7.3. I'm experimenting with both the Pyspark shell and Visual Studio 2015 Community with Python.
I'm trying to build a large engine, on which I'll run individual scripts to load, massage, format, and access the data. I'm sure there's a normal way to do that; isn't that the point of Spark?
Anyway, here's the experience I have so far. This is generally to be expected. When I build a small Spark script in Python and run it using Visual Studio, the script runs, does its job, and exits. In the process of exiting, it also exits the Spark Context it was using.
So I had the following thought: What if I started a persistent Spark Context in Pyspark and then set my SparkConf and SparkContext in each Python script to connect to that Spark Context? So, looking up online what the defaults are for Pyspark, I tried the following:
conf = SparkConf().setMaster("local[*]").setAppName("PySparkShell")
sc = SparkContext(conf = conf)
I started Pyspark. In a separate script in Visual Studio, I used this code for SparkContext. I loaded a text file into an RDD named RDDFromFilename . But I couldn't access that RDD in the Pyspark shell once the script had run.
How do I start a persistent Spark Context, create an RDD in it in one Python script, and access that RDD from subsequent Python scripts? Particularly in Windows?
There is no solution in Spark. You may consider:
To keep persistent RDDs:
Apache Ignite
To keep persistent shared context:
spark-jobserver
livy - https://github.com/cloudera/livy
mist - https://github.com/Hydrospheredata/mist
To share context for with notebooks:
Apache Zeppelin
I think that out of these only Zeppelin officially supports Windows.
For those who may follow: I've recently discovered SnappyData.
SnappyData is still fairly young and there's a bit of a learning curve, but what it promises to do is make a persistent mutable SQL collection that can be shared between multiple Spark jobs and can be accessed natively as RDDs and DataFrames. It has a job server that you can dump concurrent jobs onto.
It's essentially a combination of a GemFire in-memory database with Spark clusters that are local in the same JVM, so (when I get decent at managing it) I can do large tasks without single-machine bottlenecks to pipe data in and out of Spark, or I can even do live data manipulation while another Spark program is running on the same data.
I know this is my own answer, but I'm probably not going to mark it as the answer until I get sophisticated enough to have opinions on how well it solves my problems.
Related
I am trying to setup spark standalone node which can be used locally without a cluster and run spark program on that instance.
As I have read the articles, there are multiple approaches to have pyspark on an instance, I am listing just these two:
PySpark with Pypi - Runs with either
python file.py
spark-submit
Install with Apache Spark - Runs with
spark-submit
Both can run pyspark, however, what I am confused is if pyspark with pip installation will setup the node to be used as a standalone cluster with a single node and what is the difference between the two?
As is written in Docs:
The Python packaging for Spark is not intended to replace all ... use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.
What is the way one can execute spark application on a single node for a relatively small dataset, say about 50Gb? How can we setup a cluster for a single node?
I am deploying a Jupyter notebook(using python 2.7 kernel) on client side which accesses data on a remote and does processing in a remote Spark standalone cluster (using pyspark library). I am deploying spark cluster in Client mode. The client machine does not have any Spark worker nodes.
The client does not have enough memory(RAM). I wanted to know that if I perform a Spark action operation on dataframe like df.count()on client machine, will the dataframe be stored in Client's RAM or will it stored on Spark worker's memory?
If i understand correctly, then what you will get on the client side is an int. At least should be, if setup correctly. So the answer is no, the DF is not going to hit your local RAM.
You are interacting with the cluster via SparkSession (SparkContext for earlier versions). Even though you are developing -i.e. writing code- on the client machine, the actual computation of spark operations -i.e. running pyspark code- will not be performed on your local machine.
I just wanted to educate myself more on Spark. So wanted to ask this question.
I have Spark currently installed on my local machine. Its a Mach with 16GB.
I have connected a Jupyter notebook running with Pyspark.
So now when I do any coding in that notebook, like reading the data and converting the data into Spark DataFrame, I wanted to check:
1). Where all the dataset is distributed on local machine. Like does it distribute the dataset using different cores of CPU etc?. Is there a way to find that out?
2). Running the code and computation just using Jupyter notebook without spark is different from running Jupyter notebook with Pyspark? Like the first one just uses one core of the machine and runs using one thread while Jupyter notebook with Pyspark runs the code and computing on different cores of CPU with multi-threading/processing? Is this understanding correct?.
Is there a way to check these.
Thanks
Jupyter has mainly three parts Jupyter Notebook, Jupyter Clients and Kernels. http://jupyter.readthedocs.io/en/latest/architecture/how_jupyter_ipython_work.html
Here is short note on Kernel from Jupyter homepage.
Kernels are processes that run interactive code in a particular
programming language and return output to the user. Kernels also
respond to tab completion and introspection requests.
Jupyter's job is to communicate between the Kernel(Like python kernel, spark kernel ..) and web interface(your notebook). So under the hood spark is running drivers and executors for you, Jupyter helps in communicating with the spark driver.
1). Where all the dataset is distributed on local machine. Like does
it distribute the dataset using different cores of CPU etc?. Is there
a way to find that out?
Spark will spawn n number of executors(process responsible for executing a task) as specified using --num-executors, executors are managed by a spark driver(program/process responsible for running the Job over the Spark Engine). So the number of executors are specified while you run a spark program, you will find this kernel conf directory.
2). Running the code and computation just using Jupyter notebook
without spark is different from running Jupyter notebook with Pyspark?
Like the first one just uses one core of the machine and runs using
one thread while Jupyter notebook with Pyspark runs the code and
computing on different cores of CPU with multi-threading/processing?
Is this understanding correct?.
Yes, as I explained Jupyter is merely an interface letting you run code. Under the hood, Jupyter connects to Kernels be it normal Python or Apache Spark.
Spark has it's own good UI to monitor jobs, by default it runs on spark master server on port 4040. In your case, it will available on http://localhost:4040
I'm looking for a visualisation and analytical notebook engine for BigQuery and am interested in Apache/Zeppelin.
We have internal capability in Python and R and want to use this with our BigQuery back end.
All the installation scenarios I've seen so far ( eg: https://cloud.google.com/blog/big-data/2016/09/analyzing-bigquery-datasets-using-bigquery-interpreter-for-apache-zeppelin) seem to require the installation of a fairly hefty Scala/Spark cluster which I don't see the need for (and which would cost a lot)
Is it possible to install Zeppelin without the cluster in Google Cloud?
Starting with 0.6.1 there is a Native BigQuery Interpreter for Apache Zeppelin available.
It allows you to process and analyze datasets stored in Google BigQuery by directly running SQL against it from within an Apache Zeppelin notebook.
So you do not need anymore query BigQuery using Apache Spark as it was only way before
I have installed Hadoop - 2.6.0 in my machine and started all the service.
When I compare with my old version,this version does not start the job tracker and task tracker jobs instead it starts the nodemanager and resourcemanager.
QUestion:-
I believe this version of Hadoop uses YARN for running the jobs. Can't I run a map reduce job anymore?
Should I write a job thats tailored to fit the YARN resource manager and application manager.
Is there a sample Python job that I can submit?
I believe this version of Hadoop uses YARN for running the jobs. Can't I run a map reduce job anymore?
It's still fine to run MapReduce jobs. YARN is a rearchitecture of the cluster computing internals of a Hadoop cluster, but that rearchitecture maintained public API compatibility with classic Hadoop 1.x MapReduce. The Apache Hadoop documentation on Apache Hadoop NextGen MapReduce (YARN) discusses the rearchitecture in more detail. There is a relevant quote at the end of the document:
MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a recompile.
Should I write a job thats tailored to fit the YARN resource manager and application manager.
If you're already accustomed to writing MapReduce jobs or higher-level abstractions like Pig scripts and Hive queries, then you don't need to change anything you're doing as the end user. API compatibility as per above means that all of those things continue to work fine. You are welcome to write custom distributed applications that specifically target the YARN framework, but this is more advanced usage that isn't required if you just want to stick to Hadoop 1.x-style data processing jobs. The Apache Hadoop documentation contains a page on Writing YARN Applications if you're interested in exploring this.
Is there a sample Python job that I can submit?
I recommend taking a look at the Apache Hadoop documentation on Hadoop Streaming. Hadoop Streaming allows you write MapReduce jobs based simply on reading stdin and writing to stdout. This is a very general pardigm, so it means you can code in pretty much anything you want, including Python.
In general, it sounds like you would benefit from exploring the Apache Hadoop documentation site. There is a lot of helpful information there.