is it possible to run spark udf functions (mainly python) under docker? - python

I'm using pyspark on emr. To simplify the setup of python libraries and dependencies, we're using docker images.
This works fine for general python applications (non spark), and for the spark driver (calling spark submit from within a docker image)
However, I couldn't find a method to make the workers run within a docker image (either the "full" worker, or just the UDF functions)
EDIT
Found a solution with beta EMR version, if there's some alternative with current (5.*) EMR versions it's still relevant

Apparently yarn 3.2 supports this feature: https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/DockerContainers.html
and it expected to be available with EMR 6 (now in beta) https://aws.amazon.com/blogs/big-data/run-spark-applications-with-docker-using-amazon-emr-6-0-0-beta/

Related

Create Spark Standalone setup with Pyspark installation via Pypi

I am trying to setup spark standalone node which can be used locally without a cluster and run spark program on that instance.
As I have read the articles, there are multiple approaches to have pyspark on an instance, I am listing just these two:
PySpark with Pypi - Runs with either
python file.py
spark-submit
Install with Apache Spark - Runs with
spark-submit
Both can run pyspark, however, what I am confused is if pyspark with pip installation will setup the node to be used as a standalone cluster with a single node and what is the difference between the two?
As is written in Docs:
The Python packaging for Spark is not intended to replace all ... use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.
What is the way one can execute spark application on a single node for a relatively small dataset, say about 50Gb? How can we setup a cluster for a single node?

How to develop with PYSPARK locally and run on Spark Cluster?

I'm new to Spark.I installed a Spark 2.3.0 in Stand-Alone-Mode on an Ubuntu 16.04.3 server. That runs well so far. Now I would like to start developing with pyspark because I've got more experience using python than scala.
Ok. Even after using google for a while I'm not sure how I should setup my development environment. My local machine is a windows 10 laptop with eclipse neon and pydev configured. What are the neccessary steps to set ist up that I can develop in a local context and submit my modules to the spark cluster on my server?
Thank for helping.
use spark-submit to run locally or on a cluster. There are many online tutorials for this. I like the AWS documentation which explains the architecture, has sample spark code, and gives examples of local and remote commands. Even if you are not using AWS EMR the basics are the same.
give it a try and let us know how it goes

How to Connect Python to Spark Session and Keep RDDs Alive

How do I get a small Python script to hook into an existing instance of Spark and do operations on existing RDDs?
I'm in the early stages of working with Spark on Windows 10, trying scripts on a "Local" instance. I'm working with the latest stable build of Spark (Spark 2.0.1 for Hadoop 2.7). I've installed and set environment variables for Hadoop 2.7.3. I'm experimenting with both the Pyspark shell and Visual Studio 2015 Community with Python.
I'm trying to build a large engine, on which I'll run individual scripts to load, massage, format, and access the data. I'm sure there's a normal way to do that; isn't that the point of Spark?
Anyway, here's the experience I have so far. This is generally to be expected. When I build a small Spark script in Python and run it using Visual Studio, the script runs, does its job, and exits. In the process of exiting, it also exits the Spark Context it was using.
So I had the following thought: What if I started a persistent Spark Context in Pyspark and then set my SparkConf and SparkContext in each Python script to connect to that Spark Context? So, looking up online what the defaults are for Pyspark, I tried the following:
conf = SparkConf().setMaster("local[*]").setAppName("PySparkShell")
sc = SparkContext(conf = conf)
I started Pyspark. In a separate script in Visual Studio, I used this code for SparkContext. I loaded a text file into an RDD named RDDFromFilename . But I couldn't access that RDD in the Pyspark shell once the script had run.
How do I start a persistent Spark Context, create an RDD in it in one Python script, and access that RDD from subsequent Python scripts? Particularly in Windows?
There is no solution in Spark. You may consider:
To keep persistent RDDs:
Apache Ignite
To keep persistent shared context:
spark-jobserver
livy - https://github.com/cloudera/livy
mist - https://github.com/Hydrospheredata/mist
To share context for with notebooks:
Apache Zeppelin
I think that out of these only Zeppelin officially supports Windows.
For those who may follow: I've recently discovered SnappyData.
SnappyData is still fairly young and there's a bit of a learning curve, but what it promises to do is make a persistent mutable SQL collection that can be shared between multiple Spark jobs and can be accessed natively as RDDs and DataFrames. It has a job server that you can dump concurrent jobs onto.
It's essentially a combination of a GemFire in-memory database with Spark clusters that are local in the same JVM, so (when I get decent at managing it) I can do large tasks without single-machine bottlenecks to pipe data in and out of Spark, or I can even do live data manipulation while another Spark program is running on the same data.
I know this is my own answer, but I'm probably not going to mark it as the answer until I get sophisticated enough to have opinions on how well it solves my problems.

Best way to to implement Spark + AWS + Caffe/CUDA?

I am looking to deploy an application that already has a trained caffemodel file and I need to deploy it to a Spark cluster on AWS for processing due to GPU computation power needed (20K patches per image). From my research it seems that the best way to do it is to use Spark to create an AWS cluster which then runs a Docker image or Amazon AMI to install project dependencies automatically. Once everything is installed, the job can run in the cluster through Spark. What I am wondering is how to do this from start to finish. I have seen several guides, and have taken some online courses on Spark (BerkeleyX, Udemy) and Docker (Udemy); however almost all the information I have seen are examples of how to implement the simplest application that has little to no heavy software dependencies (CUDA drivers, CuDNN, Caffe, DIGITS). I have deployed Spark clusters on AWS and ran simple examples that had no dependencies, but have found little to no information on running an application that would require even a small dependency such as numpy. I would like to leverage the group to see if anyone has experience in such an implementation and can point me in the right direction or offer some help/suggestions?
Here are some things I have looked into:
Docker+NVIDIA: https://github.com/NVIDIA/nvidia-docker
bitfusion AMI: https://aws.amazon.com/marketplace/pp/B01DJ93C7Q/ref=sp_mpg_product_title?ie=UTF8&sr=0-13
My question is in regards to how to implement a small sample application from start to end with the Spark cluster getting created automatically side-by-side while installing the dependencies needed through either Docker or an AMI from above?
Notes:
Platform: Ubuntu 14.04
Language: Python
Dependencies: CUDA 7.5, caffe­nv, libcudnn4, NVIDIA Graphics Driver (346-352)

How to run a PySpark job (with custom modules) on Amazon EMR?

I want to run a PySpark program that runs perfectly well on my (local) machine.
I have an Amazon Elastic Map Reduce cluster running, with all the necessary dependencies installed (Spark, Python modules from PyPI).
Now, how do I run a PySpark job that uses some custom modules? I have been trying many things for maybe half a day, now, to no avail. The best command I have found so far is:
/home/hadoop/spark/bin/spark-submit --master yarn-cluster \
--py-files s3://bucket/custom_module.py s3://bucket/pyspark_program.py
However, Python fails because it does not find custom_module.py. It seems to try to copy it, though:
INFO yarn.Client: Uploading resource s3://bucket/custom_module.py ->
hdfs://…:9000/user/hadoop/.sparkStaging/application_…_0001/custom_module.py
INFO s3n.S3NativeFileSystem: Opening
's3://bucket/custom_module.py' for reading
This looks like an awfully basic question, but the web is quite mute on this, including the official documentation (the Spark documentation seems to imply the command above).
This is a bug of Spark 1.3.0.
The workaround consists in defining SPARK_HOME for YARN, even though this should be unnecessary:
spark-submit … --conf spark.yarn.appMasterEnv.SPARK_HOME=/home/hadoop/spark \
--conf spark.executorEnv.SPARK_HOME=/home/hadoop/spark …

Categories