How to run a PySpark job (with custom modules) on Amazon EMR? - python

I want to run a PySpark program that runs perfectly well on my (local) machine.
I have an Amazon Elastic Map Reduce cluster running, with all the necessary dependencies installed (Spark, Python modules from PyPI).
Now, how do I run a PySpark job that uses some custom modules? I have been trying many things for maybe half a day, now, to no avail. The best command I have found so far is:
/home/hadoop/spark/bin/spark-submit --master yarn-cluster \
--py-files s3://bucket/custom_module.py s3://bucket/pyspark_program.py
However, Python fails because it does not find custom_module.py. It seems to try to copy it, though:
INFO yarn.Client: Uploading resource s3://bucket/custom_module.py ->
hdfs://…:9000/user/hadoop/.sparkStaging/application_…_0001/custom_module.py
INFO s3n.S3NativeFileSystem: Opening
's3://bucket/custom_module.py' for reading
This looks like an awfully basic question, but the web is quite mute on this, including the official documentation (the Spark documentation seems to imply the command above).

This is a bug of Spark 1.3.0.
The workaround consists in defining SPARK_HOME for YARN, even though this should be unnecessary:
spark-submit … --conf spark.yarn.appMasterEnv.SPARK_HOME=/home/hadoop/spark \
--conf spark.executorEnv.SPARK_HOME=/home/hadoop/spark …

Related

Autoscaler Launching a simple python script on an AWS ray cluster with docker EXAMPLES

I am finding with ray a serious lack of documentation for the autoscaling. I cannot get anything to work.
Does anyone know of any basic examples of autoscaling with aws that I can build into. ie. dockerfile (or without docker, not fussy at this point), config.yaml simple_ray_script.py
or anything at all that I can just download and run.
The examples I have tried with minimal.yaml are too simple, such that any changes to the config, ie applying a conda env stops workers from being initiate and a multitude of other issues. The examples don't work for me either in ray project
That would be great.
so far I have found pretty much nothing that works. I just want to run a simple ray python script WITH dependencies that will also launch and run on all workers, NOT just on head cluster.

How to submit a tar.gz file in pyspark

I m on client deploy mode and I would like to submit an application consisting a tar.gz that contains the runtime, code and libraries.
The purpose is not depend upon spark cluster for a specific python runtime (e.g. spark cluster has python 3.5 version and my code needs 3.7 version) or a library that is not installed on the cluster.
I found it was possible to submit a python file as well as for .jar file.
Use venv to use a virtual environment version of python for the pyspark job.
Command once your venv is setup:
spark-submit --master yarn-client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=<requirementsFile> --conf spark.pyspark.virtualenv.bin.path=<virtualenv_path> --conf spark.pyspark.python=<python_path> <pyspark_file>
Have a look at: Using VirtualEnv with PySpark
Simply use this within Python
spark.sparkContext.addPyFile("module.zip")
Or you could do
spark-submit --py-files module.zip yourapp.py
See also the Spark API here

is it possible to run spark udf functions (mainly python) under docker?

I'm using pyspark on emr. To simplify the setup of python libraries and dependencies, we're using docker images.
This works fine for general python applications (non spark), and for the spark driver (calling spark submit from within a docker image)
However, I couldn't find a method to make the workers run within a docker image (either the "full" worker, or just the UDF functions)
EDIT
Found a solution with beta EMR version, if there's some alternative with current (5.*) EMR versions it's still relevant
Apparently yarn 3.2 supports this feature: https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/DockerContainers.html
and it expected to be available with EMR 6 (now in beta) https://aws.amazon.com/blogs/big-data/run-spark-applications-with-docker-using-amazon-emr-6-0-0-beta/

How to develop with PYSPARK locally and run on Spark Cluster?

I'm new to Spark.I installed a Spark 2.3.0 in Stand-Alone-Mode on an Ubuntu 16.04.3 server. That runs well so far. Now I would like to start developing with pyspark because I've got more experience using python than scala.
Ok. Even after using google for a while I'm not sure how I should setup my development environment. My local machine is a windows 10 laptop with eclipse neon and pydev configured. What are the neccessary steps to set ist up that I can develop in a local context and submit my modules to the spark cluster on my server?
Thank for helping.
use spark-submit to run locally or on a cluster. There are many online tutorials for this. I like the AWS documentation which explains the architecture, has sample spark code, and gives examples of local and remote commands. Even if you are not using AWS EMR the basics are the same.
give it a try and let us know how it goes

Command glossary for dataflow?

I'm experimenting with the Dataflow Python SDK and would like some sort of reference as to what the various commands do, their required args and their recommended syntax.
So after
import google.cloud.dataflow as df
Where can I read up on df.Create, df.Write, df.FlatMap, df.CombinePerKey, etc. ? Has anybody put together such a reference?
Is there anyplace (link please) where all the possible Apache Beam / Dataflow commands are collected and explained?
There is not yet a pydoc server running for Dataflow Python. However, you can easily run your own in order to browse: https://github.com/GoogleCloudPlatform/DataflowPythonSDK#a-quick-tour-of-the-source-code

Categories