How to submit a tar.gz file in pyspark - python

I m on client deploy mode and I would like to submit an application consisting a tar.gz that contains the runtime, code and libraries.
The purpose is not depend upon spark cluster for a specific python runtime (e.g. spark cluster has python 3.5 version and my code needs 3.7 version) or a library that is not installed on the cluster.
I found it was possible to submit a python file as well as for .jar file.

Use venv to use a virtual environment version of python for the pyspark job.
Command once your venv is setup:
spark-submit --master yarn-client --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=<requirementsFile> --conf spark.pyspark.virtualenv.bin.path=<virtualenv_path> --conf spark.pyspark.python=<python_path> <pyspark_file>
Have a look at: Using VirtualEnv with PySpark

Simply use this within Python
spark.sparkContext.addPyFile("module.zip")
Or you could do
spark-submit --py-files module.zip yourapp.py
See also the Spark API here

Related

Create Spark Standalone setup with Pyspark installation via Pypi

I am trying to setup spark standalone node which can be used locally without a cluster and run spark program on that instance.
As I have read the articles, there are multiple approaches to have pyspark on an instance, I am listing just these two:
PySpark with Pypi - Runs with either
python file.py
spark-submit
Install with Apache Spark - Runs with
spark-submit
Both can run pyspark, however, what I am confused is if pyspark with pip installation will setup the node to be used as a standalone cluster with a single node and what is the difference between the two?
As is written in Docs:
The Python packaging for Spark is not intended to replace all ... use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.
What is the way one can execute spark application on a single node for a relatively small dataset, say about 50Gb? How can we setup a cluster for a single node?

How to install jars related to spark-redis in databricks cluster?

I am trying to connect to Azure cache for redis from databricks .
I have installed this package com.redislabs:spark-redis:2.3.0 from maven package in databricks. I have created a spark session with below code
SparkSession\
.builder\
.appName("myApp")\
.config("spark.redis.host", "my host")\
.config("spark.redis.port", "6379")\
.config("spark.redis.auth", "passwd")\
.getOrCreate()
But when I ran df.write.format("org.apache.spark.sql.redis").option("table", "people").option("key.column", "name").save()
I am getting below error.
*Py4JJavaError: An error occurred while calling o390.save.
: java.lang.ClassNotFoundException:
Failed to find data source: org.apache.spark.sql.redis. Please find packages at
http://spark.apache.org/third-party-projects.html*
Could you please let me know the detailed steps to install all necessary libraries/jars to access redis in databricks.
I have seen below code in spark-redis python doc but I don't know how to run it in databricks.
$ ./bin/pyspark --jars <path-to>/spark-redis-<version>-jar-with-dependencies.jar
And also please let me know what is the latest spark-redis version.
Redis has a Spark Package that you can download and attach to your cluster
The following notebook shows how to use Redis with Apache Spark in Azure Databricks.
For more details, refer to Azure Databricks - Redis.

How to install spark-bigquery-connector in the VM GCP?

I have VM image cluster Hadoop with spark install in the GCP but it's not a dataproc.
Can I install the spark bigquery connector without using dataproc?
if yes ,how can I do it?
I found this link to download the connector
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
If I understand correctly, your cluster is not a native Dataproc cluster. You created a set of instances (at least one) and installed manually a Hadoop cluster. This scenario is more like an on-prem Hadoop installation (using GCP infrastructure); unfortunately, the bigquery connector documentation doesn't specify if it can be used outside Dataproc, but I would think it should since the connector is a jar file. In the section Downloading and Using the Connector, you can download the latest version or a different one in case one of them doesn't work.
To install the connector in a GCP VM instance, it is needed to include it in the java classpath of your application:
Include it in a Spark directory already added in the java classpath. Or you can add a new entry (this change will be cleaned up when your session ends in the prompt):
export CLASSPATH=</path/to/bigquery-connector.jar>:$CLASSPATH
Use the --jars option when submitting your spark application.
The options above will allow you to run spark jobs locally. To submit your jobs to your Hadoop cluster, you should ensure that the connector is included in its classpath as well, I recommend using HADOOP_CLASSPATH. This thread has more details about it.
Yes you can download it from the GitHub site and install in your spark cluster. Alternatively, you can add the --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0 to your spark command (spark-submit/pyspark/spark-shell).
Edit
There are few options:
When you run you spark app, run pyspark <params> --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0. Same goes for running spark-submit or spark-shell
Download the jar from the repository and copy it to the /usr/lib/spark/jars/ directory. Usually this is done with a script after the cluster is available (using init action).
Download the jars in runtime like you mentioned:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0") \
.getOrCreate()
Documentation
Need to follow Installing the connector document.
Download the connector jar file
Download the appropriate jar depending on the scala version that compiled the Spark and note the /path/to/jar.
version
Link
Scala 2.11
gs://spark-lib/bigquery/spark-bigquery-latest_2.11.jar (HTTP link)
Scala 2.12
gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar (HTTP link)
Cloud Storage access
Because your VM is in GCP, follow the instruction in the Installing the connector document.
Google Cloud Platform - Each Google Compute Engine VM must be configured to have access to the Cloud Storage scope you intend to use the connector for. When running inside of Google Compute Engine VMs, including Dataproc clusters, google.cloud.auth.service.account.enable is set to false by default, which means you don't need to manually configure a service account for the connector; it will automatically get the service account credential from the metadata server of the VM. But you must need to make sure the VM service account has permission to access the GCS bucket.
Spark property
To tell Spark (both driver and executor) where to load the jar file, set the Spark property. Note that I am using Spark on YARN, so please adjust according to your cluster configuration.
spark = SparkSession.builder\
.appName('spark-bigquery-demo') \
.master('yarn') \
.config('spark.submit.deployMode', 'client') \
.config("spark.jars", "/path/to/jar") \
.getOrCreate()
For non GCP environment
OAuth 2.0 private key
Run the GCP SDK command and generate the application_default_credentials.json file.
gcloud auth application-default login
Place the keyfile where the Spark submit account and executor account can access and read.
Spark properties to set
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.google.cloud.auth.service.account.json.keyfile <path/to/keyfile.json>
No need to set the spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS property.
Spark Session
spark = SparkSession.builder\
.appName('spark-bigquery-demo') \
.master('yarn') \
.config('spark.submit.deployMode', 'client') \
.config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", GOOGLE_APPLICATION_CREDENTIALS) \
.config("spark.jars", "/path/to/jar") \
.getOrCreate()

is it possible to run spark udf functions (mainly python) under docker?

I'm using pyspark on emr. To simplify the setup of python libraries and dependencies, we're using docker images.
This works fine for general python applications (non spark), and for the spark driver (calling spark submit from within a docker image)
However, I couldn't find a method to make the workers run within a docker image (either the "full" worker, or just the UDF functions)
EDIT
Found a solution with beta EMR version, if there's some alternative with current (5.*) EMR versions it's still relevant
Apparently yarn 3.2 supports this feature: https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/DockerContainers.html
and it expected to be available with EMR 6 (now in beta) https://aws.amazon.com/blogs/big-data/run-spark-applications-with-docker-using-amazon-emr-6-0-0-beta/

How to run a PySpark job (with custom modules) on Amazon EMR?

I want to run a PySpark program that runs perfectly well on my (local) machine.
I have an Amazon Elastic Map Reduce cluster running, with all the necessary dependencies installed (Spark, Python modules from PyPI).
Now, how do I run a PySpark job that uses some custom modules? I have been trying many things for maybe half a day, now, to no avail. The best command I have found so far is:
/home/hadoop/spark/bin/spark-submit --master yarn-cluster \
--py-files s3://bucket/custom_module.py s3://bucket/pyspark_program.py
However, Python fails because it does not find custom_module.py. It seems to try to copy it, though:
INFO yarn.Client: Uploading resource s3://bucket/custom_module.py ->
hdfs://…:9000/user/hadoop/.sparkStaging/application_…_0001/custom_module.py
INFO s3n.S3NativeFileSystem: Opening
's3://bucket/custom_module.py' for reading
This looks like an awfully basic question, but the web is quite mute on this, including the official documentation (the Spark documentation seems to imply the command above).
This is a bug of Spark 1.3.0.
The workaround consists in defining SPARK_HOME for YARN, even though this should be unnecessary:
spark-submit … --conf spark.yarn.appMasterEnv.SPARK_HOME=/home/hadoop/spark \
--conf spark.executorEnv.SPARK_HOME=/home/hadoop/spark …

Categories