How to install jars related to spark-redis in databricks cluster? - python

I am trying to connect to Azure cache for redis from databricks .
I have installed this package com.redislabs:spark-redis:2.3.0 from maven package in databricks. I have created a spark session with below code
SparkSession\
.builder\
.appName("myApp")\
.config("spark.redis.host", "my host")\
.config("spark.redis.port", "6379")\
.config("spark.redis.auth", "passwd")\
.getOrCreate()
But when I ran df.write.format("org.apache.spark.sql.redis").option("table", "people").option("key.column", "name").save()
I am getting below error.
*Py4JJavaError: An error occurred while calling o390.save.
: java.lang.ClassNotFoundException:
Failed to find data source: org.apache.spark.sql.redis. Please find packages at
http://spark.apache.org/third-party-projects.html*
Could you please let me know the detailed steps to install all necessary libraries/jars to access redis in databricks.
I have seen below code in spark-redis python doc but I don't know how to run it in databricks.
$ ./bin/pyspark --jars <path-to>/spark-redis-<version>-jar-with-dependencies.jar
And also please let me know what is the latest spark-redis version.

Redis has a Spark Package that you can download and attach to your cluster
The following notebook shows how to use Redis with Apache Spark in Azure Databricks.
For more details, refer to Azure Databricks - Redis.

Related

How to install spark-bigquery-connector in the VM GCP?

I have VM image cluster Hadoop with spark install in the GCP but it's not a dataproc.
Can I install the spark bigquery connector without using dataproc?
if yes ,how can I do it?
I found this link to download the connector
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
If I understand correctly, your cluster is not a native Dataproc cluster. You created a set of instances (at least one) and installed manually a Hadoop cluster. This scenario is more like an on-prem Hadoop installation (using GCP infrastructure); unfortunately, the bigquery connector documentation doesn't specify if it can be used outside Dataproc, but I would think it should since the connector is a jar file. In the section Downloading and Using the Connector, you can download the latest version or a different one in case one of them doesn't work.
To install the connector in a GCP VM instance, it is needed to include it in the java classpath of your application:
Include it in a Spark directory already added in the java classpath. Or you can add a new entry (this change will be cleaned up when your session ends in the prompt):
export CLASSPATH=</path/to/bigquery-connector.jar>:$CLASSPATH
Use the --jars option when submitting your spark application.
The options above will allow you to run spark jobs locally. To submit your jobs to your Hadoop cluster, you should ensure that the connector is included in its classpath as well, I recommend using HADOOP_CLASSPATH. This thread has more details about it.
Yes you can download it from the GitHub site and install in your spark cluster. Alternatively, you can add the --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0 to your spark command (spark-submit/pyspark/spark-shell).
Edit
There are few options:
When you run you spark app, run pyspark <params> --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0. Same goes for running spark-submit or spark-shell
Download the jar from the repository and copy it to the /usr/lib/spark/jars/ directory. Usually this is done with a script after the cluster is available (using init action).
Download the jars in runtime like you mentioned:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0") \
.getOrCreate()
Documentation
Need to follow Installing the connector document.
Download the connector jar file
Download the appropriate jar depending on the scala version that compiled the Spark and note the /path/to/jar.
version
Link
Scala 2.11
gs://spark-lib/bigquery/spark-bigquery-latest_2.11.jar (HTTP link)
Scala 2.12
gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar (HTTP link)
Cloud Storage access
Because your VM is in GCP, follow the instruction in the Installing the connector document.
Google Cloud Platform - Each Google Compute Engine VM must be configured to have access to the Cloud Storage scope you intend to use the connector for. When running inside of Google Compute Engine VMs, including Dataproc clusters, google.cloud.auth.service.account.enable is set to false by default, which means you don't need to manually configure a service account for the connector; it will automatically get the service account credential from the metadata server of the VM. But you must need to make sure the VM service account has permission to access the GCS bucket.
Spark property
To tell Spark (both driver and executor) where to load the jar file, set the Spark property. Note that I am using Spark on YARN, so please adjust according to your cluster configuration.
spark = SparkSession.builder\
.appName('spark-bigquery-demo') \
.master('yarn') \
.config('spark.submit.deployMode', 'client') \
.config("spark.jars", "/path/to/jar") \
.getOrCreate()
For non GCP environment
OAuth 2.0 private key
Run the GCP SDK command and generate the application_default_credentials.json file.
gcloud auth application-default login
Place the keyfile where the Spark submit account and executor account can access and read.
Spark properties to set
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.google.cloud.auth.service.account.json.keyfile <path/to/keyfile.json>
No need to set the spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS property.
Spark Session
spark = SparkSession.builder\
.appName('spark-bigquery-demo') \
.master('yarn') \
.config('spark.submit.deployMode', 'client') \
.config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", GOOGLE_APPLICATION_CREDENTIALS) \
.config("spark.jars", "/path/to/jar") \
.getOrCreate()

Accessing delta lake through Pyspark on EMR notebooks

I have a query with respect to using external libraries like delta-core over AWS EMR notebooks. Currently there isn’t any mechanism of installing the delta-core libraries through pypi packages. The available options include.
Launching out pyspark kernel with --packages option
The other option is to change the packages option in the python script through os configuration, but I don’t see that it is able to download the packages and I still get import error on import delta.tables library.
Third option is to download the JARs manually but it appears that there isn’t any option on EMR notebooks.
Has anyone tried this out before?
You can download the jars while creating EMR using bootstrap scripts.
You can place the jars in s3 and pass it to pyspark with --jars option

Connect GCP with PySpark without using Dataproc

I'm trying to connect GCP (Google Big Query) with Spark (using pyspark) without using Dataproc (self-hosted Spark in the house), as listed on google official documentation it's only for Dataproc https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example? Any suggest? Note: My Spark & Hadoop setup on Docker. Thanks
Please have a look at the project page on GitHub - it details how to reference the GCP credentials from the code.
In short, you should run
spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>").option("table", "<table>").load()
Please refer here on how to create the json credentials file if needed.
The BigQuery connector is available in a jar file as spark-bigquery-connector, it is publicly available. Then you can:
Add it to the classpath on your on-premise/self-hosted cluster, so your applications can reach the BigQuery API.
Add the connector only to your Spark applications, for example with the --jars option. Regarding this, there are some other possibilities that can impact your app, to know more please check Add jars to a Spark Job - spark-submit
Once the jar is added to the classpath you can check the two bigquery connector examples, one of them was already provided by # David Rabinowitz

How to develop with PYSPARK locally and run on Spark Cluster?

I'm new to Spark.I installed a Spark 2.3.0 in Stand-Alone-Mode on an Ubuntu 16.04.3 server. That runs well so far. Now I would like to start developing with pyspark because I've got more experience using python than scala.
Ok. Even after using google for a while I'm not sure how I should setup my development environment. My local machine is a windows 10 laptop with eclipse neon and pydev configured. What are the neccessary steps to set ist up that I can develop in a local context and submit my modules to the spark cluster on my server?
Thank for helping.
use spark-submit to run locally or on a cluster. There are many online tutorials for this. I like the AWS documentation which explains the architecture, has sample spark code, and gives examples of local and remote commands. Even if you are not using AWS EMR the basics are the same.
give it a try and let us know how it goes

Cannot save Spark dataframe to Google Cloud Storage from PySpark

I have a Spark dataframe that I'm trying to save to a Google Storage bucket with the line
df.write.format("com.databricks.spark.csv").save('gs://some-test-bucket-delete-me')
But Pyspark raises the following exception
Py4JJavaError: An error occurred while calling o55.save.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
What I've tried:
The typical solutions to this exception is to make sure that the environmental variable HADOOP_CLASSPATH to point at the gcs-connector-latest-hadoop2.jar file, which it does. I've tried using the Hadoop version 1 as well as version 2 jars incase that was the problem. I've tried explicitly pointing at it within Jupyter notebook with
sc._jsc.hadoopConfiguration().set("spark.driver.extraClassPath", "/home/****/hadoop-2.8.2/share/hadoop/common/lib/gcs-connector-latest-hadoop1.jar")
to no avail.
If I try hadoop fs -ls gs://gs://some-test-bucket-delete-me from bash the command returns perfectly, which is supposed to indicate that that the Google Cloud Storage connect works but for some reason I can't seem to get this functionality to work in PySpark.
Things that may be important:
Spark Version 2.2.0
Python 3.6.1 :: Anaconda custom (64-bit)
I'm running PySpark locally
You should run gcloud init first
Then try df.write.csv('gs://some-test-bucket-delete-me/file_name')

Categories