Connect GCP with PySpark without using Dataproc - python

I'm trying to connect GCP (Google Big Query) with Spark (using pyspark) without using Dataproc (self-hosted Spark in the house), as listed on google official documentation it's only for Dataproc https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example? Any suggest? Note: My Spark & Hadoop setup on Docker. Thanks

Please have a look at the project page on GitHub - it details how to reference the GCP credentials from the code.
In short, you should run
spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>").option("table", "<table>").load()
Please refer here on how to create the json credentials file if needed.

The BigQuery connector is available in a jar file as spark-bigquery-connector, it is publicly available. Then you can:
Add it to the classpath on your on-premise/self-hosted cluster, so your applications can reach the BigQuery API.
Add the connector only to your Spark applications, for example with the --jars option. Regarding this, there are some other possibilities that can impact your app, to know more please check Add jars to a Spark Job - spark-submit
Once the jar is added to the classpath you can check the two bigquery connector examples, one of them was already provided by # David Rabinowitz

Related

How to mock and test Databricks Pyspark notebooks Locally

How can I mock existing Azure Databricks PySpark codes of a project (written by others) and run them locally on windows machine/Anaconda to test and practice?
Is it possible to mock the codes or I need to create a new cluster on Databricks for my own testing purposes?
how I can connect to storage account, use the Databricks Utilities, etc? I only have experience with Python & GCP and just joined a Databricks project and need to run the cells one by one to see the result and modify if required.
Thanks
You can test/run PySpark code from your IDE by installing PySpark on your local computer.
Now, to use Databricks Utilities, in fact you would need a Databricks instance and it's not available on local. You can try Databricks community Editionfor free but with some limitation
To acess a cloud storage account, it can be done locally from your computer or from your own Databricks instance. In both cases your will have to set up the end point of this storage account using its secrets.

How to install jars related to spark-redis in databricks cluster?

I am trying to connect to Azure cache for redis from databricks .
I have installed this package com.redislabs:spark-redis:2.3.0 from maven package in databricks. I have created a spark session with below code
SparkSession\
.builder\
.appName("myApp")\
.config("spark.redis.host", "my host")\
.config("spark.redis.port", "6379")\
.config("spark.redis.auth", "passwd")\
.getOrCreate()
But when I ran df.write.format("org.apache.spark.sql.redis").option("table", "people").option("key.column", "name").save()
I am getting below error.
*Py4JJavaError: An error occurred while calling o390.save.
: java.lang.ClassNotFoundException:
Failed to find data source: org.apache.spark.sql.redis. Please find packages at
http://spark.apache.org/third-party-projects.html*
Could you please let me know the detailed steps to install all necessary libraries/jars to access redis in databricks.
I have seen below code in spark-redis python doc but I don't know how to run it in databricks.
$ ./bin/pyspark --jars <path-to>/spark-redis-<version>-jar-with-dependencies.jar
And also please let me know what is the latest spark-redis version.
Redis has a Spark Package that you can download and attach to your cluster
The following notebook shows how to use Redis with Apache Spark in Azure Databricks.
For more details, refer to Azure Databricks - Redis.

Sharepoint OnPremise Integration with Azure

Is there any way to integrate SharePoint on premise with Azure ADF,though there is a SharePoint connector in azure ADF but its only work with SharePoint online.
Is there any other way around to download from SharePoint on premise using python or scala
Thanks
Just for now, Data Factory doesn't support on-premise SharePoint as the connector.
About your another question, we can't tell you there is or not, you must test it by yourself. One of the workarounds is that use the Azure databricks python notebook, I searched fount these documents may helpful: Connect your Azure Databricks workspace to your on-premises
network
There isn't an exist tutorial/example for you. The last thing is that you need to build/design the code logic/library to connect to your on-premise sharepoint.

How to install spark-bigquery-connector in the VM GCP?

I have VM image cluster Hadoop with spark install in the GCP but it's not a dataproc.
Can I install the spark bigquery connector without using dataproc?
if yes ,how can I do it?
I found this link to download the connector
https://github.com/GoogleCloudDataproc/spark-bigquery-connector
If I understand correctly, your cluster is not a native Dataproc cluster. You created a set of instances (at least one) and installed manually a Hadoop cluster. This scenario is more like an on-prem Hadoop installation (using GCP infrastructure); unfortunately, the bigquery connector documentation doesn't specify if it can be used outside Dataproc, but I would think it should since the connector is a jar file. In the section Downloading and Using the Connector, you can download the latest version or a different one in case one of them doesn't work.
To install the connector in a GCP VM instance, it is needed to include it in the java classpath of your application:
Include it in a Spark directory already added in the java classpath. Or you can add a new entry (this change will be cleaned up when your session ends in the prompt):
export CLASSPATH=</path/to/bigquery-connector.jar>:$CLASSPATH
Use the --jars option when submitting your spark application.
The options above will allow you to run spark jobs locally. To submit your jobs to your Hadoop cluster, you should ensure that the connector is included in its classpath as well, I recommend using HADOOP_CLASSPATH. This thread has more details about it.
Yes you can download it from the GitHub site and install in your spark cluster. Alternatively, you can add the --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0 to your spark command (spark-submit/pyspark/spark-shell).
Edit
There are few options:
When you run you spark app, run pyspark <params> --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0. Same goes for running spark-submit or spark-shell
Download the jar from the repository and copy it to the /usr/lib/spark/jars/ directory. Usually this is done with a script after the cluster is available (using init action).
Download the jars in runtime like you mentioned:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0") \
.getOrCreate()
Documentation
Need to follow Installing the connector document.
Download the connector jar file
Download the appropriate jar depending on the scala version that compiled the Spark and note the /path/to/jar.
version
Link
Scala 2.11
gs://spark-lib/bigquery/spark-bigquery-latest_2.11.jar (HTTP link)
Scala 2.12
gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar (HTTP link)
Cloud Storage access
Because your VM is in GCP, follow the instruction in the Installing the connector document.
Google Cloud Platform - Each Google Compute Engine VM must be configured to have access to the Cloud Storage scope you intend to use the connector for. When running inside of Google Compute Engine VMs, including Dataproc clusters, google.cloud.auth.service.account.enable is set to false by default, which means you don't need to manually configure a service account for the connector; it will automatically get the service account credential from the metadata server of the VM. But you must need to make sure the VM service account has permission to access the GCS bucket.
Spark property
To tell Spark (both driver and executor) where to load the jar file, set the Spark property. Note that I am using Spark on YARN, so please adjust according to your cluster configuration.
spark = SparkSession.builder\
.appName('spark-bigquery-demo') \
.master('yarn') \
.config('spark.submit.deployMode', 'client') \
.config("spark.jars", "/path/to/jar") \
.getOrCreate()
For non GCP environment
OAuth 2.0 private key
Run the GCP SDK command and generate the application_default_credentials.json file.
gcloud auth application-default login
Place the keyfile where the Spark submit account and executor account can access and read.
Spark properties to set
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.google.cloud.auth.service.account.json.keyfile <path/to/keyfile.json>
No need to set the spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS property.
Spark Session
spark = SparkSession.builder\
.appName('spark-bigquery-demo') \
.master('yarn') \
.config('spark.submit.deployMode', 'client') \
.config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", GOOGLE_APPLICATION_CREDENTIALS) \
.config("spark.jars", "/path/to/jar") \
.getOrCreate()

Zeppelin and BigQuery

I'm looking for a visualisation and analytical notebook engine for BigQuery and am interested in Apache/Zeppelin.
We have internal capability in Python and R and want to use this with our BigQuery back end.
All the installation scenarios I've seen so far ( eg: https://cloud.google.com/blog/big-data/2016/09/analyzing-bigquery-datasets-using-bigquery-interpreter-for-apache-zeppelin) seem to require the installation of a fairly hefty Scala/Spark cluster which I don't see the need for (and which would cost a lot)
Is it possible to install Zeppelin without the cluster in Google Cloud?
Starting with 0.6.1 there is a Native BigQuery Interpreter for Apache Zeppelin available.
It allows you to process and analyze datasets stored in Google BigQuery by directly running SQL against it from within an Apache Zeppelin notebook.
So you do not need anymore query BigQuery using Apache Spark as it was only way before

Categories