How to install spark-bigquery-connector in the VM GCP?

How to install spark-bigquery-connector in the VM GCP? - python

I have VM image cluster Hadoop with spark install in the GCP but it's not a dataproc.
Can I install the spark bigquery connector without using dataproc?
if yes ,how can I do it?
I found this link to download the connector
https://github.com/GoogleCloudDataproc/spark-bigquery-connector

If I understand correctly, your cluster is not a native Dataproc cluster. You created a set of instances (at least one) and installed manually a Hadoop cluster. This scenario is more like an on-prem Hadoop installation (using GCP infrastructure); unfortunately, the bigquery connector documentation doesn't specify if it can be used outside Dataproc, but I would think it should since the connector is a jar file. In the section Downloading and Using the Connector, you can download the latest version or a different one in case one of them doesn't work.
To install the connector in a GCP VM instance, it is needed to include it in the java classpath of your application:
Include it in a Spark directory already added in the java classpath. Or you can add a new entry (this change will be cleaned up when your session ends in the prompt):
export CLASSPATH=</path/to/bigquery-connector.jar>:$CLASSPATH
Use the --jars option when submitting your spark application.
The options above will allow you to run spark jobs locally. To submit your jobs to your Hadoop cluster, you should ensure that the connector is included in its classpath as well, I recommend using HADOOP_CLASSPATH. This thread has more details about it.

Yes you can download it from the GitHub site and install in your spark cluster. Alternatively, you can add the --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0 to your spark command (spark-submit/pyspark/spark-shell).
Edit
There are few options:
When you run you spark app, run pyspark <params> --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0. Same goes for running spark-submit or spark-shell
Download the jar from the repository and copy it to the /usr/lib/spark/jars/ directory. Usually this is done with a script after the cluster is available (using init action).
Download the jars in runtime like you mentioned:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0") \
.getOrCreate()

Documentation
Need to follow Installing the connector document.
Download the connector jar file
Download the appropriate jar depending on the scala version that compiled the Spark and note the /path/to/jar.
version
Link
Scala 2.11
gs://spark-lib/bigquery/spark-bigquery-latest_2.11.jar (HTTP link)
Scala 2.12
gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar (HTTP link)
Cloud Storage access
Because your VM is in GCP, follow the instruction in the Installing the connector document.
Google Cloud Platform - Each Google Compute Engine VM must be configured to have access to the Cloud Storage scope you intend to use the connector for. When running inside of Google Compute Engine VMs, including Dataproc clusters, google.cloud.auth.service.account.enable is set to false by default, which means you don't need to manually configure a service account for the connector; it will automatically get the service account credential from the metadata server of the VM. But you must need to make sure the VM service account has permission to access the GCS bucket.
Spark property
To tell Spark (both driver and executor) where to load the jar file, set the Spark property. Note that I am using Spark on YARN, so please adjust according to your cluster configuration.
spark = SparkSession.builder\
.appName('spark-bigquery-demo') \
.master('yarn') \
.config('spark.submit.deployMode', 'client') \
.config("spark.jars", "/path/to/jar") \
.getOrCreate()
For non GCP environment
OAuth 2.0 private key
Run the GCP SDK command and generate the application_default_credentials.json file.
gcloud auth application-default login
Place the keyfile where the Spark submit account and executor account can access and read.
Spark properties to set
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.google.cloud.auth.service.account.json.keyfile <path/to/keyfile.json>
No need to set the spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS property.
Spark Session
spark = SparkSession.builder\
.appName('spark-bigquery-demo') \
.master('yarn') \
.config('spark.submit.deployMode', 'client') \
.config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", GOOGLE_APPLICATION_CREDENTIALS) \
.config("spark.jars", "/path/to/jar") \
.getOrCreate()

Related

Create Spark Standalone setup with Pyspark installation via Pypi

I am trying to setup spark standalone node which can be used locally without a cluster and run spark program on that instance.
As I have read the articles, there are multiple approaches to have pyspark on an instance, I am listing just these two:
PySpark with Pypi - Runs with either
python file.py
spark-submit
Install with Apache Spark - Runs with
spark-submit
Both can run pyspark, however, what I am confused is if pyspark with pip installation will setup the node to be used as a standalone cluster with a single node and what is the difference between the two?
As is written in Docs:
The Python packaging for Spark is not intended to replace all ... use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the Apache Spark downloads page.
What is the way one can execute spark application on a single node for a relatively small dataset, say about 50Gb? How can we setup a cluster for a single node?

How to install jars related to spark-redis in databricks cluster?

I am trying to connect to Azure cache for redis from databricks .
I have installed this package com.redislabs:spark-redis:2.3.0 from maven package in databricks. I have created a spark session with below code
SparkSession\
.builder\
.appName("myApp")\
.config("spark.redis.host", "my host")\
.config("spark.redis.port", "6379")\
.config("spark.redis.auth", "passwd")\
.getOrCreate()
But when I ran df.write.format("org.apache.spark.sql.redis").option("table", "people").option("key.column", "name").save()
I am getting below error.
*Py4JJavaError: An error occurred while calling o390.save.
: java.lang.ClassNotFoundException:
Failed to find data source: org.apache.spark.sql.redis. Please find packages at
http://spark.apache.org/third-party-projects.html*
Could you please let me know the detailed steps to install all necessary libraries/jars to access redis in databricks.
I have seen below code in spark-redis python doc but I don't know how to run it in databricks.
$ ./bin/pyspark --jars <path-to>/spark-redis-<version>-jar-with-dependencies.jar
And also please let me know what is the latest spark-redis version.

Redis has a Spark Package that you can download and attach to your cluster
The following notebook shows how to use Redis with Apache Spark in Azure Databricks.
For more details, refer to Azure Databricks - Redis.

Develop and test Python code to connect kafka streams on local machine

I am new to Python on local machine. Till now I can code in Azure Databricks. I want to create and deploy libraries which connects to confluent kafka and save data to delta table.
I got confused -
1] Do I need to connect to Databricks Delta from my local machine using python to store the streams to delta
OR
Store the streams to local delta (I am able to create delta table) by setting up like below
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
and deploy the lib into databricks , when it will run it will point to Databricks Delta
I want to use dbfs filestore as well to connect to kafka
.option("kafka.ssl.truststore.location", "/dbfs/FileStore/tables/test.jks") \
I am new to this, Please share the details about How to create streaming application in Python?
and How to deploy to Databricks?

To execute the Python code on Databricks without notebooks, you need to configure a job. As was mentioned by OneCricketeer, the egg is the file format for libraries, you will need to have a Python file that will be the entry point for the job - it will initialize the Spark session, and then will call your libraries.
Job could be configured either (you will also need to upload your libraries):
via UI, but it's limited to configuring the notebooks and jars, but not the Python code. But you'll still able to run Python code, using the spark-submit option.
via REST API - with it, you can create a job that executes Python code directly
via command-line (uses REST API under the hood) and you'll need to create JSON yourself, same way as for REST API.
via Databricks Terraform Provider - it also uses REST API, but could be easier to configure everything in one place - upload libraries, upload the file to DBFS, create/modify the job.
On Databricks, Delta is already pre-installed, so you don't need to set options, and specify maven coordinates, and everything else, so your initialization code will be:
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.getOrCreate()

Dask Gateway - Dask Workers Dying Due to PermissionError

I am trying to deploy Dask Gateway on Google Kubernetes Engine. No issues w/ the deployment. However, I am experiencing issues when using a custom dask-gateway dockerfile that inherits from the default docker image from dockerhub; the container is then submitted to Google Container Registry (GCR). It seems to result in the following PermissionError.
PermissionError: [Errno 13] Permission denied: '/home/dask/dask-worker-space
(See screenshot below for full stacktrace)
The intriguing part is that the dark workers start up without any issue when the dask workers use the docker image directly from dockerhub instead of GCR. I need to use a custom dockerfile to add a few more python packages to the dark workers, but other than that, there are no other configuration changes. It's as though sending the docker container to GCR does something funky to the permissions.
Here is the full stacktrace of the error:
Here is the dockerfile I am using for the dask workers:
FROM daskgateway/dask-gateway:0.9.0
RUN pip --no-cache-dir install --upgrade cloudpickle dask-ml scikit-learn \
nltk gensim spacy keras asyncio google-cloud-storage SQLAlchemy snowflake-sqlalchemy google-api-core gcsfs pyarrow mlflow \
tensorflow prefect hvac aiofile google-cloud-logging
Any help would be greatly appreciated because I have no idea how to debug.

As you are using a GKE cluster, make sure that the service account that you set for the cluster has the correct permissions on the Container registry.
You are creating an image, and submitting it to Container Registry, so you will need writer permissions there. The process is different if you are using the default service account or a custom one.
If you are using the default service account, you will need, at least, the Storage reader and writer scopes for this action. (GKE clusters are created by default only with reader scope).
If you have a running cluster, you will need to change the scopes on every nodepool
gcloud container node-pools create [new pool name] \
--cluster [cluster name] \
--machine-type [your desired machine type] \
--num-nodes [the same amount of nodes you have] \
--scopes [your new set of scopes]
(All the possible options can be found on the command gcloud container node-pools create --help)
After you have done it, you will need to drain the nodes kubectl drain [node], and delete the old nodepool
gcloud container node-pools delete [POOL_NAME] \
--cluster [CLUSTER_NAME]
If you don't have a cluster, you can edit the scopes on the console while creating it, or, if you will create it using gcloud, with the scopes that you want (full list)
If you are using a custom service account, make sure it has the role "roles/storage.admin" granted. (source)

Connect GCP with PySpark without using Dataproc

I'm trying to connect GCP (Google Big Query) with Spark (using pyspark) without using Dataproc (self-hosted Spark in the house), as listed on google official documentation it's only for Dataproc https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example? Any suggest? Note: My Spark & Hadoop setup on Docker. Thanks

Please have a look at the project page on GitHub - it details how to reference the GCP credentials from the code.
In short, you should run
spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>").option("table", "<table>").load()
Please refer here on how to create the json credentials file if needed.

The BigQuery connector is available in a jar file as spark-bigquery-connector, it is publicly available. Then you can:
Add it to the classpath on your on-premise/self-hosted cluster, so your applications can reach the BigQuery API.
Add the connector only to your Spark applications, for example with the --jars option. Regarding this, there are some other possibilities that can impact your app, to know more please check Add jars to a Spark Job - spark-submit
Once the jar is added to the classpath you can check the two bigquery connector examples, one of them was already provided by # David Rabinowitz

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to install spark-bigquery-connector in the VM GCP? - python

I have VM image cluster Hadoop with spark install in the GCP but it's not a dataproc. Can I install the spark bigquery connector without using dataproc? if yes ,how can I do it? I found this link to download the connector https://github.com/GoogleCloudDataproc/spark-bigquery-connector

Related

Create Spark Standalone setup with Pyspark installation via Pypi

How to install jars related to spark-redis in databricks cluster?

Develop and test Python code to connect kafka streams on local machine

Dask Gateway - Dask Workers Dying Due to PermissionError

Connect GCP with PySpark without using Dataproc

Categories

Resources