Develop and test Python code to connect kafka streams on local machine

Develop and test Python code to connect kafka streams on local machine - python

I am new to Python on local machine. Till now I can code in Azure Databricks. I want to create and deploy libraries which connects to confluent kafka and save data to delta table.
I got confused -
1] Do I need to connect to Databricks Delta from my local machine using python to store the streams to delta
OR
Store the streams to local delta (I am able to create delta table) by setting up like below
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
and deploy the lib into databricks , when it will run it will point to Databricks Delta
I want to use dbfs filestore as well to connect to kafka
.option("kafka.ssl.truststore.location", "/dbfs/FileStore/tables/test.jks") \
I am new to this, Please share the details about How to create streaming application in Python?
and How to deploy to Databricks?

To execute the Python code on Databricks without notebooks, you need to configure a job. As was mentioned by OneCricketeer, the egg is the file format for libraries, you will need to have a Python file that will be the entry point for the job - it will initialize the Spark session, and then will call your libraries.
Job could be configured either (you will also need to upload your libraries):
via UI, but it's limited to configuring the notebooks and jars, but not the Python code. But you'll still able to run Python code, using the spark-submit option.
via REST API - with it, you can create a job that executes Python code directly
via command-line (uses REST API under the hood) and you'll need to create JSON yourself, same way as for REST API.
via Databricks Terraform Provider - it also uses REST API, but could be easier to configure everything in one place - upload libraries, upload the file to DBFS, create/modify the job.
On Databricks, Delta is already pre-installed, so you don't need to set options, and specify maven coordinates, and everything else, so your initialization code will be:
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.getOrCreate()

Related

BigQuery emulator with local instance of the Apache Airflow

I'm working on a project for integration different data sources into Google BigQuery.
It is the batch approach.
We are using Apache Airflow for orchestration.
Simplified flow is: create raw tables (predefined DDL) -> call python code to do batch insert (via bq python client) -> trigger different SQL for transformations -> end
For testing purposes, we're using the GCP dev project.
But, recently, I found BigQuery Emulator: BigQuery Emulator.
Python client example works just fine: BigQuery Emulator: Call endpoint from python client
I'm curious about how to configure the local instance of Airflow to use this emulator.
I didn't find a way to point BigQueryExecuteQueryOperator to use the emulator. We are using this operator to trigger all our SQL.
I tried to set 'gcp_conn_id' but it always fails with "Could not automatically determine credentials. Please set GOOGLE_APPLICATION_CREDENTIALS or explicitly create credentials and re-run the application. For more information, please see https://cloud.google.com/docs/authentication/getting-started"
As connection type I tried HTTP and Google Bigquery, no difference.
Airflow version: 2.3.4
bigquery-emulator: 0.1.11

How to mock and test Databricks Pyspark notebooks Locally

How can I mock existing Azure Databricks PySpark codes of a project (written by others) and run them locally on windows machine/Anaconda to test and practice?
Is it possible to mock the codes or I need to create a new cluster on Databricks for my own testing purposes?
how I can connect to storage account, use the Databricks Utilities, etc? I only have experience with Python & GCP and just joined a Databricks project and need to run the cells one by one to see the result and modify if required.
Thanks

You can test/run PySpark code from your IDE by installing PySpark on your local computer.
Now, to use Databricks Utilities, in fact you would need a Databricks instance and it's not available on local. You can try Databricks community Editionfor free but with some limitation
To acess a cloud storage account, it can be done locally from your computer or from your own Databricks instance. In both cases your will have to set up the end point of this storage account using its secrets.

How to install spark-bigquery-connector in the VM GCP?

I have VM image cluster Hadoop with spark install in the GCP but it's not a dataproc.
Can I install the spark bigquery connector without using dataproc?
if yes ,how can I do it?
I found this link to download the connector
https://github.com/GoogleCloudDataproc/spark-bigquery-connector

If I understand correctly, your cluster is not a native Dataproc cluster. You created a set of instances (at least one) and installed manually a Hadoop cluster. This scenario is more like an on-prem Hadoop installation (using GCP infrastructure); unfortunately, the bigquery connector documentation doesn't specify if it can be used outside Dataproc, but I would think it should since the connector is a jar file. In the section Downloading and Using the Connector, you can download the latest version or a different one in case one of them doesn't work.
To install the connector in a GCP VM instance, it is needed to include it in the java classpath of your application:
Include it in a Spark directory already added in the java classpath. Or you can add a new entry (this change will be cleaned up when your session ends in the prompt):
export CLASSPATH=</path/to/bigquery-connector.jar>:$CLASSPATH
Use the --jars option when submitting your spark application.
The options above will allow you to run spark jobs locally. To submit your jobs to your Hadoop cluster, you should ensure that the connector is included in its classpath as well, I recommend using HADOOP_CLASSPATH. This thread has more details about it.

Yes you can download it from the GitHub site and install in your spark cluster. Alternatively, you can add the --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0 to your spark command (spark-submit/pyspark/spark-shell).
Edit
There are few options:
When you run you spark app, run pyspark <params> --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0. Same goes for running spark-submit or spark-shell
Download the jar from the repository and copy it to the /usr/lib/spark/jars/ directory. Usually this is done with a script after the cluster is available (using init action).
Download the jars in runtime like you mentioned:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config("spark.jars.packages", "com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.18.0") \
.getOrCreate()

Documentation
Need to follow Installing the connector document.
Download the connector jar file
Download the appropriate jar depending on the scala version that compiled the Spark and note the /path/to/jar.
version
Link
Scala 2.11
gs://spark-lib/bigquery/spark-bigquery-latest_2.11.jar (HTTP link)
Scala 2.12
gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar (HTTP link)
Cloud Storage access
Because your VM is in GCP, follow the instruction in the Installing the connector document.
Google Cloud Platform - Each Google Compute Engine VM must be configured to have access to the Cloud Storage scope you intend to use the connector for. When running inside of Google Compute Engine VMs, including Dataproc clusters, google.cloud.auth.service.account.enable is set to false by default, which means you don't need to manually configure a service account for the connector; it will automatically get the service account credential from the metadata server of the VM. But you must need to make sure the VM service account has permission to access the GCS bucket.
Spark property
To tell Spark (both driver and executor) where to load the jar file, set the Spark property. Note that I am using Spark on YARN, so please adjust according to your cluster configuration.
spark = SparkSession.builder\
.appName('spark-bigquery-demo') \
.master('yarn') \
.config('spark.submit.deployMode', 'client') \
.config("spark.jars", "/path/to/jar") \
.getOrCreate()
For non GCP environment
OAuth 2.0 private key
Run the GCP SDK command and generate the application_default_credentials.json file.
gcloud auth application-default login
Place the keyfile where the Spark submit account and executor account can access and read.
Spark properties to set
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.google.cloud.auth.service.account.json.keyfile <path/to/keyfile.json>
No need to set the spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS property.
Spark Session
spark = SparkSession.builder\
.appName('spark-bigquery-demo') \
.master('yarn') \
.config('spark.submit.deployMode', 'client') \
.config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", GOOGLE_APPLICATION_CREDENTIALS) \
.config("spark.jars", "/path/to/jar") \
.getOrCreate()

Connect GCP with PySpark without using Dataproc

I'm trying to connect GCP (Google Big Query) with Spark (using pyspark) without using Dataproc (self-hosted Spark in the house), as listed on google official documentation it's only for Dataproc https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example? Any suggest? Note: My Spark & Hadoop setup on Docker. Thanks

Please have a look at the project page on GitHub - it details how to reference the GCP credentials from the code.
In short, you should run
spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>").option("table", "<table>").load()
Please refer here on how to create the json credentials file if needed.

The BigQuery connector is available in a jar file as spark-bigquery-connector, it is publicly available. Then you can:
Add it to the classpath on your on-premise/self-hosted cluster, so your applications can reach the BigQuery API.
Add the connector only to your Spark applications, for example with the --jars option. Regarding this, there are some other possibilities that can impact your app, to know more please check Add jars to a Spark Job - spark-submit
Once the jar is added to the classpath you can check the two bigquery connector examples, one of them was already provided by # David Rabinowitz

How to store data in GCS while accessing it from GAE and 'GCE' locally

There's a GAE project using the GCS to store/retrieve files. These files also need to be read by code that will run on GCE (needs C++ libraries, so therefore not running on GAE).
In production, deployed on the actual GAE > GCS < GCE, this setup works fine.
However, testing and developing locally is a different story that I'm trying to figure out.
As recommended, I'm running GAE's dev_appserver with GoogleAppEngineCloudStorageClient to access the (simulated) GCS. Files are put in the local blobstore. Great for testing GAE.
Since these is no GCE SDK to run a VM locally, whenever I refer to the local 'GCE', it's just my local development machine running linux.
On the local GCE side I'm just using the default boto library (https://developers.google.com/storage/docs/gspythonlibrary) with a python 2.x runtime to interface with the C++ code and retrieving files from the GCS. However, in development, these files are inaccessible from boto because they're stored in the dev_appserver's blobstore.
Is there a way to properly connect the local GAE and GCE to a local GCS?
For now, I gave up on the local GCS part and tried using the real GCS. The GCE part with boto is easy. The GCS part is also able to use the real GCS using an access_token so it uses the real GCS instead of the local blobstore by:
cloudstorage.common.set_access_token(access_token)
According to the docs:
access_token: you can get one by run 'gsutil -d ls' and copy the
str after 'Bearer'.
That token works for a limited amount of time, so that's not ideal. Is there a way to set a more permanent access_token?

There is convenience option to access Google Cloud Storage from development environment. You should use client library provided with Google Cloud SDK. After executing gcloud init locally you get access to your resources.
As shown in examples to Client library authentication:
# Get the application default credentials. When running locally, these are
# available after running `gcloud init`. When running on compute
# engine, these are available from the environment.
credentials = GoogleCredentials.get_application_default()
# Construct the service object for interacting with the Cloud Storage API -
# the 'storage' service, at version 'v1'.
# You can browse other available api services and versions here:
# https://developers.google.com/api-client-library/python/apis/
service = discovery.build('storage', 'v1', credentials=credentials)

Google libraries come and go like tourists in a train station. Today (2020) google-cloud-storage should work on GCE and GAE Standard Environment with Python 3.
On GAE and CGE it picks up access credentials from the environment and locally you can provide it whit a servce account JSON-file like this:
GOOGLE_APPLICATION_CREDENTIALS=../sa-b0af54dea5e.json

If you're always using "real" remote GCS, the newer gcloud is probably the best library: http://googlecloudplatform.github.io/gcloud-python/
It's really confusing how many storage client libraries there are for Python. Some are for AE only, but they often force (or at least default to) using the local mock Blobstore when running with dev_appserver.py.
Seems like gcloud is always using the real GCS, which is what I want.
It also "magically" fixes authentication when running locally.

It looks like appengine-gcs-client for Python is now only useful for production App Engine and inside dev_appserver.py, and the local examples for it have been removed from the developer docs in favor of Boto :( If you are deciding not to use the local GCS emulation, it's probably best to stick with Boto for both local testing and GCE.
If you still want to use 'google.appengine.ext.cloudstorage' though, access tokens always expire so you'll need to manually refresh it. Given your setup honestly the easiest thing to to is just call 'gsutil -d ls' from Python and parse the output to get a new token from your local credentials. You could use the API Client Library to get a token in a more 'correct' fashion, but at that point things would be getting so roundabout you might as well just be using Boto.

There is a Google Cloud Storage local / development server for this purpose: https://developers.google.com/datastore/docs/tools/devserver
Once you have set it up, create a dataset and start the GCS development server
gcd.sh create [options] <dataset-directory>
gcd.sh start [options] <dataset-directory>
Export the environment variables
export DATASTORE_HOST=http://yourmachine:8080
export DATASTORE_DATASET=<dataset_id>
Then you should be able to use the datastore connection in your code, locally.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.