How to mock and test Databricks Pyspark notebooks Locally - python

How can I mock existing Azure Databricks PySpark codes of a project (written by others) and run them locally on windows machine/Anaconda to test and practice?
Is it possible to mock the codes or I need to create a new cluster on Databricks for my own testing purposes?
how I can connect to storage account, use the Databricks Utilities, etc? I only have experience with Python & GCP and just joined a Databricks project and need to run the cells one by one to see the result and modify if required.
Thanks

You can test/run PySpark code from your IDE by installing PySpark on your local computer.
Now, to use Databricks Utilities, in fact you would need a Databricks instance and it's not available on local. You can try Databricks community Editionfor free but with some limitation
To acess a cloud storage account, it can be done locally from your computer or from your own Databricks instance. In both cases your will have to set up the end point of this storage account using its secrets.

Related

How to Kickstart Kubeflow Pipeline development in Python

I have been studying about Kubeflow and trying to grasp how do I write my first hollo world program in it and run locally on my mac. I have kfp and kubectl installed locally on my machine. For testing purpose I want to write a simple pipeline with two functions: get_data() and add_data(). The doc is overwhelming that I am not clear how to program locally without k8s installed, connecting remote GCP machine and debug locally before creating zip and upload or there way to execute code locally and see how is it running on Google cloud?
Currently you need Kubernetes to run KFP pipelines.
The easiest way to deploy KFP is when you use the Google Cloud Marketplace
Alternatively you can locally install Docker Desktop which includes Kubernetes and install standalone version of KFP on it.
After that you can try this tutorial: Data passing in python components
Actually you can install a reduced version of kubeflow with minikf. More info https://www.kubeflow.org/docs/distributions/minikf/minikf-vagrant/
Check whether you are using kubeflow pipelines from the google cloud marketplace, or a custom kubernetes cluster. If you are using the managed one, you can see your pipeline running through the kubeflow pipelines management console.
for details about how to create components based on functions, you can check https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/#getting-started-with-python-function-based-components

Is there a way to directly import or open local files from Microsoft Azure Notebooks without uploading them to the Azure cloud?

I am trying to find a way that I can run commands to either open applications/files from my local machine or import files from my local machine in Microsoft Azure Notebooks without uploading them directly to the Azure cloud. Does anyone know if this is possible or if who/where is a better place to ask?
No, you cannot. The jupyter is run a micro-server, which is hosted on the Azure. You can only used the data on the Azure cloud. Been there, wished that, but no.

Connect GCP with PySpark without using Dataproc

I'm trying to connect GCP (Google Big Query) with Spark (using pyspark) without using Dataproc (self-hosted Spark in the house), as listed on google official documentation it's only for Dataproc https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example? Any suggest? Note: My Spark & Hadoop setup on Docker. Thanks
Please have a look at the project page on GitHub - it details how to reference the GCP credentials from the code.
In short, you should run
spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>").option("table", "<table>").load()
Please refer here on how to create the json credentials file if needed.
The BigQuery connector is available in a jar file as spark-bigquery-connector, it is publicly available. Then you can:
Add it to the classpath on your on-premise/self-hosted cluster, so your applications can reach the BigQuery API.
Add the connector only to your Spark applications, for example with the --jars option. Regarding this, there are some other possibilities that can impact your app, to know more please check Add jars to a Spark Job - spark-submit
Once the jar is added to the classpath you can check the two bigquery connector examples, one of them was already provided by # David Rabinowitz

Creating Airflow DAGs on GCP Composer

I just learnt about GCP Composer and am trying to move the DAGs from my local airflow instance to cloud and had a couple of questions about the transition.
In local instance I used HiveOperator to read data from hive and create tables and write it back into hive. If I had to do this in GCP how would this be possible? Would I have to upload my data to Google Bucket and does the HiveOperator work in GCP?
I have a DAG which uses sensor to check if another DAG is complete, is that possible on Composer?
Yes, Cloud Composer is just managed Apache Airflow so you can do that.
Make sure that you use the same version of Airflow that you used locally. Cloud Composer supports Airflow 1.9.0 and 1.10.0 currently.
Composer have connection store. See menu Admin--> Connection. Check connection type available.
Sensors are available.

Zeppelin and BigQuery

I'm looking for a visualisation and analytical notebook engine for BigQuery and am interested in Apache/Zeppelin.
We have internal capability in Python and R and want to use this with our BigQuery back end.
All the installation scenarios I've seen so far ( eg: https://cloud.google.com/blog/big-data/2016/09/analyzing-bigquery-datasets-using-bigquery-interpreter-for-apache-zeppelin) seem to require the installation of a fairly hefty Scala/Spark cluster which I don't see the need for (and which would cost a lot)
Is it possible to install Zeppelin without the cluster in Google Cloud?
Starting with 0.6.1 there is a Native BigQuery Interpreter for Apache Zeppelin available.
It allows you to process and analyze datasets stored in Google BigQuery by directly running SQL against it from within an Apache Zeppelin notebook.
So you do not need anymore query BigQuery using Apache Spark as it was only way before

Categories