I'm looking for a visualisation and analytical notebook engine for BigQuery and am interested in Apache/Zeppelin.
We have internal capability in Python and R and want to use this with our BigQuery back end.
All the installation scenarios I've seen so far ( eg: https://cloud.google.com/blog/big-data/2016/09/analyzing-bigquery-datasets-using-bigquery-interpreter-for-apache-zeppelin) seem to require the installation of a fairly hefty Scala/Spark cluster which I don't see the need for (and which would cost a lot)
Is it possible to install Zeppelin without the cluster in Google Cloud?
Starting with 0.6.1 there is a Native BigQuery Interpreter for Apache Zeppelin available.
It allows you to process and analyze datasets stored in Google BigQuery by directly running SQL against it from within an Apache Zeppelin notebook.
So you do not need anymore query BigQuery using Apache Spark as it was only way before
Related
How can I mock existing Azure Databricks PySpark codes of a project (written by others) and run them locally on windows machine/Anaconda to test and practice?
Is it possible to mock the codes or I need to create a new cluster on Databricks for my own testing purposes?
how I can connect to storage account, use the Databricks Utilities, etc? I only have experience with Python & GCP and just joined a Databricks project and need to run the cells one by one to see the result and modify if required.
Thanks
You can test/run PySpark code from your IDE by installing PySpark on your local computer.
Now, to use Databricks Utilities, in fact you would need a Databricks instance and it's not available on local. You can try Databricks community Editionfor free but with some limitation
To acess a cloud storage account, it can be done locally from your computer or from your own Databricks instance. In both cases your will have to set up the end point of this storage account using its secrets.
I'm trying to connect GCP (Google Big Query) with Spark (using pyspark) without using Dataproc (self-hosted Spark in the house), as listed on google official documentation it's only for Dataproc https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example? Any suggest? Note: My Spark & Hadoop setup on Docker. Thanks
Please have a look at the project page on GitHub - it details how to reference the GCP credentials from the code.
In short, you should run
spark.read.format("bigquery").option("credentialsFile", "</path/to/key/file>").option("table", "<table>").load()
Please refer here on how to create the json credentials file if needed.
The BigQuery connector is available in a jar file as spark-bigquery-connector, it is publicly available. Then you can:
Add it to the classpath on your on-premise/self-hosted cluster, so your applications can reach the BigQuery API.
Add the connector only to your Spark applications, for example with the --jars option. Regarding this, there are some other possibilities that can impact your app, to know more please check Add jars to a Spark Job - spark-submit
Once the jar is added to the classpath you can check the two bigquery connector examples, one of them was already provided by # David Rabinowitz
Could someone tell me if the current API version of BigQuery to use with Google Dataflow in python is 0.25?
When I install apache-beam using:
pip install apache-beam[gcp]
The version I get is: 0.25
But the final version of the BigQuery API's for Python is 1.7.x
Why is this?
Apache Beam uses google-cloud-bigquery just for testing purposes. The connections made to BigQuery (or any other GCP tool) uses a proprietary client based on apitools since the code must prioritize performance as much as possible.
I'm new to Spark.I installed a Spark 2.3.0 in Stand-Alone-Mode on an Ubuntu 16.04.3 server. That runs well so far. Now I would like to start developing with pyspark because I've got more experience using python than scala.
Ok. Even after using google for a while I'm not sure how I should setup my development environment. My local machine is a windows 10 laptop with eclipse neon and pydev configured. What are the neccessary steps to set ist up that I can develop in a local context and submit my modules to the spark cluster on my server?
Thank for helping.
use spark-submit to run locally or on a cluster. There are many online tutorials for this. I like the AWS documentation which explains the architecture, has sample spark code, and gives examples of local and remote commands. Even if you are not using AWS EMR the basics are the same.
give it a try and let us know how it goes
How do I get a small Python script to hook into an existing instance of Spark and do operations on existing RDDs?
I'm in the early stages of working with Spark on Windows 10, trying scripts on a "Local" instance. I'm working with the latest stable build of Spark (Spark 2.0.1 for Hadoop 2.7). I've installed and set environment variables for Hadoop 2.7.3. I'm experimenting with both the Pyspark shell and Visual Studio 2015 Community with Python.
I'm trying to build a large engine, on which I'll run individual scripts to load, massage, format, and access the data. I'm sure there's a normal way to do that; isn't that the point of Spark?
Anyway, here's the experience I have so far. This is generally to be expected. When I build a small Spark script in Python and run it using Visual Studio, the script runs, does its job, and exits. In the process of exiting, it also exits the Spark Context it was using.
So I had the following thought: What if I started a persistent Spark Context in Pyspark and then set my SparkConf and SparkContext in each Python script to connect to that Spark Context? So, looking up online what the defaults are for Pyspark, I tried the following:
conf = SparkConf().setMaster("local[*]").setAppName("PySparkShell")
sc = SparkContext(conf = conf)
I started Pyspark. In a separate script in Visual Studio, I used this code for SparkContext. I loaded a text file into an RDD named RDDFromFilename . But I couldn't access that RDD in the Pyspark shell once the script had run.
How do I start a persistent Spark Context, create an RDD in it in one Python script, and access that RDD from subsequent Python scripts? Particularly in Windows?
There is no solution in Spark. You may consider:
To keep persistent RDDs:
Apache Ignite
To keep persistent shared context:
spark-jobserver
livy - https://github.com/cloudera/livy
mist - https://github.com/Hydrospheredata/mist
To share context for with notebooks:
Apache Zeppelin
I think that out of these only Zeppelin officially supports Windows.
For those who may follow: I've recently discovered SnappyData.
SnappyData is still fairly young and there's a bit of a learning curve, but what it promises to do is make a persistent mutable SQL collection that can be shared between multiple Spark jobs and can be accessed natively as RDDs and DataFrames. It has a job server that you can dump concurrent jobs onto.
It's essentially a combination of a GemFire in-memory database with Spark clusters that are local in the same JVM, so (when I get decent at managing it) I can do large tasks without single-machine bottlenecks to pipe data in and out of Spark, or I can even do live data manipulation while another Spark program is running on the same data.
I know this is my own answer, but I'm probably not going to mark it as the answer until I get sophisticated enough to have opinions on how well it solves my problems.