Loading Data from Google BigQuery into Spark (on Databricks) - python

I want to load data into Spark (on Databricks) from Google BigQuery. I notice that Databricks offers alot of support for Amazon S3 but not for Google.
What is the best way to load data into Spark (on Databricks) from Google BigQuery? Would the BigQuery connector allow me to do this or is this only valid for files hosted on Google Cloud storage?

The BigQuery Connector is a client side library that uses the public BigQuery API: it runs BigQuery export jobs to Google Cloud Storage, and takes advantage of file creation ordering to start Hadoop processing early to increase overall throughput.
This code should work wherever you happen to locate your Hadoop cluster.
That said, if you are running over large data, then you might find network bandwidth throughput to be a problem (how good is your network connection to Google?), and since you are reading data out of Google's network, then GCS network egress costs will apply.

Databricks now has documented how to use Google BigQuery via Spark here
Set spark config in cluster settings:
credentials <base64-keys>
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.fs.gs.auth.service.account.email <client_email>
spark.hadoop.fs.gs.project.id <project_id>
spark.hadoop.fs.gs.auth.service.account.private.key <private_key>
spark.hadoop.fs.gs.auth.service.account.private.key.id <private_key_id>
In pyspark use:
df = spark.read.format("bigquery") \
.option("table", table) \
.option("project", <project-id>) \
.option("parentProject", <parent-project-id>) \
.load()

Related

Load data from MySQL to BigQuery using Dataflow

I want to load data from MySQL to BigQuery using Cloud Dataflow. Anyone can share article or work experience about load data from MySQL to BigQuery using Cloud Dataflow with Python language?
Thank you
You can use apache_beam.io.jdbc to read from your MySQL database, and the BigQuery I/O to write on BigQuery.
Beam knowledge is expected, so I recommend looking at Apache Beam Programming Guide first.
If you are looking for something pre-built, we have the JDBC to BigQuery Google-provided template, which is open-source (here), but it is written in Java.
If you only want to copy data from MySQL to BigQuery, you can firstly export your MySql data to Cloud Storage, then load this file to a BigQuery table.
I think no need using Dataflow in this case because you don't have complex transformations and business logics. It only corresponds to a copy.
Export the MySQL data to Cloud Storage via a sql query and gcloud cli :
gcloud sql export csv INSTANCE_NAME gs://BUCKET_NAME/FILE_NAME \
--database=DATABASE_NAME \
--offload \
--query=SELECT_QUERY \
--quote="22" \
--escape="5C" \
--fields-terminated-by="2C" \
--lines-terminated-by="0A"
Load the csv file to a BigQuery table via gcloud cli and bq :
bq load \
--source_format=CSV \
mydataset.mytable \
gs://mybucket/mydata.csv \
./myschema.json
./myschema.json is the BigQuery table schema.

bigquery storage API: Is it possible to stream / save AVRO files directly to Google Cloud Storage?

I would like to export a 90 TB BigQuery table to Google Cloud Storage. According to the documentation, BigQuery Storage API (beta) should be the way to go due to export size quotas (e.g., ExtractBytesPerDay) associated with other methods.
The table is date-partitioned, with each partition occupying ~300 GB. I have a Python AI Notebook running on GCP, which runs partitions (in parallel) through this script adapted from the docs.
from google.cloud import bigquery_storage_v1
client = bigquery_storage_v1.BigQueryReadClient()
table = "projects/{}/datasets/{}/tables/{}".format(
"bigquery-public-data", "usa_names", "usa_1910_current"
) # I am using my private table instead of this one.
requested_session = bigquery_storage_v1.types.ReadSession()
requested_session.table = table
requested_session.data_format = bigquery_storage_v1.enums.DataFormat.AVRO
parent = "projects/{}".format(project_id)
session = client.create_read_session(
parent,
requested_session,
max_stream_count=1,
)
reader = client.read_rows(session.streams[0].name)
# The read stream contains blocks of Avro-encoded bytes. The rows() method
# uses the fastavro library to parse these blocks as an iterable of Python
# dictionaries.
rows = reader.rows(session)
Is it possible to save data from the stream directly to Google Cloud Storage?
I tried saving tables as AVRO files to my AI instance using fastavro and later uploading them to GCS using Blob.upload_from_filename(), but this process is very slow. I was hoping it would be possible to point the stream at my GCS bucket. I experimented with Blob.upload_from_file, but couldn't figure it out.
I cannot decode the whole stream to memory and use Blob.upload_from_string because I don't have over ~300 GB of RAM.
I spent the last two days parsing GCP documentation, but couldn't find anything, so I would appreciate your help, preferably with a code snippet, if at all possible. (If working with another file format is easier, I am all for it.)
Thank you!
Is it possible to save data from the stream directly to Google Cloud Storage?
By itself, the BigQuery Storage API is not capable of writing directly to GCS; you'll need to pair the API with code to parse the data, write it to local storage, and subsequently upload to GCS. This could be code that you write manually, or code from a framework of some kind.
It looks like the code snippet that you've shared processes each partition in a single-threaded fashion, which caps your throughput at the throughput of a single read stream. The storage API is designed to achieve high throughput through parallelism, so it's meant to be used with a parallel processing framework such as Google Cloud Dataflow or Apache Spark. If you'd like to use Dataflow, there's a Google-provided template you can start from; for Spark, you can use the code snippets that David has already shared.
An easy way to that would be to use Spark with the the spark-bigquery-connector? It uses the BigQuery Storage API in order to read the table directly into a Spark's DataFrame. You can create a Spark cluster on Dataproc, which is located as the same data centers as BigQuery and GCS, making the read and write speeds much faster.
A code example will look like this:
df = spark.read.format("bigquery") \
.option("table", "bigquery-public-data.usa_names.usa_1910_current") \
.load()
df.write.format("avro").save("gs://bucket/path")
You can also filter the data and work on each partition separately:
df = spark.read.format("bigquery") \
.option("table", "bigquery-public-data.usa_names.usa_1910_current") \
.option("filter", "the_date='2020-05-12'") \
.load()
# OR, in case you don't need to give the partition at load
df = spark.read.format("bigquery") \
.option("table", "bigquery-public-data.usa_names.usa_1910_current") \
.load()
df.where("the_date='2020-05-12'").write....
Please note that in order to read large amounts of data you would need a sufficiently large cluster.

How to download, transform and upload multiple files in parallel using Google Kubernetes Engine?

I have a large collection of data stored in google storage bucket with the following structure:
gs://project_garden/plant_logs/2019/01/01/humidity/plant001/hour.gz. What I want is to make a Kubernetes Job which downloads all of it, parses it and upload the parsed files to BigQuery in parallel. So far I've managed to do it locally without any parallelism by writing a python code which takes a date interval as input and loops over each of the plants executing gsutil -m cp -r for download, gunzip for extraction and pandas for transforming. I want to do the same thing but in parallel for each plant using Kubernetes. Is it possible to parallelise the process by defining a job that passes down different plant id's for each pod and downloads the files for each of them?
A direct upload from Kubernetes to BigQuery is not possible, you can only upload data into BigQuery [1] with the following methods:
From Cloud Storage
From other Google services, such as Google Ad Manager and Google Ads
From a readable data source (such as your local machine)
By inserting individual records using streaming inserts
Using DML statements to perform bulk inserts
Using a BigQuery I/O transform in a Cloud Dataflow pipeline to write data to BigQuery
As mentioned in the previous comment the easiest solution would be to upload the data using DataFlow, you can find a template to upload text from Google Cloud Storage (GCS) to BigQuery in link [2]
If you have to use Google Cloud Engine (GKE) you will need to perform the following steps:
Read the data from GCS with GKE. You can find an example of how to mount a bucket in your containers in the next link [3]
Parse the data with your code as mentioned in your question
Upload data from GCS to BigQuery, more info in link [4]
[1] https://cloud.google.com/bigquery/docs/loading-data
[2] https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#gcstexttobigquerystream
[3] https://github.com/maciekrb/gcs-fuse-sample
[4] https://cloud.google.com/bigquery/docs/loading-data-cloud-storage

Process multiple objects in Google cloud

I have a few hundred files(100,000) in a Google Storage Bucket. The file sizes are about 2-10MB. I need to apply a simple python function(just data transformation) on each of these files. I need to read from one bucket - transform (python function) in parallel - and store in another bucket. I am thinking of a simple Hadoop or Spark cluster to do this. I previously used concurrent threads on a single instance to do this, but I need a more robust approach. What is the best way to accomplish this?
You can use the recently-announced Google Cloud Dataproc (in beta as of 5 Oct 2015), which provides a managed Hadoop or Spark cluster for you. It is integrated with Google Cloud Storage so you can read and write data from your bucket.
You can submit jobs via gcloud, the console, or via SSH to a machine in your cluster.

How import dataset from S3 to cassandra?

i Launch cluster spark cassandra with datastax dse in aws cloud. So my dataset storage in S3. But i don't know how transfer data from S3 to my cluster cassandra. Please help me
The details depend on your file format and C* data model but it might look something like this:
Read the file from s3 into an RDD
val rdd = sc.textFile("s3n://mybucket/path/filename.txt.gz")
Manipulate the rdd
Write the rdd to a cassandra table:
rdd.saveToCassandra("test", "kv", SomeColumns("key", "value"))
What #phact described is through using the Spark API that comes with the DataStax Enterprise and could be very useful if there's ETL work that needs to be done along with the loading.
For loading only, you can use the sstableloader bulk loading capability. Here's a tutorial to get you started.

Categories