Bigquery to Bigtable data transfer using google dataflow in python - python

We are using the dataflow pipeline written in JAVA for transfering data from bigquery to bigtable.
e.g. https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/blob/master/java/dataflow-connector-examples/src/main/java/com/google/cloud/bigtable/dataflow/example/BigQueryBigtableTransfer.java
I am trying to write same code in python. But I am not able to get the bigtable dataflow connector for python. Any clue how it can be done?

Just not to leave the question unanswered, as Graham Polley commented: "The Python SDK doesn't have support for Bigtable yet.". I see that engineer from Bigtable is already involved, but if you want, you can also create a feature request in Public Issue Tracker.

Related

ETL to bigquery using airflow without have permission cloud storage/ cloud sql

i have done ETL from MySql to bigQuery with python, but because i haven't permission to connect google cloud storage/ cloud sql, i must dump data and partition that by last date, this way easy but didn't worth it because take a much time, it is possible to ETL using airflow from MySql/mongo to bigQuery without google cloud storage/ cloud sql ?
With airflow or not, the easiest and the most efficient way is to:
Extract data from data source
Load the data into a file
Drop the file into Cloud Storage
Run a BigQuery Load Job on these files (load job is free)
If you want to avoid to create a file and to drop it into Cloud Storage, another way is possible, much more complex: stream data into BigQuery.
Run a query (MySQL or Mongo)
Fetch the result.
On each line, stream write the result into BigQuery (Streaming is not free on BigQuery)
Described like this, it does not seam very complex but:
You have to maintain the connexion to the source and to the destination during all the process
You have to handle errors (read and write) and be able to restart at the last point of failure
You have to perform bulk stream write into BigQuery for optimizing performance. Size of chunks has to be choose wisely.
Airflow bonus: You have to define and to write your own custom operator for doing this.
By the way, I strongly recommend to follow the first solution.
Additional tips: now, BigQuery can directly request into Cloud SQL database. If you still need your MySQL database (for keeping some referential in it) you can migrate it into CloudSQL and perform a join between your Bigquery data warehouse and your CloudSQL referential.
It is indeed possible to synchronize MySQL databases to BigQuery with Airflow.
You would of course need to make sure you have properly authenticated connections to Airflow DAG workflow.
Also, make sure to define which columns from MySQL you would like to pull and load into BigQuery. You want to also choose the method of loading your data. Would you want it loaded incrementally or fully? Be sure to also formulate a technique for eliminating duplicate copies of data (de-duplicate).
You can find more information on this topic through through this link:
How to Sync Mysql into Bigquery in realtime?
Here is a great resource for setting up your bigquery account and authentications:
https://www.youtube.com/watch?v=fAwWSxJpFQ8
You can also have a look at stichdata.com (https://www.stitchdata.com/integrations/mysql/google-bigquery/)
The Stitch MySQL integration will ETL your MySQL to Google BigQuery in minutes and keep it up to date without having to constantly write and maintain ETL scripts. Google Cloud Storage or Cloud SQL won’t be necessary in this case.
For more information on aggregating data for BigQuery using Apache Airflow you may refer to the link below:
https://cloud.google.com/blog/products/gcp/how-to-aggregate-data-for-bigquery-using-apache-airflow

How to set up GCP infrastructure to perform search quickly over massive set of json data?

I have about 100 million json files (10 TB), each with a particular field containing a bunch of text, for which I would like to perform a simple substring search and return the filenames of all the relevant json files. They're all currently stored on Google Cloud Storage. Normally for a smaller number of files I might just spin up a VM with many CPUs and run multiprocessing via Python, but alas this is a bit too much.
I want to avoid spending too much time setting up infrastructure like a Hadoop server, or loading all of that into some MongoDB database. My question is: what would be a quick and dirty way to perform this task? My original thoughts were to set up something on Kubernetes with some parallel processing running Python scripts, but I'm open to suggestions and don't really have a clue how to go about this.
Easier would be to just load the GCS data into Big Query and just run your query from there.
Send your data to AWS S3 and use Amazon Athena.
The Kubernetes option would be set up a cluster in GKE and install Presto in it with a lot of workers, use a hive metastore with GCS and query from there. (Presto doesn't have direct GCS connector yet, afaik) -- This option seems more elaborate.
Hope it helps!

Error streaming from pub/sub into big query python

I am having trouble creating a dataflowRunner job that connects a pub/sub source to a big query sink, by plugging these two:
apache_beam.io.gcp.pubsub.PubSubSource
apache_beam.io.gcp.bigquery.BigQuerySink
into lines 59 and 74 respectively in the beam/sdks/python/apache_beam/examples/streaming_wordcount.py (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/streaming_wordcount.py) example on github. After removing lines 61-70, and specifying the correct pub/sub and bigquery arguments, the script runs without errors without building the pipeline.
sidenote: the script mentions streaming pipeline support isnt available for use in Python. However, on the beam docs it mentions apache_beam.io.gcp.pubsub.PubSubSource is only available for streaming
(1st sentence underneath the "apache_beam.io.gcp.pubsub module" heading: https://beam.apache.org/documentation/sdks/pydoc/2.0.0/apache_beam.io.gcp.html#module-apache_beam.io.gcp.pubsub)
You can't stream on Python Dataflow - for now.
Monitor this changelog to find out the day it does:
https://cloud.google.com/dataflow/release-notes/release-notes-python
(soon!)

Process multiple objects in Google cloud

I have a few hundred files(100,000) in a Google Storage Bucket. The file sizes are about 2-10MB. I need to apply a simple python function(just data transformation) on each of these files. I need to read from one bucket - transform (python function) in parallel - and store in another bucket. I am thinking of a simple Hadoop or Spark cluster to do this. I previously used concurrent threads on a single instance to do this, but I need a more robust approach. What is the best way to accomplish this?
You can use the recently-announced Google Cloud Dataproc (in beta as of 5 Oct 2015), which provides a managed Hadoop or Spark cluster for you. It is integrated with Google Cloud Storage so you can read and write data from your bucket.
You can submit jobs via gcloud, the console, or via SSH to a machine in your cluster.

How to use hbase as a source for hadoop streaming jobs

Is there any way to use a Hbase table as a source for a Hadoop streaming job ? Specifically, I want to run a Hadoop streaming job written in Python. This works well when the input is specified as a folder on HDFS. But I've not been able to find any documentation about reading data from a Hbase table.
Is this supported ? Or I'll have to go through the ordeal of writing a java code for getting data from Hbase to HDFS first and then run streaming job ?
I'm using Hbase 0.94 from Cloudera.
(There is a similar question already present here. But it points to a third party solution, not actively contributed to. I was hoping that this will be supported in Hbase).
I would use Pig to load the data and then feed it into a streaming Python application.
See here:
http://pig.apache.org/docs/r0.12.0/func.html#HBaseStorage
http://pig.apache.org/docs/r0.12.0/basic.html#stream

Categories