I want to load data from MySQL to BigQuery using Cloud Dataflow. Anyone can share article or work experience about load data from MySQL to BigQuery using Cloud Dataflow with Python language?
Thank you
You can use apache_beam.io.jdbc to read from your MySQL database, and the BigQuery I/O to write on BigQuery.
Beam knowledge is expected, so I recommend looking at Apache Beam Programming Guide first.
If you are looking for something pre-built, we have the JDBC to BigQuery Google-provided template, which is open-source (here), but it is written in Java.
If you only want to copy data from MySQL to BigQuery, you can firstly export your MySql data to Cloud Storage, then load this file to a BigQuery table.
I think no need using Dataflow in this case because you don't have complex transformations and business logics. It only corresponds to a copy.
Export the MySQL data to Cloud Storage via a sql query and gcloud cli :
gcloud sql export csv INSTANCE_NAME gs://BUCKET_NAME/FILE_NAME \
--database=DATABASE_NAME \
--offload \
--query=SELECT_QUERY \
--quote="22" \
--escape="5C" \
--fields-terminated-by="2C" \
--lines-terminated-by="0A"
Load the csv file to a BigQuery table via gcloud cli and bq :
bq load \
--source_format=CSV \
mydataset.mytable \
gs://mybucket/mydata.csv \
./myschema.json
./myschema.json is the BigQuery table schema.
Related
what is the efficient way to insert streaming records into MySQL from google-dataflow using python.Is there any IO connector as in case of bigquery?I see that or bigquery has beam.io.WriteToBigQuery. how can we use similar io connector in cloud-MySQL
You can you JDBCIO to read and write data from/to a JDBC database compliant database.
You can find the details here testWrite
i have done ETL from MySql to bigQuery with python, but because i haven't permission to connect google cloud storage/ cloud sql, i must dump data and partition that by last date, this way easy but didn't worth it because take a much time, it is possible to ETL using airflow from MySql/mongo to bigQuery without google cloud storage/ cloud sql ?
With airflow or not, the easiest and the most efficient way is to:
Extract data from data source
Load the data into a file
Drop the file into Cloud Storage
Run a BigQuery Load Job on these files (load job is free)
If you want to avoid to create a file and to drop it into Cloud Storage, another way is possible, much more complex: stream data into BigQuery.
Run a query (MySQL or Mongo)
Fetch the result.
On each line, stream write the result into BigQuery (Streaming is not free on BigQuery)
Described like this, it does not seam very complex but:
You have to maintain the connexion to the source and to the destination during all the process
You have to handle errors (read and write) and be able to restart at the last point of failure
You have to perform bulk stream write into BigQuery for optimizing performance. Size of chunks has to be choose wisely.
Airflow bonus: You have to define and to write your own custom operator for doing this.
By the way, I strongly recommend to follow the first solution.
Additional tips: now, BigQuery can directly request into Cloud SQL database. If you still need your MySQL database (for keeping some referential in it) you can migrate it into CloudSQL and perform a join between your Bigquery data warehouse and your CloudSQL referential.
It is indeed possible to synchronize MySQL databases to BigQuery with Airflow.
You would of course need to make sure you have properly authenticated connections to Airflow DAG workflow.
Also, make sure to define which columns from MySQL you would like to pull and load into BigQuery. You want to also choose the method of loading your data. Would you want it loaded incrementally or fully? Be sure to also formulate a technique for eliminating duplicate copies of data (de-duplicate).
You can find more information on this topic through through this link:
How to Sync Mysql into Bigquery in realtime?
Here is a great resource for setting up your bigquery account and authentications:
https://www.youtube.com/watch?v=fAwWSxJpFQ8
You can also have a look at stichdata.com (https://www.stitchdata.com/integrations/mysql/google-bigquery/)
The Stitch MySQL integration will ETL your MySQL to Google BigQuery in minutes and keep it up to date without having to constantly write and maintain ETL scripts. Google Cloud Storage or Cloud SQL won’t be necessary in this case.
For more information on aggregating data for BigQuery using Apache Airflow you may refer to the link below:
https://cloud.google.com/blog/products/gcp/how-to-aggregate-data-for-bigquery-using-apache-airflow
What is the best possible way to copy a table (with millions of rows) from one type of database to other type using pandas or python?
I have a table in PostreSQL database consisting of millions of rows, I want to move it to Amazon Redshift. What can be the best possible way to achieve that using pandas or python?
The Amazon Database Migration Service (DMS) can handle:
Using a PostgreSQL Database as a Source for AWS DMS - AWS Database Migration Service
Using an Amazon Redshift Database as a Target for AWS Database Migration Service - AWS Database Migration Service
Alternatively, if you wish to do it yourself:
Export the data from PostgreSQL into CSV files (they can be gzip compressed)
Upload the files to Amazon S3
Create the destination tables in Amazon Redshift
Use the COPY command in Amazon Redshift to load the CSV files into Redshift
If you're using Aws services it might be good to use aws Glue, it uses python scripts for its ETL operations, very optimal for Dynamo-->Redshift for example.
If you're not using only Aws services, Try to Export your Redshift data as csv? (i did this for millions of rows) & create a migration tool using c# or whatever to read the csv file & insert your rows after converting them or whatever [Check if the Database technology you're using can take the csv directly so you can avoid doing the migration yourself].
I am working on Glue since january, and have worked multiple POC, production data lakes using AWS Glue / Databricks / EMR, etc. I have used AWS Glue to read data from S3 and perform ETL before loading to Redshift, Aurora, etc.
I have a need now to read data from a source table which is on SQL SERVER, and fetch data, write to a S3 bucket in a custom (user defined) CSV file, say employee.csv.
Am looking for some pointers, to do this please.
Thanks
You can connect using JDBC specifying connectionType=sqlserver to get a dynamic frame connecting to SQL SERVER. See here for GlueContext docs
dynF = glueContext.getSource(connection_type="sqlserver", url = ..., dbtable=..., user=..., password=)
This task fits AWS DMS (Data Migration Service) use case. DMS is designed to either migrate data from one data storage to another or keep them in sync. It can certainly keep in sync as well as transform your source (i.e., MSSQL) to your target (i.e., S3).
There is one non-negligible constraint in your case thought. Ongoing sync with MSSQL source only works if your license is the Enterprise or Developer Edition and for versions 2016-2019.
I want to load data into Spark (on Databricks) from Google BigQuery. I notice that Databricks offers alot of support for Amazon S3 but not for Google.
What is the best way to load data into Spark (on Databricks) from Google BigQuery? Would the BigQuery connector allow me to do this or is this only valid for files hosted on Google Cloud storage?
The BigQuery Connector is a client side library that uses the public BigQuery API: it runs BigQuery export jobs to Google Cloud Storage, and takes advantage of file creation ordering to start Hadoop processing early to increase overall throughput.
This code should work wherever you happen to locate your Hadoop cluster.
That said, if you are running over large data, then you might find network bandwidth throughput to be a problem (how good is your network connection to Google?), and since you are reading data out of Google's network, then GCS network egress costs will apply.
Databricks now has documented how to use Google BigQuery via Spark here
Set spark config in cluster settings:
credentials <base64-keys>
spark.hadoop.google.cloud.auth.service.account.enable true
spark.hadoop.fs.gs.auth.service.account.email <client_email>
spark.hadoop.fs.gs.project.id <project_id>
spark.hadoop.fs.gs.auth.service.account.private.key <private_key>
spark.hadoop.fs.gs.auth.service.account.private.key.id <private_key_id>
In pyspark use:
df = spark.read.format("bigquery") \
.option("table", table) \
.option("project", <project-id>) \
.option("parentProject", <parent-project-id>) \
.load()