ELT Pipeline - AWS RDS to BigQuery - python

I joined as a junior data engineer at a startup and I'm working on setting up a data warehouse for BI/visualization. I wanted to get an idea of approaches for the extraction/loading part as the company is also new to data engineering.
The company is thinking of going with Google BigQuery for warehousing. The main data source is currently a single OLTP PostgreSQL database hosted on AWS RDS. The database is about 50 GB for now with nearly a hundred tables.
I was initially thinking of using Stitch to integrate directly with BigQuery but since the team is shifting the RDS instance to a private subnet, it would not be possible to access using third party tools which would require a publicly accessible URL (would it?).
How would I go about it? I am still pretty new to data engineering so wanted some advice. I was thinking about using:
RDS -> Lambda/VM with Python extraction/load script -> BigQuery upload using API
But how would I account for changing row values e.g. a customer's status changes in a table. Would BigQuery automatically handle such changes? Plus, I would want to set up regular daily data transfers. For this, I think a cron job can be set up with the Python script to transfer data but would this be a costly approach considering that there are a bunch of large tables (extraction, conversion to dataframe/CSV then uploading to BQ)? As the data size increases, I would need to upsert data instead of overwriting tables. Can BigQuery or other warehouse solutions like Redshift handle this? My main factors to consider for a solution are mostly cost, time to set up and data loading durations.

Related

How can I optimize an alert system that processes 10k requests / job?

I'm build a solution Match Service where receive data from a third party provider from MQTT server. This data is a realtime data. We save this data in RDS Cluster.
Our users can create in another service a filter called Strateg, we send a cron every 5 minutes to this service and all records in database are send to Kafka topic to be processed in Match Service.
My design is based on events, so each new Strategy record in topic, Match Service performs a query in database for check if have any Match that active the Strategy threshold. If the threshold is passed, it sends out an new message to broker.
The API processes about 10k Strategy in each job, it's taking timing (about 250s for each job).
So my question is if there is a better way to design this system? I was thinking of adding a redis-layer, to avoid database transactions.
All suggestions welcome!
Think long and hard about your relational data store. If you really need it to be relational, then it may absolutely make sense, but if not, a relational database is often a horrible place to dump things like time-series and IoT output. It's a great place to put normalized, structured data for reporting, but a lousy dump/load location and real-time matching.
Look more at something like AWS RedShift, ElasticSearch, or some other no-sql solution that can ingest and match things at orders of magnitude higher scale.

ETL to bigquery using airflow without have permission cloud storage/ cloud sql

i have done ETL from MySql to bigQuery with python, but because i haven't permission to connect google cloud storage/ cloud sql, i must dump data and partition that by last date, this way easy but didn't worth it because take a much time, it is possible to ETL using airflow from MySql/mongo to bigQuery without google cloud storage/ cloud sql ?
With airflow or not, the easiest and the most efficient way is to:
Extract data from data source
Load the data into a file
Drop the file into Cloud Storage
Run a BigQuery Load Job on these files (load job is free)
If you want to avoid to create a file and to drop it into Cloud Storage, another way is possible, much more complex: stream data into BigQuery.
Run a query (MySQL or Mongo)
Fetch the result.
On each line, stream write the result into BigQuery (Streaming is not free on BigQuery)
Described like this, it does not seam very complex but:
You have to maintain the connexion to the source and to the destination during all the process
You have to handle errors (read and write) and be able to restart at the last point of failure
You have to perform bulk stream write into BigQuery for optimizing performance. Size of chunks has to be choose wisely.
Airflow bonus: You have to define and to write your own custom operator for doing this.
By the way, I strongly recommend to follow the first solution.
Additional tips: now, BigQuery can directly request into Cloud SQL database. If you still need your MySQL database (for keeping some referential in it) you can migrate it into CloudSQL and perform a join between your Bigquery data warehouse and your CloudSQL referential.
It is indeed possible to synchronize MySQL databases to BigQuery with Airflow.
You would of course need to make sure you have properly authenticated connections to Airflow DAG workflow.
Also, make sure to define which columns from MySQL you would like to pull and load into BigQuery. You want to also choose the method of loading your data. Would you want it loaded incrementally or fully? Be sure to also formulate a technique for eliminating duplicate copies of data (de-duplicate).
You can find more information on this topic through through this link:
How to Sync Mysql into Bigquery in realtime?
Here is a great resource for setting up your bigquery account and authentications:
https://www.youtube.com/watch?v=fAwWSxJpFQ8
You can also have a look at stichdata.com (https://www.stitchdata.com/integrations/mysql/google-bigquery/)
The Stitch MySQL integration will ETL your MySQL to Google BigQuery in minutes and keep it up to date without having to constantly write and maintain ETL scripts. Google Cloud Storage or Cloud SQL won’t be necessary in this case.
For more information on aggregating data for BigQuery using Apache Airflow you may refer to the link below:
https://cloud.google.com/blog/products/gcp/how-to-aggregate-data-for-bigquery-using-apache-airflow

AWS Glue - read from a sql server table and write to S3 as a custom CSV file

I am working on Glue since january, and have worked multiple POC, production data lakes using AWS Glue / Databricks / EMR, etc. I have used AWS Glue to read data from S3 and perform ETL before loading to Redshift, Aurora, etc.
I have a need now to read data from a source table which is on SQL SERVER, and fetch data, write to a S3 bucket in a custom (user defined) CSV file, say employee.csv.
Am looking for some pointers, to do this please.
Thanks
You can connect using JDBC specifying connectionType=sqlserver to get a dynamic frame connecting to SQL SERVER. See here for GlueContext docs
dynF = glueContext.getSource(connection_type="sqlserver", url = ..., dbtable=..., user=..., password=)
This task fits AWS DMS (Data Migration Service) use case. DMS is designed to either migrate data from one data storage to another or keep them in sync. It can certainly keep in sync as well as transform your source (i.e., MSSQL) to your target (i.e., S3).
There is one non-negligible constraint in your case thought. Ongoing sync with MSSQL source only works if your license is the Enterprise or Developer Edition and for versions 2016-2019.

Speedup my data load operation

Please pardon my ignorance if this question may sound silly the expert audience here
Currently as per my use case
I am performing certain analysis on the data present in aws redshift tables and saving them a csv file in s3 buckets
(operation is some what similar to Pivot for redshift database)
and after that i am updating data back to redshift db using copy command
Currently after performing analysis (which is done in python3) for 200 csv files are generated which are saved in 200 different table in redshift
The count of csv would keep on increasing with time
Currently the whole process takes about 50-60 minutes to complete
25 minutes to get approx 200 csv and update them in s3 buckets
25 minutes to update the approx 200 csv into 200 aws redshift tables
The size of csv vary form few MB to 1GB
I was looking for tools or aws technologies which can help me reduce my time
*additional info
Structure of csv keeps on changing .Hence i have to drop and create tables again
This would be a repetitive tasks and would be executed in every 6hours
You can achieve a significant speed-up by:
Using multi-part upload of CSV to S3, so instead of waiting for a single file to upload, multi-part upload will upload the file to S3 in-parallel, saving you considerable time. Read about it here and here. Here is the Boto3 reference for it.
Copying data into Redshift from S3, in parallel. If you split your file in multiple parts, and then run the COPY command, the data will be loaded from multiple files in parallel, instead of waiting for 1 GB file to load, which might be really slow. Read more about it here.
Hope this helps.
You should explore Athena. It's a tool that comes within the AWS package and gives you the flexibility to query csv (or even gzip) files.
It'll save you the time you take to manually copy the data in the Redshift tables and you'll be able to query the dataset from the csv itself. Athena has the ability to query them from an s3 bucket.
However, still in the development phase, you'll have to spend sometime with it as it's not very user friendly. A syntax error in your query logs you out from your AWS session rather than throwing a syntax error. Moreover, you'll not find too many documentation and developer talks over the internet since Athena is still largely unexplored.
Athena charges you depending upon the data that your query fetches and is thus, more pocket friendly. If the query fails to execute, Amazon wouldn't charge you.

Saving Data on GAE: logging vs. datastore

I have a google app engine app that has to deal with a lot of data collecting. The data I gather is around millions of records per day. As I see it, there are two simple approaches to dealing with this in order to be able to analyze the data:
1. use logger API to generate app engine logs, and then try to load these up to a big query (or more simply export to CSV and do the analysis with excel).
2. saving the data in the app engine datastore (ndb), and then download that data later / try to load that up to big query.
Is there any preferable method of doing this?
Thanks!
BigQuery has a new Streaming API, which they claim was designed for high-volume real-time data collection.
Advice from practice: we are currently logging 20M+ multi-event records a day via a method 1. as described above. It works pretty well, except when the batch uploader is not called (normally every 5min), then we need to detect this and re-run the importer.
Also, we are currently in process of migrating to new Streaming API, but is not yet in production so I can't say how reliable it is.

Categories