Copy tables from mulitple RDS postgres instances to one postgres rds - python

I have 8 rds postgres instances, each supporting databases for the 8 solutions we have.
I would like to bring all the data in the 8 rds postgres instances to a single rds postgres instance.
This is needed for analytics
Here is my work on this until now:
Tried to download each table into a csv and upload them to the target rds instance. The copy command fails for a table that is of size 50GB. I am unable to migrate all the data at one shot and then set up incremental load. It fails at the full table load for large tables.(Using python here too)
Thought of using DMS , but it is recommended that for rds-postgres to rds-postgres migration, they recommend using pgdump and pgrestore instead of DMS. But for pgdump and pgrestore I need to store the password, and my organization does not let me do that
Using debezium, but this is a long process and I honestly, need this up and working very soon.
The ideal situation is where all the 8 databases in the 8 rds instances are replicated in the one rds postgres instance. And this should be continous replication. I am okay for batch processing for the time being(the data can be stale by a day). But if streaming (example - debezium) is the only way , then I am okay too..
Any thoughts on this please?

Related

ELT Pipeline - AWS RDS to BigQuery

I joined as a junior data engineer at a startup and I'm working on setting up a data warehouse for BI/visualization. I wanted to get an idea of approaches for the extraction/loading part as the company is also new to data engineering.
The company is thinking of going with Google BigQuery for warehousing. The main data source is currently a single OLTP PostgreSQL database hosted on AWS RDS. The database is about 50 GB for now with nearly a hundred tables.
I was initially thinking of using Stitch to integrate directly with BigQuery but since the team is shifting the RDS instance to a private subnet, it would not be possible to access using third party tools which would require a publicly accessible URL (would it?).
How would I go about it? I am still pretty new to data engineering so wanted some advice. I was thinking about using:
RDS -> Lambda/VM with Python extraction/load script -> BigQuery upload using API
But how would I account for changing row values e.g. a customer's status changes in a table. Would BigQuery automatically handle such changes? Plus, I would want to set up regular daily data transfers. For this, I think a cron job can be set up with the Python script to transfer data but would this be a costly approach considering that there are a bunch of large tables (extraction, conversion to dataframe/CSV then uploading to BQ)? As the data size increases, I would need to upsert data instead of overwriting tables. Can BigQuery or other warehouse solutions like Redshift handle this? My main factors to consider for a solution are mostly cost, time to set up and data loading durations.

AWS Aurora: bulk upsert of records using pre-formed SQL Statements

Is there a way of doing a batch insert/update of records into AWS Aurora using "pre-formed" Postgresql statements, using Python?
My scenario: I have an AWS lambda that receives data changes (insert/modify/remove) from DynamoDB via Kinesis, which then needs to apply them to an instance of Postgres in AWS Aurora.
All I've managed to find doing an Internet search is the use of Boto3 via the "batch_execute_statement" command in the RDS Data Service client, where one needs to populate a list of parameters for each individual record.
If possible, I would like a mechanism where I can supply many "pre-formed" INSERT/UPDATE/DELETE Postgresql statements to the database in a batch operation.
Many thanks in advance for any assistance.
I used Psycopg2 and an SqlAlchemy engine's raw connection (instead of Boto3) and looped through my list of SQL statements, executing each one in turn.

Exported scraped .csv file from AWS EC2 to AWS MYSQL database

I have a Python Scraper that I run periodically in my free tier AWS EC2 instance using Cron that outputs a csv file every day containing around 4-5000 rows with 8 columns. I have been ssh-ing into it from my home Ubuntu OS and adding the new data to a SQLite database which I can then use to extract the data I want.
Now I would like to try the free tier AWS MySQL database so I can have the database in the Cloud and pull data from it from my terminal on my home PC. I have searched around and found no direct tutorial on how this could be done. It would be great if anyone that has done this could give me a conceptual idea of the steps I would need to take. Ideally I would like to automate the updating of the database as soon as my EC2 instance updates with a new csv table. I can do all the de-duping once the table is in the aws MySQL database.
Any advice or link to tutorials on this most welcome. As I stated, I have searched quite a bit for guides but haven't found anything on this. Perhaps the concept is completely wrong and there is an entirely different way of doing it that I am not seeing?
The problem is you don't have access to RDS filesystem, therefore cannot upload csv there (and import too).
Modify your Python Scraper to connect to DB directly and insert data there.
Did you consider using AWS Lambda to run your scraper?
Take a look at this AWS tutorial which will help you configure a Lambda Function to access an Amazon RDS database.

Remote Postgres to Postgres data

I am working on a project now where I need to load daily data from one psql database into another one (both databases are on separate remote machines).
The Postgres version I'm using is 9.5, and due to our infrastructure, I am currently doing this using python scripts, which works fine for now, although I was wondering:
Is it possible to do this using psql commands that I can easily schedule? or is python a flexible enough appproach for future developments?
EDIT:
The main database contains a backend connected directly to a website and the other contains an analytics system which basically only needs to read the main db's data and store future transformations of it.
The latency is not very important, what is important is the reliability and simplicity.
sure, you can use psql and an ssh connection if you want.
this approach (or using pg_dump) can be useful as way to reduce the effexcts of latency.
however note that the SQL insert...values command can insert several rows in a single command. When I use python scripts to migrate data I build insert commands that insert up-to 1000 rows, thus reducing latency by a factor of 1000,
Another approach worth considering is dblink which allows postgres to query a remote postgres directly, so you could do a select from the remote database and insert the result into a local table.
Postgres-FDW may be worth a look too.

Amazon EC2 & S3 When using Python / SQLite?

Suppose that I have a huge SQLite file (say, 500[MB]) stored in Amazon S3.
Can a python script that is run on a small EC2 instance directly access and modify that SQLite file? or must I first copy the file to the EC2 instance, change it there and then copy over to S3?
Will the I/O be efficient?
Here's what I am trying to do. As I wrote, I have a 500[MB] SQLite file in S3. I'd like to start say 10 different Amazon EC2 instances that will each read a subset of the file and do some processing (every instance will handle a different subset of the 500[MB] SQLite file). Then, once processing is done, every instance will update only the subset of the data it dealt with (as explained, there will be no overlap of data among processes).
For example, suppose that the SQLite file has say 1M rows:
instance 1 will deal with (and update) rows 0 - 100000
instance 2 will will deal with (and update) rows 100001 - 200000
.........................
instance 10 will deal with (and update) rows 900001 - 1000000
Is it at all possible? Does it sound OK? any suggestions / ideas are welcome.
I'd like to start say 10 different Amazon EC2 instances that will each read a subset of the file and do some processing (every instance will handle a different subset of the 500[MB] SQLite file)
You cannot do this with SQLite; on amazon infrastructure or otherwise. sqlite performs database level write locking. unless all ten nodes are performing reads exclusively, you will not attain any kind of concurrency. Even the SQLite website says so.
Situations Where Another RDBMS May Work Better
Client/Server Applications
High-volume Websites
Very large datasets
High Concurrency
Have you considered PostgreSQL?
Since S3 cannot be directly mounted, your best bet is to create an EBS volume containing the SQLite file and work directly with the EBS volume from another (controller) instance. You can then create snapshots of the volume, and archive it into S3. Using a tool like boto (Python API), you can automate the creation of snapshots and the process of moving the backups into S3.
You can mount S3 bucket on your linux machine. See below:
s3fs -
http://code.google.com/p/s3fs/wiki/InstallationNotes
- this did work for me. It uses FUSE file-system + rsync to sync the files
in S3. It kepes a copy of all
filenames in the local system & make
it look like a FILE/FOLDER.
This is good if the system is already in place and running with huge collection of data. But, if you are building this from scratch then I would suggest you to have an EBS volume for SQLite and use this script to create a snapshot of your EBS volume:
https://github.com/rakesh-sankar/Tools/blob/master/AmazonAWS/EBS/ebs-snapshot.sh
If your db structure is simple, why not just use AWS simpledb? Or run mysql (or another DB) on one of your instances.
Amazon EFS can be shared among ec2 instances. It's a managed NFS share. SQLITE will still lock the whole DB on write.
The SQLITE Website does not recommend NFS shares, though. But depending on the application you can share the DB read-only among several ec2 instances and store the results of your processing somewhere else, then concatenate the results in the next step.

Categories