I have a Python Scraper that I run periodically in my free tier AWS EC2 instance using Cron that outputs a csv file every day containing around 4-5000 rows with 8 columns. I have been ssh-ing into it from my home Ubuntu OS and adding the new data to a SQLite database which I can then use to extract the data I want.
Now I would like to try the free tier AWS MySQL database so I can have the database in the Cloud and pull data from it from my terminal on my home PC. I have searched around and found no direct tutorial on how this could be done. It would be great if anyone that has done this could give me a conceptual idea of the steps I would need to take. Ideally I would like to automate the updating of the database as soon as my EC2 instance updates with a new csv table. I can do all the de-duping once the table is in the aws MySQL database.
Any advice or link to tutorials on this most welcome. As I stated, I have searched quite a bit for guides but haven't found anything on this. Perhaps the concept is completely wrong and there is an entirely different way of doing it that I am not seeing?
The problem is you don't have access to RDS filesystem, therefore cannot upload csv there (and import too).
Modify your Python Scraper to connect to DB directly and insert data there.
Did you consider using AWS Lambda to run your scraper?
Take a look at this AWS tutorial which will help you configure a Lambda Function to access an Amazon RDS database.
Related
I joined as a junior data engineer at a startup and I'm working on setting up a data warehouse for BI/visualization. I wanted to get an idea of approaches for the extraction/loading part as the company is also new to data engineering.
The company is thinking of going with Google BigQuery for warehousing. The main data source is currently a single OLTP PostgreSQL database hosted on AWS RDS. The database is about 50 GB for now with nearly a hundred tables.
I was initially thinking of using Stitch to integrate directly with BigQuery but since the team is shifting the RDS instance to a private subnet, it would not be possible to access using third party tools which would require a publicly accessible URL (would it?).
How would I go about it? I am still pretty new to data engineering so wanted some advice. I was thinking about using:
RDS -> Lambda/VM with Python extraction/load script -> BigQuery upload using API
But how would I account for changing row values e.g. a customer's status changes in a table. Would BigQuery automatically handle such changes? Plus, I would want to set up regular daily data transfers. For this, I think a cron job can be set up with the Python script to transfer data but would this be a costly approach considering that there are a bunch of large tables (extraction, conversion to dataframe/CSV then uploading to BQ)? As the data size increases, I would need to upsert data instead of overwriting tables. Can BigQuery or other warehouse solutions like Redshift handle this? My main factors to consider for a solution are mostly cost, time to set up and data loading durations.
I have 8 rds postgres instances, each supporting databases for the 8 solutions we have.
I would like to bring all the data in the 8 rds postgres instances to a single rds postgres instance.
This is needed for analytics
Here is my work on this until now:
Tried to download each table into a csv and upload them to the target rds instance. The copy command fails for a table that is of size 50GB. I am unable to migrate all the data at one shot and then set up incremental load. It fails at the full table load for large tables.(Using python here too)
Thought of using DMS , but it is recommended that for rds-postgres to rds-postgres migration, they recommend using pgdump and pgrestore instead of DMS. But for pgdump and pgrestore I need to store the password, and my organization does not let me do that
Using debezium, but this is a long process and I honestly, need this up and working very soon.
The ideal situation is where all the 8 databases in the 8 rds instances are replicated in the one rds postgres instance. And this should be continous replication. I am okay for batch processing for the time being(the data can be stale by a day). But if streaming (example - debezium) is the only way , then I am okay too..
Any thoughts on this please?
I am attempting to write a python script which will run in AWS Lambda, back up a PostgreSQL database table which is hosted in Amazon RDS, then dump a resulting .bak file or similar to S3.
I'm able to connect to the database and make changes to it, but I'm not quite sure how to go about the next steps. How do I actually back up the DB and write it to a backup file in the S3 bucket?
Depending how large you database is lambda may not be the best solution. lambdas have limits of 512MB tmp disk space, 15 minute timeouts, and 3008 MB memory. Maxing out these limits may also be more expensive then other options.
Using EC2 or fargate along with boto or the aws cli may be a better solution. Here is an blog entry that walks through a solution
https://francescoboffa.com/using-s3-to-store-your-mysql-or-postgresql-backups
The method that worked for me was to create an AWS data pipeline to back up the database to CSV.
I have about 100 million json files (10 TB), each with a particular field containing a bunch of text, for which I would like to perform a simple substring search and return the filenames of all the relevant json files. They're all currently stored on Google Cloud Storage. Normally for a smaller number of files I might just spin up a VM with many CPUs and run multiprocessing via Python, but alas this is a bit too much.
I want to avoid spending too much time setting up infrastructure like a Hadoop server, or loading all of that into some MongoDB database. My question is: what would be a quick and dirty way to perform this task? My original thoughts were to set up something on Kubernetes with some parallel processing running Python scripts, but I'm open to suggestions and don't really have a clue how to go about this.
Easier would be to just load the GCS data into Big Query and just run your query from there.
Send your data to AWS S3 and use Amazon Athena.
The Kubernetes option would be set up a cluster in GKE and install Presto in it with a lot of workers, use a hive metastore with GCS and query from there. (Presto doesn't have direct GCS connector yet, afaik) -- This option seems more elaborate.
Hope it helps!
So I have a Google sheet that maintains a lot of data. I also have a MySQL DB with a huge junk of data. There is a vital piece of information in the Sheet that is also present in the DB. Both needs to be in sync. The information always enters the Sheet first. I had a python script with mysql queries to update my database separately.
Now the work flow has changed. Data will enter the sheet and whenever that happens the database has to updated automatically.
After some research, I found that using the onEdit function of Google AppScript (I learned from here.), I could pickup when the file has changed.
The Next step is to fetch the data from relevant cell, which I can do using this.
Now I need to connect to the DB and send some queries. This is where I am stuck.
Approach 1:
Have a python web-app running live. Send the data via UrlFetchApp.This I yet have to try.
Approach 2:
Connect to mySQL remotely through appscript. But I am not sure this is possible after 2-3 hours of reading the docs.
So this is my scenario. Any viable solution you can think of or a better approach?
Connect directly to mySQL. You likely missed reading this part https://developers.google.com/apps-script/guides/jdbc
Using JDBC within Apps Script will work if you have the time to build this yourself.
If you don't want to roll your own solution, check out SeekWell. It allows you to connect to databases and write SQL queries directly in Sheets. You can create a run a “Run Sheet” that will run multiple queries at once and schedule those queries to be run without you even opening the Sheet.
Disclaimer: I made this.