I am attempting to write a python script which will run in AWS Lambda, back up a PostgreSQL database table which is hosted in Amazon RDS, then dump a resulting .bak file or similar to S3.
I'm able to connect to the database and make changes to it, but I'm not quite sure how to go about the next steps. How do I actually back up the DB and write it to a backup file in the S3 bucket?
Depending how large you database is lambda may not be the best solution. lambdas have limits of 512MB tmp disk space, 15 minute timeouts, and 3008 MB memory. Maxing out these limits may also be more expensive then other options.
Using EC2 or fargate along with boto or the aws cli may be a better solution. Here is an blog entry that walks through a solution
https://francescoboffa.com/using-s3-to-store-your-mysql-or-postgresql-backups
The method that worked for me was to create an AWS data pipeline to back up the database to CSV.
Related
I am currently working on implementing Facebook Prophet for a live production environment. I haven't done that before so I wanted to present you my plan here and hope you can give me some feedback whether that's an okay solution or if you have any suggestions.
Within Django, I create a .csv export of relevant data which I need for my predictions. These .csv exports will be uploaded to an AWS S3 bucket.
From there I can access this S3 bucket with an AWS Lambda Function where the "heavy" calculations are happening.
Once done, I take the forecasts from 2. and save them again in a forcast.csv export
Now my Django application can access the forecast.csv on S3 and get the respective predictions.
I am especially curious if AWS Lambda Function is the right tool in that case. Exports could probably also saved in DynamoDB (?), but I try to keep my v1 simple, therefore .csv. There is still some effort to install the right layers/packages for AWS Lambda. So I want to make sure I am walking in the right direction before diving deep in its documentation.
I am a little concerned about using AWS Lambda for the "heavy" calculations. There are a few reasons.
Binary size Limit: AWS Lambda has a size limit of 250MB for binaries. This is the biggest limitation we faced as you will not be able to include all the libraries like numpy, pandas, matplotlib, etc in that binary.
Disk size limit: AWS only provides max disk size of 500MB for lambda execution, it can become a problem if you want to save intermediate results in the disk.
The cost can skyrocket: If your lambda is going to run for a long time instead of multiple small invocations, you will end up paying a lot of money. In that case, I think you will be better off with something like EC2 and ECS.
You can evaluate linking the S3 bucket to an SQS queue and a process running on EC2 machine which is listening to the queue and performing all the calculations.
AWS Lambda Limits.
I have about 100 million json files (10 TB), each with a particular field containing a bunch of text, for which I would like to perform a simple substring search and return the filenames of all the relevant json files. They're all currently stored on Google Cloud Storage. Normally for a smaller number of files I might just spin up a VM with many CPUs and run multiprocessing via Python, but alas this is a bit too much.
I want to avoid spending too much time setting up infrastructure like a Hadoop server, or loading all of that into some MongoDB database. My question is: what would be a quick and dirty way to perform this task? My original thoughts were to set up something on Kubernetes with some parallel processing running Python scripts, but I'm open to suggestions and don't really have a clue how to go about this.
Easier would be to just load the GCS data into Big Query and just run your query from there.
Send your data to AWS S3 and use Amazon Athena.
The Kubernetes option would be set up a cluster in GKE and install Presto in it with a lot of workers, use a hive metastore with GCS and query from there. (Presto doesn't have direct GCS connector yet, afaik) -- This option seems more elaborate.
Hope it helps!
I have a Python Scraper that I run periodically in my free tier AWS EC2 instance using Cron that outputs a csv file every day containing around 4-5000 rows with 8 columns. I have been ssh-ing into it from my home Ubuntu OS and adding the new data to a SQLite database which I can then use to extract the data I want.
Now I would like to try the free tier AWS MySQL database so I can have the database in the Cloud and pull data from it from my terminal on my home PC. I have searched around and found no direct tutorial on how this could be done. It would be great if anyone that has done this could give me a conceptual idea of the steps I would need to take. Ideally I would like to automate the updating of the database as soon as my EC2 instance updates with a new csv table. I can do all the de-duping once the table is in the aws MySQL database.
Any advice or link to tutorials on this most welcome. As I stated, I have searched quite a bit for guides but haven't found anything on this. Perhaps the concept is completely wrong and there is an entirely different way of doing it that I am not seeing?
The problem is you don't have access to RDS filesystem, therefore cannot upload csv there (and import too).
Modify your Python Scraper to connect to DB directly and insert data there.
Did you consider using AWS Lambda to run your scraper?
Take a look at this AWS tutorial which will help you configure a Lambda Function to access an Amazon RDS database.
I want to copy keys from buckets between 2 different accounts using boto3 api's.
In boto3, I executed the following code and the copy worked
source = boto3.client('s3')
destination = boto3.client('s3')
destination.put_object(source.get_object(Bucket='bucket', Key='key'))
Basically I am fetching data from GET and pasting that with PUT in another account.
On Similar lines in boto api, I have done the following
source = S3Connection()
source_bucket = source.get_bucket('bucket')
source_key = Key(source_bucket, key_name)
destination = S3Connection()
destination_bucket = destination.get_bucket('bucket')
dist_key = Key(destination_bucket, source_key.key)
dist_key.set_contents_from_string(source_key.get_contents_as_string())
The above code achieves the purpose of copying any type of data.
But the speed is really very slow. I get around 15-20 seconds to copy data for 1GB. And I have to copy 100GB plus.
I tried python mutithreading wherein each thread does the copy operation. The performance was bad as it took 30 seconds to copy 1GB. I suspect GIL might be the issue here.
I did multiprocessing and I am getting the same result as of single process i.e. 15-20 seconds for 1GB file.
I am using a very high end server with 48 cores and 128GB RAM. The network speed in my environment is 10GBPS.
Most of the search results tell about copying data between buckets in same account and not across accounts. Can anyone please guide me here. Is my approach wrong? Does anyone have a better solution?
Yes, it is wrong approach.
You shouldn't download the file. You are using AWS infrastructure, so you should make use of the efficient AWS backend call to do the works. Your approach is wasting resources.
boto3.client.copy will do the job better than this.
In addition, you didn't describe what you are trying to achieve(e.g. is this some sort of replication requirement? ).
Because with proper understanding of your own needs, it is possible that you don't even need a server to do the job : S3 Bucket events trigger, lambda etc can all execute the copying job without a server.
To copy file between two different AWS account, you can checkout this link Copy S3 object between AWS account
Note :
S3 is a huge virtual object store for everyone, that's why the bucket name MUST be unique. This also mean, the S3 "controller" can done a lot of fancy work similar to a file server , e.g. replication,copy, move file in the backend, without involving network traffics.
As long as you setup the proper IAM permission/policies for the destination bucket, object can move across bucket without additional server.
This is almost similar to file server. User can copy file to each other without "download/upload", instead, one just create a folder with write permission for all, file copy from another user is all done within the file server, with fastest raw disk I/O performance. You don't need powerful instance nor high performance network using backend S3 copy API.
Your method is similar to attempt FTP download file from user using the same file server, which create unwanted network traffics.
You should check out the TransferManager in boto3. It will automatically handle the threading of multipart uploads in an efficient way. See the docs for more detail.
Basically you must have to use the upload_file method and TransferManager will take care of the rest.
import boto3
# Get the service client
s3 = boto3.client('s3')
# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")
Suppose that I have a huge SQLite file (say, 500[MB]) stored in Amazon S3.
Can a python script that is run on a small EC2 instance directly access and modify that SQLite file? or must I first copy the file to the EC2 instance, change it there and then copy over to S3?
Will the I/O be efficient?
Here's what I am trying to do. As I wrote, I have a 500[MB] SQLite file in S3. I'd like to start say 10 different Amazon EC2 instances that will each read a subset of the file and do some processing (every instance will handle a different subset of the 500[MB] SQLite file). Then, once processing is done, every instance will update only the subset of the data it dealt with (as explained, there will be no overlap of data among processes).
For example, suppose that the SQLite file has say 1M rows:
instance 1 will deal with (and update) rows 0 - 100000
instance 2 will will deal with (and update) rows 100001 - 200000
.........................
instance 10 will deal with (and update) rows 900001 - 1000000
Is it at all possible? Does it sound OK? any suggestions / ideas are welcome.
I'd like to start say 10 different Amazon EC2 instances that will each read a subset of the file and do some processing (every instance will handle a different subset of the 500[MB] SQLite file)
You cannot do this with SQLite; on amazon infrastructure or otherwise. sqlite performs database level write locking. unless all ten nodes are performing reads exclusively, you will not attain any kind of concurrency. Even the SQLite website says so.
Situations Where Another RDBMS May Work Better
Client/Server Applications
High-volume Websites
Very large datasets
High Concurrency
Have you considered PostgreSQL?
Since S3 cannot be directly mounted, your best bet is to create an EBS volume containing the SQLite file and work directly with the EBS volume from another (controller) instance. You can then create snapshots of the volume, and archive it into S3. Using a tool like boto (Python API), you can automate the creation of snapshots and the process of moving the backups into S3.
You can mount S3 bucket on your linux machine. See below:
s3fs -
http://code.google.com/p/s3fs/wiki/InstallationNotes
- this did work for me. It uses FUSE file-system + rsync to sync the files
in S3. It kepes a copy of all
filenames in the local system & make
it look like a FILE/FOLDER.
This is good if the system is already in place and running with huge collection of data. But, if you are building this from scratch then I would suggest you to have an EBS volume for SQLite and use this script to create a snapshot of your EBS volume:
https://github.com/rakesh-sankar/Tools/blob/master/AmazonAWS/EBS/ebs-snapshot.sh
If your db structure is simple, why not just use AWS simpledb? Or run mysql (or another DB) on one of your instances.
Amazon EFS can be shared among ec2 instances. It's a managed NFS share. SQLITE will still lock the whole DB on write.
The SQLITE Website does not recommend NFS shares, though. But depending on the application you can share the DB read-only among several ec2 instances and store the results of your processing somewhere else, then concatenate the results in the next step.