Architecture for syncing s3/cloudfront with database

Architecture for syncing s3/cloudfront with database - python

I'm building a Django app. The app allows the user to upload files, and have them served publicly to other users.
I'm thinking of using S3 or CloudFront to manage and serve these files. (Let's call it S3 for the sake of discussion.) What bugs me is the fact that S3 is going to have a lot of state on it. My Python code will create, rename and delete files on S3 based on user actions. But we already have all the state in our database. Having state in two separate datastores could lead to synchronization problems and confusing. In other words it's "not supposed to" go out of sync. For example if someone were to delete a record in the database from the django admin, the file on s3 will stay orphaned. (I could write code to deal with that scenario, but I can't catch all the scenarios.)
So what I'm thinking is: Is there a solution for having your S3 sync automatically with data in your Postgres database? (I have no problem storing the files as blobs in the database, they're not big, as long as they're not served directly from there.) I'm talking about having an active program that always maintains sync between them, so if say someone deletes a record in the database, the corresponding file in s3 gets deleted, and if someone deletes a file from the S3 interface, it gets recreated from the database. That way my mind could rest at ease regarding synchronization issues.
Is there something like that? Most preferably in Python.

Found the same problem in the past, maybe not the best advice but here's what I did.
I wrote the upload/modify/remove to S3 logic in the models and used Model signals to keep it updated, for example you can use the post_delete signal to delete the image from S3 and avoid orphans.
Also I had a management command to check if everything is synchronized and solve the problems if any. Unfortunately I wrote this for a client and I cannot share it.
Edit: I found django-cb-storage-s3 and django-s3sync they could be helpful

Related

Heroku - CSV Files and TXT Logfiles

I want to deploy a python bot to heroku.
The bot writes all logging data to txt files and also exports a CSV file to the filesystem where data is stored that is important for the next run of the bot and as well makes it possible to track past performance of the bot.
As I know that it is not possible to store any files at a heroku dyno persistent the question is - how/where to store the data?
A database for the data in the csv file is not suitable for me because I have to edit the file sometimes between two runs and to do this via a database would be to much effort for me.
Any suggestions?

You need to save the file(s) on an external storage like S3, DropBox, GitHub, etc..
Check Files on Heroku to see options and examples.
You can decide to read/write the files directly from the storage (i.e. no local copy), or also keep the files locally making sure they are saved at some point (every 10 min, before every restart).

I want to trigger a python script using a cloud function whenever a specified file is created on the google cloud storage

One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/

The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.

parallell copy of buckets/keys from boto3 or boto api between 2 different accounts/connections

I want to copy keys from buckets between 2 different accounts using boto3 api's.
In boto3, I executed the following code and the copy worked
source = boto3.client('s3')
destination = boto3.client('s3')
destination.put_object(source.get_object(Bucket='bucket', Key='key'))
Basically I am fetching data from GET and pasting that with PUT in another account.
On Similar lines in boto api, I have done the following
source = S3Connection()
source_bucket = source.get_bucket('bucket')
source_key = Key(source_bucket, key_name)
destination = S3Connection()
destination_bucket = destination.get_bucket('bucket')
dist_key = Key(destination_bucket, source_key.key)
dist_key.set_contents_from_string(source_key.get_contents_as_string())
The above code achieves the purpose of copying any type of data.
But the speed is really very slow. I get around 15-20 seconds to copy data for 1GB. And I have to copy 100GB plus.
I tried python mutithreading wherein each thread does the copy operation. The performance was bad as it took 30 seconds to copy 1GB. I suspect GIL might be the issue here.
I did multiprocessing and I am getting the same result as of single process i.e. 15-20 seconds for 1GB file.
I am using a very high end server with 48 cores and 128GB RAM. The network speed in my environment is 10GBPS.
Most of the search results tell about copying data between buckets in same account and not across accounts. Can anyone please guide me here. Is my approach wrong? Does anyone have a better solution?

Yes, it is wrong approach.
You shouldn't download the file. You are using AWS infrastructure, so you should make use of the efficient AWS backend call to do the works. Your approach is wasting resources.
boto3.client.copy will do the job better than this.
In addition, you didn't describe what you are trying to achieve(e.g. is this some sort of replication requirement? ).
Because with proper understanding of your own needs, it is possible that you don't even need a server to do the job : S3 Bucket events trigger, lambda etc can all execute the copying job without a server.
To copy file between two different AWS account, you can checkout this link Copy S3 object between AWS account
Note :
S3 is a huge virtual object store for everyone, that's why the bucket name MUST be unique. This also mean, the S3 "controller" can done a lot of fancy work similar to a file server , e.g. replication,copy, move file in the backend, without involving network traffics.
As long as you setup the proper IAM permission/policies for the destination bucket, object can move across bucket without additional server.
This is almost similar to file server. User can copy file to each other without "download/upload", instead, one just create a folder with write permission for all, file copy from another user is all done within the file server, with fastest raw disk I/O performance. You don't need powerful instance nor high performance network using backend S3 copy API.
Your method is similar to attempt FTP download file from user using the same file server, which create unwanted network traffics.

You should check out the TransferManager in boto3. It will automatically handle the threading of multipart uploads in an efficient way. See the docs for more detail.
Basically you must have to use the upload_file method and TransferManager will take care of the rest.
import boto3
# Get the service client
s3 = boto3.client('s3')
# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")

Linux program to take newest ftp file and send to other ftp server

I was wondering if it was possible to take the newest files uploaded to an ftp server and send them to another ftp server. BUT, every file can only be sent once. If you can do this in python that would be nice, I know intermediate python. EXAMPLE:
2:14 PM file.txt is uploaded to the server. the program takes the file and sensd it to another server.
2:15 PM example.txt is uploaded to the server. the program takes just that file and sends it to another server.
I have searched online for this but cant find anything. Please help!

As you said that you already know python, I will give you some conceptual hints. Basically, you are looking for a one-way synchronisation. The main problem with this task is to make your program detect new files. The simplest way to do this is to create a database (note that by database I mean a way of storing data, not necessarly a specialized database). For example, a text file. In this database, each file will be recorded. Periodically, check the database with the current files (the basic ls or something similar will do). If a new file appears (meaning that there are files that are not in database), upload them.
This is the basic idea. You can improve it by using multi threading, some checks if a file has modified and so on.
EDIT: This is a programming way. As it has been suggested in comments, there are also some software solutions that will do this for you.

Does local GAE read and write to a local datastore file on the hard drive while it's running?

I have just noticed that when I have a running instance of my GAE application, there nothing happens with the datastore file when I add or remove entries using Python code or in admin console. I can even remove the file and still have all data safe and sound in admin area and accessible from code. But when I restart my application, all data obviously goes away and I have a blank datastore. So, the question - does GAE reads all data from the file only when it starts and then deals with it in the memory, saving the data after I stop the application? Does it make any requests to the datastore file when the application is running? If it doesn't save anything to the file while it's running, then, possibly, data may be lost if the application unexpectedly stops? Please make it clear for me if you know how it works in this aspect.

How the datastore reads and writes its underlying files varies - the standard datastore is read on startup, and written progressively, journal-style, as the app modifies data. The SQLite backend uses a SQLite database.
You shouldn't have to care, though - neither backend is designed for robustness in the face of failure, as they're development backends. You shouldn't be modifying or deleting the underlying files, either.

By default the dev_appserver will store it's data in a temporary location (which is why it disappears and you can't see anything changing)
If you don't want your data to disappear on restart set --datastore_path when running your dev server like:
dev_appserver.py --datastore_path /path/to/app/myapp.db /path/to/app
As nick said, the dev server is not built to be bulletproof, it's designed to help you quickly develop your app. The production setup is very different and will not do anything unexpected when you are dealing with exceptional circumstances.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.