I want to create a minimal webpage where concurrent users can upload a file and I can process the file (which is expected to take some hours) and email back to the user later on.
Since I am hosting this on AWS, I was thinking of invoking some background process once I receive the file so that even if the user closes the browser window, the processing keeps taking place and I am able to send the results after few hours, all through some pre-written scripts.
Can you please help me with the logistics of how should I do this?
Here's how it might look like (hosting-agnostic):
A user uploads a file on the web server
The file is saved in a storage that can be accessed later by the background jobs
Some metadata (location in the storage, user's email etc) about the file is saved in a DB/message broker
Background jobs tracking the DB/message broker pick up the metadata and start handling the file (this is why it needs to be accessible by it in p.2) and notify the user
More specifically, in case of python/django + aws you might use the following stack:
Lets assume you're using python + django
You can save the uploaded files in a private AWS S3 bucket
Some meta might be saved in the db or use celery + AWS SQS or AWS SQS directly or bring up something like rabbitmq or redis(+pubsub)
Have python code handling the job - depends on what your opt for in p.3. The only requirement is that it can pull data from your S3 bucket. After the job is done notify the user via AWS SES
The simplest single-server setup that doesn't require any intermediate components:
Your python script that simply saves the file in a folder and gives it a name like someuser#yahoo.com-f9619ff-8b86-d011-b42d-00cf4fc964ff
Cron job looking for any files in this folder that would handle found files and notify the user. Notice if you need multiple background jobs running in parallel you'll need to slightly complicate the scheme to avoid race conditions (i.e. rename the file being processed so that only a single job would handle it)
In a prod app you'll likely need something in between depending on your needs
Related
I'm fairly new to Django. I am creating an application where I am posting images from flutter application to Django REST API. I need to run a python script with the input as the image that gets posted in the API.
Does anyone have any idea about this?
The best way to handle this is a job management system (e.g. Slurm, Torque, and Oracle Grid Engine) as you can create and submit a job with every uploaded image and send the response back to the user and the job management system will process independently from the request. Celery can also work if the job won't take much time.
A simple implementation that scales well:
Upload the images in a directory uploaded
Have your script run as a daemon (controlled by systemd) looking for new files in directory uploaded
whenever it finds a new file, it moves mv it to a directory working (that way, you can run multiple instances of your script in parallel to scale up)
once your script us done with the image, it moves it to a directory finished (or where-ever you need the finished images).
That setup is very simple, and works on both a small one-machine setups with low traffic, as well as on multi-machine setups with dedicated storage and multiple worker machines that handle the image transform jobs.
It also decouples your image processing from your web backend.
One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/
The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.
I have a flask application where I query and filter large datasets from S3 in a Celery Task. I want to serve the filtered data as a CSV to the user. The size of the CSV is up to 100mb. My initial thought was to have Celery save the dataset as a CSV to disk and then use send_file in flask route BUT I am using Heroku for deployment which has an ephemeral file system. So if I save the file in the Celery worker it won't share across to the web worker. I also played with having the S3 query directly in the flask route and then sending the file without saving to the server, but the file query takes time, up to 30 seconds, so I want to keep it as a background job in Celery.
The other thought was to upload the filtered data to S3 and remove following the download. However, this seems inefficient because it means downloading, filtering, and re-uploading.
Is there any way to do this efficiently or should I move off Heroku to something where I have SSD space. Thanks!
I need a help on technology choice for the following problem:
There are data files coming to the system to process. Some of them are self contained (.a) and can be processed immediately and some (.b) need to wait for more files to get a full set. I'm loading everything that arrives at the system to a DB assigning package ids and I can send a message on the MQ.
What I need here is a component that connects to that queue and listens to those messages. When it receives a message that file arrived it needs to do more or less:
If a file name is taskA.a then create a request to workerA.
If a file name is taskB.a then create a request to workerB.
If a file name is taskA.b and we got taskA.c before than send a request to workerA with both ids.
Which technology should I use. There are a few like Celery just its hard to find the proper one by just reading docs.
I have a running GAE app that has been collecting data for a while. I am now at the point where I need to run some basic reports on this data and would like to download a subset of the live data to my dev server. Downloading all entities of a kind will simply be too big a data set for the dev server.
Does anyone know of a way to download a subset of entities from a particular kind? Ideally it would be based on entity attributes like date, or client ID etc... but any method would work. I've even tried a regular, full, download then arbitrarily killing the process when I thought I had enough data, but it seems the data is locked up in the .sql3 files generated by the bulkloader.
It looks like that default download/upload from/to GAE datastore utilities don't support filtering (appcfg.py and bulkloader.py).
It seems reasonable to do one of two things:
write a utility (select+export+save-to-local-file) and execute it locally accessing remotely GAE datastore in remote api shell
write a admin web function for select+export+zip - new url in handler + upload to GAE + call-it-using-http