Bring machine learning to live production with AWS Lambda Function - python

I am currently working on implementing Facebook Prophet for a live production environment. I haven't done that before so I wanted to present you my plan here and hope you can give me some feedback whether that's an okay solution or if you have any suggestions.
Within Django, I create a .csv export of relevant data which I need for my predictions. These .csv exports will be uploaded to an AWS S3 bucket.
From there I can access this S3 bucket with an AWS Lambda Function where the "heavy" calculations are happening.
Once done, I take the forecasts from 2. and save them again in a forcast.csv export
Now my Django application can access the forecast.csv on S3 and get the respective predictions.
I am especially curious if AWS Lambda Function is the right tool in that case. Exports could probably also saved in DynamoDB (?), but I try to keep my v1 simple, therefore .csv. There is still some effort to install the right layers/packages for AWS Lambda. So I want to make sure I am walking in the right direction before diving deep in its documentation.

I am a little concerned about using AWS Lambda for the "heavy" calculations. There are a few reasons.
Binary size Limit: AWS Lambda has a size limit of 250MB for binaries. This is the biggest limitation we faced as you will not be able to include all the libraries like numpy, pandas, matplotlib, etc in that binary.
Disk size limit: AWS only provides max disk size of 500MB for lambda execution, it can become a problem if you want to save intermediate results in the disk.
The cost can skyrocket: If your lambda is going to run for a long time instead of multiple small invocations, you will end up paying a lot of money. In that case, I think you will be better off with something like EC2 and ECS.
You can evaluate linking the S3 bucket to an SQS queue and a process running on EC2 machine which is listening to the queue and performing all the calculations.
AWS Lambda Limits.

Related

I want to trigger a python script using a cloud function whenever a specified file is created on the google cloud storage

One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/
The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.

Large Scale Processing of S3 Images

I have roughly 80tb of images hosted in an S3 bucket which I need to send to an API for image classification. Once the images are classified, the API will forward the results to another endpoint.
Currently, I am thinking of using boto to interact with S3 and perhaps Apache airflow to download these images in batches and forward them to the classification API, which will forward the results of the classification to a web app for display.
In the future I want to automatically send any new image added to the S3 bucket to the API for classification. To achieve this I am hoping to use AWS lambda and S3 notifications to trigger this function.
Would this be the best practice for such a solution?
Thank you.
For your future scenarios, yes, that approach would be sensible:
Configure Amazon S3 Events to trigger an AWS Lambda function when a new object is created
The Lambda function can download the object (to /tmp/) and call the remote API
Make sure the Lambda function deletes the temporary file before exiting since the Lambda container might be reused and there is a 500MB storage limit
Please note that the Lambda function will trigger on a single object, rather than in batches.

Backing up a postgresql database table with python in lambda

I am attempting to write a python script which will run in AWS Lambda, back up a PostgreSQL database table which is hosted in Amazon RDS, then dump a resulting .bak file or similar to S3.
I'm able to connect to the database and make changes to it, but I'm not quite sure how to go about the next steps. How do I actually back up the DB and write it to a backup file in the S3 bucket?
Depending how large you database is lambda may not be the best solution. lambdas have limits of 512MB tmp disk space, 15 minute timeouts, and 3008 MB memory. Maxing out these limits may also be more expensive then other options.
Using EC2 or fargate along with boto or the aws cli may be a better solution. Here is an blog entry that walks through a solution
https://francescoboffa.com/using-s3-to-store-your-mysql-or-postgresql-backups
The method that worked for me was to create an AWS data pipeline to back up the database to CSV.

Is it possible to perform real-time communication with a Google Compute Engine instance?

I would like to run a program on my laptop (Gazebo simulator) and send a stream of image data to a GCE instance, where it will be run through an object-detection network and sent back to my laptop in near real-time. Is such a set-up possible?
My best idea right now is, for each image:
Save the image as a JPEG on my personal machine
Stream the JPEG to a Cloud Storage bucket
Access the storage bucket from my GCE instance and transfer the file to the instance
In my python script, convert the JPEG image to numpy array and run through the object detection network
Save the detection results in a text file and transfer to the Cloud Storage bucket
Access the storage bucket from my laptop and download the detection results file
Convert the detection results file to a numpy array for further processing
This seems like a lot of steps, and I am curious if there are ways to speed it up, such as reducing the number of save and load operations or transporting the image in a better format.
If your question is "is it possible to set up such a system and do those actions in real time?" then I think the answer is yes I think so. If your question is "how can I reduce the number of steps in doing the above" then I am not sure I can help and will defer to one of the experts on here and can't wait to hear the answer!
I have implemented a system that I think is similar to what you describe for research of Forex trading algorithms (e.g. upload data to storage from my laptop, compute engine workers pull the data and work on it, post results back to storage and I download the compiled results from my laptop).
I used the Google PubSub architecture - apologies if you have already read up on this. It allows near-realtime messaging between programs. For example you can have code looping on your laptop that scans a folder that looks out for new images. When they appear it automatically uploads the files to a bucket and once theyre in the bucket it can send a message to the instance(s) telling them that there are new files there to process, or you can use the "change notification" feature of Google Storage buckets. The instances can do the work, send the results back to the storage and send a notification to the code running on your laptop that work is done and results are available for pick-up.
Note that I set this up for my project above and encountered problems to the point that I gave up with PubSub. The reason was that the Python Client Library for PubSub only supports 'asynchronous' message pulls, which seems to mean that the subscribers will pull multiple messages from the queue and process them in parallel. There are some features to help manage 'flow control' of messages built into the API, but even with them implemented I couldn't get it to work the way I wanted. For my particular application I wanted to process everything in order, one file at a time because it was important to me that I'm clear what the instance is doing and the order its doing it in. There are several threads on google search, StackOverflow and Google groups that discuss workarounds for this using queues, classes, allocating specific tasks for specific instances, etc which I tried, but even these presented problems for me. Some of these links are:
Run synchronous pull in PubSub using Python client API and pubsub problems pulling one message at a time and there are plenty more if you would like them!
You may find that if the processing of an image is relatively quick, order isn't too important and you don't mind an instance working on multiple things in parallel that my problems don't really apply to your case.
FYI, I ended up just making a simple loop on my 'worker instances' that scans the 'task list' bucket every 30 seconds or whatever to look for new files to process, but obviously this isn't quite the real-time approach that you were originally looking for. Good luck!

Working around AWS Lambda space limitations

I am doing an example of a Simple Linear Regression in Python and I want to make use of Lambda functions to make it work on AWS so that I could interface it with Alexa. The problem is my Python package is 114 MB. I have tried to separate the package and the code so that I have two lambda functions but to no avail. I have tried every possible way on the internet.
Is there any way I could upload the packages on S3 and read it from there like how we read csv's from S3 using boto3 client?
Yes, you can upload the package to S3. There is a limit for that as well. It's 250 MB currently. https://docs.aws.amazon.com/lambda/latest/dg/limits.html
Here's a simple command to do that.
aws lambda update-function-code --function-name FuncName --zip-file fileb://path/to/zip/file --region us-west-2
There are certain limitations when using AWS lambda.
1) The total size of your uncompressed code and dependencies should be less than 250MB.
2) The size of your zipped code and dependencies should be less than 75MB.
3) The total fixed size of all function packages in a region should not exceed 75GB.
If you are exceeding the limit, try finding smaller libraries with less dependencies or breaking your functionality in to multiple micro services rather than building code which does all the work for you. This way you don't have to include every library in each function. Hope this helps.

Categories