Large Scale Processing of S3 Images

Large Scale Processing of S3 Images - python

I have roughly 80tb of images hosted in an S3 bucket which I need to send to an API for image classification. Once the images are classified, the API will forward the results to another endpoint.
Currently, I am thinking of using boto to interact with S3 and perhaps Apache airflow to download these images in batches and forward them to the classification API, which will forward the results of the classification to a web app for display.
In the future I want to automatically send any new image added to the S3 bucket to the API for classification. To achieve this I am hoping to use AWS lambda and S3 notifications to trigger this function.
Would this be the best practice for such a solution?
Thank you.

For your future scenarios, yes, that approach would be sensible:
Configure Amazon S3 Events to trigger an AWS Lambda function when a new object is created
The Lambda function can download the object (to /tmp/) and call the remote API
Make sure the Lambda function deletes the temporary file before exiting since the Lambda container might be reused and there is a 500MB storage limit
Please note that the Lambda function will trigger on a single object, rather than in batches.

Related

How Python AWS Lambda interacts specifically with the uploaded file?

I´m trying to do the following:
when I upload a file in my s3 storage, the lambda picks this json file and converts it into a csv file.
How can I specify in the lambda code which file must pick?
example of my code in local:
import pandas as pd
df = pd.read_json('movies.json')
df.to_csv('csv-movies.csv')
in this example, I provide the name of the file...but..how can I manage that on a Lambda?
I think I don´t understand how Lambda works...could you give me an example?

Lambda spins up execution environments to handle your requests. When it initialises these environments, it'll pull the code you uploaded, and execute it when invoked.
Execution environments have a concept of ephemeral (temporary) storage with a default size of 512mb.
Lambda doesn't have access to your files in S3 by default. You'd first need to download your file from S3 using something like the AWS SDK for Python. You can store it in the /tmp directory to make use of the ephemeral storage I mentioned earlier.
Once you've downloaded the file using the SDK, you can interact with it as you would if you were running this locally, like in your example.
On the flip side, you'd also need to use the SDK to upload the CSV back to S3 if you want to keep it beyond the lifecycle of that execution environment.
Something else you might want to explore in future is reading that file into memory and doing away with storing it in ephemeral storage altogether.

In order to achieve this you will need to use S3 as the event source for your Lambda, there's a useful tutorial for this provided by AWS themselves and has some sample python code to assist you, you can view it here.
To break it down slightly further and answer how you get the name of the file. The lambda handler will look similar to the following:
def lambda_handler(event, context)
What is important here is the event object. When your event source is the S3 bucket you will be given the name of the bucket and the s3 key in the object which is effectively the path to the file in the S3 bucket. With this information you can do some logic to decide if you want to download the file from that path. If you do, you can use the S3 get_object( ) api call as shown in the tutorial.
Once this file is downloaded it can be used like any other file you would have on your local machine, so you can then proceed to process the json to a CSV. Once it is converted you will presumably want to put it back in S3 and for this you can use the S3 put_object( ) call for this and reuse the information in the event object in order to specify the path.

How to read and modify a csv file on one bucket in cloud storage and save the results in another bucket using Cloud Functions

I have CSV files coming to a folder on a Cloud Storage bucket and I want to create a Cloud Function that opens the CSV, adds a new column to it and then save the results to a new bucket as new.csv file.
Is there a way to do that using python Cloud Function???
Thanks in advance.

The idea that you’re trying to implement is totally possible and can be achieved using Google Cloud Functions.
For that, you would need to create a storage-triggered Cloud Function. More specifically, you can create your function in a way that it will respond to change notifications emerging from your Google Cloud Storage.
These notifications can be configured to respond to various events inside a bucket: object creation, deletion, archiving and metadata updates.
For the situation described, you will need to use the trigger google.storage.object.finalize.
This event is sent when a new object is created in the bucket or an existing object is overwritten, and a new generation of that object is created.
Here you can find a sample code of a storage-triggered Cloud Function written in Python, while this tutorial will give you a more detailed overview of the usage of storage-triggered functions.

Bring machine learning to live production with AWS Lambda Function

I am currently working on implementing Facebook Prophet for a live production environment. I haven't done that before so I wanted to present you my plan here and hope you can give me some feedback whether that's an okay solution or if you have any suggestions.
Within Django, I create a .csv export of relevant data which I need for my predictions. These .csv exports will be uploaded to an AWS S3 bucket.
From there I can access this S3 bucket with an AWS Lambda Function where the "heavy" calculations are happening.
Once done, I take the forecasts from 2. and save them again in a forcast.csv export
Now my Django application can access the forecast.csv on S3 and get the respective predictions.
I am especially curious if AWS Lambda Function is the right tool in that case. Exports could probably also saved in DynamoDB (?), but I try to keep my v1 simple, therefore .csv. There is still some effort to install the right layers/packages for AWS Lambda. So I want to make sure I am walking in the right direction before diving deep in its documentation.

I am a little concerned about using AWS Lambda for the "heavy" calculations. There are a few reasons.
Binary size Limit: AWS Lambda has a size limit of 250MB for binaries. This is the biggest limitation we faced as you will not be able to include all the libraries like numpy, pandas, matplotlib, etc in that binary.
Disk size limit: AWS only provides max disk size of 500MB for lambda execution, it can become a problem if you want to save intermediate results in the disk.
The cost can skyrocket: If your lambda is going to run for a long time instead of multiple small invocations, you will end up paying a lot of money. In that case, I think you will be better off with something like EC2 and ECS.
You can evaluate linking the S3 bucket to an SQS queue and a process running on EC2 machine which is listening to the queue and performing all the calculations.
AWS Lambda Limits.

Automate File loading from s3 to snowflake

In s3 bucket daily new JSON files are dumping , i have to create solution which pick the latest file when it arrives PARSE the JSON and load it to Snowflake Datawarehouse. may someone please share your thoughts how can we achieve

There are a number of ways to do this depending on your needs. I would suggest creating an event to trigger a lambda function.
https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
Another option may be to create a SQS message when the file lands on s3 and have an ec2 instance poll the queue and process as necessary.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/sqs-example-long-polling.html
edit: Here is a more detailed explanation on how to create events from s3 and trigger lambda functions. Documentation is provided by Snowflake
https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe-rest-lambda.html

Look into Snowpipe, it lets you do that within the system, making it (possibly) much easier.

There are some aspects to be considered such as is it a batch or streaming data , do you want retry loading the file in case there is wrong data or format or do you want to make it a generic process to be able to handle different file formats/ file types(csv/json) and stages.
In our case we have built a generic s3 to Snowflake load using Python and Luigi and also implemented the same using SSIS but for csv/txt file only.

In my case, I have a python script which get information about the bucket with boto.
Once I detect a change, I call the REST Endpoint Insertfiles on SnowPipe.
Phasing:
detect S3 change
get S3 object path
parse Content and transform to CSV in S3 (same bucket or other snowpipe can connect)
Call SnowPipe REST API
What you need:
Create a user with a public key
Create your stage on SnowFlake with AWS credential in order to access S3
Create your pipe on Snowflake with your user role
Sign a JWT
I also tried with a Talend job with TOS BigData.
Hope it helps.

parallell copy of buckets/keys from boto3 or boto api between 2 different accounts/connections

I want to copy keys from buckets between 2 different accounts using boto3 api's.
In boto3, I executed the following code and the copy worked
source = boto3.client('s3')
destination = boto3.client('s3')
destination.put_object(source.get_object(Bucket='bucket', Key='key'))
Basically I am fetching data from GET and pasting that with PUT in another account.
On Similar lines in boto api, I have done the following
source = S3Connection()
source_bucket = source.get_bucket('bucket')
source_key = Key(source_bucket, key_name)
destination = S3Connection()
destination_bucket = destination.get_bucket('bucket')
dist_key = Key(destination_bucket, source_key.key)
dist_key.set_contents_from_string(source_key.get_contents_as_string())
The above code achieves the purpose of copying any type of data.
But the speed is really very slow. I get around 15-20 seconds to copy data for 1GB. And I have to copy 100GB plus.
I tried python mutithreading wherein each thread does the copy operation. The performance was bad as it took 30 seconds to copy 1GB. I suspect GIL might be the issue here.
I did multiprocessing and I am getting the same result as of single process i.e. 15-20 seconds for 1GB file.
I am using a very high end server with 48 cores and 128GB RAM. The network speed in my environment is 10GBPS.
Most of the search results tell about copying data between buckets in same account and not across accounts. Can anyone please guide me here. Is my approach wrong? Does anyone have a better solution?

Yes, it is wrong approach.
You shouldn't download the file. You are using AWS infrastructure, so you should make use of the efficient AWS backend call to do the works. Your approach is wasting resources.
boto3.client.copy will do the job better than this.
In addition, you didn't describe what you are trying to achieve(e.g. is this some sort of replication requirement? ).
Because with proper understanding of your own needs, it is possible that you don't even need a server to do the job : S3 Bucket events trigger, lambda etc can all execute the copying job without a server.
To copy file between two different AWS account, you can checkout this link Copy S3 object between AWS account
Note :
S3 is a huge virtual object store for everyone, that's why the bucket name MUST be unique. This also mean, the S3 "controller" can done a lot of fancy work similar to a file server , e.g. replication,copy, move file in the backend, without involving network traffics.
As long as you setup the proper IAM permission/policies for the destination bucket, object can move across bucket without additional server.
This is almost similar to file server. User can copy file to each other without "download/upload", instead, one just create a folder with write permission for all, file copy from another user is all done within the file server, with fastest raw disk I/O performance. You don't need powerful instance nor high performance network using backend S3 copy API.
Your method is similar to attempt FTP download file from user using the same file server, which create unwanted network traffics.

You should check out the TransferManager in boto3. It will automatically handle the threading of multipart uploads in an efficient way. See the docs for more detail.
Basically you must have to use the upload_file method and TransferManager will take care of the rest.
import boto3
# Get the service client
s3 = boto3.client('s3')
# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.