Writing to s3 using imageio and boto3

Writing to s3 using imageio and boto3 - python

I want to write images to aws s3. As a video plays i am trying to process images through some functions and then when its done I wish to store it to a specific path. imageio directly checks the extension in the name and writes the image for the appropriate file format.
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket)
obj = bucket.Object(filepath+'/'+second+'.jpg')
img.imwrite(obj)
If I were to write this to a local location and then write it to s3 then it works but is there a better way where I could store it to s3 without having to write it locally.
Any help is appreciated.

You can use something like BytesIO from python's io package, to create the file object in memory, and pass that to the boto3 client, like this:
from io import BytesIO
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket)
in_memory_file = BytesIO()
img.imwrite(in_memory_file)
obj = bucket.Object(filepath+'/'+second+'.jpg')
obj.upload_fileobj(in_memory_file)
This should solve the problem, without having to write the file to disk.

Related

Azure Blob Storage downloading ORC files in Python

It is my first time using Azure Storage and ORC.
Here is what I have learned so far, I able to download a ORC blob storage file from Azure and save to disk. Once download complete, I can iterate ORC file using pyorc library in Python. They are mostly smaller files and can easily fit into memory. My question is, instead of writing to a file, I would like to keep the blob in memory and iterate and can avoid writing to a disk. I can download the blob into stream but I am not sure how to use pyorc with blob stream or I cannot locate the help for it.
I appreciate any help and best practice for azure storage download.

Regarding the issue, please refer to the following steps
import pyorc
import io
from azure.storage.blob import BlobClient
key = 'account key'
blob_client = BlobClient(account_url='https://<accountname>.blob.core.windows.net',
container_name='test',
blob_name='my.orc',
credential=key,)
with io.BytesIO() as f:
blob_client.download_blob().readinto(f)
reader = pyorc.Reader(f)
print(next(reader))

I want to thank Jim Xu for his solution and I slightly modify his solution to fit to my need if anyone interested
from azure.storage.blob import ContainerClient, BlobClient
from io import BytesIO
import pyorc
containerClient = ContainerClient.from_connection_string(azureConnString, container_name=azureContainer)
blobList = containerClient.list_blobs(azureBlobFolder)
for fileNo, blob in enumerate(blobList):
blobClient = containerClient.get_blob_client(blob=blob.name)
with BytesIO() as f:
blobClient.download_blob().readinto(f)
reader = pyorc.Reader(f)
print(next(reader))

How do I read a gzipped parquet file from S3 into Python using Boto3?

I have a file called data.parquet.gzip on my S3 bucket. I can't figure out what's the problem in reading it. Normally I've worked with StringIO but I don't know how to fix it. I want to import it from S3 into my Python jupyter notebook session using pandas and boto3.

The solution is actually quite straightforward.
import boto3 # For read+push to S3 bucket
import pandas as pd # Reading parquets
from io import BytesIO # Converting bytes to bytes input file
import pyarrow # Fast reading of parquets
# Set up your S3 client
# Ideally your Access Key and Secret Access Key are stored in a file already
# So you don't have to specify these parameters explicitly.
s3 = boto3.client('s3',
aws_access_key_id=ACCESS_KEY_HERE,
aws_secret_access_key=SECRET_ACCESS_KEY_HERE)
# Get the path to the file
s3_response_object = s3.get_object(Bucket=BUCKET_NAME_HERE, Key=KEY_TO_GZIPPED_PARQUET_HERE)
# Read your file, i.e. convert it from a stream to bytes using .read()
df = s3_response_object['Body'].read()
# Read your file using BytesIO
df = pd.read_parquet(BytesIO(df))

If you are using an IDE in your laptop/PC to connect to AWS S3 you may refer to the first solution of Corey:
import boto3
import pandas as pd
import io
s3 = boto3.resource(service_name='s3', region_name='XXXX',
aws_access_key_id='YYYY', aws_secret_access_key='ZZZZ')
buffer = io.BytesIO()
object = s3.Object(bucket_name='bucket_name', key='path/to/your/file.parquet')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)
If you are using Glue job you may refer to the second solution of Corey in the Glue script:
df = pd.read_parquet(path='s3://bucket_name/path/to/your/file.parquet')
In case you want to read a .json file (using an IDE in your laptop/PC):
object = s3.Object(bucket_name='bucket_name',
key='path/to/your/file.json').get()['Body'].read().decode('utf-8')
df = pd.read_json(object, lines=True)

How to load pickled dataframes from GCS into App Engine

I'm trying to load a pickled pandas dataframe from Google Cloud Storage into App Engine.
I have been using blob.download_to_file() to read the bytestream into pandas, however I encounter the following error:
UnpicklingError: invalid load key, m
I have tried seeking to the beginning to no avail and am pretty sure something fundamental is missing from my understanding.
When attempting to pass an open file object and read from there, I get an
UnsupportedOperation: write
error
from io import BytesIO
from google.cloud import storage
def get_byte_fileobj(project, bucket, path) -> BytesIO:
blob = _get_blob(bucket, path, project)
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
byte_stream.seek(0)
return(byte_stream)
def _get_blob(bucket_name, path, project):
credentials = service_account.Credentials.from_service_account_file(
service_account_credentials_path) if service_account_credentials_path else None
storage_client = storage.Client(project=project, credentials=credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(path)
return(blob)
fileobj = get_byte_fileobj(projectid, 'backups', 'Matches/Matches.pickle')
pd.read_pickle(fileobj)
Ideally pandas would just read from pickle since all of my GCS backups are in that format, but I'm open to suggestions.

The pandas.read_pickle() method takes as argument a file path string, not a file handler/object:
pandas.read_pickle(path, compression='infer')
Load pickled pandas object (or any object) from file.
path : str
File path where the pickled object will be loaded.
If you're in the 2nd generation standard or the flexible environment you could try to use a real /tmp file instead of BytesIO.
Otherwise you'd have to figure out another method of loading the data into pandas, which supports a file object/descriptor. In general the approach is described in How to restore Tensorflow model from Google bucket without writing to filesystem? (context is different, but same general idea)

AWS Lambda and S3 and Pandas - Load CSV into S3, trigger Lambda, load into pandas, put back in bucket?

I'm a noob to AWS and lambda, so I apologize if this is a dumb question. What I would like to be able to do is load a spreadsheet into an s3 bucket, trigger lambda based on that upload, have lambda load the csv into pandas and do stuff with it, then write the dataframe back to a csv into a second s3 bucket.
I've read a lot about zipping a python script and all the libraries and dependencies and uploading that, and thats a separate question. I've also figured out how to trigger lambda upon uploading a file to an S3 bucket and to automatically copy that file to a second s3 bucket.
The part I'm having trouble finding any information on is that middle part, the loading the file into pandas and manipulating the file within pandas all inside the lambda function.
First question: Is something like that even possible?
Second question: how do I "grab" the file from the s3 bucket and load it into pandas? would it be something like this?
import pandas as pd
import boto3
import json
s3 = boto3.resource('s3')
def handler(event, context):
dest_bucket = s3.Bucket('my-destination-bucket')
df = pd.read_csv(event['Records'][0]['s3']['object']['key'])
# stuff to do with dataframe goes here
s3.Object(dest_bucket.name, <code for file key>).copy_from(CopySource = df)
? I really have no idea if that's even close to right and is a complete shot in the dark. Any and all help would be really appreciated, because I'm pretty obviously out of my element!

This code triggers a Lambda function on PUTS, then GETS it, then PUTS it into another bucket:
from __future__ import print_function
import os
import time
import json
import boto3
s3 = boto3.client('s3')
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = quote(event['Records'][0]['s3']['object']['key'].encode('utf8'))
try:
response = s3.get_object(Bucket=bucket, Key=key)
s3_upload_article(response, bucket, end_path)
return response['ContentType']
except Exception as e:
print(e)
print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
raise e
def s3_upload_article(html, bucket, end_path):
s3.put_object(Body=html, Bucket=bucket, Key=end_path, ContentType='text/html', ACL='public-read')
I broke this code out from a more complicated Lambda script I have written, however, I hope it displays some of what you need to do. The PUTS of the object only triggers the scipt. Any other actions that occur after the event is triggered are up to you to code into the script.
bucket = event['Records'][0]['s3']['bucket']['name']
key = quote(event['Records'][0]['s3']['object']['key'].encode('utf8'))
Bucket and key in the first few lines are the bucket and key of the object that triggered the event. Everything else is up to you.

Save pandas dataframe to s3

I want to save BIG pandas dataframes to s3 using boto3.
Here is what I am doing now:
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, index=False)
s3.put_object(Bucket="bucket-name", Key="file-name", Body=csv_buffer.getvalue())
This generates a file with the following permissions:
---------- 1 root root file-name
Hw can I change that in order for the file to be owned by the user that executes the script? i.e user "ubuntu" on a AWS instance
This is what I want:
-rw-rw-r-- 1 ubuntu ubuntu file-name
Another thing, did anyone try this method with big dataframes? (millions of rows) and does it perform well?
How does it compare to just saving the file locally and using the boto3 copy file method?
Thanks a lot.

AWS S3 requires multiple-part upload for files above 5GB.
You can implement it with boto3 using a reusable configuration object from TransferConfig class :
import boto3
from boto3.s3.transfer import TransferConfig
# default value for parameter multipart_threshold is 8MB (=number of bytes per part)
config = TransferConfig(multipart_threshold=5GB)
# Perform the transfer
s3 = boto3.client('s3')
s3.upload_file('FILE_NAME', 'BUCKET_NAME', 'OBJECT_NAME', Config=config)
Boto3 documentation here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html
Then there is some limits here: https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Writing to s3 using imageio and boto3 - python

Related

Azure Blob Storage downloading ORC files in Python

How do I read a gzipped parquet file from S3 into Python using Boto3?

How to load pickled dataframes from GCS into App Engine

AWS Lambda and S3 and Pandas - Load CSV into S3, trigger Lambda, load into pandas, put back in bucket?

Save pandas dataframe to s3

Categories

Resources