Save pandas dataframe to s3 - python

I want to save BIG pandas dataframes to s3 using boto3.
Here is what I am doing now:
csv_buffer = io.StringIO()
df.to_csv(csv_buffer, index=False)
s3.put_object(Bucket="bucket-name", Key="file-name", Body=csv_buffer.getvalue())
This generates a file with the following permissions:
---------- 1 root root file-name
Hw can I change that in order for the file to be owned by the user that executes the script? i.e user "ubuntu" on a AWS instance
This is what I want:
-rw-rw-r-- 1 ubuntu ubuntu file-name
Another thing, did anyone try this method with big dataframes? (millions of rows) and does it perform well?
How does it compare to just saving the file locally and using the boto3 copy file method?
Thanks a lot.

AWS S3 requires multiple-part upload for files above 5GB.
You can implement it with boto3 using a reusable configuration object from TransferConfig class :
import boto3
from boto3.s3.transfer import TransferConfig
# default value for parameter multipart_threshold is 8MB (=number of bytes per part)
config = TransferConfig(multipart_threshold=5GB)
# Perform the transfer
s3 = boto3.client('s3')
s3.upload_file('FILE_NAME', 'BUCKET_NAME', 'OBJECT_NAME', Config=config)
Boto3 documentation here: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/customizations/s3.html
Then there is some limits here: https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html

Related

Using openpyxl with lambda

Python rookie here. I have a requirement for which i have been researching for a couple of days now. The requirement goes as below.
I have an S3 location where I have few excel sheets with unformatted data. I am writing a lambda function to format and convert them to csv format. Now I already have the code for this, but it works on local machine where I pick the excel files from local directory, format/transform them and put them to target folder. We are using openpyxl package for transforming. Now I am migrating this to AWS and there comes the problem. Instead of local directories the source and target will be s3 locations.
The data transforming logic is way too lengthy and I really dont want to rewrite them.
Is there a way I can handle these excel files just like how we does in local machine.
For instance,
wb = openpyxl.load_workbook('C:\User\test.xlsx, data_only=True)
How can I recreate this statement or what it does in lambda with python?
You can do this with BytesIO like so:
file = readS3('test.xlsx') # load file with Boto3
wb = openpyxl.load_workbook(BytesIO(file), data_only=True)
With readS3() being implemented for example like this:
import boto3
bucket = #bucket name
def readS3(file):
s3 = boto3.client('s3')
s3_data = s3.get_object(Bucket=bucket, Key=file)
return s3_data['Body'].read()
Configure Boto3 like so: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html

How do I read a gzipped parquet file from S3 into Python using Boto3?

I have a file called data.parquet.gzip on my S3 bucket. I can't figure out what's the problem in reading it. Normally I've worked with StringIO but I don't know how to fix it. I want to import it from S3 into my Python jupyter notebook session using pandas and boto3.
The solution is actually quite straightforward.
import boto3 # For read+push to S3 bucket
import pandas as pd # Reading parquets
from io import BytesIO # Converting bytes to bytes input file
import pyarrow # Fast reading of parquets
# Set up your S3 client
# Ideally your Access Key and Secret Access Key are stored in a file already
# So you don't have to specify these parameters explicitly.
s3 = boto3.client('s3',
aws_access_key_id=ACCESS_KEY_HERE,
aws_secret_access_key=SECRET_ACCESS_KEY_HERE)
# Get the path to the file
s3_response_object = s3.get_object(Bucket=BUCKET_NAME_HERE, Key=KEY_TO_GZIPPED_PARQUET_HERE)
# Read your file, i.e. convert it from a stream to bytes using .read()
df = s3_response_object['Body'].read()
# Read your file using BytesIO
df = pd.read_parquet(BytesIO(df))
If you are using an IDE in your laptop/PC to connect to AWS S3 you may refer to the first solution of Corey:
import boto3
import pandas as pd
import io
s3 = boto3.resource(service_name='s3', region_name='XXXX',
aws_access_key_id='YYYY', aws_secret_access_key='ZZZZ')
buffer = io.BytesIO()
object = s3.Object(bucket_name='bucket_name', key='path/to/your/file.parquet')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)
If you are using Glue job you may refer to the second solution of Corey in the Glue script:
df = pd.read_parquet(path='s3://bucket_name/path/to/your/file.parquet')
In case you want to read a .json file (using an IDE in your laptop/PC):
object = s3.Object(bucket_name='bucket_name',
key='path/to/your/file.json').get()['Body'].read().decode('utf-8')
df = pd.read_json(object, lines=True)

How to load pickled dataframes from GCS into App Engine

I'm trying to load a pickled pandas dataframe from Google Cloud Storage into App Engine.
I have been using blob.download_to_file() to read the bytestream into pandas, however I encounter the following error:
UnpicklingError: invalid load key, m
I have tried seeking to the beginning to no avail and am pretty sure something fundamental is missing from my understanding.
When attempting to pass an open file object and read from there, I get an
UnsupportedOperation: write
error
from io import BytesIO
from google.cloud import storage
def get_byte_fileobj(project, bucket, path) -> BytesIO:
blob = _get_blob(bucket, path, project)
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
byte_stream.seek(0)
return(byte_stream)
def _get_blob(bucket_name, path, project):
credentials = service_account.Credentials.from_service_account_file(
service_account_credentials_path) if service_account_credentials_path else None
storage_client = storage.Client(project=project, credentials=credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(path)
return(blob)
fileobj = get_byte_fileobj(projectid, 'backups', 'Matches/Matches.pickle')
pd.read_pickle(fileobj)
Ideally pandas would just read from pickle since all of my GCS backups are in that format, but I'm open to suggestions.
The pandas.read_pickle() method takes as argument a file path string, not a file handler/object:
pandas.read_pickle(path, compression='infer')
Load pickled pandas object (or any object) from file.
path : str
File path where the pickled object will be loaded.
If you're in the 2nd generation standard or the flexible environment you could try to use a real /tmp file instead of BytesIO.
Otherwise you'd have to figure out another method of loading the data into pandas, which supports a file object/descriptor. In general the approach is described in How to restore Tensorflow model from Google bucket without writing to filesystem? (context is different, but same general idea)

Loading data from S3 to dask dataframe

I can load the data only if I change the "anon" parameter to True after making the file public.
df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'anon':False})
This is not recommended for obvious reasons. How do I load the data from S3 securely?
The backend which loads the data from s3 is s3fs, and it has a section on credentials here, which mostly points you to boto3's documentation.
The short answer is, there are a number of ways of providing S3 credentials, some of which are automatic (a file in the right place, or environment variables - which must be accessible to all workers, or cluster metadata service).
Alternatively, you can provide your key/secret directly in the call, but that of course must mean that you trust your execution platform and communication between workers
df = dd.read_csv('s3://mybucket/some-big.csv', storage_options = {'key': mykey, 'secret': mysecret})
The set of parameters you can pass in storage_options when using s3fs can be found in the API docs.
General reference http://docs.dask.org/en/latest/remote-data-services.html
If you're within your virtual private cloud (VPC) s3 will likely already be credentialed and you can read the file in without a key:
import dask.dataframe as dd
df = dd.read_csv('s3://<bucket>/<path to file>.csv')
If you aren't credentialed, you can use the storage_options parameter and pass a key pair (key and secret):
import dask.dataframe as dd
storage_options = {'key': <s3 key>, 'secret': <s3 secret>}
df = dd.read_csv('s3://<bucket>/<path to file>.csv', storage_options=storage_options)
Full documentation from dask can be found here
Dask under the hood uses boto3 so you can pretty much setup your keys in all the ways boto3 supports e.g role-based export AWS_PROFILE=xxxx or explicitly exporting access key and secret via your environment variables. I would advise against hard-coding your keys least you expose your code to the public by a mistake.
$ export AWS_PROFILE=your_aws_cli_profile_name
or
https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-credentials.html
For s3 you can use wildcard match to fetch multiple chunked files
import dask.dataframe as dd
# Given N number of csv files located inside s3 read and compute total record len
s3_url = 's3://<bucket_name>/dask-tutorial/data/accounts.*.csv'
df = dd.read_csv(s3_url)
print(df.head())
print(len(df))

Writing to s3 using imageio and boto3

I want to write images to aws s3. As a video plays i am trying to process images through some functions and then when its done I wish to store it to a specific path. imageio directly checks the extension in the name and writes the image for the appropriate file format.
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket)
obj = bucket.Object(filepath+'/'+second+'.jpg')
img.imwrite(obj)
If I were to write this to a local location and then write it to s3 then it works but is there a better way where I could store it to s3 without having to write it locally.
Any help is appreciated.
You can use something like BytesIO from python's io package, to create the file object in memory, and pass that to the boto3 client, like this:
from io import BytesIO
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket)
in_memory_file = BytesIO()
img.imwrite(in_memory_file)
obj = bucket.Object(filepath+'/'+second+'.jpg')
obj.upload_fileobj(in_memory_file)
This should solve the problem, without having to write the file to disk.

Categories