s3 - auto compression before transfer - python

I want to know :
If any available tool by default compress and transfer file
to S3
or tool has option to compress and transfer file to S3
or I have to call python libraries and compress and then transfer to S3

You can do with simple python code if you want to.
import gzip
import json
import boto3
# To compress
# Data to compress
data = [{'name': 'test'}, {'name': 'test2'}]
# Converting data to string
json_str = json.dumps(data) + "\n"
# Converting to bytes
json_bytes = json_str.encode('utf-8')
jsonfilename = "s3_compressed_file.json.gz"
# Compressing to gzip
with gzip.GzipFile(jsonfilename, 'w') as f:
f.write(json_bytes)
# Upload to S3
s3BucketName = 'mybucket'
s3_resource = boto3.resource('s3')
# if you want to rename file while uploading
file_name = 's3_compressed_file-1.json.gz'
# '/current_dir_path/' + filename, '<bucket-name>', 's3_folder/{}'.format(filename)
s3_response = s3_resource.meta.client.upload_file('source_dir/' + 's3_compressed_file.gz', s3BucketName,
'destination_dir/{}'.format(file_name))

Related

save a zip file downloaded in AWS EC2 to a gzip file in S3, using python boto3 in memory

I appreciate this question is quite specific, but I believe it should be a common problem. I've solved parts of it but not the entire chain.
Input:
in AWS EC2 instance, I download a zip-compressed file from the internet
Output:
I save the gzip-compressed file to an S3 bucket
I see 2 ways of doing this:
saving temporary files in EC2, and then copying them to S3
converting the data in memory in EC2, and saving directly to S3
I know how to do the first option, but because of resource constraints, and because I need to download a lot of files, I would like to try the second option. This is what I have so far:
import requests, boto3, gzip
zip_data = requests.get(url).content
#I can save a temp zip file in EC2 like this, but I would like to avoid it
with open(zip_temp, 'wb') as w:
w.write(zip_data)
#missing line that decompresses the zipped file in memory and returns a byte-object, I think?
#like: data = SOMETHING (zip_data)
gz_data = gzip.compress(data)
client = boto3.client('s3')
output = client.put_object(
Bucket = 'my-bucket',
Body = gz_data,
Key = filename)
Besides, are there any general considerations I should think about when deciding which option to go for?
turns out it was quite simple:
import requests, boto3, gzip
from zipfile import ZipFile
from io import BytesIO
zip_data = requests.get(url).content
with ZipFile(BytesIO(zip_data)) as myzip:
with myzip.open('zip_file_inside.csv') as mycsv:
gz_data = gzip.compress(mycsv.read())
client = boto3.client('s3')
output = client.put_object(
Bucket = 'my-bucket',
Body = gz_data,
Key = filename)

How to zip files on s3 using lambda and python

I need to archive multiply files that exists on s3 and then upload the archive back to s3.
I am trying to use lambda and python. As some of the files have more than 500MB, downloading in the '/tmp' is not an option. Is there any way to stream files one by one and put them in archive?
Do not write to disk, stream to and from S3
Stream the Zip file from the source bucket and read and write its contents on the fly using Python back to another S3 bucket.
This method does not use up disk space and therefore is not limited by size.
The basic steps are:
Read the zip file from S3 using the Boto3 S3 resource Object into a BytesIO buffer object
Open the object using the zipfile module
Iterate over each file in the zip file using the namelist method
Write the file back to another bucket in S3 using the resource meta.client.upload_fileobj method
The Code
Python 3.6 using Boto3
s3_resource = boto3.resource('s3')
zip_obj = s3_resource.Object(bucket_name="bucket_name_here", key=zip_key)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key=f'{filename}'
)
Note: AWS Execution time limit has a maximum of 15 minutes so can you process your HUGE files in this amount of time? You can only know by testing.
AWS Lambda code: create zip from files by ext in bucket/filePath.
def createZipFileStream(bucketName, bucketFilePath, jobKey, fileExt, createUrl=False):
response = {}
bucket = s3.Bucket(bucketName)
filesCollection = bucket.objects.filter(Prefix=bucketFilePath).all()
archive = BytesIO()
with zipfile.ZipFile(archive, 'w', zipfile.ZIP_DEFLATED) as zip_archive:
for file in filesCollection:
if file.key.endswith('.' + fileExt):
with zip_archive.open(file.key, 'w') as file1:
file1.write(file.get()['Body'].read())
archive.seek(0)
s3.Object(bucketName, bucketFilePath + '/' + jobKey + '.zip').upload_fileobj(archive)
archive.close()
response['fileUrl'] = None
if createUrl is True:
s3Client = boto3.client('s3')
response['fileUrl'] = s3Client.generate_presigned_url('get_object', Params={'Bucket': bucketName,
'Key': '' + bucketFilePath + '/' + jobKey + '.zip'},
ExpiresIn=3600)
return response
The /tmp/ directory is limited to 512MB for AWS Lambda functions.
If you search StackOverflow, you'll see some code from people who have created Zip files on-the-fly without saving files to disk. It becomes pretty complicated.
An alternative would be to attach an EFS filesystem to the Lambda function. It takes a bit of effort to setup, but the cost would be practically zero if you delete the files after use and you'll have plenty of disk space so your code will be more reliable and easier to maintain.
# For me below code worked for single file in Glue job to take single .txt file form AWS S3 and make it zipped and upload back to AWS S3.
import boto3
import zipfile
from io import BytesIO
import logging
logger = logging.getLogger()
s3_client = boto3.client('s3')
s3_resource= boto3.resource('s3')
# ZipFileStream function declaration
self._createZipFileStream(
bucketName="My_AWS_S3_bucket_name",
bucketFilePath="My_txt_object_prefix",
bucketfileobject="My_txt_Object_prefix + txt_file_name",
zipKey="My_zip_file_prefix")
# ZipFileStream function Defination
def _createZipFileStream(self, bucketName: str, bucketFilePath: str, bucketfileobject: str, zipKey: str, ) -> None:
try:
obj = s3_resource.Object(bucket_name=bucketName, key=bucketfileobject)
archive = BytesIO()
with zipfile.ZipFile(archive, 'w', zipfile.ZIP_DEFLATED) as zip_archive:
with zip_archive.open(zipKey, 'w') as file1:
file1.write(obj.get()['Body'].read())
archive.seek(0)
s3_client.upload_fileobj(archive, bucketName, bucketFilePath + '/' + zipKey + '.zip')
archive.close()
# If you would like to delete the .txt after zipped from AWS S3 below code will work.
self._delete_object(
bucket=bucketName, key=bucketfileobject)
except Exception as e:
logger.error(f"Failed to zip the txt file for {bucketName}/{bucketfileobject}: str{e}")
# Delete AWS S3 funcation defination.
def _delete_object(bucket: str, key: str) -> None:
try:
logger.info(f"Deleting: {bucket}/{key}")
S3.delete_object(
Bucket=bucket,
Key=key
)
except Exception as e:
logger.error(f"Failed to delete {bucket}/{key}: str{e}")`enter code here`

Reading doc, docx files from s3 within lambda

TLDR; reading with my AWS lambda doc, docx files that are stored on S3.
On my local machine I just use textract.process(file_path) to read both doc and docx files.
So the intuitive way to do the same on lambda is to download the file from s3 to the local storage (tmp) on the lambda and then process the tmp files like I do on my local machine.
That's not cost-effective...
Is there a way to make a pipeline from the S3 object straight into some parser like textract that'll just convert the doc/docx files into a readable object like string?
My code so far for reading files like txt.
import boto3
print('Loading function')
def lambda_handler(event, context):
try: # Read s3 file
bucket_name = "appsresults"
download_path = 'Folder1/file1.txt'
filename = download_path
s3 = boto3.resource('s3')
content_object = s3.Object(bucket_name, filename)
file_content = content_object.get()['Body'].read().decode('utf-8')
print(file_content)
except Exception as e:
print("Couldnt read the file from s3 because:\n {0}".format(e))
return event # return event
This answer solves half of the problem
textract.process currently doesn't support reading file-like objects. If it did, you could have directly loaded the file from S3 into memory and pass it to the process function.
Older version of textract internally used python-docx package for reading .docx files. python-docx supports reading file-like objects. You can use the below code to achieve your goal, at least for .docx files.
import boto3
import io
from docx import Document
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
object = bucket.Object('/files/resume.docx')
file_stream = io.BytesIO()
object.download_fileobj(file_stream)
document = docx.Document(file_stream)
If you're reading the docx file from s3, Document() constructor expects path for the file. Instead, you can read the file in byte format and call the constructor like this.
from docx import Document
def parseDocx(data):
data = io.BytesIO(data)
document = Document(docx = data)
content = ''
for para in document.paragraphs:
data = para.text
content+= data
return content
Key = "acb.docx"
Bucket = "xyz"
obj_ = s3_client.get_object(Bucket= Bucket, Key=Key)
if Key.endswith('.docx'):
fs = obj_['Body'].read()
sentence = str(parseDocx(fs))

Read h5 file using AWS boto3

I am trying to read h5 file from AWS S3 using boto3.
client = boto3.client('s3',key ='key')
result = client.get_object(Bucket='bucket', Key='file')
with h5py.File(result['Body'], 'r') as f:
data = f
TypeError: expected str, bytes or os.PathLike object, not StreamingBody
Any idea?
h5py version is 2.10, boto3 version is 1.7.58
The same question was here, but no answer...
The h5py.File() command is expecting a path to a local file on disk. However, you are passing it the data in memory.
You can download the file with:
import boto3
s3_client = boto3.client('s3')
s3_client.download_file('bucket', 'key', 'filename')
with h5py.File('filename', 'r') as f:
data = f
A working solution using tempfile for temporary storage.
This streams the model data from your s3 bucket into a temp storage and sets it into a variable.
import tempfile
from keras import models
import boto3
# Creating the low level functional client
client = boto3.client(
's3',
aws_access_key_id = 'ACCESS_KEY_ID',
aws_secret_access_key = 'ACCESS_SECRET_KEY',
region_name = 'us-east-1'
)
# Create the S3 object
response_data = client.get_object(
Bucket = 'bucket-name',
Key = 'model/model.h5'
)
model_name='model.h5'
response_data=response_data['Body']
response_data=response_data.read()
#save byte file to temp storage
with tempfile.TemporaryDirectory() as tempdir:
with open(f"{tempdir}/{model_name}", 'wb') as my_data_file:
my_data_file.write(response_data)
#load byte file from temp storage into variable
gotten_model=models.load_model(f"{tempdir}/{model_name}")
print(gotten_model.summary())

How to write a file or data to an S3 object using boto3

In boto 2, you can write to an S3 object using these methods:
Key.set_contents_from_string()
Key.set_contents_from_file()
Key.set_contents_from_filename()
Key.set_contents_from_stream()
Is there a boto 3 equivalent? What is the boto3 method for saving data to an object stored on S3?
In boto 3, the 'Key.set_contents_from_' methods were replaced by
Object.put()
Client.put_object()
For example:
import boto3
some_binary_data = b'Here we have some data'
more_binary_data = b'Here we have some more data'
# Method 1: Object.put()
s3 = boto3.resource('s3')
object = s3.Object('my_bucket_name', 'my/key/including/filename.txt')
object.put(Body=some_binary_data)
# Method 2: Client.put_object()
client = boto3.client('s3')
client.put_object(Body=more_binary_data, Bucket='my_bucket_name', Key='my/key/including/anotherfilename.txt')
Alternatively, the binary data can come from reading a file, as described in the official docs comparing boto 2 and boto 3:
Storing Data
Storing data from a file, stream, or string is easy:
# Boto 2.x
from boto.s3.key import Key
key = Key('hello.txt')
key.set_contents_from_file('/tmp/hello.txt')
# Boto 3
s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))
boto3 also has a method for uploading a file directly:
s3 = boto3.resource('s3')
s3.Bucket('bucketname').upload_file('/local/file/here.txt','folder/sub/path/to/s3key')
http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Bucket.upload_file
You no longer have to convert the contents to binary before writing to the file in S3. The following example creates a new text file (called newfile.txt) in an S3 bucket with string contents:
import boto3
s3 = boto3.resource(
's3',
region_name='us-east-1',
aws_access_key_id=KEY_ID,
aws_secret_access_key=ACCESS_KEY
)
content="String content to write to a new S3 file"
s3.Object('my-bucket-name', 'newfile.txt').put(Body=content)
Here's a nice trick to read JSON from s3:
import json, boto3
s3 = boto3.resource("s3").Bucket("bucket")
json.load_s3 = lambda f: json.load(s3.Object(key=f).get()["Body"])
json.dump_s3 = lambda obj, f: s3.Object(key=f).put(Body=json.dumps(obj))
Now you can use json.load_s3 and json.dump_s3 with the same API as load and dump
data = {"test":0}
json.dump_s3(data, "key") # saves json to s3://bucket/key
data = json.load_s3("key") # read json from s3://bucket/key
A cleaner and concise version which I use to upload files on the fly to a given S3 bucket and sub-folder-
import boto3
BUCKET_NAME = 'sample_bucket_name'
PREFIX = 'sub-folder/'
s3 = boto3.resource('s3')
# Creating an empty file called "_DONE" and putting it in the S3 bucket
s3.Object(BUCKET_NAME, PREFIX + '_DONE').put(Body="")
Note: You should ALWAYS put your AWS credentials (aws_access_key_id and aws_secret_access_key) in a separate file, for example- ~/.aws/credentials
After some research, I found this. It can be achieved using a simple csv writer. It is to write a dictionary to CSV directly to S3 bucket.
eg: data_dict = [{"Key1": "value1", "Key2": "value2"}, {"Key1": "value4", "Key2": "value3"}]
assuming that the keys in all the dictionary are uniform.
import csv
import boto3
# Sample input dictionary
data_dict = [{"Key1": "value1", "Key2": "value2"}, {"Key1": "value4", "Key2": "value3"}]
data_dict_keys = data_dict[0].keys()
# creating a file buffer
file_buff = StringIO()
# writing csv data to file buffer
writer = csv.DictWriter(file_buff, fieldnames=data_dict_keys)
writer.writeheader()
for data in data_dict:
writer.writerow(data)
# creating s3 client connection
client = boto3.client('s3')
# placing file to S3, file_buff.getvalue() is the CSV body for the file
client.put_object(Body=file_buff.getvalue(), Bucket='my_bucket_name', Key='my/key/including/anotherfilename.txt')
it is worth mentioning smart-open that uses boto3 as a back-end.
smart-open is a drop-in replacement for python's open that can open files from s3, as well as ftp, http and many other protocols.
for example
from smart_open import open
import json
with open("s3://your_bucket/your_key.json", 'r') as f:
data = json.load(f)
The aws credentials are loaded via boto3 credentials, usually a file in the ~/.aws/ dir or an environment variable.
You may use the below code to write, for example an image to S3 in 2019. To be able to connect to S3 you will have to install AWS CLI using command pip install awscli, then enter few credentials using command aws configure:
import urllib3
import uuid
from pathlib import Path
from io import BytesIO
from errors import custom_exceptions as cex
BUCKET_NAME = "xxx.yyy.zzz"
POSTERS_BASE_PATH = "assets/wallcontent"
CLOUDFRONT_BASE_URL = "https://xxx.cloudfront.net/"
class S3(object):
def __init__(self):
self.client = boto3.client('s3')
self.bucket_name = BUCKET_NAME
self.posters_base_path = POSTERS_BASE_PATH
def __download_image(self, url):
manager = urllib3.PoolManager()
try:
res = manager.request('GET', url)
except Exception:
print("Could not download the image from URL: ", url)
raise cex.ImageDownloadFailed
return BytesIO(res.data) # any file-like object that implements read()
def upload_image(self, url):
try:
image_file = self.__download_image(url)
except cex.ImageDownloadFailed:
raise cex.ImageUploadFailed
extension = Path(url).suffix
id = uuid.uuid1().hex + extension
final_path = self.posters_base_path + "/" + id
try:
self.client.upload_fileobj(image_file,
self.bucket_name,
final_path
)
except Exception:
print("Image Upload Error for URL: ", url)
raise cex.ImageUploadFailed
return CLOUDFRONT_BASE_URL + id

Categories