How to zip files on s3 using lambda and python - python

I need to archive multiply files that exists on s3 and then upload the archive back to s3.
I am trying to use lambda and python. As some of the files have more than 500MB, downloading in the '/tmp' is not an option. Is there any way to stream files one by one and put them in archive?

Do not write to disk, stream to and from S3
Stream the Zip file from the source bucket and read and write its contents on the fly using Python back to another S3 bucket.
This method does not use up disk space and therefore is not limited by size.
The basic steps are:
Read the zip file from S3 using the Boto3 S3 resource Object into a BytesIO buffer object
Open the object using the zipfile module
Iterate over each file in the zip file using the namelist method
Write the file back to another bucket in S3 using the resource meta.client.upload_fileobj method
The Code
Python 3.6 using Boto3
s3_resource = boto3.resource('s3')
zip_obj = s3_resource.Object(bucket_name="bucket_name_here", key=zip_key)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key=f'{filename}'
)
Note: AWS Execution time limit has a maximum of 15 minutes so can you process your HUGE files in this amount of time? You can only know by testing.

AWS Lambda code: create zip from files by ext in bucket/filePath.
def createZipFileStream(bucketName, bucketFilePath, jobKey, fileExt, createUrl=False):
response = {}
bucket = s3.Bucket(bucketName)
filesCollection = bucket.objects.filter(Prefix=bucketFilePath).all()
archive = BytesIO()
with zipfile.ZipFile(archive, 'w', zipfile.ZIP_DEFLATED) as zip_archive:
for file in filesCollection:
if file.key.endswith('.' + fileExt):
with zip_archive.open(file.key, 'w') as file1:
file1.write(file.get()['Body'].read())
archive.seek(0)
s3.Object(bucketName, bucketFilePath + '/' + jobKey + '.zip').upload_fileobj(archive)
archive.close()
response['fileUrl'] = None
if createUrl is True:
s3Client = boto3.client('s3')
response['fileUrl'] = s3Client.generate_presigned_url('get_object', Params={'Bucket': bucketName,
'Key': '' + bucketFilePath + '/' + jobKey + '.zip'},
ExpiresIn=3600)
return response

The /tmp/ directory is limited to 512MB for AWS Lambda functions.
If you search StackOverflow, you'll see some code from people who have created Zip files on-the-fly without saving files to disk. It becomes pretty complicated.
An alternative would be to attach an EFS filesystem to the Lambda function. It takes a bit of effort to setup, but the cost would be practically zero if you delete the files after use and you'll have plenty of disk space so your code will be more reliable and easier to maintain.

# For me below code worked for single file in Glue job to take single .txt file form AWS S3 and make it zipped and upload back to AWS S3.
import boto3
import zipfile
from io import BytesIO
import logging
logger = logging.getLogger()
s3_client = boto3.client('s3')
s3_resource= boto3.resource('s3')
# ZipFileStream function declaration
self._createZipFileStream(
bucketName="My_AWS_S3_bucket_name",
bucketFilePath="My_txt_object_prefix",
bucketfileobject="My_txt_Object_prefix + txt_file_name",
zipKey="My_zip_file_prefix")
# ZipFileStream function Defination
def _createZipFileStream(self, bucketName: str, bucketFilePath: str, bucketfileobject: str, zipKey: str, ) -> None:
try:
obj = s3_resource.Object(bucket_name=bucketName, key=bucketfileobject)
archive = BytesIO()
with zipfile.ZipFile(archive, 'w', zipfile.ZIP_DEFLATED) as zip_archive:
with zip_archive.open(zipKey, 'w') as file1:
file1.write(obj.get()['Body'].read())
archive.seek(0)
s3_client.upload_fileobj(archive, bucketName, bucketFilePath + '/' + zipKey + '.zip')
archive.close()
# If you would like to delete the .txt after zipped from AWS S3 below code will work.
self._delete_object(
bucket=bucketName, key=bucketfileobject)
except Exception as e:
logger.error(f"Failed to zip the txt file for {bucketName}/{bucketfileobject}: str{e}")
# Delete AWS S3 funcation defination.
def _delete_object(bucket: str, key: str) -> None:
try:
logger.info(f"Deleting: {bucket}/{key}")
S3.delete_object(
Bucket=bucket,
Key=key
)
except Exception as e:
logger.error(f"Failed to delete {bucket}/{key}: str{e}")`enter code here`

Related

save a zip file downloaded in AWS EC2 to a gzip file in S3, using python boto3 in memory

I appreciate this question is quite specific, but I believe it should be a common problem. I've solved parts of it but not the entire chain.
Input:
in AWS EC2 instance, I download a zip-compressed file from the internet
Output:
I save the gzip-compressed file to an S3 bucket
I see 2 ways of doing this:
saving temporary files in EC2, and then copying them to S3
converting the data in memory in EC2, and saving directly to S3
I know how to do the first option, but because of resource constraints, and because I need to download a lot of files, I would like to try the second option. This is what I have so far:
import requests, boto3, gzip
zip_data = requests.get(url).content
#I can save a temp zip file in EC2 like this, but I would like to avoid it
with open(zip_temp, 'wb') as w:
w.write(zip_data)
#missing line that decompresses the zipped file in memory and returns a byte-object, I think?
#like: data = SOMETHING (zip_data)
gz_data = gzip.compress(data)
client = boto3.client('s3')
output = client.put_object(
Bucket = 'my-bucket',
Body = gz_data,
Key = filename)
Besides, are there any general considerations I should think about when deciding which option to go for?
turns out it was quite simple:
import requests, boto3, gzip
from zipfile import ZipFile
from io import BytesIO
zip_data = requests.get(url).content
with ZipFile(BytesIO(zip_data)) as myzip:
with myzip.open('zip_file_inside.csv') as mycsv:
gz_data = gzip.compress(mycsv.read())
client = boto3.client('s3')
output = client.put_object(
Bucket = 'my-bucket',
Body = gz_data,
Key = filename)

AWS Lambda: How to read CSV files in S3 bucket then upload it to another S3 bucket?

I'm doing a project, where I read files from the S3 bucket and to get rid of all NA values then upload them to the different S3 bucket. I've been watching a Lambda tutorial and example codes, but I have a hard time understanding how it really works.
My goal is to read any file in the S3 bucket and using the Lambda function, I drop all the NA values, then upload them to a different S3 bucket. But I don't really understand what is going on. I read the documentation, but it wasn't very helpful for me to understand.
How can I make the below code to read CSV files from the S3 bucket, then drop all NA values, then upload them to the new S3 bucket?
import json
import os
import boto3
import csv
def lambda_handler(event, context):
for record in event['Records']:
bucket = record['s3']['bucket']['name']
file_key = record['s3']['object']['key']
s3 = boto3.client('s3')
csv_file = s3.get_object(Bucket=bucket, Key=file_key)
csv_content = csv_file['Body'].read().split(b'\n')
csv_data = csv.DictReader(csv_content)
Any links to the documentation, or video and advice will be appreciated.
Uploading files
def upload_file(file_name, bucket, object_name=None):
"""Upload a file to an S3 bucket
:param file_name: File to upload
:param bucket: Bucket to upload to
:param object_name: S3 object name. If not specified then file_name is used
:return: True if file was uploaded, else False
"""
# If S3 object_name was not specified, use file_name
if object_name is None:
object_name = file_name
# Upload the file
s3_client = boto3.client('s3')
try:
response = s3_client.upload_file(file_name, bucket, object_name)
except ClientError as e:
logging.error(e)
return False
return True
s3 download_file
import boto3
s3 = boto3.resource('s3')
s3.meta.client.download_file('mybucket', 'hello.txt', '/tmp/hello.txt')
Now you simply put these calls in any way you want to and process your csv files and then how you process and upload to s3 in efficiency that would be a completely different topic.
There are plenty of answere her ein this post How to upload a file to directory in S3 bucket using boto
You can check this one as well if curious, gives some idea how to process larger files.
Step 4: Create the Lambda function that splits input data

Reading doc, docx files from s3 within lambda

TLDR; reading with my AWS lambda doc, docx files that are stored on S3.
On my local machine I just use textract.process(file_path) to read both doc and docx files.
So the intuitive way to do the same on lambda is to download the file from s3 to the local storage (tmp) on the lambda and then process the tmp files like I do on my local machine.
That's not cost-effective...
Is there a way to make a pipeline from the S3 object straight into some parser like textract that'll just convert the doc/docx files into a readable object like string?
My code so far for reading files like txt.
import boto3
print('Loading function')
def lambda_handler(event, context):
try: # Read s3 file
bucket_name = "appsresults"
download_path = 'Folder1/file1.txt'
filename = download_path
s3 = boto3.resource('s3')
content_object = s3.Object(bucket_name, filename)
file_content = content_object.get()['Body'].read().decode('utf-8')
print(file_content)
except Exception as e:
print("Couldnt read the file from s3 because:\n {0}".format(e))
return event # return event
This answer solves half of the problem
textract.process currently doesn't support reading file-like objects. If it did, you could have directly loaded the file from S3 into memory and pass it to the process function.
Older version of textract internally used python-docx package for reading .docx files. python-docx supports reading file-like objects. You can use the below code to achieve your goal, at least for .docx files.
import boto3
import io
from docx import Document
s3 = boto3.resource('s3')
bucket = s3.Bucket('my-bucket')
object = bucket.Object('/files/resume.docx')
file_stream = io.BytesIO()
object.download_fileobj(file_stream)
document = docx.Document(file_stream)
If you're reading the docx file from s3, Document() constructor expects path for the file. Instead, you can read the file in byte format and call the constructor like this.
from docx import Document
def parseDocx(data):
data = io.BytesIO(data)
document = Document(docx = data)
content = ''
for para in document.paragraphs:
data = para.text
content+= data
return content
Key = "acb.docx"
Bucket = "xyz"
obj_ = s3_client.get_object(Bucket= Bucket, Key=Key)
if Key.endswith('.docx'):
fs = obj_['Body'].read()
sentence = str(parseDocx(fs))

AWS S3 Download and Upload using TemporaryFile

I need to download all content (including versions) of an Amazon S3 Bucket and upload in other Amazon S3 Bucket. Don't tell me to use aws, I just can't use.
I use tempfile.TemporaryFile for this, it apparently works, the print show that the file object has the right content inside, but the uploaded files are empty (zero bytes).
with tempfile.TemporaryFile() as data:
sourceUser.download_fileobj('source-bucket',key,data)
# next 2 lines was just to check the content of the file
data.seek(0)
print (data.read())
destinationUser.upload_fileobj(data,'destination-bucket',key)
I have the same requirement, How do I pass the NamedTemporaryFile to the upload s3
Not sure How to pass the NamedTemporaryFileName to output=f'{file_name}.gpg' and to the load_file function --> filename=f_source.name
with tempFile.NamedTemporaryFile("wb") as f_source:
s3_client.download_fileobj(s3_bucket, s3_key, f_source)
logger.info(f'{s3_key} file downloaded successfully to local {f_source}')
f_source.flush()
file_name = self.s3_key.split('/')[-1]
gpg = gnupg.GPG()
key_data = open(key_path).read()
import_result = gpg.import_keys(key_data)
f_source.seek(0)
with open(f_source.name, 'r+b') as f:
status = gpg.encrypt_file(
file=f,
recipients=[recipient],
output=f'{file_name}.gpg',
)
s3_hook.load_file(
filename=f_source.name,
key=s3_key,
bucket_name=s3_bucket,
replace=True
)

Writing a file to S3 using Lambda in Python with AWS

In AWS, I'm trying to save a file to S3 in Python using a Lambda function. While this works on my local computer, I am unable to get it to work in Lambda. I've been working on this problem for most of the day and would appreciate help. Thank you.
def pdfToTable(PDFfilename, apiKey, fileExt, bucket, key):
# parsing a PDF using an API
fileData = (PDFfilename, open(PDFfilename, "rb"))
files = {"f": fileData}
postUrl = "https://pdftables.com/api?key={0}&format={1}".format(apiKey, fileExt)
response = requests.post(postUrl, files=files)
response.raise_for_status()
# this code is probably the problem!
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'rb') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_fileobj(data, key)
# FYI, on my own computer, this saves the file
with open('output.csv', "wb") as f:
f.write(response.content)
In S3, there is a bucket transportation.manifests.parsed containing the folder csv where the file should be saved.
The type of response.content is bytes.
From AWS, the error from the current set-up above is [Errno 2] No such file or directory: '/tmp/output2.csv': FileNotFoundError. In fact, my goal is to save the file to the csv folder under a unique name, so tmp/output2.csv might not be the best approach. Any guidance?
In addition, I've tried to use wb and w instead of rb also to no avail. The error with wb is Input <_io.BufferedWriter name='/tmp/output2.csv'> of type: <class '_io.BufferedWriter'> is not supported. The documentation suggests that using 'rb' is the recommended usage, but I do not understand why that would be the case.
Also, I've tried s3_client.put_object(Key=key, Body=response.content, Bucket=bucket) but receive An error occurred (404) when calling the HeadObject operation: Not Found.
Assuming Python 3.6. The way I usually do this is to wrap the bytes content in a BytesIO wrapper to create a file like object. And, per the boto3 docs you can use the-transfer-manager for a managed transfer:
from io import BytesIO
import boto3
s3 = boto3.client('s3')
fileobj = BytesIO(response.content)
s3.upload_fileobj(fileobj, 'mybucket', 'mykey')
If that doesn't work I'd double check all IAM permissions are correct.
You have a writable stream that you're asking boto3 to use as a readable stream which won't work.
Write the file, and then simply use bucket.upload_file() afterwards, like so:
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'w') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_file('/tmp/output2.csv', key)

Categories