AWS S3 Download and Upload using TemporaryFile - python

I need to download all content (including versions) of an Amazon S3 Bucket and upload in other Amazon S3 Bucket. Don't tell me to use aws, I just can't use.
I use tempfile.TemporaryFile for this, it apparently works, the print show that the file object has the right content inside, but the uploaded files are empty (zero bytes).
with tempfile.TemporaryFile() as data:
sourceUser.download_fileobj('source-bucket',key,data)
# next 2 lines was just to check the content of the file
data.seek(0)
print (data.read())
destinationUser.upload_fileobj(data,'destination-bucket',key)

I have the same requirement, How do I pass the NamedTemporaryFile to the upload s3
Not sure How to pass the NamedTemporaryFileName to output=f'{file_name}.gpg' and to the load_file function --> filename=f_source.name
with tempFile.NamedTemporaryFile("wb") as f_source:
s3_client.download_fileobj(s3_bucket, s3_key, f_source)
logger.info(f'{s3_key} file downloaded successfully to local {f_source}')
f_source.flush()
file_name = self.s3_key.split('/')[-1]
gpg = gnupg.GPG()
key_data = open(key_path).read()
import_result = gpg.import_keys(key_data)
f_source.seek(0)
with open(f_source.name, 'r+b') as f:
status = gpg.encrypt_file(
file=f,
recipients=[recipient],
output=f'{file_name}.gpg',
)
s3_hook.load_file(
filename=f_source.name,
key=s3_key,
bucket_name=s3_bucket,
replace=True
)

Related

save a zip file downloaded in AWS EC2 to a gzip file in S3, using python boto3 in memory

I appreciate this question is quite specific, but I believe it should be a common problem. I've solved parts of it but not the entire chain.
Input:
in AWS EC2 instance, I download a zip-compressed file from the internet
Output:
I save the gzip-compressed file to an S3 bucket
I see 2 ways of doing this:
saving temporary files in EC2, and then copying them to S3
converting the data in memory in EC2, and saving directly to S3
I know how to do the first option, but because of resource constraints, and because I need to download a lot of files, I would like to try the second option. This is what I have so far:
import requests, boto3, gzip
zip_data = requests.get(url).content
#I can save a temp zip file in EC2 like this, but I would like to avoid it
with open(zip_temp, 'wb') as w:
w.write(zip_data)
#missing line that decompresses the zipped file in memory and returns a byte-object, I think?
#like: data = SOMETHING (zip_data)
gz_data = gzip.compress(data)
client = boto3.client('s3')
output = client.put_object(
Bucket = 'my-bucket',
Body = gz_data,
Key = filename)
Besides, are there any general considerations I should think about when deciding which option to go for?
turns out it was quite simple:
import requests, boto3, gzip
from zipfile import ZipFile
from io import BytesIO
zip_data = requests.get(url).content
with ZipFile(BytesIO(zip_data)) as myzip:
with myzip.open('zip_file_inside.csv') as mycsv:
gz_data = gzip.compress(mycsv.read())
client = boto3.client('s3')
output = client.put_object(
Bucket = 'my-bucket',
Body = gz_data,
Key = filename)

How to zip files on s3 using lambda and python

I need to archive multiply files that exists on s3 and then upload the archive back to s3.
I am trying to use lambda and python. As some of the files have more than 500MB, downloading in the '/tmp' is not an option. Is there any way to stream files one by one and put them in archive?
Do not write to disk, stream to and from S3
Stream the Zip file from the source bucket and read and write its contents on the fly using Python back to another S3 bucket.
This method does not use up disk space and therefore is not limited by size.
The basic steps are:
Read the zip file from S3 using the Boto3 S3 resource Object into a BytesIO buffer object
Open the object using the zipfile module
Iterate over each file in the zip file using the namelist method
Write the file back to another bucket in S3 using the resource meta.client.upload_fileobj method
The Code
Python 3.6 using Boto3
s3_resource = boto3.resource('s3')
zip_obj = s3_resource.Object(bucket_name="bucket_name_here", key=zip_key)
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=bucket,
Key=f'{filename}'
)
Note: AWS Execution time limit has a maximum of 15 minutes so can you process your HUGE files in this amount of time? You can only know by testing.
AWS Lambda code: create zip from files by ext in bucket/filePath.
def createZipFileStream(bucketName, bucketFilePath, jobKey, fileExt, createUrl=False):
response = {}
bucket = s3.Bucket(bucketName)
filesCollection = bucket.objects.filter(Prefix=bucketFilePath).all()
archive = BytesIO()
with zipfile.ZipFile(archive, 'w', zipfile.ZIP_DEFLATED) as zip_archive:
for file in filesCollection:
if file.key.endswith('.' + fileExt):
with zip_archive.open(file.key, 'w') as file1:
file1.write(file.get()['Body'].read())
archive.seek(0)
s3.Object(bucketName, bucketFilePath + '/' + jobKey + '.zip').upload_fileobj(archive)
archive.close()
response['fileUrl'] = None
if createUrl is True:
s3Client = boto3.client('s3')
response['fileUrl'] = s3Client.generate_presigned_url('get_object', Params={'Bucket': bucketName,
'Key': '' + bucketFilePath + '/' + jobKey + '.zip'},
ExpiresIn=3600)
return response
The /tmp/ directory is limited to 512MB for AWS Lambda functions.
If you search StackOverflow, you'll see some code from people who have created Zip files on-the-fly without saving files to disk. It becomes pretty complicated.
An alternative would be to attach an EFS filesystem to the Lambda function. It takes a bit of effort to setup, but the cost would be practically zero if you delete the files after use and you'll have plenty of disk space so your code will be more reliable and easier to maintain.
# For me below code worked for single file in Glue job to take single .txt file form AWS S3 and make it zipped and upload back to AWS S3.
import boto3
import zipfile
from io import BytesIO
import logging
logger = logging.getLogger()
s3_client = boto3.client('s3')
s3_resource= boto3.resource('s3')
# ZipFileStream function declaration
self._createZipFileStream(
bucketName="My_AWS_S3_bucket_name",
bucketFilePath="My_txt_object_prefix",
bucketfileobject="My_txt_Object_prefix + txt_file_name",
zipKey="My_zip_file_prefix")
# ZipFileStream function Defination
def _createZipFileStream(self, bucketName: str, bucketFilePath: str, bucketfileobject: str, zipKey: str, ) -> None:
try:
obj = s3_resource.Object(bucket_name=bucketName, key=bucketfileobject)
archive = BytesIO()
with zipfile.ZipFile(archive, 'w', zipfile.ZIP_DEFLATED) as zip_archive:
with zip_archive.open(zipKey, 'w') as file1:
file1.write(obj.get()['Body'].read())
archive.seek(0)
s3_client.upload_fileobj(archive, bucketName, bucketFilePath + '/' + zipKey + '.zip')
archive.close()
# If you would like to delete the .txt after zipped from AWS S3 below code will work.
self._delete_object(
bucket=bucketName, key=bucketfileobject)
except Exception as e:
logger.error(f"Failed to zip the txt file for {bucketName}/{bucketfileobject}: str{e}")
# Delete AWS S3 funcation defination.
def _delete_object(bucket: str, key: str) -> None:
try:
logger.info(f"Deleting: {bucket}/{key}")
S3.delete_object(
Bucket=bucket,
Key=key
)
except Exception as e:
logger.error(f"Failed to delete {bucket}/{key}: str{e}")`enter code here`

provide a url to s3.upload_file()

Currently I have the process of uploading files to s3 broken down to two parts:
download the file locally :
#downloads file from a url to local
def download_url(url, save_path):
r = requests.get(url, stream=True)
with open(save_path, 'wb') as fd:
for chunk in r.iter_content(chunk_size=1024 * 1024):
fd.write(chunk)
upload to s3 :
def create_aws_resource(aws_access_key_id, aws_secret_access_key, aws_default_region):
s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key)
return s3
def save_to_s3(bucket,path,save_path,link_save_path, aws_access_key_id, aws_secret_access_key, aws_default_region):
s3 = create_aws_resource(aws_access_key_id, aws_secret_access_key, aws_default_region)
os.chdir(save_path)
filename = link_save_path.split('/')[-1]
s3_path = path+'/'+filename
s3.upload_file(link_save_path, bucket, s3_path)
This process needs to be run on a server and hence I want to eliminate the first step of downloading locally.
Is there a way to directly store files (any files - .xlsx, .pdf, .ods, .xls, .html) to s3?
example : want to upload the file on http://seriestoriche.istat.it/fileadmin/documenti/Tavola_17.1.xls to s3 directly.
Any help would be highly appreciated !!
This worked like a charm for me !
def create_aws_resource(aws_access_key_id,aws_secret_access_key):
s3 = boto3.client('s3', aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key)
return s3
def save_to_s3(doc_url, doc_name,bucket,s3_path,aws_access_key_id,aws_secret_access_key):
s3 = create_aws_resource(aws_access_key_id,aws_secret_access_key)
try:
# Gets the file as an object
response = requests.get(doc_url)
except:
print('File not downloaded from {doc_url}')
next
try:
# uploads the file to s3
s3.upload_fileobj(io.BytesIO(response.content), bucket, f'{s3_path}/{doc_name}')
print(f'File downloaded from {doc_url} and uploaded to {s3_path}/{doc_name}')
except:
print(f'File not uploaded to {s3_path}')
doc_url - is the document url;
doc_name - is the name you want the file to save as on s3;
s3_path - where you want to save the file

Writing a file to S3 using Lambda in Python with AWS

In AWS, I'm trying to save a file to S3 in Python using a Lambda function. While this works on my local computer, I am unable to get it to work in Lambda. I've been working on this problem for most of the day and would appreciate help. Thank you.
def pdfToTable(PDFfilename, apiKey, fileExt, bucket, key):
# parsing a PDF using an API
fileData = (PDFfilename, open(PDFfilename, "rb"))
files = {"f": fileData}
postUrl = "https://pdftables.com/api?key={0}&format={1}".format(apiKey, fileExt)
response = requests.post(postUrl, files=files)
response.raise_for_status()
# this code is probably the problem!
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'rb') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_fileobj(data, key)
# FYI, on my own computer, this saves the file
with open('output.csv', "wb") as f:
f.write(response.content)
In S3, there is a bucket transportation.manifests.parsed containing the folder csv where the file should be saved.
The type of response.content is bytes.
From AWS, the error from the current set-up above is [Errno 2] No such file or directory: '/tmp/output2.csv': FileNotFoundError. In fact, my goal is to save the file to the csv folder under a unique name, so tmp/output2.csv might not be the best approach. Any guidance?
In addition, I've tried to use wb and w instead of rb also to no avail. The error with wb is Input <_io.BufferedWriter name='/tmp/output2.csv'> of type: <class '_io.BufferedWriter'> is not supported. The documentation suggests that using 'rb' is the recommended usage, but I do not understand why that would be the case.
Also, I've tried s3_client.put_object(Key=key, Body=response.content, Bucket=bucket) but receive An error occurred (404) when calling the HeadObject operation: Not Found.
Assuming Python 3.6. The way I usually do this is to wrap the bytes content in a BytesIO wrapper to create a file like object. And, per the boto3 docs you can use the-transfer-manager for a managed transfer:
from io import BytesIO
import boto3
s3 = boto3.client('s3')
fileobj = BytesIO(response.content)
s3.upload_fileobj(fileobj, 'mybucket', 'mykey')
If that doesn't work I'd double check all IAM permissions are correct.
You have a writable stream that you're asking boto3 to use as a readable stream which won't work.
Write the file, and then simply use bucket.upload_file() afterwards, like so:
s3 = boto3.resource('s3')
bucket = s3.Bucket('transportation.manifests.parsed')
with open('/tmp/output2.csv', 'w') as data:
data.write(response.content)
key = 'csv/' + key
bucket.upload_file('/tmp/output2.csv', key)

Error 500 while Uploading CSV file to S3 bucket using boto3 and python flask

kind of looked at all possible options.
I am using boto3 and python3.6 to upload file to s3 bucket, Funny thing is while json and even .py file is getting uploaded, it is throwing Error 500 while uploading CSV. On successful uplaod i am returning an json to check all the values.
import boto3
from botocore.client import Config
#app.route("/upload",methods = ['POST','GET'])
def upload():
if request.method == 'POST':
file = request.files['file']
filename = secure_filename(file.filename)
s3 = boto3.resource('s3', aws_access_key_id= os.environ.get('AWS_ACCESS_KEY_ID'), aws_secret_access_key=os.environ.get('AWS_SECRET_ACCESS_KEY'),config=Config(signature_version='s3v4'))
s3.Bucket(os.environ.get('S3_BUCKET')).put_object(Key=filename, Body=open(filename, 'rb'), ContentEncoding='text/csv')
return jsonify({'successful upload':filename, 'S3_BUCKET':os.environ.get('S3_BUCKET'), 'ke':os.environ.get('AWS_ACCESS_KEY_ID'), 'sec':os.environ.get('AWS_SECRET_ACCESS_KEY'),'filepath': "https://s3.us-east-2.amazonaws.com/"+os.environ.get('S3_BUCKET')+"/" +filename})
Please help!!
You are getting a FileNotFoundError for file xyz.csv because the file does not exist.
This could be because the code in upload() does not actually save the uploaded file, it merely obtains a safe name for it and immediately tries to open it - which fails.
That it works for other files is probably due to the fact that those files already exist, perhaps left over from testing, so there is no problem.
Try saving the file to the file system using save() after obtaining the safe filename:
upload_file = request.files['file']
filename = secure_filename(upload_file.filename)
upload_file.save(os.path.join(app.config['UPLOAD_FOLDER'], filename))
and then uploading it (assuming that you've configured an UPLOAD_FOLDER):
with open(os.path.join(app.config['UPLOAD_FOLDER'], filename), 'rb') as f:
s3.Bucket(os.environ.get('S3_BUCKET')).put_object(Key=filename, Body=f, ContentEncoding='text/csv')
return jsonify({...})
There is no need to actually save the file to the file system; it can be streamed directly to your S3 bucket using the stream attribute of the upload_file object:
upload_file = request.files['file']
filename = secure_filename(upload_file.filename)
s3 = boto3.resource('s3', aws_access_key_id='key', aws_secret_access_key='secret')
s3.Bucket('bucket').put_object(Key=filename, Body=upload_file.stream, ContentType=upload_file.content_type)
To make this more generic you should use the content_type attribute of the uploaded file as shown above.

Categories