Writing bytes stream to s3 using python

Writing bytes stream to s3 using python - python

I have a zip file loaded into memory (do not have it persisted on disk). The zip file contains jpg images. I am trying to upload each jpg into s3 but am getting an error.
# already have an opened zipfile stored in zip_file
# already connected to s3
files = zip_file.namelist()
for f in files:
im = io.BytesIO(zip_file.read(f))
s3_key.key = f
s3_key.set_contents_from_stream(im)
I get the following error:
BotoClientError: BotoClientError: s3 does not support chunked transfer
What am I doing wrong?

Here is the solution. I was over thinking the problem.
files = zip_file.namelist()
for f in files:
data = zip_file.read(f)
s3_key._key.key = f
s3_key._key.set_contents_from_string(data)
That's all it took.

Boto supports other storage services, such as Google Cloud Storage, in addition to S3. The set_contents_from_stream method only works for services that support chunked transfer (see https://codereview.appspot.com/4515170). S3 does not support that (See their Technical FAQs at http://aws.amazon.com/articles/1109.)
It's unfortunate, but you can't upload from a stream to S3.

Related

Create a zip file on S3 from CSV files on S3 using Lambda

Around 60 CSV files being generated daily in my S3 bucket. The average size of each file is around 500MB. I want to zip all these files through lambda function on the fly(without downloading a file inside Lambda execution) and upload these zipped files to another s3 bucket. I came across these solutions 1 and 2 but I am still getting issue in the implementation. Right now, I am trying to stream CSV file data into a zipped file(this zip file is being created in Lambda tmp directory) and then uploading on s3. But I am getting this error message while writing into zip file:
[Errno 36] File name too long
This is my test Lambda function where I am just trying with one file but in actual case I need to zip 50-60 CSV files individually:
import boto3
import zipfile
def lambda_handler(event, context):
s3 = boto3.resource('s3')
iterator = s3.Object('bucket-name', 'file-name').get()['Body'].iter_lines()
my_zip = zipfile.ZipFile('/tmp/test.zip', 'w')
for line in iterator:
my_zip.write(line)
s3_resource.meta.client.upload_fileobj(file-name, "another-bucket-name", "object-name")
Also, is there a way where I can stream data from my CSV file, zip it and upload it to another s3 bucket without actually saving a full zip file on Lambda memory?

After lot of research and trials, I am able to make it work. I used smart_open library for my issue and managed to zip 550MB file with just 150MB memory usage in my Lambda. To use external library, I had to use Layers in Lambda. Here is my code:
from smart_open import open, register_compressor
import lzma, os
def lambda_handler(event, context):
with open('s3://bucket-name-where-large-file/file-key-name') as fin:
with open('s3://bucket-name-to-put-zip-file/zip-file-key-name', 'w') as fout:
for line in fin:
fout.write(line)
Please note, smart_open supports .gz and .bz2 file compression. If you want to zip file in other formats, you can create your own compressor using register_compressor method of this library.

Extract 7z files on the fly in S3 with boto3

I have a really large 7z file in s3 bucket say s3://tempbucket1/Test_For7zip.7z that runs into several tens of GB. I do not want to download it, unzip it and re upload it back to s3. I want to use Boto3 to unzip it on the fly and save it into S3.
I tried to solve this using lzma package based on Previous SO answer which dealt with on the fly unzipping of *.zip files using the fileobj option present in gzip.GzipFile.
from io import BytesIO
import gzip
import lzma
import boto3
# setup constants
bucket = 'tempbucket1'
gzipped_key = 'Test_For7zip.7z'
uncompressed_key = 'Test_Unzip7zip'
# initialize s3 client, this is dependent upon your aws config being done
s3 = boto3.client('s3', use_ssl=False)
s3.upload_fileobj( # upload a new obj to s3
Fileobj=lzma.LZMAFile(
BytesIO(s3.get_object(Bucket=bucket,
Key=gzipped_key)['Body'].read()),
'rb'), # read binary
Bucket=bucket, # target bucket, writing to
Key=uncompressed_key) # target key, writing to
However, this thows the following error
LZMAError: Input format not supported by decoder
Is there a python package that provides can decode 7z files based on BytesIO, or is there a better way of achieving this?

I never tried this, but Googling gave me this as a possible solution. Please reach out through this post if this solves your problem.

How to get .stl file from Amazon S3 by using boto3?

I have a Django Web application and i deployed it to Elastic Beanstalk environment. I also have the numpy-stl package. I'm trying to get a .stl file from Amazon S3 bucket and use this file with a stl package's function but i'm getting an error such as 'bytes' object has no attribute 'get_mass_properties'.
My code is;
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket_name, Key=key)
body = obj['Body'].read()
volume, cog, inertia = body.get_mass_properties()
How can i get the .stl file and use it?

Assuming that you are talking about this stl file format, once you read it in into python from S3, you need some python library to open it.
Quick search returns numpy-stl:
Simple library to make working with STL files (and 3D objects in general) fast and easy.
Thus you can install that library and attempt to use it on the file you are downloading.
In case you run your code on lambda (not written in your question?) then you would have to bundle the library with your deployment package or construct custom lambda layer for that.

I have fixed such as below.
import stl
import boto3
import tempfile
s3 = boto3.resource('s3', region_name=region)
bucket = s3.Bucket(bucket)
obj = bucket.Object(uploadedVolume)
tmp = tempfile.NamedTemporaryFile()
with open(tmp.name, 'wb') as f:
obj.download_fileobj(f)
stlMesh = stl.mesh.Mesh.from_file(tmp.name)
volume, cog, inertia = stlMesh.get_mass_properties()

zipfile write dont find files in gcloud

Im trying zip a few files from Google Storage.
The zipfile of Python doesnt find the files in gcloud, just in the project.
How can I do for my code find the files in gcloud?
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, 'w') as zip_file:
for revenue in revenues:
# queryset with files a lot, so, for a each file, add in zip
t = tempfile.NamedTemporaryFile()
t.write(revenue.revenue.name)
if revenue.revenue.name:
t.seek(0)
with default_storage.open(revenue.revenue.name, "r") as file_data:
zip_file.write(file_data.name, compress_type=zipfile.ZIP_DEFLATED)
# the code dont pass from this part
t.close()
response = HttpResponse(content_type='application/x-zip-compressed')
response['Content-Disposition'] = 'attachment; filename=my_zip.zip'
response.write(zip_buffer.getvalue())
return response
In this part, I write the file that I opened from gcloud, but stop inside the function:
def write(self, filename, arcname=None, compress_type=None):
"""Put the bytes from filename into the archive under the name
arcname."""
if not self.fp:
raise RuntimeError(
"Attempt to write to ZIP archive that was already closed")
st = os.stat(filename)
# when I try find the file, the command os.stat search in project, not in gcloud
the "os.stat(filename)" search for a file in project, how can I do for find in the gcloud?

I will post my findings as an answer, since I would like to comment about few things.
I have understood:
You have a Python library zipfile that is used to work with ZIP files.
You are looking for files locally and add one by one into the ZIP file.
You would like to do this as well for files located in Google Cloud Storage bucket. But it is failing to find the files.
If I have misunderstood the use-case scenario, please elaborate further in a comment.
However, if this is exactly what you are trying to do, then this is not supported. In the StackOverflow Question - Compress files saved in Google cloud storage, it is stated that compressing files that are already in the Google Cloud Storage is not possible. The solution in that question is to subscribe to newly created files and then download them locally, compress them and overwrite them in GCS. As you can see, you can list the files, or iterate through the files stored in GCS, but you first need to download them to be able to process them.
Work around
Therefore, in your use-case scenario, I would recommend the following workaround, by using the Python client API:
You can use Listing objects Python API, to get all the objects from GCS.
Then you can use Downloading objects Python API, to download the objects locally.
As soon as the objects are located in local directory, you can use the zipfile Python library to ZIP them together, as you are already doing it.
Then the objects are ZIPed and if you no longer need the downloaded objects, you can delete them with os.remove("downloaded_file.txt").
In case you need to have the compressed ZIP file in the Google Cloud Storage bucket, then you can use the Uploading objects Python API to upload the ZIP file in the GCS bucket.
As I have mentioned above, processing files (e.g. Adding them to a ZIP files etc.) directly in Google Cloud Storage bucket, is not supported. You first need to download them locally in order to do so. I hope that my workaround is going to be helpful to you.
UPDATE
As I have mentioned above, zipping files while they are in GCS bucket is not supported. Therefore I have prepared for you a working example in Python on how to use the workaround.
NOTE: As I am not professional on operating os commands with Python
library and I am not familiar with zipfile library, there is
probably a better and more efficient way of achieving this. However,
the code that can be found in this GitHub link, does the following
procedures:
Under #Public variables: section change BUCKET_NAME to your corresponding bucket name and execute the python script in Google Cloud Shell. Cloud Shell
Now my bucket structure is as follows:
gs://my-bucket/test.txt
gs://my-bucket/test1.txt
gs://my-bucket/test2.txt
gs://my-bucket/directory/test4.txt
When executing the command, what the app does is the following:
Will get the path of where the script is executed. e.g. /home/username/myapp.
It will create a temporary directory within this directory e.g. /home/username/myapp/temp
It will iterate through all the files located in the bucket that you have specified and will download them locally inside that temp directory.
NOTE: If the file in the bucket is under directory it will simple download the file, instead of creating that sub-directory again. You can modify the code to make it work as you desired later.
So the new downloaded files will look like this:
/home/username/myapp/temp/test.txt
/home/username/myapp/temp/test1.txt
/home/username/myapp/temp/test2.txt
/home/username/myapp/temp/test4.txt
After that, the code will zip all those files to a new zipedFile.zip that will be located in the same directory with the main.py script that you have executed.
When this step is done as well, the script will delete the directory /home/username/myapp/temp/ with all of its contents.
As I have mentioned above, after executing the script locally, you should be able to see the main.py and an zipedFile.zip file with all the zipped files from the GCS bucket. Now you can take the idea of implementation and modify it according to your project's needs.

the final code:
zip_buffer = io.BytesIO()
base_path = '/home/everton/compressedfiles/'
fiscal_compentecy_month = datetime.date(int(year), int(month), 1)
revenues = CompanyRevenue.objects.filter(company__pk=company_id, fiscal_compentecy_month=fiscal_compentecy_month)
if revenues.count() > 0:
path = base_path + str(revenues.first().company.user.pk) + "/"
zip_name = "{}-{}-{}-{}".format(revenues.first().company.external_id, revenues.first().company.external_name, month, year)
for revenue in revenues:
filename = revenue.revenue.name.split('revenues/')[1]
if not os.path.exists(path):
os.makedirs(path)
with open(path + filename, 'wb+') as file:
file.write(revenue.revenue.read())
file.close()
with zipfile.ZipFile(zip_buffer, 'w') as zip_file:
for file in os.listdir(path):
zip_file.write(path + file, compress_type=zipfile.ZIP_DEFLATED)
zip_file.close()
response = HttpResponse(content_type='application/x-zip-compressed')
response['Content-Disposition'] = 'attachment; filename={}.zip'.format(zip_name)
response.write(zip_buffer.getvalue())
shutil.rmtree(path)
return response

How to upload in-memory zip file to S3-bucket

I am trying to upload an in-memory zip file to an S3-bucket (in order to avoid temporary files on my server).
Here's my code:
import zipfile
import io
zip_bytes_io = io.BytesIO()
zip_archive = zipfile.ZipFile(zip_bytes_io, mode='a',
compression=zipfile.ZIP_DEFLATED)
zip_archive.writestr('test.txt', b'My string')
s3_client = boto3.client('s3')
zip_bytes_io.seek(0) # So that bytes are read from beginning
s3_client.upload_fileobj(zip_bytes_io, test_bucket, 'test.zip')
When I run the above code, it does upload the file to an S3 bucket. However when I download the file, I am unable to unzip it. I get the error :"Error 1 - operation not permitted."
I assume this is because I'm not uploading the file correctly in the first place. Any help appreciated :)

When looking for Error 1 - operation not permitted. error online, looks like it might be a Mac-os opening a zip file issue?
Try to open the file differently?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.