Extract 7z files on the fly in S3 with boto3

Extract 7z files on the fly in S3 with boto3 - python

I have a really large 7z file in s3 bucket say s3://tempbucket1/Test_For7zip.7z that runs into several tens of GB. I do not want to download it, unzip it and re upload it back to s3. I want to use Boto3 to unzip it on the fly and save it into S3.
I tried to solve this using lzma package based on Previous SO answer which dealt with on the fly unzipping of *.zip files using the fileobj option present in gzip.GzipFile.
from io import BytesIO
import gzip
import lzma
import boto3
# setup constants
bucket = 'tempbucket1'
gzipped_key = 'Test_For7zip.7z'
uncompressed_key = 'Test_Unzip7zip'
# initialize s3 client, this is dependent upon your aws config being done
s3 = boto3.client('s3', use_ssl=False)
s3.upload_fileobj( # upload a new obj to s3
Fileobj=lzma.LZMAFile(
BytesIO(s3.get_object(Bucket=bucket,
Key=gzipped_key)['Body'].read()),
'rb'), # read binary
Bucket=bucket, # target bucket, writing to
Key=uncompressed_key) # target key, writing to
However, this thows the following error
LZMAError: Input format not supported by decoder
Is there a python package that provides can decode 7z files based on BytesIO, or is there a better way of achieving this?

I never tried this, but Googling gave me this as a possible solution. Please reach out through this post if this solves your problem.

Related

Create a zip file on S3 from CSV files on S3 using Lambda

Around 60 CSV files being generated daily in my S3 bucket. The average size of each file is around 500MB. I want to zip all these files through lambda function on the fly(without downloading a file inside Lambda execution) and upload these zipped files to another s3 bucket. I came across these solutions 1 and 2 but I am still getting issue in the implementation. Right now, I am trying to stream CSV file data into a zipped file(this zip file is being created in Lambda tmp directory) and then uploading on s3. But I am getting this error message while writing into zip file:
[Errno 36] File name too long
This is my test Lambda function where I am just trying with one file but in actual case I need to zip 50-60 CSV files individually:
import boto3
import zipfile
def lambda_handler(event, context):
s3 = boto3.resource('s3')
iterator = s3.Object('bucket-name', 'file-name').get()['Body'].iter_lines()
my_zip = zipfile.ZipFile('/tmp/test.zip', 'w')
for line in iterator:
my_zip.write(line)
s3_resource.meta.client.upload_fileobj(file-name, "another-bucket-name", "object-name")
Also, is there a way where I can stream data from my CSV file, zip it and upload it to another s3 bucket without actually saving a full zip file on Lambda memory?

After lot of research and trials, I am able to make it work. I used smart_open library for my issue and managed to zip 550MB file with just 150MB memory usage in my Lambda. To use external library, I had to use Layers in Lambda. Here is my code:
from smart_open import open, register_compressor
import lzma, os
def lambda_handler(event, context):
with open('s3://bucket-name-where-large-file/file-key-name') as fin:
with open('s3://bucket-name-to-put-zip-file/zip-file-key-name', 'w') as fout:
for line in fin:
fout.write(line)
Please note, smart_open supports .gz and .bz2 file compression. If you want to zip file in other formats, you can create your own compressor using register_compressor method of this library.

How to upload in-memory zip file to S3-bucket

I am trying to upload an in-memory zip file to an S3-bucket (in order to avoid temporary files on my server).
Here's my code:
import zipfile
import io
zip_bytes_io = io.BytesIO()
zip_archive = zipfile.ZipFile(zip_bytes_io, mode='a',
compression=zipfile.ZIP_DEFLATED)
zip_archive.writestr('test.txt', b'My string')
s3_client = boto3.client('s3')
zip_bytes_io.seek(0) # So that bytes are read from beginning
s3_client.upload_fileobj(zip_bytes_io, test_bucket, 'test.zip')
When I run the above code, it does upload the file to an S3 bucket. However when I download the file, I am unable to unzip it. I get the error :"Error 1 - operation not permitted."
I assume this is because I'm not uploading the file correctly in the first place. Any help appreciated :)

When looking for Error 1 - operation not permitted. error online, looks like it might be a Mac-os opening a zip file issue?
Try to open the file differently?

python to write directly to S3 bucket

I have a python function that prints some stuff. I am gonig to put that to AWS Lambda and would like to print these stuff directly to a file in S3 bucket.
So something to redirect stdout in S3 bucket file.
This is how I call the fuction:
recurse_for_values(top_vault_prefix, top_level_keys)

Almost the same as Thomas L., but with io so that you don't actually write any file locally (everything is in memory).
import io
import tinys3
f=io.StringIO(u"some initial text data")
conn = tinys3.Connection('yourAccessKey', 'yourSecretKey', tls=True)
conn.upload('yourS3Key', f, 'yourBucketName')
f.close()
OR
you could use something like S3FS (https://github.com/s3fs-fuse/s3fs-fuse) to mount your S3 bucket as a disk and then simply redirect the output as you would do for a local disk, but I heavily discourage this option. S3 is definitely not a filesystem and shouldn't be used as such. For example it generates loads of requests to S3 (and thus costs) and may not be that reliable...

You can write your logs to a local file and then synchronise it with AWS S3 at the end of your script.
import tinys3
log_file_object = open(“logfile”, “w”)
log_file_object.write("Some logs...")
log_file_object.write("Some other logs...")
log_file_object.close()
conn = tinys3.Connection('S3_ACCESS_KEY','S3_SECRET_KEY',tls=True)
conn.upload('log_file_DDMMYYYY',f,'my_bucket')
You can also use boto3 to update your file but tinys3 is easier to use for tiny usage.
Hope it can help.

Python: Get zip file from Google Drive API and load its content

I have a zipfile on my Google Drive. In that zipfile is a XML file, which I want to parse, extract a specific information and save this information on my local computer (or wherever).
My goal is to use Python & Google Drive API (with help of PyDrive) to achieve this. The workflow could be as follows:
Connect to my Google Drive via Google Drive API (PyDrive)
Get my zipfile id
Load my zipfile to memory
Unzip, obtain the XML file
Parse the XML, extract the desired information
Save it as a csv on my local computer
Right now, I am able to do steps 1,2,4,5,6. But I dont know how to load the zipfile into memory without writing it on my local HDD first.
Following PyDrive code will obtain the zipfile and place it on my local HDD, which is not exactly what I want.
toUnzip = drive.CreateFile({'id':'MY_FILE_ID'})
toUnzip.GetContentFile('zipstuff.zip')
I guess one solution could be as follows:
I could read the zipfile as a string with some encoding:
toUnzip = drive.CreateFile({'id':'MY_FILE_ID'})
zipAsString = toUnzip.GetContentString(encoding='??')
and then, I could somehow (no idea how, perhaps StringIO could be useful) read this string with Python zipfile library. Is this solution even possible? Is there a better way?

You could try StringIO, they emulate files but reside in memory.
Here is the code from a related SO post:
# get_zip_data() gets a zip archive containing 'foo.txt', reading 'hey, foo'
from StringIO import StringIO
zipdata = StringIO()
zipdata.write(get_zip_data())
myzipfile = zipfile.ZipFile(zipdata)
foofile = myzipfile.open('foo.txt')
print foofile.read()
# output: "hey, foo"
or using a URL:
url = urlopen("http://www.test.com/file.zip")
zipfile = ZipFile(StringIO(url.read()))
Hope this helps.

Eventually, I solved it using BytesIOand cp862 encoding:
toUnzipStringContent = toUnzip.GetContentString(encoding='cp862')
toUnzipBytesContent = BytesIO(toUnzipStringContent.encode('cp862'))
readZipfile = zipfile.ZipFile(toUnzipBytesContent, "r")

Writing bytes stream to s3 using python

I have a zip file loaded into memory (do not have it persisted on disk). The zip file contains jpg images. I am trying to upload each jpg into s3 but am getting an error.
# already have an opened zipfile stored in zip_file
# already connected to s3
files = zip_file.namelist()
for f in files:
im = io.BytesIO(zip_file.read(f))
s3_key.key = f
s3_key.set_contents_from_stream(im)
I get the following error:
BotoClientError: BotoClientError: s3 does not support chunked transfer
What am I doing wrong?

Here is the solution. I was over thinking the problem.
files = zip_file.namelist()
for f in files:
data = zip_file.read(f)
s3_key._key.key = f
s3_key._key.set_contents_from_string(data)
That's all it took.

Boto supports other storage services, such as Google Cloud Storage, in addition to S3. The set_contents_from_stream method only works for services that support chunked transfer (see https://codereview.appspot.com/4515170). S3 does not support that (See their Technical FAQs at http://aws.amazon.com/articles/1109.)
It's unfortunate, but you can't upload from a stream to S3.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract 7z files on the fly in S3 with boto3 - python

I never tried this, but Googling gave me this as a possible solution. Please reach out through this post if this solves your problem.

Related

Create a zip file on S3 from CSV files on S3 using Lambda

How to upload in-memory zip file to S3-bucket

python to write directly to S3 bucket

Python: Get zip file from Google Drive API and load its content

Writing bytes stream to s3 using python

Categories

Resources