Using an AWS Lambda function, I download an S3 zipped file and unzip it.
For now I do it using extractall. Upon unzipping, all files are saved in the tmp/ folder.
s3.download_file('test','10000838.zip','/tmp/10000838.zip')
with zipfile.ZipFile('/tmp/10000838.zip', 'r') as zip_ref:
lstNEW = list(filter(lambda x: not x.startswith("__MACOSX/"), zip_ref.namelist()))
zip_ref.extractall('/tmp/', members=lstNEW)
After unzipping, I want to gzip files and place them in another S3 bucket.
Now, how can I read all files from the tmp folder again and gzip each file?
$item.csv.gz
I see this (https://docs.python.org/3/library/gzip.html) but I am not sure which function is to be used.
If it's the compress function, how exactly do I use it? I read in this answer gzip a file in Python that I can use the open function gzip.open('', 'wb') to gzip a file but I couldn't figure out how to use it in my case. In the open function, do I specify the target location or the source location? Where do I save the gzipped files such as that I can later save them to S3?
Alternative Option:
Instead of loading everything into the tmp folder, I read that I can also open an output stream, wrap the output stream in a gzip wrapper, and then copy from one stream to the other
with zipfile.ZipFile('/tmp/10000838.zip', 'r') as zip_ref:
testList = []
for i in zip_ref.namelist():
if (i.startswith("__MACOSX/") == False):
testList.append(i)
for i in testList:
zip_ref.open(i, ‘r’)
but then again I am not sure how to continue in the for loop and open the stream and convert files there
Depending on the sizes of the files, I would skip writing the .gz file(s) to disk. Perhaps something based on s3fs | boto and gzip.
import contextlib
import gzip
import s3fs
AWS_S3 = s3fs.S3FileSystem(anon=False) # AWS env must be set up correctly
source_file_path = "/tmp/your_file.txt"
s3_file_path = "my-bucket/your_file.txt.gz"
with contextlib.ExitStack() as stack:
source_file = stack.enter_context(open(source_file_path , mode="rb"))
destination_file = stack.enter_context(AWS_S3.open(s3_file_path, mode="wb"))
destination_file_gz = stack.enter_context(gzip.GzipFile(fileobj=destination_file))
while True:
chunk = source_file.read(1024)
if not chunk:
break
destination_file_gz.write(chunk)
Note: I have not tested this so if it does not work, let me know.
Related
Suppose I have a bzip2 compressed tar archive file x.tar.bz2 stored in s3. I would like to decompress it and place back to s3. This can be achieved by:
from s3fs import S3FileSystem
fs = S3FileSystem()
import tarfile
with fs.open('s3://path_to_source/x.tar.bz2', 'rb') as f:
r = f.read()
with open('x.tar.bz2', mode='wb') as localfile:
localfile.write(r)
tar = tarfile.open('x.tar.bz2', "r:bz2")
tar.extractall(path='extraction/path')
tar.close()
fs.put('extraction/path', f's3://path_to_destination/x', recursive=True)
Within the solution above, I am saving the file content twice into my local disc. I have the following questions (solution is expected to be done using Python):
Is it possible (using tarfile module) to load data directly from s3 and also extract it there avoiding to store data on local drive?
Is it possible to make this job in a streaming mode without need to have the whole x.tar.bz2 (or at least the uncompressed archive x.tar) file in memory?
tarfile.open accepts a file-like object as the fileobj argument, so you can pass to it the file object you get from S3FileSystem.open. You can then iterate through the TarInfo objects in the tar object, and open the corresponding path in S3 for writing:
with fs.open('s3://path_to_source/x.tar.bz2', 'rb') as f:
with tarfile.open(fileobj=f, mode='r:bz2') as tar:
for entry in tar:
with fs.open(f'path_to_destination/{entry.name}', mode='wb') as writer:
writer.write(tar.extractfile(entry).read())
I download a zip file from AWS S3 and unzip it. Upon unzipping, all files are saved in the tmp/ folder.
s3 = boto3.client('s3')
s3.download_file('testunzipping','DataPump_10000838.zip','/tmp/DataPump_10000838.zip')
with zipfile.ZipFile('/tmp/DataPump_10000838.zip', 'r') as zip_ref:
zip_ref.extractall('/tmp/')
lstNEW = zip_ref.namelist()
The output of listNEW is something like this:
['DataPump_10000838/', '__MACOSX/._DataPump_10000838', 'DataPump_10000838/DockBooking', '__MACOSX/DataPump_10000838/._DockBooking', 'DataPump_10000838/LoadEquipment', '__MACOSX/DataPump_10000838/._LoadEquipment', ....]
LoadEquipment and DockBooking are files but the rest are not. Is it possible to unzip the file without creating those temporary files? Or is I possible to filter out the real files? Because later, I need to use the correct files and gzip them.
$item_$unixepochtimestamp.csv.gz
Do I use the compress function?
To only extract certain files, you can pass a list to extractall:
with zipfile.ZipFile('/tmp/DataPump_10000838.zip', 'r') as zip_ref:
lstNEW = list(filter(lambda x: not x.startswith("__MACOSX/"), zip_ref.namelist()))
zip_ref.extractall('/tmp/', members=lstNEW)
The files are not temporary files, but rather macOS's way of representing resource forks in zip files that don't normally support this.
I have zip files uploaded by clients through a web server that sometimes contain pesky __MACOSX directories inside that gum things up. How can I remove these?
I thought of using ZipFile, but this answer says that isn't possible and gives this suggestion:
Read out the rest of the archive and write it to a new zip file.
How can I do this with ZipFile? Another Python based alternative like shutil or something similar would also be fine.
The examples below are designed to determine if a '__MACOSX' file is contained within a zip file. If this pesky exist then a new zip archive is created and all the files that are not __MACOSX files are written to this new archive. This code can be extended to include .ds_store files. Please let me if you need to delete the old zip file and replace it with the new clean zip file.
Hopefully, these answers help you solve your issue.
Example One
from zipfile import ZipFile
original_zip = ZipFile ('original.zip', 'r')
new_zip = ZipFile ('new_archve.zip', 'w')
for item in original_zip.infolist():
buffer = original_zip.read(item.filename)
if not str(item.filename).startswith('__MACOSX/'):
new_zip.writestr(item, buffer)
new_zip.close()
original_zip.close()
Example Two
def check_archive_for_bad_filename(file):
zip_file = ZipFile(file, 'r')
for filename in zip_file.namelist():
print(filename)
if filename.startswith('__MACOSX/'):
return True
def remove_bad_filename_from_archive(original_file, temporary_file):
zip_file = ZipFile(original_file, 'r')
for item in zip_file.namelist():
buffer = zip_file.read(item)
if not item.startswith('__MACOSX/'):
if not os.path.exists(temporary_file):
new_zip = ZipFile(temporary_file, 'w')
new_zip.writestr(item, buffer)
new_zip.close()
else:
append_zip = ZipFile(temporary_file, 'a')
append_zip.writestr(item, buffer)
append_zip.close()
zip_file.close()
archive_filename = 'old.zip'
temp_filename = 'new.zip'
results = check_archive_for_bad_filename(archive_filename)
if results:
print('Removing MACOSX file from archive.')
remove_bad_filename_from_archive(archive_filename, temp_filename)
else:
print('No MACOSX file in archive.')
The idea would be to use ZipFile to extract the contents into some defined folder then remove the __MACOSX entry (os.rmdir, os.remove) and then compress it again.
Depending if you have zip command on your OS you might be able to skip the re-compressing part. You could as well control this command from python by using os.system or subprocess module.
I have trouble using memoryfs:
https://docs.pyfilesystem.org/en/latest/reference/memoryfs.html:
I'm trying to extract tar inside a memoryFS, but I cant use mem_fs because it is an object and cant get the real / memory path...
from fs import open_fs, copy
import fs
import tarfile
mem_fs = open_fs('mem://')
print(mem_fs.isempty('.'))
fs.copy.copy_file('//TEST_FS', 'test.tar', mem_fs, 'test.tar')
print(mem_fs.listdir('/'))
with mem_fs.open('test.tar') as tar_file:
print(tar_file.read())
tar = tarfile.open(tar_file) // I cant create the tar ...
tar.extractall(mem_fs + 'Extract_Dir') // Cant extract it too...
Can someone help me, it is possible to do that ?
The first argument to tarfile.open is a filename. You're (a) passing it an open file object, and (b) even if you were to pass in a filename, tarfile doesn't know anything about your in-memory filesystem and so wouldn't be able to find the file.
Fortunately, tarfile.open has a fileobj argument that accepts an open file object, so you can write:
with mem_fs.open('test.tar', 'rb') as tar_file:
tar = tarfile.open(fileobj=tar_file)
t.list()
Note that you need to open the file in binary mode (rb).
Of course, now you have a second problem: while you can open and read the archive, the tarfile module still doesn't know about your in-memory filesystem, so attempting to extract files will simply extract them to your local filesystem, which is probably not what you want.
To extract into your in-memory filesystem, you're going to need to read the data from the tar archive member and write it yourself. Here's one option for doing that:
import fs
import os
import pathlib
import tarfile
mem_fs = fs.open_fs('mem://')
fs.copy.copy_file('/', '{}/example.tar.gz'.format(os.getcwd()),
mem_fs, 'example.tar.gz')
with mem_fs.open('example.tar.gz', 'rb') as fd:
tar = tarfile.open(fileobj=fd)
# iterate over list of members
for member in tar.getmembers():
# if the member is a file
if member.isfile():
# create any necessary directories
p = pathlib.Path(member.path)
mem_fs.makedirs(str(p.parent), recreate=True)
# open the archive member
with mem_fs.open(member.path, 'wb') as memfd, \
tar.extractfile(member.path) as tarfd:
# and write the data into the memory fs
memfd.write(tarfd.read())
The tarfile.TarFile.extractfile method returns an open file object to a tar archive member, rather than extracting the file to disk.
Note that the above isn't an optimal solution if you're working with large files (since it reads the entire archive member into memory before writing it out).
I wanted to collect comment data of a zip file from multiple files(as the optional comment you get on the side when opening a Zip or a Rar file)
but now I realize that they are not Zip but Rar files, what do i need to change in order for it to work on a Rar file?
import unicodedata
from zipfile import ZipFile
rootFolder = u"C:/Users/user/Desktop/archives/"
zipfiles = [os.path.join(rootFolder, f) for f in
os.listdir(rootFolder)] for zfile in zipfiles:
print("Opening: {}".format(zfile))
with ZipFile(zfile, 'r') as testzip:
print(testzip.comment) # comment for entire zip
l = testzip.infolist() #list all files in archive
for finfo in l:
# per file/directory comments
print("{}:{}".format(finfo.filename, finfo.comment))
You need to use RARFILE module. ZipFile.comment() can only get a comment object from the ZIP file.