Collecting comment data from multiple Rar files without unzipping - python

I wanted to collect comment data of a zip file from multiple files(as the optional comment you get on the side when opening a Zip or a Rar file)
but now I realize that they are not Zip but Rar files, what do i need to change in order for it to work on a Rar file?
import unicodedata
from zipfile import ZipFile
rootFolder = u"C:/Users/user/Desktop/archives/"
zipfiles = [os.path.join(rootFolder, f) for f in
os.listdir(rootFolder)] for zfile in zipfiles:
print("Opening: {}".format(zfile))
with ZipFile(zfile, 'r') as testzip:
print(testzip.comment) # comment for entire zip
l = testzip.infolist() #list all files in archive
for finfo in l:
# per file/directory comments
print("{}:{}".format(finfo.filename, finfo.comment))

You need to use RARFILE module. ZipFile.comment() can only get a comment object from the ZIP file.

Related

Extract a list of files with a certain criteria within subdirectory of zip archive in python

I want to access some .jp2 image files inside a zip file and create a list of their paths. The zip file contains a directory folder named S2A_MSIL2A_20170420T103021_N0204_R108_T32UNB_20170420T103454.SAFE and I am currently reading the files using glob, after having extracted the folder.
I don't want to have to extract the contents of the zip file first. I read that I cannot use glob within a zip directory, nor I can use wildcards to access files within it, so I am wondering what my options are, apart from extracting to a temporary directory.
The way I am currently getting the list is this:
dirr = r'C:\path-to-folder\S2A_MSIL2A_20170420T103021_N0204_R108_T32UNB_20170420T103454.SAFE'
jp2_files = glob.glob(dirr + '/**/IMG_DATA/**/R60m/*B??_??m.jp2', recursive=True)
There are additional different .jp2 files in the directory, for which reason I am using the glob wildcards to filter the ones I need.
I am hoping to make this work so that I can automate it for many different zip directories. Any help is highly appreciated.
I made it work with zipfile and fnmatch
from zipfile import ZipFile
import fnmatch
zip = path_to_zip.zip
with ZipFile(zipaki, 'r') as zipObj:
file_list = zipObj.namelist()
pattern = '*/R60m/*B???60m.jp2'
filtered_list = []
for file in file_list:
if fnmatch.fnmatch(file, pattern):
filtered_list.append(file)

how to gzip files in tmp folder

Using an AWS Lambda function, I download an S3 zipped file and unzip it.
For now I do it using extractall. Upon unzipping, all files are saved in the tmp/ folder.
s3.download_file('test','10000838.zip','/tmp/10000838.zip')
with zipfile.ZipFile('/tmp/10000838.zip', 'r') as zip_ref:
lstNEW = list(filter(lambda x: not x.startswith("__MACOSX/"), zip_ref.namelist()))
zip_ref.extractall('/tmp/', members=lstNEW)
After unzipping, I want to gzip files and place them in another S3 bucket.
Now, how can I read all files from the tmp folder again and gzip each file?
$item.csv.gz
I see this (https://docs.python.org/3/library/gzip.html) but I am not sure which function is to be used.
If it's the compress function, how exactly do I use it? I read in this answer gzip a file in Python that I can use the open function gzip.open('', 'wb') to gzip a file but I couldn't figure out how to use it in my case. In the open function, do I specify the target location or the source location? Where do I save the gzipped files such as that I can later save them to S3?
Alternative Option:
Instead of loading everything into the tmp folder, I read that I can also open an output stream, wrap the output stream in a gzip wrapper, and then copy from one stream to the other
with zipfile.ZipFile('/tmp/10000838.zip', 'r') as zip_ref:
testList = []
for i in zip_ref.namelist():
if (i.startswith("__MACOSX/") == False):
testList.append(i)
for i in testList:
zip_ref.open(i, ‘r’)
but then again I am not sure how to continue in the for loop and open the stream and convert files there
Depending on the sizes of the files, I would skip writing the .gz file(s) to disk. Perhaps something based on s3fs | boto and gzip.
import contextlib
import gzip
import s3fs
AWS_S3 = s3fs.S3FileSystem(anon=False) # AWS env must be set up correctly
source_file_path = "/tmp/your_file.txt"
s3_file_path = "my-bucket/your_file.txt.gz"
with contextlib.ExitStack() as stack:
source_file = stack.enter_context(open(source_file_path , mode="rb"))
destination_file = stack.enter_context(AWS_S3.open(s3_file_path, mode="wb"))
destination_file_gz = stack.enter_context(gzip.GzipFile(fileobj=destination_file))
while True:
chunk = source_file.read(1024)
if not chunk:
break
destination_file_gz.write(chunk)
Note: I have not tested this so if it does not work, let me know.

Remove auto-generated __MACOSX folder from inside a zip file in Python

I have zip files uploaded by clients through a web server that sometimes contain pesky __MACOSX directories inside that gum things up. How can I remove these?
I thought of using ZipFile, but this answer says that isn't possible and gives this suggestion:
Read out the rest of the archive and write it to a new zip file.
How can I do this with ZipFile? Another Python based alternative like shutil or something similar would also be fine.
The examples below are designed to determine if a '__MACOSX' file is contained within a zip file. If this pesky exist then a new zip archive is created and all the files that are not __MACOSX files are written to this new archive. This code can be extended to include .ds_store files. Please let me if you need to delete the old zip file and replace it with the new clean zip file.
Hopefully, these answers help you solve your issue.
Example One
from zipfile import ZipFile
original_zip = ZipFile ('original.zip', 'r')
new_zip = ZipFile ('new_archve.zip', 'w')
for item in original_zip.infolist():
buffer = original_zip.read(item.filename)
if not str(item.filename).startswith('__MACOSX/'):
new_zip.writestr(item, buffer)
new_zip.close()
original_zip.close()
Example Two
def check_archive_for_bad_filename(file):
zip_file = ZipFile(file, 'r')
for filename in zip_file.namelist():
print(filename)
if filename.startswith('__MACOSX/'):
return True
def remove_bad_filename_from_archive(original_file, temporary_file):
zip_file = ZipFile(original_file, 'r')
for item in zip_file.namelist():
buffer = zip_file.read(item)
if not item.startswith('__MACOSX/'):
if not os.path.exists(temporary_file):
new_zip = ZipFile(temporary_file, 'w')
new_zip.writestr(item, buffer)
new_zip.close()
else:
append_zip = ZipFile(temporary_file, 'a')
append_zip.writestr(item, buffer)
append_zip.close()
zip_file.close()
archive_filename = 'old.zip'
temp_filename = 'new.zip'
results = check_archive_for_bad_filename(archive_filename)
if results:
print('Removing MACOSX file from archive.')
remove_bad_filename_from_archive(archive_filename, temp_filename)
else:
print('No MACOSX file in archive.')
The idea would be to use ZipFile to extract the contents into some defined folder then remove the __MACOSX entry (os.rmdir, os.remove) and then compress it again.
Depending if you have zip command on your OS you might be able to skip the re-compressing part. You could as well control this command from python by using os.system or subprocess module.

How to read csv files from multiple zip folders located in the same directory using python?

I have multiple zip folders named zip_folder_1, zip_folder_2, ..., zip_folder_n.
All these zip folders are located in the same directory. Each one of these zip folders contains a csv file named "selected_file.csv".
I need to read each one of the "selected_file.csv" located at each one of the zip folders and concatenate them into a single file
Could someone give me a hint on the required python code to solve this problem? I appreciate your help!
This should produce concatenated_data.csv in your working directory, and assumes that all files in my_data_dir are zip files with data in them.
import os, numpy as np, zipfile
def add_data_to_file(new_data,file_name):
if os.path.isfile(file_name):
mode = 'ab'
else:
mode = 'wb'
with open(file_name,mode) as f:
np.savetxt(f,np.array([new_data]),delimiter=',')
my_data_dir = 'C:/my/zip/data/dir/'
data_files = os.listdir(my_data_dir)
for data_file in data_files:
full_path = os.path.join(my_data_dir,data_file)
with zipfile.ZipFile(full_path,'r',zipfile.ZIP_DEFLATED) as zip_file:
with zip_file.open('selected_file.csv','r') as selected_file:
data = np.loadtxt(selected_file,delimiter=",")
add_data_to_file(data,'concatenated_data.csv')

search within an unextracted .zip file

I'm trying to use python to search within a .zip file without actually extracting the zipped file. I understand re.search can do searches within files, but will it do the same for files that have yet to be extracted?
The way to do this is using the zipfile module, it allows reading the name list and other meta info from the zip file prior to extracting the content.
import zipfile
zf = zipfile.ZipFile('example.zip', 'r')
print zf.namelist()
You can read more here about the Zipfile Library

Categories