search within an unextracted .zip file - python

I'm trying to use python to search within a .zip file without actually extracting the zipped file. I understand re.search can do searches within files, but will it do the same for files that have yet to be extracted?

The way to do this is using the zipfile module, it allows reading the name list and other meta info from the zip file prior to extracting the content.
import zipfile
zf = zipfile.ZipFile('example.zip', 'r')
print zf.namelist()
You can read more here about the Zipfile Library

Related

Extract a list of files with a certain criteria within subdirectory of zip archive in python

I want to access some .jp2 image files inside a zip file and create a list of their paths. The zip file contains a directory folder named S2A_MSIL2A_20170420T103021_N0204_R108_T32UNB_20170420T103454.SAFE and I am currently reading the files using glob, after having extracted the folder.
I don't want to have to extract the contents of the zip file first. I read that I cannot use glob within a zip directory, nor I can use wildcards to access files within it, so I am wondering what my options are, apart from extracting to a temporary directory.
The way I am currently getting the list is this:
dirr = r'C:\path-to-folder\S2A_MSIL2A_20170420T103021_N0204_R108_T32UNB_20170420T103454.SAFE'
jp2_files = glob.glob(dirr + '/**/IMG_DATA/**/R60m/*B??_??m.jp2', recursive=True)
There are additional different .jp2 files in the directory, for which reason I am using the glob wildcards to filter the ones I need.
I am hoping to make this work so that I can automate it for many different zip directories. Any help is highly appreciated.
I made it work with zipfile and fnmatch
from zipfile import ZipFile
import fnmatch
zip = path_to_zip.zip
with ZipFile(zipaki, 'r') as zipObj:
file_list = zipObj.namelist()
pattern = '*/R60m/*B???60m.jp2'
filtered_list = []
for file in file_list:
if fnmatch.fnmatch(file, pattern):
filtered_list.append(file)

zipfile.ZipFile extracts the wrong file

I am working on a project that manipulates with a document's xml file. My approach is like the following. First convert the DOCX document into a zip archive, then extract the contents of that archive in order to have access to the document.xml file, and finally convert the XML to a txt in order to work with it.
So i did all the above on a document, and everything worked perfectly, but when i decided to use a different document, the Zipfile library doesnt extract the content of the new ZIP archive, however it somehow extracts the contents of the old document that i processed before, and converts the document.xml file into document.txt without even me even running that block of code that converts the XML into txt.
The worst part is the old document is not even in the directory anymore, so i have no idea how Zipfile is extracting the content of that particular document when its not even in the path.
This is the code I am using in Jupyter notebook.
import shutil
import zipfile
# Convert the DOCX to ZIP
shutil.copyfile('data/docx/input.docx', 'data/zip/document.zip')
# Extract the ZIP
with zipfile.ZipFile('zip/document.zip', 'r') as zip_ref:
zip_ref.extractall('data/extracted/')
# Convert "document.xml" to txt
os.rename('extracted/word/document.xml', 'extracted/word/document.txt')
# Read the txt file
with open('extracted/word/document.txt') as intxt:
data = intxt.read()
This is the directory tree for the extracted zip archive for the first document.
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.txt
The 2nd document's directory tree should be as following
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.xml
But Zipfile is extracting the contents of the first document even when the DOCX file is not in the directory.I am also using Ubuntu 20.04 so i am not sure if it has to do with my OS.
I suspect that you are having issues with relative paths, as unzipping any Word document will create the same file/directory structure. I'd suggest using absolute paths to avoid this. What you may also want to do is, after you are done manipulating and processing the extracted files and directories, delete them. That way you won't encounter any issues with lingering files.

Zipfile namelist() missing members from archive

I'm currently trying to open an .xlsx file with zipfile on Python, finding all files with namelist(), then using .count() to find all images in .png format within the archive.
My problem is currently, the list returned by namelist() function returns only 1680 elements.
After saving the xlsx file as an html, I am able to view all images contained in the excel spreadsheet and the total file count is 3,352 files.
I checked documentation for zipfile and exhausted the best Google searches I could muster. I appreciate any hints or advice!
Here's the snippet of code I'm using:
import zipfile as zf
xlsx = 'myfile.xlsx'
xlsx_file = zf.ZipFile(xlsx)
fileList = xlsx_file.namelist()
maybe convert it to a wheel file? wheel works good to me

Python NLTK Make corpus from zip files

I'm trying to create my own corpus in NLTK from ca. 200k text files each stored in it's own zip folder. It looks like the following:
Parent_dir
text1.zip
text1.txt
I'm using the following code and try to access all the text files from the parent directory:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpus_path="parent_dir"
corpus=PlaintextCorpusReader(corpus_path,".*")
file_ids=corpus.fileids()
print(file_ids)
But Python just returns an empty list because it probably can't access the text files due to the zipping. Is there an easy way to fix this? Unfortunately, I can't unzip the files because of the size of the files.
If all you're trying to do is get the fileIDs just use the 'glob' module, which doesn't care about file types.
Import the module (if you don't have it go ahead and pip install glob):
from glob import glob
Get your directory use * as a wildcard to get everything in the directory:
directory = glob('/path/to/your/corpus/*')
The glob() method returns a list of strings (which are file paths, in this case).
You can simply iterate over these to print the file name:
for file in directory:
print(file)
This article looks like an answer to your question about reading the contents of a zipped file: How to read text files in a zipped folder in Python
I think a combination of these methods makes an answer to your problem.
Good luck!

Read all files in .zip archive in python

I'm trying to read all files in a .zip archive named data1.zip using the glob() method.
import glob
from zipfile import ZipFile
archive = ZipFile('data1.zip','r')
files = archive.read(glob.glob('*.jpg'))
Error Message:
TypeError: unhashable type: 'list'
The solution to the problem I'm using is:
files = [archive.read(str(i+1)+'.jpg') for i in range(100)]
This is bad because I'm assuming my files are named 1.jpg, 2.jpg, etc.
Is there a better way using python best practices to do this? Doesn't need to be necessarily using glob()
glob doesn't look inside your archive, it'll just give you a list of jpg files in your current working directory.
ZipFile already has methods for returning information about the files in the archive: namelist returns names, and infolist returns ZipInfo objects which include metadata as well.
Are you just looking for:
archive = ZipFile('data1.zip', 'r')
files = archive.namelist()
Or if you only want .jpg files:
files = [name for name in archive.namelist() if name.endswith('.jpg')]
Or if you want to read all the contents of each file:
files = [archive.read(name) for name in archive.namelist()]
Although I'd probably rather make a dict mapping names to contents:
files = {name: archive.read(name) for name in archive.namelist()}
That way you can access contents like so:
files['1.jpg']
Or get a list of the files presents using files.keys(), etc.

Categories