Python NLTK Make corpus from zip files - python

I'm trying to create my own corpus in NLTK from ca. 200k text files each stored in it's own zip folder. It looks like the following:
Parent_dir
text1.zip
text1.txt
I'm using the following code and try to access all the text files from the parent directory:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpus_path="parent_dir"
corpus=PlaintextCorpusReader(corpus_path,".*")
file_ids=corpus.fileids()
print(file_ids)
But Python just returns an empty list because it probably can't access the text files due to the zipping. Is there an easy way to fix this? Unfortunately, I can't unzip the files because of the size of the files.

If all you're trying to do is get the fileIDs just use the 'glob' module, which doesn't care about file types.
Import the module (if you don't have it go ahead and pip install glob):
from glob import glob
Get your directory use * as a wildcard to get everything in the directory:
directory = glob('/path/to/your/corpus/*')
The glob() method returns a list of strings (which are file paths, in this case).
You can simply iterate over these to print the file name:
for file in directory:
print(file)
This article looks like an answer to your question about reading the contents of a zipped file: How to read text files in a zipped folder in Python
I think a combination of these methods makes an answer to your problem.
Good luck!

Related

Python, Inconsistent zip file extraction

I am trying to extract zip files using the zipfile module's extractall method.
My code snippet is
import zipfile
file_path = '/something/airway.zip'
dir_path = 'something/'
with zipfile.ZipFile(file_path, "r") as zip_ref:
zip_ref.extractall(dir_path)
I have two zip files named, test (1.1 mb) and airway (520 mb).
For test.zip the folder contains all the files but for airway.zip, it creates another folder inside my target folder named Airway, and then extracts all the files there. Even after renaming the airway.zip to any garbage name, the result was same.
Is there some workaround to get only the files extracted in my target folder? It is critical for me as I'm doing this extraction automated from django
Python version: 3.9.6;
Django version: 2.2
I ran your code and it seems to be only a problem of the zipfile itself. If you create a zipfile by selecting only the elements you get the result you got with test.zip. If you create it by selecting a folder holding the elements the folder will be there if you extract it again, no matter what you name your zip file.
I have two articles related to this:
https://www.kite.com/python/docs/zipfile.ZipFile.extractall
https://www.geeksforgeeks.org/working-zip-files-python/
Even if both of these articles do not solve your problem then I think that instead of zipping the files in the folder you just zipped the folder itself so try by zipping the files inside the folder.

zipfile.ZipFile extracts the wrong file

I am working on a project that manipulates with a document's xml file. My approach is like the following. First convert the DOCX document into a zip archive, then extract the contents of that archive in order to have access to the document.xml file, and finally convert the XML to a txt in order to work with it.
So i did all the above on a document, and everything worked perfectly, but when i decided to use a different document, the Zipfile library doesnt extract the content of the new ZIP archive, however it somehow extracts the contents of the old document that i processed before, and converts the document.xml file into document.txt without even me even running that block of code that converts the XML into txt.
The worst part is the old document is not even in the directory anymore, so i have no idea how Zipfile is extracting the content of that particular document when its not even in the path.
This is the code I am using in Jupyter notebook.
import shutil
import zipfile
# Convert the DOCX to ZIP
shutil.copyfile('data/docx/input.docx', 'data/zip/document.zip')
# Extract the ZIP
with zipfile.ZipFile('zip/document.zip', 'r') as zip_ref:
zip_ref.extractall('data/extracted/')
# Convert "document.xml" to txt
os.rename('extracted/word/document.xml', 'extracted/word/document.txt')
# Read the txt file
with open('extracted/word/document.txt') as intxt:
data = intxt.read()
This is the directory tree for the extracted zip archive for the first document.
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.txt
The 2nd document's directory tree should be as following
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.xml
But Zipfile is extracting the contents of the first document even when the DOCX file is not in the directory.I am also using Ubuntu 20.04 so i am not sure if it has to do with my OS.
I suspect that you are having issues with relative paths, as unzipping any Word document will create the same file/directory structure. I'd suggest using absolute paths to avoid this. What you may also want to do is, after you are done manipulating and processing the extracted files and directories, delete them. That way you won't encounter any issues with lingering files.

How to save os.listdir output as list

I am trying to save all entries of os.listdir("./oldcsv") separately in a list but I don't know how to manipulate the output before it is processed.
What I am trying to do is generate a list containing the absolute pathnames of all *.csv files in a folder, which can later be used to easily manipulate those files' contents. I don't want to put lots of hardcoded pathnames in the script, as it is annoying and hard to read.
import os
for file in os.listdir("./oldcsv"):
if file.endswith(".csv"):
print(os.path.join("/oldcsv", file))
Normally I would use a loop with .append but in this case I cannot do so, since os.listdir just seems to create a "blob" of content. Probably there is an easy solution out there, but my brain won't think of it.
There's a glob module in the standard library that can solve your problem with a single function call:
import glob
csv_files = glob.glob("./*.csv") # get all .csv files from the working dir
assert isinstance(csv_files, list)

How can I recursively find the *directories* containing a file of a certain type?

I have a set of .bam, files scattered inside a tree of folders. Not every directory contains such a file. I know how to recursively get the path of the files themselves using glob, but not the directory containing them.
import glob2
bam_files = glob2.glob('/data2/**/*.bam')
print bam_files
The above code gives the .bam files, but I want just the folders. Wondering if there is a direct way to do this using glob without regular expressions.
Use a set and os.path.dirname() [https://docs.python.org/2/library/os.path.html#os.path.dirname]:
import glob2
import os
bam_dirs = {os.path.dirname(p) for p in glob2.glob('/data2/**/*.bam')}
print bam_dirs

How to copy all subfolders into an existing folder in Python 3?

I am sure I am missing this, as it has to be easy, but I have looked in Google, The Docs, and SO and I can not find a simple way to copy a bunch of directories into another directory that already exists from Python 3?
All of the answers I find recommend using shutil.copytree but as I said, I need to copy a bunch of folders, and all their contents, into an existing folder (and keep any folders or that already existed in the destination)
Is there a way to do this on windows?
I would look into using the os module built into python. Specifically os.walk which returns all the files and subdirectories within the directory you point it to. I will not just tell you exactly how to do it, however here is some sample code that i use to backup my files. I use the os and zipfile modules.
import zipfile
import time
import os
def main():
date = time.strftime("%m.%d.%Y")
zf = zipfile.ZipFile('/media/morpheous/PORTEUS/' + date, 'w')
for root, j, files in os.walk('/home/morpheous/Documents'):
for i in files:
zf.write(os.path.join(root, i))
zf.close()
Hope this helps. Also it should be pointed out that i do this in linux, however with a few changes to the file paths it should work the same

Categories