I am working on a project that manipulates with a document's xml file. My approach is like the following. First convert the DOCX document into a zip archive, then extract the contents of that archive in order to have access to the document.xml file, and finally convert the XML to a txt in order to work with it.
So i did all the above on a document, and everything worked perfectly, but when i decided to use a different document, the Zipfile library doesnt extract the content of the new ZIP archive, however it somehow extracts the contents of the old document that i processed before, and converts the document.xml file into document.txt without even me even running that block of code that converts the XML into txt.
The worst part is the old document is not even in the directory anymore, so i have no idea how Zipfile is extracting the content of that particular document when its not even in the path.
This is the code I am using in Jupyter notebook.
import shutil
import zipfile
# Convert the DOCX to ZIP
shutil.copyfile('data/docx/input.docx', 'data/zip/document.zip')
# Extract the ZIP
with zipfile.ZipFile('zip/document.zip', 'r') as zip_ref:
zip_ref.extractall('data/extracted/')
# Convert "document.xml" to txt
os.rename('extracted/word/document.xml', 'extracted/word/document.txt')
# Read the txt file
with open('extracted/word/document.txt') as intxt:
data = intxt.read()
This is the directory tree for the extracted zip archive for the first document.
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.txt
The 2nd document's directory tree should be as following
data -
1-docx
2-zip
3-extracted/-
1-customXml/
2-docProps/
3-_rels
4-[Content_Types].xml
5-word/-document.xml
But Zipfile is extracting the contents of the first document even when the DOCX file is not in the directory.I am also using Ubuntu 20.04 so i am not sure if it has to do with my OS.
I suspect that you are having issues with relative paths, as unzipping any Word document will create the same file/directory structure. I'd suggest using absolute paths to avoid this. What you may also want to do is, after you are done manipulating and processing the extracted files and directories, delete them. That way you won't encounter any issues with lingering files.
Related
I'm currently working with Semantically Enriched Wikipedia.
The resource is inside a 7.5 GB tar.gz archive and each file inside it, it's an XML whose schema is:
<text>
Plain text
</text>
<annotation>
Annotation for plain text
</annotation>
The current task is to extract each file and then parse the content inside the tags.
First thing I did was to use the tarfile module and its extractall() method, but during the extraction I got this error:
OSError: [Errno 22] Invalid argument: '.\\sew_conservative\\wiki384\\Live_%3F%21*%40_Like_a_Suicide.xml'
while a part of it is correctly extracted (I thought the error was due to the unicode char inside the xml file name, but I'm now seeing that each file has it inside).
So I planned to work on each file inside the archive using some of the API's methods and the code below.
Unfortunately, the TarInfo object which wraps each file doesn't allow to access the file content and the extraction, file by file, takes too much time.
def parse_sew():
sew_path = Path("C:/Users/beppe/Desktop/Tesi/Semantically Enriched Wikipedia/sew_conservative.tar.gz")
with tarfile.open(sew_path, mode='r') as t:
for item in t:
// extraction
Is the extraction mandatory to parse and use the content of the XML file or it's possible to read the archive content (on-the-fly, without extracting anything) and then parse the content?
UPDATE: I'm extracting the files via tar -xvzf filename.tar.gz command, everything is going well, but after 15 mins I was able to process only 500MB of the hundred GB.
I would suggest you to use 7zip for extraction. You can initiate 7zip extraction from python and then while it is extracting side by side you can read the files getting extracted. This will save quite a lot of time. You can implement is using threads.
Secondly dont use front slashes while giving windows path. You can use \\ in place of /.
You can also try using shutil as follows.
shutil.unpack_archive('path_to_your_archive', 'path_to_extract')
I'm currently trying to open an .xlsx file with zipfile on Python, finding all files with namelist(), then using .count() to find all images in .png format within the archive.
My problem is currently, the list returned by namelist() function returns only 1680 elements.
After saving the xlsx file as an html, I am able to view all images contained in the excel spreadsheet and the total file count is 3,352 files.
I checked documentation for zipfile and exhausted the best Google searches I could muster. I appreciate any hints or advice!
Here's the snippet of code I'm using:
import zipfile as zf
xlsx = 'myfile.xlsx'
xlsx_file = zf.ZipFile(xlsx)
fileList = xlsx_file.namelist()
maybe convert it to a wheel file? wheel works good to me
I'm trying to create my own corpus in NLTK from ca. 200k text files each stored in it's own zip folder. It looks like the following:
Parent_dir
text1.zip
text1.txt
I'm using the following code and try to access all the text files from the parent directory:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
corpus_path="parent_dir"
corpus=PlaintextCorpusReader(corpus_path,".*")
file_ids=corpus.fileids()
print(file_ids)
But Python just returns an empty list because it probably can't access the text files due to the zipping. Is there an easy way to fix this? Unfortunately, I can't unzip the files because of the size of the files.
If all you're trying to do is get the fileIDs just use the 'glob' module, which doesn't care about file types.
Import the module (if you don't have it go ahead and pip install glob):
from glob import glob
Get your directory use * as a wildcard to get everything in the directory:
directory = glob('/path/to/your/corpus/*')
The glob() method returns a list of strings (which are file paths, in this case).
You can simply iterate over these to print the file name:
for file in directory:
print(file)
This article looks like an answer to your question about reading the contents of a zipped file: How to read text files in a zipped folder in Python
I think a combination of these methods makes an answer to your problem.
Good luck!
I'm working with zipped files in python for the first time, and I'm stumped.
I read the documentation for zipfile, but I'm not sure what would be the best way to do what I'm trying to do. I have a zipped folder with CSV files inside, and I'd like to be able to open the zip file, and retrieve certain values from the csv files inside.
Do I use zipfile.extract(file name here) to bring it to the current working directory? And if I do that, do I just use the file name to work with the file, or does this index or list them differently?
Currently, I manually extract all files in the zipped folder to the current working directory for my project, and then use the csv module to read them. All I'm really trying to do is remove that step.
Any and all help would be greatly appreciated!
You are looking to avoid extracting to disk, in the zip docs for python there is ZipFile.open() which gives you a file-like object. That is an object that mostly behaves like a regular file on disk, but it is in memory. It gives a bytes array when read, at least in py3.
Something like this...
from zipfile import ZipFile
import csv
with ZipFile('abc.zip') as myzip:
print(myzip.filelist)
for mf in myzip.filelist:
with myzip.open(mf.filename) as myfile:
mc = myfile.read()
c = csv.StringIO(mc.decode())
for row in c:
print(row)
The documentation of Python is actually quite good once one has learned how to find things as well as some of the basic programming terms/descriptions used in the documentation.
For some reason csv.BytesIO is not implemented, hence the extra step via csv.StringIO.
I have a *.tar.gz compressed file that I would like to read in with Python 2.7. The file contains multiple h5 formatted files as well as a few text files. I'm a novice with Python. Here is the code I'm trying to adapt:
`subset_path='c:\data\grant\files'
f=gzip.open(filename,'subset_full.tar.gz')
subset_data_path=os.path.join(subset_path,'f')
The first statement identifies the path to the folder with the data. The second statement tells Python to open a specific compressed file and the third statement (hopefully) executes a join of the prior two statements.
Several lines below this code I get an error when Python tries to use the 'subset_data_path' assignment.
What's going on?
The gzip module will only open a single file that has been compressed, i.e. my_file.gz. You have a tar archive of multiple files that are also compressed. This needs to be both untarred and uncompressed.
Try using the tarfile module instead, see https://docs.python.org/2/library/tarfile.html#examples
edit: To add a bit more information on what has happened, you have successfully opened the zipped tarball into a gzip file object, which will work almost the same as a standard file object. For instance you could call f.readlines() as if f was a normal file object and it would return the uncompressed lines.
However, this did not actually unpack the archive into new files in the filesystem. You did not create a subdirectory 'c:\data\grant\files\f', and so when you try to use the path subset_data_path you are looking for a directory that does not exist.
The following ought to work:
import tarfile
subset_path='c:\data\grant\files'
tar = tarfile.open("subset_full.tar.gz")
tar.extractall(subset_path)
subset_data_path=os.path.join(subset_path,'subset_full')