I am writing a doc and docx parser. It is necessary to obtain various metadata about the document of these formats. For example, for docx, I need to get the XML code and continue to work with the tags. Tell me the solutions that will help solve my problem? Solutions like python-docx are not suitable, because they work only with text.
If you need raw docx data, you'll probably work with it low-level, i.e. open file with zipfile and read meta with xml etree
Related
I'm pretty much stuck right now.
I wrote a parser in python3 using the python-docx library to extract all tables found in an existing .docx and store it in a python datastructure.
So far so good. Works as it should.
Now I have the problem that there are hyperlinks in these tables which I definitely need! Due to the structure (xml underneath) the docx library doesn't catch these. Neither the url nor the display text provided. I found many people having similar concerns about this, but most didn't seem to have 'just that' dilemma.
I thought about unpacking the .docx and scan the _ref document for the corresponding 'rid' and fill the actual data I have with the links found in the _ref xml.
Either way it seems seriously weary to do it that way, so I was wondering if there is a more pythonic way to do it or if somebody got good advise how to tackle this problem?
You can extract the links by parsing xml of docx file.
You can extract all text from the document by using document.element.getiterator()
Iterate all the tags of xml and extract its text. You will get all the missing data which python-docx failed to extract.
I am trying to extract information from different types of files in python(.pdf .doc .docx) and convert to .txt but while processing different files I am getting space and newlines when not required and many other issues. I have tried PyPDF2 and PDF manager.Please suggest me something with which i can extract information from files.
EDIT
Currently looking for something which can help me extract exact text from .pdf files. I have tried PyPDF, PDFMiner and PDF Manager and I am getting issues with some pdfs in all of them.
Personally I think pdfminer is the best python module for extracting information from pdfs Get it here
I think you can refer to this link
for corresponding file formats.
I have been given a series of folders with large amounts of Word documents in .xml formatting. They each contain some VBA code, but the code on all of them has already been run, so I don't need to keep this.
I need to print all of the files in each folder, but due to constraints on XML files on the network, I can't simply mass-print them from Windows Explorer, so I need to convert them to .docx (or .doc) first.
How can I go about doing this? I tried a simple python script using python-docx:
import os
from docx import Document
folderPath=<folderpath>
fileNamesList=os.listdir(folderPath)
for xmlFileName in fileNamesList:
currentDoc=Document(os.path.join(folderPath,xmlFileName))
docxFileName=xmlFileName.replace('.xml','.docx')
currentDoc.save(os.path.join(folderPath,docxFileName))
currentDoc.close()
This gives:
docx.opc.exceptions.PackageNotFoundError: Package not found at <first file name>.xml
I'm guessing this is because python-docx isn't meant to open .xml files, but that's a pretty uneducated guess. Searching around for this error, all I can find are problems with it not being installed properly (which it is, as far as I can tell) or using .doc files instead of .docx.
Am I simply using python-docx incorrectly? If not, is there are more suitable package or technique I should be using?
It's not clear just what sort of files those .xml files are, but I suspect they are a transitional format used I think in Word 2003, which was XML-based, but not the Open Packaging Convention (OPC) format used in Word documents since Word 2007.
python-docx is not going to read those, now or ever, so you'll either need to convert them to .docx format or parse the XML directly.
If I had Windows available, I suppose I would use VBA to write a short conversion script and then work with the .docx files using python-pptx. I would start by seeing if Word could load the .xml file and go from there.
You might be able to find a utility to do a bulk conversion, but I didn't find any on a quick search.
If all you're interested in is a one-time print, and Word will load the files, then a VBA script for that without the conversion step might be a good option. python-docx doesn't print .docx files, only read and write them.
I'm parsing docx files using the library python-docx. I need to read the header of documents as well as paragraphs however I can't find anything about document headers in their documentation. There is documentation on writing a header to a new file but none on reading a header. Is there a way to do this?
I had the same problem.
I used instead of the "python-docx" package a newer version called python-docx2txt
(https://github.com/ankushshah89/python-docx2txt) which extract the text with the headers in one line.
I need some help for extracting information out of an .sgm file using Python. Is there a specific library suitable for this specific kind of file? Or will libraries used to extract infos from .xml files work as well? If there are no libraries available, can you please suggest me a good module I can download to work with .sgm files?
And above all, if any, can you please explain me the difference between an .xml and a .sgm file?
Thank you!
Here are a few libraries which can be used to parse .sgm files:
SGMLtools-lite
BeautifulSoup
The major difference between the two is that SGML permits the following:
Unclosed start-tags
Unclosed end-tags
Empty start-tags
Empty end-tags
References
Comparison of SGML and XML