I need to use python to pre-process docx (Word) documents, so that pandoc can properly convert them into markdown. One of the key requirements is that the styles of the docx document should be "cleaned up", in particular that the numbering of headings (Heading 1, Heading 2, etc.) should be removed.
Restrictions: I know how to do that using VBA (and likely could do it from python using PyWin32 or such). But it is a requirement that it must be implemented without Microsoft Windows and without LibreOffice/UNO.
How can I use the python-docx package to do that? I have looked at the documentation and there does not seem to be any proper to do it (actually the heading numbering style does not seem to be implemented). Did I miss something?
Unless I should use another method, such as applying a different Word template to the docx document, with the main styles correctly predefined according to my requirements? Could that be done through an available python package?
Code in VBA
This is the code in VBA that got the job done:
Sub RemoveHeaderNos()
' Remove the header nos
Debug.Print "Removing header numbers and formatting..."
For Each s In ActiveDocument.Styles
s.LinkToListTemplate ListTemplate:=Nothing
Next
End Sub
On terminology, I'm understanding you to mean "the numbering of heading paragraphs" as opposed to something like page numbers in the page headers, have I got that right? The two terms "heading" and "header" are unfortunately close and mean quite different things, in Word parlance anyway :)
I'm assuming your paragraph headings are numbered, like 'Heading 1' style causes the next sequential integer to be prefixed to the heading paragraph text, like '9. Ninth section heading', (then likewise for Heading 2 -> 9.1, 9.2, etc.
You're correct that this hasn't been implemented in python-docx yet. You would need to get as close to the XML element in question (perhaps the <w:style> element for Heading 1 for example) as possible using the python-docx API, and then use lxml calls to manipulate the XML under that.
You'd need to start with a strategy for what XML changes you need to make. opc-diag is handy for that. You can change the .docx manually (preferably a radically stripped-down, super-short document) using Word to make it look the way you want, then compare the XML before and after to discover what changes you need to make to the XML.
Then you can validate your strategy by extracting a .docx (using opc-diag), manually updating the XML with the minimum required changes, repackage it (also using opc-diag), and load it in Word to make sure it behaves as expected.
I suspect there's a way to "disconnect" the "Heading 1" style from a numbering definition in the styles.xml part that would accomplish what you're after and be a fairly straightforward handful of element changes.
Anyway, that's where I would start.
This issue was solved in version 1.17 of pandoc, released on 20 March 2016 ("Don’t turn numbered headers into lists"). If other people meet the same issue, the best thing at this stage would be to upgrade to a that version or a later one.
Nevertheless, the exploration of the various solutions with the python-docx was interesting, because it indicated a point of possible improvement.
Related
There is a case in my job where l have to remove a specific section (Glossary) from thousands of pdf documents.
The text l want to remove has a different font from the other parts:
Example:
"Floor" the lower surface of a room, on which one may walk.
"exchange" an act of giving one thing and receiving another (especially of the same type or value) in return.
Can you please suggest a way how to do it faster?
One of the possible ways to solve this problem is to find the section you want to delete using regex. Then using one of the libraries for pdf editing in python to delete this section.
I've been trying to find a way to change all the bullet lists in a docx file to number list using python-docx. So far, I've tried using paragraph.style attribute. Doing something like this:
if paragraph.style.name == 'Bullet List':
paragraph.style = styles['List Number']
It works on basic docx files but sometimes, with more complex documents, paragraph.style.name returns something like 'Body Text' even though the given paragraph appears as a bullet list in the document. I was just wondering if it's possible to achieve this using python-docx library or I might have to look for something else. Thank you.
The short answer is no. What you're seeing are bullets applied to a paragraph using the toolbar button, which applies the bullet formatting directly to the paragraph. Any formatting applied at the paragraph level (the lowest level) overrides formatting inherited from a style.
What you'd need to do to fix this is to remove the manual paragraph formatting (possibly by selecting paragraphs and pressing Ctrl-Q) as described here and in other web resources as well I'm sure: https://www.okbar.org/lpt_articles/removing-formatting-from-word-documents/
After those "overrides" are removed, the style should be free to do its work.
There is no python-docx API counterpart to "remove-all-formatting". If you wanted to do it programmatically you would need to manipulate the XML yourself. python-docx can get you to the paragraph element with p = paragraph._p and then print(p.xml) can show you what the XML looks like, but from there you're on your own to manipulate that XML subtree with lxml calls. Search on python-docx workaround function for some ideas on what that looks like.
I am using python-docx 0.8.6 and python 3.6 to preform a simple search/replace operation.
I'm having a problem where not all of the document's text appears when iterating over the doc.paragraphs
For debugging I have tried
doc = Document(input_file)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
print('\n'.join(fullText))
Which only seems to print out about half of the file's contents.
There are no tables or special formatting in the file. Is there any reason why so much of the document's contents cannot be read by python-docx?
Edit: the missing text is contained within a mail merge field if that makes any difference
The mail merge field does make a difference. Unfortunately, python-docx is not sophisticated enough to know which "container" elements hold displayable text and which do not. So it only reports paragraphs (and tables) that are at the "top" level.
This is also a limitation when it comes to revision marks, for example, which have two or more pieces of text of which only one appears, depending on the revision marks setting (show original, show latest after edits, etc.).
The only way around it with python-docx is to navigate the XML yourself, although some of the domain objects in python-docx can be handy, like Paragraph, etc. once you've gotten hold of the elements you want.
I am trying to create a program in python that can find a specific word in a .docx file and return page number that it occurred on. So far, in looking through the python-docx documentation I have been unable to find how do access the page number or even the footer where the number would be located. Is there a way to do this using python-docx or even just python? Or if not, what would be the best way to do this?
Short answer is no, because the page breaks are inserted by the rendering engine, not determined by the .docx file itself.
However, certain clients place a <w:lastRenderedPageBreak> element in the saved XML to indicate where they broke the page last time it was rendered.
I don't know which do this (although I expect Word itself does) and how reliable it is, but that's the direction I would recommend if you wanted to work in Python. You could potentially use python-docx to get a reference to the lxml element you want (like w:document/w:body) and then use XPath commands or something to iterate through to a specific page, but just thinking it through a bit it's going to be some detailed development there to get that working.
If you work in the native Windows MS Office API you might be able to get something better since it actually runs the Word application.
If you're generating the documents in python-docx, those elements won't be placed because it makes no attempt to render the document (nor is it ever likely to). We're also not likely to add support for w:lastRenderedPageBreak anytime soon; I'm not even quite sure what that would look like.
If you search on 'lastRenderedPageBreak' and/or 'python-docx page break' you'll see other questions/answers here that may give a little more.
Using Python-docx: identify a page break in paragraph
from docx import Document
fn='1.doc'
document = Document(fn)
pn=1
import re
for p in document.paragraphs:
r=re.match('Chapter \d+',p.text)
if r:
print(r.group(),pn)
for run in p.runs:
if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
pn+=1
print('!!','='*50,pn)
I'm working on a project that is going to extract specified text from a pdf document. I have no experience with this type of extraction. One issue is that we don't just want a dump of all the text in the document. Rather, is there a way to extract only certain fields in the pdf? Is there a notion of pdf templates that could be used for something like this?
I'm trying to use Apple's Automator - this is able to get all the text but not specified text. Ideally, I would like someone in Pages to have for example 30 discreet rows of text and have 20 of those rows be specified as 'catalog item' and have our Automator script take ONLY those twenty lines.
Any ideas on best workflow / extraction tools for this? I would prefer only consumer level items be used such as Apple Pages, Automator, and ruby or python as a scripting language.
thx
edit #1
looks like tagged pdf's might be one way to do this - not sure how well supported on Apple Pages this is
With python, the best choice would probably be PDFMiner. It can extract the coordinates for every text string, so you can work out the rectangles in your form on your own and pick out what falls within them. It's all pretty low level, but PDF is unfortunately a pretty low level format.
Be warned that unless you already know a lot about the structure of PDF, you'll find the API and documentation rather scanty. Look around for usage examples, including here on SO.
For Ruby you might try pdf-reader for parsing a PDF and accessing both metadata and content. Extracting the specific items your interested in is another story, but how to go about doing that depends highly on what format of data you're expecting.
You can use Origami in Ruby, a framework designed to parse, analyze,
and forge PDF documents, or the Python equivalent: Origapy, a simple Python
interface for the Ruby based Origami.