Missing document text when using python-docx - python

I am using python-docx 0.8.6 and python 3.6 to preform a simple search/replace operation.
I'm having a problem where not all of the document's text appears when iterating over the doc.paragraphs
For debugging I have tried
doc = Document(input_file)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
print('\n'.join(fullText))
Which only seems to print out about half of the file's contents.
There are no tables or special formatting in the file. Is there any reason why so much of the document's contents cannot be read by python-docx?
Edit: the missing text is contained within a mail merge field if that makes any difference

The mail merge field does make a difference. Unfortunately, python-docx is not sophisticated enough to know which "container" elements hold displayable text and which do not. So it only reports paragraphs (and tables) that are at the "top" level.
This is also a limitation when it comes to revision marks, for example, which have two or more pieces of text of which only one appears, depending on the revision marks setting (show original, show latest after edits, etc.).
The only way around it with python-docx is to navigate the XML yourself, although some of the domain objects in python-docx can be handy, like Paragraph, etc. once you've gotten hold of the elements you want.

Related

How to copy-paste docx (containing text, image and formatting) to another docx template in a specific spot?

I'm currently struggling with the following point :
I work on a project that needed me to generate Documents (docx) for autodocumentation purposes, they contain text/images/formatting. These documents are singulars, it means that each part of my documentation (there are almost 15 parts) has its docx.
I have a global documentation template (docx) that expects to be filled by these singular documents (for some technical reason, I just can't directly generate a complete document) at specific locations
**Do you know a way to copy-paste these singular documents in the global documentation template ?
Thank you !
I tried using QuickParts from Word + MailMerge package but it seems that I can only fill these QuickParts with text.
I found ways to copy-paste a docx to another in a raw manner, I didn't find a way to copy-paste in a specific spot.

Is there a way to replace all the bullet lists with numbered list in docx file using python-docx?

I've been trying to find a way to change all the bullet lists in a docx file to number list using python-docx. So far, I've tried using paragraph.style attribute. Doing something like this:
if paragraph.style.name == 'Bullet List':
paragraph.style = styles['List Number']
It works on basic docx files but sometimes, with more complex documents, paragraph.style.name returns something like 'Body Text' even though the given paragraph appears as a bullet list in the document. I was just wondering if it's possible to achieve this using python-docx library or I might have to look for something else. Thank you.
The short answer is no. What you're seeing are bullets applied to a paragraph using the toolbar button, which applies the bullet formatting directly to the paragraph. Any formatting applied at the paragraph level (the lowest level) overrides formatting inherited from a style.
What you'd need to do to fix this is to remove the manual paragraph formatting (possibly by selecting paragraphs and pressing Ctrl-Q) as described here and in other web resources as well I'm sure: https://www.okbar.org/lpt_articles/removing-formatting-from-word-documents/
After those "overrides" are removed, the style should be free to do its work.
There is no python-docx API counterpart to "remove-all-formatting". If you wanted to do it programmatically you would need to manipulate the XML yourself. python-docx can get you to the paragraph element with p = paragraph._p and then print(p.xml) can show you what the XML looks like, but from there you're on your own to manipulate that XML subtree with lxml calls. Search on python-docx workaround function for some ideas on what that looks like.

Read PDF summary with Python

I'm trying to read some PDF documents with Python.
I would like to extract a summary in the first page.
Does it exist a library able to do it?
There are two parts to your problem: first you must extract the text from the PDF, and then run that through a summarizer.
There are many utilities to extract text from a PDF, though text in a PDF may not be stored in a 'logical' order.
(For instance, a page with two text columns might be stored with the first line of both columns, followed by the next, etc; rather than all the text of the first column, then the second column, as a human would read it.)
The PDFMiner library would seem to be ideal for extracting the text. A quick Google reveals that there are several text summarizer python libraries, though I haven't used any of them and can't attest to their abilities. But parsing human language is tricky - even for humans.
https://pypi.org/project/text-summarizer/
http://ai.intelligentonlinetools.com/ml/text-summarization/
If you're using MacOS, there is a built-in text summarizing Service. Right click on any selected text and click "Summarize" to activate. Though it seems hard to incorporate this into any automated process.

Removing heading numbers from docx

I need to use python to pre-process docx (Word) documents, so that pandoc can properly convert them into markdown. One of the key requirements is that the styles of the docx document should be "cleaned up", in particular that the numbering of headings (Heading 1, Heading 2, etc.) should be removed.
Restrictions: I know how to do that using VBA (and likely could do it from python using PyWin32 or such). But it is a requirement that it must be implemented without Microsoft Windows and without LibreOffice/UNO.
How can I use the python-docx package to do that? I have looked at the documentation and there does not seem to be any proper to do it (actually the heading numbering style does not seem to be implemented). Did I miss something?
Unless I should use another method, such as applying a different Word template to the docx document, with the main styles correctly predefined according to my requirements? Could that be done through an available python package?
Code in VBA
This is the code in VBA that got the job done:
Sub RemoveHeaderNos()
' Remove the header nos
Debug.Print "Removing header numbers and formatting..."
For Each s In ActiveDocument.Styles
s.LinkToListTemplate ListTemplate:=Nothing
Next
End Sub
On terminology, I'm understanding you to mean "the numbering of heading paragraphs" as opposed to something like page numbers in the page headers, have I got that right? The two terms "heading" and "header" are unfortunately close and mean quite different things, in Word parlance anyway :)
I'm assuming your paragraph headings are numbered, like 'Heading 1' style causes the next sequential integer to be prefixed to the heading paragraph text, like '9. Ninth section heading', (then likewise for Heading 2 -> 9.1, 9.2, etc.
You're correct that this hasn't been implemented in python-docx yet. You would need to get as close to the XML element in question (perhaps the <w:style> element for Heading 1 for example) as possible using the python-docx API, and then use lxml calls to manipulate the XML under that.
You'd need to start with a strategy for what XML changes you need to make. opc-diag is handy for that. You can change the .docx manually (preferably a radically stripped-down, super-short document) using Word to make it look the way you want, then compare the XML before and after to discover what changes you need to make to the XML.
Then you can validate your strategy by extracting a .docx (using opc-diag), manually updating the XML with the minimum required changes, repackage it (also using opc-diag), and load it in Word to make sure it behaves as expected.
I suspect there's a way to "disconnect" the "Heading 1" style from a numbering definition in the styles.xml part that would accomplish what you're after and be a fairly straightforward handful of element changes.
Anyway, that's where I would start.
This issue was solved in version 1.17 of pandoc, released on 20 March 2016 ("Don’t turn numbered headers into lists"). If other people meet the same issue, the best thing at this stage would be to upgrade to a that version or a later one.
Nevertheless, the exploration of the various solutions with the python-docx was interesting, because it indicated a point of possible improvement.

how to create docx files with python

I am trying to take my data and put it in tables in either microsoft words or libreoffice writer.
I need to be able to change the background of cells within the table and I need to be able to change the page property to 'landscape'.
I have been looking for a library with simple code ( I am a beginner in coding ) but I did not find one for what I need to do.
Have you heard of anything for me ? If there are example on how to use it that would make it easier for me to learn it.
Check out this project
And here is a great quick-start guide
It's pretty simple to use, i haven't tested this, but it should work:
from docx import Document
document = Document()
r = 2 # Number of rows you want
c = 2 # Number of collumns you want
table = document.add_table(rows=r, cols=c)
table.style = 'LightShading-Accent1' # set your style, look at the help documentation for more help
for y in range(r):
for x in range(c):
cell.text = 'text goes here'
document.save('demo.docx') # Save document
It don't think you can set the page orientation property with this library, but what you could do is create a blank word document that is in landscape yourself, store it in the working directory and make a copy of it every time you generate this document.
The previous answer is a good one, but there is another way: create the document in Word then hack the xml in Python to insert the content you want. I have done this several times. In fact, my current invoicing program works this way.
Disadvantages: Conditional formatting and numbered lists will require some real xml knowledge.
Advantages: No limitations or intricate style definitions to manage. EVERYTHING is supported, because it's all done in Word.
Here's the workflow:
create a *.docx document with marker text where you want your headings, table cells, etc.
Use lxml to find those markers and copy their parent elements (along with their formatting).
Use those found elements to create templates
Insert your data into those templates, and assemble the whole thing like a jigsaw puzzle.
Zip your xml into a new *.docx file.
I have a short sample project showing the procedure. Docx2Python does most of the work for you.
https://github.com/ShayHill/replace_docx_tables

Categories