Read ms document table properties in python-docx - python

I have a word document with multiple tables and I want to check whether all the table contents are left aligned or not. Also I like to know the font size of the document tables.How to check that using python -docx? Please let me know if any alternative package available to check the same in python environment?
Thanks in advance

Related

python-docx cannot read a table inserted from excel

I need to deal with tables in many word files. Some of them are created in word table format, which can be read using python-docx.
However, some of them are inserted from excel. I don't know why python-docx cannot read them. Here is piece of code I wrote for test. As you can see in the terminal, there is nothings in the list variable 'tables'.
import docx
from docx import Document
docFile = 'a.docx'
document = Document(docFile)
tables = document.tables
print(tables)
Anyone can help? Thanks a lot!
I'm fighting the same issue using Pages on OSX to create a .docx template. I've found that Format > Arrange > Object Placement needs to be set to Move with text for the table, changing it to have any alignment or formatting causes the tables to disappear in python and be read as paragraphs that contain nothing. Looking at the XML of both and the python-docx code I'm suspicious of w:tblInd but I'm not clued up enough to go much further. I see recent GitHub issues covering this so hopefully will get sorted.
example on OSX:

Removing heading numbers from docx

I need to use python to pre-process docx (Word) documents, so that pandoc can properly convert them into markdown. One of the key requirements is that the styles of the docx document should be "cleaned up", in particular that the numbering of headings (Heading 1, Heading 2, etc.) should be removed.
Restrictions: I know how to do that using VBA (and likely could do it from python using PyWin32 or such). But it is a requirement that it must be implemented without Microsoft Windows and without LibreOffice/UNO.
How can I use the python-docx package to do that? I have looked at the documentation and there does not seem to be any proper to do it (actually the heading numbering style does not seem to be implemented). Did I miss something?
Unless I should use another method, such as applying a different Word template to the docx document, with the main styles correctly predefined according to my requirements? Could that be done through an available python package?
Code in VBA
This is the code in VBA that got the job done:
Sub RemoveHeaderNos()
' Remove the header nos
Debug.Print "Removing header numbers and formatting..."
For Each s In ActiveDocument.Styles
s.LinkToListTemplate ListTemplate:=Nothing
Next
End Sub
On terminology, I'm understanding you to mean "the numbering of heading paragraphs" as opposed to something like page numbers in the page headers, have I got that right? The two terms "heading" and "header" are unfortunately close and mean quite different things, in Word parlance anyway :)
I'm assuming your paragraph headings are numbered, like 'Heading 1' style causes the next sequential integer to be prefixed to the heading paragraph text, like '9. Ninth section heading', (then likewise for Heading 2 -> 9.1, 9.2, etc.
You're correct that this hasn't been implemented in python-docx yet. You would need to get as close to the XML element in question (perhaps the <w:style> element for Heading 1 for example) as possible using the python-docx API, and then use lxml calls to manipulate the XML under that.
You'd need to start with a strategy for what XML changes you need to make. opc-diag is handy for that. You can change the .docx manually (preferably a radically stripped-down, super-short document) using Word to make it look the way you want, then compare the XML before and after to discover what changes you need to make to the XML.
Then you can validate your strategy by extracting a .docx (using opc-diag), manually updating the XML with the minimum required changes, repackage it (also using opc-diag), and load it in Word to make sure it behaves as expected.
I suspect there's a way to "disconnect" the "Heading 1" style from a numbering definition in the styles.xml part that would accomplish what you're after and be a fairly straightforward handful of element changes.
Anyway, that's where I would start.
This issue was solved in version 1.17 of pandoc, released on 20 March 2016 ("Don’t turn numbered headers into lists"). If other people meet the same issue, the best thing at this stage would be to upgrade to a that version or a later one.
Nevertheless, the exploration of the various solutions with the python-docx was interesting, because it indicated a point of possible improvement.

Page number python-docx

I am trying to create a program in python that can find a specific word in a .docx file and return page number that it occurred on. So far, in looking through the python-docx documentation I have been unable to find how do access the page number or even the footer where the number would be located. Is there a way to do this using python-docx or even just python? Or if not, what would be the best way to do this?
Short answer is no, because the page breaks are inserted by the rendering engine, not determined by the .docx file itself.
However, certain clients place a <w:lastRenderedPageBreak> element in the saved XML to indicate where they broke the page last time it was rendered.
I don't know which do this (although I expect Word itself does) and how reliable it is, but that's the direction I would recommend if you wanted to work in Python. You could potentially use python-docx to get a reference to the lxml element you want (like w:document/w:body) and then use XPath commands or something to iterate through to a specific page, but just thinking it through a bit it's going to be some detailed development there to get that working.
If you work in the native Windows MS Office API you might be able to get something better since it actually runs the Word application.
If you're generating the documents in python-docx, those elements won't be placed because it makes no attempt to render the document (nor is it ever likely to). We're also not likely to add support for w:lastRenderedPageBreak anytime soon; I'm not even quite sure what that would look like.
If you search on 'lastRenderedPageBreak' and/or 'python-docx page break' you'll see other questions/answers here that may give a little more.
Using Python-docx: identify a page break in paragraph
from docx import Document
fn='1.doc'
document = Document(fn)
pn=1
import re
for p in document.paragraphs:
r=re.match('Chapter \d+',p.text)
if r:
print(r.group(),pn)
for run in p.runs:
if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
pn+=1
print('!!','='*50,pn)

How to add text to existing PDF file with Python

I am looking for a Python module and some examples on how to add Text to an existing PDF. The PDF file is an one page PDF and I would need to add the info at a predetermined position. The text can be added as part of the document or as a comment.
I would also need to read the comments that are in this document.
What is the best Python module that I can use for this? The environment is Windows 7 and Python 2.7 x64.
I have tried to compile poppler but it is a nightmare
The other libraries that I have looked at are pyPDF2 and PDF1.0 but I could not locate the objects and the methods that I need to use to achieve my task. My level is "beginner" so please if I overlooked anything is because of this.
This question has been asked, and was very thoroughly answered. Check it out here! The first answer is nicely general, and walks you through each step of the process. The second answer is instead straight code that you can run. Both are valuable and well-written; choose whichever works best for you. (Or both!)

how to create docx files with python

I am trying to take my data and put it in tables in either microsoft words or libreoffice writer.
I need to be able to change the background of cells within the table and I need to be able to change the page property to 'landscape'.
I have been looking for a library with simple code ( I am a beginner in coding ) but I did not find one for what I need to do.
Have you heard of anything for me ? If there are example on how to use it that would make it easier for me to learn it.
Check out this project
And here is a great quick-start guide
It's pretty simple to use, i haven't tested this, but it should work:
from docx import Document
document = Document()
r = 2 # Number of rows you want
c = 2 # Number of collumns you want
table = document.add_table(rows=r, cols=c)
table.style = 'LightShading-Accent1' # set your style, look at the help documentation for more help
for y in range(r):
for x in range(c):
cell.text = 'text goes here'
document.save('demo.docx') # Save document
It don't think you can set the page orientation property with this library, but what you could do is create a blank word document that is in landscape yourself, store it in the working directory and make a copy of it every time you generate this document.
The previous answer is a good one, but there is another way: create the document in Word then hack the xml in Python to insert the content you want. I have done this several times. In fact, my current invoicing program works this way.
Disadvantages: Conditional formatting and numbered lists will require some real xml knowledge.
Advantages: No limitations or intricate style definitions to manage. EVERYTHING is supported, because it's all done in Word.
Here's the workflow:
create a *.docx document with marker text where you want your headings, table cells, etc.
Use lxml to find those markers and copy their parent elements (along with their formatting).
Use those found elements to create templates
Insert your data into those templates, and assemble the whole thing like a jigsaw puzzle.
Zip your xml into a new *.docx file.
I have a short sample project showing the procedure. Docx2Python does most of the work for you.
https://github.com/ShayHill/replace_docx_tables

Categories