Python: Create a "Table Of Contents" with python-docx/lxml - python

I'm trying to automate the creation of .docx files (WordML) with the help of python-docx (https://github.com/mikemaccana/python-docx). My current script creates the ToC manually with following loop:
for chapter in myChapters:
body.append(paragraph(chapter.text, style='ListNumber'))
Does anyone know of a way to use the "word built-in" ToC-function, which adds the index automatically and also creates paragraph-links to the individual chapters?
Thanks a lot!

The key challenge is that a rendered ToC depends on pagination to know what page number to put for each heading. Pagination is a function provided by the layout engine, a very complex piece of software built into the Word client. Writing a page layout engine in Python is probably not a good idea, definitely not a project I'm planning to undertake anytime soon :)
The ToC is composed of two parts:
the element that specifies the ToC placement and things like which heading levels to include.
the actual visible ToC content, headings and page numbers with dotted lines connecting them.
Creating the element is pretty straightforward and relatively low-effort. Creating the actual visible content, at least if you want the page numbers included, requires the Word layout engine.
These are the options:
Just add the tag and a few other bits to signal Word the ToC needs to be updated. When the document is first opened, a dialog box appears saying links need to be refreshed. The user clicks Yes and Bob's your uncle. If the user clicks No, the ToC title appears with no content below it and the ToC can be updated manually.
Add the tag and then engage a Word client, by means of C# or Visual Basic against the Word Automation library, to open and save the file; all the fields (including the ToC field) get updated.
Do the same thing server-side if you have a SharePoint instance or whatever that can do it with Word Automation Services.
Create an AutoOpen macro in the document that automatically runs the field update when the document is opened. Probably won't pass a lot of virus checkers and won't work on locked-down Windows builds common in a corporate setting.
Here's a very nice set of screencasts by Eric White that explain all the hairy details

Sorry for adding comments to an old post, but I think it may be helpful.
This is not my solution, but it has been found there: https://github.com/python-openxml/python-docx/issues/36
Thanks to https://github.com/mustash and https://github.com/scanny
from docx.oxml.ns import qn
from docx.oxml import OxmlElement
paragraph = self.document.add_paragraph()
run = paragraph.add_run()
fldChar = OxmlElement('w:fldChar') # creates a new element
fldChar.set(qn('w:fldCharType'), 'begin') # sets attribute on element
instrText = OxmlElement('w:instrText')
instrText.set(qn('xml:space'), 'preserve') # sets attribute on element
instrText.text = 'TOC \\o "1-3" \\h \\z \\u' # change 1-3 depending on heading levels you need
fldChar2 = OxmlElement('w:fldChar')
fldChar2.set(qn('w:fldCharType'), 'separate')
fldChar3 = OxmlElement('w:t')
fldChar3.text = "Right-click to update field."
fldChar2.append(fldChar3)
fldChar4 = OxmlElement('w:fldChar')
fldChar4.set(qn('w:fldCharType'), 'end')
r_element = run._r
r_element.append(fldChar)
r_element.append(instrText)
r_element.append(fldChar2)
r_element.append(fldChar4)
p_element = paragraph._p

Please see explanations in the code comments.
# First set directory where you want to save the file
import os
os.chdir("D:/")
# Now import required packages
import docx
from docx import Document
from docx.oxml.ns import qn
from docx.oxml import OxmlElement
# Initialising document to make word file using python
document = Document()
# Code for making Table of Contents
paragraph = document.add_paragraph()
run = paragraph.add_run()
fldChar = OxmlElement('w:fldChar') # creates a new element
fldChar.set(qn('w:fldCharType'), 'begin') # sets attribute on element
instrText = OxmlElement('w:instrText')
instrText.set(qn('xml:space'), 'preserve') # sets attribute on element
instrText.text = 'TOC \\o "1-3" \\h \\z \\u' # change 1-3 depending on heading levels you need
fldChar2 = OxmlElement('w:fldChar')
fldChar2.set(qn('w:fldCharType'), 'separate')
fldChar3 = OxmlElement('w:t')
fldChar3.text = "Right-click to update field."
fldChar2.append(fldChar3)
fldChar4 = OxmlElement('w:fldChar')
fldChar4.set(qn('w:fldCharType'), 'end')
r_element = run._r
r_element.append(fldChar)
r_element.append(instrText)
r_element.append(fldChar2)
r_element.append(fldChar4)
p_element = paragraph._p
# Giving headings that need to be included in Table of contents
document.add_heading("Network Connectivity")
document.add_heading("Weather Stations")
# Saving the word file by giving name to the file
name = "mdh2"
document.save(name+".docx")
# Now check word file which got created
# Select "Right-click to update field text"
# Now right click and then select update field option
# and then click on update entire table
# Now,You will find Automatic Table of Contents

#Mawg // Updating ToC
Had the same issue to update the ToC and googled for it. Not my code, but it works:
word = win32com.client.DispatchEx("Word.Application")
doc = word.Documents.Open(input_file_name)
doc.TablesOfContents(1).Update()
doc.Close(SaveChanges=True)
word.Quit()

Related

Covert Rect location from pymupdf to a page number

Covert Rect location from pymupdf to a page number
If I get the locations of certain text like "exam" and get the rectangle location. I then highlight the text in the pdfs with that location. I now want to delete all other pages that do not have this text in it so I use the doc.select()
function to select the pages I want to keep before making a save of the new pdf with the pages with highlighted text on only.
The Issue
You have to pass a dictionary to the doc.select() function with the page numbers I want to keep.
So what I tried to do was to pass the dictionary with the rectangle coordinates to this function but I got the following error
<br>
ValueError: bad page number(s)
<br>
I know understand that I must be able to convert the coordinates of the rectangles to page numbers.
But I don not know how to do this and it is not mentioned anywhere in the docs (Correct me if I am wrong) .
<br>
Current code
from pathlib import Path
import fitz
directory = "pdfs"
# iterate over files in
# that directory
files = Path(directory).glob('*')
for file in files:
doc = fitz.open(file)
for page in doc:
### SEARCH
text = "Exam"
text_instances = page.search_for(text)
### HIGHLIGHT
for inst in text_instances:
highlight = page.add_highlight_annot(inst)
highlight.update()
### OUTPUT
doc.select(text_instances)
doc.save("output.pdf", garbage=4, deflate=True, clean=True)
Pdf that I used for testing purposes:
pdf
I know understand that I must be able to convert the coordinates of the rectangles to page numbers. But I don not know how to do this and it is not mentioned anywhere in the docs (Correct me if I am wrong).
That is completely wrong!
The rectangles returned by text searches are locations on the current page and have nothing to do with page numbers.
You already are iterating over pages. If your search text has been found on some page, put that page's number in a list, then do your highlights.
When finished with a document, select() with the pages memorized, close document, empty the page selection, then continue with next document.
Something like that:
for filename in filenamelist:
select_pages = []
doc = fitz.open(filename)
for page in doc:
hits = page.search_for(text)
if hits == []:
continue
select_pages.append(page.number)
for rect in hits:
page.add_highlight_annot(rect)
doc.select(select_pages)
doc.save(...)
doc.close()

How to put two pictures on the same level in a word file (.docx) python

I want to put the logo and the barcode on the same level and not one on top of the other.The logo should stay at the very left and the barcode at the very right of the word file.Here is my code , thank you:
import uuid
import pandas as pd
import pyqrcode
from docx import Document
from docx.shared import Inches
qr=pyqrcode.create(str(uuid.uuid4()).replace('-',''))
qr.png('somecode.png')
df=pd.DataFrame(pd.read_excel('Komplette-
GastAccountFHZugangsdatenFertig.xlsx'))
Attributes=['Name', 'Vorname ', 'Geschlecht', 'Adresse (in Marokko)',
'Telefonnummer', 'E-Mailadresse', 'Studiengang', 'Semester']
document = Document()
document.add_heading('Informationen.',level=0)
document.add_picture('Informatik_Logo.png',height=Inches(1.0))
p = document.add_paragraph()
r = p.add_run()
p_format=p.paragraph_format
p_format.left_indent=Inches(4.5)
r.add_picture('somecode.png',width=Inches(1.0))
table=document.add_table(len(Attributes),2,style='LightGrid-Accent1')
for i in range(len(Attributes)):
row=table.rows[i]
row.cells[0].text=Attributes[i]
Infos=df[Attributes[i]]
string=str(Infos[49])
row.cells[1].text=string
document.save('sample.docx')
From what I see in the documentation, python-docx only currently supports inline pictures, not floating pictures, which means that you can only get the look you currently have. From the docs:
At the time of writing, python-docx only supports inline pictures. Floating pictures can be added. If you have an active use case, submit a feature request on the issue tracker. The Document.add_picture() method adds a specified picture to the end of the document in a paragraph of its own.
Based on that last sentence, I think what you're trying to do is currently impossible. A workaround might be to insert a table with one row and two columns, and insert an image in each cell.
i can help you with that,just add one sentence into your code
r.add_picture('somecode.png',width=Inches(2.0))
that is all. just write the sentence one more time.and if you want more picture in the same level, all you need to do is that add the sentence. i tried and it works to me.
there is my test code
dc = Document()
run = dc.add_paragraph().add_run()
run.add_picture("./picture/danilise.jpg", width=Inches(1.1))
run.add_picture("./picture/danilise.jpg", width=Inches(1.3))
run.add_picture("./picture/danilise.jpg", width=Inches(1.5))
dc.save("test1.docx")

How to create PDFOutline programmatically

I'm having trouble trying to create Table of Contents objects in a PDF file. I'm not sure whether I've understood the process from Apple's limited documentation.
I'm using python, but cogent examples in any language are welcome to explain how it's supposed to work. The code creates a new PDF document, but there's no outline item visible in Preview. I've tried just using myOutline as the root object, but that doesn't work either.
pdfURL = NSURL.fileURLWithPath_(infile)
myPDF = Quartz.PDFDocument.alloc().initWithURL_(pdfURL)
if myPDF:
# Create Destination
myPage = myPDF.pageAtIndex_(1)
pagePoint = Quartz.CGPointMake(0,0)
myDestination = Quartz.PDFDestination.alloc().initWithPage_atPoint_(myPage, pagePoint)
# Create Outline
myOutline = Quartz.PDFOutline.alloc().init()
myOutline.setLabel_("Interesting")
myOutline.setDestination_(myDestination)
# Create a root Outline and add the first outline as a child
rootOutline = Quartz.PDFOutline.alloc().init()
rootOutline.insertChild_atIndex_(myOutline, 0)
# Add the root outline to the document and save
myPDF.setOutlineRoot_(rootOutline)
myPDF.writeToFile_(outfile)
EDIT: Actually, the outline IS getting saved to the new file: I can read it programmatically, and it appears in Acrobat as a Bookmark; however, it doesn't show up in Preview's Table of Contents (yes, I checked for the "Hide" thing). If I add another Bookmark in Acrobat, then both show up in Preview.
So I guess that either I'm still doing something wrong which doesn't quite 'finish' the PDFOutline data properly, and Acrobat is being kind; or there's a massive bug in PDFKit that means you can't write PDFOutlines properly. I get the same behaviour on Mountain Lion, FWIW.
This does appear to be a bug in Preview. It will not list the Table of Contents if it contains ONLY ONE child entry.
If I add more Outlines with the code above, then all of them appear in Preview. If use other software to remove all but one entries in the Table of Contents, then Preview will not show any.

Extract image position from .docx file using python-docx

I'm trying to get the image index from the .docx file using python-docx library. I'm able to extract the name of the image, image height and width. But not the index where it is in the word file
import docx
doc = docx.Document(filename)
for s in doc.inline_shapes:
print (s.height.cm,s.width.cm,s._inline.graphic.graphicData.pic.nvPicPr.cNvPr.name)
output
21.228 15.920 IMG_20160910_220903848.jpg
In fact I would like to know if there is any simpler way to get the image name , like s.height.cm fetched me the height in cm. My primary requirement is to get to know where the image is in the document, because I need to extract the image and do some work on it and then again put the image back to the same location
This operation is not directly supported by the API.
However, if you're willing to dig into the internals a bit and use the underlying lxml API it's possible.
The general approach would be to access the ImagePart instance corresponding to the picture you want to inspect and modify, then read and write the ._blob attribute (which holds the image file as bytes).
This specimen XML might be helpful:
http://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/picture.html#specimen-xml
From the inline shape containing the picture, you get the <a:blip> element with this:
blip = inline_shape._inline.graphic.graphicData.pic.blipFill.blip
The relationship id (r:id generally, but r:embed in this case) is available at:
rId = blip.embed
Then you can get the image part from the document part
document_part = document.part
image_part = document_part.related_parts[rId]
And then the binary image is available for read and write on ._blob.
If you write a new blob, it will replace the prior image when saved.
You probably want to get it working with a single image and get a feel for it before scaling up to multiple images in a single document.
There might be one or two image characteristics that are cached, so you might not get all the finer points working until you save and reload the file, so just be alert for that.
Not for the faint of heart as you can see, but should work if you want it bad enough and can trace through the code a bit :)
You can also inspect paragraphs with a simple loop, and check which xml contains an image (for example if an xml contains "graphicData"), that is which is an image container (you can do the same with runs):
from docx import Document
image_paragraphs = []
doc = Document(path_to_docx)
for par in doc.paragraphs:
if 'graphicData' in par._p.xml:
image_paragraphs.append(par)
Than you unzip docx file, images are in the "images" folder, and they are in the same order as they will be in the image_paragraphs list. On every paragraph element you have many options how to change it. If you want to extract img process it and than insert it in the same place, than
paragraph.clear()
paragraph.add_run('your description, if needed')
run = paragraph.runs[0]
run.add_picture(path_to_pic, width, height)
So, I've never really written any answers here, but i think this might be the solution to your problem. With this little code you can see the position of your images given all the paragraphs. Hope it helps.
import docx
doc = docx.Document(filename)
paraGr = []
index = []
par = doc.paragraphs
for i in range(len(par)):
paraGr.append(par[i].text)
if 'graphicData' in par[i]._p.xml:
index.append(i)
If you are using Python 3
pip install python-docx
import docx
doc = docx.Document(document_path)
P = []
I = []
par = doc.paragraphs
for i in range(len(par)):
P.append(par[i].text)
if 'graphicData' in par[i]._p.xml:
I.append(i)
print(I)
#returns list of index(Image_Reference)

python-docx add horizontal line

Is it possible to add a single horizontal line spanning the entire page width after a paragraph?
By using the add_header such a line is created after the header:
document.add_heading('Test', 0)
I'd need such a line between two paragraphs in the document, so is it possible to just add the line by itself?
This is possible through the python-docx API using low-level functions. This is taken from a GitHub comment
from docx.oxml.shared import OxmlElement
from docx.oxml.ns import qn
def insertHR(paragraph):
p = paragraph._p # p is the <w:p> XML element
pPr = p.get_or_add_pPr()
pBdr = OxmlElement('w:pBdr')
pPr.insert_element_before(pBdr,
'w:shd', 'w:tabs', 'w:suppressAutoHyphens', 'w:kinsoku', 'w:wordWrap',
'w:overflowPunct', 'w:topLinePunct', 'w:autoSpaceDE', 'w:autoSpaceDN',
'w:bidi', 'w:adjustRightInd', 'w:snapToGrid', 'w:spacing', 'w:ind',
'w:contextualSpacing', 'w:mirrorIndents', 'w:suppressOverlap', 'w:jc',
'w:textDirection', 'w:textAlignment', 'w:textboxTightWrap',
'w:outlineLvl', 'w:divId', 'w:cnfStyle', 'w:rPr', 'w:sectPr',
'w:pPrChange'
)
bottom = OxmlElement('w:bottom')
bottom.set(qn('w:val'), 'single')
bottom.set(qn('w:sz'), '6')
bottom.set(qn('w:space'), '1')
bottom.set(qn('w:color'), 'auto')
pBdr.append(bottom)
This will add a horizontal rule to a paragraph.
This is not supported directly yet by the python-docx API.
However, you can achieve this effect by creating a paragraph style that has this border setting in the "template" document you open to create a new document, e.g.
document = Document('template-with-style.docx')
Then you can apply that style to a new paragraph in that document using python-docx.
Just add
document.add_paragraph("_____________________________________________")
length of this line may not be enough but you can add it ,i hope you have solved your problems this is for new comers

Categories