Word & Python - Create Table of Contents

Word & Python - Create Table of Contents - python

I'm using the pywin32.client extension for python and building a Word document. I have tried a pretty good host of methods to generate a ToC but all have failed.
I think what I want to do is call the ActiveDocument object and create one with something like this example from the MSDN page:
Set myRange = ActiveDocument.Range(Start:=0, End:=0)
ActiveDocument.TablesOfContents.Add Range:=myRange, _
UseFields:=False, UseHeadingStyles:=True, _
LowerHeadingLevel:=3, _
UpperHeadingLevel:=1
Except in Python it would be something like:
wordObject.ActiveDocument.TableOfContents.Add(Range=???,UseFiles=False, UseHeadingStyles=True, LowerHeadingLevel=3, UpperHeadingLevel=1)
I've built everything so far using the 'Selection' object (example below) and wish to add this ToC after the first page break.
Here's a sample of what the document looks like:
objWord = win32com.client.Dispatch("Word.Application")
objDoc = objWord.Documents.Open('pathtotemplate.docx') #
objSel = objWord.Selection
#These seem to work but I don't know why...
objWord.ActiveDocument.Sections(1).Footers(1).PageNumbers.Add(1,True)
objWord.ActiveDocument.Sections(1).Footers(1).PageNumbers.NumberStyle = 57
objSel.Style = objWord.ActiveDocument.Styles("Heading 1")
objSel.TypeText("TITLE PAGE AND STUFF")
objSel.InsertParagraph()
objSel.TypeText("Some data or another"
objSel.TypeParagraph()
objWord.Selection.InsertBreak()
####INSERT TOC HERE####
Any help would be greatly appreciated! In a perfect world I'd use the default first option which is available from the Word GUI but that seems to point to a file and be harder to access (something about templates).
Thanks

Manually, edit your template in Word, add the ToC (which will be empty initially) any intro stuff, header/footers etc., then at where you want your text content inserted (i.e. after the ToC) put a uniquely named bookmark. Then in your code, create a new document based on the template (or open the template then save it to a different name), search for the bookmark and insert your content there. Save to a different filename.
This approach has all sorts of advantages - you can format your template in Word rather than by writing all the code details, and so you can very easily edit your template to update styles when someone says they want the Normal font to be bigger/smaller/pink you can do it just by editing the template. Make sure to use styles in your code and only apply formatting when it is specifically different from the default style.
Not sure how you make sure the ToC is actually generated, might be automatically updated on every save.

Related

How to filter PDF text by font?

PDF example
A PDF may contain multiple fonts, how can I only keep 1 font with the most words with Python?

disclaimer: I am the author of borb (the library I will use in this example)
Oddly enough, there is an fairly close match example in the borb examples repository for filtering by font. You can find that example here.
In this example, we extract all the text in a particular font in the PDF (e.g. all text written in Courier).
You can easily base yourself on this code to build something that checks the number of characters for each particular font (and at a later stage, return only the font with the most characters).
I'll repeat the example here for completeness:
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
from borb.toolkit.text.font_name_filter import FontNameFilter
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction
def main():
# create FontNameFilter
l0: FontNameFilter = FontNameFilter("Courier")
# filtered text just gets passed to SimpleTextExtraction
l1: SimpleTextExtraction = SimpleTextExtraction()
l0.add_listener(l1)
# read the Document
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l0])
# check whether we have read a Document
assert doc is not None
# print the names of the Fonts
print(l1.get_text_for_page(0))
if __name__ == "__main__":
main()
Aside from the imports, everything is quite straightforward. You specify the string of the font you want to filter on. This filter object will process the parsing/rendering of the PDF, and will only push events to its children if they are relevant (if the font information matches).
We add SimpleTextExtraction as its child, and so doing only get the text which is rendered in the desired font.
After we've set up this entire thing, we need to actually process (parse) the Document which is what happens in the next lines.
Some caveats:
PDF documents might contain so-called 'subset fonts'. This is when a font is artificially made smaller by throwing out unused letters. ie if a PDF never uses the 'uppercase X' letter then the font does not need to store information on how to render it. Typically, the names of subset fonts are not the same as those of their original font. You might get something like Courier+AEOKFF.
If this happens to be the case, check out the code of FontNameFilter and make another version that only checks the name using startswith, which out to do the trick.

Python Docx Module merges tables when added subsequently to document

I'm using the python-docx module and python 3.9.0 to create word docx files with python. The problem I have is the following:
A) I defined a table style named my_table_style
B) I open my template, add one table of that style to my document object and then I store the created file with the following code:
import os
from docx import Document
template_path = os.path.realpath(__file__).replace("test.py","template.docx")
my_file = Document(template_path)
my_file.add_table(1,1,style="my_table_style").rows[-1].cells[0].paragraphs[0].add_run("hello")
my_file.save(template_path.replace("template.docx","test.docx"))
When I now open test.docx, it's all good, there's one table with one row saying "hello".
NOW, when I use this syntax to create two of these tables:
import os
from docx import Document
template_path = os.path.realpath(__file__).replace("test.py","template.docx")
my_file = Document(template_path)
my_file.add_table(1,1,style="my_table_style").rows[-1].cells[0].paragraphs[0].add_run("hello")
my_file.add_table(1,1,style="my_table_style").rows[-1].cells[0].paragraphs[0].add_run("hello")
my_file.save(template_path.replace("template.docx","test.docx"))
Instead of getting two tables, each with one row saying "hello", I get one single table with two rows, each saying "hello". The formatting is however correct, according to my_table_style, so it seems that python-docx merges two subsequently added tables of the same table style. Is this normal behavior? How can I avoid that?
Cheers!
HINTS:
When I use print(len(my_file.tables)) to print the amount of tables present in my_file, I actually get "2"! Also, when I change the style used in the second add_table line it works all good, so this seems to be related to the fact of using the same style. Any ideas, anyone?

Alright, so I figured it out, it seems to be default behaviour by Word to do what's described above. I manually created a table style my_custom_style in the template.docx file where I customized the table border lines etc. to have the format I want to have as if I would have two tables.
Instead of then using two add_table() statements, I used
new_table = my_file.add_table(1,1,style = "my_custom_style")
first_row = new_table.rows[-1]
second_row = new_table.add_row()
(you can actually access table styles defined in your template via python-docx, simply by using the table style name you used to manually create your table style in your word template file used to open your Document object. Just make sure you tick the "add this table style to the word template" option upon saving the style in Word and it should all work). Everything working now.

Can python-docx preserve font color and styles when importing documents?

Essentially what I need to do is write a program that takes in many .docx files and puts them all in one, ordered in a certain way. I have importing working via:
import docx, os, glob
finaldocname = 'Midterm-All-Questions.docx'
finaldoc=docx.Document()
docstoworkon = glob.glob('*.docx')
if finaldocname in docstoworkon:
docstoworkon.remove(finaldocname) #dont process final doc if it exists
for f in docstoworkon:
doc=docx.Document(f)
fullText=[]
for para in doc.paragraphs:
fullText.append(para.text) #generates a long text list
# finaldoc.styles = doc.styles
for l in fullText:
# if l=='u\'\\n\'':
if '#' in l:
print('We got here!')
if '#1 ' not in l: #check last two characters to see if this is the first question
finaldoc.add_section() #only add a page break between questions
finaldoc.add_paragraph(l)
# finaldoc.add_page_break
# finaldoc.add_page_break
finaldoc.save(finaldocname)
But I need to preserve text styles, like font colors, sizes, italics, etc., and they aren't in this method since it just gets the raw text and dumps it. I can't find anything on the python-docx documentation about preserving text styles or importing in something other than raw text. Does anyone know how to go about this?

Styles are a bit difficult to work with in python-docx but it can be done.
See this explanation first to understand some of the problems with styles and Word.
The Long Way
When you read in a file as a Document() it will bring in all of the paragraphs and within each of these are the runs. These runs are chunks of text with the same style attached to them.
You can find out how many paragraphs or runs there are by doing len() on the object or you can iterate through them like you did in your example with paragraphs.
You can inspect the style of any given paragraph but runs may have different styles than the paragraph as a whole, so I would skip to the run itself and inspect the style there using paragraphs[0].runs[0].style which will give you a style object. You can inspect the font object beyond that which will tell you a number of attributes like size, italic, bold, etc.
Now to the long solution:
You first should create a new blank paragraph, then you should go and add_run() one by one with your text from your original. For each of these you can define a style attribute but it would have to be a named style as described in the first link. You cannot apply a stlye object directly as it won't copy the attributes over. But there is a way around that: check the attributes that you care about copying to the output and then ensure your new run applies the same attributes.
doc_out = docx.Document()
for para in doc.paragraphs:
p = doc_out.add_paragraph()
for run in para.runs:
r = p.add_run(run.text)
if run.bold:
r.bold = True
if run.italic:
r.italic = True
# etc
Obviously this is inefficient and not a great solution, but it will work to ensure you have copied the style appropriately.
Add New Styles
There is a way to add styles by name but because it isn't likely that the Word document you are getting the text and styles from is using named styles (rather than just applying bold, etc. to the words that you want), it is probably going to be a long road to adding a lot of slightly different styles or sometimes even the same ones.
Unfortunately that is the best answer I have for you on how to do this. Working with Word, Outlook, and Excel documents is not great in Python, especially for what you are trying to do.

SolrClient python update document

I'm currently trying to create a small python program using SolrClient to index some files.
My need is that I want to index some file content and then add some attributes to enrich the document.
I used the post command line tool to index the files. Then I use a python program trying to enrich documents, something like this:
doc = solr.get('collection', id)
doc['new_attribute'] = 'value'
solr.index_json('collection',json.dumps([doc]))
solr.commit(openSearcher=True)
Problem is that I have the feeling that we lost file content index. If I run a query with a word present in all attributes of the doc, I find it.
If I run a query with a word only in the file, it does not work (it works indexing only the file with post without my update tentative).
I'm not sure to understand how to update the doc keeping the index created by the post command.
I hope I'm clear enough, maybe I misunderstood the way it works...
thanks a lot

If I understand correctly, you want to modify an existing record. You should be able to do something like this without using a solr.get:
doc = [{'id': 'value', 'new_attribute':{'set': 'value'}}]
solr.index_json('collection',json.dumps([doc]))
See also:
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents

It has worked for me in this way, it can be useful for someone
from SolrClient import SolrClient
solrConect = SolrClient("http://xx.xx.xxx.xxx:8983/solr/")
doc = [{'id': 'my_id', 'count_related_like':{'set': 10}}]
solrConect.index_json("my_collection", json.dumps(doc) )
solrConect.commit("my_collection", softCommit=True)

Trying with Curl did not change anything. I did it differently so now it works. Instead of adding the file with the post command and trying to modify it afterwards, I read the file in a string and index in a "content" field. It means every document is added in one shot.
The content field is defined as not stored, so I just index it.
It works fine and suits my needs. It's also more simple since it removes many attributes set by post command that I don't need.
If I find some time, I'll try again the partial update and update the post.
Thanks
Rémi

How can I get a list of all the strings that Babel knows about?

I have an application in which the main strings are in English and then various translations are made in various .po/.mo files, as usual (using Flask and Flask-Babel). Is it possible to get a list of all the English strings somewhere within my Python code? Specifically, I'd like to have an admin interface on the website which lets someone log in and choose an arbitrary phrase to be used in a certain place without having to poke at actual Python code or .po/.mo files. This phrase might change over time but needs to be translated, so it needs to be something Babel knows about.
I do have access to the actual .pot file, so I could just parse that, but I was hoping for a cleaner method if possible.

You can use polib for this.
This section of the documentation shows examples of how to iterate over the contents of a .po file. Here is one taken from that page:
import polib
po = polib.pofile('path/to/catalog.po')
for entry in po:
print entry.msgid, entry.msgstr

If you alredy use babel you can get all items from po file:
from babel.messages.pofile import read_po
catalog = read_po(open(full_file_name))
for message in catalog:
print message.id, message.string
See http://babel.edgewall.org/browser/trunk/babel/messages/pofile.py.
You alredy can try get items from mo file:
from babel.messages.mofile import read_mo
catalog = read_po(open(full_file_name))
for message in catalog:
print message.id, message.string
But when I try use it last time it's not was availible. See http://babel.edgewall.org/browser/trunk/babel/messages/mofile.py.
You can use polib as #Miguel wrote.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.