MS word document document stucturre and COM calls and python

MS word document document stucturre and COM calls and python - python

I am using comptypes to call function and create ms-word document. Being the first time writing such a program there is something I don't understand right, what I want to do is:
Create section in the document and call them A, B, ...
In each section create paragraphs that contain text. For section A call the paragraphs a1,a2,a3,...
Add formatting to each paragraph in each section, the formatting may be different for each paragraphs
Below is some code fragments in VBA, VBA is used since the translations to use comptypes are almost directly and there are more examples on the net for VBA.
Set myRange = ActiveDocument.Range(Start:= ...., End:= ...) //start and end can be any thing
ActiveDocument.Sections.Add Range:=myRange //Section A
Set newRange = ActiveDocument.Range(Start:= ...., End:= ...) //start and end can be any thing
newRange.Paragraphs.Add
I get stuck to select paragraphs a1 and set its text. What is missing for me is a function that say get collection of paragraphs in section A.

The following VBA, based on the code in the question, illustrates getting a Document object, adding a Section, getting the Paragraphs of that Section, getting the Paragraphs of any given Section in a document, getting the first or any Paragraph from a Paragraphs collection.
Set doc = ActiveDocument //More efficient if the Document object will be used more than once
Set section1 = doc.Sections.Add(Range:=myRange) //Section A | Type Word.Section
Set section1Paras = section1.Paragraphs //Type Word.Paragraphs
//OR
Set sectionParas = doc.Sections(n).Paragraphs //where n = section index number
Set para = sectionParas.First //OR =sectionParas(n) where n= paragraph index number

Related

Pywin32 how to replace text in entire word document

I am having trouble replacing words in the entire text document. The following code works to replace words in the main paragraphs but not when text is present in text boxes.
wdFindContinue = 1
wdReplaceAll = 2
word.Selection.Find.Execute(find_str, True, True, False, False, False, True, wdFindContinue, False, replace_str, wdReplaceAll)
Is there a way to replace words in the entire document?

See the article by Word MVPs on Using a macro to replace text wherever it appears in a document.
A collaborative effort of MVP’s Doug Robbins and Greg Maxey with
enhancements by Peter Hewett and Jonathan West
Using the Find or Replace utility on the Edit menu you can find or
replace text "almost" anywhere it appears in the document. If you
record that action however, the scope or "range" of the resulting
recorded macro will only act on the text contained in the body of the
document (or more accurately, it will only act on the part of the
document that contains the insertion point). This means that if the
insertion point is located in the main body of the document when your
macro is executed it will have no effect on text that is in the
headers or footers of the document, for example, or in a textbox,
footnotes, or any other area that is outside the main body of the
document.
Even the Find and Replace utility has a shortcoming. For example, text
in a textbox located in a header or footer is outside the scope of the
Find and Replace utility search range.
***
To use a macro to find or replace text anywhere in a document, it is
necessary to loop through each individual part of the document. In
VBA, these parts are called StoryRanges. Each StoryRange is identified
by a unique wdStoryType constant.
There are eleven different wdStoryType constants that can form the
StoryRanges (or parts) of a document (ok, a few more in later versions
of Word, but they have no bearing in this discussion). Simple
documents may contain only one or two StoryRanges, while more complex
documents may contain more. The wdStoryTypes that have a role in find
and replace are:
wdCommentsStory, wdEndnotesStory, wdEvenPagesFooterStory,
wdEvenPagesHeaderStory, wdFirstPageFooterStory,
wdFirstPageHeaderStory, wdFootnotesStory, wdMainTextStory,
wdPrimaryFooterStory, wdPrimaryHeaderStory, and wdTextFrameStory.
The complete code to find or replace text anywhere is a bit complex.
Accordingly, let’s take it a step at a time to better illustrate the
process. In many cases the simpler code is sufficient for getting the
job done.
Step 1
The following code loops through each StoryRange in the active
document and replaces the specified .Text with .Replacement.Text:
Sub FindAndReplaceFirstStoryOfEachType()
Dim rngStory As Range
For Each rngStory In ActiveDocument.StoryRanges
With rngStory.Find
.Text = "find text"
.Replacement.Text = "I'm found"
.Wrap = wdFindContinue
.Execute Replace:=wdReplaceAll
End With
Next rngStory
End Sub
(Note for those already familiar with VBA: whereas if you use
Selection.Find, you have to specify all of the Find and Replace
parameters, such as .Forward = True, because the settings are
otherwise taken from the Find and Replace dialog's current settings,
which are “sticky”, this is not necessary if using [Range].Find –
where the parameters use their default values if you don't specify
their values in your code).
The simple macro above has shortcomings. It only acts on the "first"
StoryRange of each of the eleven StoryTypes (i.e., the first header,
the first textbox, and so on). While a document only has one
wdMainTextStory StoryRange, it can have multiple StoryRanges in some
of the other StoryTypes. If, for example, the document contains
sections with un-linked headers and footers, or if it contains
multiple textboxes, there will be multiple StoryRanges for those
StoryTypes and the code will not act upon the second and subsequent
StoryRanges. To even further complicate matters, if your document
contains unlinked headers or footers and one of the headers or footers
are empty then VBA can have trouble "jumping" that empty header or
footer and process subsequent headers and footers.
The page has more, but the above should help.

How to search through a paragraph of text for a list of words, and once found, add to a dictionary

I am using a GET request to receive XML information back from a website. I have parsed the information and have gotten a point where I want to set up a new scenario.
The Scenario is, when I run a FOR loop through the xml(now turned into text in a dictionary []) how can I:
search the whole text for a variety of words? For example, say I know I'm looking for "value", "Lastupdatedate", and "Createdby" and want to grab that information everytime I see it
IF I find one of those words above, grab the word, and everything in between them (this is xml, so if I find "value" I want to grab the information in between like (value - 5 - value) ?. So the final information would have value: 5
finalvalues = []
MASTER = [This is the text I'm searching through]
variables = ["value", "Lastupdatedate", "Createdby"]
for i in MASTER:
#need something here to grab the actual name that it found
tst = re.findall(variables(.*?)variables,i)
finalvalues.append(tst)

Can python-docx preserve font color and styles when importing documents?

Essentially what I need to do is write a program that takes in many .docx files and puts them all in one, ordered in a certain way. I have importing working via:
import docx, os, glob
finaldocname = 'Midterm-All-Questions.docx'
finaldoc=docx.Document()
docstoworkon = glob.glob('*.docx')
if finaldocname in docstoworkon:
docstoworkon.remove(finaldocname) #dont process final doc if it exists
for f in docstoworkon:
doc=docx.Document(f)
fullText=[]
for para in doc.paragraphs:
fullText.append(para.text) #generates a long text list
# finaldoc.styles = doc.styles
for l in fullText:
# if l=='u\'\\n\'':
if '#' in l:
print('We got here!')
if '#1 ' not in l: #check last two characters to see if this is the first question
finaldoc.add_section() #only add a page break between questions
finaldoc.add_paragraph(l)
# finaldoc.add_page_break
# finaldoc.add_page_break
finaldoc.save(finaldocname)
But I need to preserve text styles, like font colors, sizes, italics, etc., and they aren't in this method since it just gets the raw text and dumps it. I can't find anything on the python-docx documentation about preserving text styles or importing in something other than raw text. Does anyone know how to go about this?

Styles are a bit difficult to work with in python-docx but it can be done.
See this explanation first to understand some of the problems with styles and Word.
The Long Way
When you read in a file as a Document() it will bring in all of the paragraphs and within each of these are the runs. These runs are chunks of text with the same style attached to them.
You can find out how many paragraphs or runs there are by doing len() on the object or you can iterate through them like you did in your example with paragraphs.
You can inspect the style of any given paragraph but runs may have different styles than the paragraph as a whole, so I would skip to the run itself and inspect the style there using paragraphs[0].runs[0].style which will give you a style object. You can inspect the font object beyond that which will tell you a number of attributes like size, italic, bold, etc.
Now to the long solution:
You first should create a new blank paragraph, then you should go and add_run() one by one with your text from your original. For each of these you can define a style attribute but it would have to be a named style as described in the first link. You cannot apply a stlye object directly as it won't copy the attributes over. But there is a way around that: check the attributes that you care about copying to the output and then ensure your new run applies the same attributes.
doc_out = docx.Document()
for para in doc.paragraphs:
p = doc_out.add_paragraph()
for run in para.runs:
r = p.add_run(run.text)
if run.bold:
r.bold = True
if run.italic:
r.italic = True
# etc
Obviously this is inefficient and not a great solution, but it will work to ensure you have copied the style appropriately.
Add New Styles
There is a way to add styles by name but because it isn't likely that the Word document you are getting the text and styles from is using named styles (rather than just applying bold, etc. to the words that you want), it is probably going to be a long road to adding a lot of slightly different styles or sometimes even the same ones.
Unfortunately that is the best answer I have for you on how to do this. Working with Word, Outlook, and Excel documents is not great in Python, especially for what you are trying to do.

Word & Python - Create Table of Contents

I'm using the pywin32.client extension for python and building a Word document. I have tried a pretty good host of methods to generate a ToC but all have failed.
I think what I want to do is call the ActiveDocument object and create one with something like this example from the MSDN page:
Set myRange = ActiveDocument.Range(Start:=0, End:=0)
ActiveDocument.TablesOfContents.Add Range:=myRange, _
UseFields:=False, UseHeadingStyles:=True, _
LowerHeadingLevel:=3, _
UpperHeadingLevel:=1
Except in Python it would be something like:
wordObject.ActiveDocument.TableOfContents.Add(Range=???,UseFiles=False, UseHeadingStyles=True, LowerHeadingLevel=3, UpperHeadingLevel=1)
I've built everything so far using the 'Selection' object (example below) and wish to add this ToC after the first page break.
Here's a sample of what the document looks like:
objWord = win32com.client.Dispatch("Word.Application")
objDoc = objWord.Documents.Open('pathtotemplate.docx') #
objSel = objWord.Selection
#These seem to work but I don't know why...
objWord.ActiveDocument.Sections(1).Footers(1).PageNumbers.Add(1,True)
objWord.ActiveDocument.Sections(1).Footers(1).PageNumbers.NumberStyle = 57
objSel.Style = objWord.ActiveDocument.Styles("Heading 1")
objSel.TypeText("TITLE PAGE AND STUFF")
objSel.InsertParagraph()
objSel.TypeText("Some data or another"
objSel.TypeParagraph()
objWord.Selection.InsertBreak()
####INSERT TOC HERE####
Any help would be greatly appreciated! In a perfect world I'd use the default first option which is available from the Word GUI but that seems to point to a file and be harder to access (something about templates).
Thanks

Manually, edit your template in Word, add the ToC (which will be empty initially) any intro stuff, header/footers etc., then at where you want your text content inserted (i.e. after the ToC) put a uniquely named bookmark. Then in your code, create a new document based on the template (or open the template then save it to a different name), search for the bookmark and insert your content there. Save to a different filename.
This approach has all sorts of advantages - you can format your template in Word rather than by writing all the code details, and so you can very easily edit your template to update styles when someone says they want the Normal font to be bigger/smaller/pink you can do it just by editing the template. Make sure to use styles in your code and only apply formatting when it is specifically different from the default style.
Not sure how you make sure the ToC is actually generated, might be automatically updated on every save.

Removing Paragraph From Cell In Python-Docx

I am attempting to create a table with a two row header that uses a simple template format for all of the styling. The two row header is required because I have headers that are the same under two primary categories. It appears that the only way to handle this within Word so that a document will format and flow with repeating header across pages is to nest a two row table into the header row of a main content table.
In Python-DocX a table cell is always created with a single empty paragraph element. For my use case I need to be able to remove this empty paragraph element entirely not simply clear it with an empty string. Or else I have line break above my nested table that ruins my illusion of a single table.
So the question is how do you remove the empty paragraph?
If you know of a better way to handle the two row header implementation... that would also be appreciated info.

While Paragraph.delete() is not implemented yet in python-docx, there is a workaround function documented here: https://github.com/python-openxml/python-docx/issues/33#issuecomment-77661907
Note that a table cell must always end with a paragraph. So you'll need to add an empty one after your table otherwise I believe you'll get a so-called "repair-step" error when you try to load the document.
Probably worth a try without the extra paragraph just to confirm; I'm expect it would look better without it, but last time I tried that I got the error.

As #scanny said before, it can delete the current graph if pass the p to self-defined delete function.
I just want to do a supplement, in case if you want to delete multiple paragraphs.
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
paragraph._p = paragraph._element = None
def remove_multiple_para(doc):
i = 0
while i < len(doc.paragraphs):
if 'xxxx' in doc.paragraphs[i].text:
for j in range(i+2, i-2, -1):
# delete the related 4 lines
delete_paragraph(doc.paragraphs[j])
i += 1
doc.save('outputDoc.docx')
doc = docx.Document('./inputDoc.docx')
remove_multiple_para(doc)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.