Pywin32 how to replace text in entire word document - python

I am having trouble replacing words in the entire text document. The following code works to replace words in the main paragraphs but not when text is present in text boxes.
wdFindContinue = 1
wdReplaceAll = 2
word.Selection.Find.Execute(find_str, True, True, False, False, False, True, wdFindContinue, False, replace_str, wdReplaceAll)
Is there a way to replace words in the entire document?

See the article by Word MVPs on Using a macro to replace text wherever it appears in a document.
A collaborative effort of MVP’s Doug Robbins and Greg Maxey with
enhancements by Peter Hewett and Jonathan West
Using the Find or Replace utility on the Edit menu you can find or
replace text "almost" anywhere it appears in the document. If you
record that action however, the scope or "range" of the resulting
recorded macro will only act on the text contained in the body of the
document (or more accurately, it will only act on the part of the
document that contains the insertion point). This means that if the
insertion point is located in the main body of the document when your
macro is executed it will have no effect on text that is in the
headers or footers of the document, for example, or in a textbox,
footnotes, or any other area that is outside the main body of the
document.
Even the Find and Replace utility has a shortcoming. For example, text
in a textbox located in a header or footer is outside the scope of the
Find and Replace utility search range.
***
To use a macro to find or replace text anywhere in a document, it is
necessary to loop through each individual part of the document. In
VBA, these parts are called StoryRanges. Each StoryRange is identified
by a unique wdStoryType constant.
There are eleven different wdStoryType constants that can form the
StoryRanges (or parts) of a document (ok, a few more in later versions
of Word, but they have no bearing in this discussion). Simple
documents may contain only one or two StoryRanges, while more complex
documents may contain more. The wdStoryTypes that have a role in find
and replace are:
wdCommentsStory, wdEndnotesStory, wdEvenPagesFooterStory,
wdEvenPagesHeaderStory, wdFirstPageFooterStory,
wdFirstPageHeaderStory, wdFootnotesStory, wdMainTextStory,
wdPrimaryFooterStory, wdPrimaryHeaderStory, and wdTextFrameStory.
The complete code to find or replace text anywhere is a bit complex.
Accordingly, let’s take it a step at a time to better illustrate the
process. In many cases the simpler code is sufficient for getting the
job done.
Step 1
The following code loops through each StoryRange in the active
document and replaces the specified .Text with .Replacement.Text:
Sub FindAndReplaceFirstStoryOfEachType()
Dim rngStory As Range
For Each rngStory In ActiveDocument.StoryRanges
With rngStory.Find
.Text = "find text"
.Replacement.Text = "I'm found"
.Wrap = wdFindContinue
.Execute Replace:=wdReplaceAll
End With
Next rngStory
End Sub
(Note for those already familiar with VBA: whereas if you use
Selection.Find, you have to specify all of the Find and Replace
parameters, such as .Forward = True, because the settings are
otherwise taken from the Find and Replace dialog's current settings,
which are “sticky”, this is not necessary if using [Range].Find –
where the parameters use their default values if you don't specify
their values in your code).
The simple macro above has shortcomings. It only acts on the "first"
StoryRange of each of the eleven StoryTypes (i.e., the first header,
the first textbox, and so on). While a document only has one
wdMainTextStory StoryRange, it can have multiple StoryRanges in some
of the other StoryTypes. If, for example, the document contains
sections with un-linked headers and footers, or if it contains
multiple textboxes, there will be multiple StoryRanges for those
StoryTypes and the code will not act upon the second and subsequent
StoryRanges. To even further complicate matters, if your document
contains unlinked headers or footers and one of the headers or footers
are empty then VBA can have trouble "jumping" that empty header or
footer and process subsequent headers and footers.
The page has more, but the above should help.

Related

How to filter PDF text by font?

PDF example
A PDF may contain multiple fonts, how can I only keep 1 font with the most words with Python?
disclaimer: I am the author of borb (the library I will use in this example)
Oddly enough, there is an fairly close match example in the borb examples repository for filtering by font. You can find that example here.
In this example, we extract all the text in a particular font in the PDF (e.g. all text written in Courier).
You can easily base yourself on this code to build something that checks the number of characters for each particular font (and at a later stage, return only the font with the most characters).
I'll repeat the example here for completeness:
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
from borb.toolkit.text.font_name_filter import FontNameFilter
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction
def main():
# create FontNameFilter
l0: FontNameFilter = FontNameFilter("Courier")
# filtered text just gets passed to SimpleTextExtraction
l1: SimpleTextExtraction = SimpleTextExtraction()
l0.add_listener(l1)
# read the Document
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l0])
# check whether we have read a Document
assert doc is not None
# print the names of the Fonts
print(l1.get_text_for_page(0))
if __name__ == "__main__":
main()
Aside from the imports, everything is quite straightforward. You specify the string of the font you want to filter on. This filter object will process the parsing/rendering of the PDF, and will only push events to its children if they are relevant (if the font information matches).
We add SimpleTextExtraction as its child, and so doing only get the text which is rendered in the desired font.
After we've set up this entire thing, we need to actually process (parse) the Document which is what happens in the next lines.
Some caveats:
PDF documents might contain so-called 'subset fonts'. This is when a font is artificially made smaller by throwing out unused letters. ie if a PDF never uses the 'uppercase X' letter then the font does not need to store information on how to render it. Typically, the names of subset fonts are not the same as those of their original font. You might get something like Courier+AEOKFF.
If this happens to be the case, check out the code of FontNameFilter and make another version that only checks the name using startswith, which out to do the trick.

How does "runs" works in python docx

I'm having difficulties understanding the "run" object. As described, run object identifies the same style of text continuation. However, when I run a paragraph which all words are in the same style, "runs" still return me more than one line.
To make sure I didn't miss any possible style issues, I created a new word doc and typed as below
Hjkhkuhu joiuiuouoiuo iouiouououoi iouiououiuiuiui hhvvhgh hgjjhjhhh hjhjhjhjhjhj hjhjhj, jjkjkjk jkjkjkjkiuio uiouiouoo! jkjkjlkjlk
And I run below code:
from docx import Document
doc = Document('test.docx')
for p in doc.paragraphs:
for run in p.runs:
print(run.text)
And here is the result I got:
Hjkhkuhu
joiuiuouoiuo
iouiouououoi
iouiououiuiuiui
hhvvhgh
hgjjhjhhh
hjhjhjhjhjhj
hjhjhj
jjkjkjk
jkjkjkjkiuio
uiouiouoo ! jkjkjlkjlk
Why is this the case? Did I miss anything?
Having spent a few days tussling with docx runs now..
Paragraphs <w:p> contain one or more "style" runs <w:r> containing one or more texts <w:t>
But Docx runs are very easy to break, and when broken they can hide it very well.
Just having two texts the same format isn't necessarily enough to make it the same run. They don't automatically join, and changing format on text in a run then changing it back is enough to give you two separate but identically formatted runs.
(Greater experts than I dig more into runs/text here DOCX w:t (text) elements crossing multiple w:r (run) elements?)
This caused me a lot of problems with a 'tag substitution within runs' task, leading me to conclude that the only way to guarantee text is all in one run, is to enter it yourself (with unchanging format) in one fell swoop.

Can python-docx preserve font color and styles when importing documents?

Essentially what I need to do is write a program that takes in many .docx files and puts them all in one, ordered in a certain way. I have importing working via:
import docx, os, glob
finaldocname = 'Midterm-All-Questions.docx'
finaldoc=docx.Document()
docstoworkon = glob.glob('*.docx')
if finaldocname in docstoworkon:
docstoworkon.remove(finaldocname) #dont process final doc if it exists
for f in docstoworkon:
doc=docx.Document(f)
fullText=[]
for para in doc.paragraphs:
fullText.append(para.text) #generates a long text list
# finaldoc.styles = doc.styles
for l in fullText:
# if l=='u\'\\n\'':
if '#' in l:
print('We got here!')
if '#1 ' not in l: #check last two characters to see if this is the first question
finaldoc.add_section() #only add a page break between questions
finaldoc.add_paragraph(l)
# finaldoc.add_page_break
# finaldoc.add_page_break
finaldoc.save(finaldocname)
But I need to preserve text styles, like font colors, sizes, italics, etc., and they aren't in this method since it just gets the raw text and dumps it. I can't find anything on the python-docx documentation about preserving text styles or importing in something other than raw text. Does anyone know how to go about this?
Styles are a bit difficult to work with in python-docx but it can be done.
See this explanation first to understand some of the problems with styles and Word.
The Long Way
When you read in a file as a Document() it will bring in all of the paragraphs and within each of these are the runs. These runs are chunks of text with the same style attached to them.
You can find out how many paragraphs or runs there are by doing len() on the object or you can iterate through them like you did in your example with paragraphs.
You can inspect the style of any given paragraph but runs may have different styles than the paragraph as a whole, so I would skip to the run itself and inspect the style there using paragraphs[0].runs[0].style which will give you a style object. You can inspect the font object beyond that which will tell you a number of attributes like size, italic, bold, etc.
Now to the long solution:
You first should create a new blank paragraph, then you should go and add_run() one by one with your text from your original. For each of these you can define a style attribute but it would have to be a named style as described in the first link. You cannot apply a stlye object directly as it won't copy the attributes over. But there is a way around that: check the attributes that you care about copying to the output and then ensure your new run applies the same attributes.
doc_out = docx.Document()
for para in doc.paragraphs:
p = doc_out.add_paragraph()
for run in para.runs:
r = p.add_run(run.text)
if run.bold:
r.bold = True
if run.italic:
r.italic = True
# etc
Obviously this is inefficient and not a great solution, but it will work to ensure you have copied the style appropriately.
Add New Styles
There is a way to add styles by name but because it isn't likely that the Word document you are getting the text and styles from is using named styles (rather than just applying bold, etc. to the words that you want), it is probably going to be a long road to adding a lot of slightly different styles or sometimes even the same ones.
Unfortunately that is the best answer I have for you on how to do this. Working with Word, Outlook, and Excel documents is not great in Python, especially for what you are trying to do.

MS word document document stucturre and COM calls and python

I am using comptypes to call function and create ms-word document. Being the first time writing such a program there is something I don't understand right, what I want to do is:
Create section in the document and call them A, B, ...
In each section create paragraphs that contain text. For section A call the paragraphs a1,a2,a3,...
Add formatting to each paragraph in each section, the formatting may be different for each paragraphs
Below is some code fragments in VBA, VBA is used since the translations to use comptypes are almost directly and there are more examples on the net for VBA.
Set myRange = ActiveDocument.Range(Start:= ...., End:= ...) //start and end can be any thing
ActiveDocument.Sections.Add Range:=myRange //Section A
Set newRange = ActiveDocument.Range(Start:= ...., End:= ...) //start and end can be any thing
newRange.Paragraphs.Add
I get stuck to select paragraphs a1 and set its text. What is missing for me is a function that say get collection of paragraphs in section A.
The following VBA, based on the code in the question, illustrates getting a Document object, adding a Section, getting the Paragraphs of that Section, getting the Paragraphs of any given Section in a document, getting the first or any Paragraph from a Paragraphs collection.
Set doc = ActiveDocument //More efficient if the Document object will be used more than once
Set section1 = doc.Sections.Add(Range:=myRange) //Section A | Type Word.Section
Set section1Paras = section1.Paragraphs //Type Word.Paragraphs
//OR
Set sectionParas = doc.Sections(n).Paragraphs //where n = section index number
Set para = sectionParas.First //OR =sectionParas(n) where n= paragraph index number

how to regain the original font properties and its associated properties like bold, italics using python-docx while text replacement

I am using python-docx for a automation tool. I have a issue like once after I run the code for replacement of certain words in one list with corresponding in another list it is removing all the properties (like font size, font name, part of a text in bold or italics, bookmarks in the paragraphs or table) of the text in the paragraph and table and its coming with a plain text in "Calibri" with a font size of '12'.
The code that I used is:
wrongWord = "xyz"
correctWord = "abcd"
def iter_block_items(parent):
if isinstance(parent, _Document):
parent_elm = parent.element.body
elif isinstance(parent, _Cell):
parent_elm = parent._tc
else:
raise ValueError("something's not right")
for child in parent_elm.iterchildren():
if isinstance(child, CT_P):
yield Paragraph(child, parent)
elif isinstance(child, CT_Tbl):
yield Table(child, parent)
document = Document(r"F:\python\documentSample.docx")
for block in iter_block_items(document):
if isinstance(block, Paragraph):
if wrongWord in block.text:
block.text = block.text.replace(wrongWord, correctWord)
else:
for row in block.rows:
for cell in row.cells:
if wrongWord in cell.text:
cell.text = cell.text.replace(wrongWord, correctWord)
document.save(r"F:\python\documentSampleAfterChanges.docx")
Could you help me to get the same font size, font name and other associated properties to be copied from the original file after the text replacement.
Search and replace is a hard problem in the general case, which is the main reason that feature hasn't been added yet.
What's happening here is that assigning to the .text attribute on the cell is removing all the existing runs and the font-related attributes are removed with those runs.
Font information (e.g. bold, italic, typeface, size) is stored at the run level (a paragraph is composed of zero or more runs). Assigning to the .text attribute removes all the runs and replaces them with a single new run containing the assigned text.
So the challenge is to find the text within the multiple runs somewhere, and preserve as much of the font formatting settings as possible.
This is a hard problem because Word breaks paragraph text into separate runs for many reasons, and runs tend to proliferate. There's no guarantee at all that your search term will be completely enclosed in a single run or start at a run boundary. So perhaps you start to see the challenge of a general-case solution.
One thing you can do that might work in your case is something like this:
# ---replace text of first run with new cell value---
runs = table_cell.paragraphs[0].runs
runs[0].text = replacement_text
# ---delete all remaining runs---
for run in runs[1:]:
r = run._element
r.getparent().remove(r)
Basically this replaces the text of the first run and deletes any remaining runs. Since the first run often contains the formatting you want, this can often work. If the first word is formatted differently though, say bold, then all the replacement text will be bold too. You'll have to see how this approach works in your specific case.

Categories