Replace string in paragraph while keeping style docx library - python

I am replacing the strings in tables and paragraphs of word document. However the styles change. How can I keep original style format?
with open(r"C:\Users\y.Israfilbayov\Desktop\testfiles\test_namedranges\VariableNames.json") as p:
data = json.load(p)
document = Document(r"C:\Users\y.Israfilbayov\Desktop\testfiles\test_namedranges_update\F10352-JB117-FMXXX Pile XXXX As-built Memo GAIA Auto trial_v6.docx")
for key, value in data.items():
for paragraph in document.paragraphs:
if key in paragraph.text:
paragraph.text = paragraph.text.replace(str(key), str(value))
for key, value in data.items():
for table in document.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if key in paragraph.text:
paragraph.text = paragraph.text.replace(str(key),str(value))
There was a similar post, however it did not help me (maybe I did something wrong).

This should meet your needs. Requires docx2python 2.0.0+
from docx2python.utilities import replace_docx_text
replace_docx_text(
input_filename,
output_filename,
("Apples", "Bananas"), # replace Apples with Bananas
("Pears", "Apples"), # replace Pears with Apples
("Bananas", "Pears"), # replace Bananas with Pears
html=True,
)
You may have a problem if your replacement strings include tabs or symbols, but "regular" text replacement will work and preserve most[1] formatting.
To allow this, docx2python will not replace text strings where formatting changes, e.g., "part of this string is bold", unless you specify html=False, in which case strings will be replaced regardless of format, and some formatting will be lost.
[1] The following will be preserved:
italic
bold
underline
strike
superscript
subscript
small caps
all caps
highlighted
font size
colored text
(some others, but not guaranteed)
Edit for follow-up question, how do I replace marker text in tables?
My workflow for doing this is to keep all formatting in Word. That is, I create a template in Word, slice out the context I need, then put everything back together like a puzzle.
This github "project" is an example (one file) of how I replace text in tables (where the tables can be any size).
https://github.com/ShayHill/replace_docx_tables

#property
def text(self):
"""
String formed by concatenating the text of each run in the paragraph.
Tabs and line breaks in the XML are mapped to ``\\t`` and ``\\n``
characters respectively.
Assigning text to this property causes all existing paragraph content
to be replaced with a single run containing the assigned text.
A ``\\t`` character in the text is mapped to a ``<w:tab/>`` element
and each ``\\n`` or ``\\r`` character is mapped to a line break.
Paragraph-level formatting, such as style, is preserved. All
run-level formatting, such as bold or italic, is removed.
"""
text = ''
for run in self.runs:
text += run.text
return text
From the documentation it looks like the Styles should stay the same however; bold/italic formatting can be removed.
If this is the formatting you are trying to preserve, you may need to identify what run the key is in first then modify it.

In the docx library documentation located at
https://python-docx.readthedocs.io/en/latest/api/text.html#paragraph-objects, it states the following regarding assigning a value to paragraph.text :
"Assigning text to this property causes all existing paragraph content to be replaced with a single run containing the assigned text. ... Paragraph-level formatting, such as style, is preserved. All run-level formatting, such as bold or italic, is removed. "
Are the changes in style you are observing consistent with that?
If so, then perhaps you are loosing the "run" objects with their specific styling that are children of the paragraph object. In that case, you might be better of adding another level to your loop to iterate through all the paragraph.runs and replace the text on those individually.
For example, once you have the paragraph, then
for run in paragraph.runs:
if key in run.text:
run.text = run.text.replace(str(key), str(value))

Related

How can I add a new paragraph at the end of a word document?

I would like to add a last paragraph after everything in the word doc that I have. I tried using this code, but the text is appended before my last table.
How can I make sure the text is always appended at the very end?
from docx import Document
document = Document('Summary_output.docx')
paragraphs = document.paragraphs
#Store content of second paragraph
text = paragraphs[1].text
#Clear content
paragraphs[1]._p.clear()
#Recreate second paragraph
paragraphs[1].add_run('Appended part ' + text)
document.save("Summary_output.docx")
Short answer: use document.add_paragraph().
new_last_paragraph = document.add_paragraph("Appended part %s" % text)
It's important to understand the distinction in Word between paragraphs and runs. A paragraph is a "block" item (as is a table). A block item fits between the margins, is vertically entirely below the prior block item and entirely above the following block item. Intuitively, it is a full-width block in the stack of full-width blocks appearing in the "column" bounded on each side by the margins.
A run is an inline item, a sequence of characters that all share the same character formatting. A run always appears within a paragraph and in general a paragraph contains multiple runs. Using runs is how you make, for example, a single word bold or a phrase within a paragraph italic or red. Runs are flowed within the paragraph by line-wrapping.
So in your code, you were just extending an existing paragraph (by adding a run) rather than creating a new one, which explains why its position did not change.

Changing select sections of text to different colors in a tkinter text box?

First I would like to say I am not looking to take spacific words and change all words related to that tag. I am trying to do something akin to highlighting text in word and just changing the text color.
I have a program that will output the string stored in a dictionary by calling the key. I want to change sections of text color within the string but not necessarily all the iterations of that text.
I don't think tagging words would do the job for me as it would take a lot of time to define all the different words I want to change the color of and I do not want all words to be changed.
If it is possible to highlight the words and click a button to color the text as well as when I save the text to the library the text retains its color given. That is more along the lines of what I am trying to do.
Further more if it is possible to create some kind of rule that looks for a set of characters and then takes everything inside those characters and changes the color.
Could I write something that would read the data inside the dictionary and look for an identifier lets call it clrGr* and then look for a closing identifier, lets call it clrGr** and then any text or string inside the identifiers would have been changed to green. by calling a tag or something similar.
clrGr* Just random string of text between the identifiers clrGr**
What I get currently only lets me change all the color of the text.
Now what I want to do is have something like the following display in my tkinter text box.
Applying tags via regular expression
The text widget supports finding ranges of text specified by a regular expression.
Note: the expression must follow tcl regular expression syntax, which has some slight differences from python regular expressions.
For example:
import tkinter as tk
root = tk.Tk()
text = tk.Text(root)
text.pack(fill="both", expand=True)
text.tag_configure("highlight", foreground="green")
text.insert("1.0", '''
blah blah
blah clrGr* Just random string of text between the identifiers clrGr** blah
blah blah
''')
# this just highlights the first match, but you can put this code
# in a loop to highlight the whole file
char_count = tk.IntVar()
index = text.search(r'(?:clrGr\*).*(?:clrGr\*\*)', "1.0", "end", count=char_count, regexp=True)
# we have to adjust the character indexes to skip over the identifiers
if index != "":
start = "%s + 6 chars" % index
end = "%s + %d chars" % (index, char_count.get()-7)
text.tag_add("highlight", start, end)
root.mainloop()
Saving the color information
You can, however, call the dump method on a text widget to get a list of tuples that describe the content of the text widget. You can use this information to either write directly to a file (in a format that only your app will understand), or to convert the data to a known format.
The best description of what the dump method returns is described in the tcl/tk man pages: http://tcl.tk/man/tcl8.6/TkCmd/text.htm#M108

Can I change text in MS Word using python-docx, without losing characteristics?

I now have a English word document in MS Word and I want to change its texts into Chinese using python. I've been using Python 3.4 and installed python-docx. Here's my code:
from docx import Document
document = Document(*some MS Word file*)
# I only change the texts of the first two paragraphs
document.paragraphs[0].text = '带有消毒模式的地板清洁机'
document.paragraphs[1].text = '背景'
document.save(*save_file_path*)
The first two lines did turn into Chinese characters, but characteristics like font and bold are all gone:
Is there anyway I could alter text without losing the original characteristics?
It depends on how the characteristics are applied. There is a thing called the style hierarchy, and text characteristics can be applied anywhere from directly to a run of text, a style, or a document default, and levels in-between.
There are two main classes of characteristic: paragraph properties and run properties. Paragraph properties are things like justification, space before and after, etc. Everything having to do with character-level formatting, like size, typeface, color, subscript, italic, bold, etc. is a run property, also loosely known as a font.
So if you want to preserve the font of a run of text, you need to operate at the run level. An operation like this will preserve font formatting:
run.text = "New text"
An operation like this will preserve paragraph formatting, but remove any character level formatting not applied by the paragraph style:
paragraph.text = "New paragraph text"
You'll need to decide for your application whether you modify individual runs (which may be tricky to identify) or whether you work perhaps with distinct paragraphs and apply different styles to each. I recommend the latter. So in your example, "FLOOR CLEANING MACHINE ...", "BACKGROUND", and "[0001]..." would each become distinct paragraphs. In your screenshot they appear as separate runs in a single paragraph, separated by a line break.
You can get the style of the existing paragraphs and apply it to your new paragraphs - beware that the existing paragraphs might specify a font that does not support Chinese.

Checking for particular style using python-docx

from docx import *
document = Document('ABC.docx')
for paragraph in document.paragraphs:
for run in paragraph.runs:
if run.style == 'Strong':
print run.text
This is the code I am using to open a docx file and to check if there is Bold text but I am not getting any result. If I remove the if statement , the entire file is printed without any formatting / styles. Can you please let me know how to identify text in particular style like Bold or Italics using python-docx ?
Thank you
Although bold and the style Strong appear the same when rendered, they use two different mechanisms. The first applies bold directly and the second applies a character style that can include any other number of font characteristics.
To identify all occurrences of text that appears bold, you may need to do both.
But to just find the text having bold applied you would do something like this:
for paragraph in document.paragraphs:
for run in paragraph.runs:
if run.bold:
print run.text
Note there are ways this can miss text that appears bold, like text that appears in a paragraph whose font formatting is bold for the entire paragraph (Heading1 for example). But I think this is the property you were looking for.
To check for a particular style you could use the name property that is available in _ParagraphStyle objects or _CharacterStyle objects
example:
for paragraph in document.paragraphs:
if 'List Paragraph' == paragraph.style.name:
print(paragraph.text)

New lines/tabulators turn into spaces in generated document

I have problem with \n and \t tags. When I am opening a generated .docx in OpenOffice everything looks fine, but when I open the same document in Microsoft Word I just get the last two tabulators in section "Surname" and spaces instead of newlines/tabulators in other sections. What is wrong?
p = document.add_paragraph('Simple paragraph')
p.add_run('Name:\t\t' + name).bold = True
p.add_run('\n\nSurname:\t\t' + surname)
In Word, what we often think of as a line feed translates to a paragraph object. If you want empty paragraphs in your document you will need to insert them explicitly.
First of all though, you should ask whether you're using paragraphs for formatting, a common casual practice for Word users but one you might want to deal with differently, in particular by using the space-before and/or space-after properties of a paragraph. In HTML this would correspond roughly to padding-top and padding-bottom.
In this case, if you just want the line feeds, consider using paragraphs like so:
document.add_paragraph('Simple paragraph')
p = document.add_paragraph()
p.add_run('Name:\t\t').bold = True
p.add_run(name)
document.add_paragraph()
p = document.add_paragraph()
p.add_run('Surname:\t\t').bold = True
p.add_run(surname)

Categories