I built a script who modify some word in a imported word file .
from docx import Document
document = Document('filename.docx')
paragraphs = document.paragraphs
dic = {'XxwordxX': modifiedword}
for p in document.paragraphs:
for key in dic.keys():
if key in p.text:
p.text = p.text.replace(key,dic[key])
document.save('new.docx')
Unfortunately, the paragraph where the word has be modified isn't bold anymore (it was bold in filename).
And I cannot found how to bold it again.
I tried this :
from docx import Document
document = Document('filename.docx')
paragraphs = document.paragraphs
dic = {'XxwordxX': modifiedword}
for p in document.paragraphs:
for key in dic.keys():
if key in p.text:
p.text = p.text.replace(key,dic[key])
para = document.paragraphs[2]
run.bold = True
document.save('new.docx')
But even though I haven't any error, the paragraph isn't bold in the output.
Any ideas how to do that ? The only tips I found on internet are for bold paragraph added manually.
I think you're looking for a solution like the paragraph_replace_text() function here:
https://github.com/python-openxml/python-docx/issues/30#issuecomment-879593691
The problem is that replacing Paragraph.text removes all the runs in the paragraph and replaces them with a single paragraph with default formatting. That's often convenient, but not sufficient in the "replace text while preserving formatting" case.
The documentation on that code snippet is pretty thorough and the comment following that one provides some further explanation. You may need to tweak the paragraph_replace_text() function to suit your use case, so best to study it closely enough to understand basically how it works.
Related
I am replacing the strings in tables and paragraphs of word document. However the styles change. How can I keep original style format?
with open(r"C:\Users\y.Israfilbayov\Desktop\testfiles\test_namedranges\VariableNames.json") as p:
data = json.load(p)
document = Document(r"C:\Users\y.Israfilbayov\Desktop\testfiles\test_namedranges_update\F10352-JB117-FMXXX Pile XXXX As-built Memo GAIA Auto trial_v6.docx")
for key, value in data.items():
for paragraph in document.paragraphs:
if key in paragraph.text:
paragraph.text = paragraph.text.replace(str(key), str(value))
for key, value in data.items():
for table in document.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if key in paragraph.text:
paragraph.text = paragraph.text.replace(str(key),str(value))
There was a similar post, however it did not help me (maybe I did something wrong).
This should meet your needs. Requires docx2python 2.0.0+
from docx2python.utilities import replace_docx_text
replace_docx_text(
input_filename,
output_filename,
("Apples", "Bananas"), # replace Apples with Bananas
("Pears", "Apples"), # replace Pears with Apples
("Bananas", "Pears"), # replace Bananas with Pears
html=True,
)
You may have a problem if your replacement strings include tabs or symbols, but "regular" text replacement will work and preserve most[1] formatting.
To allow this, docx2python will not replace text strings where formatting changes, e.g., "part of this string is bold", unless you specify html=False, in which case strings will be replaced regardless of format, and some formatting will be lost.
[1] The following will be preserved:
italic
bold
underline
strike
superscript
subscript
small caps
all caps
highlighted
font size
colored text
(some others, but not guaranteed)
Edit for follow-up question, how do I replace marker text in tables?
My workflow for doing this is to keep all formatting in Word. That is, I create a template in Word, slice out the context I need, then put everything back together like a puzzle.
This github "project" is an example (one file) of how I replace text in tables (where the tables can be any size).
https://github.com/ShayHill/replace_docx_tables
#property
def text(self):
"""
String formed by concatenating the text of each run in the paragraph.
Tabs and line breaks in the XML are mapped to ``\\t`` and ``\\n``
characters respectively.
Assigning text to this property causes all existing paragraph content
to be replaced with a single run containing the assigned text.
A ``\\t`` character in the text is mapped to a ``<w:tab/>`` element
and each ``\\n`` or ``\\r`` character is mapped to a line break.
Paragraph-level formatting, such as style, is preserved. All
run-level formatting, such as bold or italic, is removed.
"""
text = ''
for run in self.runs:
text += run.text
return text
From the documentation it looks like the Styles should stay the same however; bold/italic formatting can be removed.
If this is the formatting you are trying to preserve, you may need to identify what run the key is in first then modify it.
In the docx library documentation located at
https://python-docx.readthedocs.io/en/latest/api/text.html#paragraph-objects, it states the following regarding assigning a value to paragraph.text :
"Assigning text to this property causes all existing paragraph content to be replaced with a single run containing the assigned text. ... Paragraph-level formatting, such as style, is preserved. All run-level formatting, such as bold or italic, is removed. "
Are the changes in style you are observing consistent with that?
If so, then perhaps you are loosing the "run" objects with their specific styling that are children of the paragraph object. In that case, you might be better of adding another level to your loop to iterate through all the paragraph.runs and replace the text on those individually.
For example, once you have the paragraph, then
for run in paragraph.runs:
if key in run.text:
run.text = run.text.replace(str(key), str(value))
I am using Python docx to remove blank lines from documents containing text and images. Using the paragraph.clear() and paragraph.run.clear() works to a point, but the outputted file still has blank lines which only have a paragraph mark shown in Word. Is there a way of searching directly for paragraph marks? Or is there a better way of clearing the lines?
# code snippet
for paragraphs in document.paragraphs:
if paragraphs.text == "\n":
paragraphs.clear()
Empty lines are not marked by "\n" but by empty string "".
Plus, clear() removes text but not the paragraph itself.
Try to test len(paragraph.text)==0 for each paragraph.
This removed all the empty lines for me in my document file
for paragraph in doc.paragraphs:
if len(paragraph.text) == 0:
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
Using len(paragraph.text)==1 helps as opposed to using len(paragraph.text)==0 as new line is also a character.
I just wanted to copy the lines except the blank lines to a new document so it gave me the output.
When I used paragraph.text=paragraph.strip('\n') the font style,bold,underlined and italic were removed.So checking the length of each paragraph and clearing that paragraph does the trick.
I am trying to batch manipulate many microsoft word documents in .docx format within python.
The following code accomplishes what I need, except it loses special characters I would like to preserve, like the right arrow symbol and bullets.
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return fullText
getText('example.docx')
The Paragraph.text property in python-pptx returns the plain text in the paragraph as a string. This is a very common requirement.
Bullets, or numbered lists in general (of which bullets are a type) are not reflected in the text of a paragraph, even though it may appear that way on-screen. This sort of thing will be an additional property of the paragraph.
One way bullets can be applied is using the 'List Bullet' style. The paragraph style is available on Paragraph.style.
The documentation here is your friend for this and other details, in particular the 11 topics in the User Guide section:
http://python-docx.readthedocs.io/en/latest/
from docx import *
document = Document('ABC.docx')
for paragraph in document.paragraphs:
for run in paragraph.runs:
if run.style == 'Strong':
print run.text
This is the code I am using to open a docx file and to check if there is Bold text but I am not getting any result. If I remove the if statement , the entire file is printed without any formatting / styles. Can you please let me know how to identify text in particular style like Bold or Italics using python-docx ?
Thank you
Although bold and the style Strong appear the same when rendered, they use two different mechanisms. The first applies bold directly and the second applies a character style that can include any other number of font characteristics.
To identify all occurrences of text that appears bold, you may need to do both.
But to just find the text having bold applied you would do something like this:
for paragraph in document.paragraphs:
for run in paragraph.runs:
if run.bold:
print run.text
Note there are ways this can miss text that appears bold, like text that appears in a paragraph whose font formatting is bold for the entire paragraph (Heading1 for example). But I think this is the property you were looking for.
To check for a particular style you could use the name property that is available in _ParagraphStyle objects or _CharacterStyle objects
example:
for paragraph in document.paragraphs:
if 'List Paragraph' == paragraph.style.name:
print(paragraph.text)
I have problem with \n and \t tags. When I am opening a generated .docx in OpenOffice everything looks fine, but when I open the same document in Microsoft Word I just get the last two tabulators in section "Surname" and spaces instead of newlines/tabulators in other sections. What is wrong?
p = document.add_paragraph('Simple paragraph')
p.add_run('Name:\t\t' + name).bold = True
p.add_run('\n\nSurname:\t\t' + surname)
In Word, what we often think of as a line feed translates to a paragraph object. If you want empty paragraphs in your document you will need to insert them explicitly.
First of all though, you should ask whether you're using paragraphs for formatting, a common casual practice for Word users but one you might want to deal with differently, in particular by using the space-before and/or space-after properties of a paragraph. In HTML this would correspond roughly to padding-top and padding-bottom.
In this case, if you just want the line feeds, consider using paragraphs like so:
document.add_paragraph('Simple paragraph')
p = document.add_paragraph()
p.add_run('Name:\t\t').bold = True
p.add_run(name)
document.add_paragraph()
p = document.add_paragraph()
p.add_run('Surname:\t\t').bold = True
p.add_run(surname)