Using python-docx to to read .docx, preserving special characters, bullets

Using python-docx to to read .docx, preserving special characters, bullets - python

I am trying to batch manipulate many microsoft word documents in .docx format within python.
The following code accomplishes what I need, except it loses special characters I would like to preserve, like the right arrow symbol and bullets.
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return fullText
getText('example.docx')

The Paragraph.text property in python-pptx returns the plain text in the paragraph as a string. This is a very common requirement.
Bullets, or numbered lists in general (of which bullets are a type) are not reflected in the text of a paragraph, even though it may appear that way on-screen. This sort of thing will be an additional property of the paragraph.
One way bullets can be applied is using the 'List Bullet' style. The paragraph style is available on Paragraph.style.
The documentation here is your friend for this and other details, in particular the 11 topics in the User Guide section:
http://python-docx.readthedocs.io/en/latest/

Related

Replace string in paragraph while keeping style docx library

I am replacing the strings in tables and paragraphs of word document. However the styles change. How can I keep original style format?
with open(r"C:\Users\y.Israfilbayov\Desktop\testfiles\test_namedranges\VariableNames.json") as p:
data = json.load(p)
document = Document(r"C:\Users\y.Israfilbayov\Desktop\testfiles\test_namedranges_update\F10352-JB117-FMXXX Pile XXXX As-built Memo GAIA Auto trial_v6.docx")
for key, value in data.items():
for paragraph in document.paragraphs:
if key in paragraph.text:
paragraph.text = paragraph.text.replace(str(key), str(value))
for key, value in data.items():
for table in document.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if key in paragraph.text:
paragraph.text = paragraph.text.replace(str(key),str(value))
There was a similar post, however it did not help me (maybe I did something wrong).

This should meet your needs. Requires docx2python 2.0.0+
from docx2python.utilities import replace_docx_text
replace_docx_text(
input_filename,
output_filename,
("Apples", "Bananas"), # replace Apples with Bananas
("Pears", "Apples"), # replace Pears with Apples
("Bananas", "Pears"), # replace Bananas with Pears
html=True,
)
You may have a problem if your replacement strings include tabs or symbols, but "regular" text replacement will work and preserve most[1] formatting.
To allow this, docx2python will not replace text strings where formatting changes, e.g., "part of this string is bold", unless you specify html=False, in which case strings will be replaced regardless of format, and some formatting will be lost.
[1] The following will be preserved:
italic
bold
underline
strike
superscript
subscript
small caps
all caps
highlighted
font size
colored text
(some others, but not guaranteed)
Edit for follow-up question, how do I replace marker text in tables?
My workflow for doing this is to keep all formatting in Word. That is, I create a template in Word, slice out the context I need, then put everything back together like a puzzle.
This github "project" is an example (one file) of how I replace text in tables (where the tables can be any size).
https://github.com/ShayHill/replace_docx_tables

#property
def text(self):
"""
String formed by concatenating the text of each run in the paragraph.
Tabs and line breaks in the XML are mapped to ``\\t`` and ``\\n``
characters respectively.
Assigning text to this property causes all existing paragraph content
to be replaced with a single run containing the assigned text.
A ``\\t`` character in the text is mapped to a ``<w:tab/>`` element
and each ``\\n`` or ``\\r`` character is mapped to a line break.
Paragraph-level formatting, such as style, is preserved. All
run-level formatting, such as bold or italic, is removed.
"""
text = ''
for run in self.runs:
text += run.text
return text
From the documentation it looks like the Styles should stay the same however; bold/italic formatting can be removed.
If this is the formatting you are trying to preserve, you may need to identify what run the key is in first then modify it.

In the docx library documentation located at
https://python-docx.readthedocs.io/en/latest/api/text.html#paragraph-objects, it states the following regarding assigning a value to paragraph.text :
"Assigning text to this property causes all existing paragraph content to be replaced with a single run containing the assigned text. ... Paragraph-level formatting, such as style, is preserved. All run-level formatting, such as bold or italic, is removed. "
Are the changes in style you are observing consistent with that?
If so, then perhaps you are loosing the "run" objects with their specific styling that are children of the paragraph object. In that case, you might be better of adding another level to your loop to iterate through all the paragraph.runs and replace the text on those individually.
For example, once you have the paragraph, then
for run in paragraph.runs:
if key in run.text:
run.text = run.text.replace(str(key), str(value))

How to get to know paragraph color docx python?

I have document where some lines are highlighted. I have managed to open it and get text line by line:
doc = Document("/Users/an/PycharmProjects/projects/test.docx")
for para in doc.paragraphs:
print(para.text)
but I don't know how to check which color has each line. Maybe this library has some method for such task?

A run can have a .highlight_color, which is perhaps what you're looking for:
for p in doc.paragraphs:
for r in p.runs:
print(r.font.highlight_color)
There is no concept of "line" in Word. There are paragraphs, within which there are runs. A run is a sequence of characters that shares the same character formatting (aka. "font").
Note that breaks between runs occur arbitrarily in text and there's nothing stopping a "line" of highlighted text being broken into several runs. So you may need to check for adjacent runs with the same .highlight_color value to "assemble" those into the same sequence of text in the "apparent" highlighted passage.

How to perform a tag-agnostic text string search in an html file?

I'm using LanguageTool (LT) with the --xmlfilter option enabled to spell-check HTML files. This forces LanguageTool to strip all tags before running the spell check.
This also means that all reported character positions are off because LT doesn't "see" the tags.
For example, if I check the following HTML fragment:
<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
LanguageTool will treat it as a plain text sentence:
This is kind of a stupid question.
and returns the following message:
<error category="Grammar" categoryid="GRAMMAR" context=" This is kind of a stupid question. " contextoffset="24" errorlength="9" fromx="8" fromy="8" locqualityissuetype="grammar" msg="Don't include 'a' after a classification term. Use simply 'kind of'." offset="24" replacements="kind of" ruleId="KIND_OF_A" shortmsg="Grammatical problem" subId="1" tox="17" toy="8"/>
(In this particular example, LT has flagged "kind of a.")
Since the search string might be wrapped in tags and might occur multiple times I can't do a simple index search.
What would be the most efficient Python solution to reliably locate any given text string in an HTML file? (LT returns an approximate character position, which might be off by 10-30% depending on the number of tags, as well as the words before and after the flagged word(s).)
I.e. I'd need to do a search that ignores all tags, but includes them in the character position count.
In this particular example, I'd have to locate "kind of a" and find the location of the letter k in:
kin<b>d</b> o<i>f</i>a

This may not be the speediest way to go, but pyparsing will recognize HTML tags in most forms. The following code inverts the typical scan, creating a scanner that will match any single character, and then configuring the scanner to skip over HTML open and close tags, and also common HTML '&xxx;' entities. pyparsing's scanString method returns a generator that yields the matched tokens, the starting, and the ending location of each match, so it is easy to build a list that maps every character outside of a tag to its original location. From there, the rest is pretty much just ''.join and indexing into the list. See the comments in the code below:
test = "<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>"
from pyparsing import Word, printables, anyOpenTag, anyCloseTag, commonHTMLEntity
non_tag_text = Word(printables+' ', exact=1).leaveWhitespace()
non_tag_text.ignore(anyOpenTag | anyCloseTag | commonHTMLEntity)
# use scanString to get all characters outside of tags, and build list
# of (char,loc) tuples
char_locs = [(t[0], loc) for t,loc,endloc in non_tag_text.scanString(test)]
# imagine a world without HTML tags...
untagged = ''.join(ch for ch, loc in char_locs)
# look for our string in the untagged text, then index into the char,loc list
# to find the original location
search_str = 'kind of a'
orig_loc = char_locs[untagged.find(search_str)][1]
# print the test string, and mark where we found the matching text
print(test)
print(' '*orig_loc + '^')
"""
Should look like this:
<p>This is kin<b>d</b> o<i>f</i> a <b>stupid</b> question.</p>
^
"""

The --xmlfilter option is deprecated because of issues like this. The proper solution is to remove the tags yourself but keep the positions so you have a mapping to correct the results that come back from LT. When using LT from Java, this is supported by AnnotatedText, but the algorithm should be simple enough to port it. (full disclosure: I'm the maintainer of LT)

Can I change text in MS Word using python-docx, without losing characteristics?

I now have a English word document in MS Word and I want to change its texts into Chinese using python. I've been using Python 3.4 and installed python-docx. Here's my code:
from docx import Document
document = Document(*some MS Word file*)
# I only change the texts of the first two paragraphs
document.paragraphs[0].text = '带有消毒模式的地板清洁机'
document.paragraphs[1].text = '背景'
document.save(*save_file_path*)
The first two lines did turn into Chinese characters, but characteristics like font and bold are all gone:
Is there anyway I could alter text without losing the original characteristics?

It depends on how the characteristics are applied. There is a thing called the style hierarchy, and text characteristics can be applied anywhere from directly to a run of text, a style, or a document default, and levels in-between.
There are two main classes of characteristic: paragraph properties and run properties. Paragraph properties are things like justification, space before and after, etc. Everything having to do with character-level formatting, like size, typeface, color, subscript, italic, bold, etc. is a run property, also loosely known as a font.
So if you want to preserve the font of a run of text, you need to operate at the run level. An operation like this will preserve font formatting:
run.text = "New text"
An operation like this will preserve paragraph formatting, but remove any character level formatting not applied by the paragraph style:
paragraph.text = "New paragraph text"
You'll need to decide for your application whether you modify individual runs (which may be tricky to identify) or whether you work perhaps with distinct paragraphs and apply different styles to each. I recommend the latter. So in your example, "FLOOR CLEANING MACHINE ...", "BACKGROUND", and "[0001]..." would each become distinct paragraphs. In your screenshot they appear as separate runs in a single paragraph, separated by a line break.

You can get the style of the existing paragraphs and apply it to your new paragraphs - beware that the existing paragraphs might specify a font that does not support Chinese.

New lines/tabulators turn into spaces in generated document

I have problem with \n and \t tags. When I am opening a generated .docx in OpenOffice everything looks fine, but when I open the same document in Microsoft Word I just get the last two tabulators in section "Surname" and spaces instead of newlines/tabulators in other sections. What is wrong?
p = document.add_paragraph('Simple paragraph')
p.add_run('Name:\t\t' + name).bold = True
p.add_run('\n\nSurname:\t\t' + surname)

In Word, what we often think of as a line feed translates to a paragraph object. If you want empty paragraphs in your document you will need to insert them explicitly.
First of all though, you should ask whether you're using paragraphs for formatting, a common casual practice for Word users but one you might want to deal with differently, in particular by using the space-before and/or space-after properties of a paragraph. In HTML this would correspond roughly to padding-top and padding-bottom.
In this case, if you just want the line feeds, consider using paragraphs like so:
document.add_paragraph('Simple paragraph')
p = document.add_paragraph()
p.add_run('Name:\t\t').bold = True
p.add_run(name)
document.add_paragraph()
p = document.add_paragraph()
p.add_run('Surname:\t\t').bold = True
p.add_run(surname)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using python-docx to to read .docx, preserving special characters, bullets - python

Related

Replace string in paragraph while keeping style docx library

How to get to know paragraph color docx python?

How to perform a tag-agnostic text string search in an html file?

Can I change text in MS Word using python-docx, without losing characteristics?

New lines/tabulators turn into spaces in generated document

Categories

Resources