I am using Python docx to remove blank lines from documents containing text and images. Using the paragraph.clear() and paragraph.run.clear() works to a point, but the outputted file still has blank lines which only have a paragraph mark shown in Word. Is there a way of searching directly for paragraph marks? Or is there a better way of clearing the lines?
# code snippet
for paragraphs in document.paragraphs:
if paragraphs.text == "\n":
paragraphs.clear()
Empty lines are not marked by "\n" but by empty string "".
Plus, clear() removes text but not the paragraph itself.
Try to test len(paragraph.text)==0 for each paragraph.
This removed all the empty lines for me in my document file
for paragraph in doc.paragraphs:
if len(paragraph.text) == 0:
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None
Using len(paragraph.text)==1 helps as opposed to using len(paragraph.text)==0 as new line is also a character.
I just wanted to copy the lines except the blank lines to a new document so it gave me the output.
When I used paragraph.text=paragraph.strip('\n') the font style,bold,underlined and italic were removed.So checking the length of each paragraph and clearing that paragraph does the trick.
Related
I would like to add a last paragraph after everything in the word doc that I have. I tried using this code, but the text is appended before my last table.
How can I make sure the text is always appended at the very end?
from docx import Document
document = Document('Summary_output.docx')
paragraphs = document.paragraphs
#Store content of second paragraph
text = paragraphs[1].text
#Clear content
paragraphs[1]._p.clear()
#Recreate second paragraph
paragraphs[1].add_run('Appended part ' + text)
document.save("Summary_output.docx")
Short answer: use document.add_paragraph().
new_last_paragraph = document.add_paragraph("Appended part %s" % text)
It's important to understand the distinction in Word between paragraphs and runs. A paragraph is a "block" item (as is a table). A block item fits between the margins, is vertically entirely below the prior block item and entirely above the following block item. Intuitively, it is a full-width block in the stack of full-width blocks appearing in the "column" bounded on each side by the margins.
A run is an inline item, a sequence of characters that all share the same character formatting. A run always appears within a paragraph and in general a paragraph contains multiple runs. Using runs is how you make, for example, a single word bold or a phrase within a paragraph italic or red. Runs are flowed within the paragraph by line-wrapping.
So in your code, you were just extending an existing paragraph (by adding a run) rather than creating a new one, which explains why its position did not change.
I have document where some lines are highlighted. I have managed to open it and get text line by line:
doc = Document("/Users/an/PycharmProjects/projects/test.docx")
for para in doc.paragraphs:
print(para.text)
but I don't know how to check which color has each line. Maybe this library has some method for such task?
A run can have a .highlight_color, which is perhaps what you're looking for:
for p in doc.paragraphs:
for r in p.runs:
print(r.font.highlight_color)
There is no concept of "line" in Word. There are paragraphs, within which there are runs. A run is a sequence of characters that shares the same character formatting (aka. "font").
Note that breaks between runs occur arbitrarily in text and there's nothing stopping a "line" of highlighted text being broken into several runs. So you may need to check for adjacent runs with the same .highlight_color value to "assemble" those into the same sequence of text in the "apparent" highlighted passage.
I need to replace the text on the paragraphs while keeping the figures that are inline with text in place while also keeping the old text's style.
The input consists of a line of text with a picture in the middle, for example:
OLD TEXT (picture.jpg) OLD TEXT
The output should be, for example:
NEW TEXT (picture.jpg) NEW TEXT
So far I've gotten to:
for para in doc.paragraphs:
for run in para.runs:
if (run.text != ""):
run.text = "NEW TEXT"
But it adds an extra NEW TEXT string for the picture.jpg object and doesn't keep the style (bold/italic, text identation, font, etc.).
Consider the text on this page. If you look at the source code, you'll see that the main text is presented exactly as in the page -- no HTML divisions or any other way to obviously find paragraphs/tabbed in sections.
Is there a way to automatically identify and remove sections that are tabbed in from the raw text?
One thing I notice is that when I encode the text as text = unicode(raw_text).encode("utf-8"), I can then see a bunch of \n's for line skips. But no \t's. (This might be not a useful direction to think, but just an idea).
Edit: The following works
text = unicode(raw_text).encode("utf-8")
y = [x for x in text.split("\n") if " " not in x]
final = " ".join(y)
Well, after looking at the page, they are 'tabbed' in with spaces rather than the tab character; looking for tabs would not be useful. It looks like the section is tabbed in with 5 spaces.
raw_text.replace(' ','')
To replace all occurances of 5 spaces...
from re import sub
...
raw_text = sub(r' .*\n', '', raw_text)
I have problem with \n and \t tags. When I am opening a generated .docx in OpenOffice everything looks fine, but when I open the same document in Microsoft Word I just get the last two tabulators in section "Surname" and spaces instead of newlines/tabulators in other sections. What is wrong?
p = document.add_paragraph('Simple paragraph')
p.add_run('Name:\t\t' + name).bold = True
p.add_run('\n\nSurname:\t\t' + surname)
In Word, what we often think of as a line feed translates to a paragraph object. If you want empty paragraphs in your document you will need to insert them explicitly.
First of all though, you should ask whether you're using paragraphs for formatting, a common casual practice for Word users but one you might want to deal with differently, in particular by using the space-before and/or space-after properties of a paragraph. In HTML this would correspond roughly to padding-top and padding-bottom.
In this case, if you just want the line feeds, consider using paragraphs like so:
document.add_paragraph('Simple paragraph')
p = document.add_paragraph()
p.add_run('Name:\t\t').bold = True
p.add_run(name)
document.add_paragraph()
p = document.add_paragraph()
p.add_run('Surname:\t\t').bold = True
p.add_run(surname)