Using Python Docx to remove blank lines

Using Python Docx to remove blank lines - python

I am using Python docx to remove blank lines from documents containing text and images. Using the paragraph.clear() and paragraph.run.clear() works to a point, but the outputted file still has blank lines which only have a paragraph mark shown in Word. Is there a way of searching directly for paragraph marks? Or is there a better way of clearing the lines?
# code snippet
for paragraphs in document.paragraphs:
if paragraphs.text == "\n":
paragraphs.clear()

Empty lines are not marked by "\n" but by empty string "".
Plus, clear() removes text but not the paragraph itself.
Try to test len(paragraph.text)==0 for each paragraph.

This removed all the empty lines for me in my document file
for paragraph in doc.paragraphs:
if len(paragraph.text) == 0:
p = paragraph._element
p.getparent().remove(p)
p._p = p._element = None

Using len(paragraph.text)==1 helps as opposed to using len(paragraph.text)==0 as new line is also a character.
I just wanted to copy the lines except the blank lines to a new document so it gave me the output.
When I used paragraph.text=paragraph.strip('\n') the font style,bold,underlined and italic were removed.So checking the length of each paragraph and clearing that paragraph does the trick.

Related

How can I add a new paragraph at the end of a word document?

I would like to add a last paragraph after everything in the word doc that I have. I tried using this code, but the text is appended before my last table.
How can I make sure the text is always appended at the very end?
from docx import Document
document = Document('Summary_output.docx')
paragraphs = document.paragraphs
#Store content of second paragraph
text = paragraphs[1].text
#Clear content
paragraphs[1]._p.clear()
#Recreate second paragraph
paragraphs[1].add_run('Appended part ' + text)
document.save("Summary_output.docx")

Short answer: use document.add_paragraph().
new_last_paragraph = document.add_paragraph("Appended part %s" % text)
It's important to understand the distinction in Word between paragraphs and runs. A paragraph is a "block" item (as is a table). A block item fits between the margins, is vertically entirely below the prior block item and entirely above the following block item. Intuitively, it is a full-width block in the stack of full-width blocks appearing in the "column" bounded on each side by the margins.
A run is an inline item, a sequence of characters that all share the same character formatting. A run always appears within a paragraph and in general a paragraph contains multiple runs. Using runs is how you make, for example, a single word bold or a phrase within a paragraph italic or red. Runs are flowed within the paragraph by line-wrapping.
So in your code, you were just extending an existing paragraph (by adding a run) rather than creating a new one, which explains why its position did not change.

How to get to know paragraph color docx python?

I have document where some lines are highlighted. I have managed to open it and get text line by line:
doc = Document("/Users/an/PycharmProjects/projects/test.docx")
for para in doc.paragraphs:
print(para.text)
but I don't know how to check which color has each line. Maybe this library has some method for such task?

A run can have a .highlight_color, which is perhaps what you're looking for:
for p in doc.paragraphs:
for r in p.runs:
print(r.font.highlight_color)
There is no concept of "line" in Word. There are paragraphs, within which there are runs. A run is a sequence of characters that shares the same character formatting (aka. "font").
Note that breaks between runs occur arbitrarily in text and there's nothing stopping a "line" of highlighted text being broken into several runs. So you may need to check for adjacent runs with the same .highlight_color value to "assemble" those into the same sequence of text in the "apparent" highlighted passage.

Skip figures while replacing text

I need to replace the text on the paragraphs while keeping the figures that are inline with text in place while also keeping the old text's style.
The input consists of a line of text with a picture in the middle, for example:
OLD TEXT (picture.jpg) OLD TEXT
The output should be, for example:
NEW TEXT (picture.jpg) NEW TEXT
So far I've gotten to:
for para in doc.paragraphs:
for run in para.runs:
if (run.text != ""):
run.text = "NEW TEXT"
But it adds an extra NEW TEXT string for the picture.jpg object and doesn't keep the style (bold/italic, text identation, font, etc.).

Identifying sections tabbed in from raw text

Consider the text on this page. If you look at the source code, you'll see that the main text is presented exactly as in the page -- no HTML divisions or any other way to obviously find paragraphs/tabbed in sections.
Is there a way to automatically identify and remove sections that are tabbed in from the raw text?
One thing I notice is that when I encode the text as text = unicode(raw_text).encode("utf-8"), I can then see a bunch of \n's for line skips. But no \t's. (This might be not a useful direction to think, but just an idea).
Edit: The following works
text = unicode(raw_text).encode("utf-8")
y = [x for x in text.split("\n") if " " not in x]
final = " ".join(y)

Well, after looking at the page, they are 'tabbed' in with spaces rather than the tab character; looking for tabs would not be useful. It looks like the section is tabbed in with 5 spaces.
raw_text.replace(' ','')
To replace all occurances of 5 spaces...
from re import sub
...
raw_text = sub(r' .*\n', '', raw_text)

New lines/tabulators turn into spaces in generated document

I have problem with \n and \t tags. When I am opening a generated .docx in OpenOffice everything looks fine, but when I open the same document in Microsoft Word I just get the last two tabulators in section "Surname" and spaces instead of newlines/tabulators in other sections. What is wrong?
p = document.add_paragraph('Simple paragraph')
p.add_run('Name:\t\t' + name).bold = True
p.add_run('\n\nSurname:\t\t' + surname)

In Word, what we often think of as a line feed translates to a paragraph object. If you want empty paragraphs in your document you will need to insert them explicitly.
First of all though, you should ask whether you're using paragraphs for formatting, a common casual practice for Word users but one you might want to deal with differently, in particular by using the space-before and/or space-after properties of a paragraph. In HTML this would correspond roughly to padding-top and padding-bottom.
In this case, if you just want the line feeds, consider using paragraphs like so:
document.add_paragraph('Simple paragraph')
p = document.add_paragraph()
p.add_run('Name:\t\t').bold = True
p.add_run(name)
document.add_paragraph()
p = document.add_paragraph()
p.add_run('Surname:\t\t').bold = True
p.add_run(surname)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Python Docx to remove blank lines - python

Empty lines are not marked by "\n" but by empty string "". Plus, clear() removes text but not the paragraph itself. Try to test len(paragraph.text)==0 for each paragraph.

This removed all the empty lines for me in my document file for paragraph in doc.paragraphs: if len(paragraph.text) == 0: p = paragraph._element p.getparent().remove(p) p._p = p._element = None

Related

How can I add a new paragraph at the end of a word document?

How to get to know paragraph color docx python?

Skip figures while replacing text

Identifying sections tabbed in from raw text

New lines/tabulators turn into spaces in generated document

Categories

Resources