New lines/tabulators turn into spaces in generated document

New lines/tabulators turn into spaces in generated document - python

I have problem with \n and \t tags. When I am opening a generated .docx in OpenOffice everything looks fine, but when I open the same document in Microsoft Word I just get the last two tabulators in section "Surname" and spaces instead of newlines/tabulators in other sections. What is wrong?
p = document.add_paragraph('Simple paragraph')
p.add_run('Name:\t\t' + name).bold = True
p.add_run('\n\nSurname:\t\t' + surname)

In Word, what we often think of as a line feed translates to a paragraph object. If you want empty paragraphs in your document you will need to insert them explicitly.
First of all though, you should ask whether you're using paragraphs for formatting, a common casual practice for Word users but one you might want to deal with differently, in particular by using the space-before and/or space-after properties of a paragraph. In HTML this would correspond roughly to padding-top and padding-bottom.
In this case, if you just want the line feeds, consider using paragraphs like so:
document.add_paragraph('Simple paragraph')
p = document.add_paragraph()
p.add_run('Name:\t\t').bold = True
p.add_run(name)
document.add_paragraph()
p = document.add_paragraph()
p.add_run('Surname:\t\t').bold = True
p.add_run(surname)

Related

Replace string in paragraph while keeping style docx library

I am replacing the strings in tables and paragraphs of word document. However the styles change. How can I keep original style format?
with open(r"C:\Users\y.Israfilbayov\Desktop\testfiles\test_namedranges\VariableNames.json") as p:
data = json.load(p)
document = Document(r"C:\Users\y.Israfilbayov\Desktop\testfiles\test_namedranges_update\F10352-JB117-FMXXX Pile XXXX As-built Memo GAIA Auto trial_v6.docx")
for key, value in data.items():
for paragraph in document.paragraphs:
if key in paragraph.text:
paragraph.text = paragraph.text.replace(str(key), str(value))
for key, value in data.items():
for table in document.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if key in paragraph.text:
paragraph.text = paragraph.text.replace(str(key),str(value))
There was a similar post, however it did not help me (maybe I did something wrong).

This should meet your needs. Requires docx2python 2.0.0+
from docx2python.utilities import replace_docx_text
replace_docx_text(
input_filename,
output_filename,
("Apples", "Bananas"), # replace Apples with Bananas
("Pears", "Apples"), # replace Pears with Apples
("Bananas", "Pears"), # replace Bananas with Pears
html=True,
)
You may have a problem if your replacement strings include tabs or symbols, but "regular" text replacement will work and preserve most[1] formatting.
To allow this, docx2python will not replace text strings where formatting changes, e.g., "part of this string is bold", unless you specify html=False, in which case strings will be replaced regardless of format, and some formatting will be lost.
[1] The following will be preserved:
italic
bold
underline
strike
superscript
subscript
small caps
all caps
highlighted
font size
colored text
(some others, but not guaranteed)
Edit for follow-up question, how do I replace marker text in tables?
My workflow for doing this is to keep all formatting in Word. That is, I create a template in Word, slice out the context I need, then put everything back together like a puzzle.
This github "project" is an example (one file) of how I replace text in tables (where the tables can be any size).
https://github.com/ShayHill/replace_docx_tables

#property
def text(self):
"""
String formed by concatenating the text of each run in the paragraph.
Tabs and line breaks in the XML are mapped to ``\\t`` and ``\\n``
characters respectively.
Assigning text to this property causes all existing paragraph content
to be replaced with a single run containing the assigned text.
A ``\\t`` character in the text is mapped to a ``<w:tab/>`` element
and each ``\\n`` or ``\\r`` character is mapped to a line break.
Paragraph-level formatting, such as style, is preserved. All
run-level formatting, such as bold or italic, is removed.
"""
text = ''
for run in self.runs:
text += run.text
return text
From the documentation it looks like the Styles should stay the same however; bold/italic formatting can be removed.
If this is the formatting you are trying to preserve, you may need to identify what run the key is in first then modify it.

In the docx library documentation located at
https://python-docx.readthedocs.io/en/latest/api/text.html#paragraph-objects, it states the following regarding assigning a value to paragraph.text :
"Assigning text to this property causes all existing paragraph content to be replaced with a single run containing the assigned text. ... Paragraph-level formatting, such as style, is preserved. All run-level formatting, such as bold or italic, is removed. "
Are the changes in style you are observing consistent with that?
If so, then perhaps you are loosing the "run" objects with their specific styling that are children of the paragraph object. In that case, you might be better of adding another level to your loop to iterate through all the paragraph.runs and replace the text on those individually.
For example, once you have the paragraph, then
for run in paragraph.runs:
if key in run.text:
run.text = run.text.replace(str(key), str(value))

How do I retrieve the text of a webpage without sentences being broken by newlines?

I would like to retrieve the text from a webpage - my preferred language is Python - so that sentences are not broken mid-sentence by newlines, like this:
and then the community
decided to invest in
public parks for the
benefit of the citizens.
I have tried dumping web pages from lynx and w3m, but it breaks the sentences into lines.
I just tried using Beautiful Soup's .get_text() method, which should pull out unbroken strings from elements containing text, such as <p> tags, but to my surprise, it still breaks sentences into newlines. Maybe this has something to do with newlines already being there in the HTML, or the text having tags like links embedded, breaking the flow of text. (I tried it on this webpage.)
I can open the webpage in a browser, select all and copy and paste into a text file, and this preserves the sentences as single lines, but this is not a programmatic solution.
I will try to show GPT-3 an example of the way I would like it to join broken sentences but not lines of code and see if it can replicate the example, but this is fixing the extraction afterwards, rather than before.
What would be a simple, programmatic way to obtain unbroken sentences from the source HTML?
I would really appreciate anybody who can help me with this.

Here ya go:
def cleanMe(html):
soup = BeautifulSoup(html, "html.parser") # create a new bs4 object from the html data loaded
for script in soup(["script", "style"]): # remove all javascript and stylesheet code
script.extract()
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
return text
And you can optionally add in another line that removes all \n \r and \t following the same join pattern. For example:
text = text.replace("\n", " ")
can be added right before the return text line in the function. Adding that line for me causes the text in your example to render as: and then the community decided to invest in public parks for the benefit of the citizens. which may also be a use case you want.

How can I add a new paragraph at the end of a word document?

I would like to add a last paragraph after everything in the word doc that I have. I tried using this code, but the text is appended before my last table.
How can I make sure the text is always appended at the very end?
from docx import Document
document = Document('Summary_output.docx')
paragraphs = document.paragraphs
#Store content of second paragraph
text = paragraphs[1].text
#Clear content
paragraphs[1]._p.clear()
#Recreate second paragraph
paragraphs[1].add_run('Appended part ' + text)
document.save("Summary_output.docx")

Short answer: use document.add_paragraph().
new_last_paragraph = document.add_paragraph("Appended part %s" % text)
It's important to understand the distinction in Word between paragraphs and runs. A paragraph is a "block" item (as is a table). A block item fits between the margins, is vertically entirely below the prior block item and entirely above the following block item. Intuitively, it is a full-width block in the stack of full-width blocks appearing in the "column" bounded on each side by the margins.
A run is an inline item, a sequence of characters that all share the same character formatting. A run always appears within a paragraph and in general a paragraph contains multiple runs. Using runs is how you make, for example, a single word bold or a phrase within a paragraph italic or red. Runs are flowed within the paragraph by line-wrapping.
So in your code, you were just extending an existing paragraph (by adding a run) rather than creating a new one, which explains why its position did not change.

Join runs from python-docx for the purpose of applying regex to group of runs

I am using Python-Docx to read through docx files, find a particular string (e.g. a date), and replace it with another string (e.g. a new date).
Here are the two functions I am using:
def docx_replace_regex(doc_obj, regex , replace):
for p in doc_obj.paragraphs:
if regex.search(p.text):
inline = p.runs
# Loop added to work with runs (strings with same style)
for i in range(len(inline)):
if regex.search(inline[i].text):
text = regex.sub(replace, inline[i].text)
inline[i].text = text
for table in doc_obj.tables:
for row in table.rows:
for cell in row.cells:
docx_replace_regex(cell, regex , replace)
def replace_date(folder,replaceDate,*date):
docs = [y for x in os.walk(folder) for y in glob(os.path.join(x[0], '*.docx'))]
for doc in docs:
if date: #Date is optional date to replace
regex = re.compile(r+date)
else: #If no date provided, replace all dates
regex = re.compile(r"(\w{3,12}\s\d{1,2}\,?\s?[0-9]{4})|((the\s)?\d{1,2}[th]{0,2}\sday\sof\s\w{3,12}\,\s?\d{4})")
docObj = Document(doc)
docx_replace_regex(docObj,regex,replaceDate)
docObj.save(doc)
The first function is essentially a find and replace function to use python with a docx file. The second file recursively searches through a file path to find docx files to search. The details of the regex aren't relevant (I think). It essentially searches for different date formats. It works as I want it to and shouldn't impact on my issue.
When a document is passed to docx_replace_regex that function iterates through paragraphs, then runs and searches the runs for my regex. The issue is that the runs sometimes break up a single line of text so that if the doc were in plaintext the regex would capture the text, but because the runs break up the text, the text isn't captured.
For example, if my paragraph is "10th day of May, 2020", the inline array may be ['1','0th day of May,',' 2020'].
Initially, I joined the inline array so that it would be equal to "10th day of May, 2020" but then I can't replace the run with the new text because my inline variable is a string, not a run object. Even if I kept inline as a run object it would still replace only one part of the text I'm looking for.
Looking for any ideas on how to properly replace the portion of text captured by my regex. Alternatively, why the sentence is being broken up into separate runs as it is.

This is not a simple problem, as it looks like you're starting to realize :)
The simplest possible approach is to search and replace in paragraph.text, like:
paragraph.text = my_replace_function(paragraph.text, ...)
This works, but all character formatting is lost. A more sophisticated approach finds the offset of the search phrase, maps that to runs, and then splits and rejoins runs as necessary to change only those runs containing the search phrase.
It looks like there's a working solution here: https://stackoverflow.com/a/55733040/1902513, which shows by its length just how much is involved.
It's come up quite a few times before, so if you search here in SO on [python-docx] replace you'll find more on the nature of the problem.

Identifying sections tabbed in from raw text

Consider the text on this page. If you look at the source code, you'll see that the main text is presented exactly as in the page -- no HTML divisions or any other way to obviously find paragraphs/tabbed in sections.
Is there a way to automatically identify and remove sections that are tabbed in from the raw text?
One thing I notice is that when I encode the text as text = unicode(raw_text).encode("utf-8"), I can then see a bunch of \n's for line skips. But no \t's. (This might be not a useful direction to think, but just an idea).
Edit: The following works
text = unicode(raw_text).encode("utf-8")
y = [x for x in text.split("\n") if " " not in x]
final = " ".join(y)

Well, after looking at the page, they are 'tabbed' in with spaces rather than the tab character; looking for tabs would not be useful. It looks like the section is tabbed in with 5 spaces.
raw_text.replace(' ','')
To replace all occurances of 5 spaces...
from re import sub
...
raw_text = sub(r' .*\n', '', raw_text)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

New lines/tabulators turn into spaces in generated document - python

Related

Replace string in paragraph while keeping style docx library

How do I retrieve the text of a webpage without sentences being broken by newlines?

How can I add a new paragraph at the end of a word document?

Join runs from python-docx for the purpose of applying regex to group of runs

Identifying sections tabbed in from raw text

Categories

Resources