Why python-docx reads only first paragraph of a docx?

Why python-docx reads only first paragraph of a docx? - python

Good day!
I'm currently trying to read a text from docx file using python-docx. xml of the problem part looks like this:
<w:t xml:space="preserve">Some text
<w:br />
more text
<w:br />
<w:br />
a bit more text
</w:t>
So when reading it with python-docx it simply does not see "more text" and "a bit more text" parts. I assume that it happens because that each line is not framed with <w:t> and </w:t>, but the whole run also includes linebreaks.
So I'm writing code:
doc = docx.Document('File.docx')
for para in doc.paragraphs:
print(para.text)
Output:
Some text
Wanted output:
Some text
more text
a bit more text
Does anybody knows how to make it work using python-docx or some other library in python?
Thank you.

Related

Python read XML from standard input

I'm trying to read XML input from command line in Python3. So far I tried various method and following is my code for read XML,
import sys
import xml.dom.minidom
try:
input = sys.stdin.buffer
except AttributeError:
input = sys.stdin
xmlString = input.read()
But this continuedly getting inputs please someone can tell how to stop getting inputs after getting XML file
My XML file is,
<response>
<article>
<title>A Novel Approach to Image Classification, in a Cloud Computing Environment stability.</title>
<publicationtitle>IEEE Transactions on Cloud Computing</publicationtitle>
<abstract>Classification of items within PDF documents has always been challenging. This stability document will discuss a simple classification algorithm for indexing images within a PDF.</abstract>
</article>
<body>
<sec>
<label>I.</label>
<p>Should Haven't That is a bunch of text pattern these classification and cyrptography. These paragraphs are nothing but nonsense. What is the statbility of your program to find neural nets. Throw in some numbers to see if you get the word count correct this is a classification this in my nd and rd words. What the heck throw in cryptography.</p>
<p>I bet diseases you can't find probability twice. Here it is a again probability. Just to fool you I added it three times probability. Does this make any pattern classification? pattern classification! pattern classification.</p>
<p>
<fig>
<label>FIGURE.</label>
<caption>This is a figure representing convolutional neural nets.</caption>
</fig>
</p>
</sec>
</body>
</response>
Since this has number of lines I cant input this from conventional way using input()

Reading from the console / command line is done with input(). Try:
import xml.dom.minidom
xmlString = input()
For more details on sys.stdin, take a look at this SO post.
Edit: If you wanted to read multiple lines from the console, try sys.stdin.readlines, like xmlString = sys.stdin.readlines(). The user terminates multi-line input with CTRL+D. Or, you can just have the user write the XML to a file, and parse that file (easier, but maybe not desireable).

get only xml data from text file using python

I have a text file where I have some XML data and some HTML data. Both start with "<". Now I want to extract only XML data and save it in another file. How can I do it?
File example:
xyz data:
<note>
<to>john</to>
<from>doe</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
xyz data
<bold>xyz</bold>
text
text
text
<bold>xyz</bold>
again XML data
Note: This file is in .txt format.

I would treat your whole input not as XML, but as an HTML fragment. HTML can contain non-standard elements, so <note> etc. is fine.
For convenience I suggest pyquery (link) to deal with HTML. It works pretty much the same way as jQuery, so if you've worked with that before, it should be familiar.
It's pretty straight-forward. Load your data, wrap it in "<html></html>", parse it, query it.
from pyquery import PyQuery as pq
data = """xyz data:
<note>
<to>john</to>
<from>doe</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
xyz data
<bold>xyz</bold>
text
text
text
<bold>xyz</bold>
again XML data"""
doc = pq(f"<html><body>{data}</body></html>")
note = doc.find("note")
print(note.find("body").text())
which prints "Don't forget me this weekend!".

PyQt5 How to save textEdit text as rich text

Hi i'm coding a rich text editor and i want to save my textEdit field text as rich text file. I did that.However,i write my rich text which has different font color, size, bold. but when i save this as rtf file. All changes are gone.(i write toPlainText. i have to write different method)
How can i save my text with changes(like fonts, size, colors) ?
def savefl(self):
try:
filey = QtWidgets.QFileDialog.getSaveFileName(self,"Save","","Rich Text File (*.rtf);;Text File(*.txt);;All Files (*.*)")
with open(filey[0], "w", encoding="utf-8") as file2:
file2.write(self.textEdit.toPlainText())
except (FileNotFoundError,FileExistsError):
pass

Rich text and the rich text format, RTF, are not necessarily the same thing. Microsoft Word documents (.doc), Markdown (.md), and Libreoffice documents (.odf) are all rich text file formats.
So is HTML, which is how Qt lets you get the rich text, using the toHtml method. There's no way to get an RTF out of Qt; you'll have to convert the HTML to RTF.
If HTML can suffice for you, use it. As has been written before, RTF is an ancient format and its age is showing more and more. If you absolutely need RTF, you'll need to do a conversion. I'd recommend pandoc if you can call an external program; if not, you'll have to use a library like PyRTF and manually parse the HTML and create a document with PyRTF.

Html rich text to Microsoft Word

Right now i have this format:
"This is a bold word, this is in italic, this is regular"
Which translates to:
<p>This is a bold <strong>word</strong>, <em>this is in italic</em>, this is regular.</p>
Is there any python library which turns the above code into Microsoft Word format? I couldn't find any, i found only pandoc and the subsequent pypandoc which read html but can only translate it into .docx format by saving a .docx file- and this isn't helpful.
I figured to ask a question here before i put work into writing a parser to do this.

Docx content and formatting extraction in python

I am trying to parse a docx folder and take specific elements base on wether or not a certain word is bolded. If this is the text in the document:
Foo: Hello
Boo:
Blah Blah
•Blah
•Blah
Choo: Hello
I would want to scan, line by line, and take all the text after the bolded word until the next bolded word.
As of right now I am using using an XML parser that parses based on newline charactrs. I cannot find anything in the Zipfile or the individual lines that would give me metadata like that.
Is it possible to do this?

I'd use a higher-level library that supports reading docx files rather than parsing the XML document.
One library that looks up to the task is python-docx.
If you're using Jython, Apache POI HWPF is another option.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why python-docx reads only first paragraph of a docx? - python

Related

Python read XML from standard input

get only xml data from text file using python

PyQt5 How to save textEdit text as rich text

Html rich text to Microsoft Word

Docx content and formatting extraction in python

Categories

Resources