Html rich text to Microsoft Word

Html rich text to Microsoft Word - python

Right now i have this format:
"This is a bold word, this is in italic, this is regular"
Which translates to:
<p>This is a bold <strong>word</strong>, <em>this is in italic</em>, this is regular.</p>
Is there any python library which turns the above code into Microsoft Word format? I couldn't find any, i found only pandoc and the subsequent pypandoc which read html but can only translate it into .docx format by saving a .docx file- and this isn't helpful.
I figured to ask a question here before i put work into writing a parser to do this.

Related

Why python-docx reads only first paragraph of a docx?

Good day!
I'm currently trying to read a text from docx file using python-docx. xml of the problem part looks like this:
<w:t xml:space="preserve">Some text
<w:br />
more text
<w:br />
<w:br />
a bit more text
</w:t>
So when reading it with python-docx it simply does not see "more text" and "a bit more text" parts. I assume that it happens because that each line is not framed with <w:t> and </w:t>, but the whole run also includes linebreaks.
So I'm writing code:
doc = docx.Document('File.docx')
for para in doc.paragraphs:
print(para.text)
Output:
Some text
Wanted output:
Some text
more text
a bit more text
Does anybody knows how to make it work using python-docx or some other library in python?
Thank you.

PyQt5 How to save textEdit text as rich text

Hi i'm coding a rich text editor and i want to save my textEdit field text as rich text file. I did that.However,i write my rich text which has different font color, size, bold. but when i save this as rtf file. All changes are gone.(i write toPlainText. i have to write different method)
How can i save my text with changes(like fonts, size, colors) ?
def savefl(self):
try:
filey = QtWidgets.QFileDialog.getSaveFileName(self,"Save","","Rich Text File (*.rtf);;Text File(*.txt);;All Files (*.*)")
with open(filey[0], "w", encoding="utf-8") as file2:
file2.write(self.textEdit.toPlainText())
except (FileNotFoundError,FileExistsError):
pass

Rich text and the rich text format, RTF, are not necessarily the same thing. Microsoft Word documents (.doc), Markdown (.md), and Libreoffice documents (.odf) are all rich text file formats.
So is HTML, which is how Qt lets you get the rich text, using the toHtml method. There's no way to get an RTF out of Qt; you'll have to convert the HTML to RTF.
If HTML can suffice for you, use it. As has been written before, RTF is an ancient format and its age is showing more and more. If you absolutely need RTF, you'll need to do a conversion. I'd recommend pandoc if you can call an external program; if not, you'll have to use a library like PyRTF and manually parse the HTML and create a document with PyRTF.

How can I extract HTML embedded in RTF using Python?

I'm trying to extract the HTML email bodies from Outlook msg files. I've successfully converted them to eml/standard RFC 822 files using email-outlook-message-perl, but the body of the emails are HTML wrapped in RTF. Here's an example snipit:
{\*\htmltag96 <div class="EduText" style="padding:2px;border-width:1px;background-color:#DEE5ED;border-color:##FAFAFA;border-style:solid;">}\htmlrtf {\htmlrtf0 {\*\htmltag64}\htmlrtf {\htmlrtf0 \htmlrtf{\f4\fs24\htmlrtf0 \'cd\'d5\'e0\'c1\'c5\'b9\'d5\'e9\'ca\'e8\'a7\'e4\'bb\'b7\'d5\'e8 john.smith\htmlrtf\f0}\htmlrtf0
{\*\htmltag116 <br>}\htmlrtf \line
\htmlrtf0
Is there a way to get the the HTML content, without all of the RTF crud?

This is a few years old back thread, but this might be helpful for one who is new to TNEF and he is in similar situation...
If you are a Linux user, then you could extract the html content from rtf file using Linux command line tool unrtf
unrtf message.rtf
This will give you the output with html content.
If you want to redirect it into a file, then could try
unrtf message.rtf > message.html
Hope this helps...
-Suresh

Microsoft is using TNEF (Transport Neutral Encapsulation Format). So I think you need to search for a TNEF Phyton implementation like:
tnefparse

Get text from Gtk3 TextView/TextBuffer with formatting tags included in Python

I'm working on a Python 3 project that uses the Gtk3 TextView/TextBuffer to get a user's input, and I've got it working to where I can have the user typing in rich text and able to format it as Bold/Italic/Underline/Combination of these.
However, I'm stuck on trying to figure out how to get the text from the TextBuffer with those flags included so I can use the formatting flags to convert the text to properly formatted HTML when I need to.
Calling textbuffer.get_text(start, end, True) simply returns the text without any flags.
Here's the code and the editor.glade file. Save them both in the same directory.
How can I get the text with the flags included? Or, alternatively, is there a way I can get the user's input formatted as HTML automatically in another variable automatically?

That's not very easy. Here is a link to some code that I once wrote to do the same thing for RTF output. You can probably adapt it to produce HTML output. If you manage to do so, I'd possibly integrate it into that library's successor.
Alternatively, if you prefer text processing to the above, you can export the rich text in GtkTextBuffer's internal serialization format and convert it to HTML yourself later:
format = textbuffer.register_serialize_tagset('my-tagset')
exported = textbuffer.serialize(textbuffer, format, start, end)

Docx content and formatting extraction in python

I am trying to parse a docx folder and take specific elements base on wether or not a certain word is bolded. If this is the text in the document:
Foo: Hello
Boo:
Blah Blah
•Blah
•Blah
Choo: Hello
I would want to scan, line by line, and take all the text after the bolded word until the next bolded word.
As of right now I am using using an XML parser that parses based on newline charactrs. I cannot find anything in the Zipfile or the individual lines that would give me metadata like that.
Is it possible to do this?

I'd use a higher-level library that supports reading docx files rather than parsing the XML document.
One library that looks up to the task is python-docx.
If you're using Jython, Apache POI HWPF is another option.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Html rich text to Microsoft Word - python

Related

Why python-docx reads only first paragraph of a docx?

PyQt5 How to save textEdit text as rich text

How can I extract HTML embedded in RTF using Python?

Get text from Gtk3 TextView/TextBuffer with formatting tags included in Python

Docx content and formatting extraction in python

Categories

Resources