How can I extract HTML embedded in RTF using Python?

How can I extract HTML embedded in RTF using Python? - python

I'm trying to extract the HTML email bodies from Outlook msg files. I've successfully converted them to eml/standard RFC 822 files using email-outlook-message-perl, but the body of the emails are HTML wrapped in RTF. Here's an example snipit:
{\*\htmltag96 <div class="EduText" style="padding:2px;border-width:1px;background-color:#DEE5ED;border-color:##FAFAFA;border-style:solid;">}\htmlrtf {\htmlrtf0 {\*\htmltag64}\htmlrtf {\htmlrtf0 \htmlrtf{\f4\fs24\htmlrtf0 \'cd\'d5\'e0\'c1\'c5\'b9\'d5\'e9\'ca\'e8\'a7\'e4\'bb\'b7\'d5\'e8 john.smith\htmlrtf\f0}\htmlrtf0
{\*\htmltag116 <br>}\htmlrtf \line
\htmlrtf0
Is there a way to get the the HTML content, without all of the RTF crud?

This is a few years old back thread, but this might be helpful for one who is new to TNEF and he is in similar situation...
If you are a Linux user, then you could extract the html content from rtf file using Linux command line tool unrtf
unrtf message.rtf
This will give you the output with html content.
If you want to redirect it into a file, then could try
unrtf message.rtf > message.html
Hope this helps...
-Suresh

Microsoft is using TNEF (Transport Neutral Encapsulation Format). So I think you need to search for a TNEF Phyton implementation like:
tnefparse

Related

create OleFile type file from uploaded file

So I am using Message class from this repo to parse .msg file. I have a test file that works with that class.
I am trying to use that class in custom parser I am writing for my Django rest framework app.
But when I read stream.body, it additionally adds the following content
----------------------------488071469102781097692083
Content-Disposition: form-data; name="file"; filename="email_test.msg"
Content-Type: application/vnd.ms-outlook
< actual content here >
----------------------------488071469102781097692083--
and I have a doubt that, because of this additional content, Message class is throwing following error.
not an OLE2 structured storage file
Is my doubt right? How do I solve this?

msg-extractor is for .msg files from MS Outlook, which have a binary format called "OLE2" or "CFB". They start with "D0CF" when you open them in a hex viewer.
The snippet in your question looks like a MIME-encoded e-mail, which is text.
Are you sure the file you are trying to parse is a MS Outlook MSG file?

Is it possible to call xsl Apache FOP without providing an input file but instead passing a string

I am trying to generate a PDF using FOP. To do this I am taking in a template file, initialling its values with Jinja2 and then passing it through to fop with a system call.
Is it possible to do a subprocess call to FOP without passing through an input file but instead a string containing the XML directly? And if so how would I go about doing so?
I was hoping for something like this
fop -fo "XML here" -pdf output.pdf

Yes actually it was possible.
Using python I was able to import the xml from the file into lxml.etree:
tree = etree.parse('FOP_PARENT.fo.xml')
And then by using the etree to parse the include tags:
tree.xinclude()
Then it was a simple case of converting the xml back into unicode:
xml = etree.tounicode(tree)
This is how I got the templates to work. Hopefully this helps someone who has the same issue!

Read outlook mail in html format

I receive a mail in Microsoft Outlook that contains a html table. I would like to parse this in to a pandas dataframe.
I have already written a script that uses beautiful soup to parse the html text in to the dataframe. But I am struggling with reading the email in html in the first place.
Having found the message I am using the below code to read it in to a text file. But it is writing the text as a /n separated string rather than something like data as I was expecting. Which means that I then can't use beautiful soup to get this in to a dataframe.
I have found lots of examples of how to write and send a html mail but not how to read one in html format. Any ideas?
contents = msg.Body.encode('ascii', 'ignore').decode('ascii')
contents_file = open("U:\body.txt", "w")
contents_file.write(contents)
contents_file.close()

Found the answer myself. I should use msg.HTMLBody rather than msg.Body

Get text from Gtk3 TextView/TextBuffer with formatting tags included in Python

I'm working on a Python 3 project that uses the Gtk3 TextView/TextBuffer to get a user's input, and I've got it working to where I can have the user typing in rich text and able to format it as Bold/Italic/Underline/Combination of these.
However, I'm stuck on trying to figure out how to get the text from the TextBuffer with those flags included so I can use the formatting flags to convert the text to properly formatted HTML when I need to.
Calling textbuffer.get_text(start, end, True) simply returns the text without any flags.
Here's the code and the editor.glade file. Save them both in the same directory.
How can I get the text with the flags included? Or, alternatively, is there a way I can get the user's input formatted as HTML automatically in another variable automatically?

That's not very easy. Here is a link to some code that I once wrote to do the same thing for RTF output. You can probably adapt it to produce HTML output. If you manage to do so, I'd possibly integrate it into that library's successor.
Alternatively, if you prefer text processing to the above, you can export the rich text in GtkTextBuffer's internal serialization format and convert it to HTML yourself later:
format = textbuffer.register_serialize_tagset('my-tagset')
exported = textbuffer.serialize(textbuffer, format, start, end)

Creating pdfs in Python with Pisa / xhtml2pdf

I know there are a lot of questions based on pdf creation in Python but I haven't seen anything based on creating pdfs with Pisa or xhtml2pdf.
Here is my code.
pisa.pisaDocument(cStringIO.StringIO(a).encode('utf-8'),file('mypdf.pdf','wb'))
and then
pisa.startViewer('mypdf.pdf')
I assembled this over a couple different tutorials and examples but every single thing that I've tried always results in the pdf being corrupted and I get this message when trying to open the pdf.
"Adobe Reader could not open 'awesomer.pdf' because it is either not a supported file type or because the file has been damaged (for example, it was sent as an email attachment and wasn't correctly decoded)."
This message occurs even when I don't use the .encode('utf-8') on the string.
What am I doing wrong? Does the encoding on my Mac have to do with this?

I'd suggest closing the file manually, had a simmilar problem. Try this:
f = file('mypdf.pdf', 'wb')
pisa.pisaDocument(cStringIO.StringIO(a).encode('utf-8'),f)
f.close()

I recommend doing the following:
pdf = pisa.pisaDocument(cStringIO.StringIO(a).encode('utf-8'),file('mypdf.pdf','wb'))
if pdf.err:
print "*** %d ERRORS OCCURED" % pdf.err
And then see what the error output is.
I'm not sure what string you are encoding but this might also help:
pdf = pisa.pisaDocument(cStringIO.StringIO(html.encode(a)).encode('utf-8'),file('mypdf.pdf','wb'))
It depends on if a needs to be html encoded

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I extract HTML embedded in RTF using Python? - python

Microsoft is using TNEF (Transport Neutral Encapsulation Format). So I think you need to search for a TNEF Phyton implementation like: tnefparse

Related

create OleFile type file from uploaded file

Is it possible to call xsl Apache FOP without providing an input file but instead passing a string

Read outlook mail in html format

Get text from Gtk3 TextView/TextBuffer with formatting tags included in Python

Creating pdfs in Python with Pisa / xhtml2pdf

Categories

Resources