I receive a mail in Microsoft Outlook that contains a html table. I would like to parse this in to a pandas dataframe.
I have already written a script that uses beautiful soup to parse the html text in to the dataframe. But I am struggling with reading the email in html in the first place.
Having found the message I am using the below code to read it in to a text file. But it is writing the text as a /n separated string rather than something like data as I was expecting. Which means that I then can't use beautiful soup to get this in to a dataframe.
I have found lots of examples of how to write and send a html mail but not how to read one in html format. Any ideas?
contents = msg.Body.encode('ascii', 'ignore').decode('ascii')
contents_file = open("U:\body.txt", "w")
contents_file.write(contents)
contents_file.close()
Found the answer myself. I should use msg.HTMLBody rather than msg.Body
Related
I am trying to read the contents of an HTML file with BeautifulSoup, but I'm receiving an UnicodeDecodeError.
I also tried changing the parser to html.parser instead of the lxml but it doesn't work.
however, if I use the requests library to request the URL, it works, but not if I read the HTML file locally.
answer:
I needed to add a Unicode and it was should have something like that: with open('lap.html', encoding="utf8") as html_file:
You are passing a file to 'BeautifulSoup' instead you have to pass the content of the file.
try :
content = html_file.read() source = BeautifulSoup(content, 'lxml')
First of all, fix the soruce to source, then make a gap between the equal sign and the text and then find out, what might not be encodable by the coding standart you use, because that error refers to a sign which cant be decoded/encoded
I'm trying to extract the HTML email bodies from Outlook msg files. I've successfully converted them to eml/standard RFC 822 files using email-outlook-message-perl, but the body of the emails are HTML wrapped in RTF. Here's an example snipit:
{\*\htmltag96 <div class="EduText" style="padding:2px;border-width:1px;background-color:#DEE5ED;border-color:##FAFAFA;border-style:solid;">}\htmlrtf {\htmlrtf0 {\*\htmltag64}\htmlrtf {\htmlrtf0 \htmlrtf{\f4\fs24\htmlrtf0 \'cd\'d5\'e0\'c1\'c5\'b9\'d5\'e9\'ca\'e8\'a7\'e4\'bb\'b7\'d5\'e8 john.smith\htmlrtf\f0}\htmlrtf0
{\*\htmltag116 <br>}\htmlrtf \line
\htmlrtf0
Is there a way to get the the HTML content, without all of the RTF crud?
This is a few years old back thread, but this might be helpful for one who is new to TNEF and he is in similar situation...
If you are a Linux user, then you could extract the html content from rtf file using Linux command line tool unrtf
unrtf message.rtf
This will give you the output with html content.
If you want to redirect it into a file, then could try
unrtf message.rtf > message.html
Hope this helps...
-Suresh
Microsoft is using TNEF (Transport Neutral Encapsulation Format). So I think you need to search for a TNEF Phyton implementation like:
tnefparse
I'm working on a Python 3 project that uses the Gtk3 TextView/TextBuffer to get a user's input, and I've got it working to where I can have the user typing in rich text and able to format it as Bold/Italic/Underline/Combination of these.
However, I'm stuck on trying to figure out how to get the text from the TextBuffer with those flags included so I can use the formatting flags to convert the text to properly formatted HTML when I need to.
Calling textbuffer.get_text(start, end, True) simply returns the text without any flags.
Here's the code and the editor.glade file. Save them both in the same directory.
How can I get the text with the flags included? Or, alternatively, is there a way I can get the user's input formatted as HTML automatically in another variable automatically?
That's not very easy. Here is a link to some code that I once wrote to do the same thing for RTF output. You can probably adapt it to produce HTML output. If you manage to do so, I'd possibly integrate it into that library's successor.
Alternatively, if you prefer text processing to the above, you can export the rich text in GtkTextBuffer's internal serialization format and convert it to HTML yourself later:
format = textbuffer.register_serialize_tagset('my-tagset')
exported = textbuffer.serialize(textbuffer, format, start, end)
I receive an email when a system in my company generates an error. This email contains XML all crammed onto a single line.
I wrote a notepad++ Python script that parses out everything except XML and pretty prints it. Unfortunately some of the emails contain too much XML data and it gets truncated. In general, the truncated data isn't that important to me. I would like to be able to just auto-close any open tags so that my Python script works. It doesn't need to be smart or correct, it just needs to make the xml well-enough formed that the script runs. Is there a way to do this?
I am open to Python scripts, online apps, downloadable apps, etc.
I realize that the right solution is to get the non-truncated xml, but pulling the right lever to get things done will be far more work than just dealing with it.
Use Beautiful Soup
>>> import bs4
>>> s= bs4.BeautifulSoup("<asd><xyz>asd</xyz>")
>>> s
<html><head></head><body><asd><xyz>asd</xyz></asd></body></html>
>>
>>> s.body.contents[0]
<asd><xyz>asd</xyz></asd>
Notice that it closed the "asd" tag automagically"
To create a notepad++ script to handle this,
download the tarball and extract the files
Copy the bs4 directory to your PythonScript/scripts folder.
In notepad++ add the following code to your python script
#import Beautiful Soup
import bs4
#get text in document
text = editor.getText()
#soupify it to fix XML
soup = bs4.BeautifulSoup(text)
#convert soup object to string again
text = str(soup)
#clear editor and replace bad xml with fixed xml
editor.clearAll()
editor.addText(text)
#change language to xml
notepad.menuCommand( MENUCOMMAND.LANG_XML )
#soup has its own prettify, but I like the XML tools version better
notepad.runMenuCommand('XML Tools', 'Pretty print (XML only - with line breaks)', 1)
If you have BeautifulSoup and lxml installed, it's straightforward:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""
... <?xml version="1.0" encoding="utf-8"?>
... <a>
... <b>foo</b>
... <c>bar</""", "xml")
>>> soup
<?xml version="1.0" encoding="utf-8"?>
<a>
<b>foo</b>
<c>bar</c></a>
Note the second "xml" argument to the constructor to avoid the XML being interpreted as HTML.
I am writing code in python that can not only read a xml but also send the results of that parsing as an email. Now I am having trouble just trying to read the file I have in xml. I made a simple python script that I thought would at least read the file which I can then try to email within python but I am getting a Syntax Error in line 4.
root.tag 'log'
Anyways here is the code I written so far:
import xml.etree.cElementTree as etree
tree = etree.parse('C:/opidea.xml')
response = tree.getroot()
log = response.find('log').text
logentry = response.find('logentry').text
author = response.find('author').text
date = response.find('date').text
msg = [i.text for i in response.find('msg')]
Now the xml file has this type of formating
<log>
<logentry
revision="12345">
<author>glv</author>
<date>2012-08-09T13:16:24.488462Z</date>
<paths>
<path
action="M"
kind="file">/trunk/build.xml</path>
</paths>
<msg>BUG_NUMBER:N/A
FEATURE_AFFECTED:N/A
OVERVIEW:Example</msg>
</logentry>
</log>
I want to be able to send an email of this xml file. For now though I am just trying to get the python code to read the xml file.
response.find('log') won't find anything, because:
find(self, path, namespaces=None)
Finds the first matching subelement, by tag name or path.
In your case log is not a subelement, but rather the root element itself. You can get its text directly, though: response.text. But in your example the log element doesn't have any text in it, anyway.
EDIT: Sorry, that quote from the docs actually applies to lxml.etree documentation, rather than xml.etree.
I'm not sure about the reason, but all other calls to find also return None (you can find it out by printing response.find('date') and so on). With lxml ou can use xpath instead:
author = response.xpath('//author')[0].text
msg = [i.text for i in response.xpath('//msg')]
In any case, your use of find is not correct for msg, because find always returns a single element, not a list of them.