Parsing outlook .msg files with python - python

Looked around and couldn't find a satisfactory answer. Does anyone know how to parse .msg files from outlook with Python?
I've tried using mimetools and email.parser with no luck. Help would be greatly appreciated!

This works for me:
import win32com.client
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
msg = outlook.OpenSharedItem(r"C:\test_msg.msg")
print msg.SenderName
print msg.SenderEmailAddress
print msg.SentOn
print msg.To
print msg.CC
print msg.BCC
print msg.Subject
print msg.Body
count_attachments = msg.Attachments.Count
if count_attachments > 0:
for item in range(count_attachments):
print msg.Attachments.Item(item + 1).Filename
del outlook, msg
Please refer to the following post regarding methods to access email addresses and not just the names (ex. "John Doe") from the To, CC and BCC properties - enter link description here

I succeeded extracting relevant fields from MS Outlook files (.msg) using msg-extractor utilitity by Matt Walker.
Prerequesites
pip install extract-msg
Note, it may require to install additional modules, in my case, it required to install imapclient:
pip install imapclient
Usage
import extract_msg
f = r'MS_Outlook_file.msg' # Replace with yours
msg = extract_msg.Message(f)
msg_sender = msg.sender
msg_date = msg.date
msg_subj = msg.subject
msg_message = msg.body
print('Sender: {}'.format(msg_sender))
print('Sent On: {}'.format(msg_date))
print('Subject: {}'.format(msg_subj))
print('Body: {}'.format(msg_message))
There are many other goodies in MsgExtractor utility, to be explored, but this is good to start with.
Note
I had to comment out lines 3 to 8 within the file C:\Anaconda3\Scripts\ExtractMsg.py:
#"""
#ExtractMsg:
# Extracts emails and attachments saved in Microsoft Outlook's .msg files
#
#https://github.com/mattgwwalker/msg-extractor
#"""
Error message was:
line 3
ExtractMsg:
^
SyntaxError: invalid syntax
After blocking those lines, the error message disappeared and the code worked just fine.

Even though this is an old thread, I hope this information might help someone who is looking for a solution to what the thread subject exactly says. I strongly advise using the solution of mattgwwalker in github, which requires OleFileIO_PL module to be installed externally.

I was able to parse it similar way as Vladimir mentioned above. However I needed to make small change by adding a for loop. The glob.glob(r'c:\test_email*.msg') returns a list whereas the Message(f) expect a file or str.
f = glob.glob(r'c:\test_email\*.msg')
for filename in f:
msg = ExtractMsg.Message(filename)
msg_sender = msg.sender
msg_date = msg.date
msg_subj = msg.subject
msg_message = msg.body

I found on the net a module called MSG PY.
This is Microsoft Outlook .msg file module for Python.
The module allows you to easy create/read/parse/convert Outlook .msg files.
The module does not require Microsoft Outlook to be installed on the machine or any other third party application or library in order to work.
For example:
from independentsoft.msg import Message
appointment = Message("e:\\appointment.msg")
print("subject: " + str(appointment.subject))
print("start_time: " + str(appointment.appointment_start_time))
print("end_time: " + str(appointment.appointment_end_time))
print("location: " + str(appointment.location))
print("is_reminder_set: " + str(appointment.is_reminder_set))
print("sender_name: " + str(appointment.sender_name))
print("sender_email_address: " + str(appointment.sender_email_address))
print("display_to: " + str(appointment.display_to))
print("display_cc: " + str(appointment.display_cc))
print("body: " + str(appointment.body))

I've tried the python email module and sometimes that doesn't successfully parse the msg file.
So, in this case, if you are only after text or html, the following code worked for me.
start_text = "<html>"
end_text = "</html>"
def parse_msg(msg_file,start_text,end_text):
with open(msg_file) as f:
b=f.read()
return b[b.find(start_text):b.find(end_text)+len(end_text)]
print parse_msg(path_to_msg_file,start_text,end_text)

The extract-msg Python module (pip install extract-msg) is also extremely useful because it allows quick access to the full headers from the message, something that Outlook makes much harder than necessary to get hold of.
My modification of Vladimir's code that shows full headers is:
#!/usr/bin/env python3
import extract_msg
import sys
msg = extract_msg.Message(sys.argv[1])
msg_sender = msg.sender
msg_date = msg.date
msg_subj = msg.subject
print('Sender: {}'.format(msg_sender))
print('Sent On: {}'.format(msg_date))
print('Subject: {}'.format(msg_subj))
print ("=== Details ===")
for k, v in msg.header.items():
print("{}: {}".format(k, v))
print(msg.body)

Related

How to add .htm to email body using win32com

I need to use win32com.client to make an email where I add a signature with the .htm extension to the mail.HtmlBody. However, each time I do this, I get UnicodeDecodeError.
In other words, how do I correct the UnicodeDecodeError problem and add my string & htm file to the HtmlBody?
self.mail = win32.Dispatch('outlook.application').CreateItem(0)
self.curText = str(self.email.currentText())
self.projectNameT = ' '.join(self.curText.split(' ')[7:])
self.mail.To = 'ABC#XYZ.com'
self.mail.Subject = "Subject: " + str(self.projectNameT)
self.someStr = 'Hello '
self.html_url = open("SomePath//Signature.htm",encoding = 'utf16')
self.data = self.html_url.read()
self.mail.HtmlBody = self.someStr + ('<p>self.data</p>')
If you want to insert a signature in using python and fully programatically, Redemption exposes the RDOSignature object which implements ApplyTo method (it deals with signature image files and merges HTML styles). Because with the outlook security patch, a lot is unable to be done inherrently, so you must work around this before you can procede as normal

How to use python to read and extract data from .msg files on linux?

I'm trying to extract attachments from .msg files using the following code, as suggested here. Following is just a part of the code to test whether the attachments are read
import email
with open('input/message.msg', 'rb') as fp:
msg = email.message_from_binary_file(fp)
for part in msg.walk():
print(part.get_content_type())
print(part.get_filename())
print(part.get_content_maintype())
I would expect that some of those print statements would output something similar to image/png, but even if that email message has numerous attachments, the output is simply
text/plain
None
text
Do you have any hint of what I'm doing wrong? I'm working on a Linux machine with python 3.7.3.
Thanks
Edit
I didn't investigate too much but I ended up using the python module msg-extractor which, using the following code, works without any problem
import extract_msg
msg = extract_msg.Message("input/email.msg")
for msg in msg.attachments:
print(msg.save())
The attachment class with all the available methods is implement here, I just needed to store the attachments.
I'll keep the question open hoping for a more relevant answer.
Have you had the opportunity to try the MSG PY module from an independent soft company. Here's an example:
from independentsoft.msg import Message
appointment = Message("e:\\appointment.msg")
print("subject: " + str(appointment.subject))
print("start_time: " + str(appointment.appointment_start_time))
print("end_time: " + str(appointment.appointment_end_time))
print("location: " + str(appointment.location))
print("is_reminder_set: " + str(appointment.is_reminder_set))
print("sender_name: " + str(appointment.sender_name))
print("sender_email_address: " + str(appointment.sender_email_address))
print("display_to: " + str(appointment.display_to))
print("display_cc: " + str(appointment.display_cc))
print("body: " + str(appointment.body))

importing gmail text content to a text file with python returns nothing

I was looking for a solution to get all my text messages/content (not attachments) in my gmail's email-account to a text file and I found a piece of code which promises to do so. I have python 3.4 and 2.7 both installed on win 7. I know php a bit but python I am zero. Right now, I have the code copied into a notepad text and saved it as test.py. Here is the complete code.
import imaplib
import email
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('myemailid#gmail.com', 'mypassword')
mail.list()
mail.select('inbox')
result, data = mail.uid('search', None, "ALL")
i = len(data[0].split())
for x in range(i):
latest_email_uid = data[0].split()[x]
result, email_data = mail.uid('fetch', latest_email_uid, '(RFC822)')
raw_email = email_data[0][1]
raw_email_string = raw_email.decode('utf-8')
email_message = email.message_from_string(raw_email_string)
for part in email_message.walk():
if part.get_content_type() == "text/plain":
body = part.get_payload(decode=True)
save_string = r"F:\result\email_" + str(x) + ".txt"
myfile = open(save_string, 'a')
myfile.write(body.decode('utf-8'))
myfile.close()
else:
continue
ISSUE : The code when run gives nothing in return.
UPDATE : Actually I have been going through a lot of threads here. Closest to what I am asking are here and here but no one has a unified answer and the solutions are scattered in bits and pieces, or it is too specific so it is very difficult for me to make sense out of it. Trust me, I am testing and trying. Even a Similar question remains unanswered here.
For those who may find it useful. This script will run fine on Python 3.4 with these changes.
save_string = "F:\\result\\email_" + str(x) + ".txt"
myfile.write(str(body)).
Just change the directory and the folder name according to your need. Thanks to the original code provider.

Get the Gmail attachment filename without downloading it

I'm trying to get all the messages from a Gmail account that may contain some large attachments (about 30MB). I just need the names, not the whole files. I found a piece of code to get a message and the attachment's name, but it downloads the file and then read its name:
import imaplib, email
#log in and select the inbox
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login('username', 'password')
mail.select('inbox')
#get uids of all messages
result, data = mail.uid('search', None, 'ALL')
uids = data[0].split()
#read the lastest message
result, data = mail.uid('fetch', uids[-1], '(RFC822)')
m = email.message_from_string(data[0][1])
if m.get_content_maintype() == 'multipart': #multipart messages only
for part in m.walk():
#find the attachment part
if part.get_content_maintype() == 'multipart': continue
if part.get('Content-Disposition') is None: continue
#save the attachment in the program directory
filename = part.get_filename()
fp = open(filename, 'wb')
fp.write(part.get_payload(decode=True))
fp.close()
print '%s saved!' % filename
I have to do this once a minute, so I can't download hundreds of MB of data. I am a newbie into the web scripting, so could anyone help me? I don't actually need to use imaplib, any python lib will be ok for me.
Best regards
Rather than fetch RFC822, which is the full content, you could specify BODYSTRUCTURE.
The resulting data structure from imaplib is pretty confusing, but you should be able to find the filename, content-type and sizes of each part of the message without downloading the entire thing.
If you know something about the file name, you can use the X-GM-RAW gmail extensions for imap SEARCH command. These extensions let you use any gmail advanced search query to filter the messages. This way you can restrict the downloads to the matching messages, or exclude some messages you don't want.
mail.uid('search', None, 'X-GM-RAW',
'has:attachment filename:pdf in:inbox -label:parsed'))
The above search for messages with PDF attachments in INBOX not labeled "parsed".
Some pro tips:
label the messages you have already parsed, so you don't need to fetch them again (the -label:parsed filter in the above example)
always use the uid version instead of the standard sequential ids (you are already doing this)
unfortunately MIME is messy: there are a lot of clients that do weird (or plain wrong) things. You could try to download and parse only the headers, but is it worth the trouble?
[edit]
If you label a message after parsing it, you can skip the messages you have parsed already. This should be reasonable enough to monitor your class mailbox.
Perhaps you live in a corner of the world where internet bandwidth is more expensive than programmer time; in this case, you can fetch only the headers and look for "Content-disposition" == "attachment; filename=somefilename.ext".
A FETCH of the RFC822 message data item is functionally equivalent to BODY[]. IMAP4 supports other message data items, listed in section 6.4.5 of RFC 3501.
Try requesting a different set of message data items to get just the information that you need. For example, you could try RFC822.HEADER or maybe BODY.PEEK[MIME].
Old question, but just wanted to share the solution to this I came up with today. Searches for all emails with attachments and outputs the uid, sender, subject, and a formatted list of attachments. Edited relevant code to show how to format BODYSTRUCTURE:
data = mailobj.uid('fetch', mail_uid, '(BODYSTRUCTURE)')[1]
struct = data[0].split()
list = [] #holds list of attachment filenames
for j, k in enumerate(struct):
if k == '("FILENAME"':
count = 1
val = struct[j + count]
while val[-3] != '"':
count += 1
val += " " + struct[j + count]
list.append(val[1:-3])
elif k == '"FILENAME"':
count = 1
val = struct[j + count]
while val[-1] != '"':
count += 1
val += " " + struct[j + count]
list.append(val[1:-1])
I've also published it on GitHub.
EDIT
Above solution is good but the logic to extract attachment file name from payload is not robust. It fails when file name contains space with first word having only two characters,
for example: "ad cde gh.png".
Try this:
import re # Somewhere at the top
result, data = mailobj.uid("fetch", mail_uid, "BODYSTRUCTURE")
itr = re.finditer('("FILENAME" "([^\/:*?"<>|]+)")', data[0].decode("ascii"))
for match in itr:
print(f"File name: {match.group(2)}")
Test Regex here.

Python newbie - Input strings, return a value to a web page

I've got a program I would like to use to input a password and one or multiple strings from a web page. The program takes the strings and outputs them to a time-datestamped text file, but only if the password matches the set MD5 hash.
The problems I'm having here are that
I don't know how to get this code on the web. I have a server, but is it as easy as throwing pytext.py onto my server?
I don't know how to write a form for the input to this script and how to get the HTML to work with this program. If possible, it would be nice to make it a multi-line input box... but it's not necessary.
I want to return a value to a web page to let the user know if the password authenticated successfully or failed.
dtest
import sys
import time
import getopt
import hashlib
h = hashlib.new('md5')
var = sys.argv[1]
print "Password: ", var
h.update(var)
print h.hexdigest()
trial = h.hexdigest()
check = "86fe2288ac154c500983a8b89dbcf288"
if trial == check:
print "Password success"
time_stamp = time.strftime('%Y-%m-%d_%H-%M-%S', (time.localtime(time.time())))
strFile = "txt_" + str(time_stamp) + ".txt"
print "File created: txt_" + str(time_stamp) + ".txt"
#print 'The command line arguments are:'
#for i in sys.argv:
#print i
text_file = open(strFile, "w")
text_file.write(str(time_stamp) + "\n")
for i in range(2, len(sys.argv)):
text_file.write(sys.argv[i] + "\n")
#print 'Debug to file:', sys.argv[i]
text_file.close()
else:
print "Password failure"
You'll need to read up on mod_python (if you're using Apache) and the Python CGI module.
Take a look at django. It's an excellent web framework that can accomplish exactly what you are asking. It also has an authentication module that handles password hashing and logins for you.

Categories