Parsing email with Python - python

I'm writing a Python script to process emails returned from Procmail. As suggested in this question, I'm using the following Procmail config:
:0:
|$HOME/process_mail.py
My process_mail.py script is receiving an email via stdin like this:
From hostname Tue Jun 15 21:43:30 2010
Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400
Received: from mail-fx0-f44.google.com (209.85.161.44)
by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400
Received: by fxm19 with SMTP id 19so170709fxm.3
for <username#domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15
Jun 2010 18:47:33 -0700 (PDT)
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Date: Tue, 15 Jun 2010 20:47:33 -0500
Message-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB#mail.gmail.com>
Subject: TEST 12
From: Full Name <username#sender.com>
To: username#domain.com
Content-Type: text/plain; charset=ISO-8859-1
ONE
TWO
THREE
I'm trying to parse the message in this way:
>>> import email
>>> msg = email.message_from_string(full_message)
I want to get message fields like 'From', 'To' and 'Subject'. However, the message object does not contain any of these fields.
What am I doing wrong?

You must ensure that the lines are not accidentally broken (as they are above, though it's hard to say if that was a copy-paste problem) -- with an intact message such as:
Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400
Received: from mail-fx0-f44.google.com (209.85.161.44) by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400
Received: by fxm19 with SMTP id 19so170709fxm.3 for <username#domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Date: Tue, 15 Jun 2010 20:47:33 -0500
Message-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB#mail.gmail.com>
Subject: TEST 12
From: Full Name <username#sender.com>
To: username#domain.com
Content-Type: text/plain; charset=ISO-8859-1
ONE
TWO
THREE
then
msg = email.message_from_string(msgtxt)
print msg['Subject']
prints TEST 12 as desired.

It looks like you have linefeeds without spaces prepended to the additional lines, which according to RFC 2822 §2.3.2 is illegal:
Each header field is logically a single line of characters comprising
the field name, the colon, and the field body. For convenience
however, and to deal with the 998/78 character limitations per line,
the field body portion of a header field can be split into a multiple
line representation; this is called "folding". The general rule is
that wherever this standard allows for folding white space (not
simply WSP characters), a CRLF may be inserted before any WSP. For
example, the header field:
Subject: This is a test
can be represented as:
Subject: This
is a test
It should look something like this:
From hostname Tue Jun 15 21:43:30 2010
Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400
Received: from mail-fx0-f44.google.com (209.85.161.44)
by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400
Received: by fxm19 with SMTP id 19so170709fxm.3
for <username#domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15
Jun 2010 18:47:33 -0700 (PDT)
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Date: Tue, 15 Jun 2010 20:47:33 -0500
Message-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB#mail.gmail.com>
Subject: TEST 12
From: Full Name <username#sender.com>
To: username#domain.com
Content-Type: text/plain; charset=ISO-8859-1
ONE
TWO
THREE

I answer to myself.
I found a bug in the code that builds the messages. It's appending linebreaks between some lines, preventing the parser from working properly.

Related

Properly using Python re.groupdict() [duplicate]

I'm writing a Python script to process emails returned from Procmail. As suggested in this question, I'm using the following Procmail config:
:0:
|$HOME/process_mail.py
My process_mail.py script is receiving an email via stdin like this:
From hostname Tue Jun 15 21:43:30 2010
Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400
Received: from mail-fx0-f44.google.com (209.85.161.44)
by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400
Received: by fxm19 with SMTP id 19so170709fxm.3
for <username#domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15
Jun 2010 18:47:33 -0700 (PDT)
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Date: Tue, 15 Jun 2010 20:47:33 -0500
Message-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB#mail.gmail.com>
Subject: TEST 12
From: Full Name <username#sender.com>
To: username#domain.com
Content-Type: text/plain; charset=ISO-8859-1
ONE
TWO
THREE
I'm trying to parse the message in this way:
>>> import email
>>> msg = email.message_from_string(full_message)
I want to get message fields like 'From', 'To' and 'Subject'. However, the message object does not contain any of these fields.
What am I doing wrong?
You must ensure that the lines are not accidentally broken (as they are above, though it's hard to say if that was a copy-paste problem) -- with an intact message such as:
Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400
Received: from mail-fx0-f44.google.com (209.85.161.44) by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400
Received: by fxm19 with SMTP id 19so170709fxm.3 for <username#domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Date: Tue, 15 Jun 2010 20:47:33 -0500
Message-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB#mail.gmail.com>
Subject: TEST 12
From: Full Name <username#sender.com>
To: username#domain.com
Content-Type: text/plain; charset=ISO-8859-1
ONE
TWO
THREE
then
msg = email.message_from_string(msgtxt)
print msg['Subject']
prints TEST 12 as desired.
It looks like you have linefeeds without spaces prepended to the additional lines, which according to RFC 2822 §2.3.2 is illegal:
Each header field is logically a single line of characters comprising
the field name, the colon, and the field body. For convenience
however, and to deal with the 998/78 character limitations per line,
the field body portion of a header field can be split into a multiple
line representation; this is called "folding". The general rule is
that wherever this standard allows for folding white space (not
simply WSP characters), a CRLF may be inserted before any WSP. For
example, the header field:
Subject: This is a test
can be represented as:
Subject: This
is a test
It should look something like this:
From hostname Tue Jun 15 21:43:30 2010
Received: (qmail 8580 invoked from network); 15 Jun 2010 21:43:22 -0400
Received: from mail-fx0-f44.google.com (209.85.161.44)
by ip-73-187-35-131.ip.secureserver.net with SMTP; 15 Jun 2010 21:43:22 -0400
Received: by fxm19 with SMTP id 19so170709fxm.3
for <username#domain.com>; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.103.84.1 with SMTP id m1mr2774225mul.26.1276652853684; Tue, 15
Jun 2010 18:47:33 -0700 (PDT)
Received: by 10.123.143.4 with HTTP; Tue, 15 Jun 2010 18:47:33 -0700 (PDT)
Date: Tue, 15 Jun 2010 20:47:33 -0500
Message-ID: <AANLkTikFsIjJ3KYW1HJWcAqQlGXNiXE2YMzrj39I0tdB#mail.gmail.com>
Subject: TEST 12
From: Full Name <username#sender.com>
To: username#domain.com
Content-Type: text/plain; charset=ISO-8859-1
ONE
TWO
THREE
I answer to myself.
I found a bug in the code that builds the messages. It's appending linebreaks between some lines, preventing the parser from working properly.

Multiple response parsing in python

I am using curl command to access hadoop(webhdfs) and for http response parsing i am using python.
But after firing curl command ,multiple responses are being returned.
curl -i "http://host:50070/webhdfs/v1/user/hduser/pigtest?op=GETFILESTATUS"
HTTP/1.1 401 Authentication required
Cache-Control: no-cache
Expires: Thu, 14 Jan 2016 10:04:23 GMT
Date: Thu, 14 Jan 2016 10:04:23 GMT
Pragma: no-cache
Expires: Thu, 14 Jan 2016 10:04:23 GMT
Date: Thu, 14 Jan 2016 10:04:23 GMT
Pragma: no-cache
Content-Type: plain/text
Transfer-Encoding: chunked
Server: Jetty(6.1.26.hwx)
HTTP/1.1 200 OK
Cache-Control: no-cache
Expires: Thu, 14 Jan 2016 10:04:23 GMT
Date: Thu, 14 Jan 2016 10:04:23 GMT
Pragma: no-cache
Expires: Thu, 14 Jan 2016 10:04:23 GMT
Date: Thu, 14 Jan 2016 10:04:23 GMT
Pragma: no-cache
Content-Type: application/json
Transfer-Encoding: chunked
Server: Jetty(6.1.26.hwx)
{"FileStatus":{"accessTime":1452062206193,"blockSize":134217728,"childrenNum":0,"fileId":39295,"group":"hdfs","length":753,"modificationTime":1452062206392,"owner":"hduser","pathSuffix":"","permission":"644","replication":2,"storagePolicy":0,"type":"FILE"}}
How do i parse these multiple responses in python?
Thanks

Email flagged as SPAM with correct authentication

I'm using Python with the framework Django. I am sending registration emails from my website (when a user register).
Using this snippet, I authenticate my email with DKIM (the DNS is correctly configured).
I also added SPF on my DNS.
Either on gmail and hotmail, I see spf=pass and dkim=pass.
But still, my email is flagged as spam. I made sure to use appropriate vocabulary, it's a text email with only 1 link (for registering). I am using no-reply#mydomain.com in FROM field for my email.
EDIT : After few changes I managed to have a "proper" header for my email. This is what it looks like (received on my hotmail account, still flagged as spam) (I replaced my domain name by mydomain.com and IP adress by stars, but they are correct) :
x-store-info:4r51+eLowCe79NzwdU2kRyU+pBy2R9QCQ99fuVSCLVOS47rfbRPiE7iaYhO1ERiggdK+K18l1xsWi4P40pG/T41xqL9zUAoU17o0RrecEQY1EuSFAsrgi0P9JxG/GRiKRWTxOOBRX7E=
Authentication-Results: hotmail.com; spf=pass (sender IP is ***.***.***.***) smtp.mailfrom=no-reply#mydomain.com; dkim=pass header.d=mydomain.com; x-hmca=pass header.id=no-reply#mydomain.com
X-SID-PRA: no-reply#mydomain.com
X-AUTH-Result: PASS
X-SID-Result: PASS
X-Message-Status: n:n
X-Message-Delivery: Vj0xLjE7dXM9MDtsPTA7YT0wO0Q9MjtHRD0yO1NDTD02
X-Message-Info: 11chDOWqoTn7F4e7hHYwxaXv9iZKZZyIKj/+21TGh6QZKczxEHQs4rb60Cxfdi09jTLkRJAecG6MEZoumj8BxQZCAkaW+YvuWguCAySgqkkiNyD1AL4MyP3BFzgaoF2ZXtaGotKTc8c/ChQJkPtnUkHdes5iALGuXQjNzKRE6CJjxAGItrK/tX2h6cQRePYbs40w9kwlyrSKjnMd0tsAss5uWWZc2J8a
Received: from mydomain.com ([***.***.***.***]) by BAY004-MC3F39.hotmail.com over TLS secured channel with Microsoft SMTPSVC(7.5.7601.22712);
Wed, 9 Jul 2014 08:18:05 -0700
Received: from mydomain.com (localhost.localdomain [127.0.0.1])
by mydomain.com (8.14.4/8.14.4/Debian-4) with ESMTP id s69FI3wS030630
for <*********#hotmail.fr>; Wed, 9 Jul 2014 17:18:03 +0200
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=mydomain.com;
i=#mydomain.com; q=dns/txt; s=selector; t=1404919083; h=MIME-Version
: Content-Type : Content-Transfer-Encoding : Subject : From : To : Date
: Message-ID; bh=k7X+9bPwn6CQYmdYxiU1/FA763QwNClj01j8KmwLN2k=; b=Xg53TzAVYu7/7hnSJpH0NPsXhR2xasyW/Oo37XNSdWGOmZFP95way23mFMgT370IGv/rlTf+LJgYuH1grPRoVgR9Oif89uwLf9FIWx0CTwNlG9ONvKgTX3I91J8lAn/5KaMHW3sF/6C6CYhu9+nP8bh1JcuiuHq3zlYZLv2zQQQ=
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
Subject: Activation de votre compte Mydomain
From: Mydomain <no-reply#mydomain.com>
To: *********#hotmail.fr
Date: Wed, 09 Jul 2014 15:18:03 -0000
Message-ID: <20140709151803.30554.31146#mydomain.com>
Return-Path: no-reply#mydomain.com
X-OriginalArrivalTime: 09 Jul 2014 15:18:05.0604 (UTC) FILETIME=[FB999E40:01CF9B88]
Now I really don't understand what causes the email to be flagged as spam. Also checked blacklists, the domain isn't blacklisted.
I also did a test here, the results are the same : DKIM detected and check PASS, SPF PASS, SpamAssassin Score: -2.011 "Message is NOT marked as spam", only empty box is "DomainKeys Information : Message does not contain a DomainKeys Signature" (I don't find anything explaining the difference with DKIM).
NB : After goncalopp's comment, I wondered if this question shouldn't be on Serverfault instead of here. Should I remove it and ask there?
So after changing few settings I managed to have this header (masked IP address and domain for confidentiality). It sems to be clean header and passes all authentication tests :
Delivered-To: **********#gmail.com
Received: by 10.140.103.77 with SMTP id x71csp25213qge;
Thu, 17 Jul 2014 07:12:51 -0700 (PDT)
X-Received: by 10.180.109.168 with SMTP id ht8mr22242453wib.68.1405606370624;
Thu, 17 Jul 2014 07:12:50 -0700 (PDT)
Return-Path: <no-reply#**********.com>
Received: from mail.**********.com (**********com. [**********])
by mx.google.com with ESMTP id r8si9159599wia.83.2014.07.17.07.12.48
for <**********#gmail.com>;
Thu, 17 Jul 2014 07:12:50 -0700 (PDT)
Received-SPF: pass (google.com: domain of no-reply#**********.com designates ********** as permitted sender) client-ip=**********;
Authentication-Results: mx.google.com;
spf=pass (google.com: domain of no-reply#**********.com designates ********** as permitted sender) smtp.mail=no-reply#**********.com;
dkim=pass header.i=#**********.com
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=**********.com;
i=#**********.com; q=dns/txt; s=selector; t=1405606368; h=MIME-Version
: Content-Type : Content-Transfer-Encoding : Subject : From : To : Date
: Message-ID; bh=PblNSkQvil33DWRvqe8DinhP7RB+k1OiDCBjgpR7DuE=; b=T4ti1yJsxUE2Uav6UYr+WznqZFrDVvAIoUN8G6voMWr4hUGVdC7u+QkR+d87SY4cN0nklbTWBXJ7gSOhR6r1d0NQZbg3jmRZzYxofPwayMRicYfUw1brWnrSnCUDQ98aUPv4qi9okb2/8vuu5yCKk5irarGrNQk+smnhVEFbqbA=
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
Subject: Activation de votre compte **********
From: ********** <no-reply#**********.com>
To: **********#gmail.com
Date: Thu, 17 Jul 2014 14:12:48 -0000
Message-ID: <20140717141248.2687.75060#**********.com>
It is still going straight to the spam folder. According to what I read here and there, it seems that my domain has to gain "trust" before being considered as "non-spam" (i.e. users have to flag it as "non-spam" and my domain should then be recognized better).
If anyone has any more suggestions, I take with pleasure :)
Hotmail/Outlook has the snds (Smart Network Data Service) you can register your ip and check the reputation, mail volume, bounces, traps. Maybe you have a bad reputation.
https://postmaster.live.com/snds/

Python : How to determine if the raw-email-source contains HTML or TEXT email

I placed the raw-email into a string named a
i would like Python to tell me if this e-mail is TEXT or HTML
a = """From root#a1.local.tld Thu Jul 25 19:28:59 2013
Received: from a1.local.tld (localhost [127.0.0.1])
by a1.local.tld (8.14.4/8.14.4) with ESMTP id r6Q2SxeQ003866
for <ooo#a1.local.tld>; Thu, 25 Jul 2013 19:28:59 -0700
Received: (from root#localhost)
by a1.local.tld (8.14.4/8.14.4/Submit) id r6Q2Sxbh003865;
Thu, 25 Jul 2013 19:28:59 -0700
From: root#a1.local.tld
Subject: oooooooooooooooo
To: ooo#a1.local.tld
Cc:
X-Originating-IP: 192.168.15.127
X-Mailer: Webmin 1.420
Message-Id: <1374805739.3861#a1>
Date: Thu, 25 Jul 2013 19:28:59 -0700 (PDT)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="bound1374805739"
This is a multi-part message in MIME format.
--bound1374805739
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooooooooooooooooooo
--bound1374805739--"""
import email
b = email.message_from_string(a)
Content-Type: text/plain means plain text
Content-Type: text/html means HTML

Python : How to parse things such as : from, to, body, from a raw email source w/Python [duplicate]

This question already has answers here:
Parsing email with Python
(3 answers)
Closed 9 years ago.
The raw email usually looks something like this
From root#a1.local.tld Thu Jul 25 19:28:59 2013
Received: from a1.local.tld (localhost [127.0.0.1])
by a1.local.tld (8.14.4/8.14.4) with ESMTP id r6Q2SxeQ003866
for <ooo#a1.local.tld>; Thu, 25 Jul 2013 19:28:59 -0700
Received: (from root#localhost)
by a1.local.tld (8.14.4/8.14.4/Submit) id r6Q2Sxbh003865;
Thu, 25 Jul 2013 19:28:59 -0700
From: root#a1.local.tld
Subject: ooooooooooooooooooooooo
To: ooo#a1.local.tld
Cc:
X-Originating-IP: 192.168.15.127
X-Mailer: Webmin 1.420
Message-Id: <1374805739.3861#a1>
Date: Thu, 25 Jul 2013 19:28:59 -0700 (PDT)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="bound1374805739"
This is a multi-part message in MIME format.
--bound1374805739
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
ooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooo
--bound1374805739--
So if I wanted to code a PYTHON script to get the
From
To
Subject
Body
Is this the code I am looking for to built on of or is there a better method?
a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'
import re
a1 = re.findall(r'<(title)>(.*?)<(/title)>', a)
I don't really understand what your final code snippet has to do with anything - you haven't mentioned anything about HTML until that point, so I don't know why you would suddenly be giving an example of parsing HTML (which you should never do with a regex anyway).
In any case, to answer your original question about getting the headers from an email message, Python includes code to do that in the standard library:
import email
msg = email.message_from_string(email_string)
msg['from'] # 'root#a1.local.tld'
msg['to'] # 'ooo#a1.local.tld'
Fortunately Python makes this simpler: http://docs.python.org/2.7/library/email.parser.html#email.parser.Parser
from email.parser import Parser
parser = Parser()
emailText = """PUT THE RAW TEXT OF YOUR EMAIL HERE"""
email = parser.parsestr(emailText)
print email.get('From')
print email.get('To')
print email.get('Subject')
The body is trickier. Call email.is_multipart(). If that's false, you can get your body by calling email.get_payload(). However, if it's true, email.get_payload() will return a list of messages, so you'll have to call get_payload() on each of those.
if email.is_multipart():
for part in email.get_payload():
print part.get_payload()
else:
print email.get_payload()
"Body" is not present in your sample email
Can use email module :
import email
msg = email.message_from_string(email_message_as_text)
Then use:
print email['To']
print email['From']
...
...
etc
You should probably use email.parser
s = """
From root#a1.local.tld Thu Jul 25 19:28:59 2013
Received: from a1.local.tld (localhost [127.0.0.1])
by a1.local.tld (8.14.4/8.14.4) with ESMTP id r6Q2SxeQ003866
for <ooo#a1.local.tld>; Thu, 25 Jul 2013 19:28:59 -0700
Received: (from root#localhost)
by a1.local.tld (8.14.4/8.14.4/Submit) id r6Q2Sxbh003865;
Thu, 25 Jul 2013 19:28:59 -0700
From: root#a1.local.tld
Subject: ooooooooooooooooooooooo
To: ooo#a1.local.tld
Cc:
X-Originating-IP: 192.168.15.127
X-Mailer: Webmin 1.420
Message-Id: <1374805739.3861#a1>
Date: Thu, 25 Jul 2013 19:28:59 -0700 (PDT)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="bound1374805739"
This is a multi-part message in MIME format.
--bound1374805739
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
ooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooo
--bound1374805739--
"""
import email.parser
msg = email.parser.Parser().parsestr(s)
help(msg)
you could write that raw content to a file
then read the file like this:
with open('in.txt', 'r') as file:
raw = file.readlines()
get_list = ['From:','To:','Subject:']
info_list = []
for i in raw:
for word in get_list:
if i.startswith(word):
info_list.append(i)
now info_list will be:
['From: root#a1.local.tld', 'Subject: ooooooooooooooooooooooo', 'To: ooo#a1.local.tld']
i dont see Body: in your raw content

Categories