i have printer status reports sent on my email. i would like to download them and process one by one, and all information to put in some database for further processing.
i would like to use python3 as i start to learn it.
i have this code:
import getpass
import poplib
server = poplib.POP3('pop3.mailserver.com' )
server.user('report#mailserver.com')
server.pass_('pswd')
numMessages = 1 #len(server.list()[1])
emails, total_bytes = server.stat()
print("{0} emails in the inbox, {1} bytes total".format(emails, total_bytes))
for i in range(numMessages):
for msg in server.retr(i+1)[1]:
print(msg)
and what i get is whole email message (with headers and body) in this format:
b'Return-Path: <"tever">'
b'Delivered-To: reportc#mailserver.com'
b'Received: (qmail 13193 invoked by uid 89); 23 May 2012 08:44:51 -0000'
b'Received: by simscan 1.2.0 ppid: 13156, pid: 13164, t: 0.1620s'
b' scanners: clamav: 0.97-exp/m:53 spam: 3.3.1'
b'X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mxavas16.ad.aruba.it'
b'X-Spam-Level: *'
b'X-Spam-Status: No, score=1.4 required=5.0 tests=FH_FROMEML_NOTLD,INVALID_MSGID,'
b'\tT_FILL_THIS_FORM_SHORT autolearn=disabled version=3.3.1'
b'Received: from unknown (HELO smtplq02.aruba.it) (62.149.158.35)'
b' by mxavas16.ad.aruba.it with SMTP; 23 May 2012 08:44:51 -0000'
b'Received: (qmail 30750 invoked by uid 89); 23 May 2012 08:44:51 -0000'
b'Received: from unknown (HELO smtp8.aruba.it) (62.149.158.228)'
b' by smtplq02.aruba.it with SMTP; 23 May 2012 08:44:51 -0000'
b'Received: (qmail 30979 invoked by uid 89); 23 May 2012 08:44:51 -0000'
b'Received: from unknown (HELO NM7ACD31) (email#server.it#83.xxx.xxx.xxx)'
b' by smtp8.ad.aruba.it with SMTP; 23 May 2012 08:44:51 -0000'
b'Date: Wed, 23 May 2012 10:46:34 +0200'
b'From: tever'
b'Subject: QEQ1313212'
b'To: report#mailserver.com'
b'Message-Id: <201205231046340001d806.TEVER>'
b'Mime-Version: 1.0'
b'Content-Type: text/plain; charset="utf-8"'
b'Content-Transfer-Encoding: base64'
b''
b'RXF1aXBtZW50IElEOiAgICAgICAgICAgICANCk1vZGVsIE5hbWU6ICAgICAgICAg'
b'ICAgICAgQ0RDIDE3MjVfRENDIDI3MjUNClNlcmlhbCBOdW1iZXI6ICAgICAgICAg'
b'ICAgUUVRMTMxMzIxMg0KTWV0ZXJEYXRlOiAgICAgICAgICAgICAgICBXZWQgMjMg'
b'TWF5IDIwMTIgMTA6NDY6MzQNCkNvdW50ZXJzIGJ5IEZ1bmN0aW9uDQogUHJpbnRl'
b'ZCBQYWdlcw0KICBDb3BpZXI6ICAgICAgICAgICAgICAgICAyMjE1ICAgIA0KICBQ'
b'cmludGVyOiAgICAgICAgICAgICAgICAxMTEyMDQgIA0KICBGQVg6ICAgICAgICAg'
b'ICAgICAgICAgICA5MzIgICAgIA0KICBUb3RhbDogICAgICAgICAgICAgICAgICAx'
b'MTQzNTEgIA0KIFNjYW5uZWQgUGFnZXMNCiAgQ29waWVyOiAgICAgICAgICAgICAg'
b'ICAgMTkxOSAgICANCiAgRkFYOiAgICAgICAgICAgICAgICAgICAgMjIwNyAgICAN'
b'CiAgT3RoZXI6ICAgICAgICAgICAgICAgICAgMTgyMiAgICANCiAgVG90YWw6ICAg'
b'ICAgICAgICAgICAgICAgNTk0OCAgICANCkNvdW50ZXJzIGJ5IFBhcGVyIFNpemUN'
b'Ck1vbm9jaHJvbWUNCiAgQTM6ICAgICAgICAgICAgICAgICAgICAgNDU0ICAgICAN'
b'CiAgQjQ6ICAgICAgICAgICAgICAgICAgICAgMCAgICAgICANCiAgQTQ6ICAgICAg'
b'ICAgICAgICAgICAgICAgMTA4MDQ4ICANCiAgQjU6ICAgICAgICAgICAgICAgICAg'
b'ICAgNDI3ICAgICANCiAgQTU6ICAgICAgICAgICAgICAgICAgICAgMCAgICAgICAN'
b'CiAgRm9saW86ICAgICAgICAgICAgICAgICAgMSAgICAgICANCiAgTGVkZ2VyOiAg'
b'ICAgICAgICAgICAgICAgMCAgICAgICANCiAgTGVnYWw6ICAgICAgICAgICAgICAg'
b'ICAgMCAgICAgICANCiAgTGV0dGVyOiAgICAgICAgICAgICAgICAgMCAgICAgICAN'
b'CiAgU3RhdGVtZW50OiAgICAgICAgICAgICAgMCAgICAgICANCiAgT3RoZXIxOiAg'
b'ICAgICAgICAgICAgICAgMCAgICAgICANCiAgT3RoZXIyOiAgICAgICAgICAgICAg'
b'ICAgMiAgICAgICANCk1vbm8gQ29sb3INCiAgQTM6ICAgICAgICAgICAgICAgICAg'
b'ICAgMCAgICAgICANCiAgQjQ6ICAgICAgICAgICAgICAgICAgICAgMCAgICAgICAN'
b'CiAgQTQ6ICAgICAgICAgICAgICAgICAgICAgMCAgICAgICANCiAgQjU6ICAgICAg'
b'IE90aGVyIEVycm9ycw0KDQo8V2VkIDIzIE1heSAyMDEyIDEwOjQxOjU0Pg0KICBb'
b'IF0gQWxsIE90aGVyIEVycm9ycw0KDQo8V2VkIDIzIE1heSAyMDEyIDEwOjQ1OjIx'
b'Pg0KICBbKl0gQWRkIFBhcGVyDQoNCi0tLS0tLS0tLS0tLS0tLS0tLS0NCkNEQyAx'
b'NzI1X0RDQyAyNzI1DQpbMDA6YzA6ZWU6N2E6Y2Q6MzFdDQotLS0tLS0tLS0tLS0t'
b'LS0tLS0t'
b'DQo='
b''
what i need is to process body content line by line and if it matches i need to delete it from the server.
and tips how to do it?
many thanks
gerard
maybe if you start by parsing the message it would be a good start:
# ... get your message ...
# msg = [b'Return-Path: <"tever">'
# b'Delivered-To: reportc#mailserver.com', ... ]
import email
# decode simple non-multipart message
message = email.message_from_bytes(b'\n'.join(msg))
payload = message.get_payload(decode=True)
payload = payload.decode(message.get_content_charset())
print(payload)
then you can do with the payload whatever you need...
Related
I am working on piece of code to get a value from gmail, but email itself is HTML File, so code is also returning me html code within list, for which I am unable to parse data.
My Code:
import imaplib
ORG_EMAIL = "comapnyname.com"
FROM_EMAIL = "automation#companyname.co"
FROM_PWD = "password123!"
SMTP_SERVER = "imap.gmail.com"
def read_email_from_gmail():
mail = imaplib.IMAP4_SSL(SMTP_SERVER)
mail.login(FROM_EMAIL, FROM_PWD)
mail.select("inbox")
email_type, data = mail.search(None, "ALL")
mail_ids = data[0]
id_list = mail_ids.split()
latest_email_id = int(id_list[-1])
email_type, data = mail.fetch(str.encode(str(latest_email_id)), "(RFC822)")
string_data = str(data)
print('MAIL Data: ')
print(string_data)
read_email_from_gmail()
Now This code is returning me long list which contains HTML
[(b'1 (RFC822 {54624}', b'Delivered-To: automation+qa1#spekit.co\r\nReceived: by 2002:a4a:6f04:0:0:0:0:0 with SMTP id h4csp1519301ooc;\r\n Thu, 10 Sep 2020 09:18:42 -0700 (PDT)\r\nX-Google-Smtp-Source: ABdhPJy/7yOn17HKdn+QjP0XHEOK2fu8LDL8tz4jDmDKemms2GVyykqDCDUfppmRbV4DUi7ckRRg\r\nX-Received: by 2002:a25:d7cd:: with SMTP id o196mr14075369ybg.91.1599754722247;\r\n Thu, 10 Sep 2020 09:18:42 -0700 (PDT)\r\nARC-Seal: i=1; a=rsa-sha256; t=1599754722; cv=none;\r\n d=google.com; s=arc-20160816;\r\n b=KzNg7bsmLaNcrRMihkN+AwlTp8ybj5D65K+Z21Ddl/lgd2LN90InAWhj+guhrmzHtB\r\n vw83T4AlJ8u2jpAs5qYUbxgd/R5COLhlRDqR/dE4wljRgIq2W6sVCJo/fGuZruFjob4Z\r\n h1acPat0xa3h83lJzzbH576KggTqdScMwCbLsujPr/FclnHNjkqxQuFQlV23nAGgvWX8\r\n raiIW+6wC070tmQaaz3feIVfo7r7cmQBGokOmy8B3of0/kqIyMVuaEkmk2kno8VFvILF\r\n i8YPq7bOHVNpre7KwiG4r69PdaDRXIcd/ETtuyusfNXOrGJ0QhC44j2eLUpxlRltOGgL\r\n NAeA==\r\nARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;\r\n h=mime-version:date:message-id:to:subject:from:dkim-signature\r\n :dkim-signature;\r\n bh=ZNxh0gTg5kVpAZyTHGJ2jWADa5UGAoPCP3GFX1DUu94=;\r\n b=WjnIWwVX2oWrl3aZoKlzck1GAoy/gT5/cbNP+tnmdypfjvAUTyuZ3OO5xXlZB/CiF9\r\n PkYZFEzJQSxradr3ky5T7tLmV2qKnHfaIp3G3STUs5f9vhSfp6qknV7ouLBGwCWyp2gp\r\n e14Aek7M5ciVC1GIjxlr7AXZne4eHSwCb7u8j91Yt8B2getEQ9lyQlChwjYf38Kau5lL\r\n wPmMtAM0DDOqlNff2gTBEFgAX1s0Wk+g8mKS31tzBMIQvayR+a3PHX+S3zhtC2i1XsLm\r\n NOWSMsI0ZEEk/mjA36DVWhEN0d9llOwiDfFonXxIkcPZLlNR3zGfA61apTeud7i24vYn\r\n bfCw==\r\nARC-Authentication-Results: i=1; mx.google.com;\r\n dkim=pass header.i=#spekit.co header.s=mandrill header.b=RhjFdk+T;\r\n dkim=pass header.i=#mandrillapp.com header.s=mandrill header.b=SusUoY2S;\r\n spf=pass (google.com: domain of bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com designates 198.2.180.17 as permitted sender) smtp.mailfrom=bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com;\r\n dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=spekit.co\r\nReturn-Path: <bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com>\r\nReceived: from mail180-17.suw31.mandrillapp.com (mail180-17.suw31.mandrillapp.com. [198.2.180.17])\r\n by mx.google.com with ESMTPS id t10si6240908ybl.463.2020.09.10.09.18.42\r\n for <automation+qa1#spekit.co>\r\n (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);\r\n Thu, 10 Sep 2020 09:18:42 -0700 (PDT)\r\nReceived-SPF: pass (google.com: domain of bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com designates 198.2.180.17 as permitted sender) client-ip=198.2.180.17;\r\nAuthentication-Results: mx.google.com;\r\n dkim=pass header.i=#spekit.co header.s=mandrill header.b=RhjFdk+T;\r\n dkim=pass header.i=#mandrillapp.com header.s=mandrill header.b=SusUoY2S;\r\n spf=pass (google.com: domain of bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com designates 198.2.180.17 as permitted sender) smtp.mailfrom=bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com;\r\n dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=spekit.co\r\nDKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=mandrill; d=spekit.co;\r\n h=From:Subject:To:Message-Id:Date:MIME-Version:Content-Type; i=support#spekit.co;\r\n bh=ZNxh0gTg5kVpAZyTHGJ2jWADa5UGAoPCP3GFX1DUu94=;\r\n b=RhjFdk+Tvr3HP43qJoKzVowGAs1SYJFfpq8MK4firz5tcpBYn3UEP/Z5cF+IBA74/PTmCahgTnXi\r\n /EPSbY2b+20ERj4s4VUnwNZw8t4L98gSQiM6o3mF4iVI2JIgABU2Tn2nmB68kGZyxeSOs4bWtE+s\r\n MXleLzg+uTftETJoUhM=\r\nReceived: from pmta03.mandrill.prod.suw01.rsglab.com (127.0.0.1) by mail180-17.suw31.mandrillapp.com id hb98u422sc0h for <automation+qa1#spekit.co>; Thu, 10 Sep 2020 16:18:42 +0000 (envelope-from <bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com>)\r\nDKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mandrillapp.com; \r\n i=#mandrillapp.com; q=dns/txt; s=mandrill; t=1599754721; h=From : \r\n Subject : To : Message-Id : Date : MIME-Version : Content-Type : From : \r\n Subject : Date : X-Mandrill-User : List-Unsubscribe; \r\n bh=ZNxh0gTg5kVpAZyTHGJ2jWADa5UGAoPCP3GFX1DUu94=; \r\n b=SusUoY2SOQosSQrzHafHGf7Pto1Ol3PDGU067dNsjT1ZIOuSP0Dz7DJwqgFn6NpwAV7X7e\r\n pzQQPyDJoAqQCjCdSqG9mp80hAEGwQC89GNu78a8o0NRC+BPRTGaNKV/jX06cXsgp+A4KXfY\r\n 13x1BInjKraTnCYz9TnzDUChIm3pg=\r\nFrom: Support <support#spekit.co>\r\nSubject: Your Spekit Login PIN\r\nReturn-Path: <bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com>\r\nReceived: from [3.128.246.0] by mandrillapp.com id 8084cafe0c6c4aeca73fef8bdaf5b70b; Thu, 10 Sep 2020 16:18:41 +0000\r\nTo: Automation <automation+qa1#spekit.co>\r\nX-Report-Abuse: Please forward a copy of this message, including all headers, to abuse#mandrill.com\r\nX-Report-Abuse: You can also report abuse here: http://mandrillapp.com/contact/abuse?id=31064008.8084cafe0c6c4aeca73fef8bdaf5b70b\r\nX-Mandrill-User: md_31064008\r\nMessage-Id: <31064008.20200910161841.5f5a51e1e2be13.10518479#mail180-17.suw31.mandrillapp.com>\r\nDate: Thu, 10 Sep 2020 16:18:41 +0000\r\nMIME-Version: 1.0\r\nContent-Type: multipart/alternative; boundary="_av-l5kOy35rlKJaV18wYlOHPA"\r\n\r\n--_av-l5kOy35rlKJaV18wYlOHPA\r\nContent-Type: text/plain; charset=utf-8\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\n Your Spekit Login PIN \r\n Hi Automation, Someone (hopefully you) just\r\nlogged into your Spekit account with the email *automation+qa1#spekit.co*. \r\n \r\n If this was you, please use the code below to log-in, otherwise please\r\ncontact your admin and reset your password ASAP.\r\n =3D *952681* =3D\r\n\r\n Enter PIN <https://app.spekit.co/verifypin>\r\n<http://www.twitter.com/spekitapp>\r\n<https://www.linkedin.com/company/spekit/> <https://medium.com/spekit>\r\n<https://spekit.co/> \r\nQuestions? Contact us. <mailto:support#spekit.co>\r\n Copyright =C2=A9 2018 Spekit, Inc. All rights reserved.\r\n\r\n--_av-l5kOy35rlKJaV18wYlOHPA\r\nContent-Type: text/html; charset=utf-8\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\n<!doctype html>\r\n<html xmlns=3D"http://www.w3.org/1999/xhtml" xmlns:v=3D"urn:schemas-microso=\r\nft-com:vml" xmlns:o=3D"urn:schemas-microsoft-com:office:office">\r\n <head>\r\n <!-- NAME: 1 COLUMN - FULL WIDTH -->\r\n <!--[if gte mso 15]>\r\n <xml>\r\n <o:OfficeDocumentSettings>\r\n <o:AllowPNG/>\r\n <o:PixelsPerInch>96</o:PixelsPerInch>\r\n </o:OfficeDocumentSettings>\r\n </xml>\r\n <![endif]-->\r\n <meta charset=3D"UTF-8">\r\n <meta http-equiv=3D"X-UA-Compatible" content=3D"IE=3Dedge">\r\n <meta name=3D"viewport" content=3D"width=3Ddevice-width, initial-sc=\r\nale=3D1">\r\n <title>Your Spekit Login PIN</title>\r\n \r\n <style type=3D"text/css">\r\n=09=09p{\r\n=09=09=09margin:10px 0;\r\n=09=09=09padding:0;\r\n=09=09}\r\n=09=09table{\r\n=09=09=09border-</tbody></table> ')']
I need to get value i.e. '952681', which is displaying twice, can someone help me there?
if the format of the email stays the same you can use regex to parse the returned html string:
import re
pattern = '\*([\s\S]*?)\*'
res = re.findall(pattern, your_email_text)
the variable res contains your number at the second position:
['automation+qa1#spekit.co', '952681']
I am using this regular expression for SIP (Session Initiation Protocol) URIs to extract the different internal variables.
_syntax = re.compile('^(?P<scheme>[a-zA-Z][a-zA-Z0-9\+\-\.]*):' # scheme
+ '(?:(?:(?P<user>[a-zA-Z0-9\-\_\.\!\~\*\'\(\)&=\+\$,;\?\/\%]+)' # user
+ '(?::(?P<password>[^:#;\?]+))?)#)?' # password
+ '(?:(?:(?P<host>[^;\?:]*)(?::(?P<port>[\d]+))?))' # host, port
+ '(?:;(?P<params>[^\?]*))?' # parameters
+ '(?:\?(?P<headers>.*))?$') # headers
m = URI._syntax.match(value)
if m:
self.scheme, self.user, self.password, self.host, self.port, params, headers = m.groups()
and i want to extract specific header like the header via,branch,contact,callID or Cseq.
The general form of a sip message is:
OPTIONS sip:172.16.18.35:5060 SIP/2.0
Content-Length: 0
Via: SIP/2.0/UDP 172.16.18.90:5060
From: "fake" <sip:fake#172.16.18.90>
Supported: replaces, timer
User-Agent: SIPPing
To: <sip:172.16.18.35:5060>
Contact: <sip:fake#172.16.18.90:5060>
CSeq: 1 OPTIONS
Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY, INFO, PUBLISH
Call-ID: fake-id#172.16.18.90
Date: Thu, 25 Apr 2013 003024 +0000
Max-Forwards: 70
I would suggest taking advantage of the intentional similarities between SIP header format and RFC822.
from email.parser import Parser
msg = Parser().parsestr(m.group('headers'))
...thereafter:
>>> msg.keys()
['Content-Length', 'Via', 'From', 'Supported', 'User-Agent', 'To', 'Contact', 'CSeq', 'Allow', 'Call-ID', 'Date', 'Max-Forwards']
>>> msg['To']
'<sip:172.16.18.35:5060>'
>>> msg['Date']
'Thu, 25 Apr 2013 003024 +0000'
...etc. See the documentation for the Python standard-library email module for more details.
I'm hoping this is just something simple. I'm trying to determine whether or not an email is already encrypted.
# Read e-mail from stdin
raw = sys.stdin.read()
raw_message = email.message_from_string( raw )
I took the example from http://docs.python.org/2/howto/regex.html on doing a simple test for match.
p = re.compile('-----BEGIN\sPGP\sMESSAGE-----')
m = p.match(raw)
if m:
log = open(cfg['logging']['file'], 'a')
log.write("THIS IS ENCRYPTED")
log.close()
else:
log = open(cfg['logging']['file'], 'a')
log.write("NOT ENCRYPTED:")
log.close()
The email is read. The log file is written to but it always comes back no match. I've written raw to a logfile and that string is present.
Not sure where to go next.
UPDATE:
Here is the output from a raw ( a simple test message )
Sending email to: <bruce#packetaddiction.com>
Received: from localhost (localhost [127.0.0.1])
by mail2.packetaddiction.com (Postfix) with ESMTP id 5FE5D22A65
for <bruce#packetaddiction.com>; Tue, 10 Sep 2013 16:19:12 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at mail2.packetaddiction.com
Received: from mail2.packetaddiction.com ([127.0.0.1])
by localhost (mail2.packetaddiction.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id cc3zZ_izEb1j for <bruce#packetaddiction.com>;
Tue, 10 Sep 2013 16:19:06 +0000 (UTC)
Received: from mail.secryption.com (mail.secryption.com [178.18.24.223])
by mail2.packetaddiction.com (Postfix) with ESMTPS id 9CA3C22A5B
for <bruce#packetaddiction.com>; Tue, 10 Sep 2013 16:19:06 +0000 (UTC)
Received: from localhost (localhost.localdomain [127.0.0.1])
by mail.secryption.com (Postfix) with ESMTP id 9994E1421F81
for <bruce#packetaddiction.com>; Tue, 10 Sep 2013 12:19:19 -0400 (EDT)
X-Virus-Scanned: Debian amavisd-new at mail.secryption.com
Received: from mail.secryption.com ([127.0.0.1])
by localhost (mail.secryption.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id WbkVn_cowG6q for <bruce#packetaddiction.com>;
Tue, 10 Sep 2013 12:19:18 -0400 (EDT)
Received: from dennis.cng.int (mail.compassnetworkgroup.com [173.163.129.21])
(using TLSv1 with cipher RC4-MD5 (128/128 bits))
(No client certificate requested)
by mail.secryption.com (Postfix) with ESMTPSA id 5B4191421F80
for <bruce#packetaddiction.com>; Tue, 10 Sep 2013 12:19:18 -0400 (EDT)
User-Agent: K-9 Mail for Android
MIME-Version: 1.0
Content-Type: text/plain;
charset=UTF-8
Content-Transfer-Encoding: 8bit
Subject: Message
From: Bruce Markey <bruce#secryption.com>
Date: Tue, 10 Sep 2013 12:19:00 -0400
To: "bruce#packetaddiction.com" <bruce#packetaddiction.com>
Message-ID: <36615ed6-a1a9-49ac-ac85-31905916d478#email.android.com>
-----BEGIN PGP MESSAGE-----
Version: APG v1.0.8
hQEMAwPNxvNWsisWAQgAuOTLkiitYzhGJydOzN4sBoGjhRm9JeJMfmxKxKTKcV2W
ZBuN0z+nS1KxnXrIlahhwLtpiFvp5apI8wAyAiLC2BhFieFttOl1/xLVJbd1nI1o
KQE1RUXhPURejJ3eH9g/LmkhtFQcnsuHGTGnLi6dugBNhWLqgnLUBX+VLt6moz2C
84lDuQ1y7B/JFOctKRScUqmxDd8b2peZJOnVT/p0tSYNfN9QGH3W02FZShE4KKBl
HpezK8KC6cZdf34Eao+ep+fP5DuKx/4j3ksCbFKyQ3gd+yxK/xnhkijDsYCfFRiF
ElAGDvXu4RXqrKRpBxq1bRhU8YqS7j5593MTUViWitLAGgH1DV0UeA/B5LMUDRyz
4ZfDqd0kDYsPUy2Cg20HdXHaobkzdvHLzfqQq0Owc1nTcvu4nzCbIMhTAlZjn8ZA
aODTlKcvnFBWEtNERPm0x6nkbhMo3GeysejaJSRod3aGqhuhga4iIrrew1W03297
aalwY8RKeNoV15VItsyrbbT+HvDNSaFFCPUAs+KcLHCOez5/woozjlqKdBI6yHCe
gqpYJPP07qFsVviltfDO63xS48f2HCPe4iyXCy6Usp0+jM7zAzH7KH1O854GH46Q
r0A01DLo9REmDr4U
=pBQZ
-----END PGP MESSAGE-----
re.match will only find a match at the beginning of the string, as noted here. You want to use re.search
raw = """Sending email to: <bruce#packetaddiction.com>...
...
-----BEGIN PGP MESSAGE-----
...
"""
>>> p = re.compile('-----BEGIN\sPGP\sMESSAGE-----')
>>> m = p.search(raw)
>>> m
<_sre.SRE_Match object at 0x0000000002E02510>
>>> m.group()
'-----BEGIN PGP MESSAGE-----'
>>> m = p.match(raw)
>>> print m
None
Although, as noted, regex is likely overkill for this problem as the matching text is static.
Regular expressions are used when you want a "fuzzy" match - that is, you aren't sure if the string you are looking for will be identical every time.
In this case, the string you are looking for appears to be exactly -----BEGIN PGP MESSAGE----. In this case, the string.find function will be simpler to use and faster to boot.
>>> a = "This is a PGP encrypted email. -----BEGIN PGP MESSAGE----- !##$%^..."
>>> b = "This is not encrypted. My hovercraft is full of eels." #example strings
>>> a.find("-----BEGIN PGP MESSAGE-----")
30 # Return value '30' means that the search string was found at index 30 of source string
>>> b.find("-----BEGIN PGP MESSAGE-----")
-1 # -1 means 'not found in the source string'
>>>
let's execute the script
python b.wsgi
result is:
None
None
that is the problem and here is the full script b.wsgi
aaa = """
From root#a1.local.tld Thu Jul 25 19:28:59 2013
Received: from a1.local.tld (localhost [127.0.0.1])
by a1.local.tld (8.14.4/8.14.4) with ESMTP id r6Q2SxeQ003866
for <ooo#a1.local.tld>; Thu, 25 Jul 2013 19:28:59 -0700
Received: (from root#localhost)
by a1.local.tld (8.14.4/8.14.4/Submit) id r6Q2Sxbh003865;
Thu, 25 Jul 2013 19:28:59 -0700
From: root#a1.local.tld
Subject: oooooooooooooooo
To: ooo#a1.local.tld
Cc:
X-Originating-IP: 192.168.15.127
X-Mailer: Webmin 1.420
Message-Id: <1374805739.3861#a1>
Date: Thu, 25 Jul 2013 19:28:59 -0700 (PDT)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="bound1374805739"
This is a multi-part message in MIME format.
--bound1374805739
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
ooooooooooooooooooooooooooooooooooooooooooooooo
--bound1374805739--
"""
import email
msg = email.message_from_string(aaa)
print msg['From']
print msg['To']
i tried changing it to
print msg['from']
print msg['to']
same problem.
what might be the issue here ?
is it possible PYTHON knows this "raw" string was manually edited by my hands ?
very sneaky stuff going on here.
The \n at the beginning and end of the string are causing the problem. Try this
>>> msg = email.message_from_string(aaa.strip())
>>> msg.keys()
['Received', 'Received', 'From', 'Subject', 'To', 'Cc', 'X-Originating-IP', 'X-Mailer', 'Message-Id', 'Date', 'MIME-Version', 'Content-Type']
>>> msg['From']
'root#a1.local.tld'
I'm a python beginner trying to extract data from email headers. I have thousands of email messages in a single text file, and from each message I want to extract the sender's address, recipient(s) address, and the date, and write it to a single, semicolon-delimitted line in a new file.
this is ugly, but it's what I've come up with:
import re
emails = open("demo_text.txt","r") #opens the file to analyze
results = open("results.txt","w") #creates new file for search results
resultsList = []
for line in emails:
if "From - " in line: #recgonizes the beginning of a email message and adds a linebreak
newMessage = re.findall(r'\w\w\w\s\w\w\w.*', line)
if newMessage:
resultsList.append("\n")
if "From: " in line:
address = re.findall(r'[\w.-]+#[\w.-]+', line)
if address:
resultsList.append(address)
resultsList.append(";")
if "To: " in line:
if "Delivered-To:" not in line: #avoids confusion with 'Delivered-To:' tag
address = re.findall(r'[\w.-]+#[\w.-]+', line)
if address:
for person in address:
resultsList.append(person)
resultsList.append(";")
if "Date: " in line:
date = re.findall(r'\w\w\w\,.*', line)
resultsList.append(date)
resultsList.append(";")
for result in resultsList:
results.writelines(result)
emails.close()
results.close()
and here's my 'demo_text.txt':
From - Sun Jan 06 19:08:49 2013
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
Delivered-To: somebody_1#hotmail.com
Received: by 10.48.48.3 with SMTP id v3cs417003nfv;
Mon, 15 Jan 2007 10:14:19 -0800 (PST)
Received: by 10.65.211.13 with SMTP id n13mr5741660qbq.1168884841872;
Mon, 15 Jan 2007 10:14:01 -0800 (PST)
Return-Path: <nobody#hotmail.com>
Received: from bay0-omc3-s21.bay0.hotmail.com (bay0-omc3-s21.bay0.hotmail.com [65.54.246.221])
by mx.google.com with ESMTP id e13si6347910qbe.2007.01.15.10.13.58;
Mon, 15 Jan 2007 10:14:01 -0800 (PST)
Received-SPF: pass (google.com: domain of nobody#hotmail.com designates 65.54.246.221 as permitted sender)
Received: from hotmail.com ([65.54.250.22]) by bay0-omc3-s21.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.2668);
Mon, 15 Jan 2007 10:13:48 -0800
Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC;
Mon, 15 Jan 2007 10:13:47 -0800
Message-ID: <BAY115-F12E4E575FF2272CF577605A1B50#phx.gbl>
Received: from 65.54.250.200 by by115fd.bay115.hotmail.msn.com with HTTP;
Mon, 15 Jan 2007 18:13:43 GMT
X-Originating-IP: [200.122.47.165]
X-Originating-Email: [nobody#hotmail.com]
X-Sender: nobody#hotmail.com
From: =?iso-8859-1?B?UGF1bGEgTWFy7WEgTGlkaWEgRmxvcmVuemE=?=
<nobody#hotmail.com>
To: somebody_1#hotmail.com, somebody_2#gmail.com, 3_nobodies#yahoo.com.ar
Bcc:
Subject: fotos
Date: Mon, 15 Jan 2007 18:13:43 +0000
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="----=_NextPart_000_d98_1c4f_3aa9"
X-OriginalArrivalTime: 15 Jan 2007 18:13:47.0572 (UTC) FILETIME=[E68D4740:01C738D0]
Return-Path: nobody#hotmail.com
The output is:
somebody_1#hotmail.com;somebody_2#gmail.com;3_nobodies#yahoo.com.ar;Mon, 15 Jan 2007 18:13:43 +0000;
This output would be fine except there's a line break in the 'From:' field in my demo_text.txt (line 24), and so I miss 'nobody#hotmail.com'.
I'm not sure how to tell my code to skip line break and still find email address in the From: tag.
More generally, I'm sure there are many more sensible ways to go about this task. If anyone could point me in the right direction, I'd sure appreciate it.
Your demo text is practicallly the mbox format, which can be perfectly processed with the appropriate object in the mailbox module:
from mailbox import mbox
import re
PAT_EMAIL = re.compile(r"[0-9A-Za-z._-]+\#[0-9A-Za-z._-]+")
mymbox = mbox("demo.txt")
for email in mymbox.values():
from_address = PAT_EMAIL.findall(email["from"])
to_address = PAT_EMAIL.findall(email["to"])
date = [ email["date"], ]
print ";".join(from_address + to_address + date)
In order to skip newlines, you can't read it line by line. You can try loading in your file, and using your keywords (From, To, etc.) as boundaries. So when you search for 'From -', you use the rest of your keywords as boundaries so they won't be included in the portion of the list.
Also, mentioning this cause you said you're a beginner:
The "Pythonic" way of naming your non-class variables is with underscores. So resultsList should be results_list.