Parsing email headers with regular expressions in python - python

I'm a python beginner trying to extract data from email headers. I have thousands of email messages in a single text file, and from each message I want to extract the sender's address, recipient(s) address, and the date, and write it to a single, semicolon-delimitted line in a new file.
this is ugly, but it's what I've come up with:
import re
emails = open("demo_text.txt","r") #opens the file to analyze
results = open("results.txt","w") #creates new file for search results
resultsList = []
for line in emails:
if "From - " in line: #recgonizes the beginning of a email message and adds a linebreak
newMessage = re.findall(r'\w\w\w\s\w\w\w.*', line)
if newMessage:
resultsList.append("\n")
if "From: " in line:
address = re.findall(r'[\w.-]+#[\w.-]+', line)
if address:
resultsList.append(address)
resultsList.append(";")
if "To: " in line:
if "Delivered-To:" not in line: #avoids confusion with 'Delivered-To:' tag
address = re.findall(r'[\w.-]+#[\w.-]+', line)
if address:
for person in address:
resultsList.append(person)
resultsList.append(";")
if "Date: " in line:
date = re.findall(r'\w\w\w\,.*', line)
resultsList.append(date)
resultsList.append(";")
for result in resultsList:
results.writelines(result)
emails.close()
results.close()
and here's my 'demo_text.txt':
From - Sun Jan 06 19:08:49 2013
X-Mozilla-Status: 0001
X-Mozilla-Status2: 00000000
Delivered-To: somebody_1#hotmail.com
Received: by 10.48.48.3 with SMTP id v3cs417003nfv;
Mon, 15 Jan 2007 10:14:19 -0800 (PST)
Received: by 10.65.211.13 with SMTP id n13mr5741660qbq.1168884841872;
Mon, 15 Jan 2007 10:14:01 -0800 (PST)
Return-Path: <nobody#hotmail.com>
Received: from bay0-omc3-s21.bay0.hotmail.com (bay0-omc3-s21.bay0.hotmail.com [65.54.246.221])
by mx.google.com with ESMTP id e13si6347910qbe.2007.01.15.10.13.58;
Mon, 15 Jan 2007 10:14:01 -0800 (PST)
Received-SPF: pass (google.com: domain of nobody#hotmail.com designates 65.54.246.221 as permitted sender)
Received: from hotmail.com ([65.54.250.22]) by bay0-omc3-s21.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.2668);
Mon, 15 Jan 2007 10:13:48 -0800
Received: from mail pickup service by hotmail.com with Microsoft SMTPSVC;
Mon, 15 Jan 2007 10:13:47 -0800
Message-ID: <BAY115-F12E4E575FF2272CF577605A1B50#phx.gbl>
Received: from 65.54.250.200 by by115fd.bay115.hotmail.msn.com with HTTP;
Mon, 15 Jan 2007 18:13:43 GMT
X-Originating-IP: [200.122.47.165]
X-Originating-Email: [nobody#hotmail.com]
X-Sender: nobody#hotmail.com
From: =?iso-8859-1?B?UGF1bGEgTWFy7WEgTGlkaWEgRmxvcmVuemE=?=
<nobody#hotmail.com>
To: somebody_1#hotmail.com, somebody_2#gmail.com, 3_nobodies#yahoo.com.ar
Bcc:
Subject: fotos
Date: Mon, 15 Jan 2007 18:13:43 +0000
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="----=_NextPart_000_d98_1c4f_3aa9"
X-OriginalArrivalTime: 15 Jan 2007 18:13:47.0572 (UTC) FILETIME=[E68D4740:01C738D0]
Return-Path: nobody#hotmail.com
The output is:
somebody_1#hotmail.com;somebody_2#gmail.com;3_nobodies#yahoo.com.ar;Mon, 15 Jan 2007 18:13:43 +0000;
This output would be fine except there's a line break in the 'From:' field in my demo_text.txt (line 24), and so I miss 'nobody#hotmail.com'.
I'm not sure how to tell my code to skip line break and still find email address in the From: tag.
More generally, I'm sure there are many more sensible ways to go about this task. If anyone could point me in the right direction, I'd sure appreciate it.

Your demo text is practicallly the mbox format, which can be perfectly processed with the appropriate object in the mailbox module:
from mailbox import mbox
import re
PAT_EMAIL = re.compile(r"[0-9A-Za-z._-]+\#[0-9A-Za-z._-]+")
mymbox = mbox("demo.txt")
for email in mymbox.values():
from_address = PAT_EMAIL.findall(email["from"])
to_address = PAT_EMAIL.findall(email["to"])
date = [ email["date"], ]
print ";".join(from_address + to_address + date)

In order to skip newlines, you can't read it line by line. You can try loading in your file, and using your keywords (From, To, etc.) as boundaries. So when you search for 'From -', you use the rest of your keywords as boundaries so they won't be included in the portion of the list.
Also, mentioning this cause you said you're a beginner:
The "Pythonic" way of naming your non-class variables is with underscores. So resultsList should be results_list.

Related

How to get access of value within long text present within list in Python?

I am working on piece of code to get a value from gmail, but email itself is HTML File, so code is also returning me html code within list, for which I am unable to parse data.
My Code:
import imaplib
ORG_EMAIL = "comapnyname.com"
FROM_EMAIL = "automation#companyname.co"
FROM_PWD = "password123!"
SMTP_SERVER = "imap.gmail.com"
def read_email_from_gmail():
mail = imaplib.IMAP4_SSL(SMTP_SERVER)
mail.login(FROM_EMAIL, FROM_PWD)
mail.select("inbox")
email_type, data = mail.search(None, "ALL")
mail_ids = data[0]
id_list = mail_ids.split()
latest_email_id = int(id_list[-1])
email_type, data = mail.fetch(str.encode(str(latest_email_id)), "(RFC822)")
string_data = str(data)
print('MAIL Data: ')
print(string_data)
read_email_from_gmail()
Now This code is returning me long list which contains HTML
[(b'1 (RFC822 {54624}', b'Delivered-To: automation+qa1#spekit.co\r\nReceived: by 2002:a4a:6f04:0:0:0:0:0 with SMTP id h4csp1519301ooc;\r\n Thu, 10 Sep 2020 09:18:42 -0700 (PDT)\r\nX-Google-Smtp-Source: ABdhPJy/7yOn17HKdn+QjP0XHEOK2fu8LDL8tz4jDmDKemms2GVyykqDCDUfppmRbV4DUi7ckRRg\r\nX-Received: by 2002:a25:d7cd:: with SMTP id o196mr14075369ybg.91.1599754722247;\r\n Thu, 10 Sep 2020 09:18:42 -0700 (PDT)\r\nARC-Seal: i=1; a=rsa-sha256; t=1599754722; cv=none;\r\n d=google.com; s=arc-20160816;\r\n b=KzNg7bsmLaNcrRMihkN+AwlTp8ybj5D65K+Z21Ddl/lgd2LN90InAWhj+guhrmzHtB\r\n vw83T4AlJ8u2jpAs5qYUbxgd/R5COLhlRDqR/dE4wljRgIq2W6sVCJo/fGuZruFjob4Z\r\n h1acPat0xa3h83lJzzbH576KggTqdScMwCbLsujPr/FclnHNjkqxQuFQlV23nAGgvWX8\r\n raiIW+6wC070tmQaaz3feIVfo7r7cmQBGokOmy8B3of0/kqIyMVuaEkmk2kno8VFvILF\r\n i8YPq7bOHVNpre7KwiG4r69PdaDRXIcd/ETtuyusfNXOrGJ0QhC44j2eLUpxlRltOGgL\r\n NAeA==\r\nARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;\r\n h=mime-version:date:message-id:to:subject:from:dkim-signature\r\n :dkim-signature;\r\n bh=ZNxh0gTg5kVpAZyTHGJ2jWADa5UGAoPCP3GFX1DUu94=;\r\n b=WjnIWwVX2oWrl3aZoKlzck1GAoy/gT5/cbNP+tnmdypfjvAUTyuZ3OO5xXlZB/CiF9\r\n PkYZFEzJQSxradr3ky5T7tLmV2qKnHfaIp3G3STUs5f9vhSfp6qknV7ouLBGwCWyp2gp\r\n e14Aek7M5ciVC1GIjxlr7AXZne4eHSwCb7u8j91Yt8B2getEQ9lyQlChwjYf38Kau5lL\r\n wPmMtAM0DDOqlNff2gTBEFgAX1s0Wk+g8mKS31tzBMIQvayR+a3PHX+S3zhtC2i1XsLm\r\n NOWSMsI0ZEEk/mjA36DVWhEN0d9llOwiDfFonXxIkcPZLlNR3zGfA61apTeud7i24vYn\r\n bfCw==\r\nARC-Authentication-Results: i=1; mx.google.com;\r\n dkim=pass header.i=#spekit.co header.s=mandrill header.b=RhjFdk+T;\r\n dkim=pass header.i=#mandrillapp.com header.s=mandrill header.b=SusUoY2S;\r\n spf=pass (google.com: domain of bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com designates 198.2.180.17 as permitted sender) smtp.mailfrom=bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com;\r\n dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=spekit.co\r\nReturn-Path: <bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com>\r\nReceived: from mail180-17.suw31.mandrillapp.com (mail180-17.suw31.mandrillapp.com. [198.2.180.17])\r\n by mx.google.com with ESMTPS id t10si6240908ybl.463.2020.09.10.09.18.42\r\n for <automation+qa1#spekit.co>\r\n (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);\r\n Thu, 10 Sep 2020 09:18:42 -0700 (PDT)\r\nReceived-SPF: pass (google.com: domain of bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com designates 198.2.180.17 as permitted sender) client-ip=198.2.180.17;\r\nAuthentication-Results: mx.google.com;\r\n dkim=pass header.i=#spekit.co header.s=mandrill header.b=RhjFdk+T;\r\n dkim=pass header.i=#mandrillapp.com header.s=mandrill header.b=SusUoY2S;\r\n spf=pass (google.com: domain of bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com designates 198.2.180.17 as permitted sender) smtp.mailfrom=bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com;\r\n dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=spekit.co\r\nDKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; s=mandrill; d=spekit.co;\r\n h=From:Subject:To:Message-Id:Date:MIME-Version:Content-Type; i=support#spekit.co;\r\n bh=ZNxh0gTg5kVpAZyTHGJ2jWADa5UGAoPCP3GFX1DUu94=;\r\n b=RhjFdk+Tvr3HP43qJoKzVowGAs1SYJFfpq8MK4firz5tcpBYn3UEP/Z5cF+IBA74/PTmCahgTnXi\r\n /EPSbY2b+20ERj4s4VUnwNZw8t4L98gSQiM6o3mF4iVI2JIgABU2Tn2nmB68kGZyxeSOs4bWtE+s\r\n MXleLzg+uTftETJoUhM=\r\nReceived: from pmta03.mandrill.prod.suw01.rsglab.com (127.0.0.1) by mail180-17.suw31.mandrillapp.com id hb98u422sc0h for <automation+qa1#spekit.co>; Thu, 10 Sep 2020 16:18:42 +0000 (envelope-from <bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com>)\r\nDKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mandrillapp.com; \r\n i=#mandrillapp.com; q=dns/txt; s=mandrill; t=1599754721; h=From : \r\n Subject : To : Message-Id : Date : MIME-Version : Content-Type : From : \r\n Subject : Date : X-Mandrill-User : List-Unsubscribe; \r\n bh=ZNxh0gTg5kVpAZyTHGJ2jWADa5UGAoPCP3GFX1DUu94=; \r\n b=SusUoY2SOQosSQrzHafHGf7Pto1Ol3PDGU067dNsjT1ZIOuSP0Dz7DJwqgFn6NpwAV7X7e\r\n pzQQPyDJoAqQCjCdSqG9mp80hAEGwQC89GNu78a8o0NRC+BPRTGaNKV/jX06cXsgp+A4KXfY\r\n 13x1BInjKraTnCYz9TnzDUChIm3pg=\r\nFrom: Support <support#spekit.co>\r\nSubject: Your Spekit Login PIN\r\nReturn-Path: <bounce-md_31064008.5f5a51e1.v1-8084cafe0c6c4aeca73fef8bdaf5b70b#mandrillapp.com>\r\nReceived: from [3.128.246.0] by mandrillapp.com id 8084cafe0c6c4aeca73fef8bdaf5b70b; Thu, 10 Sep 2020 16:18:41 +0000\r\nTo: Automation <automation+qa1#spekit.co>\r\nX-Report-Abuse: Please forward a copy of this message, including all headers, to abuse#mandrill.com\r\nX-Report-Abuse: You can also report abuse here: http://mandrillapp.com/contact/abuse?id=31064008.8084cafe0c6c4aeca73fef8bdaf5b70b\r\nX-Mandrill-User: md_31064008\r\nMessage-Id: <31064008.20200910161841.5f5a51e1e2be13.10518479#mail180-17.suw31.mandrillapp.com>\r\nDate: Thu, 10 Sep 2020 16:18:41 +0000\r\nMIME-Version: 1.0\r\nContent-Type: multipart/alternative; boundary="_av-l5kOy35rlKJaV18wYlOHPA"\r\n\r\n--_av-l5kOy35rlKJaV18wYlOHPA\r\nContent-Type: text/plain; charset=utf-8\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\n Your Spekit Login PIN \r\n Hi Automation, Someone (hopefully you) just\r\nlogged into your Spekit account with the email *automation+qa1#spekit.co*. \r\n \r\n If this was you, please use the code below to log-in, otherwise please\r\ncontact your admin and reset your password ASAP.\r\n =3D *952681* =3D\r\n\r\n Enter PIN <https://app.spekit.co/verifypin>\r\n<http://www.twitter.com/spekitapp>\r\n<https://www.linkedin.com/company/spekit/> <https://medium.com/spekit>\r\n<https://spekit.co/> \r\nQuestions? Contact us. <mailto:support#spekit.co>\r\n Copyright =C2=A9 2018 Spekit, Inc. All rights reserved.\r\n\r\n--_av-l5kOy35rlKJaV18wYlOHPA\r\nContent-Type: text/html; charset=utf-8\r\nContent-Transfer-Encoding: quoted-printable\r\n\r\n<!doctype html>\r\n<html xmlns=3D"http://www.w3.org/1999/xhtml" xmlns:v=3D"urn:schemas-microso=\r\nft-com:vml" xmlns:o=3D"urn:schemas-microsoft-com:office:office">\r\n <head>\r\n <!-- NAME: 1 COLUMN - FULL WIDTH -->\r\n <!--[if gte mso 15]>\r\n <xml>\r\n <o:OfficeDocumentSettings>\r\n <o:AllowPNG/>\r\n <o:PixelsPerInch>96</o:PixelsPerInch>\r\n </o:OfficeDocumentSettings>\r\n </xml>\r\n <![endif]-->\r\n <meta charset=3D"UTF-8">\r\n <meta http-equiv=3D"X-UA-Compatible" content=3D"IE=3Dedge">\r\n <meta name=3D"viewport" content=3D"width=3Ddevice-width, initial-sc=\r\nale=3D1">\r\n <title>Your Spekit Login PIN</title>\r\n \r\n <style type=3D"text/css">\r\n=09=09p{\r\n=09=09=09margin:10px 0;\r\n=09=09=09padding:0;\r\n=09=09}\r\n=09=09table{\r\n=09=09=09border-</tbody></table> ')']
I need to get value i.e. '952681', which is displaying twice, can someone help me there?
if the format of the email stays the same you can use regex to parse the returned html string:
import re
pattern = '\*([\s\S]*?)\*'
res = re.findall(pattern, your_email_text)
the variable res contains your number at the second position:
['automation+qa1#spekit.co', '952681']

email.message_from_string can't parse outlook original source message

I am trying to parse a multipart email using its original source message from Outlook. The email has 2 parts: plain text and html. email.message_from_string doesn't parse the raw email correctly. It doesn't return 2 parts and also _payload includes everything except for the first 2 lines.
I used email.message_from_string(raw_email) to parse the raw original source message and it didn't parse it correctly.
Note: I cut off most of the email to keep it short.
Original source message from Outlook:
Received: from SN1NAM04HT187.eop-NAM04.prod.protection.outlook.com
(2603:10b6:300:d4::32) by CO2PR01MB1959.prod.exchangelabs.com with HTTPS via
MWHPR19CA0022.NAMPRD19.PROD.OUTLOOK.COM; Wed, 31 Jul 2019 19:52:30 +0000
Received: from SN1NAM04FT005.eop-NAM04.prod.protection.outlook.com
(10.152.88.55) by SN1NAM04HT187.eop-NAM04.prod.protection.outlook.com
(10.152.89.14) with Microsoft SMTP Server (version=TLS1_2,
cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2115.10; Wed, 31 Jul
2019 19:52:29 +0000
Authentication-Results: spf=pass (sender IP is 50.31.51.89)
smtp.mailfrom=sendgrid.blabla.com; windowslive.com; dkim=pass (signature was
verified) header.d=blabla.com;windowslive.com; dmarc=pass action=none
header.from=blabla.com;
Received-SPF: Pass (protection.outlook.com: domain of sendgrid.blabla.com
designates 50.31.51.89 as permitted sender) receiver=protection.outlook.com;
client-ip=50.31.51.89; helo=o1.email-sg.blabla.com;
Received: from o1.email-sg.blabla.com (50.31.51.89) by
SN1NAM04FT005.mail.protection.outlook.com (10.152.88.160) with Microsoft SMTP
Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
15.20.2136.14 via Frontend Transport; Wed, 31 Jul 2019 19:52:29 +0000
X-IncomingTopHeaderMarker: OriginalChecksum:0A3835CC8F7E76F92E22A1986408E34F6CB0EE38219063E844D0BB1572B82825;UpperCasedChecksum:3B51CEDA7CBD6FB06905BA9CCFA3417B571F394F0412206B12B87927F8C8FE0B;SizeAsReceived:1804;Count:15
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=blabla.com;
h=from:sender:to:subject:mime-version:content-type; s=s1;
bh=wEOiatvA5BWHjVFwDPHy3RC5ur4=; b=Y/sBR8/uaU5y+7GN3GanXk7dlsUId
bQjsB7HfZp6fdDuVo9EIKrFn9uffrsqJpXO6DFqX5rWWCvgTMYPnsM8Iy3ekU0sD
psxBZ186ROAoalowdniEsGZ/fTMan4JEXEWhlKKpGHxGR102lz1qylqRazxFlOEY
5yhWp6dJjLegIg=
Received: by filter0246p1iad2.sendgrid.net with SMTP id filter0246p1iad2-24721-5D41F17C-8
2019-07-31 19:52:28.625630772 +0000 UTC m=+518882.143553260
Received: from iad1gmta02.localdomain (unknown [192.88.178.20])
by ismtpd0002p1iad2.sendgrid.net (SG) with ESMTP id GrDTvaa3R6yukBfilmmfMw
for <y#windowslive.com>; Wed, 31 Jul 2019 19:52:28.454 +0000 (UTC)
Received: from iad1gbos.localdomain (unknown [10.3.65.145])
by iad1gmta02.localdomain (Postfix) with ESMTP id 62A8812AEA4D
for <y#windowslive.com>; Wed, 31 Jul 2019 15:52:28 -0400 (EDT)
Received: from iad1gbos.ecom.blabla.com (localhost [127.0.0.1])
by iad1gbos.localdomain (Postfix) with ESMTP id 57CD51013601
for <y#windowslive.com>; Wed, 31 Jul 2019 15:52:28 -0400 (EDT)
From: "blabla.com" <service#blabla.com>
Sender: "blabla.com" <service#blabla.com>
To: yc#windowslive.com
Message-ID: <1943845105.133098.1564602748358#localhost>
Subject: Thanks for your blabla order!
Content-Type: multipart/alternative;
boundary="----=_Part_12654_1135590884.1564602743147"
Date: Wed, 31 Jul 2019 19:52:28 +0000
X-SG-EID: KlhL5+04rpq9b+lNnUQSSXSv/U/Agrwcy5kw6hHCP8rbih+DKKzTjpaizOf9gI4jfUbeoQFtkwaLeA
Q5VJW+s2G92MVJdOKnwbJCcJQrsVc4oiuZgDCBS8dpWhU6KfIM6V5wL2yNP0pKKCugS+b4cgX4K5CX
GndIFYXJXa1LTZLPblTMNhH8QH5+kLY4Wtg9po8FuNUzEJaPXsJgnMHYzKZOIvAvnevqNIcyYVL2Yc
0=
X-SG-ID: DT9Vw4eifUpKg3EkHbNxgoJgjlm7TnFJRHcoaVv1UYo=
X-IncomingHeaderCount: 15
Return-Path: bounces+266386-caec-yc=windowslive.com#sendgrid.blabla.com
X-MS-Exchange-Organization-ExpirationStartTime: 31 Jul 2019 19:52:29.3384
(UTC)
X-MS-Exchange-Organization-ExpirationStartTimeReason: OriginalSubmit
X-MS-Exchange-Organization-ExpirationInterval: 1:00:00:00.0000000
X-MS-Exchange-Organization-ExpirationIntervalReason: OriginalSubmit
X-MS-Exchange-Organization-Network-Message-Id: 0748a39e-bdb3-4241-2271-08d715f09e99
X-EOPAttributedMessage: 0
X-EOPTenantAttributedMessage: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa:0
X-MS-Exchange-Organization-MessageDirectionality: Incoming
X-Forefront-Antispam-Report: EFV:NLI;
X-MS-Exchange-Organization-AuthSource:
SN1NAM04FT005.eop-NAM04.prod.protection.outlook.com
X-MS-Exchange-Organization-AuthAs: Anonymous
X-MS-PublicTrafficType: Email
X-MS-UserLastLogonTime: 7/31/2019 7:47:36 PM
X-MS-Office365-Filtering-Correlation-Id: 0748a39e-bdb3-4241-2271-08d715f09e99
X-Microsoft-Antispam:
BCL:6;PCL:0;RULEID:(2390118)(5000188)(711020)(4605104)(610169)(650170)(651021)(8291501072);SRVR:SN1NAM04HT187;
X-MS-TrafficTypeDiagnostic: SN1NAM04HT187:
X-MS-Exchange-PUrlCount: 24
X-MS-Exchange-EOPDirect: true
X-Sender-IP: 50.31.51.89
X-SID-PRA: SERVICE#blabla.COM
X-SID-Result: PASS
X-MS-Exchange-Organization-PCL: 2
X-OriginatorOrg: outlook.com
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 31 Jul 2019 19:52:29.0952
(UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 0748a39e-bdb3-4241-2271-08d715f09e99
X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-FromEntityHeader: Internet
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg:
00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN1NAM04HT187
X-MS-Exchange-Transport-EndToEndLatency: 00:00:01.1679704
X-MS-Exchange-Processed-By-BccFoldering: 15.20.2115.005
X-Microsoft-Antispam-Mailbox-Delivery:
abwl:0;wl:0;pcwl:0;kl:0;iwl:0;ijl:0;dwl:0;dkl:0;rwl:0;ucf:0;jmr:0;ex:0;auth:1;dest:I;ENG:(5062000261)(5061607266)(5061608174)(4900115)(4920090)(6515079)(4950130)(4990090);
X-Message-Info:
5vMbyqxGkdee9CWP6GN6k7SiFatHA5tOJthXLYGApF09RV+2VIwDSv7TpFIyyuwbdpElZ/0OfDQ0pBW79cd9agnpjGQw7b7v7zA6S/RBHvx/Foariz/CKmDCPiOrCScSKVW9YaM/CqKL76WFalT2LUf8VJBR8M4LupokoBm/WazuStfNPUu2PvSCEzFxbvn/ptMxVl+4wEXDPQivJ1nuMw==
X-Message-Delivery: Vj0xLjE7dXM9MDtsPTA7YT0xO0Q9MTtHRD0xO1NDTD0z
X-Microsoft-Antispam-Message-Info:
1PVG+UKcd/9R+HkjIXvLA0AQVREwyDmXR6XmnL6oX6xks/yw3ZQRARX4fYngU4vXeDhoJr9kyTA6Bpm3OE5FZn8+JPH3p24pamcQhTiI/RdyRHAOx7q7YHb9PzM3EkY2hOb6qF/QxCZPdshlewXqGe+azoh4Sr9CPUL8x7gtZS0XVBYBQMtHRW0NsS0ULp/4e4+lbGoyQcXdMGoy4Cf6ACU783dlOjDyZNkz2Frk3vm3Za/P3avHn46xf8WzHrDbbfOiVc+HXFAQxBOWbQPD0rkXNssXlOszegvDX7nPq9hdj8UadbqhECjiizH890bNwqKIa2sWd/d1HBfojK2FDsEOPwDSsIS/1ApF038jELtjpzkzSadz319o6VohzYUHm7CtRF9sqJTgLVKePBo/i8FrLeoCq2rXydXj6a9MS7SqLDfny7NlP/qId5z5GXFs63K/QXu3InYLIf5zcl/kgsvg+W0cHDZ4/IdBPvHGeaQn7hdf62IKftys3CspYBbSlt0Eus97CCddOX/EBaJpJ7nEpHxIL3pxKVY0kKWqaUqrvvC3mvCffBe3igaAq2LiHgvT0pIU+j0y41VwEn7X8rL8gyWpbBF64+wf8NAe8JM2N8aWudElAkIeA5GHJodGcXt+jdyhYzh3EZs1BWyrF+k6MPp4kU/9yVxAxBimBx1aje5geHD7NggWqFJAD6fv0XsSjxku4Tap6Zs+NEkoD/MHNyT6TMnu9cqGgEoznr2mTssEZkz0JRPgcb2YbZabybkxBRJjVi/aroSjtOj2V2JHo9m2F8bA4XhLrLwgJzOs1ZelyYFKZ3OMaUAS9yNRlSPHBSKm/WUzIMNkcHLCcg==
MIME-Version: 1.0
------=_Part_12654_1135590884.1564602743147
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Hi c y,
Thanks for your order! Your order information appears below.
Item(s) Subtotal: $1.79
Shipping: $4.95
----
Total Before Tax: $6.74
Estimated Tax: $0.15
----
Order Total: $6.89
Shipping Address:
blabla.com
https://www.blabla.com
1-800-672-4399
------=_Part_12654_1135590884.1564602743147
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit
<!--[if !mso]><!--><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><link href="http://cms.blabla.com/fonts/roboto/email-font.css" rel="stylesheet" type="text/css">
<style type="text/css">
#font-face {
font-family: 'Roboto';
src: url('http://cms.blabla.com/fonts/roboto/Roboto-Regular-webfont.eot');
src: url('http://cms.blabla.com/fonts/roboto/Roboto-Regular-webfont.eot?#iefix') format('embedded-opentype'),
url('http://cms.blabla.com/fonts/roboto/Roboto-Regular-webfont.woff') format('woff'),
url('http://cms.blabla.com/fonts/roboto/Roboto-Regular-webfont.ttf') format('truetype'),
url('http://cms.blabla.com/fonts/roboto/Roboto-Regular-webfont.svg#robotoregular') format('svg');
font-weight:400;
font-style:normal;
}
</style>
<div width="100%" style="display:none;font-size:0px;color:#eeeeee;line-height:1px;text-align:center;opacity:0;overflow:hidden;max-height:0px;max-width:0px;">
Questions? Call us any time 24/7 at 1-800-672-4399 or simply reply to this email | blabla.com
</div>
------=_Part_12654_1135590884.1564602743147--
Result:
{'_charset': None,
'_default_type': 'text/plain',
'_headers': [('Received',
'from SN1NAM04HT187.eop-NAM04.prod.protection.outlook.com'),
('(2603',
'10b6:300:d4::32) by CO2PR01MB1959.prod.exchangelabs.com with HTTPS via')],
'_payload': 'Received: from SN1NAM04HT187.eop-NAM04.prod.protection.outlook.com
(2603:10b6:300:d4::32) by CO2PR01MB1959.prod.exchangelabs.com with HTTPS via
MWHPR19CA0022.NAMPRD19.PROD.OUTLOOK.COM; Wed, 31 Jul 2019 19:52:30 +0000
Received: from SN1NAM04FT005.eop-NAM04.prod.protection.outlook.com
(10.152.88.55) by SN1NAM04HT187.eop-NAM04.prod.protection.outlook.com
(10.152.89.14) with Microsoft SMTP Server (version=TLS1_2,
cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2115.10; Wed, 31 Jul
2019 19:52:29 +0000
Authentication-Results: spf=pass (sender IP is 50.31.51.89)
smtp.mailfrom=sendgrid.blabla.com; windowslive.com; dkim=pass (signature was
verified) header.d=blabla.com;windowslive.com; dmarc=pass action=none
header.from=blabla.com;
Received-SPF: Pass (protection.outlook.com: domain of sendgrid.blabla.com
designates 50.31.51.89 as permitted sender) receiver=protection.outlook.com;
client-ip=50.31.51.89; helo=o1.email-sg.blabla.com;
Received: from o1.email-sg.blabla.com (50.31.51.89) by
SN1NAM04FT005.mail.protection.outlook.com (10.152.88.160) with Microsoft SMTP
Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
15.20.2136.14 via Frontend Transport; Wed, 31 Jul 2019 19:52:29 +0000
X-IncomingTopHeaderMarker: OriginalChecksum:0A3835CC8F7E76F92E22A1986408E34F6CB0EE38219063E844D0BB1572B82825;UpperCasedChecksum:3B51CEDA7CBD6FB06905BA9CCFA3417B571F394F0412206B12B87927F8C8FE0B;SizeAsReceived:1804;Count:15
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; d=blabla.com;
h=from:sender:to:subject:mime-version:content-type; s=s1;
bh=wEOiatvA5BWHjVFwDPHy3RC5ur4=; b=Y/sBR8/uaU5y+7GN3GanXk7dlsUId
bQjsB7HfZp6fdDuVo9EIKrFn9uffrsqJpXO6DFqX5rWWCvgTMYPnsM8Iy3ekU0sD
psxBZ186ROAoalowdniEsGZ/fTMan4JEXEWhlKKpGHxGR102lz1qylqRazxFlOEY
5yhWp6dJjLegIg=
Received: by filter0246p1iad2.sendgrid.net with SMTP id filter0246p1iad2-24721-5D41F17C-8
2019-07-31 19:52:28.625630772 +0000 UTC m=+518882.143553260
Received: from iad1gmta02.localdomain (unknown [192.88.178.20])
by ismtpd0002p1iad2.sendgrid.net (SG) with ESMTP id GrDTvaa3R6yukBfilmmfMw
for <y#windowslive.com>; Wed, 31 Jul 2019 19:52:28.454 +0000 (UTC)
Received: from iad1gbos.localdomain (unknown [10.3.65.145])
by iad1gmta02.localdomain (Postfix) with ESMTP id 62A8812AEA4D
for <y#windowslive.com>; Wed, 31 Jul 2019 15:52:28 -0400 (EDT)
Received: from iad1gbos.ecom.blabla.com (localhost [127.0.0.1])
by iad1gbos.localdomain (Postfix) with ESMTP id 57CD51013601
for <y#windowslive.com>; Wed, 31 Jul 2019 15:52:28 -0400 (EDT)
From: "blabla.com" <service#blabla.com>
Sender: "blabla.com" <service#blabla.com>
To: yc#windowslive.com
Message-ID: <1943845105.133098.1564602748358#localhost>
Subject: Thanks for your blabla order!
Content-Type: multipart/alternative;
boundary="----=_Part_12654_1135590884.1564602743147"
Date: Wed, 31 Jul 2019 19:52:28 +0000
X-SG-EID: KlhL5+04rpq9b+lNnUQSSXSv/U/Agrwcy5kw6hHCP8rbih+DKKzTjpaizOf9gI4jfUbeoQFtkwaLeA
Q5VJW+s2G92MVJdOKnwbJCcJQrsVc4oiuZgDCBS8dpWhU6KfIM6V5wL2yNP0pKKCugS+b4cgX4K5CX
GndIFYXJXa1LTZLPblTMNhH8QH5+kLY4Wtg9po8FuNUzEJaPXsJgnMHYzKZOIvAvnevqNIcyYVL2Yc
0=
X-SG-ID: DT9Vw4eifUpKg3EkHbNxgoJgjlm7TnFJRHcoaVv1UYo=
X-IncomingHeaderCount: 15
Return-Path: bounces+266386-caec-yc=windowslive.com#sendgrid.blabla.com
X-MS-Exchange-Organization-ExpirationStartTime: 31 Jul 2019 19:52:29.3384
(UTC)
X-MS-Exchange-Organization-ExpirationStartTimeReason: OriginalSubmit
X-MS-Exchange-Organization-ExpirationInterval: 1:00:00:00.0000000
X-MS-Exchange-Organization-ExpirationIntervalReason: OriginalSubmit
X-MS-Exchange-Organization-Network-Message-Id: 0748a39e-bdb3-4241-2271-08d715f09e99
X-EOPAttributedMessage: 0
X-EOPTenantAttributedMessage: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa:0
X-MS-Exchange-Organization-MessageDirectionality: Incoming
X-Forefront-Antispam-Report: EFV:NLI;
X-MS-Exchange-Organization-AuthSource:
SN1NAM04FT005.eop-NAM04.prod.protection.outlook.com
X-MS-Exchange-Organization-AuthAs: Anonymous
X-MS-PublicTrafficType: Email
X-MS-UserLastLogonTime: 7/31/2019 7:47:36 PM
X-MS-Office365-Filtering-Correlation-Id: 0748a39e-bdb3-4241-2271-08d715f09e99
X-Microsoft-Antispam:
BCL:6;PCL:0;RULEID:(2390118)(5000188)(711020)(4605104)(610169)(650170)(651021)(8291501072);SRVR:SN1NAM04HT187;
X-MS-TrafficTypeDiagnostic: SN1NAM04HT187:
X-MS-Exchange-PUrlCount: 24
X-MS-Exchange-EOPDirect: true
X-Sender-IP: 50.31.51.89
X-SID-PRA: SERVICE#blabla.COM
X-SID-Result: PASS
X-MS-Exchange-Organization-PCL: 2
X-OriginatorOrg: outlook.com
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 31 Jul 2019 19:52:29.0952
(UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 0748a39e-bdb3-4241-2271-08d715f09e99
X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa
X-MS-Exchange-CrossTenant-FromEntityHeader: Internet
X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg:
00000000-0000-0000-0000-000000000000
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN1NAM04HT187
X-MS-Exchange-Transport-EndToEndLatency: 00:00:01.1679704
X-MS-Exchange-Processed-By-BccFoldering: 15.20.2115.005
X-Microsoft-Antispam-Mailbox-Delivery:
abwl:0;wl:0;pcwl:0;kl:0;iwl:0;ijl:0;dwl:0;dkl:0;rwl:0;ucf:0;jmr:0;ex:0;auth:1;dest:I;ENG:(5062000261)(5061607266)(5061608174)(4900115)(4920090)(6515079)(4950130)(4990090);
X-Message-Info:
5vMbyqxGkdee9CWP6GN6k7SiFatHA5tOJthXLYGApF09RV+2VIwDSv7TpFIyyuwbdpElZ/0OfDQ0pBW79cd9agnpjGQw7b7v7zA6S/RBHvx/Foariz/CKmDCPiOrCScSKVW9YaM/CqKL76WFalT2LUf8VJBR8M4LupokoBm/WazuStfNPUu2PvSCEzFxbvn/ptMxVl+4wEXDPQivJ1nuMw==
X-Message-Delivery: Vj0xLjE7dXM9MDtsPTA7YT0xO0Q9MTtHRD0xO1NDTD0z
X-Microsoft-Antispam-Message-Info:
1PVG+UKcd/9R+HkjIXvLA0AQVREwyDmXR6XmnL6oX6xks/yw3ZQRARX4fYngU4vXeDhoJr9kyTA6Bpm3OE5FZn8+JPH3p24pamcQhTiI/RdyRHAOx7q7YHb9PzM3EkY2hOb6qF/QxCZPdshlewXqGe+azoh4Sr9CPUL8x7gtZS0XVBYBQMtHRW0NsS0ULp/4e4+lbGoyQcXdMGoy4Cf6ACU783dlOjDyZNkz2Frk3vm3Za/P3avHn46xf8WzHrDbbfOiVc+HXFAQxBOWbQPD0rkXNssXlOszegvDX7nPq9hdj8UadbqhECjiizH890bNwqKIa2sWd/d1HBfojK2FDsEOPwDSsIS/1ApF038jELtjpzkzSadz319o6VohzYUHm7CtRF9sqJTgLVKePBo/i8FrLeoCq2rXydXj6a9MS7SqLDfny7NlP/qId5z5GXFs63K/QXu3InYLIf5zcl/kgsvg+W0cHDZ4/IdBPvHGeaQn7hdf62IKftys3CspYBbSlt0Eus97CCddOX/EBaJpJ7nEpHxIL3pxKVY0kKWqaUqrvvC3mvCffBe3igaAq2LiHgvT0pIU+j0y41VwEn7X8rL8gyWpbBF64+wf8NAe8JM2N8aWudElAkIeA5GHJodGcXt+jdyhYzh3EZs1BWyrF+k6MPp4kU/9yVxAxBimBx1aje5geHD7NggWqFJAD6fv0XsSjxku4Tap6Zs+NEkoD/MHNyT6TMnu9cqGgEoznr2mTssEZkz0JRPgcb2YbZabybkxBRJjVi/aroSjtOj2V2JHo9m2F8bA4XhLrLwgJzOs1ZelyYFKZ3OMaUAS9yNRlSPHBSKm/WUzIMNkcHLCcg==
MIME-Version: 1.0
------=_Part_12654_1135590884.1564602743147
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Hi c y,
Thanks for your order! Your order information appears below.
Item(s) Subtotal: $1.79
Shipping: $4.95
----
Total Before Tax: $6.74
Estimated Tax: $0.15
----
Order Total: $6.89
Shipping Address:
blabla.com
https://www.blabla.com
1-800-672-4399
------=_Part_12654_1135590884.1564602743147
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit
<!--[if !mso]><!--><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><link href="http://cms.blabla.com/fonts/roboto/email-font.css" rel="stylesheet" type="text/css">
<style type="text/css">
#font-face {
font-family: 'Roboto';
src: url('http://cms.blabla.com/fonts/roboto/Roboto-Regular-webfont.eot');
src: url('http://cms.blabla.com/fonts/roboto/Roboto-Regular-webfont.eot?#iefix') format('embedded-opentype'),
url('http://cms.blabla.com/fonts/roboto/Roboto-Regular-webfont.woff') format('woff'),
url('http://cms.blabla.com/fonts/roboto/Roboto-Regular-webfont.ttf') format('truetype'),
url('http://cms.blabla.com/fonts/roboto/Roboto-Regular-webfont.svg#robotoregular') format('svg');
font-weight:400;
font-style:normal;
}
</style>
<div width="100%" style="display:none;font-size:0px;color:#eeeeee;line-height:1px;text-align:center;opacity:0;overflow:hidden;max-height:0px;max-width:0px;">
Questions? Call us any time 24/7 at 1-800-672-4399 or simply reply to this email | blabla.com
</div>
------=_Part_12654_1135590884.1564602743147--',
'_unixfrom': None,
'defects': [],
'epilogue': None,
'preamble': None}
As you can see result returns the complete original email source message as a payload except for the first 2 lines. The email should be 2 parts, one text/plain and the other text/html. Lines before MIME-Version: 1.0 should not be included in payload. Thanks!
The problem was the formating of the email source message. When I copy pasted it from outlook client, formating was broken so I had to fix it manually for it to be parsed correctly. I had to put tabs before some lines as you can see below:
Received: from SN1NAM04HT187.eop-NAM04.prod.protection.outlook.com
(2603:10b6:300:d4::32) by CO2PR01MB1959.prod.exchangelabs.com with
HTTPS via
MWHPR19CA0022.NAMPRD19.PROD.OUTLOOK.COM; Wed, 31 Jul 2019 19:52:30 +0000
Received: from SN1NAM04FT005.eop-NAM04.prod.protection.outlook.com
(10.152.88.55) by SN1NAM04HT187.eop-NAM04.prod.protection.outlook.com
(10.152.89.14) with Microsoft SMTP Server (version=TLS1_2,
cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2115.10; Wed, 31
Jul 2019 19:52:29 +0000
Authentication-Results: spf=pass (sender IP is 50.31.51.89)
smtp.mailfrom=sendgrid.blabla.com; windowslive.com;
dkim=pass (signature was
verified) header.d=blabla.com;windowslive.com; dmarc=pass
action=none
header.from=blabla.com;

Splitting a list of twitter data

I have a file full of hundreds of un-separated tweets all formatted like so:
{"text": "Just posted a photo # Navarre Conference Center", "created_at": "Sun Nov 13 01:52:03 +0000 2016", "coordinates": [-86.8586, 30.40299]}
I am trying to split them up so I can assign each part to a variable.
The text
The timestamp
The location coordinates
I was able to split the tweets up using .split('{}') but I don't really know how to split the rest into the three things that I want.
My basic idea that didn't work:
file = open('tweets_with_time.json' , 'r')
line = file.readline()
for line in file:
line = line.split(',')
message = (line[0])
timestamp = (line[1])
position = (line[2])
#just to test if it's working
print(position)
Thanks!
I just downloaded your file, it's not as bad as you said. Each tweet is on a separate line. It would be nicer if the file was a JSON list, but we can still parse it fairly easily, line by line. Here's an example that extracts the 1st 10 tweets.
import json
fname = 'tweets_with_time.json'
with open(fname) as f:
for i, line in enumerate(f, 1):
# Convert this JSON line into a Python dict
data = json.loads(line)
# Extract the data
message = data['text']
timestamp = data['created_at']
position = data['coordinates']
# Print it
print(i)
print('Message:', message)
print('Timestamp:', timestamp)
print('Position:', position)
print()
#Only print the first 10 tweets
if i == 10:
break
Unfortunately, I can't show the output of this script: Stack Exchange won't allow me to put those shortened URLs into a post.
Here's a modified version that cuts off each message at the URL.
import json
fname = 'tweets_with_time.json'
with open(fname) as f:
for i, line in enumerate(f, 1):
# Convert this JSON line to a Python dict
data = json.loads(line)
# Extract the data
message = data['text']
timestamp = data['created_at']
position = data['coordinates']
# Remove the URL from the message
idx = message.find('https://')
if idx != -1:
message = message[:idx]
# Print it
print(i)
print('Message:', message)
print('Timestamp:', timestamp)
print('Position:', position)
print()
#Only print the first 10 tweets
if i == 10:
break
output
1
Message: Just posted a photo # Navarre Conference Center
Timestamp: Sun Nov 13 01:52:03 +0000 2016
Position: [-86.8586, 30.40299]
2
Message: I don't usually drink #coffee, but I do love a good #Vietnamese drip coffee with condense milk…
Timestamp: Sun Nov 13 01:52:04 +0000 2016
Position: [-123.04437109, 49.26211779]
3
Message: #bestcurryπŸ’₯πŸ‘£πŸ‘ŒπŸ½πŸ˜ŽπŸ€‘πŸ‘πŸ½πŸ‘πŸΌπŸ‘ŠπŸΌβ˜πŸ½πŸ™ŒπŸΌπŸ’ͺπŸΌπŸŒ΄πŸŒΊπŸŒžπŸŒŠπŸ·πŸ‰πŸπŸŠπŸΌπŸ„πŸ½πŸ‹πŸ½πŸŒβœˆοΈπŸ’ΈβœπŸ’―πŸ†’πŸ‡ΏπŸ‡¦πŸ‡ΊπŸ‡ΈπŸ™πŸΌ#johanvanaarde #kauai #rugby #surfing…
Timestamp: Sun Nov 13 01:52:04 +0000 2016
Position: [-159.4958861, 22.20321232]
4
Message: #thatonePerezwedding πŸ’πŸ’ # Scenic Springs
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-98.68685568, 29.62182898]
5
Message: Miami trends now: Heat, Wade, VeteransDay, OneLetterOffBands and TheyMightBeACatfishIf.
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-80.19240081, 25.78111669]
6
Message: Thank you family for supporting my efforts. I love you all!…
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-117.83012, 33.65558157]
7
Message: If you're looking for work in #HONOLULU, HI, check out this #job:
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-157.7973653, 21.2868901]
8
Message: Drinking a L'Brett d'Apricot by #CrookedStave # FOBAB β€”
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-87.6455, 41.8671]
9
Message: Can you recommend anyone for this #job? Barista (US) -
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-121.9766823, 38.350109]
10
Message: He makes me happy # Frank and Bank
Timestamp: Sun Nov 13 01:52:05 +0000 2016
Position: [-75.69360487, 45.41268776]
It looks like well-formatted JSON data. Try the following:
import json
from pprint import pprint
file_ptr = open('tweets_with_time.json' , 'r')
data = json.load(file_ptr)
pprint(data)
It should parse your data into a nice Python dictionary. You can access the elements by their names like:
# Return the first 'coordinates' data point as a list of floats
data[0]["coordinates"]
# Return the 5th 'text' data point as a string
data[4]["text"]

Python regex re.match() not returning any results

I'm hoping this is just something simple. I'm trying to determine whether or not an email is already encrypted.
# Read e-mail from stdin
raw = sys.stdin.read()
raw_message = email.message_from_string( raw )
I took the example from http://docs.python.org/2/howto/regex.html on doing a simple test for match.
p = re.compile('-----BEGIN\sPGP\sMESSAGE-----')
m = p.match(raw)
if m:
log = open(cfg['logging']['file'], 'a')
log.write("THIS IS ENCRYPTED")
log.close()
else:
log = open(cfg['logging']['file'], 'a')
log.write("NOT ENCRYPTED:")
log.close()
The email is read. The log file is written to but it always comes back no match. I've written raw to a logfile and that string is present.
Not sure where to go next.
UPDATE:
Here is the output from a raw ( a simple test message )
Sending email to: <bruce#packetaddiction.com>
Received: from localhost (localhost [127.0.0.1])
by mail2.packetaddiction.com (Postfix) with ESMTP id 5FE5D22A65
for <bruce#packetaddiction.com>; Tue, 10 Sep 2013 16:19:12 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at mail2.packetaddiction.com
Received: from mail2.packetaddiction.com ([127.0.0.1])
by localhost (mail2.packetaddiction.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id cc3zZ_izEb1j for <bruce#packetaddiction.com>;
Tue, 10 Sep 2013 16:19:06 +0000 (UTC)
Received: from mail.secryption.com (mail.secryption.com [178.18.24.223])
by mail2.packetaddiction.com (Postfix) with ESMTPS id 9CA3C22A5B
for <bruce#packetaddiction.com>; Tue, 10 Sep 2013 16:19:06 +0000 (UTC)
Received: from localhost (localhost.localdomain [127.0.0.1])
by mail.secryption.com (Postfix) with ESMTP id 9994E1421F81
for <bruce#packetaddiction.com>; Tue, 10 Sep 2013 12:19:19 -0400 (EDT)
X-Virus-Scanned: Debian amavisd-new at mail.secryption.com
Received: from mail.secryption.com ([127.0.0.1])
by localhost (mail.secryption.com [127.0.0.1]) (amavisd-new, port 10024)
with ESMTP id WbkVn_cowG6q for <bruce#packetaddiction.com>;
Tue, 10 Sep 2013 12:19:18 -0400 (EDT)
Received: from dennis.cng.int (mail.compassnetworkgroup.com [173.163.129.21])
(using TLSv1 with cipher RC4-MD5 (128/128 bits))
(No client certificate requested)
by mail.secryption.com (Postfix) with ESMTPSA id 5B4191421F80
for <bruce#packetaddiction.com>; Tue, 10 Sep 2013 12:19:18 -0400 (EDT)
User-Agent: K-9 Mail for Android
MIME-Version: 1.0
Content-Type: text/plain;
charset=UTF-8
Content-Transfer-Encoding: 8bit
Subject: Message
From: Bruce Markey <bruce#secryption.com>
Date: Tue, 10 Sep 2013 12:19:00 -0400
To: "bruce#packetaddiction.com" <bruce#packetaddiction.com>
Message-ID: <36615ed6-a1a9-49ac-ac85-31905916d478#email.android.com>
-----BEGIN PGP MESSAGE-----
Version: APG v1.0.8
hQEMAwPNxvNWsisWAQgAuOTLkiitYzhGJydOzN4sBoGjhRm9JeJMfmxKxKTKcV2W
ZBuN0z+nS1KxnXrIlahhwLtpiFvp5apI8wAyAiLC2BhFieFttOl1/xLVJbd1nI1o
KQE1RUXhPURejJ3eH9g/LmkhtFQcnsuHGTGnLi6dugBNhWLqgnLUBX+VLt6moz2C
84lDuQ1y7B/JFOctKRScUqmxDd8b2peZJOnVT/p0tSYNfN9QGH3W02FZShE4KKBl
HpezK8KC6cZdf34Eao+ep+fP5DuKx/4j3ksCbFKyQ3gd+yxK/xnhkijDsYCfFRiF
ElAGDvXu4RXqrKRpBxq1bRhU8YqS7j5593MTUViWitLAGgH1DV0UeA/B5LMUDRyz
4ZfDqd0kDYsPUy2Cg20HdXHaobkzdvHLzfqQq0Owc1nTcvu4nzCbIMhTAlZjn8ZA
aODTlKcvnFBWEtNERPm0x6nkbhMo3GeysejaJSRod3aGqhuhga4iIrrew1W03297
aalwY8RKeNoV15VItsyrbbT+HvDNSaFFCPUAs+KcLHCOez5/woozjlqKdBI6yHCe
gqpYJPP07qFsVviltfDO63xS48f2HCPe4iyXCy6Usp0+jM7zAzH7KH1O854GH46Q
r0A01DLo9REmDr4U
=pBQZ
-----END PGP MESSAGE-----
re.match will only find a match at the beginning of the string, as noted here. You want to use re.search
raw = """Sending email to: <bruce#packetaddiction.com>...
...
-----BEGIN PGP MESSAGE-----
...
"""
>>> p = re.compile('-----BEGIN\sPGP\sMESSAGE-----')
>>> m = p.search(raw)
>>> m
<_sre.SRE_Match object at 0x0000000002E02510>
>>> m.group()
'-----BEGIN PGP MESSAGE-----'
>>> m = p.match(raw)
>>> print m
None
Although, as noted, regex is likely overkill for this problem as the matching text is static.
Regular expressions are used when you want a "fuzzy" match - that is, you aren't sure if the string you are looking for will be identical every time.
In this case, the string you are looking for appears to be exactly -----BEGIN PGP MESSAGE----. In this case, the string.find function will be simpler to use and faster to boot.
>>> a = "This is a PGP encrypted email. -----BEGIN PGP MESSAGE----- !##$%^..."
>>> b = "This is not encrypted. My hovercraft is full of eels." #example strings
>>> a.find("-----BEGIN PGP MESSAGE-----")
30 # Return value '30' means that the search string was found at index 30 of source string
>>> b.find("-----BEGIN PGP MESSAGE-----")
-1 # -1 means 'not found in the source string'
>>>

Python : ( msg = email.message_from_string(aaa) ) values are returning ( None ) when trying to parse stuff from raw e-mail source

let's execute the script
python b.wsgi
result is:
None
None
that is the problem and here is the full script b.wsgi
aaa = """
From root#a1.local.tld Thu Jul 25 19:28:59 2013
Received: from a1.local.tld (localhost [127.0.0.1])
by a1.local.tld (8.14.4/8.14.4) with ESMTP id r6Q2SxeQ003866
for <ooo#a1.local.tld>; Thu, 25 Jul 2013 19:28:59 -0700
Received: (from root#localhost)
by a1.local.tld (8.14.4/8.14.4/Submit) id r6Q2Sxbh003865;
Thu, 25 Jul 2013 19:28:59 -0700
From: root#a1.local.tld
Subject: oooooooooooooooo
To: ooo#a1.local.tld
Cc:
X-Originating-IP: 192.168.15.127
X-Mailer: Webmin 1.420
Message-Id: <1374805739.3861#a1>
Date: Thu, 25 Jul 2013 19:28:59 -0700 (PDT)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="bound1374805739"
This is a multi-part message in MIME format.
--bound1374805739
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
ooooooooooooooooooooooooooooooooooooooooooooooo
--bound1374805739--
"""
import email
msg = email.message_from_string(aaa)
print msg['From']
print msg['To']
i tried changing it to
print msg['from']
print msg['to']
same problem.
what might be the issue here ?
is it possible PYTHON knows this "raw" string was manually edited by my hands ?
very sneaky stuff going on here.
The \n at the beginning and end of the string are causing the problem. Try this
>>> msg = email.message_from_string(aaa.strip())
>>> msg.keys()
['Received', 'Received', 'From', 'Subject', 'To', 'Cc', 'X-Originating-IP', 'X-Mailer', 'Message-Id', 'Date', 'MIME-Version', 'Content-Type']
>>> msg['From']
'root#a1.local.tld'

Categories