I wrote a python script to fetch all of my gmail. I have hundreds of thousands of old emails, of which about 10,000 were unread.
After successfully fetching all of my email, I find that gmail has marked all the fetched emails as "read". This is disastrous for me since I need to check all unread emails only.
How can I recover the information about which emails were unread? I dumped each mail object into files, the core of my code is shown below:
m = imaplib.IMAP4_SSL("imap.gmail.com")
m.select("[Gmail]/All Mail")
resp, items = m.uid('search', None, 'ALL')
uids = items[0].split()
for uid in uids:
resp, data = m.uid('fetch', uid, "(RFC822)")
email_body = data[0][1]
mail = email.message_from_string(email_body)
dumbobj(uid, mail)
I am hoping there is either an option to undo this in gmail, or a member inside the stored mail objects reflecting the seen-state information.
For anyone looking to prevent this headache, consider this answer here. This does not work for me, however, since the damage has already been done.
I have written the following function to recursively "grep" all strings in an object, and applied it to a dumped email object using the following keywords:
regex = "(?i)((marked)|(seen)|(unread)|(read)|(flag)|(delivered)|(status)|(sate))"
So far, no results (only an unrelated "Delivered-To"). Which other keywords could I try?
def grep_object (obj, regex , cycle = set(), matched = set()):
import re
if id(obj) in cycle:
if isinstance(obj, basestring):
if re.search(regex, obj):
def grep_dict (adict ):
[ [ grep_object(a, regex, cycle, matched ) for a in ab ] for ab in adict.iteritems() ]
[ grep_object(elm, regex, cycle, matched ) for elm in obj ]
except: pass
return matched
grep_object(mail_object, regex)
I'm having a similar problem (not with gmail), and the biggest problem for me was to make a reproducible test case; and I finally managed to produce one (see below).
In terms of the Seen flag, I now gather it goes like this:
If a message is new/unseen, IMAP fetch for \Seen flag will return empty (i.e. it will not be present, as related to the email message).
If you do IMAP select on a mailbox (INBOX), you get a "flag" UNSEEN which contains a list of ids (or uids) of emails in that folder that are new (do not have the \Seen flag)
In my test case, if you fetch say headers for a message with BODY.PEEK, then \Seen on a message is not set; if you fetch them with BODY, then \Seen is set
In my test case, also fetching (RFC822) doesn't set \Seen (unlike your case with Gmail)
In the test case, I try to do pprint.pprint(inspect.getmembers(mail)) (in lieu of your dumpobj(uid, mail)) - but only after I'm certain \Seen has been set. The output I get is posted in mail_object_inspect.txt - and as far as I can see, there is no mention of 'new/read/seen' etc. in none of the readable fields; furthermore mail.as_string() prints:
'From: jesse#example.com\nTo: user#example.com\nSubject: This is a test message!\n\nHello. I am executive assistant to the director of\nBear Stearns, a failed investment Bank. I have\naccess to USD6,000,000. ...\n'
Even worse, there is no mention of "fields" anywhere in the imaplib code (below filenames are printed if they do not contain case-insensitive "field" anywhere):
$ grep -L -i field /usr/lib/python{2.7,3.2}/imaplib.py
... so I guess that information was not saved with your dumps.
Here is a bit on reconstructing the test case. The hardest was to find a small IMAP server, that can be quickly ran with some arbitrary users and emails, but without having to install a ton of stuff on your system. Finally I found one: trivial-server.pl, the example file of Perl's Net::IMAP::Server; tested on Ubuntu 11.04.
The test case is pasted in this gist, with two files (with many comments) that I'll try to post abridged:
trivial-serverB.pl - Perl (v5.10.1) Net::IMAP::Server server (has a terminal output paste at end of file with a telnet client session)
testimap.py - Python 2.7/3.2 imaplib
client (has a terminal output paste at end of file, of itself operating with the server)
First, make sure you have Net::IMAP::Server - note, it has many dependencies, so the below command may take a while to install:
sudo perl -MCPAN -e 'install Net::IMAP::Server'
Then, in the directory where you got trivial-serverB.pl, create a subdirectory with SSL certificates:
mkdir certs
openssl req \
-x509 -nodes -days 365 \
-subj '/C=US/ST=Oregon/L=Portland/CN=localhost' \
-newkey rsa:1024 -keyout certs/server-key.pem -out certs/server-cert.pem
Finally run the server with administrative properties:
sudo perl trivial-serverB.pl
Note that the trivial-serverB.pl has a hack which will let a client to connect without SSL. Here is trivial-serverB.pl:
use v5.10.1;
use feature qw(say);
use Net::IMAP::Server;
package Demo::IMAP::Hack;
$INC{'Demo/IMAP/Hack.pm'} = 1;
sub capabilityb {
my $self = shift;
print STDERR "Capabilitin'\n";
my $base = $self->server->capability;
my #words = split " ", $base;
#words = grep {$_ ne "STARTTLS"} #words
if $self->is_encrypted;
unless ($self->auth) {
my $auth = $self->auth || $self->server->auth_class->new;
my #auth = $auth->sasl_provides;
# hack:
#unless ($self->is_encrypted) {
# # Lack of encrpytion makes us turn off all plaintext auth
# push #words, "LOGINDISABLED";
# #auth = grep {$_ ne "PLAIN"} #auth;
push #words, map {"AUTH=$_"} #auth;
return join(" ", #words);
package Demo::IMAP::Auth;
$INC{'Demo/IMAP/Auth.pm'} = 1;
use base 'Net::IMAP::Server::DefaultAuth';
sub auth_plain {
my ( $self, $user, $pass ) = #_;
return 1;
package Demo::IMAP::Model;
$INC{'Demo/IMAP/Model.pm'} = 1;
use base 'Net::IMAP::Server::DefaultModel';
sub init {
my $self = shift;
$self->root( Demo::IMAP::Mailbox->new() );
$self->root->add_child( name => "INBOX" );
package Demo::IMAP::Mailbox;
use base qw/Net::IMAP::Server::Mailbox/;
use Data::Dumper;
my $data = <<'EOF';
From: jesse#example.com
To: user#example.com
Subject: This is a test message!
Hello. I am executive assistant to the director of
Bear Stearns, a failed investment Bank. I have
access to USD6,000,000. ...
my $msg = Net::IMAP::Server::Message->new($data);
sub load_data {
my $self = shift;
my %ports = ( port => 143, ssl_port => 993 );
$ports{$_} *= 10 for grep {$> > 0} keys %ports;
$myserv = Net::IMAP::Server->new(
auth_class => "Demo::IMAP::Auth",
model_class => "Demo::IMAP::Model",
user => 'nobody',
log_level => 3, # at least 3 to output 'CONNECT TCP Peer: ...' message; 4 to output IMAP commands too
# apparently, this overload MUST be after the new?! here:
no strict 'refs';
*Net::IMAP::Server::Connection::capability = \&Demo::IMAP::Hack::capabilityb;
# https://stackoverflow.com/questions/27206371/printing-addresses-of-perl-object-methods
say " -", $myserv->can('validate'), " -", $myserv->can('capability'), " -", \&Net::IMAP::Server::Connection::capability, " -", \&Demo::IMAP::Hack::capabilityb;
With the server above running in one terminal, in another terminal you can just do:
python testimap.py
The code will simply read fields and content from the one (and only) message the server above presents, and will eventually restore (remove) the \Seen field.
import sys
if sys.version_info[0] < 3: # python 2.7
def uttc(x):
return x
else: # python 3+
def uttc(x):
return x.decode("utf-8")
import imaplib
import email
import pprint,inspect
imap_user = 'nobody'
imap_password = 'whatever'
imap_server = 'localhost'
conn = imaplib.IMAP4(imap_server)
conn.debug = 3
(retcode, capabilities) = conn.login(imap_user, imap_password)
# not conn.select(readonly=1), else we cannot modify the \Seen flag later
conn.select() # Select inbox or default namespace
(retcode, messages) = conn.search(None, '(UNSEEN)')
if retcode == 'OK':
for num in uttc(messages[0]).split(' '):
if not(num):
print("No messages available: num is `{0}`!".format(num))
print('Processing message: {0}'.format(num))
typ, data = conn.fetch(num,'(FLAGS)')
isSeen = ( "Seen" in uttc(data[0]) )
print('Got flags: {2}: {0} .. {1}'.format(typ,data, # NEW: OK .. ['1 (FLAGS ())']
"Seen" if isSeen else "NEW"))
print('Peeking headers, message: {0} '.format(num))
typ, data = conn.fetch(num,'(BODY.PEEK[HEADER])')
typ, data = conn.fetch(num,'(FLAGS)')
isSeen = ( "Seen" in uttc(data[0]) )
print('Got flags: {2}: {0} .. {1}'.format(typ,data, # NEW: OK .. ['1 (FLAGS ())']
"Seen" if isSeen else "NEW"))
print('Get RFC822 body, message: {0} '.format(num))
typ, data = conn.fetch(num,'(RFC822)')
mail = email.message_from_string(uttc(data[0][1]))
typ, data = conn.fetch(num,'(FLAGS)')
isSeen = ( "Seen" in uttc(data[0]) )
print('Got flags: {2}: {0} .. {1}'.format(typ,data, # NEW: OK .. ['1 (FLAGS ())']
"Seen" if isSeen else "NEW"))
print('Get headers, message: {0} '.format(num))
typ, data = conn.fetch(num,'(BODY[HEADER])') # note, FLAGS (\\Seen) is now in data, even if not explicitly requested!
print('Get RFC822 body, message: {0} '.format(num))
typ, data = conn.fetch(num,'(RFC822)')
mail = email.message_from_string(uttc(data[0][1]))
pprint.pprint(inspect.getmembers(mail)) # this is in mail_object_inspect.txt
typ, data = conn.fetch(num,'(FLAGS)')
isSeen = ( "Seen" in uttc(data[0]) )
print('Got flags: {2}: {0} .. {1}'.format(typ,data, # Seen: OK .. ['1 (FLAGS (\\Seen))']
"Seen" if isSeen else "NEW"))
conn.select() # select again, to see flags server side
# * OK [UNSEEN 0] # no more unseen messages (if there was only one msg in folder)
print('Restoring flag to unseen/new, message: {0} '.format(num))
ret, data = conn.store(num,'-FLAGS','\\Seen')
if ret == 'OK':
print("Set back to unseen; Got OK: {0}{1}{2}".format(data,'\n',30*'-'))
typ, data = conn.fetch(num,'(FLAGS)')
isSeen = ( "Seen" in uttc(data[0]) )
print('Got flags: {2}: {0} .. {1}'.format(typ,data, # NEW: OK .. [b'1 (FLAGS ())']
"Seen" if isSeen else "NEW"))
How do I mock an IMAP server in Python, despite extreme laziness?
Get only NEW Emails imaplib and python
Undoing "marked as read" status of emails fetched with imaplib
I have method that sends an email and I want to make parametrize pytest test, but when I run my test it doesn't run and in logs it shows no tests ran. What's wrong with my test?
pytest scenario:
import pytest
import new_version
#pytest.mark.parametrize("email", ['my_email#gmail.com'])
def send_email(email):
method that sends email:
# Imports
import smtplib
import time
# Language
error_title = "ERROR"
send_error = "I'm waiting {idle} and I will try again."
send_error_mail = "Mail invalid"
send_error_char = "Invalid character"
send_error_connection_1_2 = "Connection problem..."
send_error_connection_2_2 = "Gmail server is down or internet connection is instabil."
login_browser = "Please log in via your web browser and then try again."
login_browser_info = "That browser and this software shood have same IP connection first time."
# Gmaild ID fro login
fromMail = "myemail#gmail.com"
fromPass = "pass"
# Some configurations
mailDelay = 15
exceptionDelay = 180
def send_mail(email, thisSubject="Just a subject",
thisMessage="This is just a simple message..."):
# To ho to send mails
mailTo = [
# If still have mails to send
while len(mailTo) != 0:
sendItTo = mailTo[0] # Memorise what mail will be send it (debug purpose)
# Connect to the server
server = smtplib.SMTP("smtp.gmail.com:587")
# Sign In
server.login(fromMail, fromPass)
# Set the message
message = f"Subject: {thisSubject}\n{thisMessage}"
# Send one mail
server.sendmail(fromMail, mailTo.pop(0), message)
# Sign Out
# If is a problem
except Exception as e:
# Convert error in a string for som checks
e = str(e)
# Show me if...
if "The recipient address" in e and "is not a valid" in e:
print(f"\n>>> {send_error_mail} [//> {sendItTo}\n")
elif "'ascii'" in e and "code can't encode characters" in e:
print(f"\n>>> {send_error_char} [//> {sendItTo}\n")
elif "Please" in e and "log in via your web browser" in e:
print(f"\n>>> {login_browser}\n>>> - {login_browser_info}")
elif "[WinError 10060]" in e:
if "{idle}" in send_error:
se = send_error.split("{idle}");
seMsg = f"{se[0]}{exceptionDelay} sec.{se[1]}"
seMsg = send_error
print(f"\n>>> {send_error_connection_1_2}\n>>> {send_error_connection_2_2}")
print(f">>> {seMsg}\n")
# Wait 5 minutes
waitTime = exceptionDelay - mailDelay
if waitTime <= 0:
waitTime = exceptionDelay
if "{idle}" in send_error:
se = send_error.split("{idle}");
seMsg = f"{se[0]}{exceptionDelay} sec.{se[1]}"
seMsg = send_error
print(f">>> {error_title} <<<", e)
print(f">>> {seMsg}\n")
# Wait 5 minutes
# If are still mails wait before to send another one
if len(mailTo) != 0:
platform darwin -- Python 3.7.5, pytest-5.3.5, py-1.8.1, pluggy-0.13.1 -- /Users/nikolai/Documents/Python/gmail/venv/bin/python
cachedir: .pytest_cache
rootdir: /Users/nikolai/Documents/Python/gmail
collected 0 items
Unless you change the discovery prefixes for Pytest, test functions must be named test_*:
import pytest
import new_version
#pytest.mark.parametrize("email", ['my_email#gmail.com'])
def test_send_email(email):
I've been using the following code, which is from the pynetdicom library to query and store some images from a remote server SCP, on my machine(SCU).
Query/Retrieve SCU AE example.
This demonstrates a simple application entity that support the Patient
Root Find and Move SOP Classes as SCU. In order to receive retrieved
datasets, this application entity must support the CT Image Storage
SOP Class as SCP as well. For this example to work, there must be an
SCP listening on the specified host and port.
For help on usage,
python qrscu.py -h
import argparse
from netdicom.applicationentity import AE
from netdicom.SOPclass import *
from dicom.dataset import Dataset, FileDataset
from dicom.UID import ExplicitVRLittleEndian, ImplicitVRLittleEndian, \
import netdicom
# netdicom.debug(True)
import tempfile
# parse commandline
parser = argparse.ArgumentParser(description='storage SCU example')
parser.add_argument('remoteport', type=int)
parser.add_argument('-p', help='local server port', type=int, default=9999)
parser.add_argument('-aet', help='calling AE title', default='PYNETDICOM')
parser.add_argument('-aec', help='called AE title', default='REMOTESCU')
parser.add_argument('-implicit', action='store_true',
help='negociate implicit transfer syntax only',
parser.add_argument('-explicit', action='store_true',
help='negociate explicit transfer syntax only',
args = parser.parse_args()
if args.implicit:
ts = [ImplicitVRLittleEndian]
elif args.explicit:
ts = [ExplicitVRLittleEndian]
ts = [
# call back
def OnAssociateResponse(association):
print "Association response received"
def OnAssociateRequest(association):
print "Association resquested"
return True
def OnReceiveStore(SOPClass, DS):
print "Received C-STORE", DS.PatientName
# do something with dataset. For instance, store it.
file_meta = Dataset()
file_meta.MediaStorageSOPClassUID = '1.2.840.10008.'
# !! Need valid UID here
file_meta.MediaStorageSOPInstanceUID = "1.2.3"
# !!! Need valid UIDs here
file_meta.ImplementationClassUID = ""
filename = '%s/%s.dcm' % (tempfile.gettempdir(), DS.SOPInstanceUID)
ds = FileDataset(filename, {},
file_meta=file_meta, preamble="\0" * 128)
#ds.is_little_endian = True
#ds.is_implicit_VR = True
print "File %s written" % filename
# must return appropriate status
return SOPClass.Success
# create application entity
MyAE = AE(args.aet, args.p, [PatientRootFindSOPClass,
VerificationSOPClass], [StorageSOPClass], ts)
MyAE.OnAssociateResponse = OnAssociateResponse
MyAE.OnAssociateRequest = OnAssociateRequest
MyAE.OnReceiveStore = OnReceiveStore
# remote application entity
RemoteAE = dict(Address=args.remotehost, Port=args.remoteport, AET=args.aec)
# create association with remote AE
print "Request association"
assoc = MyAE.RequestAssociation(RemoteAE)
# perform a DICOM ECHO
print "DICOM Echo ... ",
st = assoc.VerificationSOPClass.SCU(1)
print 'done with status "%s"' % st
print "DICOM FindSCU ... ",
d = Dataset()
d.PatientsName = args.searchstring
d.QueryRetrieveLevel = "PATIENT"
d.PatientID = "*"
st = assoc.PatientRootFindSOPClass.SCU(d, 1)
print 'done with status "%s"' % st
for ss in st:
if not ss[1]:
# print ss[1]
d.PatientID = ss[1].PatientID
print "Moving"
print d
assoc2 = MyAE.RequestAssociation(RemoteAE)
gen = assoc2.PatientRootMoveSOPClass.SCU(d, 'SAMTEST', 1)
for gg in gen:
print gg
print "QWEQWE"
print "Release association"
# done
Running the program, I get the following output:
Request association
Association response received
DICOM Echo ... done with status "Success "
DICOM FindSCU ... done with status "<generator object SCU at 0x106014c30>"
(0008, 0052) Query/Retrieve Level CS: 'PATIENT'
(0010, 0010) Patient's Name PN: 'P*'
(0010, 0020) Patient ID LO: 'Pat00001563'
Association response received
(0008, 0052) Query/Retrieve Level CS: 'PATIENT'
(0010, 0010) Patient's Name PN: 'P*'
(0010, 0020) Patient ID LO: 'Pat00002021'
Association response received
Release association
Echo works so I know my associations are working and I am able to query and see the files on the server as suggested by the output. As you can see however, OnReceiveStore does not seem to be called. I am pretty new to DICOM and I was wondering what might be the case. Correct me if I am wrong but I think the line gen = assoc2.PatientRootMoveSOPClass.SCU(d, 'SAMTEST', 1) is supposed to invoke OnReceiveStore. If not, some insight on how C-STORE should be called is appreciated.
DICOM C-MOVE is a complicated operation, if You are not very familiar with DICOM.
You are not calling OnReceiveStore Yourself, it's called when the SCP starts sending instances to Your application. You issue a C-MOVE command to the SCP with Your own AE title, where You want to receive the images. In Your case SAMTEST. Since this is the only parameter for the C-MOVE command, then the SCP needs to be configured before-hand to know the IP and port of the SAMTEST AE. Have You done this?
I don't know, why the code is not outputting responses from the SCP to the C-MOVE command. It looks like it should. These could give a good indication, what is happening. You can also check the logs on the SCP side, to see, what's happening.
Also, here is some quite good reading about C-MOVE operation.
I am new to python and my networking logics are at the beginner level. I have an HTTP server running in a VM and when I curl it from a different terminal on the same machine, I get the expected response. I am looking for a functionality where I can get the same response on my mobile device when I type the ip and port in the browser. My mobile device is connected to the same WiFi. Here's the server code:
import socket
MAX_PACKET = 32768
def recv_all(sock):
r'''Receive everything from `sock`, until timeout occurs, meaning sender
is exhausted, return result as string.'''
# dirty hack to simplify this stuff - you should really use zero timeout,
# deal with async socket and implement finite automata to handle incoming data
prev_timeout = sock.gettimeout()
rdata = []
while True:
except socket.timeout:
return ''.join(rdata)
# unreachable
def normalize_line_endings(s):
r'''Convert string containing various line endings like \n, \r or \r\n,
to uniform \n.'''
return ''.join((line + '\n') for line in s.splitlines())
def run():
r'''Main loop'''
# Create TCP socket listening on 10000 port for all connections,
# with connection queue of length 1
server_sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM, \
while True:
# accept connection
client_sock, client_addr = server_sock.accept()
# headers and body are divided with \n\n (or \r\n\r\n - that's why we
# normalize endings). In real application usage, you should handle
# all variations of line endings not to screw request body
request = normalize_line_endings(recv_all(client_sock)) # hack again
request_head, request_body = request.split('\n\n', 1)
# first line is request headline, and others are headers
request_head = request_head.splitlines()
request_headline = request_head[0]
# headers have their name up to first ': '. In real world uses, they
# could duplicate, and dict drops duplicates by default, so
# be aware of this.
request_headers = dict(x.split(': ', 1) for x in request_head[1:])
# headline has form of "POST /can/i/haz/requests HTTP/1.0"
request_method, request_uri, request_proto = request_headline.split(' ', 3)
response_body = [
'<html><body><h1>Hello, world!</h1>',
'<p>This page is in location %(request_uri)r, was requested ' % locals(),
'using %(request_method)r, and with %(request_proto)r.</p>' % locals(),
'<p>Request body is %(request_body)r</p>' % locals(),
'<p>Actual set of headers received:</p>',
for request_header_name, request_header_value in request_headers.iteritems():
response_body.append('<li><b>%r</b> == %r</li>' % (request_header_name, \
response_body_raw = ''.join(response_body)
# Clearly state that connection will be closed after this response,
# and specify length of response body
response_headers = {
'Content-Type': 'text/html; encoding=utf8',
'Content-Length': len(response_body_raw),
'Connection': 'close',
response_headers_raw = ''.join('%s: %s\n' % (k, v) for k, v in \
# Reply as HTTP/1.1 server, saying "HTTP OK" (code 200).
response_proto = 'HTTP/1.1'
response_status = '200'
response_status_text = 'OK' # this can be random
# sending all this stuff
client_sock.send('%s %s %s' % (response_proto, response_status, \
client_sock.send('\n') # to separate headers from body
# and closing connection, as we stated before
Here's the response when I run curl from a different terminal on the same VM.
I want to ping it from my mobile device connected to the same WiFi. Thank you!
I was looking for a fuzzing library and I happened to see "boofuzz"
though there are no examples of how to use the library for http fuzzing.
This is the only code I see in their github page, but they say it was taken from sulley (an old fuzzing library):
import sys
sys.path.insert(0, '../')
from boofuzz.primitives import String, Static, Delim
class Group(object):
blocks = []
def __init__(self, name, definition=None):
self.name = name
if definition:
self.definition = definition
def add_definition(self, definition):
assert isinstance(definition, (list, tuple)), "Definition must be a list or a tuple!"
self.definition = definition
def render(self):
return "".join([x.value for x in self.definition])
def exhaust(self):
for item in self.definition:
while item.mutate():
current_value = item.value
recv_data = self.send_buffer(current_value)
def __repr__(self):
return '<%s [%s items]>' % (self.__class__.__name__, len(self.definition))
# noinspection PyMethodMayBeStatic
def send_buffer(self, current_value):
return "Sent %s!" % current_value
def log_send(self, current_value):
def log_recv(self, recv_data):
s_static = Static
s_delim = Delim
s_string = String
CloseHeader = Group(
"HTTP Close Header",
# GET / HTTP/1.1\r\n
s_static("GET / HTTP/1.1\r\n"),
# Connection: close
s_static("Connection"), s_delim(":"), s_delim(" "), s_string("close"),
OpenHeader = Group(
"HTTP Open Header",
# GET / HTTP/1.1\r\n
Static("GET / HTTP/1.1\r\n"),
# Connection: close
Static("Connection"), Delim(":"), Delim(" "), String("open"),
# CloseHeader = Group("HTTP Close Header")
# CloseHeader.add_definition([
# # GET / HTTP/1.1\r\n
# s_static("GET / HTTP/1.1\r\n"),
# # Connection: close
# s_static("Connection"), s_delim(":"), s_delim(" "), s_string("close"),
# s_static("\r\n\r\n")
# ])
Why would they post it, if it's another's library code? And is there a good explanation of how to work with the boofuzz library?
If you Google "http protocol format", the first result right now is this HTTP tutorial. If you read a few pages there, you can get a pretty good description of the protocol format. Based on that, I wrote the following fuzz script, source code here:
#!/usr/bin/env python
# Designed for use with boofuzz v0.0.9
from boofuzz import *
def main():
session = Session(
connection=SocketConnection("", 80, proto='tcp')
with s_block("Request-Line"):
s_group("Method", ['GET', 'HEAD', 'POST', 'PUT', 'DELETE', 'CONNECT', 'OPTIONS', 'TRACE'])
s_delim(" ", name='space-1')
s_string("/index.html", name='Request-URI')
s_delim(" ", name='space-2')
s_string('HTTP/1.1', name='HTTP-Version')
s_static("\r\n", name="Request-Line-CRLF")
s_static("\r\n", "Request-CRLF")
if __name__ == "__main__":
Although I got tripped up for a while because I only had one CRLF. After checking RFC 2616 (Section 5), it's pretty clear that this example should end with two CRLFs.
Request = Request-Line ; Section 5.1
*(( general-header ; Section 4.5
| request-header ; Section 5.3
| entity-header ) CRLF) ; Section 7.1
[ message-body ] ; Section 4.3
Request-Line = Method SP Request-URI SP HTTP-Version CRLF
Obviously, this fuzz script doesn't come close to covering the whole protocol. Just a few things that could be added:
HTTP Headers (there are a lot)
Specialized formats for each HTTP Method
A message body (e.g. on POST)
Some way to choose valid URIs for the particular target server
Report warnings based on server response (could get noisy, but server errors do tend to indicate... errors)
I have an Outlook PST file, and I'd like to get a json of the emails, e.g. something like
{"emails": [
{"from": "alice#example.com",
"to": "bob#example.com",
"bcc": "eve#example.com",
"subject": "mitm",
"content": "be careful!"
}, ...]}
I've thought using readpst to convert to MH format and then scan it in a ruby/python/bash script, is there a better way?
Unfortunately the ruby-msg gem doesn't work on my PST files (and looks like it wasn't updated since 2014).
I found a way to do it in 2 stages, first convert to mbox and then to json:
# requires installing libpst
pst2json my.pst
# or you can specify a custom output dir and an outlook mail folder,
# e.g. Inbox, Sent, etc.
pst2json -o email/ -f Inbox my.pst
Where pst2json is my script and mbox2json is slightly modified from Mining the Social Web.
#!/usr/bin/env bash
echo "usage: $(basename $0) [-o <output-dir>] [-f <folder>] <pst-file>"
echo "default output-dir: email/mbox-all/<pst-file>"
echo "default folder: Inbox"
exit 1
which readpst || { echo "Error: libpst not installed"; exit 1; }
while (( $# > 0 )); do
[[ -n "$pst_file" ]] && usage
case "$1" in
if [[ -n "$2" ]]; then
shift 2
if [[ -n "$2" ]]; then
shift 2
default_out_dir="email/mbox-all/$(basename $pst_file)"
mkdir -p "$out_dir"
readpst -o "$out_dir" "$pst_file"
[[ -f "$out_dir/$folder" ]] || { echo "Error: folder $folder is missing or empty."; exit 1; }
mbox2json "$out_dir/$folder" "$res" && echo "Success: result saved to $res"
mbox2json (python 2.7):
# -*- coding: utf-8 -*-
import sys
import mailbox
import email
import quopri
import json
from BeautifulSoup import BeautifulSoup
MBOX = sys.argv[1]
OUT_FILE = sys.argv[2]
def cleanContent(msg):
# Decode message from "quoted printable" format
msg = quopri.decodestring(msg)
# Strip out HTML tags, if any are present
soup = BeautifulSoup(msg)
return ''.join(soup.findAll(text=True))
def jsonifyMessage(msg):
json_msg = {'parts': []}
for (k, v) in msg.items():
json_msg[k] = v.decode('utf-8', 'ignore')
# The To, CC, and Bcc fields, if present, could have multiple items
# Note that not all of these fields are necessarily defined
for k in ['To', 'Cc', 'Bcc']:
if not json_msg.get(k):
json_msg[k] = json_msg[k].replace('\n', '').replace('\t', '').replace('\r'
, '').replace(' ', '').decode('utf-8', 'ignore').split(',')
for part in msg.walk():
json_part = {}
if part.get_content_maintype() == 'multipart':
type = part.get_content_type()
if SKIP_HTML and type == 'text/html':
json_part['contentType'] = type
content = part.get_payload(decode=False).decode('utf-8', 'ignore')
json_part['content'] = cleanContent(content)
except Exception, e:
sys.stderr.write('Skipping message - error encountered (%s)\n' % (str(e), ))
return json_msg
# There's a lot of data to process, so use a generator to do it. See http://wiki.python.org/moin/Generators
# Using a generator requires a trivial custom encoder be passed to json for serialization of objects
class Encoder(json.JSONEncoder):
def default(self, o):
return {'emails': list(o)}
# The generator itself...
def gen_json_msgs(mb):
while 1:
msg = mb.next()
if msg is None:
yield jsonifyMessage(msg)
mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)
json.dump(gen_json_msgs(mbox),open(OUT_FILE, 'wb'), indent=4, cls=Encoder)
Now, it's possible to process the file easily. E.g. to get just the contents of the emails:
jq '.emails[] | .parts[] | .content' < out/Inbox.json