I have a string of raw HTTP and I would like to represent the fields in an object. Is there any way to parse the individual headers from an HTTP string?
'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n
[...]'
Update: It’s 2019, so I have rewritten this answer for Python 3, following a confused comment from a programmer trying to use the code. The original Python 2 code is now down at the bottom of the answer.
There are excellent tools in the Standard Library both for parsing RFC 821 headers, and also for parsing entire HTTP requests. Here is an example request string (note that Python treats it as one big string, even though we are breaking it across several lines for readability) that we can feed to my examples:
request_text = (
b'GET /who/ken/trust.html HTTP/1.1\r\n'
b'Host: cm.bell-labs.com\r\n'
b'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
b'Accept: text/html;q=0.9,text/plain\r\n'
b'\r\n'
)
As #TryPyPy points out, you can use Python’s email message library to parse the headers — though we should add that the resulting Message object acts like a dictionary of headers once you are done creating it:
from email.parser import BytesParser
request_line, headers_alone = request_text.split(b'\r\n', 1)
headers = BytesParser().parsebytes(headers_alone)
print(len(headers)) # -> "3"
print(headers.keys()) # -> ['Host', 'Accept-Charset', 'Accept']
print(headers['Host']) # -> "cm.bell-labs.com"
But this, of course, ignores the request line, or makes you parse it yourself. It turns out that there is a much better solution.
The Standard Library will parse HTTP for you if you use its BaseHTTPRequestHandler. Though its documentation is a bit obscure — a problem with the whole suite of HTTP and URL tools in the Standard Library — all you have to do to make it parse a string is (a) wrap your string in a BytesIO(), (b) read the raw_requestline so that it stands ready to be parsed, and (c) capture any error codes that occur during parsing instead of letting it try to write them back to the client (since we do not have one!).
So here is our specialization of the Standard Library class:
from http.server import BaseHTTPRequestHandler
from io import BytesIO
class HTTPRequest(BaseHTTPRequestHandler):
def __init__(self, request_text):
self.rfile = BytesIO(request_text)
self.raw_requestline = self.rfile.readline()
self.error_code = self.error_message = None
self.parse_request()
def send_error(self, code, message):
self.error_code = code
self.error_message = message
Again, I wish the Standard Library folks had realized that HTTP parsing should be broken out in a way that did not require us to write nine lines of code to properly call it, but what can you do? Here is how you would use this simple class:
# Using this new class is really easy!
request = HTTPRequest(request_text)
print(request.error_code) # None (check this first)
print(request.command) # "GET"
print(request.path) # "/who/ken/trust.html"
print(request.request_version) # "HTTP/1.1"
print(len(request.headers)) # 3
print(request.headers.keys()) # ['Host', 'Accept-Charset', 'Accept']
print(request.headers['host']) # "cm.bell-labs.com"
If there is an error during parsing, the error_code will not be None:
# Parsing can result in an error code and message
request = HTTPRequest(b'GET\r\nHeader: Value\r\n\r\n')
print(request.error_code) # 400
print(request.error_message) # "Bad request syntax ('GET')"
I prefer using the Standard Library like this because I suspect that they have already encountered and resolved any edge cases that might bite me if I try re-implementing an Internet specification myself with regular expressions.
Old Python 2 code
Here’s the original code for this answer, back when I first wrote it:
request_text = (
'GET /who/ken/trust.html HTTP/1.1\r\n'
'Host: cm.bell-labs.com\r\n'
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
'Accept: text/html;q=0.9,text/plain\r\n'
'\r\n'
)
And:
# Ignore the request line and parse only the headers
from mimetools import Message
from StringIO import StringIO
request_line, headers_alone = request_text.split('\r\n', 1)
headers = Message(StringIO(headers_alone))
print len(headers) # -> "3"
print headers.keys() # -> ['accept-charset', 'host', 'accept']
print headers['Host'] # -> "cm.bell-labs.com"
And:
from BaseHTTPServer import BaseHTTPRequestHandler
from StringIO import StringIO
class HTTPRequest(BaseHTTPRequestHandler):
def __init__(self, request_text):
self.rfile = StringIO(request_text)
self.raw_requestline = self.rfile.readline()
self.error_code = self.error_message = None
self.parse_request()
def send_error(self, code, message):
self.error_code = code
self.error_message = message
And:
# Using this new class is really easy!
request = HTTPRequest(request_text)
print request.error_code # None (check this first)
print request.command # "GET"
print request.path # "/who/ken/trust.html"
print request.request_version # "HTTP/1.1"
print len(request.headers) # 3
print request.headers.keys() # ['accept-charset', 'host', 'accept']
print request.headers['host'] # "cm.bell-labs.com"
And:
# Parsing can result in an error code and message
request = HTTPRequest('GET\r\nHeader: Value\r\n\r\n')
print request.error_code # 400
print request.error_message # "Bad request syntax ('GET')"
mimetools has been deprecated since Python 2.3 and totally removed from Python 3 (link).
Here is how you should do in Python 3:
import email
import io
import pprint
# […]
request_line, headers_alone = request_text.split('\r\n', 1)
message = email.message_from_file(io.StringIO(headers_alone))
headers = dict(message.items())
pprint.pprint(headers, width=160)
This seems to work fine if you strip the GET line:
import mimetools
from StringIO import StringIO
he = "Host: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n"
m = mimetools.Message(StringIO(he))
print m.headers
A way to parse your example and add information from the first line to the object would be:
import mimetools
from StringIO import StringIO
he = 'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\n'
# Pop the first line for further processing
request, he = he.split('\r\n', 1)
# Get the headers
m = mimetools.Message(StringIO(he))
# Add request information
m.dict['method'], m.dict['path'], m.dict['http-version'] = request.split()
print m['method'], m['path'], m['http-version']
print m['Connection']
print m.headers
print m.dict
Using python3.7, urllib3.HTTPResponse, http.client.parse_headers, and with curl flag explanation here:
curl -i -L -X GET "http://httpbin.org/relative-redirect/3" | python -c '
import sys
from io import BytesIO
from urllib3 import HTTPResponse
from http.client import parse_headers
rawresponse = sys.stdin.read().encode("utf8")
redirects = []
while True:
header, body = rawresponse.split(b"\r\n\r\n", 1)
if body[:4] == b"HTTP":
redirects.append(header)
rawresponse = body
else:
break
f = BytesIO(header)
# read one line for HTTP/2 STATUSCODE MESSAGE
requestline = f.readline().split(b" ")
protocol, status = requestline[:2]
headers = parse_headers(f)
resp = HTTPResponse(body, headers=headers)
resp.status = int(status)
print("headers")
print(resp.headers)
print("redirects")
print(redirects)
'
Output:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 215 100 215 0 0 435 0 --:--:-- --:--:-- --:--:-- 435
headers
HTTPHeaderDict({'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Date': 'Thu, 20 Sep 2018 05:39:25 GMT', 'Content-Type': 'application/json', 'Content-Length': '215', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Via': '1.1 vegur'})
redirects
[b'HTTP/1.1 302 FOUND\r\nConnection: keep-alive\r\nServer: gunicorn/19.9.0\r\nDate: Thu, 20 Sep 2018 05:39:24 GMT\r\nContent-Type: text/html; charset=utf-8\r\nContent-Length: 0\r\nLocation: /relative-redirect/2\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\nVia: 1.1 vegur',
b'HTTP/1.1 302 FOUND\r\nConnection: keep-alive\r\nServer: gunicorn/19.9.0\r\nDate: Thu, 20 Sep 2018 05:39:24 GMT\r\nContent-Type: text/html; charset=utf-8\r\nContent-Length: 0\r\nLocation: /relative-redirect/1\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\nVia: 1.1 vegur',
b'HTTP/1.1 302 FOUND\r\nConnection: keep-alive\r\nServer: gunicorn/19.9.0\r\nDate: Thu, 20 Sep 2018 05:39:24 GMT\r\nContent-Type: text/html; charset=utf-8\r\nContent-Length: 0\r\nLocation: /get\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\nVia: 1.1 vegur']
notes:
https://urllib3.readthedocs.io/en/latest/reference/#urllib3.response.HTTPResponse
parse_headers()
In a pythonic way
request_text = (
b'GET /who/ken/trust.html HTTP/1.1\r\n'
b'Host: cm.bell-labs.com\r\n'
b'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
b'Accept: text/html;q=0.9,text/plain\r\n'
b'\r\n'
)
print({ k:v.strip() for k,v in [line.split(":",1)
for line in request_text.decode().splitlines() if ":" in line]})
They is another way, simpler and safer way to handle headers. More object oriented. With no need for manual parsing.
Short demo.
1. Parse them
From str, bytes, fp, dict, requests.Response, email.Message, httpx.Response, urllib3.HTTPResponse.
from requests import get
from kiss_headers import parse_it
response = get('https://www.google.fr')
headers = parse_it(response)
headers.content_type.charset # output: ISO-8859-1
# Its the same as
headers["content-type"]["charset"] # output: ISO-8859-1
2. Build them
This
from kiss_headers import *
headers = (
Host("developer.mozilla.org")
+ UserAgent(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0"
)
+ Accept("text/html")
+ Accept("application/xhtml+xml")
+ Accept("application/xml", qualifier=0.9)
+ Accept(qualifier=0.8)
+ AcceptLanguage("en-US")
+ AcceptLanguage("en", qualifier=0.5)
+ AcceptEncoding("gzip")
+ AcceptEncoding("deflate")
+ AcceptEncoding("br")
+ Referer("https://developer.mozilla.org/testpage.html")
+ Connection(should_keep_alive=True)
+ UpgradeInsecureRequests()
+ IfModifiedSince("Mon, 18 Jul 2016 02:36:04 GMT")
+ IfNoneMatch("c561c68d0ba92bbeb8b0fff2a9199f722e3a621a")
+ CacheControl(max_age=0)
)
raw_headers = str(headers)
Will become
Host: developer.mozilla.org
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0
Accept: text/html, application/xhtml+xml, application/xml; q="0.9", */*; q="0.8"
Accept-Language: en-US, en; q="0.5"
Accept-Encoding: gzip, deflate, br
Referer: https://developer.mozilla.org/testpage.html
Connection: keep-alive
Upgrade-Insecure-Requests: 1
If-Modified-Since: Mon, 18 Jul 2016 02:36:04 GMT
If-None-Match: "c561c68d0ba92bbeb8b0fff2a9199f722e3a621a"
Cache-Control: max-age="0"
Documentation for the kiss-headers library.
Is there any way to parse the individual headers from an HTTP string?
I wrote a simple function that can return a dictionary object, hope it can help you. ^_^
Python 3
def parse_request(request):
raw_list = request.split("\r\n")
request = {}
for index in range(1, len(raw_list)):
item = raw_list[index].split(":")
if len(item) == 2:
request.update({item[0].lstrip(' '): item[1].lstrip(' ')})
return request
raw_request = 'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n'
request = parse_request(raw_request)
print(request)
print('\n')
print(request.keys())
Output:
{'Host': 'www.google.com', 'Connection': 'keep-alive', 'Accept': 'application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13', 'Accept-Encoding': 'gzip,deflate,sdch', 'Avail-Dictionary': 'GeNLY2f-', 'Accept-Language': 'en-US,en;q=0.8'}
dict_keys(['Host', 'Connection', 'Accept', 'User-Agent', 'Accept-Encoding', 'Avail-Dictionary', 'Accept-Language'])
in python3
from email import message_from_string
data = socket.recv(4096)
headers = message_from_string(str(data, 'ASCII').split('\r\n', 1)[1])
print(headers['Host'])
From this question: How to parse raw HTTP request in Python 3?
Here are some Python packages aimed at proper HTTP protocol parsing:
https://dpkt.readthedocs.io/en/latest/api/api_auto.html#module-dpkt.http
https://h11.readthedocs.io/en/latest/
https://github.com/benoitc/http-parser/ (C backend)
https://github.com/MagicStack/httptools (based on NodeJS's C backend)
https://github.com/silentsignal/netlib-offline (shameless plug)
Related
I have written my own implementation of a websocket in python to teach myself their inner workings. I was going to be sending large repetitive JSON objects over the websocket so I am trying to implement permessage-deflate. The compression works in the client->server direction, but not in the server -> client direction
This is the header exchange:
Request
Host: awebsite.com:port
Connection: Upgrade
Pragma: no-cache
Cache-Control: no-cache
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36
Upgrade: websocket
Origin: http://awebsite.com
Sec-WebSocket-Version: 13
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
Sec-WebSocket-Key: JItmF32mfGXXKYyhcEoW/A==
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Response
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Extensions: permessage-deflate
Sec-WebSocket-Accept: zYQKJ6gvwlTU/j2xw1Kf0BErg9c=
When I do this, I get compressed data from the client as expected and it inflates as expected.
When I send an uncompressed message, I get a normal response on the client, Ie i send "hello" and I get "hello"
When I try to deflate my message using this simple python function:
def deflate(self,data,B64_encode=False):
data=zlib.compress(data)
if B64_encode:
return base64.b64encode(data[2:-4])
else:
return data[2:-4]
I get an error message about the characters not being utf-8, and when I base64 encode the compressed message, I just get the base64 encoded string. I also tried sending the data as Binary over the websocket, and get a blob at the other end. I've been scouring the internet for a while now and haven't heard of this happening. My Guess is that I am compressing the data at the wrong step. Below is the function I use to send the data. So far I've been feeding in the compressed message into the send() function because from what I've read permessage compression happens on the message level, and all the other data remains uncompressed.
def send(self, string, TYPE="TEXT"):
import struct
conn = self.conn
datatypes = {
"TEXT": 0x01,
"BINARY": 0x02,
"CLOSE": 0X08,
"PING": 0x09,
"PONG": 0x0A}
b1 = 0x80
b2 = 0
message = ""
if TYPE == "TEXT":
if type(string) == unicode:
b1 |= datatypes["TEXT"]
payload = string.encode("UTF8")
elif type(string) == str:
b1 |= datatypes["TEXT"]
payload = string
message += chr(b1)
else:
b1 |= datatypes[TYPE]
payload = string
message += chr(b1)
length = len(payload)
if length < 126:
b2 |= length
message += chr(b2)
elif length < (2 ** 16) - 1:
b2 |= 126
message += chr(b2)
l = struct.pack(">H", length)
message += l
else:
l = struct.pack(">Q", length)
b2 |= 127
message += chr(b2)
message += l
message += payload
try:
conn.send(str(message))
except socket.error:
traceback.print_exc()
conn.close()
if TYPE == "CLOSE":
self.Die = True
conn.shutdown(2)
conn.close()
print self.myid,"Closed"
After a lot of sleuthing, I found out my problem was a case of "RTFM". In the third paragraph of (a?) the manual on perMessage Compression it says
A WebSocket client may offer multiple PMCEs during the WebSocket
opening handshake. A peer WebSocket server received those offers may
choose and accept preferred one or decline all of them. PMCEs use
the RSV1 bit of the WebSocket frame header to indicate whether a
message is compressed or not, so that an endpoint can choose not to
compress messages with incompressible contents.
I didn't know what the rsv bits did when I first set this up, and had them set to 0 by default. My code now allows for compression to be set in the send() function in my program. It nicely shrinks my messages from 30200 bytes to 149 bytes.
My Modified code now looks like this:
def deflate2(self,data):
data=zlib.compress(data)
data=data[2:-4]
return data
def send(self, string, TYPE="TEXT",deflate=False):
import struct
if (deflate):
string=self.deflate(string)
conn = self.conn
datatypes = {
"TEXT": 0x01,
"BINARY": 0x02,
"CLOSE": 0X08,
"PING": 0x09,
"PONG": 0x0A}
b1 = 0x80 #0b100000000
if(deflate): b1=0xC0 #0b110000000 sets RSV1 to 1 for compression
b2 = 0
message = ""
if TYPE == "TEXT":
if type(string) == unicode:
b1 |= datatypes["TEXT"]
payload = string.encode("UTF8")
elif type(string) == str:
b1 |= datatypes["TEXT"]
payload = string
message += chr(b1)
else:
b1 |= datatypes[TYPE]
payload = string
message += chr(b1)
length = len(payload)
if length < 126:
b2 |= length
message += chr(b2)
elif length < (2 ** 16) - 1:
b2 |= 126
message += chr(b2)
l = struct.pack(">H", length)
message += l
else:
l = struct.pack(">Q", length)
b2 |= 127
message += chr(b2)
message += l
message += payload
try:
if self.debug:
for x in message: print("S>: ",x,hex(ord(x)),ord(x))
conn.send(str(message))
except socket.error:
traceback.print_exc()
conn.close()
if TYPE == "CLOSE":
self.Die = True
conn.shutdown(2)
conn.close()
print self.myid,"Closed"
I am new at python and I need to parse status codes. I have a task to parse the HTTP log file:
Group the logged requests by IP address or HTTP status code
(selected by the user).
Calculate one of the following (selected by the user) for each group:
Request count
Request count percentage of all logged requests
The total number of bytes transferred.
I have made the calculation of requests and percentages. Now I do not know how to count the transferred bytes (3rd task).
The example of a log file (bytes are shown here after status code: 6146, 52315, 12251, 54662):
93.114.45.13 - - [17/May/2015:10:05:17 +0000] "GET /images/jordan-80.png HTTP/1.1" 200 6146 "http://www.semicomplete.com/articles/dynamic-dns-with-dhcp/" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0"
93.114.45.13 - - [17/May/2015:10:05:21 +0000] "GET /images/web/2009/banner.png HTTP/1.1" 200 52315 "http://www.semicomplete.com/style2.css" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0"
66.249.73.135 - - [17/May/2015:10:05:40 +0000] "GET /blog/tags/ipv6 HTTP/1.1" 200 12251 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"83.149.9.216 - - [17/May/2015:10:05:25 +0000] "GET /presentations/logstash-monitorama-2013/images/elasticsearch.png HTTP/1.1" 200 8026 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"
83.149.9.216 - - [17/May/2015:10:05:59 +0000] "GET /presentations/logstash-monitorama-2013/images/logstashbook.png HTTP/1.1" 200 54662 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"
My code :
import re
import sys
from collections import Counter
def getBytes(filename):
with open(filename, 'r') as logfile:
for line in logfile:
newlist=line.split(" ")
print(newlist[0]+" "+newlist[9])
def countIp(filename):
print ("hey")
myregex = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
with open(filename) as f:
log = f.read()
my_iplist = re.findall(myregex,log)
ipcount = Counter(my_iplist)
s = sum(ipcount.values())
for k,v in ipcount.items():
pct = v * 100.0 / s
print("IP Address " + "=> " + str(k) + " " + "Count " + "=> " + str(v) + " Percentage " +"=> "+ str(pct) +"%")
def countStatusCode(filename):
print ("hey")
myregex = r'\b[2-5]\d\d\s'
with open(filename) as f:
log = f.read()
my_iplist = re.findall(myregex,log)
ipcount = Counter(my_iplist)
s = sum(ipcount.values())
for k,v in ipcount.items():
pct = v * 100.0 / s
print("IP Address " + "=> " + str(k) + " " + "Count " + "=> " + str(v) + " Percentage " +"=> "+ str(pct) +"%")
if __name__ == '__main__':
filename=sys.argv[1]
val = input("Enter your choice (ip or status code): ")
if val in ["ip","IP","Ip","iP"]:
countIp(filename)
elif val in ["statusCode","code","sc","Status Code","status"]:
countStatusCode(filename)
To get number of bytes transfered from your logfile:
def getBytes(filename):
with open(filename, 'r') as logfile:
for line in logfile:
regex = r'\/.+?\sHTTP\/1\..\"\s.{3}\s(.+?)\s'
bytesCount = re.search(regex, line)[1]
print("Bytes transfered: "+bytesCount)
Rather than using multiple regular expressions making multiple passes through the log, this solution uses one regular expression to pull out all the relevant values in one pass.
Function process_log is passed the text of the log as a single string. In the following demo program, a test string is used. An actual implementation would call this function with the results of reading the actual log file.
To keep track of IP addresses/status pairs, a defaultdict using a list as the default_factory is used. The number of items in the list counts the number of times the IP address/status combination has been seen and each list item is the number of bytes transferred for that HTTP request. For example, an key/value pair of the ip_status dictionary might be:
key: ('123.12.11.9', '200') value: [6213, 9876, 376]
The interpretation of the above is that there were 3 instances of status code '200' for ip address '123.12.11.9' discovered. The bytes transferred for these 3 instances were 6213, 9876 and 376.
Explanation of the regex:
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - - .*?HTTP/1.1" (\d+) (\d+)
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - - This is essentially the regex used by the OP to recognize an IP address so it shouldn't require much of an explanation. I have followed it with - - to provide additional following context just in case there are other instances of similar looking strings. The IP address is captured in Capture Group 1.
.*? This will match non-greedily 0 or more non-newline characters until the following.
HTTP/1.1" Matches the string HTTP/1.1" to provide left-hand context for the following.
(\d+) Matches one or more digit in Capture Group 2 (the status).
Matches a single space.
(\d+) Matches one or more digit in Capture Group 3 (the transferred bytes).
See Regex Demo
In other words, I just want to make sure I am picking up the right fields from the right place by matching what I expect to find next to the fields I am looking for. When your regex returns multiple groups, finditer is often more convenient then findall. finditer returns an iterator yielding a match object for each iteration. I have added code to produce statistics for both ip/status code/transferred bytes and just status code. You need only one or another depending on what the user wants.
The code:
import re
from collections import defaultdict
def process_log(log):
ip_counter = defaultdict(list)
status_counter = defaultdict(int)
total_count = 0
for m in re.finditer(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - - .*?HTTP/1.1" (\d+) (\d+)', log):
total_count += 1
ip = m[1]
status = m[2]
bytes = int(m[3])
ip_counter[(ip, status)].append(bytes)
status_counter[status] += 1
for k, v in ip_counter.items():
count = len(v)
percentage = count/total_count
total_bytes = sum(v)
ip = k[0]
status = k[1]
print(f"IP Address => {ip}, status => {status}, Count => {count}, Percentage => {percentage}, Total Bytes Transferred => {total_bytes}")
for k, v in status_counter.items():
count = v
percentage = count/total_count
print(f"Status Code => {k}, Percentage => {percentage}")
log = """93.114.45.13 - - [17/May/2015:10:05:17 +0000] "GET /images/jordan-80.png HTTP/1.1" 200 6146 "http://www.semicomplete.com/articles/dynamic-dns-with-dhcp/" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0"
93.114.45.13 - - [17/May/2015:10:05:21 +0000] "GET /images/web/2009/banner.png HTTP/1.1" 200 52315 "http://www.semicomplete.com/style2.css" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0"
66.249.73.135 - - [17/May/2015:10:05:40 +0000] "GET /blog/tags/ipv6 HTTP/1.1" 200 12251 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"83.149.9.216 - - [17/May/2015:10:05:25 +0000] "GET /presentations/logstash-monitorama-2013/images/elasticsearch.png HTTP/1.1" 200 8026 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"
83.149.9.216 - - [17/May/2015:10:05:59 +0000] "GET /presentations/logstash-monitorama-2013/images/logstashbook.png HTTP/1.1" 200 54662 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"
"""
process_log(log)
Prints:
IP Address => 93.114.45.13, status => 200, Count => 2, Percentage => 0.4, Total Bytes Transferred => 58461
IP Address => 66.249.73.135, status => 200, Count => 1, Percentage => 0.2, Total Bytes Transferred => 12251
IP Address => 83.149.9.216, status => 200, Count => 2, Percentage => 0.4, Total Bytes Transferred => 62688
Status Code => 200, Percentage => 1.0
See Python Demo
We are trying to add HTTPS support to a web server virtual host scanning tool. Said tool uses the python3 requests library, which uses urllib3 under the hood.
We need a way to provide our own SNI hostname so are attempting to monkey patch the _ssl_wrap_socket function of urllib3 to control server_hostname but aren't having much success.
Here is the full code:
from urllib3.util import ssl_
_target_host = None
_orig_wrap_socket = ssl_.ssl_wrap_socket
def _ssl_wrap_socket(sock, keyfile=None, certfile=None, cert_reqs=None,
ca_certs=None, server_hostname=None,
ssl_version=None, ciphers=None, ssl_context=None,
ca_cert_dir=None):
_orig_wrap_socket(sock, keyfile=keyfile, certfile=certfile,
cert_reqs=cert_reqs, ca_certs=ca_certs,
server_hostname=_target_host, ssl_version=ssl_version,
ciphers=ciphers, ssl_context=ssl_context,
ca_cert_dir=ca_cert_dir)
ssl_.ssl_wrap_socket = _ssl_wrap_socket
We then call requests.get() further down in the code. The full context can be found on Github (here).
Unfortunately this isn't working as our code never appears to be reached, and we're not sure why. Is there something obvious that we're missing or a better way to approach this issue?
Further Explanation
The following is the full class:
import os
import random
import requests
import hashlib
import pandas as pd
import time
from lib.core.discovered_host import *
import urllib3
DEFAULT_USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/61.0.3163.100 Safari/537.36'
urllib3.disable_warnings()
from urllib3.util import ssl_
class virtual_host_scanner(object):
"""Virtual host scanning class
Virtual host scanner has the following properties:
Attributes:
wordlist: location to a wordlist file to use with scans
target: the target for scanning
port: the port to scan. Defaults to 80
ignore_http_codes: commad seperated list of http codes to ignore
ignore_content_length: integer value of content length to ignore
output: folder to write output file to
"""
def __init__(self, target, wordlist, **kwargs):
self.target = target
self.wordlist = wordlist
self.base_host = kwargs.get('base_host')
self.rate_limit = int(kwargs.get('rate_limit', 0))
self.port = int(kwargs.get('port', 80))
self.real_port = int(kwargs.get('real_port', 80))
self.ssl = kwargs.get('ssl', False)
self.fuzzy_logic = kwargs.get('fuzzy_logic', False)
self.unique_depth = int(kwargs.get('unique_depth', 1))
self.ignore_http_codes = kwargs.get('ignore_http_codes', '404')
self.first_hit = kwargs.get('first_hit')
self.ignore_content_length = int(
kwargs.get('ignore_content_length', 0)
)
self.add_waf_bypass_headers = kwargs.get(
'add_waf_bypass_headers',
False
)
# this can be made redundant in future with better exceptions
self.completed_scan = False
# this is maintained until likely-matches is refactored to use
# new class
self.results = []
# store associated data for discovered hosts
# in array for oN, oJ, etc'
self.hosts = []
# available user-agents
self.user_agents = list(kwargs.get('user_agents')) \
or [DEFAULT_USER_AGENT]
#property
def ignore_http_codes(self):
return self._ignore_http_codes
#ignore_http_codes.setter
def ignore_http_codes(self, codes):
self._ignore_http_codes = [
int(code) for code in codes.replace(' ', '').split(',')
]
_target_host = None
_orig_wrap_socket = ssl_.ssl_wrap_socket
def _ssl_wrap_socket(sock, keyfile=None, certfile=None, cert_reqs=None,
ca_certs=None, server_hostname=None,
ssl_version=None, ciphers=None, ssl_context=None,
ca_cert_dir=None):
print('SHOULD BE PRINTED')
_orig_wrap_socket(sock, keyfile=keyfile, certfile=certfile,
cert_reqs=cert_reqs, ca_certs=ca_certs,
server_hostname=_target_host, ssl_version=ssl_version,
ciphers=ciphers, ssl_context=ssl_context,
ca_cert_dir=ca_cert_dir)
def scan(self):
print('fdsa')
ssl_.ssl_wrap_socket = self._ssl_wrap_socket
if not self.base_host:
self.base_host = self.target
if not self.real_port:
self.real_port = self.port
for virtual_host in self.wordlist:
hostname = virtual_host.replace('%s', self.base_host)
if self.real_port == 80:
host_header = hostname
else:
host_header = '{}:{}'.format(hostname, self.real_port)
headers = {
'User-Agent': random.choice(self.user_agents),
'Host': host_header,
'Accept': '*/*'
}
if self.add_waf_bypass_headers:
headers.update({
'X-Originating-IP': '127.0.0.1',
'X-Forwarded-For': '127.0.0.1',
'X-Remote-IP': '127.0.0.1',
'X-Remote-Addr': '127.0.0.1'
})
dest_url = '{}://{}:{}/'.format(
'https' if self.ssl else 'http',
self.target,
self.port
)
_target_host = hostname
try:
res = requests.get(dest_url, headers=headers, verify=False)
except requests.exceptions.RequestException:
continue
if res.status_code in self.ignore_http_codes:
continue
response_length = int(res.headers.get('content-length', 0))
if self.ignore_content_length and \
self.ignore_content_length == response_length:
continue
# hash the page results to aid in identifing unique content
page_hash = hashlib.sha256(res.text.encode('utf-8')).hexdigest()
self.hosts.append(self.create_host(res, hostname, page_hash))
# add url and hash into array for likely matches
self.results.append(hostname + ',' + page_hash)
if len(self.hosts) >= 1 and self.first_hit:
break
# rate limit the connection, if the int is 0 it is ignored
time.sleep(self.rate_limit)
self.completed_scan = True
def likely_matches(self):
if self.completed_scan is False:
print("[!] Likely matches cannot be printed "
"as a scan has not yet been run.")
return
# segment results from previous scan into usable results
segmented_data = {}
for item in self.results:
result = item.split(",")
segmented_data[result[0]] = result[1]
dataframe = pd.DataFrame([
[key, value] for key, value in segmented_data.items()],
columns=["key_col", "val_col"]
)
segmented_data = dataframe.groupby("val_col").filter(
lambda x: len(x) <= self.unique_depth
)
return segmented_data["key_col"].values.tolist()
def create_host(self, response, hostname, page_hash):
"""
Creates a host using the responce and the hash.
Prints current result in real time.
"""
output = '[#] Found: {} (code: {}, length: {}, hash: {})\n'.format(
hostname,
response.status_code,
response.headers.get('content-length'),
page_hash
)
host = discovered_host()
host.hostname = hostname
host.response_code = response.status_code
host.hash = page_hash
host.contnet = response.content
for key, val in response.headers.items():
output += ' {}: {}\n'.format(key, val)
host.keys.append('{}: {}'.format(key, val))
print(output)
return host
In this case the following line is never being hit:
print('SHOULD BE PRINTED')
This also results in the following log entry on the web server:
[Wed Oct 25 16:37:23.654321 2017] [ssl:error] [pid 1355] AH02032:
Hostname provided via SNI and hostname test.test provided via
HTTP are different
Which indicates the code was never run also.
Edit-1: No reload needed
Thanks to #MartijnPieters helping me enhance the answer. There is no reload needed if we directly patch urllib3.connection. But requests package has some changes in the latest version, which made the original answer not work on some version of requests.
Here is a updated version of the code, which handles all these things
import requests
try:
assert requests.__version__ != "2.18.0"
import requests.packages.urllib3.util.ssl_ as ssl_
import requests.packages.urllib3.connection as connection
except (ImportError,AssertionError,AttributeError):
import urllib3.util.ssl_ as ssl_
import urllib3.connection as connection
print("Using " + requests.__version__)
def _ssl_wrap_socket(sock, keyfile=None, certfile=None, cert_reqs=None,
ca_certs=None, server_hostname=None,
ssl_version=None, ciphers=None, ssl_context=None,
ca_cert_dir=None):
print('SHOULD BE PRINTED')
return ssl_.ssl_wrap_socket(sock, keyfile=keyfile, certfile=certfile,
cert_reqs=cert_reqs, ca_certs=ca_certs,
server_hostname=server_hostname, ssl_version=ssl_version,
ciphers=ciphers, ssl_context=ssl_context,
ca_cert_dir=ca_cert_dir)
connection.ssl_wrap_socket = _ssl_wrap_socket
res = requests.get("https://www.google.com", verify=True)
The code is also available on
https://github.com/tarunlalwani/monkey-patch-ssl_wrap_socket
Original Answer
So two issues in your code.
requests doesn't actually directly import urllib3. It does it through its own context using requests.packages
So the socket you want to overwrite is
requests.packages.urllib3.util.ssl_.ssl_wrap_socket
Next if you look at connection.py from urllib3/connection.py
from .util.ssl_ import (
resolve_cert_reqs,
resolve_ssl_version,
ssl_wrap_socket,
assert_fingerprint,
)
This is a local import and it can't be overridden on first attempt as the code is loaded when we use import requests. You can easily confirm that by putting a breakpoint and checking the stack trace back to parent file.
So for the monkey patch to work we need to reload the module once the patching is done, so it takes our patched function
Below is minimal code showing that interception works this way
try:
reload # Python 2.7
except NameError:
try:
from importlib import reload # Python 3.4+
except ImportError:
from imp import reload # Python 3.0 - 3.3
def _ssl_wrap_socket(sock, keyfile=None, certfile=None, cert_reqs=None,
ca_certs=None, server_hostname=None,
ssl_version=None, ciphers=None, ssl_context=None,
ca_cert_dir=None):
print('SHOULD BE PRINTED')
_orig_wrap_socket(sock, keyfile=keyfile, certfile=certfile,
cert_reqs=cert_reqs, ca_certs=ca_certs,
server_hostname=_target_host, ssl_version=ssl_version,
ciphers=ciphers, ssl_context=ssl_context,
ca_cert_dir=ca_cert_dir)
import requests
_orig_wrap_socket = requests.packages.urllib3.util.ssl_.ssl_wrap_socket
requests.packages.urllib3.util.ssl_.ssl_wrap_socket = _ssl_wrap_socket
reload(requests.packages.urllib3.connection)
res = requests.get("https://www.google.com", verify=True)
I have a script that's made in python as below
#!/bin/env python2.7
# Run around 1059 as early as 1055.
# Polling times vary pick something nice.
# Ghost checkout timer can be changed by
# adjusting for loop range near bottom.
# Fill out personal data in checkout payload dict.
import sys, json, time, requests, urllib2
from datetime import datetime
qty='1'
def UTCtoEST():
current=datetime.now()
return str(current) + ' EST'
print
poll=raw_input("Polling interval? ")
poll=int(poll)
keyword=raw_input("Product name? ").title() # hardwire here by declaring keyword as a string
color=raw_input("Color? ").title() # hardwire here by declaring keyword as a string
sz=raw_input("Size? ").title() # hardwire here by declaring keyword as a string
print
print UTCtoEST(),':: Parsing page...'
def main():
global ID
global variant
global cw
req = urllib2.Request('http://www.supremenewyork.com/mobile_stock.json')
req.add_header('User-Agent', "User-Agent','Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_4 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10B350 Safari/8536.25")
resp = urllib2.urlopen(req)
data = json.loads(resp.read())
ID=0
for i in range(len(data[u'products_and_categories'].values())):
for j in range(len(data[u'products_and_categories'].values()[i])):
item=data[u'products_and_categories'].values()[i][j]
name=str(item[u'name'].encode('ascii','ignore'))
# SEARCH WORDS HERE
# if string1 in name or string2 in name:
if keyword in name:
# match/(es) detected!
# can return multiple matches but you're
# probably buying for resell so it doesn't matter
myproduct=name
ID=str(item[u'id'])
print UTCtoEST(),'::',name, ID, 'found ( MATCHING ITEM DETECTED )'
if (ID == 0):
# variant flag unchanged - nothing found - rerun
time.sleep(poll)
print UTCtoEST(),':: Reloading and reparsing page...'
main()
else:
print UTCtoEST(),':: Selecting',str(myproduct),'(',str(ID),')'
jsonurl = 'http://www.supremenewyork.com/shop/'+str(ID)+'.json'
req = urllib2.Request(jsonurl)
req.add_header('User-Agent', "User-Agent','Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_4 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10B350 Safari/8536.25")
resp = urllib2.urlopen(req)
data = json.loads(resp.read())
found=0
for numCW in data['styles']:
# COLORWAY TERMS HERE
# if string1 in numCW['name'] or string2 in numCW['name']:
if color in numCW['name'].title():
for sizes in numCW['sizes']:
# SIZE TERMS HERE
if str(sizes['name'].title()) == sz: # Medium
found=1;
variant=str(sizes['id'])
cw=numCW['name']
print UTCtoEST(),':: Selecting size:', sizes['name'],'(',numCW['name'],')','(',str(sizes['id']),')'
if found ==0:
# DEFAULT CASE NEEDED HERE - EITHER COLORWAY NOT FOUND OR SIZE NOT IN RUN OF PRODUCT
# PICKING FIRST COLORWAY AND LAST SIZE OPTION
print UTCtoEST(),':: Selecting default colorway:',data['styles'][0]['name']
sizeName=str(data['styles'][0]['sizes'][len(data['styles'][0]['sizes'])-1]['name'])
variant=str(data['styles'][0]['sizes'][len(data['styles'][0]['sizes'])-1]['id'])
cw=data['styles'][0]['name']
print UTCtoEST(),':: Selecting default size:',sizeName,'(',variant,')'
main()
session=requests.Session()
addUrl='http://www.supremenewyork.com/shop/'+str(ID)+'/add.json'
addHeaders={
'Host': 'www.supremenewyork.com',
'Accept': 'application/json',
'Proxy-Connection': 'keep-alive',
'X-Requested-With': 'XMLHttpRequest',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-us',
'Content-Type': 'application/x-www-form-urlencoded',
'Origin': 'http://www.supremenewyork.com',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Mobile/11D257',
'Referer': 'http://www.supremenewyork.com/mobile'
}
addPayload={
'size': str(variant),
'qty': '1'
}
print UTCtoEST() +' :: Adding product to cart...'
addResp=session.post(addUrl,data=addPayload,headers=addHeaders)
print UTCtoEST() +' :: Checking status code of response...'
if addResp.status_code!=200:
print UTCtoEST() +' ::',addResp.status_code,'Error \nExiting...'
print
sys.exit()
else:
if addResp.json()==[]:
print UTCtoEST() +' :: Response Empty! - Problem Adding to Cart\nExiting...'
print
sys.exit()
print UTCtoEST() +' :: '+str(cw)+' - '+addResp.json()[0]['name']+' - '+ addResp.json()[0]['size_name']+' added to cart!'
checkoutUrl='https://www.supremenewyork.com/checkout.json'
checkoutHeaders={
'host': 'www.supremenewyork.com',
'If-None-Match': '"*"',
'Accept': 'application/json',
'Proxy-Connection': 'keep-alive',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-us',
'Content-Type': 'application/x-www-form-urlencoded',
'Origin': 'http://www.supremenewyork.com',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Mobile/11D257',
'Referer': 'http://www.supremenewyork.com/mobile'
}
#################################
# FILL OUT THESE FIELDS AS NEEDED
#################################
checkoutPayload={
'store_credit_id': '',
'from_mobile': '1',
'cookie-sub': '%7B%22'+str(variant)+'%22%3A1%7D', # cookie-sub: eg. {"VARIANT":1} urlencoded
'same_as_billing_address': '1',
'order[billing_name]': 'anon mous', # FirstName LastName
'order[email]': 'anon#mailinator.com', # email#domain.com
'order[tel]': '999-999-9999', # phone-number-here
'order[billing_address]': '123 Seurat lane', # your address
'order[billing_address_2]': '',
'order[billing_zip]': '90210', # zip code
'order[billing_city]': 'Beverly Hills', # city
'order[billing_state]': 'CA', # state
'order[billing_country]': 'USA', # country
'store_address': '1',
'credit_card[type]': 'visa', # master or visa
'credit_card[cnb]': '9999 9999 9999 9999', # credit card number
'credit_card[month]': '01', # expiration month
'credit_card[year]': '2026', # expiration year
'credit_card[vval]': '123', # cvc/cvv
'order[terms]': '0',
'order[terms]': '1'
}
# GHOST CHECKOUT PREVENTION WITH ROLLING PRINT
for i in range(5):
sys.stdout.write("\r" +UTCtoEST()+ ' :: Sleeping for '+str(5-i)+' seconds to avoid ghost checkout...')
sys.stdout.flush()
time.sleep(1)
print
print UTCtoEST()+ ' :: Firing checkout request!'
checkoutResp=session.post(checkoutUrl,data=checkoutPayload,headers=checkoutHeaders)
try:
print UTCtoEST()+ ' :: Checkout',checkoutResp.json()['status'].title()+'!'
except:
print UTCtoEST()+':: Error reading status key of response!'
print checkoutResp.json()
print
print checkoutResp.json()
if checkoutResp.json()['status']=='failed':
print
print '!!!ERROR!!! ::',checkoutResp.json()['errors']
print
When I want to run it everything goes correctly but at the end it's giving me this error:
Traceback (most recent call last):
File "/Users/"USERNAME"/Desktop/supreme.py", line 167, in <module>
checkoutResp=session.post(checkoutUrl,data=checkoutPayload,headers=checkoutHeaders)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py", line 522, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py", line 475, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py", line 596, in send
r = adapter.send(request, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/adapters.py", line 497, in send
raise SSLError(e, request=request)
SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:590)
As can be seen from the SSLLabs report the server supports only TLS 1.2 and supports only very few ciphers. Given a path like /Users/... in your error output I assume that you are using Mac OS X. The version of OpenSSL shipped with Mac OS X and probably used by your Python is 0.9.8, which is too old to support TLS 1.2. This means that the server will not accept the SSL 3.0 or TLS 1.0 request from your client.
For information on how to fix this problem by updating your OpenSSL version see Updating openssl in python 2.7.
I am using this regular expression for SIP (Session Initiation Protocol) URIs to extract the different internal variables.
_syntax = re.compile('^(?P<scheme>[a-zA-Z][a-zA-Z0-9\+\-\.]*):' # scheme
+ '(?:(?:(?P<user>[a-zA-Z0-9\-\_\.\!\~\*\'\(\)&=\+\$,;\?\/\%]+)' # user
+ '(?::(?P<password>[^:#;\?]+))?)#)?' # password
+ '(?:(?:(?P<host>[^;\?:]*)(?::(?P<port>[\d]+))?))' # host, port
+ '(?:;(?P<params>[^\?]*))?' # parameters
+ '(?:\?(?P<headers>.*))?$') # headers
m = URI._syntax.match(value)
if m:
self.scheme, self.user, self.password, self.host, self.port, params, headers = m.groups()
and i want to extract specific header like the header via,branch,contact,callID or Cseq.
The general form of a sip message is:
OPTIONS sip:172.16.18.35:5060 SIP/2.0
Content-Length: 0
Via: SIP/2.0/UDP 172.16.18.90:5060
From: "fake" <sip:fake#172.16.18.90>
Supported: replaces, timer
User-Agent: SIPPing
To: <sip:172.16.18.35:5060>
Contact: <sip:fake#172.16.18.90:5060>
CSeq: 1 OPTIONS
Allow: INVITE, ACK, CANCEL, OPTIONS, BYE, REFER, SUBSCRIBE, NOTIFY, INFO, PUBLISH
Call-ID: fake-id#172.16.18.90
Date: Thu, 25 Apr 2013 003024 +0000
Max-Forwards: 70
I would suggest taking advantage of the intentional similarities between SIP header format and RFC822.
from email.parser import Parser
msg = Parser().parsestr(m.group('headers'))
...thereafter:
>>> msg.keys()
['Content-Length', 'Via', 'From', 'Supported', 'User-Agent', 'To', 'Contact', 'CSeq', 'Allow', 'Call-ID', 'Date', 'Max-Forwards']
>>> msg['To']
'<sip:172.16.18.35:5060>'
>>> msg['Date']
'Thu, 25 Apr 2013 003024 +0000'
...etc. See the documentation for the Python standard-library email module for more details.