Counting bytes transferred from HTTP logs - python

I am new at python and I need to parse status codes. I have a task to parse the HTTP log file:
Group the logged requests by IP address or HTTP status code
(selected by the user).
Calculate one of the following (selected by the user) for each group:
Request count
Request count percentage of all logged requests
The total number of bytes transferred.
I have made the calculation of requests and percentages. Now I do not know how to count the transferred bytes (3rd task).
The example of a log file (bytes are shown here after status code: 6146, 52315, 12251, 54662):
93.114.45.13 - - [17/May/2015:10:05:17 +0000] "GET /images/jordan-80.png HTTP/1.1" 200 6146 "http://www.semicomplete.com/articles/dynamic-dns-with-dhcp/" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0"
93.114.45.13 - - [17/May/2015:10:05:21 +0000] "GET /images/web/2009/banner.png HTTP/1.1" 200 52315 "http://www.semicomplete.com/style2.css" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0"
66.249.73.135 - - [17/May/2015:10:05:40 +0000] "GET /blog/tags/ipv6 HTTP/1.1" 200 12251 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"83.149.9.216 - - [17/May/2015:10:05:25 +0000] "GET /presentations/logstash-monitorama-2013/images/elasticsearch.png HTTP/1.1" 200 8026 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"
83.149.9.216 - - [17/May/2015:10:05:59 +0000] "GET /presentations/logstash-monitorama-2013/images/logstashbook.png HTTP/1.1" 200 54662 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"
My code :
import re
import sys
from collections import Counter
def getBytes(filename):
with open(filename, 'r') as logfile:
for line in logfile:
newlist=line.split(" ")
print(newlist[0]+" "+newlist[9])
def countIp(filename):
print ("hey")
myregex = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
with open(filename) as f:
log = f.read()
my_iplist = re.findall(myregex,log)
ipcount = Counter(my_iplist)
s = sum(ipcount.values())
for k,v in ipcount.items():
pct = v * 100.0 / s
print("IP Address " + "=> " + str(k) + " " + "Count " + "=> " + str(v) + " Percentage " +"=> "+ str(pct) +"%")
def countStatusCode(filename):
print ("hey")
myregex = r'\b[2-5]\d\d\s'
with open(filename) as f:
log = f.read()
my_iplist = re.findall(myregex,log)
ipcount = Counter(my_iplist)
s = sum(ipcount.values())
for k,v in ipcount.items():
pct = v * 100.0 / s
print("IP Address " + "=> " + str(k) + " " + "Count " + "=> " + str(v) + " Percentage " +"=> "+ str(pct) +"%")
if __name__ == '__main__':
filename=sys.argv[1]
val = input("Enter your choice (ip or status code): ")
if val in ["ip","IP","Ip","iP"]:
countIp(filename)
elif val in ["statusCode","code","sc","Status Code","status"]:
countStatusCode(filename)

To get number of bytes transfered from your logfile:
def getBytes(filename):
with open(filename, 'r') as logfile:
for line in logfile:
regex = r'\/.+?\sHTTP\/1\..\"\s.{3}\s(.+?)\s'
bytesCount = re.search(regex, line)[1]
print("Bytes transfered: "+bytesCount)

Rather than using multiple regular expressions making multiple passes through the log, this solution uses one regular expression to pull out all the relevant values in one pass.
Function process_log is passed the text of the log as a single string. In the following demo program, a test string is used. An actual implementation would call this function with the results of reading the actual log file.
To keep track of IP addresses/status pairs, a defaultdict using a list as the default_factory is used. The number of items in the list counts the number of times the IP address/status combination has been seen and each list item is the number of bytes transferred for that HTTP request. For example, an key/value pair of the ip_status dictionary might be:
key: ('123.12.11.9', '200') value: [6213, 9876, 376]
The interpretation of the above is that there were 3 instances of status code '200' for ip address '123.12.11.9' discovered. The bytes transferred for these 3 instances were 6213, 9876 and 376.
Explanation of the regex:
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - - .*?HTTP/1.1" (\d+) (\d+)
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - - This is essentially the regex used by the OP to recognize an IP address so it shouldn't require much of an explanation. I have followed it with - - to provide additional following context just in case there are other instances of similar looking strings. The IP address is captured in Capture Group 1.
.*? This will match non-greedily 0 or more non-newline characters until the following.
HTTP/1.1" Matches the string HTTP/1.1" to provide left-hand context for the following.
(\d+) Matches one or more digit in Capture Group 2 (the status).
Matches a single space.
(\d+) Matches one or more digit in Capture Group 3 (the transferred bytes).
See Regex Demo
In other words, I just want to make sure I am picking up the right fields from the right place by matching what I expect to find next to the fields I am looking for. When your regex returns multiple groups, finditer is often more convenient then findall. finditer returns an iterator yielding a match object for each iteration. I have added code to produce statistics for both ip/status code/transferred bytes and just status code. You need only one or another depending on what the user wants.
The code:
import re
from collections import defaultdict
def process_log(log):
ip_counter = defaultdict(list)
status_counter = defaultdict(int)
total_count = 0
for m in re.finditer(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - - .*?HTTP/1.1" (\d+) (\d+)', log):
total_count += 1
ip = m[1]
status = m[2]
bytes = int(m[3])
ip_counter[(ip, status)].append(bytes)
status_counter[status] += 1
for k, v in ip_counter.items():
count = len(v)
percentage = count/total_count
total_bytes = sum(v)
ip = k[0]
status = k[1]
print(f"IP Address => {ip}, status => {status}, Count => {count}, Percentage => {percentage}, Total Bytes Transferred => {total_bytes}")
for k, v in status_counter.items():
count = v
percentage = count/total_count
print(f"Status Code => {k}, Percentage => {percentage}")
log = """93.114.45.13 - - [17/May/2015:10:05:17 +0000] "GET /images/jordan-80.png HTTP/1.1" 200 6146 "http://www.semicomplete.com/articles/dynamic-dns-with-dhcp/" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0"
93.114.45.13 - - [17/May/2015:10:05:21 +0000] "GET /images/web/2009/banner.png HTTP/1.1" 200 52315 "http://www.semicomplete.com/style2.css" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0"
66.249.73.135 - - [17/May/2015:10:05:40 +0000] "GET /blog/tags/ipv6 HTTP/1.1" 200 12251 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"83.149.9.216 - - [17/May/2015:10:05:25 +0000] "GET /presentations/logstash-monitorama-2013/images/elasticsearch.png HTTP/1.1" 200 8026 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"
83.149.9.216 - - [17/May/2015:10:05:59 +0000] "GET /presentations/logstash-monitorama-2013/images/logstashbook.png HTTP/1.1" 200 54662 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"
"""
process_log(log)
Prints:
IP Address => 93.114.45.13, status => 200, Count => 2, Percentage => 0.4, Total Bytes Transferred => 58461
IP Address => 66.249.73.135, status => 200, Count => 1, Percentage => 0.2, Total Bytes Transferred => 12251
IP Address => 83.149.9.216, status => 200, Count => 2, Percentage => 0.4, Total Bytes Transferred => 62688
Status Code => 200, Percentage => 1.0
See Python Demo

Related

Websocket Permessage-deflate not occuring in server -> client direction

I have written my own implementation of a websocket in python to teach myself their inner workings. I was going to be sending large repetitive JSON objects over the websocket so I am trying to implement permessage-deflate. The compression works in the client->server direction, but not in the server -> client direction
This is the header exchange:
Request
Host: awebsite.com:port
Connection: Upgrade
Pragma: no-cache
Cache-Control: no-cache
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36
Upgrade: websocket
Origin: http://awebsite.com
Sec-WebSocket-Version: 13
Accept-Encoding: gzip, deflate
Accept-Language: en-US,en;q=0.9
Sec-WebSocket-Key: JItmF32mfGXXKYyhcEoW/A==
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Response
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Extensions: permessage-deflate
Sec-WebSocket-Accept: zYQKJ6gvwlTU/j2xw1Kf0BErg9c=
When I do this, I get compressed data from the client as expected and it inflates as expected.
When I send an uncompressed message, I get a normal response on the client, Ie i send "hello" and I get "hello"
When I try to deflate my message using this simple python function:
def deflate(self,data,B64_encode=False):
data=zlib.compress(data)
if B64_encode:
return base64.b64encode(data[2:-4])
else:
return data[2:-4]
I get an error message about the characters not being utf-8, and when I base64 encode the compressed message, I just get the base64 encoded string. I also tried sending the data as Binary over the websocket, and get a blob at the other end. I've been scouring the internet for a while now and haven't heard of this happening. My Guess is that I am compressing the data at the wrong step. Below is the function I use to send the data. So far I've been feeding in the compressed message into the send() function because from what I've read permessage compression happens on the message level, and all the other data remains uncompressed.
def send(self, string, TYPE="TEXT"):
import struct
conn = self.conn
datatypes = {
"TEXT": 0x01,
"BINARY": 0x02,
"CLOSE": 0X08,
"PING": 0x09,
"PONG": 0x0A}
b1 = 0x80
b2 = 0
message = ""
if TYPE == "TEXT":
if type(string) == unicode:
b1 |= datatypes["TEXT"]
payload = string.encode("UTF8")
elif type(string) == str:
b1 |= datatypes["TEXT"]
payload = string
message += chr(b1)
else:
b1 |= datatypes[TYPE]
payload = string
message += chr(b1)
length = len(payload)
if length < 126:
b2 |= length
message += chr(b2)
elif length < (2 ** 16) - 1:
b2 |= 126
message += chr(b2)
l = struct.pack(">H", length)
message += l
else:
l = struct.pack(">Q", length)
b2 |= 127
message += chr(b2)
message += l
message += payload
try:
conn.send(str(message))
except socket.error:
traceback.print_exc()
conn.close()
if TYPE == "CLOSE":
self.Die = True
conn.shutdown(2)
conn.close()
print self.myid,"Closed"
After a lot of sleuthing, I found out my problem was a case of "RTFM". In the third paragraph of (a?) the manual on perMessage Compression it says
A WebSocket client may offer multiple PMCEs during the WebSocket
opening handshake. A peer WebSocket server received those offers may
choose and accept preferred one or decline all of them. PMCEs use
the RSV1 bit of the WebSocket frame header to indicate whether a
message is compressed or not, so that an endpoint can choose not to
compress messages with incompressible contents.
I didn't know what the rsv bits did when I first set this up, and had them set to 0 by default. My code now allows for compression to be set in the send() function in my program. It nicely shrinks my messages from 30200 bytes to 149 bytes.
My Modified code now looks like this:
def deflate2(self,data):
data=zlib.compress(data)
data=data[2:-4]
return data
def send(self, string, TYPE="TEXT",deflate=False):
import struct
if (deflate):
string=self.deflate(string)
conn = self.conn
datatypes = {
"TEXT": 0x01,
"BINARY": 0x02,
"CLOSE": 0X08,
"PING": 0x09,
"PONG": 0x0A}
b1 = 0x80 #0b100000000
if(deflate): b1=0xC0 #0b110000000 sets RSV1 to 1 for compression
b2 = 0
message = ""
if TYPE == "TEXT":
if type(string) == unicode:
b1 |= datatypes["TEXT"]
payload = string.encode("UTF8")
elif type(string) == str:
b1 |= datatypes["TEXT"]
payload = string
message += chr(b1)
else:
b1 |= datatypes[TYPE]
payload = string
message += chr(b1)
length = len(payload)
if length < 126:
b2 |= length
message += chr(b2)
elif length < (2 ** 16) - 1:
b2 |= 126
message += chr(b2)
l = struct.pack(">H", length)
message += l
else:
l = struct.pack(">Q", length)
b2 |= 127
message += chr(b2)
message += l
message += payload
try:
if self.debug:
for x in message: print("S>: ",x,hex(ord(x)),ord(x))
conn.send(str(message))
except socket.error:
traceback.print_exc()
conn.close()
if TYPE == "CLOSE":
self.Die = True
conn.shutdown(2)
conn.close()
print self.myid,"Closed"

call a python function from excel

I have a python function that gets exchange rates from a website :-
import requests
import time
def myFunction(myCCY,myDate):
url=f'https://www1.oanda.com/currency/converter/updatebase_currency_0=GBP&quote_currency={myCCY}&end_date={myDate}&view=details&id=5&action=C&'
headers={
'accept':'text/javascript, text/html, application/xml, text/xml,*/*',
'referer':'https://www1.oanda.com/currency/converter/',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36',
'x-prototype-version':'1.7',
'x-requested-with':'XMLHttpRequest'
}
resp=requests.get(url=url,headers=headers)
time.sleep(2.73)
if resp.status_code==200:
jsonReturn=resp.json()
myRate=jsonReturn['data']['bid_ask_data']['bid']
return myRate
else:
return 0
I am trying to call it from Excel where I have a list of currencies :-
Sub getRates()
Dim i As Integer
Dim shell As Object
Dim exePath As String
Dim scriptPath As String
Dim command As String
Dim myDate As String
Set shell = VBA.CreateObject("Wscript.Shell")
exePath = """C:/Program Files (x86)/Microsoft Visual Studio/Shared/Python37_64/python.exe"""
scriptPath = "C:/Users/Philip/Documents/python2.myFunction"
myDate = "2020-05-25"
Stop
For i = 1 To Worksheets("Sheet1").Range("rates").Rows.Count
command = exePath & " " & scriptPath & " " & Worksheets("Sheet1").Range("rates").Range("A1").Offset(i - 1, 0).Value & " " & myDate
Debug.Print shell.Run(command, vbHide)
Next i
End Sub
but it doesnt seem to work.
Is it obvious what I am doing wrong???
Thanks
Phil

requests.exceptions.SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:590)

I have a script that's made in python as below
#!/bin/env python2.7
# Run around 1059 as early as 1055.
# Polling times vary pick something nice.
# Ghost checkout timer can be changed by
# adjusting for loop range near bottom.
# Fill out personal data in checkout payload dict.
import sys, json, time, requests, urllib2
from datetime import datetime
qty='1'
def UTCtoEST():
current=datetime.now()
return str(current) + ' EST'
print
poll=raw_input("Polling interval? ")
poll=int(poll)
keyword=raw_input("Product name? ").title() # hardwire here by declaring keyword as a string
color=raw_input("Color? ").title() # hardwire here by declaring keyword as a string
sz=raw_input("Size? ").title() # hardwire here by declaring keyword as a string
print
print UTCtoEST(),':: Parsing page...'
def main():
global ID
global variant
global cw
req = urllib2.Request('http://www.supremenewyork.com/mobile_stock.json')
req.add_header('User-Agent', "User-Agent','Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_4 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10B350 Safari/8536.25")
resp = urllib2.urlopen(req)
data = json.loads(resp.read())
ID=0
for i in range(len(data[u'products_and_categories'].values())):
for j in range(len(data[u'products_and_categories'].values()[i])):
item=data[u'products_and_categories'].values()[i][j]
name=str(item[u'name'].encode('ascii','ignore'))
# SEARCH WORDS HERE
# if string1 in name or string2 in name:
if keyword in name:
# match/(es) detected!
# can return multiple matches but you're
# probably buying for resell so it doesn't matter
myproduct=name
ID=str(item[u'id'])
print UTCtoEST(),'::',name, ID, 'found ( MATCHING ITEM DETECTED )'
if (ID == 0):
# variant flag unchanged - nothing found - rerun
time.sleep(poll)
print UTCtoEST(),':: Reloading and reparsing page...'
main()
else:
print UTCtoEST(),':: Selecting',str(myproduct),'(',str(ID),')'
jsonurl = 'http://www.supremenewyork.com/shop/'+str(ID)+'.json'
req = urllib2.Request(jsonurl)
req.add_header('User-Agent', "User-Agent','Mozilla/5.0 (iPhone; CPU iPhone OS 6_1_4 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10B350 Safari/8536.25")
resp = urllib2.urlopen(req)
data = json.loads(resp.read())
found=0
for numCW in data['styles']:
# COLORWAY TERMS HERE
# if string1 in numCW['name'] or string2 in numCW['name']:
if color in numCW['name'].title():
for sizes in numCW['sizes']:
# SIZE TERMS HERE
if str(sizes['name'].title()) == sz: # Medium
found=1;
variant=str(sizes['id'])
cw=numCW['name']
print UTCtoEST(),':: Selecting size:', sizes['name'],'(',numCW['name'],')','(',str(sizes['id']),')'
if found ==0:
# DEFAULT CASE NEEDED HERE - EITHER COLORWAY NOT FOUND OR SIZE NOT IN RUN OF PRODUCT
# PICKING FIRST COLORWAY AND LAST SIZE OPTION
print UTCtoEST(),':: Selecting default colorway:',data['styles'][0]['name']
sizeName=str(data['styles'][0]['sizes'][len(data['styles'][0]['sizes'])-1]['name'])
variant=str(data['styles'][0]['sizes'][len(data['styles'][0]['sizes'])-1]['id'])
cw=data['styles'][0]['name']
print UTCtoEST(),':: Selecting default size:',sizeName,'(',variant,')'
main()
session=requests.Session()
addUrl='http://www.supremenewyork.com/shop/'+str(ID)+'/add.json'
addHeaders={
'Host': 'www.supremenewyork.com',
'Accept': 'application/json',
'Proxy-Connection': 'keep-alive',
'X-Requested-With': 'XMLHttpRequest',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-us',
'Content-Type': 'application/x-www-form-urlencoded',
'Origin': 'http://www.supremenewyork.com',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Mobile/11D257',
'Referer': 'http://www.supremenewyork.com/mobile'
}
addPayload={
'size': str(variant),
'qty': '1'
}
print UTCtoEST() +' :: Adding product to cart...'
addResp=session.post(addUrl,data=addPayload,headers=addHeaders)
print UTCtoEST() +' :: Checking status code of response...'
if addResp.status_code!=200:
print UTCtoEST() +' ::',addResp.status_code,'Error \nExiting...'
print
sys.exit()
else:
if addResp.json()==[]:
print UTCtoEST() +' :: Response Empty! - Problem Adding to Cart\nExiting...'
print
sys.exit()
print UTCtoEST() +' :: '+str(cw)+' - '+addResp.json()[0]['name']+' - '+ addResp.json()[0]['size_name']+' added to cart!'
checkoutUrl='https://www.supremenewyork.com/checkout.json'
checkoutHeaders={
'host': 'www.supremenewyork.com',
'If-None-Match': '"*"',
'Accept': 'application/json',
'Proxy-Connection': 'keep-alive',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-us',
'Content-Type': 'application/x-www-form-urlencoded',
'Origin': 'http://www.supremenewyork.com',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Mobile/11D257',
'Referer': 'http://www.supremenewyork.com/mobile'
}
#################################
# FILL OUT THESE FIELDS AS NEEDED
#################################
checkoutPayload={
'store_credit_id': '',
'from_mobile': '1',
'cookie-sub': '%7B%22'+str(variant)+'%22%3A1%7D', # cookie-sub: eg. {"VARIANT":1} urlencoded
'same_as_billing_address': '1',
'order[billing_name]': 'anon mous', # FirstName LastName
'order[email]': 'anon#mailinator.com', # email#domain.com
'order[tel]': '999-999-9999', # phone-number-here
'order[billing_address]': '123 Seurat lane', # your address
'order[billing_address_2]': '',
'order[billing_zip]': '90210', # zip code
'order[billing_city]': 'Beverly Hills', # city
'order[billing_state]': 'CA', # state
'order[billing_country]': 'USA', # country
'store_address': '1',
'credit_card[type]': 'visa', # master or visa
'credit_card[cnb]': '9999 9999 9999 9999', # credit card number
'credit_card[month]': '01', # expiration month
'credit_card[year]': '2026', # expiration year
'credit_card[vval]': '123', # cvc/cvv
'order[terms]': '0',
'order[terms]': '1'
}
# GHOST CHECKOUT PREVENTION WITH ROLLING PRINT
for i in range(5):
sys.stdout.write("\r" +UTCtoEST()+ ' :: Sleeping for '+str(5-i)+' seconds to avoid ghost checkout...')
sys.stdout.flush()
time.sleep(1)
print
print UTCtoEST()+ ' :: Firing checkout request!'
checkoutResp=session.post(checkoutUrl,data=checkoutPayload,headers=checkoutHeaders)
try:
print UTCtoEST()+ ' :: Checkout',checkoutResp.json()['status'].title()+'!'
except:
print UTCtoEST()+':: Error reading status key of response!'
print checkoutResp.json()
print
print checkoutResp.json()
if checkoutResp.json()['status']=='failed':
print
print '!!!ERROR!!! ::',checkoutResp.json()['errors']
print
When I want to run it everything goes correctly but at the end it's giving me this error:
Traceback (most recent call last):
File "/Users/"USERNAME"/Desktop/supreme.py", line 167, in <module>
checkoutResp=session.post(checkoutUrl,data=checkoutPayload,headers=checkoutHeaders)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py", line 522, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py", line 475, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/sessions.py", line 596, in send
r = adapter.send(request, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/adapters.py", line 497, in send
raise SSLError(e, request=request)
SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:590)
As can be seen from the SSLLabs report the server supports only TLS 1.2 and supports only very few ciphers. Given a path like /Users/... in your error output I assume that you are using Mac OS X. The version of OpenSSL shipped with Mac OS X and probably used by your Python is 0.9.8, which is too old to support TLS 1.2. This means that the server will not accept the SSL 3.0 or TLS 1.0 request from your client.
For information on how to fix this problem by updating your OpenSSL version see Updating openssl in python 2.7.

Handling web-log files as a CSV with Python

I'm using the Python 3 CSV Reader to read some web-log files into a namedtuple.
I have no control over the log file structures, and there are varying types.
The delimiter is a space ( ), the problem is that some log file formats place a space in the timestamp, as Logfile 2 below. The CSV reader then reads the date/time stamp as two fields.
Logfile 1
73 58 2993 [22/Jul/2016:06:51:06.299] 2[2] "GET /example HTTP/1.1"
13 58 224 [22/Jul/2016:06:51:06.399] 2[2] "GET /example HTTP/1.1"
Logfile 2
13 58 224 [22/Jul/2016:06:51:06 +0000] 2[2] "GET /test HTTP/1.1"
153 38 224 [22/Jul/2016:06:51:07 +0000] 2[2] "GET /test HTTP/1.1"
The log files typically have the timestamp within square quotes, but I cannot find a way of handling them as "quotes". On top of that, square brackets are not always used as quotes within the logs either (see the [2] later in the logs).
I've read through the Python 3 CSV Reader documentation, including about dialects, but there doesn't seem to be anything for handling enclosing square brackets.
How can I handle this situation automatically?
This will do, you need to use a regex in place of sep.
This for example will parse NGinx log files into a pandas.Dataframe:
import pandas as pd
df = pd.read_csv(log_file,
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python',
usecols=[0, 3, 4, 5, 6, 7, 8],
names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
na_values='-',
header=None
)
Edit :
line = '172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"'
regex = '([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) - "(.*?)" "(.*?)"'
import re
print re.match(regex, line).groups()
The output would be a tuple with 6 pieces of information
('172.16.0.3', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827')
Bruteforce! I concatenate the 2 time fields so the timestamp can be interpreted if needed.
data.csv =
13 58 224 [22/Jul/2016:06:51:06.399] 2[2] "GET /example HTTP/1.1"
13 58 224 [22/Jul/2016:06:51:06 +0000] 2[2] "GET /test HTTP/1.1"
import csv
with open("data.csv") as f:
c = csv.reader(f, delimiter=" ")
for row in c:
if len(row) == 7:
r = row[:3] + [row[3] + row[4]] + row[5:]
print(r)
else:
print(row)
['13', '58', '224', '[22/Jul/2016:06:51:06.399]', '2[2]', 'GET /example HTTP/1.1']
['13', '58', '224', '[22/Jul/2016:06:51:06+0000]', '2[2]', 'GET /test HTTP/1.1']

Parse raw HTTP Headers

I have a string of raw HTTP and I would like to represent the fields in an object. Is there any way to parse the individual headers from an HTTP string?
'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n
[...]'
Update: It’s 2019, so I have rewritten this answer for Python 3, following a confused comment from a programmer trying to use the code. The original Python 2 code is now down at the bottom of the answer.
There are excellent tools in the Standard Library both for parsing RFC 821 headers, and also for parsing entire HTTP requests. Here is an example request string (note that Python treats it as one big string, even though we are breaking it across several lines for readability) that we can feed to my examples:
request_text = (
b'GET /who/ken/trust.html HTTP/1.1\r\n'
b'Host: cm.bell-labs.com\r\n'
b'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
b'Accept: text/html;q=0.9,text/plain\r\n'
b'\r\n'
)
As #TryPyPy points out, you can use Python’s email message library to parse the headers — though we should add that the resulting Message object acts like a dictionary of headers once you are done creating it:
from email.parser import BytesParser
request_line, headers_alone = request_text.split(b'\r\n', 1)
headers = BytesParser().parsebytes(headers_alone)
print(len(headers)) # -> "3"
print(headers.keys()) # -> ['Host', 'Accept-Charset', 'Accept']
print(headers['Host']) # -> "cm.bell-labs.com"
But this, of course, ignores the request line, or makes you parse it yourself. It turns out that there is a much better solution.
The Standard Library will parse HTTP for you if you use its BaseHTTPRequestHandler. Though its documentation is a bit obscure — a problem with the whole suite of HTTP and URL tools in the Standard Library — all you have to do to make it parse a string is (a) wrap your string in a BytesIO(), (b) read the raw_requestline so that it stands ready to be parsed, and (c) capture any error codes that occur during parsing instead of letting it try to write them back to the client (since we do not have one!).
So here is our specialization of the Standard Library class:
from http.server import BaseHTTPRequestHandler
from io import BytesIO
class HTTPRequest(BaseHTTPRequestHandler):
def __init__(self, request_text):
self.rfile = BytesIO(request_text)
self.raw_requestline = self.rfile.readline()
self.error_code = self.error_message = None
self.parse_request()
def send_error(self, code, message):
self.error_code = code
self.error_message = message
Again, I wish the Standard Library folks had realized that HTTP parsing should be broken out in a way that did not require us to write nine lines of code to properly call it, but what can you do? Here is how you would use this simple class:
# Using this new class is really easy!
request = HTTPRequest(request_text)
print(request.error_code) # None (check this first)
print(request.command) # "GET"
print(request.path) # "/who/ken/trust.html"
print(request.request_version) # "HTTP/1.1"
print(len(request.headers)) # 3
print(request.headers.keys()) # ['Host', 'Accept-Charset', 'Accept']
print(request.headers['host']) # "cm.bell-labs.com"
If there is an error during parsing, the error_code will not be None:
# Parsing can result in an error code and message
request = HTTPRequest(b'GET\r\nHeader: Value\r\n\r\n')
print(request.error_code) # 400
print(request.error_message) # "Bad request syntax ('GET')"
I prefer using the Standard Library like this because I suspect that they have already encountered and resolved any edge cases that might bite me if I try re-implementing an Internet specification myself with regular expressions.
Old Python 2 code
Here’s the original code for this answer, back when I first wrote it:
request_text = (
'GET /who/ken/trust.html HTTP/1.1\r\n'
'Host: cm.bell-labs.com\r\n'
'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
'Accept: text/html;q=0.9,text/plain\r\n'
'\r\n'
)
And:
# Ignore the request line and parse only the headers
from mimetools import Message
from StringIO import StringIO
request_line, headers_alone = request_text.split('\r\n', 1)
headers = Message(StringIO(headers_alone))
print len(headers) # -> "3"
print headers.keys() # -> ['accept-charset', 'host', 'accept']
print headers['Host'] # -> "cm.bell-labs.com"
And:
from BaseHTTPServer import BaseHTTPRequestHandler
from StringIO import StringIO
class HTTPRequest(BaseHTTPRequestHandler):
def __init__(self, request_text):
self.rfile = StringIO(request_text)
self.raw_requestline = self.rfile.readline()
self.error_code = self.error_message = None
self.parse_request()
def send_error(self, code, message):
self.error_code = code
self.error_message = message
And:
# Using this new class is really easy!
request = HTTPRequest(request_text)
print request.error_code # None (check this first)
print request.command # "GET"
print request.path # "/who/ken/trust.html"
print request.request_version # "HTTP/1.1"
print len(request.headers) # 3
print request.headers.keys() # ['accept-charset', 'host', 'accept']
print request.headers['host'] # "cm.bell-labs.com"
And:
# Parsing can result in an error code and message
request = HTTPRequest('GET\r\nHeader: Value\r\n\r\n')
print request.error_code # 400
print request.error_message # "Bad request syntax ('GET')"
mimetools has been deprecated since Python 2.3 and totally removed from Python 3 (link).
Here is how you should do in Python 3:
import email
import io
import pprint
# […]
request_line, headers_alone = request_text.split('\r\n', 1)
message = email.message_from_file(io.StringIO(headers_alone))
headers = dict(message.items())
pprint.pprint(headers, width=160)
This seems to work fine if you strip the GET line:
import mimetools
from StringIO import StringIO
he = "Host: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n"
m = mimetools.Message(StringIO(he))
print m.headers
A way to parse your example and add information from the first line to the object would be:
import mimetools
from StringIO import StringIO
he = 'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\n'
# Pop the first line for further processing
request, he = he.split('\r\n', 1)
# Get the headers
m = mimetools.Message(StringIO(he))
# Add request information
m.dict['method'], m.dict['path'], m.dict['http-version'] = request.split()
print m['method'], m['path'], m['http-version']
print m['Connection']
print m.headers
print m.dict
Using python3.7, urllib3.HTTPResponse, http.client.parse_headers, and with curl flag explanation here:
curl -i -L -X GET "http://httpbin.org/relative-redirect/3" | python -c '
import sys
from io import BytesIO
from urllib3 import HTTPResponse
from http.client import parse_headers
rawresponse = sys.stdin.read().encode("utf8")
redirects = []
while True:
header, body = rawresponse.split(b"\r\n\r\n", 1)
if body[:4] == b"HTTP":
redirects.append(header)
rawresponse = body
else:
break
f = BytesIO(header)
# read one line for HTTP/2 STATUSCODE MESSAGE
requestline = f.readline().split(b" ")
protocol, status = requestline[:2]
headers = parse_headers(f)
resp = HTTPResponse(body, headers=headers)
resp.status = int(status)
print("headers")
print(resp.headers)
print("redirects")
print(redirects)
'
Output:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 215 100 215 0 0 435 0 --:--:-- --:--:-- --:--:-- 435
headers
HTTPHeaderDict({'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Date': 'Thu, 20 Sep 2018 05:39:25 GMT', 'Content-Type': 'application/json', 'Content-Length': '215', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true', 'Via': '1.1 vegur'})
redirects
[b'HTTP/1.1 302 FOUND\r\nConnection: keep-alive\r\nServer: gunicorn/19.9.0\r\nDate: Thu, 20 Sep 2018 05:39:24 GMT\r\nContent-Type: text/html; charset=utf-8\r\nContent-Length: 0\r\nLocation: /relative-redirect/2\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\nVia: 1.1 vegur',
b'HTTP/1.1 302 FOUND\r\nConnection: keep-alive\r\nServer: gunicorn/19.9.0\r\nDate: Thu, 20 Sep 2018 05:39:24 GMT\r\nContent-Type: text/html; charset=utf-8\r\nContent-Length: 0\r\nLocation: /relative-redirect/1\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\nVia: 1.1 vegur',
b'HTTP/1.1 302 FOUND\r\nConnection: keep-alive\r\nServer: gunicorn/19.9.0\r\nDate: Thu, 20 Sep 2018 05:39:24 GMT\r\nContent-Type: text/html; charset=utf-8\r\nContent-Length: 0\r\nLocation: /get\r\nAccess-Control-Allow-Origin: *\r\nAccess-Control-Allow-Credentials: true\r\nVia: 1.1 vegur']
notes:
https://urllib3.readthedocs.io/en/latest/reference/#urllib3.response.HTTPResponse
parse_headers()
In a pythonic way
request_text = (
b'GET /who/ken/trust.html HTTP/1.1\r\n'
b'Host: cm.bell-labs.com\r\n'
b'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n'
b'Accept: text/html;q=0.9,text/plain\r\n'
b'\r\n'
)
print({ k:v.strip() for k,v in [line.split(":",1)
for line in request_text.decode().splitlines() if ":" in line]})
They is another way, simpler and safer way to handle headers. More object oriented. With no need for manual parsing.
Short demo.
1. Parse them
From str, bytes, fp, dict, requests.Response, email.Message, httpx.Response, urllib3.HTTPResponse.
from requests import get
from kiss_headers import parse_it
response = get('https://www.google.fr')
headers = parse_it(response)
headers.content_type.charset # output: ISO-8859-1
# Its the same as
headers["content-type"]["charset"] # output: ISO-8859-1
2. Build them
This
from kiss_headers import *
headers = (
Host("developer.mozilla.org")
+ UserAgent(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0"
)
+ Accept("text/html")
+ Accept("application/xhtml+xml")
+ Accept("application/xml", qualifier=0.9)
+ Accept(qualifier=0.8)
+ AcceptLanguage("en-US")
+ AcceptLanguage("en", qualifier=0.5)
+ AcceptEncoding("gzip")
+ AcceptEncoding("deflate")
+ AcceptEncoding("br")
+ Referer("https://developer.mozilla.org/testpage.html")
+ Connection(should_keep_alive=True)
+ UpgradeInsecureRequests()
+ IfModifiedSince("Mon, 18 Jul 2016 02:36:04 GMT")
+ IfNoneMatch("c561c68d0ba92bbeb8b0fff2a9199f722e3a621a")
+ CacheControl(max_age=0)
)
raw_headers = str(headers)
Will become
Host: developer.mozilla.org
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:50.0) Gecko/20100101 Firefox/50.0
Accept: text/html, application/xhtml+xml, application/xml; q="0.9", */*; q="0.8"
Accept-Language: en-US, en; q="0.5"
Accept-Encoding: gzip, deflate, br
Referer: https://developer.mozilla.org/testpage.html
Connection: keep-alive
Upgrade-Insecure-Requests: 1
If-Modified-Since: Mon, 18 Jul 2016 02:36:04 GMT
If-None-Match: "c561c68d0ba92bbeb8b0fff2a9199f722e3a621a"
Cache-Control: max-age="0"
Documentation for the kiss-headers library.
Is there any way to parse the individual headers from an HTTP string?
I wrote a simple function that can return a dictionary object, hope it can help you. ^_^
Python 3
def parse_request(request):
raw_list = request.split("\r\n")
request = {}
for index in range(1, len(raw_list)):
item = raw_list[index].split(":")
if len(item) == 2:
request.update({item[0].lstrip(' '): item[1].lstrip(' ')})
return request
raw_request = 'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n'
request = parse_request(raw_request)
print(request)
print('\n')
print(request.keys())
Output:
{'Host': 'www.google.com', 'Connection': 'keep-alive', 'Accept': 'application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5',
'User-Agent': 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13', 'Accept-Encoding': 'gzip,deflate,sdch', 'Avail-Dictionary': 'GeNLY2f-', 'Accept-Language': 'en-US,en;q=0.8'}
dict_keys(['Host', 'Connection', 'Accept', 'User-Agent', 'Accept-Encoding', 'Avail-Dictionary', 'Accept-Language'])
in python3
from email import message_from_string
data = socket.recv(4096)
headers = message_from_string(str(data, 'ASCII').split('\r\n', 1)[1])
print(headers['Host'])
From this question: How to parse raw HTTP request in Python 3?
Here are some Python packages aimed at proper HTTP protocol parsing:
https://dpkt.readthedocs.io/en/latest/api/api_auto.html#module-dpkt.http
https://h11.readthedocs.io/en/latest/
https://github.com/benoitc/http-parser/ (C backend)
https://github.com/MagicStack/httptools (based on NodeJS's C backend)
https://github.com/silentsignal/netlib-offline (shameless plug)

Categories