I'm using the Python 3 CSV Reader to read some web-log files into a namedtuple.
I have no control over the log file structures, and there are varying types.
The delimiter is a space ( ), the problem is that some log file formats place a space in the timestamp, as Logfile 2 below. The CSV reader then reads the date/time stamp as two fields.
Logfile 1
73 58 2993 [22/Jul/2016:06:51:06.299] 2[2] "GET /example HTTP/1.1"
13 58 224 [22/Jul/2016:06:51:06.399] 2[2] "GET /example HTTP/1.1"
Logfile 2
13 58 224 [22/Jul/2016:06:51:06 +0000] 2[2] "GET /test HTTP/1.1"
153 38 224 [22/Jul/2016:06:51:07 +0000] 2[2] "GET /test HTTP/1.1"
The log files typically have the timestamp within square quotes, but I cannot find a way of handling them as "quotes". On top of that, square brackets are not always used as quotes within the logs either (see the [2] later in the logs).
I've read through the Python 3 CSV Reader documentation, including about dialects, but there doesn't seem to be anything for handling enclosing square brackets.
How can I handle this situation automatically?
This will do, you need to use a regex in place of sep.
This for example will parse NGinx log files into a pandas.Dataframe:
import pandas as pd
df = pd.read_csv(log_file,
usecols=[0, 3, 4, 5, 6, 7, 8],
names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
Edit :
line = ' - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"'
regex = '([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) - "(.*?)" "(.*?)"'
import re
print re.match(regex, line).groups()
The output would be a tuple with 6 pieces of information
('', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827')
Bruteforce! I concatenate the 2 time fields so the timestamp can be interpreted if needed.
data.csv =
13 58 224 [22/Jul/2016:06:51:06.399] 2[2] "GET /example HTTP/1.1"
13 58 224 [22/Jul/2016:06:51:06 +0000] 2[2] "GET /test HTTP/1.1"
import csv
with open("data.csv") as f:
c = csv.reader(f, delimiter=" ")
for row in c:
if len(row) == 7:
r = row[:3] + [row[3] + row[4]] + row[5:]
['13', '58', '224', '[22/Jul/2016:06:51:06.399]', '2[2]', 'GET /example HTTP/1.1']
['13', '58', '224', '[22/Jul/2016:06:51:06+0000]', '2[2]', 'GET /test HTTP/1.1']
I have a flask-socketio server and several react socket.io-client namespaces connected to it.
Clients are in the same ip and port but in different namespaces like /finalResponse, /algoSignal etc.
Client app
componentDidMount() {
// Connecting to backend api for data
const { dispatch } = this.props;
socket = io(`http://${IP.clientOrderFeedFlaskIP}:6050/algoSignal`);
var interval = setInterval(()=>{
socket.on("algo_signal_data", (res) => {
Here it sends up an interval to ping the server with 'getdata' to request every 2 seconds. There is an event bucket to handle incoming data in 'algo_signal_data'.
'update' bucket is to get data on refresh.
Similar to this I have around 8-9 clients in different namespaces.
Backend server
#socket.on('getdata', namespace = '/algoSignal')
def algoSignal():
# global algosig
lastOrderUpdated = json.loads(pingConn.get('lastOrderUpdated'))
if lastOrderUpdated != '0':
print('---------------------sent algo signal data ---------------------')
algosig = SymphonyOrderRaw(mongoIp).algoSignal.to_json(orient = 'records')
emit('algo_signal_data', algosig, broadcast = True)
pingConn.set('lastOrderUpdated', json.dumps('0'))
#socket.on('getdata', namespace='/finalResponse')
def getfinalResponse():
# global finalres
lastOrderUpdated = json.loads(pingConn.get('lastOrderUpdated'))
if lastOrderUpdated != '0':
finalres = FinalResponse(mongoIp).finalResponse.to_json(orient = 'records')
print('--------------------sent final Response data----------------')
# print(finalres)
emit('response_data', finalres, broadcast = True)
pingConn.set('lastOrderUpdated', json.dumps('0'))
I have event buckets like these who recieve the 'getdata' prompt and check if the database was updated. If yes, it sends the data else it sends status code.
The expected result is that each client page should recieve data when db is updated.
The problem is that when I open multiple client pages in seperate tabs, The server only sends the data to one page. But when the other pages is refreshed, then data is updated in those also.
Update bucket
#socket.on('update', namespace = '/algoSignal')
def update2():
algosig = SymphonyOrderRaw(mongoIp).algoSignal.to_json(orient = 'records')
emit('algo_signal_data', algosig, broadcast = True)
This is the output at the server terminal
(flask-venv) mockAdmin#python-react:~/RMS-git/FlaskServerCodes/serverFlask$ python3 main.py
* Restarting with stat
* Debugger is active!
* Debugger PIN: 113-048-226
(59454) wsgi starting up on
(59454) accepted ('', 65522) - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vTi HTTP/1.1" 200 401 0.000859
algo signal client connected - - [31/Dec/2021 11:22:05] "POST /socket.io/?EIO=3&transport=polling&t=NuF2vVA&sid=65d328eeffb943a5a2ad80ee4216c856 HTTP/1.1" 200 219 0.000986 - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vVh&sid=65d328eeffb943a5a2ad80ee4216c856 HTTP/1.1" 200 249 0.000353
(59454) accepted ('', 65523) - - [31/Dec/2021 11:22:05] "POST /socket.io/?EIO=3&transport=polling&t=NuF2vWP&sid=65d328eeffb943a5a2ad80ee4216c856 HTTP/1.1" 200 219 0.001710
(59454) accepted ('', 65524) - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vWQ&sid=65d328eeffb943a5a2ad80ee4216c856 HTTP/1.1" 200 235 0.000171
(59454) accepted ('', 65525) - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vWZ HTTP/1.1" 200 401 0.000345
(59454) accepted ('', 65526) - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vX6&sid=65d328eeffb943a5a2ad80ee4216c856 HTTP/1.1" 200 235 0.000255 - - [31/Dec/2021 11:22:05] "POST /socket.io/?EIO=3&transport=polling&t=NuF2vX5&sid=65d328eeffb943a5a2ad80ee4216c856 HTTP/1.1" 200 219 0.005536
final response client connected - - [31/Dec/2021 11:22:05] "POST /socket.io/?EIO=3&transport=polling&t=NuF2vXg&sid=8e278beb8e5a4f77b5b203e28af9b596 HTTP/1.1" 200 219 0.000881 - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vXl&sid=65d328eeffb943a5a2ad80ee4216c856 HTTP/1.1" 200 235 0.000180 - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vY4&sid=8e278beb8e5a4f77b5b203e28af9b596 HTTP/1.1" 200 252 0.000208
(59454) accepted ('', 65527) - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vYk&sid=8e278beb8e5a4f77b5b203e28af9b596 HTTP/1.1" 200 260 0.000265 - - [31/Dec/2021 11:22:05] "POST /socket.io/?EIO=3&transport=polling&t=NuF2vYj&sid=8e278beb8e5a4f77b5b203e28af9b596 HTTP/1.1" 200 219 0.002399 - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vZL&sid=8e278beb8e5a4f77b5b203e28af9b596 HTTP/1.1" 200 235 0.000189 - - [31/Dec/2021 11:22:05] "POST /socket.io/?EIO=3&transport=polling&t=NuF2vZM&sid=8e278beb8e5a4f77b5b203e28af9b596 HTTP/1.1" 200 219 0.005291
-----------------------------sent final Response data-----------------------
How can I send data to all the open client pages at the same time from a single server?
I am new at python and I need to parse status codes. I have a task to parse the HTTP log file:
Group the logged requests by IP address or HTTP status code
(selected by the user).
Calculate one of the following (selected by the user) for each group:
Request count
Request count percentage of all logged requests
The total number of bytes transferred.
I have made the calculation of requests and percentages. Now I do not know how to count the transferred bytes (3rd task).
The example of a log file (bytes are shown here after status code: 6146, 52315, 12251, 54662): - - [17/May/2015:10:05:17 +0000] "GET /images/jordan-80.png HTTP/1.1" 200 6146 "http://www.semicomplete.com/articles/dynamic-dns-with-dhcp/" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0" - - [17/May/2015:10:05:21 +0000] "GET /images/web/2009/banner.png HTTP/1.1" 200 52315 "http://www.semicomplete.com/style2.css" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0" - - [17/May/2015:10:05:40 +0000] "GET /blog/tags/ipv6 HTTP/1.1" 200 12251 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)" - - [17/May/2015:10:05:25 +0000] "GET /presentations/logstash-monitorama-2013/images/elasticsearch.png HTTP/1.1" 200 8026 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36" - - [17/May/2015:10:05:59 +0000] "GET /presentations/logstash-monitorama-2013/images/logstashbook.png HTTP/1.1" 200 54662 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"
My code :
import re
import sys
from collections import Counter
def getBytes(filename):
with open(filename, 'r') as logfile:
for line in logfile:
newlist=line.split(" ")
print(newlist[0]+" "+newlist[9])
def countIp(filename):
print ("hey")
myregex = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
with open(filename) as f:
log = f.read()
my_iplist = re.findall(myregex,log)
ipcount = Counter(my_iplist)
s = sum(ipcount.values())
for k,v in ipcount.items():
pct = v * 100.0 / s
print("IP Address " + "=> " + str(k) + " " + "Count " + "=> " + str(v) + " Percentage " +"=> "+ str(pct) +"%")
def countStatusCode(filename):
print ("hey")
myregex = r'\b[2-5]\d\d\s'
with open(filename) as f:
log = f.read()
my_iplist = re.findall(myregex,log)
ipcount = Counter(my_iplist)
s = sum(ipcount.values())
for k,v in ipcount.items():
pct = v * 100.0 / s
print("IP Address " + "=> " + str(k) + " " + "Count " + "=> " + str(v) + " Percentage " +"=> "+ str(pct) +"%")
if __name__ == '__main__':
val = input("Enter your choice (ip or status code): ")
if val in ["ip","IP","Ip","iP"]:
elif val in ["statusCode","code","sc","Status Code","status"]:
To get number of bytes transfered from your logfile:
def getBytes(filename):
with open(filename, 'r') as logfile:
for line in logfile:
regex = r'\/.+?\sHTTP\/1\..\"\s.{3}\s(.+?)\s'
bytesCount = re.search(regex, line)[1]
print("Bytes transfered: "+bytesCount)
Rather than using multiple regular expressions making multiple passes through the log, this solution uses one regular expression to pull out all the relevant values in one pass.
Function process_log is passed the text of the log as a single string. In the following demo program, a test string is used. An actual implementation would call this function with the results of reading the actual log file.
To keep track of IP addresses/status pairs, a defaultdict using a list as the default_factory is used. The number of items in the list counts the number of times the IP address/status combination has been seen and each list item is the number of bytes transferred for that HTTP request. For example, an key/value pair of the ip_status dictionary might be:
key: ('', '200') value: [6213, 9876, 376]
The interpretation of the above is that there were 3 instances of status code '200' for ip address '' discovered. The bytes transferred for these 3 instances were 6213, 9876 and 376.
Explanation of the regex:
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - - .*?HTTP/1.1" (\d+) (\d+)
(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - - This is essentially the regex used by the OP to recognize an IP address so it shouldn't require much of an explanation. I have followed it with - - to provide additional following context just in case there are other instances of similar looking strings. The IP address is captured in Capture Group 1.
.*? This will match non-greedily 0 or more non-newline characters until the following.
HTTP/1.1" Matches the string HTTP/1.1" to provide left-hand context for the following.
(\d+) Matches one or more digit in Capture Group 2 (the status).
Matches a single space.
(\d+) Matches one or more digit in Capture Group 3 (the transferred bytes).
See Regex Demo
In other words, I just want to make sure I am picking up the right fields from the right place by matching what I expect to find next to the fields I am looking for. When your regex returns multiple groups, finditer is often more convenient then findall. finditer returns an iterator yielding a match object for each iteration. I have added code to produce statistics for both ip/status code/transferred bytes and just status code. You need only one or another depending on what the user wants.
The code:
import re
from collections import defaultdict
def process_log(log):
ip_counter = defaultdict(list)
status_counter = defaultdict(int)
total_count = 0
for m in re.finditer(r'(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - - .*?HTTP/1.1" (\d+) (\d+)', log):
total_count += 1
ip = m[1]
status = m[2]
bytes = int(m[3])
ip_counter[(ip, status)].append(bytes)
status_counter[status] += 1
for k, v in ip_counter.items():
count = len(v)
percentage = count/total_count
total_bytes = sum(v)
ip = k[0]
status = k[1]
print(f"IP Address => {ip}, status => {status}, Count => {count}, Percentage => {percentage}, Total Bytes Transferred => {total_bytes}")
for k, v in status_counter.items():
count = v
percentage = count/total_count
print(f"Status Code => {k}, Percentage => {percentage}")
log = """ - - [17/May/2015:10:05:17 +0000] "GET /images/jordan-80.png HTTP/1.1" 200 6146 "http://www.semicomplete.com/articles/dynamic-dns-with-dhcp/" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0" - - [17/May/2015:10:05:21 +0000] "GET /images/web/2009/banner.png HTTP/1.1" 200 52315 "http://www.semicomplete.com/style2.css" "Mozilla/5.0 (X11; Linux x86_64; rv:25.0) Gecko/20100101 Firefox/25.0" - - [17/May/2015:10:05:40 +0000] "GET /blog/tags/ipv6 HTTP/1.1" 200 12251 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)" - - [17/May/2015:10:05:25 +0000] "GET /presentations/logstash-monitorama-2013/images/elasticsearch.png HTTP/1.1" 200 8026 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36" - - [17/May/2015:10:05:59 +0000] "GET /presentations/logstash-monitorama-2013/images/logstashbook.png HTTP/1.1" 200 54662 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"
IP Address =>, status => 200, Count => 2, Percentage => 0.4, Total Bytes Transferred => 58461
IP Address =>, status => 200, Count => 1, Percentage => 0.2, Total Bytes Transferred => 12251
IP Address =>, status => 200, Count => 2, Percentage => 0.4, Total Bytes Transferred => 62688
Status Code => 200, Percentage => 1.0
See Python Demo
I'm trying to parse a web access logs using the following regex
pattern = re.compile(r"""(?x)^
(?P<remote_host>\S+) \s+ # host %h
\S+ \s+ # indent %l (unused)
(?P<remote_user>\S+) \s+ # user %u
\[(?P<time_received>.*?)\] \s+ # time %t
"(?P<request>.*?)" \s+ # request "%r"
(?P<status>[0-9]+) \s+ # status %>s
(?P<response_bytes_clf>\S+) (?:\s+ # size %b (careful, can be '-')
"(?P<referrer>[^"?\s]*[^"]*)" \s+ # referrer "%{Referer}i"
"(?P<user_agent>[^"]*)" (?:\s+ # user agent "%{User-agent}i"
"[^"]*" )?)? # optional argument (unused)
def get_structured_access_log(access_log):
return pattern.match(access_log).groupdict()
But some of the logs lines contains malicious requests as this ones: - - [21/Dec/2011:05:47:03 +0000] "GET /gnu3/index.php?doc=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 273 "-" "<?php system(\"id\"); ?>" - - [21/Dec/2011:05:47:04 +0000] "GET /gnu/index.php?doc=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 271 "-" "<?php system(\"id\"); ?>" - - [21/Dec/2011:05:47:04 +0000] "GET /phpgwapi/setup/tables_update.inc.php?appdir=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 286 "-" "<?php system(\"id\"); ?>" - - [21/Dec/2011:05:47:05 +0000] "GET /forum/install.php?phpbb_root_dir=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 274 "-" "<?php system(\"id\"); ?>" - - [21/Dec/2011:05:47:06 +0000] "GET /includes/calendar.php?phpc_root_path=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 275 "-" "<?php system(\"id\"); ?>" - - [21/Dec/2011:05:47:06 +0000] "GET /includes/setup.php?phpc_root_path=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 273 "-" "<?php system(\"id\"); ?>" - - [21/Dec/2011:05:47:07 +0000] "GET /inc/authform.inc.php?path_pre=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 275 "-" "<?php system(\"id\"); ?>" - - [21/Dec/2011:05:47:07 +0000] "GET /include/authform.inc.php?path_pre=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 278 "-" "<?php system(\"id\"); ?>" - - [21/Dec/2011:05:47:08 +0000] "GET /index.php?nic=../../../../../../../proc/self/environ%00 HTTP/1.1" 200 4399 "-" "<?php system(\"id\"); ?>" - - [21/Dec/2011:05:47:11 +0000] "GET /index.php?sec=../../../../../../../proc/self/environ%00 HTTP/1.1" 200 4399 "-" "<?php system(\"id\"); ?>"
these request failed to be parsed with the above regex, other normal web requests are parsed successfully.
here are some access logs that is successfully parsed: - - [28/Apr/2012:08:12:57 +0100] "GET /robots.txt HTTP/1.1" 404 268 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" - - [28/Apr/2012:10:23:02 +0100] "GET /robots.txt HTTP/1.1" 404 268 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" - - [28/Apr/2012:10:23:02 +0100] "GET / HTTP/1.1" 200 4399 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" - - [28/Apr/2012:11:57:26 +0100] "GET / HTTP/1.1" 200 4399 "-" "Yahoo! Slurp China"
exception error message :
'NoneType' object has no attribute 'groupdict'
How can I fix the regex so it can parse these complex requests as well?
Using re.match will return a corresponding match object or will return None if the string does not match the pattern.
In the first example data, this is the last part containing an escaped double quote "<?php system(\"id\"); ?>"
If you use a negated character class matching not a double quote and want to assert the end of the string, then [^"]* will not get past the first double quote in (\"id
You could fix your pattern by replacing the negated character class to match not a double quote in this part "(?P<user_agent>[^"]*)" to match any character except a new line .*?
Your pattern might look like:
(?P<remote_host>\S+) \s+ # host %h
\S+ \s+ # indent %l (unused)
(?P<remote_user>\S+) \s+ # user %u
\[(?P<time_received>.*?)\] \s+ # time %t
"(?P<request>.*?)" \s+ # request "%r"
(?P<status>[0-9]+) \s+ # status %>s
(?P<response_bytes_clf>\S+) (?:\s+ # size %b (careful, can be '-')
"(?P<referrer>[^"?\s]*[^"]*)" \s+ # referrer "%{Referer}i"
"(?P<user_agent>.*? )" (?:\s+ # user agent "%{User-agent}i"
"[^"]*" )?)? # optional argument (unused)
Regex demo
I have some problems to import a .log file to my notebook. It works with just some text but then when I try to import the file will I get invalid syntax.
file_data = """ - - [07/Mar/2004:16:05:49 -0800] "GET
HTTP/1.1" 401 12846 - - [07/Mar/2004:19:03:58 -0800] "GET
HTTP/1.1" 401 12846
206-15-133-154.dialup.ziplink.net - - [11/Mar/2004:16:33:23 -0800] "HEAD
/twiki/bin/view/Main/SpamAssassinDeleting HTTP/1.1" 200 0"""
df = pd.read_csv(pd.compat.StringIO(file_data), names=[0, 'hour', 2, 3], sep=':', engine='python')
However when I try to import a .log file I will get a syntax error.
df = pd.read_csv('/Users/john/Desktop/data_log.log'), names=[0, 'hour', 2, 3], sep=':', engine='python')
How can I solve it so I can count the 10 most common hours in the .log file?
I have a python script which extracts unique IP addresses from a log file and displays their count of how many times those IPs are pinged the code is as follows.
import sys
def extract_ip(line):
return line.split()[0]
def increase_count(ip_dict, ip_addr):
if ip_addr in ip_dict:
ip_dict[ip_addr] += 1
ip_dict[ip_addr] = 1
def read_ips(infilename):
res_dict = {}
log_file = file(infilename)
for line in log_file:
if line.isspace():
ip_addr = extract_ip(line)
increase_count(res_dict, ip_addr)
return res_dict
def write_ips(outfilename, ip_dict):
out_file = file(outfilename, "w")
for ip_addr, count in ip_dict.iteritems():
out_file.write("%5d\t%s\n" % (count, ip_addr))
def parse_cmd_line_args():
if len(sys.argv)!=3:
print("Usage: %s [infilename] [outfilename]" % sys.argv[0])
return sys.argv[1], sys.argv[2]
def main():
infilename, outfilename = parse_cmd_line_args()
ip_dict = read_ips(infilename)
write_ips(outfilename, ip_dict)
if __name__ == "__main__":
The log file is in the following format with 2L lines. These are the first 30 lines of the log file - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:00 -0800] "GET /hrefadd.xml HTTP/1.1" 204 214 - - - - [06/Mar/2012:00:00:00 -0800] "GET /dbupdates2.xml HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:00 -0800] "GET /hrefadd.xml HTTP/1.1" 204 214 - - - - [06/Mar/2012:00:00:00 -0800] "GET /add.txt HTTP/1.1" 204 214 - - - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:00 -0800] "GET /welcome.html HTTP/1.1" 204 212 - - - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:00 -0800] "GET /hrefadd.xml HTTP/1.1" 204 214 - - - - [06/Mar/2012:00:00:00 -0800] "GET /dbupdates2.xml HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:00 -0800] "GET /css/epic.css HTTP/1.1" 204 214 "http://www.epicbrowser.com/welcome.html" - - - [06/Mar/2012:00:00:00 -0800] "GET /add.txt HTTP/1.1" 204 214 - - - - [06/Mar/2012:00:00:00 -0800] "GET /hrefadd.xml HTTP/1.1" 204 214 - - - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:00 -0800] "GET /js/flash_detect_min.js HTTP/1.1" 304 0 "http://www.epicbrowser.com/welcome.html" - - - [06/Mar/2012:00:00:00 -0800] "GET /images/home-page-bottom.jpg HTTP/1.1" 304 0 "http://www.epicbrowser.com/welcome.html" - - - [06/Mar/2012:00:00:00 -0800] "GET /images/Facebook_Like.png HTTP/1.1" 204 214 "http://www.epicbrowser.com/welcome.html" - - - [06/Mar/2012:00:00:00 -0800] "GET /images/Twitter_Follow.png HTTP/1.1" 204 214 "http://www.epicbrowser.com/welcome.html" - - - [06/Mar/2012:00:00:00 -0800] "GET /images/home-page-top.jpg HTTP/1.1" 304 0 "http://www.epicbrowser.com/welcome.html" - - - [06/Mar/2012:00:00:01 -0800] "GET /dbupdates2.xml HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:01 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - - - - [06/Mar/2012:00:00:01 -0800] "GET /hrefadd.xml HTTP/1.1" 204 214 - - - - [06/Mar/2012:00:00:01 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
the output of the file is in the following format
Number of Times IPS
Here the ip - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - - is pinged into epic 158 times totally.
Now i want to add a functionality to the code so that if i pass a particular URL, it should return how many times the URL was accessed by which IP addresses(IP address either from log file or from output file).
E.g. if I pass the url as input: http://www.epicbrowser.com/hrefadd.xml
the output should be in the following format 4 6
I assume your requirement that you want only IPs of one given URL is true. In this case you just have to add an additional filter to the program which filters out the unwanted lines. The structure of the program can be unchanged.
Because the log files do not know anything about hosts, you have to specify only the path part of the URL as the third parameter; example: "/hrefadd.xml"
#!/usr/bin/env python
# Counts the IP addresses of a log file.
# Assumption: the IP address is logged in the first column.
# Example line: - - [06/Mar/2012:00:00:00 -0800] \
# "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
import sys
def urlcheck(line, url):
'''Checks if the url is part of the log line.'''
lsplit = line.split()
if len(lsplit)<7:
return False
return url==lsplit[6]
def extract_ip(line):
'''Extracts the IP address from the line.
Currently it is assumed, that the IP address is logged in
the first column and the columns are space separated.'''
return line.split()[0]
def increase_count(ip_dict, ip_addr):
'''Increases the count of the IP address.
If an IP address is not in the given dictionary,
it is initially created and the count is set to 1.'''
if ip_addr in ip_dict:
ip_dict[ip_addr] += 1
ip_dict[ip_addr] = 1
def read_ips(infilename, url):
'''Read the IP addresses from the file and store (count)
them in a dictionary - returns the dictionary.'''
res_dict = {}
log_file = file(infilename)
for line in log_file:
if line.isspace():
if not urlcheck(line, url):
ip_addr = extract_ip(line)
increase_count(res_dict, ip_addr)
return res_dict
def write_ips(outfilename, ip_dict):
'''Write out the count and the IP addresses.'''
out_file = file(outfilename, "w")
for ip_addr, count in ip_dict.iteritems():
out_file.write("%s\t%5d\n" % (ip_addr, count))
def parse_cmd_line_args():
'''Return the in and out file name.
If there are more or less than two parameters,
an error is logged in the program is exited.'''
if len(sys.argv)!=4:
print("Usage: %s [infilename] [outfilename] [url]" % sys.argv[0])
return sys.argv[1], sys.argv[2], sys.argv[3]
def main():
infilename, outfilename, url = parse_cmd_line_args()
ip_dict = read_ips(infilename, url)
write_ips(outfilename, ip_dict)
if __name__ == "__main__":
IMHO it would be helpful if also the original post was referenced.
IMHO you should leave the comments in place.
Instead of using a database (which might be a better solution in the long run) you can use a dictionary of dictionaries.
urls = {}
def increase_path_count(dict, path, ip_addr):
if path not in dict:
dict[path] = {}
increase_count(dict[path], ip_addr)
You have to parse the actual contents of the logfile to get the path. This can be done with the regular expression module. A good regular expression to start with might be this:
'GET (?P<path>/[\w.]+)'
Since you only have the paths in the logfile, you need to extract the path from the URL in the command line argument. This can be done with the urlparse module.
Edit 2
import re
# ....
def read_ips_and_paths(infilename, url):
'''Read the IP addresses and paths from the file and store (count)
them in a dictionary - returns the dictionary.'''
res_dict = {}
log_file = file(infilename)
for line in log_file:
if line.isspace():
# Get the ip address for the log entry
ip_addr = extract_ip(line)
# Get the path from the log entry
match = re.search('GET (?P<path>/[\w.]+)', line);
path = match.group('path')
increase_path_count(res_dict, path, ip_addr)
return res_dict
Now when you want to get all IP addresses and counts for a specific path, you use urlparse to get the path part of the URL supplied from the command line:
from urlparse import urlparse
# ....
url_path = urlparse(complete_url).path
Not you use the path to print the requested data:
for i in url_dict[url_path].items():
print "ip address: %r - %d" % (i[0], i[1])
Your problem cries out for the use of a relational database.
Using a database will let you construct queries like "how many hits did I get from each URL?" as SQL queries like SELECT ip, COUNT(ip) as hits FROM requests GROUP BY ip. The database will then take care of looping through the data and counting things.
Complete solution using an in-memory SQLite database given below. I have tested this and it works. 'logfile.txt' should be a file of precisely the format you gave in your example above.
Edit: Revised to work with imprecisely specified data format - the only requirements now are that each row must consist of at least seven whitespace-separated fields, of which the first field must be an IP in dotted quad format, and the seventh field must be a path starting with '/'.
(Note the use of defensive programming techniques - check that the data you're getting looks the way you expect it to look, and raise an error if the data is malformed. This prevents the bad data from causing your entire program to blow up later.)
import os, sqlite3, re
fh = open('logfile.txt','r')
db = sqlite3.connect(':memory:') #create temporary SQLite database in memory
CREATE TABLE requests (
ip TEXT,
url TEXT
for line in fh:
line_split = line.split()
if len(line_split) < 7:
raise ValueError ("Not enough fields - need at least seven.")
ip = line_split[0]
url = line_split[6]
# Check that the 'ip' variable really contains four sets of number separated by dots.
if (re.match(r'\d+\.\d+\.\d+\.\d+', ip) == None):
errmsg = "The value %s found in the first column was not an IP address." % ip
raise ValueError (errmsg)
# check that the 'url' variable contains a string starting with /
if (url.startswith("/") == False):
errmsg = "The value %s found in the 7th column was not a path beginning with /" % url
raise ValueError ( errmsg )
#if len(line_split) != 12:
# print (line_split)
# raise ValueError("Malformatted line - must have 10 fields")
db.execute("INSERT INTO requests VALUES (?,?)",(ip,url) )
db.commit() #save data
# print what's in the database
print("\nData in the database\n")
results = db.execute("SELECT * FROM requests")
for row in results:
print row
# Count hits from each IP
print ("\nNumber of hits from each IP\n")
results = db.execute("""
SELECT ip, COUNT(ip) AS hits
FROM requests
GROUP BY ip""")
for row in results:
# Count hits from each IP for the particular URL '/mysidebars/newtab.html'
print("\nNumber of hits from each IP for url %s" % url)
target_url = '/mysidebars/newtab.html'
results = db.execute("""
SELECT ip, COUNT(ip) AS hits
FROM requests
WHERE url=?
""", [target_url])
for row in results:
The output is:
Data in the database
(u'', u'/mysidebars/newtab.html')
(u'', u'/hrefadd.xml')
(u'', u'/dbupdates2.xml')
(u'', u'/mysidebars/newtab.html')
(u'', u'/hrefadd.xml')
(u'', u'/add.txt')
(u'', u'/mysidebars/newtab.html')
(u'', u'/mysidebars/newtab.html')
(u'', u'/welcome.html')
(u'', u'/mysidebars/newtab.html')
(u'', u'/mysidebars/newtab.html')
(u'', u'/mysidebars/newtab.html')
(u'', u'/mysidebars/newtab.html')
(u'', u'/hrefadd.xml')
(u'', u'/dbupdates2.xml')
(u'', u'/mysidebars/newtab.html')
(u'', u'/mysidebars/newtab.html')
(u'', u'/css/epic.css')
(u'', u'/add.txt')
(u'', u'/hrefadd.xml')
(u'', u'/mysidebars/newtab.html')
(u'', u'/js/flash_detect_min.js')
(u'', u'/images/home-page-bottom.jpg')
(u'', u'/images/Facebook_Like.png')
(u'', u'/images/Twitter_Follow.png')
(u'', u'/images/home-page-top.jpg')
(u'', u'/dbupdates2.xml')
(u'', u'/mysidebars/newtab.html')
(u'', u'/hrefadd.xml')
(u'', u'/mysidebars/newtab.html')
Number of hits from each IP
(u'', 1)
(u'', 5)
(u'', 1)
(u'', 3)
(u'', 7)
(u'', 1)
(u'', 1)
(u'', 1)
(u'', 1)
(u'', 1)
(u'', 2)
(u'', 2)
(u'', 2)
(u'', 2)
Number of hits from each IP for url /mysidebars/newtab.html
(u'', 1)
(u'', 3)
(u'', 1)
(u'', 1)
(u'', 1)
(u'', 1)
(u'', 1)
(u'', 1)
(u'', 2)
(u'', 1)
Sidenote: Your existing code is not a good solution to your problem (SQL is a much better way of dealing with "tabular" data.) But if you ever need to count occurrences of a repeated value for another purpose, collections.Counter from the standard library is easier to use and faster than your increase_count() function.