python parsing web access logs

python parsing web access logs - python

I'm trying to parse a web access logs using the following regex
pattern = re.compile(r"""(?x)^
(?P<remote_host>\S+) \s+ # host %h
\S+ \s+ # indent %l (unused)
(?P<remote_user>\S+) \s+ # user %u
\[(?P<time_received>.*?)\] \s+ # time %t
"(?P<request>.*?)" \s+ # request "%r"
(?P<status>[0-9]+) \s+ # status %>s
(?P<response_bytes_clf>\S+) (?:\s+ # size %b (careful, can be '-')
"(?P<referrer>[^"?\s]*[^"]*)" \s+ # referrer "%{Referer}i"
"(?P<user_agent>[^"]*)" (?:\s+ # user agent "%{User-agent}i"
"[^"]*" )?)? # optional argument (unused)
$""")
def get_structured_access_log(access_log):
return pattern.match(access_log).groupdict()
But some of the logs lines contains malicious requests as this ones:
190.2.7.178 - - [21/Dec/2011:05:47:03 +0000] "GET /gnu3/index.php?doc=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 273 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:04 +0000] "GET /gnu/index.php?doc=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 271 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:04 +0000] "GET /phpgwapi/setup/tables_update.inc.php?appdir=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 286 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:05 +0000] "GET /forum/install.php?phpbb_root_dir=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 274 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:06 +0000] "GET /includes/calendar.php?phpc_root_path=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 275 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:06 +0000] "GET /includes/setup.php?phpc_root_path=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 273 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:07 +0000] "GET /inc/authform.inc.php?path_pre=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 275 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:07 +0000] "GET /include/authform.inc.php?path_pre=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 278 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:08 +0000] "GET /index.php?nic=../../../../../../../proc/self/environ%00 HTTP/1.1" 200 4399 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:11 +0000] "GET /index.php?sec=../../../../../../../proc/self/environ%00 HTTP/1.1" 200 4399 "-" "<?php system(\"id\"); ?>"
these request failed to be parsed with the above regex, other normal web requests are parsed successfully.
here are some access logs that is successfully parsed:
123.125.71.79 - - [28/Apr/2012:08:12:57 +0100] "GET /robots.txt HTTP/1.1" 404 268 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
157.56.95.126 - - [28/Apr/2012:10:23:02 +0100] "GET /robots.txt HTTP/1.1" 404 268 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
157.56.95.126 - - [28/Apr/2012:10:23:02 +0100] "GET / HTTP/1.1" 200 4399 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
110.75.173.193 - - [28/Apr/2012:11:57:26 +0100] "GET / HTTP/1.1" 200 4399 "-" "Yahoo! Slurp China"
exception error message :
'NoneType' object has no attribute 'groupdict'
How can I fix the regex so it can parse these complex requests as well?

Using re.match will return a corresponding match object or will return None if the string does not match the pattern.
In the first example data, this is the last part containing an escaped double quote "<?php system(\"id\"); ?>"
If you use a negated character class matching not a double quote and want to assert the end of the string, then [^"]* will not get past the first double quote in (\"id
You could fix your pattern by replacing the negated character class to match not a double quote in this part "(?P<user_agent>[^"]*)" to match any character except a new line .*?
Your pattern might look like:
(?x)^
(?P<remote_host>\S+) \s+ # host %h
\S+ \s+ # indent %l (unused)
(?P<remote_user>\S+) \s+ # user %u
\[(?P<time_received>.*?)\] \s+ # time %t
"(?P<request>.*?)" \s+ # request "%r"
(?P<status>[0-9]+) \s+ # status %>s
(?P<response_bytes_clf>\S+) (?:\s+ # size %b (careful, can be '-')
"(?P<referrer>[^"?\s]*[^"]*)" \s+ # referrer "%{Referer}i"
"(?P<user_agent>.*? )" (?:\s+ # user agent "%{User-agent}i"
"[^"]*" )?)? # optional argument (unused)
$
Regex demo

Related

Why is a flask app.route causing a class method to loop?

I have a HTML Page with multiple divs containing links. When I click on the link, flask should run the class method and then render that same page again.
#app.route("/<drinkName>/")
def action(drinkName:str):
goodDrinks = initBartender(drink_list)
for drink in goodDrinks:
drink:MenuItem
if drink.name == drinkName:
Bartender().makeDrink(drink, drink.attributes['ingredients'])
return render_template('./index.html')
def makeDrink(self, drink:MenuItem, ingredients: dict[str, int]):
self.running = True
print('Making your drink...')
maxTime = 0
pumpThreads = []
for ing in ingredients.keys():
for pump in self.pump_configuration.keys():
if ing == self.pump_configuration[pump]['value']:
waitTime = ingredients[ing] * FLOW_RATE
if waitTime > maxTime:
maxTime = waitTime
pump_t = threading.Thread(target=self.pour, args=(self.pump_configuration[pump]['pin'], waitTime))
pumpThreads.append(pump_t)
print(f"Making {drink.name}!")
for thread in pumpThreads:
thread:threading.Thread
thread.start()
thread.join()
self.running = False
I tried swapping the return render template for a redirect to the base url, I tried indenting the render_template to directly below the class method. This is my console log.
127.0.0.1 - - [28/Mar/2022 09:46:07] "GET /Screwdriver HTTP/1.1" 308 -
Making your drink...
Making Screwdriver!
Making your drink...
Making Screwdriver!
127.0.0.1 - - [28/Mar/2022 09:50:07] "GET /Screwdriver/ HTTP/1.1" 200 -
127.0.0.1 - - [28/Mar/2022 09:50:08] "GET /static/styles.css HTTP/1.1" 304 -
127.0.0.1 - - [28/Mar/2022 09:50:08] "GET /static/drinkList.json HTTP/1.1" 304 -
As you can see, it runs the class method twice and takes four minutes to render index.html. I want to make it so that it runs the class method once, then (with as little delay as possible) render index.html

flask socketio send different messages to different clients at the same time

I have a flask-socketio server and several react socket.io-client namespaces connected to it.
Clients are in the same ip and port but in different namespaces like /finalResponse, /algoSignal etc.
Client app
componentDidMount() {
// Connecting to backend api for data
const { dispatch } = this.props;
socket = io(`http://${IP.clientOrderFeedFlaskIP}:6050/algoSignal`);
dispatch(loadClientOrdersDataSocket(socket));
var interval = setInterval(()=>{
socket.emit('getdata')
console.log(socket.connected)
},2000)
this.setState({interval:interval})
socket.emit('update')
socket.on("algo_signal_data", (res) => {
console.log(JSON.parse(res))
dispatch(ClientOrdersData(res));
});
}
Here it sends up an interval to ping the server with 'getdata' to request every 2 seconds. There is an event bucket to handle incoming data in 'algo_signal_data'.
'update' bucket is to get data on refresh.
Similar to this I have around 8-9 clients in different namespaces.
Backend server
#socket.on('getdata', namespace = '/algoSignal')
def algoSignal():
# global algosig
lastOrderUpdated = json.loads(pingConn.get('lastOrderUpdated'))
if lastOrderUpdated != '0':
print('---------------------sent algo signal data ---------------------')
algosig = SymphonyOrderRaw(mongoIp).algoSignal.to_json(orient = 'records')
emit('algo_signal_data', algosig, broadcast = True)
pingConn.set('lastOrderUpdated', json.dumps('0'))
else:
emit(json.dumps(200))
#socket.on('getdata', namespace='/finalResponse')
def getfinalResponse():
# global finalres
lastOrderUpdated = json.loads(pingConn.get('lastOrderUpdated'))
if lastOrderUpdated != '0':
finalres = FinalResponse(mongoIp).finalResponse.to_json(orient = 'records')
print('--------------------sent final Response data----------------')
# print(finalres)
emit('response_data', finalres, broadcast = True)
pingConn.set('lastOrderUpdated', json.dumps('0'))
else:
emit(json.dumps(200))
I have event buckets like these who recieve the 'getdata' prompt and check if the database was updated. If yes, it sends the data else it sends status code.
The expected result is that each client page should recieve data when db is updated.
The problem is that when I open multiple client pages in seperate tabs, The server only sends the data to one page. But when the other pages is refreshed, then data is updated in those also.
Update bucket
#socket.on('update', namespace = '/algoSignal')
def update2():
algosig = SymphonyOrderRaw(mongoIp).algoSignal.to_json(orient = 'records')
emit('algo_signal_data', algosig, broadcast = True)
This is the output at the server terminal
(flask-venv) mockAdmin#python-react:~/RMS-git/FlaskServerCodes/serverFlask$ python3 main.py
Starting
* Restarting with stat
Starting
* Debugger is active!
* Debugger PIN: 113-048-226
(59454) wsgi starting up on http://10.160.0.2:6050
(59454) accepted ('103.62.42.27', 65522)
103.62.42.27 - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vTi HTTP/1.1" 200 401 0.000859
algo signal client connected
103.62.42.27 - - [31/Dec/2021 11:22:05] "POST /socket.io/?EIO=3&transport=polling&t=NuF2vVA&sid=65d328eeffb943a5a2ad80ee4216c856 HTTP/1.1" 200 219 0.000986
103.62.42.27 - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vVh&sid=65d328eeffb943a5a2ad80ee4216c856 HTTP/1.1" 200 249 0.000353
(59454) accepted ('103.62.42.27', 65523)
103.62.42.27 - - [31/Dec/2021 11:22:05] "POST /socket.io/?EIO=3&transport=polling&t=NuF2vWP&sid=65d328eeffb943a5a2ad80ee4216c856 HTTP/1.1" 200 219 0.001710
(59454) accepted ('103.62.42.27', 65524)
103.62.42.27 - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vWQ&sid=65d328eeffb943a5a2ad80ee4216c856 HTTP/1.1" 200 235 0.000171
(59454) accepted ('103.62.42.27', 65525)
103.62.42.27 - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vWZ HTTP/1.1" 200 401 0.000345
(59454) accepted ('103.62.42.27', 65526)
103.62.42.27 - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vX6&sid=65d328eeffb943a5a2ad80ee4216c856 HTTP/1.1" 200 235 0.000255
103.62.42.27 - - [31/Dec/2021 11:22:05] "POST /socket.io/?EIO=3&transport=polling&t=NuF2vX5&sid=65d328eeffb943a5a2ad80ee4216c856 HTTP/1.1" 200 219 0.005536
final response client connected
103.62.42.27 - - [31/Dec/2021 11:22:05] "POST /socket.io/?EIO=3&transport=polling&t=NuF2vXg&sid=8e278beb8e5a4f77b5b203e28af9b596 HTTP/1.1" 200 219 0.000881
103.62.42.27 - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vXl&sid=65d328eeffb943a5a2ad80ee4216c856 HTTP/1.1" 200 235 0.000180
103.62.42.27 - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vY4&sid=8e278beb8e5a4f77b5b203e28af9b596 HTTP/1.1" 200 252 0.000208
(59454) accepted ('103.62.42.27', 65527)
103.62.42.27 - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vYk&sid=8e278beb8e5a4f77b5b203e28af9b596 HTTP/1.1" 200 260 0.000265
103.62.42.27 - - [31/Dec/2021 11:22:05] "POST /socket.io/?EIO=3&transport=polling&t=NuF2vYj&sid=8e278beb8e5a4f77b5b203e28af9b596 HTTP/1.1" 200 219 0.002399
103.62.42.27 - - [31/Dec/2021 11:22:05] "GET /socket.io/?EIO=3&transport=polling&t=NuF2vZL&sid=8e278beb8e5a4f77b5b203e28af9b596 HTTP/1.1" 200 235 0.000189
103.62.42.27 - - [31/Dec/2021 11:22:05] "POST /socket.io/?EIO=3&transport=polling&t=NuF2vZM&sid=8e278beb8e5a4f77b5b203e28af9b596 HTTP/1.1" 200 219 0.005291
-----------------------------sent final Response data-----------------------
How can I send data to all the open client pages at the same time from a single server?

Import a .log file into a Jupyter notebook

I have some problems to import a .log file to my notebook. It works with just some text but then when I try to import the file will I get invalid syntax.
Works:
file_data = """
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET
/twiki/bin/edit/Main/ouble_bounce_sender?topicparent=Main.ConfigurationVariable
HTTP/1.1" 401 12846
64.242.88.10 - - [07/Mar/2004:19:03:58 -0800] "GET
/twiki/bin/edit/Main/Message_size_limit?topicparent=Main.ConfigurationVariable
HTTP/1.1" 401 12846
206-15-133-154.dialup.ziplink.net - - [11/Mar/2004:16:33:23 -0800] "HEAD
/twiki/bin/view/Main/SpamAssassinDeleting HTTP/1.1" 200 0"""
df = pd.read_csv(pd.compat.StringIO(file_data), names=[0, 'hour', 2, 3], sep=':', engine='python')
df['hour'].value_counts()
However when I try to import a .log file I will get a syntax error.
df = pd.read_csv('/Users/john/Desktop/data_log.log'), names=[0, 'hour', 2, 3], sep=':', engine='python')
How can I solve it so I can count the 10 most common hours in the .log file?

Handling web-log files as a CSV with Python

I'm using the Python 3 CSV Reader to read some web-log files into a namedtuple.
I have no control over the log file structures, and there are varying types.
The delimiter is a space ( ), the problem is that some log file formats place a space in the timestamp, as Logfile 2 below. The CSV reader then reads the date/time stamp as two fields.
Logfile 1
73 58 2993 [22/Jul/2016:06:51:06.299] 2[2] "GET /example HTTP/1.1"
13 58 224 [22/Jul/2016:06:51:06.399] 2[2] "GET /example HTTP/1.1"
Logfile 2
13 58 224 [22/Jul/2016:06:51:06 +0000] 2[2] "GET /test HTTP/1.1"
153 38 224 [22/Jul/2016:06:51:07 +0000] 2[2] "GET /test HTTP/1.1"
The log files typically have the timestamp within square quotes, but I cannot find a way of handling them as "quotes". On top of that, square brackets are not always used as quotes within the logs either (see the [2] later in the logs).
I've read through the Python 3 CSV Reader documentation, including about dialects, but there doesn't seem to be anything for handling enclosing square brackets.
How can I handle this situation automatically?

This will do, you need to use a regex in place of sep.
This for example will parse NGinx log files into a pandas.Dataframe:
import pandas as pd
df = pd.read_csv(log_file,
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python',
usecols=[0, 3, 4, 5, 6, 7, 8],
names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
na_values='-',
header=None
)
Edit :
line = '172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"'
regex = '([(\d\.)]+) - - \[(.*?)\] "(.*?)" (\d+) - "(.*?)" "(.*?)"'
import re
print re.match(regex, line).groups()
The output would be a tuple with 6 pieces of information
('172.16.0.3', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827')

Bruteforce! I concatenate the 2 time fields so the timestamp can be interpreted if needed.
data.csv =
13 58 224 [22/Jul/2016:06:51:06.399] 2[2] "GET /example HTTP/1.1"
13 58 224 [22/Jul/2016:06:51:06 +0000] 2[2] "GET /test HTTP/1.1"
import csv
with open("data.csv") as f:
c = csv.reader(f, delimiter=" ")
for row in c:
if len(row) == 7:
r = row[:3] + [row[3] + row[4]] + row[5:]
print(r)
else:
print(row)
['13', '58', '224', '[22/Jul/2016:06:51:06.399]', '2[2]', 'GET /example HTTP/1.1']
['13', '58', '224', '[22/Jul/2016:06:51:06+0000]', '2[2]', 'GET /test HTTP/1.1']

How to find how many times a particular ip is pinged to the url?

I have a python script which extracts unique IP addresses from a log file and displays their count of how many times those IPs are pinged the code is as follows.
import sys
def extract_ip(line):
return line.split()[0]
def increase_count(ip_dict, ip_addr):
if ip_addr in ip_dict:
ip_dict[ip_addr] += 1
else:
ip_dict[ip_addr] = 1
def read_ips(infilename):
res_dict = {}
log_file = file(infilename)
for line in log_file:
if line.isspace():
continue
ip_addr = extract_ip(line)
increase_count(res_dict, ip_addr)
return res_dict
def write_ips(outfilename, ip_dict):
out_file = file(outfilename, "w")
for ip_addr, count in ip_dict.iteritems():
out_file.write("%5d\t%s\n" % (count, ip_addr))
out_file.close()
def parse_cmd_line_args():
if len(sys.argv)!=3:
print("Usage: %s [infilename] [outfilename]" % sys.argv[0])
sys.exit(1)
return sys.argv[1], sys.argv[2]
def main():
infilename, outfilename = parse_cmd_line_args()
ip_dict = read_ips(infilename)
write_ips(outfilename, ip_dict)
if __name__ == "__main__":
main()
The log file is in the following format with 2L lines. These are the first 30 lines of the log file
220.227.40.118 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
220.227.40.118 - - [06/Mar/2012:00:00:00 -0800] "GET /hrefadd.xml HTTP/1.1" 204 214 - -
59.95.13.217 - - [06/Mar/2012:00:00:00 -0800] "GET /dbupdates2.xml HTTP/1.1" 404 0 - -
111.92.9.222 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
120.56.236.46 - - [06/Mar/2012:00:00:00 -0800] "GET /hrefadd.xml HTTP/1.1" 204 214 - -
49.138.106.21 - - [06/Mar/2012:00:00:00 -0800] "GET /add.txt HTTP/1.1" 204 214 - -
117.195.185.130 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
122.160.166.220 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
117.214.20.28 - - [06/Mar/2012:00:00:00 -0800] "GET /welcome.html HTTP/1.1" 204 212 - -
117.18.231.5 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
117.18.231.5 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
122.169.136.211 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
203.217.145.10 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
117.18.231.5 - - [06/Mar/2012:00:00:00 -0800] "GET /hrefadd.xml HTTP/1.1" 204 214 - -
59.95.13.217 - - [06/Mar/2012:00:00:00 -0800] "GET /dbupdates2.xml HTTP/1.1" 404 0 - -
203.217.145.10 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
117.206.70.4 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
117.214.20.28 - - [06/Mar/2012:00:00:00 -0800] "GET /css/epic.css HTTP/1.1" 204 214 "http://www.epicbrowser.com/welcome.html" -
117.206.70.4 - - [06/Mar/2012:00:00:00 -0800] "GET /add.txt HTTP/1.1" 204 214 - -
117.206.70.4 - - [06/Mar/2012:00:00:00 -0800] "GET /hrefadd.xml HTTP/1.1" 204 214 - -
118.97.38.130 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
117.214.20.28 - - [06/Mar/2012:00:00:00 -0800] "GET /js/flash_detect_min.js HTTP/1.1" 304 0 "http://www.epicbrowser.com/welcome.html" -
117.214.20.28 - - [06/Mar/2012:00:00:00 -0800] "GET /images/home-page-bottom.jpg HTTP/1.1" 304 0 "http://www.epicbrowser.com/welcome.html" -
117.214.20.28 - - [06/Mar/2012:00:00:00 -0800] "GET /images/Facebook_Like.png HTTP/1.1" 204 214 "http://www.epicbrowser.com/welcome.html" -
117.214.20.28 - - [06/Mar/2012:00:00:00 -0800] "GET /images/Twitter_Follow.png HTTP/1.1" 204 214 "http://www.epicbrowser.com/welcome.html" -
117.214.20.28 - - [06/Mar/2012:00:00:00 -0800] "GET /images/home-page-top.jpg HTTP/1.1" 304 0 "http://www.epicbrowser.com/welcome.html" -
49.138.106.21 - - [06/Mar/2012:00:00:01 -0800] "GET /dbupdates2.xml HTTP/1.1" 404 0 - -
117.18.231.5 - - [06/Mar/2012:00:00:01 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
117.18.231.5 - - [06/Mar/2012:00:00:01 -0800] "GET /hrefadd.xml HTTP/1.1" 204 214 - -
120.61.182.186 - - [06/Mar/2012:00:00:01 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
the output of the file is in the following format
Number of Times IPS
158 111.92.9.222
11 58.97.187.231
30 212.57.209.41
5 119.235.51.66
3 122.168.134.106
5 180.234.220.75
13 115.252.223.243
Here the ip 111.92.9.222 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - - is pinged into epic 158 times totally.
Now i want to add a functionality to the code so that if i pass a particular URL, it should return how many times the URL was accessed by which IP addresses(IP address either from log file or from output file).
E.g. if I pass the url as input: http://www.epicbrowser.com/hrefadd.xml
the output should be in the following format
10.10.128.134 4
10.134.222.232 6

I assume your requirement that you want only IPs of one given URL is true. In this case you just have to add an additional filter to the program which filters out the unwanted lines. The structure of the program can be unchanged.
Because the log files do not know anything about hosts, you have to specify only the path part of the URL as the third parameter; example: "/hrefadd.xml"
#!/usr/bin/env python
#
# Counts the IP addresses of a log file.
#
# Assumption: the IP address is logged in the first column.
# Example line: 117.195.185.130 - - [06/Mar/2012:00:00:00 -0800] \
# "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
#
import sys
def urlcheck(line, url):
'''Checks if the url is part of the log line.'''
lsplit = line.split()
if len(lsplit)<7:
return False
return url==lsplit[6]
def extract_ip(line):
'''Extracts the IP address from the line.
Currently it is assumed, that the IP address is logged in
the first column and the columns are space separated.'''
return line.split()[0]
def increase_count(ip_dict, ip_addr):
'''Increases the count of the IP address.
If an IP address is not in the given dictionary,
it is initially created and the count is set to 1.'''
if ip_addr in ip_dict:
ip_dict[ip_addr] += 1
else:
ip_dict[ip_addr] = 1
def read_ips(infilename, url):
'''Read the IP addresses from the file and store (count)
them in a dictionary - returns the dictionary.'''
res_dict = {}
log_file = file(infilename)
for line in log_file:
if line.isspace():
continue
if not urlcheck(line, url):
continue
ip_addr = extract_ip(line)
increase_count(res_dict, ip_addr)
return res_dict
def write_ips(outfilename, ip_dict):
'''Write out the count and the IP addresses.'''
out_file = file(outfilename, "w")
for ip_addr, count in ip_dict.iteritems():
out_file.write("%s\t%5d\n" % (ip_addr, count))
out_file.close()
def parse_cmd_line_args():
'''Return the in and out file name.
If there are more or less than two parameters,
an error is logged in the program is exited.'''
if len(sys.argv)!=4:
print("Usage: %s [infilename] [outfilename] [url]" % sys.argv[0])
sys.exit(1)
return sys.argv[1], sys.argv[2], sys.argv[3]
def main():
infilename, outfilename, url = parse_cmd_line_args()
ip_dict = read_ips(infilename, url)
write_ips(outfilename, ip_dict)
if __name__ == "__main__":
main()
IMHO it would be helpful if also the original post was referenced.
IMHO you should leave the comments in place.

Instead of using a database (which might be a better solution in the long run) you can use a dictionary of dictionaries.
urls = {}
def increase_path_count(dict, path, ip_addr):
if path not in dict:
dict[path] = {}
increase_count(dict[path], ip_addr)
Edit
You have to parse the actual contents of the logfile to get the path. This can be done with the regular expression module. A good regular expression to start with might be this:
'GET (?P<path>/[\w.]+)'
Since you only have the paths in the logfile, you need to extract the path from the URL in the command line argument. This can be done with the urlparse module.
Edit 2
import re
# ....
def read_ips_and_paths(infilename, url):
'''Read the IP addresses and paths from the file and store (count)
them in a dictionary - returns the dictionary.'''
res_dict = {}
log_file = file(infilename)
for line in log_file:
if line.isspace():
continue
# Get the ip address for the log entry
ip_addr = extract_ip(line)
# Get the path from the log entry
match = re.search('GET (?P<path>/[\w.]+)', line);
path = match.group('path')
increase_path_count(res_dict, path, ip_addr)
return res_dict
Now when you want to get all IP addresses and counts for a specific path, you use urlparse to get the path part of the URL supplied from the command line:
from urlparse import urlparse
# ....
url_path = urlparse(complete_url).path
Not you use the path to print the requested data:
for i in url_dict[url_path].items():
print "ip address: %r - %d" % (i[0], i[1])

Your problem cries out for the use of a relational database.
Using a database will let you construct queries like "how many hits did I get from each URL?" as SQL queries like SELECT ip, COUNT(ip) as hits FROM requests GROUP BY ip. The database will then take care of looping through the data and counting things.
Complete solution using an in-memory SQLite database given below. I have tested this and it works. 'logfile.txt' should be a file of precisely the format you gave in your example above.
Edit: Revised to work with imprecisely specified data format - the only requirements now are that each row must consist of at least seven whitespace-separated fields, of which the first field must be an IP in dotted quad format, and the seventh field must be a path starting with '/'.
(Note the use of defensive programming techniques - check that the data you're getting looks the way you expect it to look, and raise an error if the data is malformed. This prevents the bad data from causing your entire program to blow up later.)
import os, sqlite3, re
fh = open('logfile.txt','r')
db = sqlite3.connect(':memory:') #create temporary SQLite database in memory
db.execute("""
CREATE TABLE requests (
ip TEXT,
url TEXT
)
""")
for line in fh:
line_split = line.split()
if len(line_split) < 7:
raise ValueError ("Not enough fields - need at least seven.")
ip = line_split[0]
url = line_split[6]
# Check that the 'ip' variable really contains four sets of number separated by dots.
if (re.match(r'\d+\.\d+\.\d+\.\d+', ip) == None):
errmsg = "The value %s found in the first column was not an IP address." % ip
raise ValueError (errmsg)
# check that the 'url' variable contains a string starting with /
if (url.startswith("/") == False):
errmsg = "The value %s found in the 7th column was not a path beginning with /" % url
raise ValueError ( errmsg )
#if len(line_split) != 12:
# print (line_split)
# raise ValueError("Malformatted line - must have 10 fields")
db.execute("INSERT INTO requests VALUES (?,?)",(ip,url) )
db.commit() #save data
# print what's in the database
print("\nData in the database\n")
results = db.execute("SELECT * FROM requests")
for row in results:
print row
# Count hits from each IP
print ("\nNumber of hits from each IP\n")
results = db.execute("""
SELECT ip, COUNT(ip) AS hits
FROM requests
GROUP BY ip""")
for row in results:
print(row)
# Count hits from each IP for the particular URL '/mysidebars/newtab.html'
print("\nNumber of hits from each IP for url %s" % url)
target_url = '/mysidebars/newtab.html'
results = db.execute("""
SELECT ip, COUNT(ip) AS hits
FROM requests
WHERE url=?
GROUP BY ip
""", [target_url])
for row in results:
print(row)
The output is:
Data in the database
(u'220.227.40.118', u'/mysidebars/newtab.html')
(u'220.227.40.118', u'/hrefadd.xml')
(u'59.95.13.217', u'/dbupdates2.xml')
(u'111.92.9.222', u'/mysidebars/newtab.html')
(u'120.56.236.46', u'/hrefadd.xml')
(u'49.138.106.21', u'/add.txt')
(u'117.195.185.130', u'/mysidebars/newtab.html')
(u'122.160.166.220', u'/mysidebars/newtab.html')
(u'117.214.20.28', u'/welcome.html')
(u'117.18.231.5', u'/mysidebars/newtab.html')
(u'117.18.231.5', u'/mysidebars/newtab.html')
(u'122.169.136.211', u'/mysidebars/newtab.html')
(u'203.217.145.10', u'/mysidebars/newtab.html')
(u'117.18.231.5', u'/hrefadd.xml')
(u'59.95.13.217', u'/dbupdates2.xml')
(u'203.217.145.10', u'/mysidebars/newtab.html')
(u'117.206.70.4', u'/mysidebars/newtab.html')
(u'117.214.20.28', u'/css/epic.css')
(u'117.206.70.4', u'/add.txt')
(u'117.206.70.4', u'/hrefadd.xml')
(u'118.97.38.130', u'/mysidebars/newtab.html')
(u'117.214.20.28', u'/js/flash_detect_min.js')
(u'117.214.20.28', u'/images/home-page-bottom.jpg')
(u'117.214.20.28', u'/images/Facebook_Like.png')
(u'117.214.20.28', u'/images/Twitter_Follow.png')
(u'117.214.20.28', u'/images/home-page-top.jpg')
(u'49.138.106.21', u'/dbupdates2.xml')
(u'117.18.231.5', u'/mysidebars/newtab.html')
(u'117.18.231.5', u'/hrefadd.xml')
(u'120.61.182.186', u'/mysidebars/newtab.html')
Number of hits from each IP
(u'111.92.9.222', 1)
(u'117.18.231.5', 5)
(u'117.195.185.130', 1)
(u'117.206.70.4', 3)
(u'117.214.20.28', 7)
(u'118.97.38.130', 1)
(u'120.56.236.46', 1)
(u'120.61.182.186', 1)
(u'122.160.166.220', 1)
(u'122.169.136.211', 1)
(u'203.217.145.10', 2)
(u'220.227.40.118', 2)
(u'49.138.106.21', 2)
(u'59.95.13.217', 2)
Number of hits from each IP for url /mysidebars/newtab.html
(u'111.92.9.222', 1)
(u'117.18.231.5', 3)
(u'117.195.185.130', 1)
(u'117.206.70.4', 1)
(u'118.97.38.130', 1)
(u'120.61.182.186', 1)
(u'122.160.166.220', 1)
(u'122.169.136.211', 1)
(u'203.217.145.10', 2)
(u'220.227.40.118', 1)
Sidenote: Your existing code is not a good solution to your problem (SQL is a much better way of dealing with "tabular" data.) But if you ever need to count occurrences of a repeated value for another purpose, collections.Counter from the standard library is easier to use and faster than your increase_count() function.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python parsing web access logs - python

Related

Why is a flask app.route causing a class method to loop?

flask socketio send different messages to different clients at the same time

Import a .log file into a Jupyter notebook

Handling web-log files as a CSV with Python

How to find how many times a particular ip is pinged to the url?

Categories

Resources