I need some tool to read latest 10 minutes entry in my log file, and if some words are logged then print some text.
log file:
23.07.2014 09:22:11 INFO Logging.LogEvent 0 Failed login test#test.com
23.07.2014 09:29:02 INFO Logging.LogEvent 0 login test#test.com
23.07.2014 09:31:55 INFO Logging.LogEvent 0 login test#test.com
23.07.2014 09:44:14 INFO Logging.LogEvent 0 Failed login test#test.com
if during last 10min some entry = Failed -print ALARM.
All what i did is find 'Failed' match but i have no idea how to check last 10min in my log file ;/ -any idea??
from sys import argv
from datetime import datetime, timedelta
with open('log_test.log', 'r') as f:
for line in f:
try:
e = line.index("Failed")
except:
pass
else:
print(line)
Your format %d.%m.%Y is worse than %Y:%m:%d which can be used in string comparison.
We also do not know if log is big and if it is sorted. If it is not sorted (it is common for multithreaded applications) you will have to analyze each line and convert it into datetime:
def get_dt_from_line(s):
return datetime.datetime.strptime(s[:20], '%d.%m.%Y %H:%M:%S')
Then use it as filter (for small files):
MAX_CHECK_TIMEDELTA = datetime.timedelta(minutes=10)
LOG_START_ANALYZE_DATETIME = (datetime.datetime.today() - MAX_CHECK_TIMEDELTA)
lines = [s for s in TXT.split('\n') if 'Failed' in s and get_dt_from_line(s) >= LOG_START_ANALYZE_DATETIME]
print('\n'.join(lines))
For big files you can read file line by line.
If your log file is just for one day you can use string comparison instead of datetime comparison:
LOG_START_ANALYZE_DATETIME = (datetime.datetime.today() - datetime.timedelta(minutes=10)).strftime('%d.%m.%Y %H:%M:%S')
lines = [s for s in TXT.split('\n') if 'Failed' in s and s >= LOG_START_ANALYZE_DATETIME]
If I were you, I would lookup line by line, get the timestamp of the first line and then iterate until the difference between the first date and the current one is more than 10 minutes, while counting occurences of the word "Failed".
I think that you'll sort something out with splitting your line following spaces. But be careful as if someday, your log format changes, your script is likely not gonna be working too.
Related
I'm trying to parse dates from a textfile, but executing the scripts throws incorrect data format, when the format is correct.
The file is a .txt file with the following structure
2018/02/15 05:00:13 - somestring - anotherstring
2018/02/15 05:00:14 - somestring - anotherstring
2018/02/15 05:00:15 - somestring - anotherstring
... etc
The script gets the file divided in lines, and each line is divided on fields, of which one field is a date and time. I divided the date and the time in two separate fields, the time gets converted ok so the problem is in the date.
This is what I get on execution:
ValueError: time data "b'2018/02/15" does not match format '%Y/%m/%d'
I noticed it prints the string with a "b" in front of it, which if I'm not mistaken it means it's a byte literal. I've tried using "decode("utf-8")" on it, but it throw's exception as "string" has no method decode.
#the file is in one long string as I get it from a 'cat' bash command via ssh
file = str(stdout.read()) #reads the cat into a long string
strings = file.split("\\n") #splits the string into lines
for string in strings:
fields = string.split(" - ")
if len(fields) >= 3:
#dates.append(datetime.strptime(campos[0],"%Y/%m/%d %H:%M:%S")) #Wrong format
datentime = fields[0].split()
dates.append(datetime.strptime(datentime[0],"%Y/%m/%d")) #Wrong format
print(datentime[1])
dates.append(datetime.strptime(datentime[1],"%H:%M:%S")) #WORKS
I can't figure out why that is happening to you with the code you gave so I can't offer a fix for that but I tried testing on it and this worked for me:
datetime.strptime(str(datentime[0])[2,:-1], "%Y/%m/%d")
It removes the B and ' from the string, if you still have problems with that, please post how you got that string, maybe there was some error on the way.
use try and except:
import datetime
def convertDate(d):
strptime = datetime.datetime.strptime
try:
return strptime(d, "%Y/%m/%d")
except TypeError:
return strptime(d.decode("utf-8"), "%Y/%m/%d")
print(convertDate(b'2018/02/15'))
print(convertDate('2018/02/15'))
I have three file: 2 .gz files and 1 .log file. These files are pretty big. Below I have a sample copy of my original data. I want to extract the entries that correspond to the last 24 hours.
a.log.1.gz
2018/03/25-00:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/25-10:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
2018/03/25-20:08:50.486601 1.5M 7FE9D3D41706 qojfcmqcacaeia
2018/03/25-24:08:50.980519 16K 7FE9BD1AF707 user: number is 93823004
2018/03/26-00:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/26-10:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/26-15:08:51.066968 1 7FE9BDC91700 std:ZMD:
a.log.2.gz
2018/03/26-20:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/26-24:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
2018/03/27-00:08:50.486601 1.5M 7FE9D3D41706 qojfcmqcacaeia
2018/03/27-10:08:50.980519 16K 7FE9BD1AF707 user: number is 93823004
2018/03/27-20:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/27-24:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/28-00:08:51.066968 1 7FE9BDC91700 std:ZMD:
a.log
2018/03/28-10:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
** Desired Result**
result.txt
2018/03/27-20:08:50.981908 1389 7FE9BDC2B707 user 7fb31ecfa700
2018/03/27-24:08:51.066967 0 7FE9BDC91700 Exit Status = 0x0
2018/03/28-00:08:51.066968 1 7FE9BDC91700 std:ZMD:
2018/03/28-10:08:48.638553 508 7FF4A8F3D704 snononsonfvnosnovoosr
2018/03/28-20:08:48.985053 346K 7FE9D2D51706 ahelooa afoaona woom
I am not sure how to get the entries that cover the last 24 hours.
And I want to run the below function on last 24 hours of data.
def _clean_logs(line):
# noinspection SpellCheckingInspection
lemmatizer = WordNetLemmatizer()
clean_line = clean_line.strip()
clean_line = clean_line.lstrip('0123456789.- ')
cleaned_log = " ".join(
[lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in nltk.word_tokenize(clean_line) if
word not in Stopwords.ENGLISH_STOP_WORDS and 2 < len(word) <= 30 and not word.startswith('_')])
cleaned_log = cleaned_log.replace('"', ' ')
return cleaned_log
Something like this should work.
from datetime import datetime, timedelta
import glob
import gzip
from pathlib import Path
import shutil
def open_file(path):
if Path(path).suffix == '.gz':
return gzip.open(path, mode='rt', encoding='utf-8')
else:
return open(path, encoding='utf-8')
def parsed_entries(lines):
for line in lines:
yield line.split(' ', maxsplit=1)
def earlier():
return (datetime.now() - timedelta(hours=24)).strftime('%Y/%m/%d-%H:%M:%S')
def get_files():
return ['a.log'] + list(reversed(sorted(glob.glob('a.log.*'))))
output = open('output.log', 'w', encoding='utf-8')
files = get_files()
cutoff = earlier()
for i, path in enumerate(files):
with open_file(path) as f:
lines = parsed_entries(f)
# Assumes that your files are not empty
date, line = next(lines)
if cutoff <= date:
# Skip files that can just be appended to the output later
continue
for date, line in lines:
if cutoff <= date:
# We've reached the first entry of our file that should be
# included
output.write(line)
break
# Copies from the current position to the end of the file
shutil.copyfileobj(f, output)
break
else:
# In case ALL the files are within the last 24 hours
i = len(files)
for path in reversed(files[:i]):
with open_file(path) as f:
# Assumes that your files have trailing newlines.
shutil.copyfileobj(f, output)
# Cleanup, it would get closed anyway when garbage collected or process exits.
output.close()
Then if we make some test log files:
#!/bin/sh
echo "2019/01/15-00:00:00.000000 hi" > a.log.1
echo "2019/01/31-00:00:00.000000 hi2" > a.log.2
echo "2019/01/31-19:00:00.000000 hi3" > a.log
gzip a.log.1 a.log.2
and run our script, it outputs the expected result (for this point in time)
2019/01/31-00:00:00.000000 hi2
2019/01/31-19:00:00.000000 hi3
Working with log files often involves pretty large amounts of data, thus reading in ascending order and reading everything everytime is not desired since it wastes a lot of resources.
The fastest way to accomplish your goal that immediately came to my mind (better approaches will certainly exist) is a very simple random search: we search through the logfile(s) in reversed order, thus beginning with the newest first. Instead of visiting all lines, you just arbitrarily choose some stepsize and only look at some lines each stepsize. This way, you can search through Gigabytes of data in a very short time.
Additionally, this approach does not require to store each line of a file in memory, but only some lines and the final result.
When a.log is the current log file, we begin searching here:
with open("a.log", "rb+") as fh:
Since we are only interested in the last 24 hours, we jump to the end first and save the timestamp to search for as a formatted string:
timestamp = datetime.datetime.now() - datetime.timedelta(days=1) # last 24h
# jump to logfile's end
fh.seek(0, 2) # <-- '2': search relative to file's end
index = fh.tell() # current position in file; here: logfile's *last* byte
Now we can begin our random search. Your line appear to be about 65 characters long on average, hence we move a multiple of that.
average_line_length = 65
stepsize = 1000
while True:
# we move a step back
fh.seek(index - average_line_length * stepsize, 2)
# save our current position in file
index = fh.tell()
# we try to read a "line" (multiply avg. line length times a number
# large enough to cover even large lines. Ignore largest lines here,
# since this is an edge cases ruining our runtime. We rather skip
# one iteration of the loop then)
r = fh.read(average_line_length * 10)
# our results now contains (on average) multiple lines, so we
# split first
lines = r.split(b"\n")
# now we check for our timestring
for l in lines:
# your timestamps are formatted like '2018/03/28-20:08:48.985053'
# I ignore minutes, seconds, ... here, just for the sake of simplicity
timestr = l.split(b":") # this gives us b'2018/03/28-20' in timestr[0]
# next we convert this to a datetime
found_time = datetime.datetime.strptime(timestr[0], "%Y/%m/%d-%H")
# finally, we compare if the found time is not inside our 24hour margin
if found_time < timestamp:
break
With this code we will only end up searching a few lines each stepsize (here: 1000 lines) as long as we are inside our last 24 hours. Once we left the 24 hours, we know that at most we went exactly stepsize * average_line_length too far up in the file.
Filtering this "went too far" becomes very easy then:
# read in file's contents from current position to end
contents = fh.read()
# split for lines
lines_of_contents = contents.split(b"\n")
# helper function for removing all lines older than 24 hours
def check_line(line):
# split to extract datestr
tstr = line.split(b":")
# convert this to a datetime
ftime = datetime.datetime.strptime(tstr[0], "%Y/%m/%d-%H")
return ftime > timestamp
# remove all lines that are older than 24 hours
final_result = filter(check_line, lines_of_contents)
Since contents covers all of the remaining contents of our file (and lines all lines, which is simply contents split at linebreaks \n) we can easily use filter to get our desired result.
Each line in lines will be fed to check_line, which returns True if the line's time is > timestamp and timestamp is our datetime object describing exactly now - 1day. This means that check_line will return False for all lines older than timestamp and filter will remove those lines.
Obviously, this is far from optimal, but it is easy to understand and easily extendable to filtering for minutes, seconds, ...
Additionally, covering multiple files is also easy: you just need glob.glob to find all possible files, start with the newest file and add another loop: you would search through the files until our while loop fails for the first time, then break and read all remaining contents from the current file + all contents from all files visited before.
Roughly, something like this:
final_lines = lst()
for file in logfiles:
# our while-loop
while True:
...
# if while-loop did not break all of the current logfile's content is
# <24 hours of age
with open(file, "rb+") as fh:
final_lines.extend(fh.readlines())
This way you simply store all lines of a logfile, if all lines are <24 hours of age. If the loop breaks at some point, i.e. we have found a logfile and the exact line >24 hours of age, extend final_lines by final_result since this will cover only the lines <24 hours of age.
i making a script in python for reading the last 5 minutes of a log file, this is my code so far
from datetime import datetime, timedelta
now = datetime.now()
before = timedelta(minutes=5)
now = now.replace(microsecond=0)
before = (now-before)
now = (now.strftime("%b %d %X"))
before = (before.strftime("%b %d %X"))
print(before)
print(now)
with open('user.log','r') as f:
for line in f:
if before in line:
break
for line in f:
if now in line:
break
print (line.strip())
the output is Sep 03 11:47:25 Sep 03 11:52:25 which is the print to check if the time is correct, nearly 100 lines in the log that has it but dont bring me nothing, if i take the ifs out then print all the lines which proves the problem is on the if...
any ideas?
here is a exemple of my log file content:
Sep 03 10:18:47 bni..........teagagfaesa.....
Sep 03 10:18:48 bni..........teagagfaesa.....2
I managed to find a Python even older than yours.
#!/usr/bin/env python
from __future__ import with_statement
from datetime import datetime, timedelta
before = timedelta(minutes=5)
now = datetime.now().replace(microsecond=0, year=1900)
before = (now-before)
with open('user.log','r') as f:
for line in f:
if datetime.strptime(line[0:15], '%b %d %X') < before:
continue
print line.strip()
The change compared to your code is that we convert each time stamp from the file into a datetime object; then we can trivially compare these properly machine-readable representations the way you'd expect (whereas without parsing the dates, it can't work, except by chance -- "Sep" comes after "Aug" but "Sep" comes after "Oct", too; so it seems to work if you run it in a suitable month, but then breaks the next month!)
The year=1900 hack is because strptime() defaults to year 1900 for inputs which don't have a year.
I am a beginner at python and trying to solve the below:
I have a text file that each line starts like this:
<18:12:53.972>
<18:12:53.975>
<18:12:53.975>
<18:12:53.975>
<18:12:54.008>
etc
Instead of above I would like to add the elapsed time in seconds in the beginning of each line, but only if the line starts with '<'.
<0.0><18:12:53.972>
<0.003><18:12:53.975>
<0.003><18:12:53.975>
<0.003><18:12:53.975>
<0.036><18:12:54.008>
etc
Here comes a try :-)
#import datetime
from datetime import timedelta
from sys import argv
#get filename as argument
run, input, output = argv
#get number of lines for textfile
nr_of_lines = sum(1 for line in open(input))
#read in file
f = open(input)
lines = f.readlines()
f.close
#declarations
do_once = True
time = []
delta_to_list = []
i = 0
#read in and translate all timevalues from logfile to delta time.
while i < nr_of_lines:
i += 1
if lines[i-1].startswith('<'):
get_lines = lines[i-1] #get one line
get_time = (get_lines[1:13]) #get the time from that line
h = int(get_time[0:2])
m = int(get_time[3:5])
s = int(get_time[6:8])
ms = int(get_time[9:13])
time = timedelta(hours = h, minutes = m, seconds = s, microseconds = 0, milliseconds = ms)
sec_time = time.seconds + (ms/1000)
if do_once:
start_value = sec_time
do_once = False
delta = float("{0:.3f}".format(sec_time - start_value))
delta_to_list.append(delta)
#write back values to logfile.
k=0
s = str(delta_to_list[k])
with open(output, 'w') as out_file:
with open(input, 'r') as in_file:
for line in in_file:
if line.startswith('<'):
s = str(delta_to_list[k])
out_file.write("<" + s + ">" + line)
else:
out_file.write(line)
k += 1
As it is now, it works fine, but the last two lines is not written to the new file. It says: "s = str(delta_to_list[k]) IndexError: list index out of range.
At first I would like to get my code working, and second a suggestions for improvements. Thank you!
First point: never read a full file in memory when you don't have too (and specially when you don't know whether you have enough free memory).
Second point: learn to use python's for loop and iteration protocol. The way to iterate over a list and any other iterable is:
for item in some_iterable:
do_something_with(item)
This avoids messing with indexes and getting it wrong ;)
One of the nice things with Python file objects is that they actually are iterables, so to iterate over a file lines, the simplest way is:
for line in my_opened_file:
do_something_with(line)
Here's a simple yet working and mostly pythonic (nb: python 2.7.x) way to write your program:
# -*- coding: utf-8 -*-
import os
import sys
import datetime
import re
import tempfile
def totime(timestr):
""" returns a datetime object for a "HH:MM:SS" string """
# we actually need datetime objects for substraction
# so let's use the first available bogus date
# notes:
# `timestr.split(":")` will returns a list `["MM", "HH", "SS]`
# `map(int, ...)` will apply `int()` on each item
# of the sequence (second argument) and return
# the resulting list, ie
# `map(int, "01", "02", "03")` => `[1, 2, 3]`
return datetime.datetime(1900, 1, 1, *map(int, timestr.split(":")))
def process(instream, outstream):
# some may consider that regexps are not that pythonic
# but as far as I'm concerned it seems like a sensible
# use case.
time_re = re.compile("^<(?P<time>\d{2}:\d{2}:\d{2})\.")
first = None
# iterate over our input stream lines
for line in instream:
# should we handle this line at all ?
# (nb a bit redundant but faster than re.match)
if not line.startswith("<"):
continue
# looks like a candidate, let's try and
# extract the 'time' value from it
match = time_re.search(line)
if not match:
# starts with '<' BUT not followed by 'HH:MM:SS.' ?
# unexpected from the sample source but well, we
# can't do much about it either
continue
# retrieve the captured "time" (HH:MM:SS) part
current = totime(match.group("time"))
# store the first occurrence so we can
# compute the elapsed time
if first is None:
first = current
# `(current - first)` yields a `timedelta` object
# we now just have to retrieve it's `seconds` attribute
seconds = (current - first).seconds
# inject the seconds before the line
# and write the whole thing tou our output stream
newline = "{}{}".format(seconds, line)
outstream.write(newline)
def usage(err=None):
if err:
print >> sys.stderr, err
print >> sys.stderr, "usage: python retime.py <filename>"
# unix standards process exit codes
return 2 if err else 0
def main(*args):
# our entry point...
# gets the source filename, process it
# (storing the results in a temporary file),
# and if everything's ok replace the source file
# by the temporary file.
try:
sourcename = args[0]
except IndexError as e:
return usage("missing <filename> argument")
# `delete=False` prevents the tmp file to be
# deleted on closing.
dest = tempfile.NamedTemporaryFile(delete=False)
with open(sourcename) as source:
try:
process(source, dest)
except Exception as e:
dest.close()
os.remove(dest)
raise
# ok done
dest.close()
os.rename(dest.name, sourcename)
return 0
if __name__ == "__main__":
# only execute main() if we are called as a script
# (so we can also import this file as a module)
sys.exit(main(*sys.argv[1:]))
It gives the expected results on your sample data (running on linux - but it should be ok on any other supported OS afaict).
Note that I wrote it to work like your original code (replace the source file with the processed one), but if it were my code I would instead either explicitely provide a destination filename or as a default write to sys.stdout instead (and redirect stdout to another file). The process function can deal with any of those solution FWIW - it's only a matter of a couple edits in main().
i am writing a script in python that searches for strings and suposedly does different things when encounters strings.
import re, datetime
from datetime import *
f = open(raw_input('Name of file to search: ')
strToSearch = ''
for line in f:
strToSearch += line
patFinder = re.compile('\d{2}\/\d{2}\/\d{4}\sA\d{3}\sB\d{3}')
findPat1 = re.findall(patFinder, strToSearch)
# search only dates
datFinder = re.compile('\d{2}\/\d{2}\/\d{4}')
findDat = re.findall(datFinder, strToSearch)
nowDate = date.today()
fileLst = open('cels.txt', 'w')
ntrdLst = open('not_ready.txt', 'w')
for i in findPat1:
for Date in findDat:
Date = datetime.strptime(Date, '%d/%m/%Y')
Date = Date.date()
endDate = Date + timedelta(days=731)
if endDate < nowDate:
fileLst.write(i)
else:
ntrdLst.write(i)
f.close()
fileLst.close()
ntrdLst.close()
toClose = raw_input('File was modified, press enter to close: ')
so basically it searches for a string with dates and numbers and then same list but only dates, converts the dates, adds 2 years to each and compares, if the date surpass today's date, goes to the ntrdLst, if not, to fileLst.
My problem is that it writes the same list (i) multiple times and doesn't do the sorting.
i am fearly new to python and programming so i am asking for your help. thanks in advance
edit: -----------------
the normal output was this (without the date and if statement)
27/01/2009 A448 B448
22/10/2001 A434 B434
06/09/2007 A825 B825
06/09/2007 A434 B434
06/05/2010 A826 B826
what i would like is if i had a date that is after date.today() say like 27/01/2016 to write to another file and what i keep getting is the script printing this list 30x times or doesn't take to account the if statement.
(sorry, the if was indeed indented the last loop, i went wrong while putting it in here)
You're computing endDate in a loop, once for each date... but not doing anything with it in the loop. So, after the loop is over, you have the very last endDate, and you use only that one to decide which file to write to.
I'm not sure what your logic is supposed to be, but I'm pretty sure you want to put the if statement with the writes inside the inner loop.
If you do that, then if you have, say, 100 pattern matches and 25 dates, you'll end up writing 2500 strings--some to one file, some to the other. Is that what you wanted?
SOLVED
i gave it a little (A LOT) of thought about it and just got all together in one piece. i knew that there too many for loops but now i got it. Thanks anyway to you whom have reached a helping hand to me. I leave the code for anyone having a similar problem.
nowDate = date.today()
for line in sourceFile:
s = re.compile('(\d{2}\/\d{2}\/\d{4})\s(C\d{3}\sS\d{3})')
s1 = re.search(s, line)
if s1:
date = s1.group(1)
date = datetime.strptime(date, '%d/%m/%Y')
date = date.date()
endDate = date + timedelta(days=731)
if endDate <= nowDate:
fileLst.write(s1.group())
fileLst.write('\n')
else:
print ('not ready: ', date.strftime('%d-%m-%Y'))
ntrdLst.write(s1.group(1))
ntrdLst.write('\n')