python : reading a datetime from a log file using regex - python

I have a log file which has text that looks like this.
Jul 1 03:27:12 syslog: [m_java][ 1/Jul/2013 03:27:12.818][j:[SessionThread <]^Iat com/avc/abc/magr/service/find.something(abc/1235/locator/abc;Ljava/lang/String;)Labc/abc/abcd/abcd;(bytecode:7)
There are two time formats in the file. I need to sort this log file based on the date time format enclosed in [].
This is the regex I am trying to use. But it does not return anything.
t_pat = re.compile(r".*\[\d+/\D+/.*\]")
I want to go over each line in file, be able to apply this pattern and sort the lines based on the date & time.
Can someone help me on this? Thanks!

You have a space in there that needs to be added to the regular expression
text = "Jul 1 03:27:12 syslog: [m_java][ 1/Jul/2013 03:27:12.818][j:[SessionThread <]^Iat com/avc/abc/magr/service/find.something(abc/1235/locator/abc;Ljava/lang/String;)Labc/abc/abcd/abcd;(bytecode:7)"
matches = re.findall(r"\[\s*(\d+/\D+/.*?)\]", text)
print matches
['1/Jul/2013 03:27:12.818']
Next parse the time using the following function
http://docs.python.org/2/library/time.html#time.strptime
Finally use this as a key into a dict, and the line as the value, and sort these entries based on the key.

You are not matching the initial space; you also want to group the date for easy extraction, and limit the \D and .* patterns to non-greedy:
t_pat = re.compile(r".*\[\s?(\d+/\D+?/.*?)\]")
Demo:
>>> re.compile(r".*\[\s?(\d+/\D+?/.*?)\]").search(line).group(1)
'1/Jul/2013 03:27:12.818'
You can narrow down the pattern some more; you only need to match 3 letters for the month for example:
t_pat = re.compile(r".*\[\s?(\d{1,2}/[A-Z][a-z]{2}/\d{4} \d{2}:\d{2}:[\d.]{2,})\]")

Read all the lines of the file and use the sort function and pass in a function that parses out the date and uses that as the key for sorting:
import re
import datetime
def parse_date_from_log_line(line):
t_pat = re.compile(r".*\[\s?(\d+/\D+?/.*?)\]")
date_string = t_pat.search(line).group(1)
format = '%d/%b/%Y %H:%M:%S.%f'
return datetime.datetime.strptime(date_string, format)
log_path = 'mylog.txt'
with open(log_path) as log_file:
lines = log_file.readlines()
lines.sort(key=parse_date_from_log_line)

Related

Load files that has name patterns and clean data using python [duplicate]

I am trying to find all file names in a folder which follows this pattern: 'index_YYYYMMDD.csv'. The 'YYYYMMDD' part represents the date of the data file. Some of the files names are listed below:
'index_20091101.csv',
'index_20091102.csv',
'index_20091103.csv',
'index_20091104.csv',
'index_20091105.csv',
'index_20091106.csv',
'index_20091107.csv',
'index_20091108.csv',
Given a startDate and endDate, I would like to find all file names, the date part of which is between the startDate and endDate. For example, for the above file list, if the startDate=20091104 and endDate=20091107, the file names I would like to find should be:
'index_20091104.csv',
'index_20091105.csv',
'index_20091106.csv',
'index_20091107.csv'
I've tried os.listdir function, which gives me all the file names. To filter out the unwanted files, I think I need to use regular expression, but could not work it out.
Anyone could help me with this? Thanks!
import glob
glob.glob('index_[0-9]*.csv')
This will math the filename that starts with a digital .
John's solution matches exactly 8 digital .
I would take the following approach. You can define a simple file filter factory.
import time
def make_time_filter(start, end, time_format, file_format='index_{time_format:}.csv'):
t_start = time.strptime(start, time_format)
t_end = time.strptime(end, time_format)
ft_fmt = file_format.format(time_format=time_format)
def filt(fname):
try:
return t_start <= time.strptime(fname, ft_fmt) <= t_end
except ValueError:
return False
return filt
Now, you can simply make a predicate to filter out the date range you want
time_filt = make_time_filter('20091101', '20091201', '%Y%m%d')
Then pass this to filter
filter(time_filt, os.listdir(your_dir))
Or put it a comprehension of some sort
(fname for fname in os.listdir(your_dir) if time_filt(fname))
A regex will be more general, but you don't need one in your case since your file names all follow a simple pattern which you know must contain a date. For more on the time module see the docs.
If you want to match exactly 8 digits with glob you need to write them all out like this
import glob
glob.glob('index_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].csv')
Help on function glob in module glob:
glob(pathname)
Return a list of paths matching a pathname pattern.
The pattern may contain simple shell-style wildcards a la
fnmatch. However, unlike fnmatch, filenames starting with a
dot are special cases that are not matched by '*' and '?'
patterns.
If you want real regex, use os.listdir and filter the result
[x for x in os.listdir('.') if re.match('index_[0-9]*.csv', x)]
This will get you where you want to be and allows you to provide start and end dates:
import os
import re
import datetime
start_date = datetime.datetime.strptime('20071102', '%Y%m%d')
end_date = datetime.datetime.strptime('20071103', '%Y%m%d')
files = os.listdir('.')
files_in_range = []
for fl in files:
if re.match('index_\d+\.csv', fl):
date = re.match('index_(\d+)\.csv', fl).group(1)
date = datetime.datetime.strptime(date, '%Y%m%d')
if date >= start_date and date <= end_date:
files_in_range.append(fl)
print files_in_range

Python's regex .match() function works inconsistently with multiline strings

I'm writing a script in python that takes a directory containing journal entries in the form of markdown files and processes each file in order to create an object from it. These objects are appended into a list of journal entry objects. The object contains 3 fields: title, date and body.
In order to create this list of entry objects, I loop over each file in a directory and append to the list the return value of a function called entry_create_object, which takes the file text as an input.
def load_entries(directory):
entries = []
for filename in os.listdir(directory):
filepath = os.path.join(directory, filename)
with open(filepath, 'r') as f:
text = f.read()
entry_object = entry_create_object(text)
if entry_object: entries.append(entry_object)
else: print(f"Couldn't read {filepath}")
return entries
In order to create the object, I use regular expressions to find the information I need for the title and date fields. The body is just the file contents. The function returns None if it doesn't match title and date. The following code is what I use for this:
def entry_create_object(ugly_entry):
title = re.match('^# (.*)', ugly_entry)
date = re.match('(Date:|Created at:) (\w{3}) (\d{2}), (\d{4})', ugly_entry)
body = ugly_entry
if not (title and date and body):
return
entry_object = {}
entry_object['title'], entry_object['date'], entry_object['body'] = title, date, body
return entry_object
For some reason I can't understand, my regular expression for dates works for some files but doesn't for others, even though I've been able to succesfully match what I wanted by testing my regex pattern in an online regular expression webapp such as Regexr. The title regex pattern works fine for all files.
I've found in my testing that re.match is very inconsistent with multiline strings overall, but I haven't been able to find a way of fixing it.
I can't see anything wrong with my pattern.
Example of file that succesfully matches both title and date:
# Time tracker
Created at: Oct 21, 2020 4:16 PM
Date: Oct 21, 2020
[...]
Example of file that fails to match date:
# Bad habits
Created at: Dec 6, 2020 4:24 PM
Date: Dec 6, 2020
[...]
Thank you for your time.
Let's decode the regex.
date = re.match('(Date:|Created at:) (\w{3}) (\d{2}), (\d{4})', ugly_entry)
That's three letters, followed by a space, followed by 2 digits, followed by comma space, followed by 4 digits. Given that description, can you see why the following string does not match?
Created at: Dec 6, 2020 4:24 PM
I shouldn't spoil the surprise, but you want (\d{1,2}),

Counting date occurrences in python?

I'm currently trying to count the number of times a date occurs within a chat log for example the file I'm reading from may look something like this:
*username* (mm/dd/yyyy hh:mm:ss): *message here*
However I need to split the date from the time as I currently treat them as one. Im currently struggling to solve my problem so any help is appreciated. Down below is some sample code that I'm currently using to try get the date count working. Im currently using a counter however I'm wondering if there are other ways to count dates.
filename = tkFileDialog.askopenfile(filetypes=(("Text files", "*.txt") ,))
mtxtr = filename.read()
date = []
number = []
occurences = Counter(date)
mtxtformat = mtxtr.split("\r\n")
print 'The Dates in the chat are as follows'
print "--------------------------------------------"
for mtxtf in mtxtformat:
participant = mtxtf.split("(")[0]
date = mtxtf.split("(")[-1]
message = date.split(")")[0]
date.append(date1.strip())
for item in date:
if item not in number:
number.append(item)
for item in number:
occurences = date.count(item)
print("Date Occurences " + " is: " + str(occurences))
Easiest way would be to use regex and take the count of the date pattern you have in the log file. It would be faster too.
If you know the date and time are going to be enclosed in parentheses at the start of the message (i.e. no parentheses (...): will be seen before the one containing the date and time):
*username* (mm/dd/yyyy hh:mm:ss): *message here*
Then you can extract based on the parens:
import re
...
parens = re.compile(r'\((.+)\)')
for mtxtf in mtxtformat:
match = parens.search(mtxtf)
date.append(match.group(1).split(' ')[0])
...
Note: If the message itself contains parens, this may match more than just the needed (mm/dd/yyyy hh:mm:ss). Doing match.group(1).split(' ')[0] would still give you the information you are looking for assuming there is no information enclosed in parens before your date-time information (for the current line).
Note2: Ideally enclose this in a try-except to continue on to the next line if the current line doesn't contain useful information.

python find all file names in folder that follows a pattern

I am trying to find all file names in a folder which follows this pattern: 'index_YYYYMMDD.csv'. The 'YYYYMMDD' part represents the date of the data file. Some of the files names are listed below:
'index_20091101.csv',
'index_20091102.csv',
'index_20091103.csv',
'index_20091104.csv',
'index_20091105.csv',
'index_20091106.csv',
'index_20091107.csv',
'index_20091108.csv',
Given a startDate and endDate, I would like to find all file names, the date part of which is between the startDate and endDate. For example, for the above file list, if the startDate=20091104 and endDate=20091107, the file names I would like to find should be:
'index_20091104.csv',
'index_20091105.csv',
'index_20091106.csv',
'index_20091107.csv'
I've tried os.listdir function, which gives me all the file names. To filter out the unwanted files, I think I need to use regular expression, but could not work it out.
Anyone could help me with this? Thanks!
import glob
glob.glob('index_[0-9]*.csv')
This will math the filename that starts with a digital .
John's solution matches exactly 8 digital .
I would take the following approach. You can define a simple file filter factory.
import time
def make_time_filter(start, end, time_format, file_format='index_{time_format:}.csv'):
t_start = time.strptime(start, time_format)
t_end = time.strptime(end, time_format)
ft_fmt = file_format.format(time_format=time_format)
def filt(fname):
try:
return t_start <= time.strptime(fname, ft_fmt) <= t_end
except ValueError:
return False
return filt
Now, you can simply make a predicate to filter out the date range you want
time_filt = make_time_filter('20091101', '20091201', '%Y%m%d')
Then pass this to filter
filter(time_filt, os.listdir(your_dir))
Or put it a comprehension of some sort
(fname for fname in os.listdir(your_dir) if time_filt(fname))
A regex will be more general, but you don't need one in your case since your file names all follow a simple pattern which you know must contain a date. For more on the time module see the docs.
If you want to match exactly 8 digits with glob you need to write them all out like this
import glob
glob.glob('index_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9].csv')
Help on function glob in module glob:
glob(pathname)
Return a list of paths matching a pathname pattern.
The pattern may contain simple shell-style wildcards a la
fnmatch. However, unlike fnmatch, filenames starting with a
dot are special cases that are not matched by '*' and '?'
patterns.
If you want real regex, use os.listdir and filter the result
[x for x in os.listdir('.') if re.match('index_[0-9]*.csv', x)]
This will get you where you want to be and allows you to provide start and end dates:
import os
import re
import datetime
start_date = datetime.datetime.strptime('20071102', '%Y%m%d')
end_date = datetime.datetime.strptime('20071103', '%Y%m%d')
files = os.listdir('.')
files_in_range = []
for fl in files:
if re.match('index_\d+\.csv', fl):
date = re.match('index_(\d+)\.csv', fl).group(1)
date = datetime.datetime.strptime(date, '%Y%m%d')
if date >= start_date and date <= end_date:
files_in_range.append(fl)
print files_in_range

How to reformat a timestamp in Python

I have some text below that needs to handled. The timestamp is currently listed and then the value. The format for the timestamp is yyyymmdd and I want to be able to alter it to yyyy-mm-dd or some other variation: yyyy/mm/dd, etc. I can't seem to find a string method that inserts characters into strings so I'm unsure of the best way to do go about this. Looking for efficiency here and general advice on slicing and dicing text in python. Thanks in advance!
19800101,0.76
19800102,0.00
19800103,0.51
19800104,0.00
19800105,1.52
19800106,2.54
19800107,0.00
19800108,0.00
19800109,0.00
19800110,0.76
19800111,0.25
19800112,0.00
19800113,6.10
19800114,0.00
19800115,0.00
19800116,2.03
19800117,0.00
19800118,0.00
19800119,0.25
19800120,0.25
19800121,0.00
19800122,0.00
19800123,0.00
19800124,0.00
19800125,0.00
19800126,0.00
19800127,0.00
19800128,0.00
19800129,0.00
19800130,7.11
19800131,0.25
19800201,.510
19800202,0.00
19800203,0.00
19800204,0.00
I'd do something like this:
#!/usr/bin/env python
from datetime import datetime
with open("stuff.txt", "r") as f:
for line in f:
# Remove initial or ending whitespace (like line endings)
line = line.strip()
# Split the timestamp and value
raw_timestamp, value = line.split(",")
# Make the timestamp an actual datetime object
timestamp = datetime.strptime(raw_timestamp, "%Y%m%d")
# Print the timestamp separated by -'s. Replace - with / or whatever.
print("%s,%s" % (timestamp.strftime("%Y-%m-%d"), value))
This lets you import or print the timestamp using any format allowed by strftime.
general advice on slicing and dicing text in python
The slice operator:
str = '19800101,0.76'
print('{0}-{1}-{2}'.format(str[:4], str[4:6], str[6:]))
Read: strings (look for the part on slices), and string formatting.
Strings are not mutable so inserting characters into strings won't work. Try this:
date = '19800131'
print '-'.join([date[:4],date[4:6],date[6:]])

Categories