Matching multiple possibilities with regex in Python - python

I am trying to process a log file using Python and extract the date, time and log message of each entry and store it in a list of dicts. I am using the re.search() and group() methods for this purpose.
The problem is the date/time take various formats such as.
dd/mm/yy, hh:mm AM - logs
dd/mm/yyyy, hh:mm a.m. - logs
dd/mm/yy HH:mm - logs
My program looks something like this:
import re
infile=open('logfile.txt', 'r')
loglist=[]
logdict={}
for aline in infile.readlines():
line=re.search(r'^(\d?\d/\d?\d/\d\d), (\d?\d:\d?\d \w\w) - (.*?)',aline)
if line:
logdict['date'] = line.group(1)
logdict['time'] = line.group(2)
logdict['logmsg'] = line.group(3)
loglist.append(logdict)
However, this matches only the first of the above-mentioned formats.
How can I match the other formats as well and also maintain the groups? Or is there an easier method of doing this?

You can use {m,n} after a pattern to indicate that there can be between m and n repetitions. So use \d{1,2} to indicate 1 or 2 digits. And you an use an alternation to indicate multiple possibilities, e.g. \d{2}|\d{4} for 2- or 4-digit years.
So the regexp can be:
^(\d{1,2}/\d{1,2}/(?:\d{2}|\d{4})),? (\d{1,2}:\d{1,2}(?: [AaPp]\.?[Mm]\.?)?) - (.*)'

I would first extract the data with a regex and then validate it manually. I wouldn't use the regex for two things, validation and extraction.
For clarity I would also assign names to these regex and make sure that each individual regex would return an atom such as a time or a date or am_pm and then string them together to form the sentence.
Note: I have not assigned names to the groups but I think its possible but not sure how
However in the end you could get your date_time and do a split on it such as date_time.split("/") which would return you day, month, year which you can then validate or use.
import re
log_records = ["10/10/1960, 10:50 AM - logs",
"5/15/2001, 23:11 a.m. - logs",
"50/100/1069 300:100 - logs"]
parsed_records = []
date_month_year_ptrn = r"((\d+/){2,2}\d+)"
time_ptrn = r"(\d+:\d+)"
morning_evening_ptrn = r"((\w+\.?)+)?"
everything_else_ptrn = r"(.*)"
log_record_ptrn = "^{date_ptrn},?\s+{time_ptrn}\s+{morn_even_ptrn}\s*-\s+{log_msg}$"
log_record_ptrn = log_record_ptrn.format(date_ptrn=date_month_year_ptrn,
time_ptrn=time_ptrn,
morn_even_ptrn=morning_evening_ptrn,
log_msg=everything_else_ptrn)
def extract_log_record_from_match(matcher):
if log_record_match:
# I am pretty sure you can attach names to these numbers
# but not sure how to do this
date_time = log_record_match.group(1)
time_ = log_record_match.group(3)
am_pm = log_record_match.group(4)
log_message = log_record_match.group(6)
return date_time, time_, am_pm, log_message
return None
def print_records(records):
for record in parsed_records:
if record:
print(record)
for log_record in log_records:
log_record_match = re.search(log_record_ptrn, log_record, re.IGNORECASE)
parsed_records.append(extract_log_record_from_match(log_record_match))
print_records(parsed_records)

Related

Reading from file and formatting into two dimensional array

I'm attempting to read data from multiple text files and move the data into a two-dimensional array. The data needs to remain in a specific order.
Could regex assist with this?
If you have any insight on how to improve this section of the code please let me know.
the datetime module provides (most) everything date-related
from datetime import datetime
date = "Sat 30-Mar-1996 7:40 PM"
fmt = "%a %d-%b-%Y %I:%M %p"
a = datetime.strptime(date, fmt)
print(a.year)
>>> 1996
You can parse the date-time string very easily by splitting its components and using iterable unpacking, e.g.,
def parse_date(d):
day_of_week, date, hhmm, ampm = d.split()
day_of_month, month, year = date.split('-')
hour, minute = hhmm.split(':')
return (year, month, day_of_month,
​hour if ampm=='AM' or str(int(hour)+12), minute,
day_of_week)
and later, in the body of the loop
year, m, dom, ​h, m, dow = parse_date(fields[-1].strip())
or, if you are interested only in year
year, *_ = parse_date(fields[-1].strip())
You're probably looking for regular expressions, which are a very powerful way to analyze and extract data from strings. For an intro into them, I'd check out this site or the python docs, but in your case I think you probably want something like '| ([a-zA-Z]*) ([0-9]*)-([a-zA-Z]*)-([0-9]*) ([0-9:]* [a-zA-Z]*) |' would work. A more specific description of the format the time would be in is necessary for a 100% correct regex [short for regular expressions].
To use regex in python, you want the re library. First, create the pattern matcher with matcher = re.compile(your_regex_string_here). Then, find the match with result = matcher.match(file_contents). (You could also just do result = re.match(regex_string,file_contents).) Whatever your regex, anything surrounded by parentheses is known as a "capturing group", which can be extracted from the result with result.group(); result.group(0) will return full match, and result.group(n) will return the contents of the nth capturing group - that is, the nth set of parentheses. In the above example, result.group(4) would return the year, though you could get any of the day of the week, day, month, year, and time by using groups 1-5.
The DateTime module as mentioned in another answer is also a great tool.

Extract date from file name with import re in python

My file name looks like as follow:
show_data_paris_12112019.xlsx
I want to extract the date only and I have tried this script:
date = os.path.basename(xls)
pattern = r'(?<=_)_*(?=\.xlsx)'
re.search(pattern, date).group(0)
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%Y%m%d')
but It gives me errors. How can I fix this?
Thank you.
It looks to me like the regex you're using is also at fault, and so it fails when trying to group(0) from the empty return.
Assuming all your dates are stored as digits the following regex i've made seems to work quite well.
(?!.+_)\d+(?=\.xlsx)
The next issue is when formatting the date it experiences an issue with the way you're formatting the date, to me it looks like 12112019 would be the 12/11/2019 obviously this could also be the 11/12/2019 but the basic is that we change the way strftime formats the date.
So for the date / month / year format we would use
# %d%m%Y
event_date_obj = datetime.strptime(event_date, '%d%m%Y')
And we would simply swap %d and %m for the month / date / year format. So your complete code would look something like this:
date = os.path.basename(xls)
pattern = "(?!.+_)\d+(?=\.xlsx)"
event_date = re.search(pattern, date).group(0)
event_date_obj = datetime.strptime (event_date, '%d%m%Y')
For further information on how to use strftime see https://strftime.org/.
_* matches a sequence of zero or more underscore characters.
(?<=_) means that it has to be preceded by an underscore.
(?=\.xlsx) means that it has to be followed by .xlsx.
So this will match the ___ in foo____.xlsx. But it doesn't match your filename, because the data is between the underscore and .xlsx.
You should match \d+ rather than _* between the lookbehind and lookahead.
pattern = r'(?<=_)\d+(?=\.xlsx)'
And if the data is always 8 digits, use \d{8} to be more specific.

How to detect dash or underscore in datetime string to use in strptime?

I have several thousand files which feature datetime in their file name.
Sadly the devider between the datetime blocks are not always the same.
Example:
Data_trul-100A1-Berlin_2019-01-31_150480.dat
Data_tral-2000B2-Frankf-2018_02_27-190200.dat
Data_bash-300003_Hambrg_2017-04-12_210500.dat
I managed to find the datetime part in the string with a regular expression
import re
strings = ['Data_trul-100A1-Berlin_2019-01-31_150430.dat',
'Data_tral-2000B2-Frankf-2018_02_27-190200.dat',
'Data_bash-300003_Hambrg_2017-04-12_210500.dat']
for part_string in strings:
match = re.search('\d{4}[-_]\d{2}[-_]\d{2}[-_]\d{6}', part_string)
print(match.group())
However, now I am stuck to convert the group to datetime
from datetime import datetime
date = datetime.strptime(match.group(), "%Y-%m-%d_%H%M%S")
because I need to specify dashes or underscores.
I came up with the following solution to just replace it, but that feels like cheating.
for part_string in strings:
part_string = part_string.replace('-',"_")
match = re.search('\d{4}_\d{2}_\d{2}_\d{6}', part_string)
date = datetime.strptime(match.group(), "%Y_%m_%d_%H%M%S")
print(date)
Is there a more elegant way? Using regex to find the divider and pass it on to strptime?
You could change your regular expression to find 4 separate elements
match = re.search('(\d{4})[-_](\d{2})[-_](\d{2})[-_](\d{6})', part_string)
Then combine them into one standard string format
fixedstring = "{}_{}_{}_{}".format(match.groups())
date = datetime.strptime(match.group(), "%Y_%m_%d_%H%M%S")
Of course at this point you could just split the HHMMSS part of the time into their own elements and build the datetime object directly,
m = re.search('(\d{4})[-_](\d{2})[-_](\d{2})[-_](\d{2})(\d{2})(\d{2})', part_string)
date = datetime.datetime(year=m.group(0),
month=m.group(1),
day=m.group(2),
hour=m.group(3),
minute=m.group(4),
second=m.group(5))

Python regex similar expressions

I have a file with two different types of data I'd like to parse with a regex; however, the data is similar enough that I can't find the correct way to distinguish it.
Some lines in my file are of form:
AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Other lines are of form
AED=20180823
AMD=20150914
AMD=20150921
The remaining lines are headers and I'd like to discard them. For example
[HEADER: BUSINESS DATE=20160831]
My solution attempt so far is to match first three capital letters and an equal sign,
r'\b[A-Z]{3}=\b'
but after that I'm not sure how to distinguish between dates (eg 20180823) and days (eg FRI:SAT:SUN).
The results I'd expect from these parsing functions:
Regex weekday_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=FRI>);
Regex date_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=20160816>);
weekdays = [weekday_rx.Match(line) for line in infile.read()]
dates = [date_rx.Match(line) for line in infile.read()]
r'\S*\d$'
Will match all non-whitespace characters that end in a digit
Will match AED=20180823
r'\S*[a-zA-Z]$'
Matches all non-whitespace characters that end in a letter.
will match AED=AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Neither will match
[HEADER: BUSINESS DATE=20160831]
This will match both
r'(\S*[a-zA-Z]$|\S*\d$)'
Replacing the * with the number of occurences you expect will be safer, the (a|b) is match a or match b
The following is a solution in Python :)
import re
p = re.compile(r'\b([A-Z]{3})=((\d)+|([A-Z])+)')
str_test_01 = "AMD=SUN:SAT"
m = p.search(str_test_01)
print (m.group(1))
print (m.group(2))
str_test_02 = "AMD=20150921"
m = p.search(str_test_02)
print (m.group(1))
print (m.group(2))
"""
<Output>
AMD
SUN
AMD
20150921
"""
Use pipes to express alternatives in regex. Pattern '[A-Z]{3}:[A-Z]{3}|[A-Z]{3}' will match both ABC and ABC:ABC. Then use parenthesis to group results:
import re
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC:ABC')
assert match.groups() == ('ABC:ABC', None)
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC')
assert match.groups() == (None, 'ABC')
You can research the concept of named groups to make this even more readable. Also, take a look at the docs for the match object for useful info and methods.

Get any character except digits

I'm trying to search for a string that has 6 digits, but no more, other chars may follow. This is the regex I use \d{6}[^\d] For some reason it doesn't catch the digits which \d{6} do catch.
Update
Now I'm using the regex (\d{6}\D*)$ which do makes sence. But I can't get it to work anyways.
Update 2 - solution
I should of course grouped the \d{6} with parentheses. Doh! Otherwise it includes the none-digit and tries to make a date with that.
End of update
What I'm trying to achive (as a rather dirty hack) is to find a datestring in the header of a openoffice document in either of the following formats: YYMMDD, YYYY-MM-DD or YYYYMMDD. If it finds one of these (and only one) it set the mtime and atime of that file to that date. Try to create a odt-file in /tmp with 100101 in the header and run this script (sample file to download: http://db.tt/9aBaIqqa). It should'nt according to my tests change the mtime/atime. But it will change them if you remove the \D in the script below.
This is all of my source:
import zipfile
import re
import glob
import time
import os
class OdfExtractor:
def __init__(self,filename):
"""
Open an ODF file.
"""
self._odf = zipfile.ZipFile(filename)
def getcontent(self):
# Read file with header
return self._odf.read('styles.xml')
if __name__ == '__main__':
filepattern = '/tmp/*.odt'
# Possible date formats I've used
patterns = [('\d{6}\D', '%y%m%d'), ('\d{4}-\d\d-\d\d', '%Y-%m-%d'), ('\d{8}', '%Y%m%d')]
# go thru all those files
for f in glob.glob(filepattern):
# Extract data
odf = OdfExtractor(f)
# Create a list for all dates that will be found
findings = []
# Try finding date matches
contents = odf.getcontent()
for p in patterns:
matches = re.findall(p[0], contents)
for m in matches:
try:
# Collect regexp matches that really are dates
findings.append(time.strptime(m, p[1]))
except ValueError:
pass
print f
if len(findings) == 1: # Don't change if multiple dates was found in file
print 'ändrar till:', findings[0]
newtime = time.mktime(findings[0])
os.utime(f, (newtime, newtime))
print '-' * 8
You can use \D (capital D) to match any non-digit character.
regex:
\d{6}\D
raw string: (are you sure you are escaping the string correctly?)
ex = r"\d{6}\D"
string:
ex = '\\d{6}\\D'
Try this instead:
r'(\d{6}\D*)$'
(six digits followed by 0 or more non-digits).
Edit: added a "must match to end of string" qualifier.
Edit2: Oh, for Pete's sake:
import re
test_strings = [
("12345", False),
("123456", True),
("1234567", False),
("123456abc", True),
("123456ab9", False)
]
outp = [
" good, matched",
"FALSE POSITIVE",
"FALSE NEGATIVE",
" good, no match"
]
pattern = re.compile(r'(\d{6}\D*)$')
for s,expected in test_strings:
res = pattern.match(s)
print outp[2*(res is None) + (expected is False)]
returns
good, no match
good, matched
good, no match
good, matched
good, no match
I was pretty stupid. If I add an \D to the end of the search the search will of course return that none digit also which I did'nt want. I had to add parenthesis to the part I really wanted. I feel pretty stupid for not catching this with a simple print statement after loop. I really need to code more frequently.

Categories