Parsing timestamps with Python regular expressions ':' character not found - python

I am teaching myself python and I am trying to implement the regular expression to obtain a timestamp from an application log file ( I normally use grep, cut and awk for this )
My logfiles contain many lines started with date and time next
18.12.19 14:03:16 [ ..... # message error
18.12.19 14:03:16 [
:
I normally use a simple grep command grep "14\:03\:16" mytext
and this expression works "14:03:16", so after researching I came up with this regex:
Where res is one of the lines above
datap = re.compile(r'(\d{2}):(\d{2}):(\d{2})')
m = datap.match(res)
This does not find anything whereas
datap = re.compile(r'(\d{2}).(\d{2}).(\d{2})')
m = datap.match(re
Captures the date.
Why the character : is not found? I have tried to use \: as well and it also does not work. Thanks in advance.

re.match tries to match the regex from the beginning of the string.
From the docs:
If zero or more characters at the beginning of string match the
regular expression pattern, return a corresponding match object.
Return None if the string does not match the pattern; note that this
is different from a zero-length match.
When you did
datap = re.compile(r'(\d{2}).(\d{2}).(\d{2})')
m = datap.match(res)
the regex actually matched the date, not the time (because it is at the beginning of the string):
print(m)
# <re.Match object; span=(0, 8), match='18.12.19'>
If you use re.search then you will get the expected output:
import re
res = '18.12.19 14:03:16 [ ..... # message error'
datap = re.compile(r'(\d{2}):(\d{2}):(\d{2})')
m = datap.search(res)
print(m)
# <re.Match object; span=(9, 17), match='14:03:16'>

Related

Parsing Regex with "|" (OR) in python

I´m trying to read a file with the follow regex sentence using python "pattern = (r'(?x)((?<=\Kauid=)|(?<=\Kcomm="nom))[\S]+')" to return both regex parameters,but is Only return the first one
here is my code:
import regex
filename = "file.log"
pattern = (r'(?x)((?<=\Kauid=)|(?<=\Kcomm="nom))[\S]+')
matchvalues = []
new_output = []
comm = 'comm="nom-http"'
i = 0
with open(filename, 'r') as audit:
lines = audit.readlines()
for line in lines:
match = regex.search(pattern, line)
if match:
new_line = match.group()
print(new_line)
matchvalues.append(new_line)
matchvalues_size = len(matchvalues)
print(matchvalues)
Can you guys help me please?
Normally \K is not used within lookbehinds since its meaning is to make the match succeed at the current position and one usually does not want lookbehinds to be part of the current match. So I don't know why you are using variable-length lookbehinds, which require the regex package, to begin with. That said, I did not have a problem matching 'comm="nom-http"' with your regex:
>>> pattern = (r'(?x)((?<=\Kauid=)|(?<=\Kcomm="nom))[\S]+')
... comm = 'comm="nom-http"'
... regex.search(pattern, comm)
<regex.Match object; span=(0, 15), match='comm="nom-http"'>
Note that comm="nom is part of the match due to the presence of \K in the regex.
But simpler would be to use:
pattern = r'((?:auid=|comm="nom)\S+)'
So, what is the problem you are having? When you say, "It is only returning the first one", are you then referring to not the pattern but the first occurrence on the line because there may be multiple occurrences? If so, then instead of doing regex.search, do regex.findall, which will return a list of string matches.
import re
pattern = r'((?:auid=|comm="nom)\S+)'
matches = re.findall(pattern, line)

Preventing python to escape regexp pattern while inserting into list

I am trying to create a list of regexp pattern which I can use for patter matching like below one
REGEXES = [
'port .\d+',
'te\d+-\d+ \d+ [#]?\d+',
'te\d+.-\d+'
]
Now while I am checking the o/p of it, its shows
['port .\\d+', 'te\\d+-\\d+ \\d+ [#]?\\d+', 'te\\d+.-\\d+']
And using below code
msg = "Aborting Test: checkDutPort: Invalid dutBladeAndPort: te3932-213 0 #4, not found in global ::dutPortMap"
combined = "(" + ")|(".join(REGEXES) + ")"
re.match(combined, msg)
it not able to match the pattern.
I check but for raw input also python escaped the "\".
How can I prevent this.
From the docs:
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object.
None of your patterns can be found at the beginning of msg, so it returns None.
If instead you use re.search, it will find the part of the string I assume you're looking for:
>>> re.search(combined, msg)
<_sre.SRE_Match object; span=(54, 69), match='te3932-213 0 #4'>

Can't get re.search() to work in Python

I have a string of type "animal cat dog" and I am trying to extract animal from it.
Following this example, I tried using re.search (and also re.match later), however that didn't produce the result I expected. The if statement would go through but groups() would be empty.
The code I had:
string = "fox cat dog"
regex = "\S+ cat dog\s*"
m = re.search(regex, string)
if m:
temp = m.group(1)
I tried printing out m and m.groups() and they had the following values:
m: <_sre.SRE_Match object at 0x000000002009A920>
m.groups(): ()
I found a way around the problem by using substrings and .find() but I am very curious what was wrong with my original code.
Any help would be appreciated. Thank you!
You just need to add a parenthesis to the group you want. Like so:
string = "fox cat dog"
regex = "(\S+) cat dog\s*"
# ~~~~~~Note the parenthesis there
m = re.search(regex, string)
if m:
temp = m.group(1)
You may want to check the documentation for more information:
(...) Matches whatever regular expression is inside the parentheses,
and indicates the start and end of a group; the contents of a group
can be retrieved after a match has been performed, and can be matched
later in the string with the \number special sequence, described
below. To match the literals '(' or ')', use \( or \), or enclose them
inside a character class: [(] [)].

How to print regex match results in python 3?

I was in IDLE, and decided to use regex to sort out a string. But when I typed in what the online tutorial told me to, all it would do was print:
<_sre.SRE_Match object at 0x00000000031D7E68>
Full program:
import re
reg = re.compile("[a-z]+8?")
str = "ccc8"
print(reg.match(str))
result:
<_sre.SRE_Match object at 0x00000000031D7ED0>
Could anybody tell me how to actually print the result?
You need to include .group() after to the match function so that it would print the matched string otherwise it shows only whether a match happened or not. To print the chars which are captured by the capturing groups, you need to pass the corresponding group index to the .group() function.
>>> import re
>>> reg = re.compile("[a-z]+8?")
>>> str = "ccc8"
>>> print(reg.match(str).group())
ccc8
Regex with capturing group.
>>> reg = re.compile("([a-z]+)8?")
>>> print(reg.match(str).group(1))
ccc
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
If you need to get the whole match value, you should use
m = reg.match(r"[a-z]+8?", text)
if m: # Always check if a match occurred to avoid NoneType issues
print(m.group()) # Print the match string
If you need to extract a part of the regex match, you need to use capturing groups in your regular expression. Enclose those patterns with a pair of unescaped parentheses.
To only print captured group results, use Match.groups:
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.
So, to get ccc and 8 and display only those, you may use
import re
reg = re.compile("([a-z]+)(8?)")
s = "ccc8"
m = reg.match(s)
if m:
print(m.groups()) # => ('ccc', '8')
See the Python demo

Regular expression find and replace multiple

I am trying to write a regular expression that will match all cases of
[[any text or char her]]
in a series of text.
Eg:
My name is [[Sean]]
There is a [[new and cool]] thing here.
This all works fine using my regex.
data = "this is my tes string [[ that does some matching ]] then returns."
p = re.compile("\[\[(.*)\]\]")
data = p.sub('STAR', data)
The problem is when I have multiple instances of the match occuring :[[hello]] and [[bye]]
Eg:
data = "this is my new string it contains [[hello]] and [[bye]] and nothing else"
p = re.compile("\[\[(.*)\]\]")
data = p.sub('STAR', data)
This will match the opening bracket of hello and the closing bracket of bye. I want it to replace them both.
.* is greedy and matches as much text as it can, including ]] and [[, so it plows on through your "tag" boundaries.
A quick solution is to make the star lazy by adding a ?:
p = re.compile(r"\[\[(.*?)\]\]")
A better (more robust and explicit but slightly slower) solution is to make it clear that we cannot match across tag boundaries:
p = re.compile(r"\[\[((?:(?!\]\]).)*)\]\]")
Explanation:
\[\[ # Match [[
( # Match and capture...
(?: # ...the following regex:
(?!\]\]) # (only if we're not at the start of the sequence ]]
. # any character
)* # Repeat any number of times
) # End of capturing group
\]\] # Match ]]
Use ungreedy matching .*? <~~ the ? after a + or * makes it match as few characters as possible. The default is to be greedy, and consume as many characters as possible.
p = re.compile("\[\[(.*?)\]\]")
You can use this:
p = re.compile(r"\[\[[^\]]+\]\]")
>>> data = "this is my new string it contains [[hello]] and [[bye]] and nothing else"
>>> p = re.compile(r"\[\[[^\]]+\]\]")
>>> data = p.sub('STAR', data)
>>> data
'this is my new string it contains STAR and STAR and nothing else'

Categories