Best regex to match IEEE Time Stamp format in Python - python

I did some searching but didn't find this specifically, and I'm sure it's going to be a quick answer.
I have a python script parsing IEEE date and time stamps out of strings, but I think I'm using python's match objects wrong.
import re
stir = "foo_2015-07-07-17-58-26.log"
timestamp = re.search("([0-9]+-){5}[0-9]+", stir).groups()
print timestamp
Produces
58-
When my intent is to get
2015-07-07-17-58-26
Is there a pre-canned regex that would work better here? Am I getting tripped up on re's capture groups? Why is the length of the groups() tuple only 1?
Edit
I was misinterpreting the way capture groups work in python's re module - there is only one set of parentheses in the statement, so the re module returned the most recently grabbed capture group - the "58-".
The way I ended up doing it was by referencing group(0), as Dawg suggests below.
timestamp = re.search("([0-9]+-){5}[0-9]+", stir)
print timestamp.group(0)
2015-07-07-17-58-26

You need a single capture group or groups:
(\d\d\d\d-\d\d-\d\d-\d\d-\d\d-\d\d)
Demo
Or, use nested capture groups:
>>> re.search(r'(\d{4}(?:-\d{2}){5})', 'foo_2015-07-07-17-58-26.log')
<_sre.SRE_Match object at 0x100b49dc8>
>>> _.group(1)
'2015-07-07-17-58-26'
Or, you can use your pattern and just use group(0) instead of groups():
>>> re.search("([0-9]+-){5}[0-9]+", "foo_2015-07-07-17-58-26.log").group(0)
'2015-07-07-17-58-26'
Or, use findall with an additional capture group (and the other a non capture group):
>>> re.findall("((?:[0-9]+-){5}[0-9]+)", 'foo_2015-07-07-17-58-26.log')
['2015-07-07-17-58-26']
But that will find the digits that are not part of the timestamp.

if you want the timestamp in one match object, i think this should work
\d{4}(?:\d{2}){5}
then use group() or group(0)
also, match.groups actually returns the number of group objects, you should try .group() instead (your code would still not work though because you grouped the 5 sets of numbers in and the final -58 would be omitted

I'd use below:
_(\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-\d{2}).
_ and . to mark the starting and the end.
import re
r = r'_(\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-\d{2}).'
s = 'some string'
lst = re.findall(s,r)
link

You might want
re.findall(r"([0-9-]+)", stir)
>>> import re
>>> stir = "foo_2015-07-07-17-58-26.log"
>>> re.findall(r"([0-9-]+)", stir)
['2015-07-07-17-58-26']

Related

Regex: define a particular group and avoid repeating it each time [duplicate]

Consider this (very simplified) example string:
1aw2,5cx7
As you can see, it is two digit/letter/letter/digit values separated by a comma.
Now, I could match this with the following:
>>> from re import match
>>> match("\d\w\w\d,\d\w\w\d", "1aw2,5cx7")
<_sre.SRE_Match object at 0x01749D40>
>>>
The problem is though, I have to write \d\w\w\d twice. With small patterns, this isn't so bad but, with more complex Regexes, writing the exact same thing twice makes the end pattern enormous and cumbersome to work with. It also seems redundant.
I tried using a named capture group:
>>> from re import match
>>> match("(?P<id>\d\w\w\d),(?P=id)", "1aw2,5cx7")
>>>
But it didn't work because it was looking for two occurrences of 1aw2, not digit/letter/letter/digit.
Is there any way to save part of a pattern, such as \d\w\w\d, so it can be used latter on in the same pattern? In other words, can I reuse a sub-pattern in a pattern?
No, when using the standard library re module, regular expression patterns cannot be 'symbolized'.
You can always do so by re-using Python variables, of course:
digit_letter_letter_digit = r'\d\w\w\d'
then use string formatting to build the larger pattern:
match(r"{0},{0}".format(digit_letter_letter_digit), inputtext)
or, using Python 3.6+ f-strings:
dlld = r'\d\w\w\d'
match(fr"{dlld},{dlld}", inputtext)
I often do use this technique to compose larger, more complex patterns from re-usable sub-patterns.
If you are prepared to install an external library, then the regex project can solve this problem with a regex subroutine call. The syntax (?<digit>) re-uses the pattern of an already used (implicitly numbered) capturing group:
(\d\w\w\d),(?1)
^........^ ^..^
| \
| re-use pattern of capturing group 1
\
capturing group 1
You can do the same with named capturing groups, where (?<groupname>...) is the named group groupname, and (?&groupname), (?P&groupname) or (?P>groupname) re-use the pattern matched by groupname (the latter two forms are alternatives for compatibility with other engines).
And finally, regex supports the (?(DEFINE)...) block to 'define' subroutine patterns without them actually matching anything at that stage. You can put multiple (..) and (?<name>...) capturing groups in that construct to then later refer to them in the actual pattern:
(?(DEFINE)(?<dlld>\d\w\w\d))(?&dlld),(?&dlld)
^...............^ ^......^ ^......^
| \ /
creates 'dlld' pattern uses 'dlld' pattern twice
Just to be explicit: the standard library re module does not support subroutine patterns.
Note: this will work with PyPi regex module, not with re module.
You could use the notation (?group-number), in your case:
(\d\w\w\d),(?1)
it is equivalent to:
(\d\w\w\d),(\d\w\w\d)
Be aware that \w includes \d. The regex will be:
(\d[a-zA-Z]{2}\d),(?1)
I was troubled with the same problem and wrote this snippet
import nre
my_regex=nre.from_string('''
a=\d\w\w\d
b={{a}},{{a}}
c=?P<id>{{a}}),(?P=id)
''')
my_regex["b"].match("1aw2,5cx7")
For lack of a more descriptive name, I named the partial regexes as a,b and c.
Accessing them is as easy as {{a}}
import re
digit_letter_letter_digit = re.compile("\d\w\w\d") # we compile pattern so that we can reuse it later
all_finds = re.findall(digit_letter_letter_digit, "1aw2,5cx7") # finditer instead of findall
for value in all_finds:
print(re.match(digit_letter_letter_digit, value))
Since you're already using re, why not use string processing to manage the pattern repetition as well:
pattern = "P,P".replace("P",r"\d\w\w\d")
re.match(pattern, "1aw2,5cx7")
OR
P = r"\d\w\w\d"
re.match(f"{P},{P}", "1aw2,5cx7")
Try using back referencing, i believe it works something like below to match
1aw2,5cx7
You could use
(\d\w\w\d),\1
See here for reference http://www.regular-expressions.info/backref.html

Python regular expression findall *

I am not able to understand the following code behavior.
>>> import re
>>> text = 'been'
>>> r = re.compile(r'b(e)*')
>>> r.search(text).group()
'bee' #makes sense
>>> r.findall(text)
['e'] #makes no sense
I read some already existing question and answers about capturing groups and all. But still I am confused. Could someone please explain me.
The answer is simplified in the Regex Howto
As you can read here, group returns the string matched by the Regular Expression.
group() returns the substring that was matched by the RE.
But the action of findall is justified in the documentation
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group
So you are getting the matched part of the capture group.
Some experiments include :
>>> r = re.compile(r'(b)(e)*')
>>> r.findall(text)
[('b', 'e')]
Here the regex has two capturing groups, so the returned values are a list of matched groups (in tuples)
When a pattern contains a capture group, findall returns only the content of the capture group and no more the whole match.
If this behaviour looks strange, it can be very useful to extract easily parts of a string in a particular context (substring before or after), especially since python re module doesn't support variable length lookbehinds.

Regex pattern for illegal regex groups `\g<...>`

In the following regex r"\g<NAME>\w+", I would like to know that a group named NAME must be used for replacements corresponding to a match.
Which regex matches the wrong use of \g<...> ?
For example, the following code finds any not escaped groups.
p = re.compile(r"(?:[^\\])\\g<(\w+)>")
for m in p.finditer(r"\g<NAME>\w+\\g<ESCAPED>"):
print(m.group(1))
But there is a last problem to solve. How can I manage cases of \g<WRONGUSE\> and\g\<WRONGUSE> ?
As far as I am aware, the only restriction on named capture groups is that you can't put metacharacters in there, such as . \, etc...
Have you come across some kind of problem with named capture groups?
The regex you used, r"illegal|(\g<NAME>\w+)" is only illegal because you referred to a backreference without it being declared earlier in the regex string. If you want to make a named capture group, it is (?P<NAME>regex)
Like this:
>>> import re
>>> string = "sup bro"
>>> re.sub(r"(?P<greeting>sup) bro", r"\g<greeting> mate", string)
'sup mate'
If you wanted to do some kind of analysis on the actual regex string in use, I don't think there is anything inside the re module which can do this natively.
You would need to run another match on the string itself, so, you would put the regex into a string variable and then match something like \(\?P<(.*?)>\) which would give you the named capture group's name.
I hope that is what you are asking for... Let me know.
So, what you want is to get the string of the group name, right?
Maybe you can get it by doing this:
>>> regex = re.compile(r"illegal|(?P<group_name>\w+)")
>>> regex.groupindex
{'group_name': 1}
As you see, groupindex returns a dictionary mapping the group names and their position in the regex. Having that, it is easy to retrieve the string:
>>> # A list of the group names in your regex:
... regex.groupindex.keys()
['group_name']
>>> # The string of your group name:
... regex.groupindex.keys()[0]
'group_name'
Don't know if that is what you were looking for...
Use a negative lookahead?
\\g(?!<\w+>)
This search for any g not followed by <…>, thus a "wrong use".
Thanks to all the comments, I have this solution.
# Good uses.
p = re.compile(r"(?:[^\\])\\g<(\w+)>")
for m in p.finditer(r"</\g\<at__tribut1>\\g<notattribut>>"):
print(m.group(1))
# Bad uses.
p = re.compile(r"(?:[^\\])\\g(?!<\w+>)")
if p.search(r"</\g\<at__tribut1>\\g<notattribut>>"):
print("Wrong use !")

Python regular expression issue

I'm trying to use the re module in a way that it will return bunch of characters until a particular string follows an individual character. The re documentation seems to indicate that I can use (?!...) to accomplish this. The example that I'm currently wrestling with:
str_to_search = 'abababsonab, etc'
first = re.search(r'(ab)+(?!son)', str_to_search)
second = re.search(r'.+(?!son)', str_to_search)
first.group() is 'abab', which is what I'm aiming for. However, second.group() returns the entire str_to_search string, despite the fact that I'm trying to make it stop at 'ababa', as the subsequent 'b' is immediately followed by 'son'. Where am I going wrong?
It's not the simplest thing, but you can capture a repeating sequence of "a character not followed by 'son'". This repeated expression should be in a non-capturing group, (?: ... ), so it doesn't mess with your match results. (You'd end up with an extra match group)
Try this:
import re
str_to_search = 'abababsonab, etc'
second = re.search(r'(?:.(?!son))+', str_to_search)
print(second.group())
Output:
ababa
See it here: http://ideone.com/6DhLgN
This should work:
second = re.search(r'(.(?!son))+', str_to_search)
#output: 'ababa'
not sure what you are trying to do
check out string.partition
'.+?' is the minimal matcher, otherwise it is greedy and gets it all
read the docs for group(...) and groups(..) especially when passing group number

Can't make regex work with Python

I need to extract the date in format of: dd Month yyyy (20 August 2013).
I tried the following regex:
\d{2} (January|February|March|April|May|June|July|August|September|October|November|December) \d{4}
It works with regex testers (chcked with several the text - Monday, 19 August 2013), but It seems that Python doesn't understand it. The output I get is:
>>>
['August']
>>>
Can somebody please understand me why is that happening ?
Thank you !
Did you use re.findall? By default, if there's at least one capture group in the pattern, re.findall will return only the captured parts of the expression.
You can avoid this by removing every capture group, causing re.findall to return the entire match:
\d{2} (?:January|February|...|December) \d{4}
or by making a single big capture group:
(\d{2} (?:January|February|...|December) \d{4})
or, possibly more conveniently, by making every component a capture group:
(\d{2}) (January|February|...|December) (\d{4})
This latter form is more useful if you will need to process the individual day/month/year components.
It looks like you are only getting the data from the capture group, try this:
(\d{2} (?:January|February|March|April|May|June|July|August|September|October|November|December) \d{4})
I put a capture group around the entire thing and made the month a non-capture group. Now whatever was giving you "August" should give you the entire thing.
I just looked at some python regex stuff here
>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'
Seeing this, I'm guessing (since you didn't show how you were actually using this regex) that you were doing group(1) which will now work with the regex I supplied above.
It also looks like you could have used group(0) to get the whole thing (if I am correct in the assumption that this is what you were doing). This would work in your original regex as well as my modified version.

Categories