Regex: define a particular group and avoid repeating it each time [duplicate] - python

Consider this (very simplified) example string:
1aw2,5cx7
As you can see, it is two digit/letter/letter/digit values separated by a comma.
Now, I could match this with the following:
>>> from re import match
>>> match("\d\w\w\d,\d\w\w\d", "1aw2,5cx7")
<_sre.SRE_Match object at 0x01749D40>
>>>
The problem is though, I have to write \d\w\w\d twice. With small patterns, this isn't so bad but, with more complex Regexes, writing the exact same thing twice makes the end pattern enormous and cumbersome to work with. It also seems redundant.
I tried using a named capture group:
>>> from re import match
>>> match("(?P<id>\d\w\w\d),(?P=id)", "1aw2,5cx7")
>>>
But it didn't work because it was looking for two occurrences of 1aw2, not digit/letter/letter/digit.
Is there any way to save part of a pattern, such as \d\w\w\d, so it can be used latter on in the same pattern? In other words, can I reuse a sub-pattern in a pattern?

No, when using the standard library re module, regular expression patterns cannot be 'symbolized'.
You can always do so by re-using Python variables, of course:
digit_letter_letter_digit = r'\d\w\w\d'
then use string formatting to build the larger pattern:
match(r"{0},{0}".format(digit_letter_letter_digit), inputtext)
or, using Python 3.6+ f-strings:
dlld = r'\d\w\w\d'
match(fr"{dlld},{dlld}", inputtext)
I often do use this technique to compose larger, more complex patterns from re-usable sub-patterns.
If you are prepared to install an external library, then the regex project can solve this problem with a regex subroutine call. The syntax (?<digit>) re-uses the pattern of an already used (implicitly numbered) capturing group:
(\d\w\w\d),(?1)
^........^ ^..^
| \
| re-use pattern of capturing group 1
\
capturing group 1
You can do the same with named capturing groups, where (?<groupname>...) is the named group groupname, and (?&groupname), (?P&groupname) or (?P>groupname) re-use the pattern matched by groupname (the latter two forms are alternatives for compatibility with other engines).
And finally, regex supports the (?(DEFINE)...) block to 'define' subroutine patterns without them actually matching anything at that stage. You can put multiple (..) and (?<name>...) capturing groups in that construct to then later refer to them in the actual pattern:
(?(DEFINE)(?<dlld>\d\w\w\d))(?&dlld),(?&dlld)
^...............^ ^......^ ^......^
| \ /
creates 'dlld' pattern uses 'dlld' pattern twice
Just to be explicit: the standard library re module does not support subroutine patterns.

Note: this will work with PyPi regex module, not with re module.
You could use the notation (?group-number), in your case:
(\d\w\w\d),(?1)
it is equivalent to:
(\d\w\w\d),(\d\w\w\d)
Be aware that \w includes \d. The regex will be:
(\d[a-zA-Z]{2}\d),(?1)

I was troubled with the same problem and wrote this snippet
import nre
my_regex=nre.from_string('''
a=\d\w\w\d
b={{a}},{{a}}
c=?P<id>{{a}}),(?P=id)
''')
my_regex["b"].match("1aw2,5cx7")
For lack of a more descriptive name, I named the partial regexes as a,b and c.
Accessing them is as easy as {{a}}

import re
digit_letter_letter_digit = re.compile("\d\w\w\d") # we compile pattern so that we can reuse it later
all_finds = re.findall(digit_letter_letter_digit, "1aw2,5cx7") # finditer instead of findall
for value in all_finds:
print(re.match(digit_letter_letter_digit, value))

Since you're already using re, why not use string processing to manage the pattern repetition as well:
pattern = "P,P".replace("P",r"\d\w\w\d")
re.match(pattern, "1aw2,5cx7")
OR
P = r"\d\w\w\d"
re.match(f"{P},{P}", "1aw2,5cx7")

Try using back referencing, i believe it works something like below to match
1aw2,5cx7
You could use
(\d\w\w\d),\1
See here for reference http://www.regular-expressions.info/backref.html

Related

How to use regular expression to remove all math expression in latex file

Suppose I have a string which consists of a part of latex file. How can I use python re module to remove any math expression in it?
e.g:
text="This is an example $$a \text{$a$}$$. How to remove it? Another random math expression $\mathbb{R}$..."
I would like my function to return ans="This is an example . How to remove it? Another random math expression ...".
Thank you!
Try this Regex:
(\$+)(?:(?!\1)[\s\S])*\1
Click for Demo
Code
Explanation:
(\$+) - matches 1+ occurrences of $ and captures it in Group 1
(?:(?!\1)[\s\S])* - matches 0+ occurrences of any character that does not start with what was captured in Group 1
\1 - matches the contents of Group 1 again
Replace each match with a blank string.
As suggested by #torek, we should not match 3 or more consecutive $, hence changing the expression to (\${1,2})(?:(?!\1)[\s\S])*\1
It's commonly said that regular expressions cannot count, which is kind of a loose way of describing a problem more formally discussed in Count parentheses with regular expression. See that for what this means.
Now, with that in mind, note that LaTeX math expressions can include nested sub-equations, which can include further nested sub-equations, and so on. This is analogous to the problem of detecting whether a closing parenthesis closes an inner parenthesized expression (as in (for instance) this example, where the first one does not) or an outer parenthesis. Therefore, regular expressions are not going to be powerful enough to handle the full general case.
If you're willing to do a less-than-complete job, you can construct a regular expression that finds $...$ and $$...$$. You will need to pay attention to the particular regular expression language available. Python's is essentially the same as Perl's here.
Importantly, these $-matchers will completely miss \begin{equation} ... \end{equation}, \begin{eqnarray} ... \end{eqnarray}, and so on. We've already noted that handling LaTeX expression parsing with a mere regular expression recognizer is inadequate, so if you want to do a good job—while ignoring the complexity of lower-level TeX manipulation of token types, where one can change any individual character's category code —you will want a more general parser. You can then tokenize \begin, {, }, and words, and match up the begin/end pairs. You can also tokenize $ and $$ and match those up. Since parsers can count, in exactly the way that regular expressions can't, you can do a much better job this way.

Best regex to match IEEE Time Stamp format in Python

I did some searching but didn't find this specifically, and I'm sure it's going to be a quick answer.
I have a python script parsing IEEE date and time stamps out of strings, but I think I'm using python's match objects wrong.
import re
stir = "foo_2015-07-07-17-58-26.log"
timestamp = re.search("([0-9]+-){5}[0-9]+", stir).groups()
print timestamp
Produces
58-
When my intent is to get
2015-07-07-17-58-26
Is there a pre-canned regex that would work better here? Am I getting tripped up on re's capture groups? Why is the length of the groups() tuple only 1?
Edit
I was misinterpreting the way capture groups work in python's re module - there is only one set of parentheses in the statement, so the re module returned the most recently grabbed capture group - the "58-".
The way I ended up doing it was by referencing group(0), as Dawg suggests below.
timestamp = re.search("([0-9]+-){5}[0-9]+", stir)
print timestamp.group(0)
2015-07-07-17-58-26
You need a single capture group or groups:
(\d\d\d\d-\d\d-\d\d-\d\d-\d\d-\d\d)
Demo
Or, use nested capture groups:
>>> re.search(r'(\d{4}(?:-\d{2}){5})', 'foo_2015-07-07-17-58-26.log')
<_sre.SRE_Match object at 0x100b49dc8>
>>> _.group(1)
'2015-07-07-17-58-26'
Or, you can use your pattern and just use group(0) instead of groups():
>>> re.search("([0-9]+-){5}[0-9]+", "foo_2015-07-07-17-58-26.log").group(0)
'2015-07-07-17-58-26'
Or, use findall with an additional capture group (and the other a non capture group):
>>> re.findall("((?:[0-9]+-){5}[0-9]+)", 'foo_2015-07-07-17-58-26.log')
['2015-07-07-17-58-26']
But that will find the digits that are not part of the timestamp.
if you want the timestamp in one match object, i think this should work
\d{4}(?:\d{2}){5}
then use group() or group(0)
also, match.groups actually returns the number of group objects, you should try .group() instead (your code would still not work though because you grouped the 5 sets of numbers in and the final -58 would be omitted
I'd use below:
_(\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-\d{2}).
_ and . to mark the starting and the end.
import re
r = r'_(\d{4}-\d{2}-\d{2}-\d{2}-\d{2}-\d{2}).'
s = 'some string'
lst = re.findall(s,r)
link
You might want
re.findall(r"([0-9-]+)", stir)
>>> import re
>>> stir = "foo_2015-07-07-17-58-26.log"
>>> re.findall(r"([0-9-]+)", stir)
['2015-07-07-17-58-26']

Python regular expression expansion

I am really bad with regular expressions, and stuck up to generate all the possible combinations for a regular expression.
When the regular expression is abc-defghi00[1-24,2]-[1-20,23].walmart.com, it should generate all its possible combinations.
The text before the braces can be anything and the pattern inside the braces is optional.
Need all the python experts to help me with the code.
Sample output
Here is the expected output:
abc-defghi001-1.walmart.com
.........
abc-defghi001-20.walmart.com
abc-defghi001-23.walmart.com
..............
abc-defghi002-1.walmart.com
Repeat this from 1-24 and 2.
Regex tried
([a-z]+)(-)([a-z]+)(\[)(\d)(-)(\d+)(,?)(\d?)(\])(-)(\[)(\d)(-)(\d+)(,?)(\d?)(\])(.*)
Lets say we would like to match against abc-defghi001-1.walmart.com. Now, if we write the following regex, it does the job.
s = 'abc-defghi001-1.walmart.com'
re.match ('.*[1-24]-[1-20|23]\.walmart\.com',s)
and the output:
<_sre.SRE_Match object at 0x029462C0>
So, its found. If you want to match to 27 in the first bracket, you simply replace it by [1-24|27], or if you want to match to 0 to 29, you simply replace it by [1-29]. And ofcourse, you know that you have to write import re, before all the above commands.
Edit1: As far as I understand, you want to generate all instances of a regular expression and store them in a list.
Use the exrex python library to do so. You can find further information about it here. Then, you have to limit the regex you use.
import re
s = 'abc-defghi001-1.walmart.com'
obj=re.match(r'^\w{3}-\w{6}00(1|2)-([1-20]|23)\.walmart\.com$',s)
print(obj.group())
The above regex will match the template you're looking for I hope!

Regex pattern for illegal regex groups `\g<...>`

In the following regex r"\g<NAME>\w+", I would like to know that a group named NAME must be used for replacements corresponding to a match.
Which regex matches the wrong use of \g<...> ?
For example, the following code finds any not escaped groups.
p = re.compile(r"(?:[^\\])\\g<(\w+)>")
for m in p.finditer(r"\g<NAME>\w+\\g<ESCAPED>"):
print(m.group(1))
But there is a last problem to solve. How can I manage cases of \g<WRONGUSE\> and\g\<WRONGUSE> ?
As far as I am aware, the only restriction on named capture groups is that you can't put metacharacters in there, such as . \, etc...
Have you come across some kind of problem with named capture groups?
The regex you used, r"illegal|(\g<NAME>\w+)" is only illegal because you referred to a backreference without it being declared earlier in the regex string. If you want to make a named capture group, it is (?P<NAME>regex)
Like this:
>>> import re
>>> string = "sup bro"
>>> re.sub(r"(?P<greeting>sup) bro", r"\g<greeting> mate", string)
'sup mate'
If you wanted to do some kind of analysis on the actual regex string in use, I don't think there is anything inside the re module which can do this natively.
You would need to run another match on the string itself, so, you would put the regex into a string variable and then match something like \(\?P<(.*?)>\) which would give you the named capture group's name.
I hope that is what you are asking for... Let me know.
So, what you want is to get the string of the group name, right?
Maybe you can get it by doing this:
>>> regex = re.compile(r"illegal|(?P<group_name>\w+)")
>>> regex.groupindex
{'group_name': 1}
As you see, groupindex returns a dictionary mapping the group names and their position in the regex. Having that, it is easy to retrieve the string:
>>> # A list of the group names in your regex:
... regex.groupindex.keys()
['group_name']
>>> # The string of your group name:
... regex.groupindex.keys()[0]
'group_name'
Don't know if that is what you were looking for...
Use a negative lookahead?
\\g(?!<\w+>)
This search for any g not followed by <…>, thus a "wrong use".
Thanks to all the comments, I have this solution.
# Good uses.
p = re.compile(r"(?:[^\\])\\g<(\w+)>")
for m in p.finditer(r"</\g\<at__tribut1>\\g<notattribut>>"):
print(m.group(1))
# Bad uses.
p = re.compile(r"(?:[^\\])\\g(?!<\w+>)")
if p.search(r"</\g\<at__tribut1>\\g<notattribut>>"):
print("Wrong use !")

Named non-capturing group in python?

Is it possible to have named non-capturing group in python? For example I want to match string in this pattern (including the quotes):
"a=b"
'bird=angel'
I can do the following:
s = '"bird=angel"'
myre = re.compile(r'(?P<quote>[\'"])(\w+)=(\w+)(?P=quote)')
m = myre.search(s)
m.groups()
# ('"', 'bird', 'angel')
The result captures the quote group, which is not desirable here.
No, named groups are always capturing groups. From the documentation of the re module:
Extensions usually do not create a new group; (?P<name>...) is the
only exception to this rule.
And regarding the named group extension:
Similar to regular parentheses, but the substring matched by the group
is accessible within the rest of the regular expression via the
symbolic group name name
Where regular parentheses means (...), in contrast with (?:...).
You do need a capturing group in order to match the same quote: there is no other mechanism in re that allows you to do this, short of explicitly distinguishing the two quotes:
myre = re.compile('"{0}"' "|'{0}'" .format('(\w+)=(\w+)'))
(which has the downside of giving you four groups, two for each style of quotes).
Note that one does not need to give a name to the quotes, though:
myre = re.compile(r'([\'"])(\w+)=(\w+)\1')
works as well.
In conclusion, you are better off using groups()[1:] in order to get only what you need, if at all possible.

Categories