Compiling user defined regex object and searching/matching in Python [duplicate] - python

This question already has answers here:
How to create raw string from string variable in python?
(3 answers)
Closed 4 years ago.
I am writing an online regex checker, which takes input from the user in the form of a pattern, and flags, and uses that to compile a regex object. The regex object is then used to check if the test string matches within the format provided by the regex pattern or not. As of this moment, the compile function looks like this:
class RegexObject:
...
def compile(self):
flags = ''
if self.multiline_flag:
flags = re.M
if self.dotall_flag:
flags |= re.S
if self.verbose_flag:
flags |= re.X
if self.ignorecase_flag:
flags |= re.I
if self.unicode_flag:
flags |= re.U
regex = re.compile(self.pattern, flags)
return regex
Please note, the self.pattern and all the flags are class attributes defined by the user using a simple form. However, one thing I noticed in the docs is that there is usually an r before the pattern in the compile functions, like this:
re.compile(r'(?<=abc)def')
How do I place that r in my code before my variable name? Also, if I want to tell the user if the test input is valid or not, should I be using the match method, or the search method?
Thanks for any help.
Edit: This question is not a duplicate of this one, because that question has nothing to do with regular expressions.

Don't worry about the r, you don't need it here.
The r stands for "raw", not "regex". In an r string, you can put backslashes without escaping them. R strings are often used in regexes because there are often many backslashes in regexes. Escaping the backslashes can be annoying. See this shell output:
>>> s = r"\a"
>>> s2 = "\a"
>>> s2
'\x07'
>>> s
'\\a'
And you should use search, as match only looks at the start of the string. Look at the docs.
re.search(pattern, string, flags=0)
Scan through string looking for
the first location where the regular expression pattern produces a
match, and return a corresponding match object. Return None if no
position in the string matches the pattern; note that this is
different from finding a zero-length match at some point in the
string.
re.match(pattern, string, flags=0)
If zero or more characters at the
beginning of string match the regular expression pattern, return a
corresponding match object. Return None if the string does not match
the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the
beginning of the string and not at the beginning of each line.
If you want to locate a match anywhere in string, use search() instead
(see also search() vs. match()).

You need not use r.Instead you should use re.escape.match or search again should be user input.

Related

Regular expression error: unbalanced parenthesis at position n

I have been meaning to extract the month name from the following string with regex and despite the fact that my regex works on a platform like regex101, I can't seem to be able to extract the word "August".
import re
s = "word\anyword\2021\August\202108_filename.csv"
re.findall("\d+\\([[:alpha:]]+)\\\d+", s)
Which results in the following error:
error: unbalanced parenthesis at position 17
I also tried using re.compile, re.escape as per suggestions of the previous posts dealing with the same error but none of them seems to work.
Any help and also a little explanation on why this isn't working is greatly appreciated.
You can use
import re
s = r"word\anyword\2021\August\202108_filename.csv"
m = re.search(r"\d+\\([a-zA-Z]+)\\\d+", s)
if m:
print(m.group(1))
See the Python demo.
There are three main problems here:
The input string should be the same as used at regex101.com, i.e. you need to make sure you are using literal backslashes in the Python code, hence the use of raw string literals for both the input text and regex
The POSIX character classes are not supported by Python re, so [[:alpha:]]+ should be replaced with some equivalent pattern, say, [A-Za-z]+ or [^\W\d_]+
Since it seems like you only expect a single match (there is only one August (month) name in the string), you do not need re.findall, you can use re.search. Only use re.findall when you need to extract multiple matches from a string.
Also, see these posts:
Python regex - r prefix
What does the "r" in pythons re.compile(r' pattern flags') mean?
What exactly do "u" and "r" string flags do, and what are raw string literals?

re.fullmatch equivalent in pandas text handling [duplicate]

I'm trying to check if a string is a number, so the regex "\d+" seemed good. However that regex also fits "78.46.92.168:8000" for some reason, which I do not want, a little bit of code:
class Foo():
_rex = re.compile("\d+")
def bar(self, string):
m = _rex.match(string)
if m != None:
doStuff()
And doStuff() is called when the ip adress is entered. I'm kind of confused, how does "." or ":" match "\d"?
\d+ matches any positive number of digits within your string, so it matches the first 78 and succeeds.
Use ^\d+$.
Or, even better: "78.46.92.168:8000".isdigit()
There are a couple of options in Python to match an entire input with a regex.
Python 2 and 3
In Python 2 and 3, you may use
re.match(r'\d+$') # re.match anchors the match at the start of the string, so $ is what remains to add
or - to avoid matching before the final \n in the string:
re.match(r'\d+\Z') # \Z will only match at the very end of the string
Or the same as above with re.search method requiring the use of ^ / \A start-of-string anchor as it does not anchor the match at the start of the string:
re.search(r'^\d+$')
re.search(r'\A\d+\Z')
Note that \A is an unambiguous string start anchor, its behavior cannot be redefined with any modifiers (re.M / re.MULTILINE can only redefine the ^ and $ behavior).
Python 3
All those cases described in the above section and one more useful method, re.fullmatch (also present in the PyPi regex module):
If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
So, after you compile the regex, just use the appropriate method:
_rex = re.compile("\d+")
if _rex.fullmatch(s):
doStuff()
re.match() always matches from the start of the string (unlike re.search()) but allows the match to end before the end of the string.
Therefore, you need an anchor: _rex.match(r"\d+$") would work.
To be more explicit, you could also use _rex.match(r"^\d+$") (which is redundant) or just drop re.match() altogether and just use _rex.search(r"^\d+$").
\Z matches the end of the string while $ matches the end of the string or just before the newline at the end of the string, and exhibits different behaviour in re.MULTILINE. See the syntax documentation for detailed information.
>>> s="1234\n"
>>> re.search("^\d+\Z",s)
>>> s="1234"
>>> re.search("^\d+\Z",s)
<_sre.SRE_Match object at 0xb762ed40>
Change it from \d+ to ^\d+$

Python conditional using regex with multiline string [duplicate]

This question already has answers here:
What is the difference between re.search and re.match?
(9 answers)
Regular expression works on regex101.com, but not on prod
(1 answer)
Closed 3 years ago.
I need help with very simples question, a conditional using regex with multiline string. No make sense to me why this not work:
if(re.match(r"\w", " \n\n\n aaaaaaaaaaaa\n\n", re.MULTILINE)):
print('ok')
else:
print('fail')
fail
I expected that result be ok, but no match any data. I trying using https://regex101.com/r/BsdymE/1, but there works and in my code not works.
re.match will only return a match if the search string is at the beginning.
https://docs.python.org/3/library/re.html#re.match
re.match(pattern, string, flags=0)
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Try using re.search(pattern, string, flags=0) instead
From pydoc re.match:
Try to apply the pattern at the start of the string, returning a
Match object, or None if no match was found.
(emphasis mine). So, the problem is not with the string being multiline, but rather with it not beginning with a word-class character. If you want to check if a string contains something anywhere, use re.search instead.

How to match an underscore using Python's regex? [duplicate]

This question already has an answer here:
Python regular expression re.match, why this code does not work? [duplicate]
(1 answer)
Closed 6 years ago.
I'm having trouble matching the underscore character in Python using regular expressions. Just playing around in the shell, I get:
>>> import re
>>> re.match(r'a', 'abc')
<_sre.SRE_Match object at 0xb746a368>
>>> re.match(r'_', 'ab_c')
>>> re.match(r'[_]', 'ab_c')
>>> re.match(r'\_', 'ab_c')
I would have expected at least one of these to return a match object. Am I doing something wrong?
Use re.search instead of re.match if the pattern you are looking for is not at the start of the search string.
re.match(pattern, string, flags=0)
Try to apply the pattern at the start of the string, returning a match
object, or None if no match was found.
re.search(pattern, string, flags=0)
Scan through string looking for a match to the pattern, returning a
match object, or None if no match was found.
You don't need to escape _ or even use raw string.
>>> re.search('_', 'ab_c')
Out[4]: <_sre.SRE_Match object; span=(2, 3), match='_'>
Try the following:
re.search(r'\_', 'ab_c')
You were indeed right to escape the underscore character!
Mind that you can only use match for the beginning of strings, as is also clear from the documentation (https://docs.python.org/2/library/re.html):
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.
You should use search in this case:
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

re.match vs re.search

If i do this
import re
m = re.compile("[0-9]{1,}Y")
res = m.search("AUD3M25Y_EOD2")
if res:
return res.group(0)[:-1]
I will get 25 as an answer
However if I do
import re
m = re.compile(".*([0-9]{1,})Y.*")
res = m.match("AUD3M25Y_EOD2")
if res:
return res.groups(0)
I will get only 5.
Why the difference?
Does it have anything to do with 'global' option? (much like s///g in vi)
In your match, the first .* is greedy, it is matching as much as it can, including numbers.
If you make it less greedy, it will work:
.*?([0-9]{1,})Y.*
(PS I think this greedy issue doesn't make it a fair comparison of re.search and re.match)
Please read the documentation first. As you should expect, it has the answers.
re.search:
Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
re.match:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note: If you want to locate a match anywhere in string, use search() instead.
Also, on the same page, Matching vs. Searching:
Python offers two different primitive operations based on regular expressions: match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string (this is what Perl does by default).

Categories