insert string at beginning of regex in python - python

In below code, the string "Graph" is replacing the matched regex:
htmlText = re.sub("[0-9]*/index.html", 'Graph', htmlText, re.MULTILINE|re.DOTALL)
But the problem is, I want to prepend 'Graph' to the beginning of the matched '[0-9]*/index.html' expression, not replace it.

You want to capture the match (by surrounding your regex with parens), then backreference it (via \1), using a raw string (via r before the replacement string) to prevent the backslash from being treated as an escape character:
In [1]: import re
In [2]: htmlText = "5/index.html"
In [3]: re.sub("([0-9]*/index.html)", r'Graph\g<1>', htmlText, re.MULTILINE|re.DOTALL)
Out[3]: 'Graph5/index.html'
Edit: Changed r'Graph\1' to r'Graph\g<1>' above, since that's more reliable in case someone uses this answer in a context where the backreference is followed by another number -- see docs https://docs.python.org/2/library/re.html#re.sub which cite:
\g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0
Note: Example above uses Python 2.7.6.

Related

Regular expression error: unbalanced parenthesis at position n

I have been meaning to extract the month name from the following string with regex and despite the fact that my regex works on a platform like regex101, I can't seem to be able to extract the word "August".
import re
s = "word\anyword\2021\August\202108_filename.csv"
re.findall("\d+\\([[:alpha:]]+)\\\d+", s)
Which results in the following error:
error: unbalanced parenthesis at position 17
I also tried using re.compile, re.escape as per suggestions of the previous posts dealing with the same error but none of them seems to work.
Any help and also a little explanation on why this isn't working is greatly appreciated.
You can use
import re
s = r"word\anyword\2021\August\202108_filename.csv"
m = re.search(r"\d+\\([a-zA-Z]+)\\\d+", s)
if m:
print(m.group(1))
See the Python demo.
There are three main problems here:
The input string should be the same as used at regex101.com, i.e. you need to make sure you are using literal backslashes in the Python code, hence the use of raw string literals for both the input text and regex
The POSIX character classes are not supported by Python re, so [[:alpha:]]+ should be replaced with some equivalent pattern, say, [A-Za-z]+ or [^\W\d_]+
Since it seems like you only expect a single match (there is only one August (month) name in the string), you do not need re.findall, you can use re.search. Only use re.findall when you need to extract multiple matches from a string.
Also, see these posts:
Python regex - r prefix
What does the "r" in pythons re.compile(r' pattern flags') mean?
What exactly do "u" and "r" string flags do, and what are raw string literals?

re.fullmatch equivalent in pandas text handling [duplicate]

I'm trying to check if a string is a number, so the regex "\d+" seemed good. However that regex also fits "78.46.92.168:8000" for some reason, which I do not want, a little bit of code:
class Foo():
_rex = re.compile("\d+")
def bar(self, string):
m = _rex.match(string)
if m != None:
doStuff()
And doStuff() is called when the ip adress is entered. I'm kind of confused, how does "." or ":" match "\d"?
\d+ matches any positive number of digits within your string, so it matches the first 78 and succeeds.
Use ^\d+$.
Or, even better: "78.46.92.168:8000".isdigit()
There are a couple of options in Python to match an entire input with a regex.
Python 2 and 3
In Python 2 and 3, you may use
re.match(r'\d+$') # re.match anchors the match at the start of the string, so $ is what remains to add
or - to avoid matching before the final \n in the string:
re.match(r'\d+\Z') # \Z will only match at the very end of the string
Or the same as above with re.search method requiring the use of ^ / \A start-of-string anchor as it does not anchor the match at the start of the string:
re.search(r'^\d+$')
re.search(r'\A\d+\Z')
Note that \A is an unambiguous string start anchor, its behavior cannot be redefined with any modifiers (re.M / re.MULTILINE can only redefine the ^ and $ behavior).
Python 3
All those cases described in the above section and one more useful method, re.fullmatch (also present in the PyPi regex module):
If the whole string matches the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
So, after you compile the regex, just use the appropriate method:
_rex = re.compile("\d+")
if _rex.fullmatch(s):
doStuff()
re.match() always matches from the start of the string (unlike re.search()) but allows the match to end before the end of the string.
Therefore, you need an anchor: _rex.match(r"\d+$") would work.
To be more explicit, you could also use _rex.match(r"^\d+$") (which is redundant) or just drop re.match() altogether and just use _rex.search(r"^\d+$").
\Z matches the end of the string while $ matches the end of the string or just before the newline at the end of the string, and exhibits different behaviour in re.MULTILINE. See the syntax documentation for detailed information.
>>> s="1234\n"
>>> re.search("^\d+\Z",s)
>>> s="1234"
>>> re.search("^\d+\Z",s)
<_sre.SRE_Match object at 0xb762ed40>
Change it from \d+ to ^\d+$

Compiling user defined regex object and searching/matching in Python [duplicate]

This question already has answers here:
How to create raw string from string variable in python?
(3 answers)
Closed 4 years ago.
I am writing an online regex checker, which takes input from the user in the form of a pattern, and flags, and uses that to compile a regex object. The regex object is then used to check if the test string matches within the format provided by the regex pattern or not. As of this moment, the compile function looks like this:
class RegexObject:
...
def compile(self):
flags = ''
if self.multiline_flag:
flags = re.M
if self.dotall_flag:
flags |= re.S
if self.verbose_flag:
flags |= re.X
if self.ignorecase_flag:
flags |= re.I
if self.unicode_flag:
flags |= re.U
regex = re.compile(self.pattern, flags)
return regex
Please note, the self.pattern and all the flags are class attributes defined by the user using a simple form. However, one thing I noticed in the docs is that there is usually an r before the pattern in the compile functions, like this:
re.compile(r'(?<=abc)def')
How do I place that r in my code before my variable name? Also, if I want to tell the user if the test input is valid or not, should I be using the match method, or the search method?
Thanks for any help.
Edit: This question is not a duplicate of this one, because that question has nothing to do with regular expressions.
Don't worry about the r, you don't need it here.
The r stands for "raw", not "regex". In an r string, you can put backslashes without escaping them. R strings are often used in regexes because there are often many backslashes in regexes. Escaping the backslashes can be annoying. See this shell output:
>>> s = r"\a"
>>> s2 = "\a"
>>> s2
'\x07'
>>> s
'\\a'
And you should use search, as match only looks at the start of the string. Look at the docs.
re.search(pattern, string, flags=0)
Scan through string looking for
the first location where the regular expression pattern produces a
match, and return a corresponding match object. Return None if no
position in the string matches the pattern; note that this is
different from finding a zero-length match at some point in the
string.
re.match(pattern, string, flags=0)
If zero or more characters at the
beginning of string match the regular expression pattern, return a
corresponding match object. Return None if the string does not match
the pattern; note that this is different from a zero-length match.
Note that even in MULTILINE mode, re.match() will only match at the
beginning of the string and not at the beginning of each line.
If you want to locate a match anywhere in string, use search() instead
(see also search() vs. match()).
You need not use r.Instead you should use re.escape.match or search again should be user input.

How does re.search match raw strings?

re.search(r'c\.t', 'c.t abc') matches successfully to c.t. But the pattern being matched is c\.t, how is c.t matching to c\.t? What happened to the backslash?
Inside a regular expression, the dot character has a special meaning, which is that it can match any character at all other than a newline (unless the re.S/re.DOTALL flag is used). In this case, the backslash has the effect of escaping the dot from its special meaning and letting the regular expression engine interpret it as literally matching only a dot (and no other character). Consider if the backslash is not there:
>>> re.search(r'c.t', 'c.t abc')
<_sre.SRE_Match object at 0x7fe7378d8370>
The original string you provided as input still matches. But now the following will also match:
>>> re.search(r'c.t', 'I saw a cat')
<_sre.SRE_Match object at 0x7fe7378d83d8>
Because the a in cat qualifies as any non-newline character, which is what . will match if unescaped with a backslash. You can see that if we add the backslash back in, it no longer matches.
>>> print(re.search(r'c\.t', 'I saw a cat'))
None
More on Python's implementation of regular expressions here:
Python 2.7.x: https://docs.python.org/2/library/re.html
Python 3.4.x: https://docs.python.org/3/library/re.html
Edited to reflect #cdarke's excellent point about newlines

How to ignore \n in regular expressions in python?

So i have a regex telling if a number is integer.
regex = '^(0|[1-9][0-9]*)$'
import re
bool(re.search(regex, '42\n'))
returns True, and it is not supposed to?
Where does the problem come from ?
From the documentation:
'$'
Matches the end of the string or just before the newline at the end of the string
Try \Z instead.
Also, any time you find yourself writing a regular expression that starts with ^ or \A and ends with $ or \Z, if your intent is to only match the entire string, you should probably use re.fullmatch() instead of re.search() (and omit the boundary markers from the regex). Or if you're using a version of Python that's too old to have re.fullmatch(), (you really need to upgrade but) you can use re.match() and omit the beginning-of-string boundary marker.
regex ahould be regex = '\b^(0|[1-9][0-9]*)$\b'
The regex in the question matches ->start of line, numbers and end of line. And the given string matches that, thats why it is returning true. If you want it to return False when there is a number present, you can use "!" to indicate NOT.
Refer https://docs.python.org/2/library/re.html
regex = '!(0|[1-9][0-9]*)$'
bool(re.search(regex, '42\n')) => (Returns false)
Yeah, that $ matching one \n before the end is kind of trap/inconsistency. Check out my list of regex traps for python: http://www.cofoh.com/advanced-regex-tutorial-python/traps

Categories