Python Regular Expression Why Quantifier (+) is not greedy

Python Regular Expression Why Quantifier (+) is not greedy - python

Input: asjkd http://www.as.com/as/g/ff askl
Expected output: http://www.as.com/as/g/ff
When I try below I am getting expected output
pattern=re.compile(r'http[\w./:]+')
print(pattern.search("asjkd http://www.as.com/as/g/ff askl"))
Why isn't the + quantifier greedy here? I was expecting it to be greedy. Here actually not being greedy is helping to find the right answer.

It is greedy. It stops matching when it hits the space because [\w./:] doesn't match a space. A space isn't a word character (alphanumeric or underscore), dot, slash, or colon.
Change + to +? and you can see what happens when it's non-greedy.
Greedy
>>> pattern=re.compile(r'http[\w./:]+')
>>> print(pattern.search("asjkd http://www.as.com/as/g/ff askl"))
<re.Match object; span=(6, 31), match='http://www.as.com/as/g/ff'>
Non-greedy
>>> pattern=re.compile(r'http[\w./:]+?')
>>> print(pattern.search("asjkd http://www.as.com/as/g/ff askl"))
<re.Match object; span=(6, 11), match='http:'>
It matches a single character :!

Related

Python regex re.sub maintain double quotes in input string [duplicate]

Say I want to match the presence of the phrase Sortes\index[persons]{Sortes} in the phrase test Sortes\index[persons]{Sortes} text.
Using python re I could do this:
>>> search = re.escape('Sortes\index[persons]{Sortes}')
>>> match = 'test Sortes\index[persons]{Sortes} text'
>>> re.search(search, match)
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
This works, but I want to avoid the search pattern Sortes to give a positive result on the phrase test Sortes\index[persons]{Sortes} text.
>>> re.search(re.escape('Sortes'), match)
<_sre.SRE_Match object; span=(5, 11), match='Sortes'>
So I use the \b pattern, like this:
search = r'\b' + re.escape('Sortes\index[persons]{Sortes}') + r'\b'
match = 'test Sortes\index[persons]{Sortes} text'
re.search(search, match)
Now, I don't get a match.
If the search pattern does not contain any of the characters []{}, it works. E.g.:
>>> re.search(r'\b' + re.escape('Sortes\index') + r'\b', 'test Sortes\index test')
<_sre.SRE_Match object; span=(5, 17), match='Sortes\\index'>
Also, if I remove the final r'\b', it also works:
re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}'), 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
Furthermore, the documentation says about \b
Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string.
So I tried replacing the final \b with (\W|$):
>>> re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}') + '(\W|$)', 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 35), match='Sortes\\index[persons]{Sortes} '>
Lo and behold, it works!
What is going on here? What am I missing?

See what a word boundary matches:
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
In your pattern }\b only matches if there is a word char after } (a letter, digit or _).
When you use (\W|$) you require a non-word or end of string explicitly.
A solution is adaptive word boundaries:
re.search(r'(?:(?!\w)|\b(?=\w)){}(?:(?<=\w)\b|(?<!\w))'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Or equivalent:
re.search(r'(?!\B\w){}(?<!\w\B)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, adaptive dynamic word boundaries are used that mean the following:
(?:(?!\w)|\b(?=\w)) (equal to (?!\B\w)) - a left-hand boundary, making sure the current position is at the word boundary if the next char is a word char, or no context restriction is applied if the next char is not a word char (note that you will need to use (?:\B(?!\w)|\b(?=\w)) if you want to disallow a word char immediately on the left if the next char is not a word char)
(?:(?<=\w)\b|(?<!\w)) (equal to (?<!\w\B)) - a right-hand boundary, making sure the current position is at the word boundary if the previous char is a word char, or no context restriction is applied if the previous char is not a word char (note that you will need to use (?:(?<=\w)\b|\B(?<!\w)) if you want to disallow a word char immediately on the right if the preceding char is not a word char).
You might also consider using unambiguous word boundaries based on negative lookarounds in these cases:
re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, (?<!\w) negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!\w) negative lookahead will fail the match if there is a word char immediately to the right of the current location.
Which to choose? Adaptive word boundaries are more lenient compared to unambiguous word boundaries as the latter presume there must be no word chars on both ends of a match, while the former allow matching leading and trailing non-word chars in any context.
Note: It is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^\W\d_] instead of \w, or if you only allow matches around whitespaces, use whitespace boundaries (?<!\S) / (?!\S) lookaround boundaries).

I think this is what you're running into:
\b lands on the boundary of \w and \W, but in the example that doesn't work. '{Sortes}\b' is the boundary between \W and \W because of the '}', which doesn't match [a-zA-Z0-9_], the ordinary set for \w.

regex dictionary replacement working with only some entries (python). why? [duplicate]

Say I want to match the presence of the phrase Sortes\index[persons]{Sortes} in the phrase test Sortes\index[persons]{Sortes} text.
Using python re I could do this:
>>> search = re.escape('Sortes\index[persons]{Sortes}')
>>> match = 'test Sortes\index[persons]{Sortes} text'
>>> re.search(search, match)
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
This works, but I want to avoid the search pattern Sortes to give a positive result on the phrase test Sortes\index[persons]{Sortes} text.
>>> re.search(re.escape('Sortes'), match)
<_sre.SRE_Match object; span=(5, 11), match='Sortes'>
So I use the \b pattern, like this:
search = r'\b' + re.escape('Sortes\index[persons]{Sortes}') + r'\b'
match = 'test Sortes\index[persons]{Sortes} text'
re.search(search, match)
Now, I don't get a match.
If the search pattern does not contain any of the characters []{}, it works. E.g.:
>>> re.search(r'\b' + re.escape('Sortes\index') + r'\b', 'test Sortes\index test')
<_sre.SRE_Match object; span=(5, 17), match='Sortes\\index'>
Also, if I remove the final r'\b', it also works:
re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}'), 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
Furthermore, the documentation says about \b
Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string.
So I tried replacing the final \b with (\W|$):
>>> re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}') + '(\W|$)', 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 35), match='Sortes\\index[persons]{Sortes} '>
Lo and behold, it works!
What is going on here? What am I missing?

See what a word boundary matches:
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
In your pattern }\b only matches if there is a word char after } (a letter, digit or _).
When you use (\W|$) you require a non-word or end of string explicitly.
A solution is adaptive word boundaries:
re.search(r'(?:(?!\w)|\b(?=\w)){}(?:(?<=\w)\b|(?<!\w))'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Or equivalent:
re.search(r'(?!\B\w){}(?<!\w\B)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, adaptive dynamic word boundaries are used that mean the following:
(?:(?!\w)|\b(?=\w)) (equal to (?!\B\w)) - a left-hand boundary, making sure the current position is at the word boundary if the next char is a word char, or no context restriction is applied if the next char is not a word char (note that you will need to use (?:\B(?!\w)|\b(?=\w)) if you want to disallow a word char immediately on the left if the next char is not a word char)
(?:(?<=\w)\b|(?<!\w)) (equal to (?<!\w\B)) - a right-hand boundary, making sure the current position is at the word boundary if the previous char is a word char, or no context restriction is applied if the previous char is not a word char (note that you will need to use (?:(?<=\w)\b|\B(?<!\w)) if you want to disallow a word char immediately on the right if the preceding char is not a word char).
You might also consider using unambiguous word boundaries based on negative lookarounds in these cases:
re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, (?<!\w) negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!\w) negative lookahead will fail the match if there is a word char immediately to the right of the current location.
Which to choose? Adaptive word boundaries are more lenient compared to unambiguous word boundaries as the latter presume there must be no word chars on both ends of a match, while the former allow matching leading and trailing non-word chars in any context.
Note: It is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^\W\d_] instead of \w, or if you only allow matches around whitespaces, use whitespace boundaries (?<!\S) / (?!\S) lookaround boundaries).

I think this is what you're running into:
\b lands on the boundary of \w and \W, but in the example that doesn't work. '{Sortes}\b' is the boundary between \W and \W because of the '}', which doesn't match [a-zA-Z0-9_], the ordinary set for \w.

error while replacing text using re module [duplicate]

Say I want to match the presence of the phrase Sortes\index[persons]{Sortes} in the phrase test Sortes\index[persons]{Sortes} text.
Using python re I could do this:
>>> search = re.escape('Sortes\index[persons]{Sortes}')
>>> match = 'test Sortes\index[persons]{Sortes} text'
>>> re.search(search, match)
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
This works, but I want to avoid the search pattern Sortes to give a positive result on the phrase test Sortes\index[persons]{Sortes} text.
>>> re.search(re.escape('Sortes'), match)
<_sre.SRE_Match object; span=(5, 11), match='Sortes'>
So I use the \b pattern, like this:
search = r'\b' + re.escape('Sortes\index[persons]{Sortes}') + r'\b'
match = 'test Sortes\index[persons]{Sortes} text'
re.search(search, match)
Now, I don't get a match.
If the search pattern does not contain any of the characters []{}, it works. E.g.:
>>> re.search(r'\b' + re.escape('Sortes\index') + r'\b', 'test Sortes\index test')
<_sre.SRE_Match object; span=(5, 17), match='Sortes\\index'>
Also, if I remove the final r'\b', it also works:
re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}'), 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
Furthermore, the documentation says about \b
Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string.
So I tried replacing the final \b with (\W|$):
>>> re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}') + '(\W|$)', 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 35), match='Sortes\\index[persons]{Sortes} '>
Lo and behold, it works!
What is going on here? What am I missing?

See what a word boundary matches:
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
In your pattern }\b only matches if there is a word char after } (a letter, digit or _).
When you use (\W|$) you require a non-word or end of string explicitly.
A solution is adaptive word boundaries:
re.search(r'(?:(?!\w)|\b(?=\w)){}(?:(?<=\w)\b|(?<!\w))'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Or equivalent:
re.search(r'(?!\B\w){}(?<!\w\B)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, adaptive dynamic word boundaries are used that mean the following:
(?:(?!\w)|\b(?=\w)) (equal to (?!\B\w)) - a left-hand boundary, making sure the current position is at the word boundary if the next char is a word char, or no context restriction is applied if the next char is not a word char (note that you will need to use (?:\B(?!\w)|\b(?=\w)) if you want to disallow a word char immediately on the left if the next char is not a word char)
(?:(?<=\w)\b|(?<!\w)) (equal to (?<!\w\B)) - a right-hand boundary, making sure the current position is at the word boundary if the previous char is a word char, or no context restriction is applied if the previous char is not a word char (note that you will need to use (?:(?<=\w)\b|\B(?<!\w)) if you want to disallow a word char immediately on the right if the preceding char is not a word char).
You might also consider using unambiguous word boundaries based on negative lookarounds in these cases:
re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, (?<!\w) negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!\w) negative lookahead will fail the match if there is a word char immediately to the right of the current location.
Which to choose? Adaptive word boundaries are more lenient compared to unambiguous word boundaries as the latter presume there must be no word chars on both ends of a match, while the former allow matching leading and trailing non-word chars in any context.
Note: It is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^\W\d_] instead of \w, or if you only allow matches around whitespaces, use whitespace boundaries (?<!\S) / (?!\S) lookaround boundaries).

I think this is what you're running into:
\b lands on the boundary of \w and \W, but in the example that doesn't work. '{Sortes}\b' is the boundary between \W and \W because of the '}', which doesn't match [a-zA-Z0-9_], the ordinary set for \w.

Regular expressions: How to make my code match the '+' character OR digits

I've just started on regex.
I'm trying to search through a short list of 'phrases' to find UK mobile numbers (starting with +44 or 07, sometimes with the number broken up by one space). I'm having trouble getting it to return numbers starting +44.
This is what I've written:
for snippet in phrases:
match = re.search("\\b(\+44|07)\\d+\\s?\\d+\\b", snippet)
if match:
numbers.append(match)
print(match)
which prints
<_sre.SRE_Match object; span=(19, 31), match='07700 900432'>
<_sre.SRE_Match object; span=(20, 31), match='07700930710'>
and misses out the number +44770090999 which is in 'phrases.'
I tried with and without the brackets. Without the brackets it would also print the +44 in sums like '10+44=54.' Is the backslash before the +44 necessary? Any ideas on what I'm missing?
Thanks to all!
EDIT: Some of my input:
phrases = ["You can call me on 07700 900432.",
"My mobile number is 07700930710",
"My date of birth is 07.08.92",
"Why not phone me on 202-555-0136?"
"There are around 7600000000 people on Earth",
"If you're from overseas, call +44 7700 900190",
"Try calling +447700900999 now!",
"56+44=100."]

In your regex the word boundary \b does not match between a whitespace and a plus sign.
What you could do is match either 07 or +44 and then match either a digit or a whitespace one or more times [\d ]+ followed by a digit \d to not match a whitespace at the end and add a word boundary \b at the end.
(?:07|\+44)[\d ]+\d\b
Demo Python

The problem with your regex is that the the first \b matches the word boundary between the + and the 4. The boundary between a space and a + is not a word boundary. This means that it can't find +44 after the \b because the + is on the left of the \b. There is only 44 on the right of \b.
To fix this, you can use a negative lookbehind to make sure there are no words before +44. Remember to put it inside the capturing group because it should only be matched if the +44 option was chosen. You still want to match a word boundary if it were starting with 07.
((?!\w)\+44|\b07)\d+\s?\d+\b
You can put the regex in a r"" string. This way you don't have to write that many slashes:
r"((?!\w)\+44|07)\d+\s?\d+\b"
Demo

This should help.
import re
phrases = ["Hello +4407700 900432 World", "Hello +44770090999 World"]
for snippet in phrases:
match = re.search(r"(?P<num>(\+44|07)\d+\s?\d+)", snippet)
if match:
print(match.group('num'))
Output:
+4407700 900432
+44770090999

You should be able to cover all cases by removing expected "noisy characters" from the string and simplify your regex to just "(07|\D44)\d{9}". Where:
(07|\D44) searches for a starting number with 07 and 44 preceded by a non-numeric character.
\d{9} searches for the remaining 9 digits.
Your code should look like this:
cleansnippet = snippet.replace("-","").replace(" ","").replace("(0)","")...
re.search("(07|\D44)\d{9}", cleansnippet)
Applying this to your input retrieves this:
<_sre.SRE_Match object; span=(14, 25), match='07700900432'>
<_sre.SRE_Match object; span=(16, 27), match='07700930710'>
<_sre.SRE_Match object; span=(25, 37), match='+44770090019'>
<_sre.SRE_Match object; span=(10, 22), match='+44770090099'>
Hope that helps.
Pd.:
The \ before the + means that you are specifically looking for a + sign instead of "1 or more" of the previous element.
The only reason why I propose \D44 instead of the \+44 is because it could be safer for you as people could miss typing + prior their number. :)

Regex: don't match string ending with newline (\n) with end-of-line anchor ($)

I can't figure out how to match a string but not if it has a trailing newline character (\n), which seems automatically stripped:
import re
print(re.match(r'^foobar$', 'foobar'))
# <_sre.SRE_Match object; span=(0, 6), match='foobar'>
print(re.match(r'^foobar$', 'foobar\n'))
# <_sre.SRE_Match object; span=(0, 6), match='foobar'>
print(re.match(r'^foobar$', 'foobar\n\n'))
# None
For me, the second case should also return None.
When we set the end of a pattern with $, like ^foobar$, it should only match a string like foobar, not foobar\n.
What am I missing?

You more likely don't need $ but rather \Z:
>>> print(re.match(r'^foobar\Z', 'foobar\n'))
None
\Z matches only at the end of the string.

The documentation says this about the $ character:
Matches the end of the string or just before the newline at the end of
the string, and in MULTILINE mode also matches before a newline.
So, without the MULTILINE option, it matches exactly the first two strings you tried: 'foobar' and 'foobar\n', but not 'foobar\n\n', because that is not a newline at the end of the string.
On the other hand, if you choose MULTILINE option, it will match the end of any line:
>>> re.match(r'^foobar$', 'foobar\n\n', re.MULTILINE)
<_sre.SRE_Match object; span=(0, 6), match='foobar'>
Of course, this will also match in the following case, which may or may not be what you want:
>>> re.match(r'^foobar$', 'foobar\nanother line\n', re.MULTILINE)
<_sre.SRE_Match object; span=(0, 6), match='foobar'>
In order to NOT match the ending newline, use the negative lookahead as DeepSpace wrote.

This is the defined behavior of $, as can be read in the docs that #zvone linked to or even on https://regex101.com:
$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)
You can use an explicit negative lookahead to counter this behavior:
import re
print(re.match(r'^foobar(?!\n)$', 'foobar'))
# <_sre.SRE_Match object; span=(0, 6), match='foobar'>
print(re.match(r'^foobar(?!\n)$', 'foobar\n'))
# None
print(re.match(r'^foobar(?!\n)$', 'foobar\n\n'))
# None

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regular Expression Why Quantifier (+) is not greedy - python

Related

Python regex re.sub maintain double quotes in input string [duplicate]

regex dictionary replacement working with only some entries (python). why? [duplicate]

error while replacing text using re module [duplicate]

Regular expressions: How to make my code match the '+' character OR digits

Regex: don't match string ending with newline (\n) with end-of-line anchor ($)

Categories

Resources