Python regex re.sub maintain double quotes in input string [duplicate] - python

Say I want to match the presence of the phrase Sortes\index[persons]{Sortes} in the phrase test Sortes\index[persons]{Sortes} text.
Using python re I could do this:
>>> search = re.escape('Sortes\index[persons]{Sortes}')
>>> match = 'test Sortes\index[persons]{Sortes} text'
>>> re.search(search, match)
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
This works, but I want to avoid the search pattern Sortes to give a positive result on the phrase test Sortes\index[persons]{Sortes} text.
>>> re.search(re.escape('Sortes'), match)
<_sre.SRE_Match object; span=(5, 11), match='Sortes'>
So I use the \b pattern, like this:
search = r'\b' + re.escape('Sortes\index[persons]{Sortes}') + r'\b'
match = 'test Sortes\index[persons]{Sortes} text'
re.search(search, match)
Now, I don't get a match.
If the search pattern does not contain any of the characters []{}, it works. E.g.:
>>> re.search(r'\b' + re.escape('Sortes\index') + r'\b', 'test Sortes\index test')
<_sre.SRE_Match object; span=(5, 17), match='Sortes\\index'>
Also, if I remove the final r'\b', it also works:
re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}'), 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
Furthermore, the documentation says about \b
Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string.
So I tried replacing the final \b with (\W|$):
>>> re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}') + '(\W|$)', 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 35), match='Sortes\\index[persons]{Sortes} '>
Lo and behold, it works!
What is going on here? What am I missing?

See what a word boundary matches:
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
In your pattern }\b only matches if there is a word char after } (a letter, digit or _).
When you use (\W|$) you require a non-word or end of string explicitly.
A solution is adaptive word boundaries:
re.search(r'(?:(?!\w)|\b(?=\w)){}(?:(?<=\w)\b|(?<!\w))'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Or equivalent:
re.search(r'(?!\B\w){}(?<!\w\B)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, adaptive dynamic word boundaries are used that mean the following:
(?:(?!\w)|\b(?=\w)) (equal to (?!\B\w)) - a left-hand boundary, making sure the current position is at the word boundary if the next char is a word char, or no context restriction is applied if the next char is not a word char (note that you will need to use (?:\B(?!\w)|\b(?=\w)) if you want to disallow a word char immediately on the left if the next char is not a word char)
(?:(?<=\w)\b|(?<!\w)) (equal to (?<!\w\B)) - a right-hand boundary, making sure the current position is at the word boundary if the previous char is a word char, or no context restriction is applied if the previous char is not a word char (note that you will need to use (?:(?<=\w)\b|\B(?<!\w)) if you want to disallow a word char immediately on the right if the preceding char is not a word char).
You might also consider using unambiguous word boundaries based on negative lookarounds in these cases:
re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, (?<!\w) negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!\w) negative lookahead will fail the match if there is a word char immediately to the right of the current location.
Which to choose? Adaptive word boundaries are more lenient compared to unambiguous word boundaries as the latter presume there must be no word chars on both ends of a match, while the former allow matching leading and trailing non-word chars in any context.
Note: It is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^\W\d_] instead of \w, or if you only allow matches around whitespaces, use whitespace boundaries (?<!\S) / (?!\S) lookaround boundaries).

I think this is what you're running into:
\b lands on the boundary of \w and \W, but in the example that doesn't work. '{Sortes}\b' is the boundary between \W and \W because of the '}', which doesn't match [a-zA-Z0-9_], the ordinary set for \w.

Related

word matching with boundary not works [duplicate]

Say I want to match the presence of the phrase Sortes\index[persons]{Sortes} in the phrase test Sortes\index[persons]{Sortes} text.
Using python re I could do this:
>>> search = re.escape('Sortes\index[persons]{Sortes}')
>>> match = 'test Sortes\index[persons]{Sortes} text'
>>> re.search(search, match)
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
This works, but I want to avoid the search pattern Sortes to give a positive result on the phrase test Sortes\index[persons]{Sortes} text.
>>> re.search(re.escape('Sortes'), match)
<_sre.SRE_Match object; span=(5, 11), match='Sortes'>
So I use the \b pattern, like this:
search = r'\b' + re.escape('Sortes\index[persons]{Sortes}') + r'\b'
match = 'test Sortes\index[persons]{Sortes} text'
re.search(search, match)
Now, I don't get a match.
If the search pattern does not contain any of the characters []{}, it works. E.g.:
>>> re.search(r'\b' + re.escape('Sortes\index') + r'\b', 'test Sortes\index test')
<_sre.SRE_Match object; span=(5, 17), match='Sortes\\index'>
Also, if I remove the final r'\b', it also works:
re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}'), 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
Furthermore, the documentation says about \b
Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string.
So I tried replacing the final \b with (\W|$):
>>> re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}') + '(\W|$)', 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 35), match='Sortes\\index[persons]{Sortes} '>
Lo and behold, it works!
What is going on here? What am I missing?
See what a word boundary matches:
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
In your pattern }\b only matches if there is a word char after } (a letter, digit or _).
When you use (\W|$) you require a non-word or end of string explicitly.
A solution is adaptive word boundaries:
re.search(r'(?:(?!\w)|\b(?=\w)){}(?:(?<=\w)\b|(?<!\w))'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Or equivalent:
re.search(r'(?!\B\w){}(?<!\w\B)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, adaptive dynamic word boundaries are used that mean the following:
(?:(?!\w)|\b(?=\w)) (equal to (?!\B\w)) - a left-hand boundary, making sure the current position is at the word boundary if the next char is a word char, or no context restriction is applied if the next char is not a word char (note that you will need to use (?:\B(?!\w)|\b(?=\w)) if you want to disallow a word char immediately on the left if the next char is not a word char)
(?:(?<=\w)\b|(?<!\w)) (equal to (?<!\w\B)) - a right-hand boundary, making sure the current position is at the word boundary if the previous char is a word char, or no context restriction is applied if the previous char is not a word char (note that you will need to use (?:(?<=\w)\b|\B(?<!\w)) if you want to disallow a word char immediately on the right if the preceding char is not a word char).
You might also consider using unambiguous word boundaries based on negative lookarounds in these cases:
re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, (?<!\w) negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!\w) negative lookahead will fail the match if there is a word char immediately to the right of the current location.
Which to choose? Adaptive word boundaries are more lenient compared to unambiguous word boundaries as the latter presume there must be no word chars on both ends of a match, while the former allow matching leading and trailing non-word chars in any context.
Note: It is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^\W\d_] instead of \w, or if you only allow matches around whitespaces, use whitespace boundaries (?<!\S) / (?!\S) lookaround boundaries).
I think this is what you're running into:
\b lands on the boundary of \w and \W, but in the example that doesn't work. '{Sortes}\b' is the boundary between \W and \W because of the '}', which doesn't match [a-zA-Z0-9_], the ordinary set for \w.

Python regex to match string with dots [duplicate]

Say I want to match the presence of the phrase Sortes\index[persons]{Sortes} in the phrase test Sortes\index[persons]{Sortes} text.
Using python re I could do this:
>>> search = re.escape('Sortes\index[persons]{Sortes}')
>>> match = 'test Sortes\index[persons]{Sortes} text'
>>> re.search(search, match)
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
This works, but I want to avoid the search pattern Sortes to give a positive result on the phrase test Sortes\index[persons]{Sortes} text.
>>> re.search(re.escape('Sortes'), match)
<_sre.SRE_Match object; span=(5, 11), match='Sortes'>
So I use the \b pattern, like this:
search = r'\b' + re.escape('Sortes\index[persons]{Sortes}') + r'\b'
match = 'test Sortes\index[persons]{Sortes} text'
re.search(search, match)
Now, I don't get a match.
If the search pattern does not contain any of the characters []{}, it works. E.g.:
>>> re.search(r'\b' + re.escape('Sortes\index') + r'\b', 'test Sortes\index test')
<_sre.SRE_Match object; span=(5, 17), match='Sortes\\index'>
Also, if I remove the final r'\b', it also works:
re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}'), 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
Furthermore, the documentation says about \b
Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string.
So I tried replacing the final \b with (\W|$):
>>> re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}') + '(\W|$)', 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 35), match='Sortes\\index[persons]{Sortes} '>
Lo and behold, it works!
What is going on here? What am I missing?
See what a word boundary matches:
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
In your pattern }\b only matches if there is a word char after } (a letter, digit or _).
When you use (\W|$) you require a non-word or end of string explicitly.
A solution is adaptive word boundaries:
re.search(r'(?:(?!\w)|\b(?=\w)){}(?:(?<=\w)\b|(?<!\w))'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Or equivalent:
re.search(r'(?!\B\w){}(?<!\w\B)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, adaptive dynamic word boundaries are used that mean the following:
(?:(?!\w)|\b(?=\w)) (equal to (?!\B\w)) - a left-hand boundary, making sure the current position is at the word boundary if the next char is a word char, or no context restriction is applied if the next char is not a word char (note that you will need to use (?:\B(?!\w)|\b(?=\w)) if you want to disallow a word char immediately on the left if the next char is not a word char)
(?:(?<=\w)\b|(?<!\w)) (equal to (?<!\w\B)) - a right-hand boundary, making sure the current position is at the word boundary if the previous char is a word char, or no context restriction is applied if the previous char is not a word char (note that you will need to use (?:(?<=\w)\b|\B(?<!\w)) if you want to disallow a word char immediately on the right if the preceding char is not a word char).
You might also consider using unambiguous word boundaries based on negative lookarounds in these cases:
re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, (?<!\w) negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!\w) negative lookahead will fail the match if there is a word char immediately to the right of the current location.
Which to choose? Adaptive word boundaries are more lenient compared to unambiguous word boundaries as the latter presume there must be no word chars on both ends of a match, while the former allow matching leading and trailing non-word chars in any context.
Note: It is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^\W\d_] instead of \w, or if you only allow matches around whitespaces, use whitespace boundaries (?<!\S) / (?!\S) lookaround boundaries).
I think this is what you're running into:
\b lands on the boundary of \w and \W, but in the example that doesn't work. '{Sortes}\b' is the boundary between \W and \W because of the '}', which doesn't match [a-zA-Z0-9_], the ordinary set for \w.

regex dictionary replacement working with only some entries (python). why? [duplicate]

Say I want to match the presence of the phrase Sortes\index[persons]{Sortes} in the phrase test Sortes\index[persons]{Sortes} text.
Using python re I could do this:
>>> search = re.escape('Sortes\index[persons]{Sortes}')
>>> match = 'test Sortes\index[persons]{Sortes} text'
>>> re.search(search, match)
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
This works, but I want to avoid the search pattern Sortes to give a positive result on the phrase test Sortes\index[persons]{Sortes} text.
>>> re.search(re.escape('Sortes'), match)
<_sre.SRE_Match object; span=(5, 11), match='Sortes'>
So I use the \b pattern, like this:
search = r'\b' + re.escape('Sortes\index[persons]{Sortes}') + r'\b'
match = 'test Sortes\index[persons]{Sortes} text'
re.search(search, match)
Now, I don't get a match.
If the search pattern does not contain any of the characters []{}, it works. E.g.:
>>> re.search(r'\b' + re.escape('Sortes\index') + r'\b', 'test Sortes\index test')
<_sre.SRE_Match object; span=(5, 17), match='Sortes\\index'>
Also, if I remove the final r'\b', it also works:
re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}'), 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
Furthermore, the documentation says about \b
Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string.
So I tried replacing the final \b with (\W|$):
>>> re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}') + '(\W|$)', 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 35), match='Sortes\\index[persons]{Sortes} '>
Lo and behold, it works!
What is going on here? What am I missing?
See what a word boundary matches:
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
In your pattern }\b only matches if there is a word char after } (a letter, digit or _).
When you use (\W|$) you require a non-word or end of string explicitly.
A solution is adaptive word boundaries:
re.search(r'(?:(?!\w)|\b(?=\w)){}(?:(?<=\w)\b|(?<!\w))'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Or equivalent:
re.search(r'(?!\B\w){}(?<!\w\B)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, adaptive dynamic word boundaries are used that mean the following:
(?:(?!\w)|\b(?=\w)) (equal to (?!\B\w)) - a left-hand boundary, making sure the current position is at the word boundary if the next char is a word char, or no context restriction is applied if the next char is not a word char (note that you will need to use (?:\B(?!\w)|\b(?=\w)) if you want to disallow a word char immediately on the left if the next char is not a word char)
(?:(?<=\w)\b|(?<!\w)) (equal to (?<!\w\B)) - a right-hand boundary, making sure the current position is at the word boundary if the previous char is a word char, or no context restriction is applied if the previous char is not a word char (note that you will need to use (?:(?<=\w)\b|\B(?<!\w)) if you want to disallow a word char immediately on the right if the preceding char is not a word char).
You might also consider using unambiguous word boundaries based on negative lookarounds in these cases:
re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, (?<!\w) negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!\w) negative lookahead will fail the match if there is a word char immediately to the right of the current location.
Which to choose? Adaptive word boundaries are more lenient compared to unambiguous word boundaries as the latter presume there must be no word chars on both ends of a match, while the former allow matching leading and trailing non-word chars in any context.
Note: It is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^\W\d_] instead of \w, or if you only allow matches around whitespaces, use whitespace boundaries (?<!\S) / (?!\S) lookaround boundaries).
I think this is what you're running into:
\b lands on the boundary of \w and \W, but in the example that doesn't work. '{Sortes}\b' is the boundary between \W and \W because of the '}', which doesn't match [a-zA-Z0-9_], the ordinary set for \w.

error while replacing text using re module [duplicate]

Say I want to match the presence of the phrase Sortes\index[persons]{Sortes} in the phrase test Sortes\index[persons]{Sortes} text.
Using python re I could do this:
>>> search = re.escape('Sortes\index[persons]{Sortes}')
>>> match = 'test Sortes\index[persons]{Sortes} text'
>>> re.search(search, match)
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
This works, but I want to avoid the search pattern Sortes to give a positive result on the phrase test Sortes\index[persons]{Sortes} text.
>>> re.search(re.escape('Sortes'), match)
<_sre.SRE_Match object; span=(5, 11), match='Sortes'>
So I use the \b pattern, like this:
search = r'\b' + re.escape('Sortes\index[persons]{Sortes}') + r'\b'
match = 'test Sortes\index[persons]{Sortes} text'
re.search(search, match)
Now, I don't get a match.
If the search pattern does not contain any of the characters []{}, it works. E.g.:
>>> re.search(r'\b' + re.escape('Sortes\index') + r'\b', 'test Sortes\index test')
<_sre.SRE_Match object; span=(5, 17), match='Sortes\\index'>
Also, if I remove the final r'\b', it also works:
re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}'), 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 34), match='Sortes\\index[persons]{Sortes}'>
Furthermore, the documentation says about \b
Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string.
So I tried replacing the final \b with (\W|$):
>>> re.search(r'\b' + re.escape('Sortes\index[persons]{Sortes}') + '(\W|$)', 'test Sortes\index[persons]{Sortes} test')
<_sre.SRE_Match object; span=(5, 35), match='Sortes\\index[persons]{Sortes} '>
Lo and behold, it works!
What is going on here? What am I missing?
See what a word boundary matches:
A word boundary can occur in one of three positions:
Before the first character in the string, if the first character is a word character.
After the last character in the string, if the last character is a word character.
Between two characters in the string, where one is a word character and the other is not a word character.
In your pattern }\b only matches if there is a word char after } (a letter, digit or _).
When you use (\W|$) you require a non-word or end of string explicitly.
A solution is adaptive word boundaries:
re.search(r'(?:(?!\w)|\b(?=\w)){}(?:(?<=\w)\b|(?<!\w))'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Or equivalent:
re.search(r'(?!\B\w){}(?<!\w\B)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, adaptive dynamic word boundaries are used that mean the following:
(?:(?!\w)|\b(?=\w)) (equal to (?!\B\w)) - a left-hand boundary, making sure the current position is at the word boundary if the next char is a word char, or no context restriction is applied if the next char is not a word char (note that you will need to use (?:\B(?!\w)|\b(?=\w)) if you want to disallow a word char immediately on the left if the next char is not a word char)
(?:(?<=\w)\b|(?<!\w)) (equal to (?<!\w\B)) - a right-hand boundary, making sure the current position is at the word boundary if the previous char is a word char, or no context restriction is applied if the previous char is not a word char (note that you will need to use (?:(?<=\w)\b|\B(?<!\w)) if you want to disallow a word char immediately on the right if the preceding char is not a word char).
You might also consider using unambiguous word boundaries based on negative lookarounds in these cases:
re.search(r'(?<!\w){}(?!\w)'.format(re.escape('Sortes\index[persons]{Sortes}')), 'test Sortes\index[persons]{Sortes} test')
Here, (?<!\w) negative lookbehind will fail the match if there is a word char immediately to the left of the current location, and (?!\w) negative lookahead will fail the match if there is a word char immediately to the right of the current location.
Which to choose? Adaptive word boundaries are more lenient compared to unambiguous word boundaries as the latter presume there must be no word chars on both ends of a match, while the former allow matching leading and trailing non-word chars in any context.
Note: It is easy to customize these lookaround patterns further (say, to only fail the match if there are letters around the pattern, use [^\W\d_] instead of \w, or if you only allow matches around whitespaces, use whitespace boundaries (?<!\S) / (?!\S) lookaround boundaries).
I think this is what you're running into:
\b lands on the boundary of \w and \W, but in the example that doesn't work. '{Sortes}\b' is the boundary between \W and \W because of the '}', which doesn't match [a-zA-Z0-9_], the ordinary set for \w.

How can you make \b accept words that start with '+' using regex in python?

re.search(r"\b\+359\b","Is your phone number +359 887438?")
Why is this regex not finding +359 and how can i make \b consider words starting with +?
You can't alter \b's behaviour. You'd have to use a different anchor; like \B, which matches anywhere not at the start or end of a word; it is the inverse of \b:
\B\+359\b
This again matches if there is no word character preceding +, which itself is a non-word character. Where \b can only match between a word and nonword character (so WORD\bNONWORD or NONWORD\bWORD), \B needs two nonword or two word characters to match (so WORD\BWORD or NONWORD\BNONWORD). As + is a nonword character, whatever comes before + must also be a non-word character.
Alternatively, you can use a negative look-behind:
(?<!\w)\+359\b
The (?<!\w) negative look-behind assertion only matches a position where there is no word character preceding the position.
Demo:
>>> import re
>>> re.search(r"\b\+359\b","Is your phone number +359 887438?")
>>> re.search(r"\B\+359\b","Is your phone number +359 887438?")
<_sre.SRE_Match object at 0x10cf03850>
>>> re.search(r"(?<!\w)\+359\b","Is your phone number +359 887438?")
<_sre.SRE_Match object at 0x10cf03d30>

Categories