Is there a version of Python's re.sub() that acts like str.rfind and begins searching from the last match occurrence?
I want to do a regex substitution on the last match in a string, but there doesn't seem to be an out-of-box solution in the stdlib.
If you mean it literally, no. That's not how the regex engine works.
You can either reverse the string and apply re.sub(pattern, sub, string, count=1) with a reversed pattern, like one of the comment said.
Or you can construct a regex that match only the last match, like below:
>>> import re
>>> s = "hello hello hello hello hello world"
>>> re.sub(r"hello(?!.*hello.*$)", "hay", s)
'hello hello hello hello hay world'
You could use re.sub in the ordinary way but start the regexp with a (.*) to match as much of the string as possible, and then in the replacement you could use \1 to include unchanged the part that the .* matched.
>>> re.sub("(.*)a", r"\1A", "bananas")
'bananAs'
(Note here the r to ensure that the \ is passed verbatim to re.sub and not treated as starting an escape sequence.)
Related
Im trying to locate the symbol " in a large text when it is immediately preceded and followed by a word or character only. I then want to replace it with this symbol without changing the word/number before and after it: '
I tried this:
text7 = re.sub(r'(\w)"(\w)', r"$1\'$2", text6)
For the word "it"s" all i get now is i$1'$2. What I want is "it's"
Any suggestions?
Use a lookbehind and lookahead; these just look without being modified by the replacement text:
text7 = re.sub(r'(?<=\w)"(?=\w)', "'", text6)
For help with the re module, I recommend running help(re) in your interpreter (or pydoc re from the command line). It's laid out really conveniently and I find it easier to follow than the online documentation.
Solution:
>>> import re
>>> text6 = 'it"s'
>>> print(re.sub(r'(\w)"(\w)', r"\1'\2", text6))
it's
You used $1 to match group 1, but in Python it's \1. Also you had an extra \ in front of the single-quote in the replacement string.
You can just use \b"\b replace with '. \b is a word boundary and matches anywhere the following matches (without consuming characters): ^\w|\w$|\W\w|\w\W.
See code in use here
import re
print(re.sub(r'\b"\b', "'", 'it"s'))
P.S. In python \1 or \g<1> are used to reference capture groups, not $1 as it will instead be interpreted literally. See python's re.sub() documentation for more information.
Given the following string as input:
[2015/06/09 14:21:59] mod=syn|cli=192.168.1.99/49244|srv=192.168.1.100/80|subj=cli|os=Windows 7 or 8|dist=0|params=none|raw_sig=4:128+0:0:1460:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0
I'm trying to match the value of subj, ie: in the above case the expected output would be cli
I don't understand why my regex is not working:
subj = re.match(r"(.*)subj=(.*?)|(.*)", line).group(2)
From what I can tell, the second group in here should be cli but I'm getting an empty result.
The | has special meaning in regex (Which creates alternations ) , hence escape it as
>> re.match(r"(.*)subj=(.*?)\|", line).group(2)
'cli'
Another Solution
You can use re.search() so that you can get rid of the groups at the start of subj and that after the |
Example
>>> re.search(r"subj=(.*?)\|", line).group(1)
'cli'
Here we use group(1) since there is only one group that is being captured instead of three as in previous version.
Read about the differences between search and match
Complex version
You can even get rid of all the capturing if you are using look arounds
>>> re.search(r"(?<=subj=).*?(?=\|)", line).group(0)
'cli'
(?<=subj=) Checks if the string matched by .*? is preceded by subj.
.*? Matches anything, non greedy matching.
(?=\|) Check if this anything is followed by a |.
Regex101
I'd recommend using the following regex, because it will provide better performance with two additions/substitutions:
adding the beginning of line character ^
adding the negating group [^\|]* is faster than (.*)?
Code
subj = re.match(r"^.*\|subj=([^\|]*)", line).group(1)
regex:
^.*\|subj=([^\|]*)
Debuggex Demo
You need to escape |.. Use the following:
subj = re.match(r"(.*)subj=(.*?)\|(.*)", line).group(2)
^
The pipe sign | needs to be escaped, like so:
subj = re.match(r"(.*)subj=(.*?)\|(.*)", s).group(2)
I would use a negated class [^|]* with re.search for better performance:
import re
p = re.compile(r'^(.*)subj=([^|]*)\|(.*)$')
test_str = "[2015/06/09 14:21:59] mod=syn|cli=192.168.1.99/49244|srv=192.168.1.100/80|subj=cli|os=Windows 7 or 8|dist=0|params=none|raw_sig=4:128+0:0:1460:8192,8:mss,nop,ws,nop,nop,sok:df,id+:0"
print re.search(p, test_str).group(2)
See IDEONE demo
Note I am not using both lazy and greedy quantifiers in the regex (it is not advisable usually).
The pipe symbol must be escaped to be treated as a literal | symbol.
REGEX EXPLANATION:
^ - Start of string
(.*) - The first capturing group that matches characters from the beginning up to
subj= - A literal string subj=
([^|]*) - The second capturing group matching any characters other than a literal pipe (inside a character class, it does not need escaping)
\| - A literal pipe (must be escaped)
(.*) - The third capturing group (if you need to get the string after up to the end.
$ - End of string
Suppose I want to prepend all occurrences of a particular expression with a character such as \.
In sed, it would look like this.
echo '__^^^%%%__FooBar' | sed 's/[_^%]/\\&/g'
Note that the & character is used to represent the original matched expression.
I have looked through the regex docs and the regex howto, but I do not see an equivalent to the & character that can be used to substitute in the matched expression.
The only workaround I have found is to use the an extra set of () to group the expression and then refernece the group, as follows.
import re
line = "__^^^%%%__FooBar"
print re.sub("([_%^$])", r"\\\1", line)
Is there a clean way to reference the entire matched expression without the extra group creation?
From the docs:
The backreference \g<0> substitutes in the entire substring matched by the RE.
Example:
>>> print re.sub("[_%^$]", r"\\\g<0>", line)
\_\_\^\^\^\%\%\%\_\_FooBar
You could get the result also by using Positive lookahead .
>>> print re.sub("(?=[_%^$])", r"\\", line)
\_\_\^\^\^\%\%\%\_\_FooBar
I'm trying to find strings that have trailing whitespace, i.e. 'foo ' as opposed to 'foo'.
In Perl, I would use:
$str = 'foo ';
print "Match\n" if ($str =~ /\s+$/) ;
When I try this in Python 2.6, e.g.:
import re
str = 'foo '
if re.match('\s+$', str):
print 'Match'
it doesn't match. I feel like I'm missing something obvious but I can't figure out what I'm doing wrong.
Use re.search() instead; re.match() only matches at the start of a string. Quoting the re.match() documentation:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance.
Emphasis mine.
In other words, re.match() is the equivalent of the m/.../ match operator in Perl, while re.search() is the same as /.../.
Because re.match(r'\s+$', str) is equivalent to re.search(r'\A\s+$', str). Use re.search instead.
From docs:
re.match() checks for a match only at the beginning of the string,
while re.search() checks for a match anywhere in the string (this is
what Perl does by default).
I have this string:
'Is?"they'
I want to find the question mark (?) in the string, and put it at the end of the string. The output should look like this:
'Is"they?'
I am using the following regular expression in python 2.7. I don't know why my regex is not working.
import re
regs = re.sub('(\w*)(\?)(\w*)', '\\1\\3\\2', 'Is?"they')
print regs
Is?"they # this is the output of my regex.
Your regex doesn't match because " is not in the \w character class. You would need to change it to something like:
regs = re.sub('(\w*)(\?)([^"\w]*)', '\\1\\3\\2', 'Is?"they')
As shown here, " is not captured by \w. Hence, it would probably be best to just use a .:
>>> import re
>>> re.sub("(.*)(\?)(.*)", r'\1\3\2', 'Is?"they')
'Is"they?'
>>>
. captures anything/everything in Regex (except newlines).
Also, you'll notice that I used a raw-string for the second argument of re.sub. Doing so is cleaner than having all those backslashes.