Non greedy match of .* with ^ - python

Given the string:
s = "Why did you foo bar a <b>^f('y')[f('x').get()]^? and ^f('barbar')^</b>"
How do I replace the ^f('y')[f('x').get()]^ and ^f('barbar')^ with a string, e.g. PLACEXHOLDER?
The desired output is:
Why did you foo bar a <b>PLACEXHOLDER? and PLACEXHOLDER</b>
I've tried re.sub('\^.*\^', 'PLACEXHOLDER', s) but the .* is greedy and it matches, ^f('y')[f('x').get()]^? and ^f('barbar')^ and outputs:
Why did you foo bar a PLACEXHOLDER
There can be multiple substrings of unknown number that's encoded by \^ so hardcoding this is not desired:
re.sub('(\^.+\^).*(\^.*\^)', 'PLACEXHOLDER', s)

If you add a question mark after the star, it will make it non-greedy.
\^.*?\^
http://www.regexpal.com/?fam=97647
Why did you foo bar a <b>^f('y')[f('x').get()]^? and ^f('barbar')^</b>
Properly replaces to
Why did you foo bar a <b>PLACEXHOLDER? and PLACEXHOLDER</b>

Related

Replace all variations of a string regardless of comma position Python

I have a data frame that consists of multiple rows that contain different variations of a string that is separated by commas. Rather than constantly writing variations of this code such as df.replace('Word,', ''), I am looking for a simpler way to replace variations in strings for python. I have heard about regex yet am having a difficult time understanding it.
One such example that I am looking into is df.column.str.replace('Word,?', '') which would replace all variations of "Word" regardless of comma position. However, I am unsure as to how this works. Any help in understanding replacing using regex would be greatly appreciated. Thank you in advance.
Example:
'Word, foo, bar'
'Word'
'foo, bar, Word'
'foo, Word, bar'
Desired Output:
'foo, bar'
''
'foo, bar'
'foo, bar'
df.replace(to_replace='Word,|(, )?Word',value='',regex=True)
This way .replace() method will do the required work.
to_replace is our regular expression criteria and it should be in string.
'Word,' will match all strings except at the end in form of ", Word".
To match those end string we provided "|"(or) so that we can add new criteria which is "(, )?Word". Here ? match 0 or 1 occurrence of ", "(comma and 1 space) so that both conditions for ending string as well as only 1 string "Word" matched
Value = '' : which show what to be replaced with
regex = True : which tells to treat "to_replace" parameter as a regex expression
You can do it as below
Input
df = pd.DataFrame([[1, 'Word, foo, bar'],
[2, 'Word'],
[3, 'foo, bar, Word'],
[4, 'foo, Word, bar']],columns=['id', 'text'])
id text
1 Word, foo, bar
2 Word
3 foo, bar, Word
4 foo, Word, bar
Code to replace text 'Word' and following comma & space if any
df['text']=df['text'].replace('Word(,\s)|(,\s)?Word','',regex=True)
What is happening in the code
Word : will search for the text 'Word'
(,\s)? : will look for comma, followed by space\s, ? will look and match if it is available, if comma & space does not follow, then just the text 'Word' is matched. So ? is pretty important here.
| : this matches one of the 2 expressions (in your case this is needed for line 3 where there is a preceding space & comma)
You can see detailed explanation here Regex Demo
Output
id text
1 foo, bar
2
3 foo, bar
4 foo, bar

Putting words in parenthesis with regex python

How can I put in brackets / parenthesis some words following another word in python?
For 2 words it looks like:
>>> p=re.compile(r"foo\s(\w+)\s(\w+)")
>>> p.sub( r"[\1] [\2]", "foo bar baz")
'[bar] [baz]'
I want for undefined number of words. I came up with this, but it doesn't seem to work.
>>> p=re.compile(r"foo(\s(\w+))*")
>>> p.sub( r"[\2] [\2] [\2]", "foo bar baz bax")
'[bax] [bax] [bax]'
The desired result in this case would be
'[bar] [baz] [bax]'
You may use a solution like
import re
p = re.compile(r"(foo\s+)([\w\s]+)")
r = re.compile(r"\w+")
s = "foo bar baz"
print( p.sub( lambda x: "{}{}".format(x.group(1), r.sub(r"[\g<0>]", x.group(2))), s) )
See the Python demo
The first (foo\s+)([\w\s]+) pattern matches and captures foo followed with 1+ whitespaces into Group 1 and then captures 1+ word and whitespace chars into Group 2.
Then, inside the re.sub, the replacement argument is a lambda expression where all 1+ word chunks are wrapped with square brackets using the second simple \w+ regex (that is done to ensure the same amount of whitespaces between the words, else, it can be done without a regex).
Note that [\g<0>] replacement pattern inserts [, the whole match value (\g<0>) and then ].
I suggest you the following simple solution:
import re
s = "foo bar baz bu bi porte"
p = re.compile(r"foo\s([\w\s]+)")
p = p.match(s)
# Here: p.group(1) is "bar baz bu bi porte"
# p.group(1).split is ['bar', 'baz' ,'bu' ,'bi', 'porte']
print(' '.join([f'[{i}]' for i in p.group(1).split()])) # for Python 3.6+ (due to f-strings)
# [bar] [baz] [bu] [bi] [porte]
print(' '.join(['[' + i + ']' for i in p.group(1).split()])) # for other Python versions
# [bar] [baz] [bu] [bi] [porte]

Python regex: Replace individual characters in a match

I'm trying to substitute every individual character found in a regex match but I can't seem to make it work.
I have a string containing parenthesized expressions that have to be replaced.
For example, foo bar (baz) should become foo bar (***)
Here is what I came up with: re.sub(r"(\(.*?).(.*?\))", r"\1*\2", "foo bar (baz)") Unfortunately, I can't seem to apply the substitution to every character between the parentheses. Is there any way to make this work?
How about something like this?
>>> import re
>>> s = 'foo bar (baz)'
>>> re.sub(r'(?<=\().*?(?=\))', lambda m: '*'*len(m.group()), s)
'foo bar (***)'

Python regular expression to match either a quoted or unquoted string

I am trying to write a regular expression in Python that will match either a quoted string with spaces or an unquoted string without spaces. For example given the string term:foo the result would be foo and given the string term:"foo bar" the result would be foo bar. So far I've come up with the following regular expression:
r = re.compile(r'''term:([^ "]+)|term:"([^"]+)"''')
The problem is that the match can come in either group(1) or group(2) so I have to do something like this:
m = r.match(search_string)
term = m.group(1) or m.group(2)
Is there a way I can do this all in one step?
Avoid grouping, and instead use lookahead/lookbehind assertions to eliminate the parts that are not needed:
s = 'term:foo term:"foo bar" term:bar foo term:"foo term:'
re.findall(r'(?<=term:)[^" ]+|(?<=term:")[^"]+(?=")', s)
Gives:
['foo', 'foo bar', 'bar']
It doesn't seem that you really want re.match here. Your regex is almost right, but you're grouping too much. How about this?
>>> s
('xyz term:abc 123 foo', 'foo term:"abc 123 "foo')
>>> re.findall(r'term:([^ "]+|"[^"]+")', '\n'.join(s))
['abc', '"abc 123 "']

Is there a single Python regex that can change all "foo" to "bar" on lines starting with "#"?

Is it possible to write a single Python regular expression that can be applied to a multi-line string and change all occurrences of "foo" to "bar", but only on lines beginning with "#"?
I was able to get this working in Perl, using Perl's \G regular expression sigil, which matches the end of the previous match. However, Python doesn't appear to support this.
Here's the Perl solution, in case it helps:
my $x =<<EOF;
# foo
foo
# foo foo
EOF
$x =~ s{
( # begin capture
(?:\G|^\#) # last match or start of string plus hash
.*? # followed by anything, non-greedily
) # end capture
foo
}
{$1bar}xmg;
print $x;
The proper output, of course, is:
# bar
foo
# bar bar
Can this be done in Python?
Edit: Yes, I know that it's possible to split the string into individual lines and test each line and then decide whether to apply the transformation, but please take my word that doing so would be non-trivial in this case. I really do need to do it with a single regular expression.
lines = mystring.split('\n')
for line in lines:
if line.startswith('#'):
line = line.replace('foo', 'bar')
No need for a regex.
It looked pretty easy to do with a regular expression:
>>> import re
... text = """line 1
... line 2
... Barney Rubble Cutherbert Dribble and foo
... line 4
... # Flobalob, bing, bong, foo and brian
... line 6"""
>>> regexp = re.compile('^(#.+)foo', re.MULTILINE)
>>> print re.sub(regexp, '\g<1>bar', text)
line 1
line 2
Barney Rubble Cutherbert Dribble and foo
line 4
# Flobalob, bing, bong, bar and brian
line 6
But then trying your example text is not so good:
>>> text = """# foo
... foo
... # foo foo"""
>>> regexp = re.compile('^(#.+)foo', re.MULTILINE)
>>> print re.sub(regexp, '\g<1>bar', text)
# bar
foo
# foo bar
So, try this:
>>> regexp = re.compile('(^#|\g.+)foo', re.MULTILINE)
>>> print re.sub(regexp, '\g<1>bar', text)
# foo
foo
# foo foo
That seemed to work, but I can't find \g in the documentation!
Moral: don't try to code after a couple of beers.
\g works in python just like perl, and is in the docs.
"In addition to character escapes and backreferences as described above, \g will use the substring matched by the group named name, as defined by the (?P...) syntax. \g uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE."

Categories