Python Regex Sub - Use Match as Dict Key in Substitution

Python Regex Sub - Use Match as Dict Key in Substitution - python

I'm translating a program from Perl to Python (3.3). I'm fairly new with Python. In Perl, I can do crafty regex substitutions, such as:
$string =~ s/<(\w+)>/$params->{$1}/g;
This will search through $string, and for each group of word characters enclosed in <>, a substitution from the $params hashref will occur, using the regex match as the hash key.
What is the best (Pythonic) way to concisely replicate this behavior? I've come up with something along these lines:
string = re.sub(r'<(\w+)>', (what here?), string)
It might be nice if I could pass a function that maps regex matches to a dict. Is that possible?
Thanks for the help.

You can pass a callable to re.sub to tell it what to do with the match object.
s = re.sub(r'<(\w+)>', lambda m: replacement_dict.get(m.group()), s)
use of dict.get allows you to provide a "fallback" if said word isn't in the replacement dict, i.e.
lambda m: replacement_dict.get(m.group(), m.group())
# fallback to just leaving the word there if we don't have a replacement
I'll note that when using re.sub (and family, ie re.split), when specifying stuff that exists around your wanted substitution, it's often cleaner to use lookaround expressions so that the stuff around your match doesn't get subbed out. So in this case I'd write your regex like
r'(?<=<)(\w+)(?=>)'
Otherwise you have to do some splicing out/back in of the brackets in your lambda. To be clear what I'm talking about, an example:
s = "<sometag>this is stuff<othertag>this is other stuff<closetag>"
d = {'othertag': 'blah'}
#this doesn't work because `group` returns the whole match, including non-groups
re.sub(r'<(\w+)>', lambda m: d.get(m.group(), m.group()), s)
Out[23]: '<sometag>this is stuff<othertag>this is other stuff<closetag>'
#this output isn't exactly ideal...
re.sub(r'<(\w+)>', lambda m: d.get(m.group(1), m.group(1)), s)
Out[24]: 'sometagthis is stuffblahthis is other stuffclosetag'
#this works, but is ugly and hard to maintain
re.sub(r'<(\w+)>', lambda m: '<{}>'.format(d.get(m.group(1), m.group(1))), s)
Out[26]: '<sometag>this is stuff<blah>this is other stuff<closetag>'
#lookbehind/lookahead makes this nicer.
re.sub(r'(?<=<)(\w+)(?=>)', lambda m: d.get(m.group(), m.group()), s)
Out[27]: '<sometag>this is stuff<blah>this is other stuff<closetag>'

Related

Replace involving original string

I am trying to replace some text with something else that depends on the original text in Python3. For example, say I have "[[procedural programming|procedural programming languages]]", I need to replace that with the later text, so just procedural programming languages.
In general, I need a function which takes a string and a function and applies the function to the string and then replaces it. For example, reversing a string could be done like so:
text = "123456 123456 84708467 11235"
new_text = special_replace(text, lambda x: x[::-1])
>>> 654321 654321 84708467 11235
Or the previous example:
text = "[[procedural programming|procedural programming languages]] [meow|woof]"
new_text = specail_replace(text, lambda x: x.replace("[[", "").replace("]]","").split("|")[1])
>>> procedural programming languages [meow|woof]

You can create a regular expression and use re.sub with group reference to replace them:
>>> text = "[[procedural programming|procedural programming languages]] [meow|woof]"
>>> p = r"\[\[.*?\|(.*?)\]\]"
>>> re.findall(p, text)
['procedural programming languages']
>>> re.sub(p, r"\1", text)
'procedural programming languages [meow|woof]'
Note that [, |, and ] all have to be escaped. Here (.*?) is a capturing group for the second term, and \1 references that group in the replacement string.
For more complex stuff, like also reversing the group, you can use a callback function:
>>> re.sub(p, lambda m: m.group(1)[::-1], text)
'segaugnal gnimmargorp larudecorp [meow|woof]'

Is there a better way to swap string without a placeholder

I have a string:
>>> s = 'Y/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<X/PROPN/pobj_>,/PUNCT/punct'
And the aim is to change the position of Y/ to X/, i.e. something like:
>>> s.replace('X/', '##').replace('Y/', 'X/').replace('##', 'Y/')
'X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct'
Assuming that there'll be no conflict when doing the replacement, i.e. X/ and Y/ is unique and will only happen once each in the original string.
Is there a way to do the replacement without the placeholder? Currently, i'm swapping there position by using the ## placeholder.

In Python, an easy way using a regex is via a lambda in the re.sub replacement part where you can evaluate/check texts captured with capturing groups and select appropriate replacement:
So, (X|Y)/ (I assume X and Y are potentially multicharacter string placeholders, otherwise use ([XY])) should work:
import re
s = 'Y/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<X/PROPN/pobj_>,/PUNCT/punct'
print(s)
print(re.sub(r"(X|Y)/", lambda m: "Y/" if m.group(1) == 'X' else 'X/' , s))
Output:
Y/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<X/PROPN/pobj_>,/PUNCT/punct
X/NOUN/dobj>_hold/VERB/ROOT_<membership/NOUN/dobj_<with/ADP/prep_<Y/PROPN/pobj_>,/PUNCT/punct

Getting captured group in one line

There is a known "pattern" to get the captured group value or an empty string if no match:
match = re.search('regex', 'text')
if match:
value = match.group(1)
else:
value = ""
or:
match = re.search('regex', 'text')
value = match.group(1) if match else ''
Is there a simple and pythonic way to do this in one line?
In other words, can I provide a default for a capturing group in case it's not found?
For example, I need to extract all alphanumeric characters (and _) from the text after the key= string:
>>> import re
>>> PATTERN = re.compile('key=(\w+)')
>>> def find_text(text):
... match = PATTERN.search(text)
... return match.group(1) if match else ''
...
>>> find_text('foo=bar,key=value,beer=pub')
'value'
>>> find_text('no match here')
''
Is it possible for find_text() to be a one-liner?
It is just an example, I'm looking for a generic approach.

Quoting from the MatchObjects docs,
Match objects always have a boolean value of True. Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement:
match = re.search(pattern, string)
if match:
process(match)
Since there is no other option, and as you use a function, I would like to present this alternative
def find_text(text, matches = lambda x: x.group(1) if x else ''):
return matches(PATTERN.search(text))
assert find_text('foo=bar,key=value,beer=pub') == 'value'
assert find_text('no match here') == ''
It is the same exact thing, but only the check which you need to do has been default parameterized.
Thinking of #Kevin's solution and #devnull's suggestions in the comments, you can do something like this
def find_text(text):
return next((item.group(1) for item in PATTERN.finditer(text)), "")
This takes advantage of the fact that, next accepts the default to be returned as an argument. But this has the overhead of creating a generator expression on every iteration. So, I would stick to the first version.

You can play with the pattern, using an empty alternative at the end of the string in the capture group:
>>> re.search(r'((?<=key=)\w+|$)', 'foo=bar,key=value').group(1)
'value'
>>> re.search(r'((?<=key=)\w+|$)', 'no match here').group(1)
''

It's possible to refer to the result of a function call twice in a single one-liner: create a lambda expression and call the function in the arguments.
value = (lambda match: match.group(1) if match else '')(re.search(regex,text))
However, I don't consider this especially readable. Code responsibly - if you're going to write tricky code, leave a descriptive comment!

One-line version:
if re.findall(pattern,string): pass
The issue here is that you want to prepare for multiple matches or ensure that your pattern only hits once. Expanded version:
# matches is a list
matches = re.findall(pattern,string)
# condition on the list fails when list is empty
if matches:
pass
So for your example "extract all alphanumeric characters (and _) from the text after the key= string":
# Returns
def find_text(text):
return re.findall("(?<=key=)[a-zA-Z0-9_]*",text)[0]

One line for you, although not quite Pythonic.
find_text = lambda text: (lambda m: m and m.group(1) or '')(PATTERN.search(text))
Indeed, in Scheme programming language, all local variable constructs can be derived from lambda function applications.

Re: "Is there a simple and pythonic way to do this in one line?" The answer is no. Any means to get this to work in one line (without defining your own wrapper), is going to be uglier to read than the ways you've already presented. But defining your own wrapper is perfectly Pythonic, as is using two quite readable lines instead of a single difficult-to-read line.
Update for Python 3.8+: The new "walrus operator" introduced with PEP 572 does allow this to be a one-liner without convoluted tricks:
value = match.group(1) if (match := re.search('regex', 'text')) else ''
Many would consider this Pythonic, particularly those who supported the PEP. However, it should be noted that there was fierce opposition to it as well. The conflict was so intense that Guido van Rossum stepped down from his role as Python's BDFL the day after announcing his acceptance of the PEP.

You can do it as:
value = re.search('regex', 'text').group(1) if re.search('regex', 'text') else ''
Although it's not terribly efficient considering the fact that you run the regex twice.
Or to run it only once as #Kevin suggested:
value = (lambda match: match.group(1) if match else '')(re.search(regex,text))

One liners, one liners... Why can't you write it on 2 lines?
getattr(re.search('regex', 'text'), 'group', lambda x: '')(1)
Your second solution if fine. Make a function from it if you wish. My solution is for demonstrational purposes and it's in no way pythonic.

Starting Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), we can name the regex search expression pattern.search(text) in order to both check if there is a match (as pattern.search(text) returns either None or a re.Match object) and use it to extract the matching group:
# pattern = re.compile(r'key=(\w+)')
match.group(1) if (match := pattern.search('foo=bar,key=value,beer=pub')) else ''
# 'value'
match.group(1) if (match := pattern.search('no match here')) else ''
# ''

How do I write a regex to replace a word but keep its case in Python?

Is this even possible?
Basically, I want to turn these two calls to sub into a single call:
re.sub(r'\bAword\b', 'Bword', mystring)
re.sub(r'\baword\b', 'bword', mystring)
What I'd really like is some sort of conditional substitution notation like:
re.sub(r'\b([Aa])word\b', '(?1=A:B,a:b)word')
I only care about the capitalization of the first character. None of the others.

You can have functions to parse every match:
>>> def f(match):
return chr(ord(match.group(0)[0]) + 1) + match.group(0)[1:]
>>> re.sub(r'\b[aA]word\b', f, 'aword Aword')
'bword Bword'

OK, here's the solution I came up with, thanks to the suggestions to use a replace function.
re.sub(r'\b[Aa]word\b', lambda x: ('B' if x.group()[0].isupper() else 'b') + 'word', 'Aword aword.')

You can pass a lambda function which uses the Match object as a parameter as the replacement function:
import re
re.sub(r'\baword\b',
lambda m: m.group(0)[0].lower() == m.group(0)[0] and 'bword' or 'Bword',
'Aword aword',
flags=re.I)
# returns: 'Bword bword'

Use capture groups (r'\1'):
re.sub(r'\b([Aa])word\b', r'\1word', "hello Aword")

Python RE question - proper state initial formatting

I have a string that I need to edit, it looks something similar to this:
string = "Idaho Ave N,,Crystal,Mn,55427-1463,US,,610839124763,Expedited"
If you notice the state initial "Mn" is not in proper formatting. I'm trying to use a regular expression to change this:
re.sub("[A-Z][a-z],", "[A-Z][A-Z],", string)
However, re.sub treats the second part as a literal and will change Mn, to [A-Z][A-Z],. How would I use re.sub (or something similar and simple) to properly change Mn, to MN, in this string?
Thank you in advance!

Your re.sub might modify also parts of the string you would not want to modify. Try to process the right element in your list explicitly:
input = "Idaho Ave N,,Crystal,Mn,55427-1463,US,,610839124763,Expedited"
elems = input.split(',')
elems[3] = elems[3].upper()
output = ','.join(elems)
returns
'Idaho Ave N,,Crystal,MN,55427-1463,US,,610839124763,Expedited'

You can pass a function as the replacement parameter to re.sub to generate the replacement string from the match object, e.g.:
import re
s = "Idaho Ave N,,Crystal,Mn,55427-1463,US,,610839124763,Expedited"
def upcase(match):
return match.group().upper()
print re.sub("[A-Z][a-z],", upcase, s)
(This is ignoring the concern of whether you're genuinely finding state initials with this method.)
The appropriate documentation for re.sub is here.

sub(pattern, repl, string, count=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used.
re.sub("[A-Z][a-z]", lambda m: m.group(0).upper(), myString)
I would avoid calling your variable string since that is a type name.

You create a group by surrounding it in parentheses withing your regex, then refer to is by its group number:
re.sub("([A-Z][a-z]),", "\1,".upper(), string)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex Sub - Use Match as Dict Key in Substitution - python

Related

Replace involving original string

Is there a better way to swap string without a placeholder

Getting captured group in one line

How do I write a regex to replace a word but keep its case in Python?

Python RE question - proper state initial formatting

Categories

Resources