Getting captured group in one line - python

There is a known "pattern" to get the captured group value or an empty string if no match:
match = re.search('regex', 'text')
if match:
value = match.group(1)
else:
value = ""
or:
match = re.search('regex', 'text')
value = match.group(1) if match else ''
Is there a simple and pythonic way to do this in one line?
In other words, can I provide a default for a capturing group in case it's not found?
For example, I need to extract all alphanumeric characters (and _) from the text after the key= string:
>>> import re
>>> PATTERN = re.compile('key=(\w+)')
>>> def find_text(text):
... match = PATTERN.search(text)
... return match.group(1) if match else ''
...
>>> find_text('foo=bar,key=value,beer=pub')
'value'
>>> find_text('no match here')
''
Is it possible for find_text() to be a one-liner?
It is just an example, I'm looking for a generic approach.

Quoting from the MatchObjects docs,
Match objects always have a boolean value of True. Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement:
match = re.search(pattern, string)
if match:
process(match)
Since there is no other option, and as you use a function, I would like to present this alternative
def find_text(text, matches = lambda x: x.group(1) if x else ''):
return matches(PATTERN.search(text))
assert find_text('foo=bar,key=value,beer=pub') == 'value'
assert find_text('no match here') == ''
It is the same exact thing, but only the check which you need to do has been default parameterized.
Thinking of #Kevin's solution and #devnull's suggestions in the comments, you can do something like this
def find_text(text):
return next((item.group(1) for item in PATTERN.finditer(text)), "")
This takes advantage of the fact that, next accepts the default to be returned as an argument. But this has the overhead of creating a generator expression on every iteration. So, I would stick to the first version.

You can play with the pattern, using an empty alternative at the end of the string in the capture group:
>>> re.search(r'((?<=key=)\w+|$)', 'foo=bar,key=value').group(1)
'value'
>>> re.search(r'((?<=key=)\w+|$)', 'no match here').group(1)
''

It's possible to refer to the result of a function call twice in a single one-liner: create a lambda expression and call the function in the arguments.
value = (lambda match: match.group(1) if match else '')(re.search(regex,text))
However, I don't consider this especially readable. Code responsibly - if you're going to write tricky code, leave a descriptive comment!

One-line version:
if re.findall(pattern,string): pass
The issue here is that you want to prepare for multiple matches or ensure that your pattern only hits once. Expanded version:
# matches is a list
matches = re.findall(pattern,string)
# condition on the list fails when list is empty
if matches:
pass
So for your example "extract all alphanumeric characters (and _) from the text after the key= string":
# Returns
def find_text(text):
return re.findall("(?<=key=)[a-zA-Z0-9_]*",text)[0]

One line for you, although not quite Pythonic.
find_text = lambda text: (lambda m: m and m.group(1) or '')(PATTERN.search(text))
Indeed, in Scheme programming language, all local variable constructs can be derived from lambda function applications.

Re: "Is there a simple and pythonic way to do this in one line?" The answer is no. Any means to get this to work in one line (without defining your own wrapper), is going to be uglier to read than the ways you've already presented. But defining your own wrapper is perfectly Pythonic, as is using two quite readable lines instead of a single difficult-to-read line.
Update for Python 3.8+: The new "walrus operator" introduced with PEP 572 does allow this to be a one-liner without convoluted tricks:
value = match.group(1) if (match := re.search('regex', 'text')) else ''
Many would consider this Pythonic, particularly those who supported the PEP. However, it should be noted that there was fierce opposition to it as well. The conflict was so intense that Guido van Rossum stepped down from his role as Python's BDFL the day after announcing his acceptance of the PEP.

You can do it as:
value = re.search('regex', 'text').group(1) if re.search('regex', 'text') else ''
Although it's not terribly efficient considering the fact that you run the regex twice.
Or to run it only once as #Kevin suggested:
value = (lambda match: match.group(1) if match else '')(re.search(regex,text))

One liners, one liners... Why can't you write it on 2 lines?
getattr(re.search('regex', 'text'), 'group', lambda x: '')(1)
Your second solution if fine. Make a function from it if you wish. My solution is for demonstrational purposes and it's in no way pythonic.

Starting Python 3.8, and the introduction of assignment expressions (PEP 572) (:= operator), we can name the regex search expression pattern.search(text) in order to both check if there is a match (as pattern.search(text) returns either None or a re.Match object) and use it to extract the matching group:
# pattern = re.compile(r'key=(\w+)')
match.group(1) if (match := pattern.search('foo=bar,key=value,beer=pub')) else ''
# 'value'
match.group(1) if (match := pattern.search('no match here')) else ''
# ''

Related

I need help formulating a specific regex

I do not consider myself a newbie in regex, but I seem to have found a problem that stumped me (it's also Friday evening, so brain not at peak performance).
I am trying to substitute a place-holder inside a string with some other value. I am having great difficulty getting a syntax that behaves the way I want.
My place-holder has this format: {swap}
I want it to capture and replace these:
{swap} # NewValue
x{swap}x # xNewValuex
{swap}x # NewValuex
x{swap} # xNewValue
But I want it to NOT match these:
{{swap}} # NOT {NewValue}
x{{swap}}x # NOT x{NewValue}x
{{swap}}x # NOT {NewValue}x
x{{swap}} # NOT x{NewValue}
In all of the above, x can be any string, of any length, be it "word" or not.
I'm trying to do this using python3's re.sub() but anytime I satisfy one subset of criteria I lose another in the process. I'm starting to think it might not be possible to do in a single command.
Cheers!
If you're able to use the newer regex module, you can use (*SKIP)(*FAIL):
{{.*?}}(*SKIP)(*FAIL)|{.*?}
See a demo on regex101.com.
Broken down, this says:
{{.*?}}(*SKIP)(*FAIL) # match any {{...}} and "throw them away"
| # or ...
{.*?} # match your desired pattern
In Python this would be:
import regex as re
rx = re.compile(r'{{.*?}}(*SKIP)(*FAIL)|{.*?}')
string = """
{swap}
x{swap}x
{swap}x
x{swap}
{{swap}}
x{{swap}}x
{{swap}}x
x{{swap}}"""
string = rx.sub('NewValue', string)
print(string)
This yields:
NewValue
xNewValuex
NewValuex
xNewValue
{{swap}}
x{{swap}}x
{{swap}}x
x{{swap}}
For the sake of completeness, you can also achieve this with Python's own re module but here, you'll need a slightly adjusted pattern as well as a replacement function:
import re
rx = re.compile(r'{{.*?}}|({.*?})')
string = """
{swap}
x{swap}x
{swap}x
x{swap}
{{swap}}
x{{swap}}x
{{swap}}x
x{{swap}}"""
def repl(match):
if match.group(1) is not None:
return "NewValue"
else:
return match.group(0)
string = rx.sub(repl, string)
print(string)
Use negative lookahead and lookbehind:
s1 = "x{swap}x"
s2 = "x{{swap}}x"
pattern = r"(?<!\{)\{[^}]+\}(?!})"
re.sub(pattern, "foo", s1)
#'xfoox'
re.sub(pattern, "foo", s2)
#'x{{swap}}x'

simple regex pattern not matching [duplicate]

>>> import re
>>> s = 'this is a test'
>>> reg1 = re.compile('test$')
>>> match1 = reg1.match(s)
>>> print match1
None
in Kiki that matches the test at the end of the s. What do I miss? (I tried re.compile(r'test$') as well)
Use
match1 = reg1.search(s)
instead. The match function only matches at the start of the string ... see the documentation here:
Python offers two different primitive operations based on regular expressions: re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).
Your regex does not match the full string. You can use search instead as Useless mentioned, or you can change your regex to match the full string:
'^this is a test$'
Or somewhat harder to read but somewhat less useless:
'^t[^t]*test$'
It depends on what you're trying to do.
It's because of that match method returns None if it couldn't find expected pattern, if it find the pattern it would return an object with type of _sre.SRE_match .
So, if you want Boolean (True or False) result from match you must check the result is None or not!
You could examine texts are matched or not somehow like this:
string_to_evaluate = "Your text that needs to be examined"
expected_pattern = "pattern"
if re.match(expected_pattern, string_to_evaluate) is not None:
print("The text is as you expected!")
else:
print("The text is not as you expected!")

Python Regex Sub - Use Match as Dict Key in Substitution

I'm translating a program from Perl to Python (3.3). I'm fairly new with Python. In Perl, I can do crafty regex substitutions, such as:
$string =~ s/<(\w+)>/$params->{$1}/g;
This will search through $string, and for each group of word characters enclosed in <>, a substitution from the $params hashref will occur, using the regex match as the hash key.
What is the best (Pythonic) way to concisely replicate this behavior? I've come up with something along these lines:
string = re.sub(r'<(\w+)>', (what here?), string)
It might be nice if I could pass a function that maps regex matches to a dict. Is that possible?
Thanks for the help.
You can pass a callable to re.sub to tell it what to do with the match object.
s = re.sub(r'<(\w+)>', lambda m: replacement_dict.get(m.group()), s)
use of dict.get allows you to provide a "fallback" if said word isn't in the replacement dict, i.e.
lambda m: replacement_dict.get(m.group(), m.group())
# fallback to just leaving the word there if we don't have a replacement
I'll note that when using re.sub (and family, ie re.split), when specifying stuff that exists around your wanted substitution, it's often cleaner to use lookaround expressions so that the stuff around your match doesn't get subbed out. So in this case I'd write your regex like
r'(?<=<)(\w+)(?=>)'
Otherwise you have to do some splicing out/back in of the brackets in your lambda. To be clear what I'm talking about, an example:
s = "<sometag>this is stuff<othertag>this is other stuff<closetag>"
d = {'othertag': 'blah'}
#this doesn't work because `group` returns the whole match, including non-groups
re.sub(r'<(\w+)>', lambda m: d.get(m.group(), m.group()), s)
Out[23]: '<sometag>this is stuff<othertag>this is other stuff<closetag>'
#this output isn't exactly ideal...
re.sub(r'<(\w+)>', lambda m: d.get(m.group(1), m.group(1)), s)
Out[24]: 'sometagthis is stuffblahthis is other stuffclosetag'
#this works, but is ugly and hard to maintain
re.sub(r'<(\w+)>', lambda m: '<{}>'.format(d.get(m.group(1), m.group(1))), s)
Out[26]: '<sometag>this is stuff<blah>this is other stuff<closetag>'
#lookbehind/lookahead makes this nicer.
re.sub(r'(?<=<)(\w+)(?=>)', lambda m: d.get(m.group(), m.group()), s)
Out[27]: '<sometag>this is stuff<blah>this is other stuff<closetag>'

python re, find expression containing an optional group

I have a regular expression that can have either from:
(src://path/to/foldernames canhave spaces/file.xzy)
(src://path/to/foldernames canhave spaces/file.xzy "optional string")
These expressions occur within a much longer string (they are not individual strings). I am having trouble matching both expressions when using re.search or re.findall (as there may be multiple expression in the string).
It's straightforward enough to match either individually but how can I go about matching either case so that two groups are returned, the first with src://path/... and the second with the optional string if it exists or None if not?
I am thinking that I need to somehow specify OR groups---for instance, consider:
The pattern \((.*)( ".*")\) matches the second instance but not the first because it does not contain "...".
r = re.search(r'\((.*)( ".*")\)', '(src://path/to/foldernames canhave spaces/file.xzy)'
r.groups() # Nothing found
AttributeError: 'NoneType' object has no attribute 'groups'
While \((.*)( ".*")?\) matches the first group but does not individually identify the "optional string" as a group in the second instance.
r = re.search(r'\((.*)( ".*")?\)', '(src://path/to/foldernames canhave spaces/file.xzy "optional string")')
r.groups()
('src://path/to/foldernames canhave spaces/file.xzy "optional string"', None)
Any thoughts, ye' masters of expressions (of the regular variety)?
The simplest way is to make the first * non-greedy:
>>> import re
>>> string = "(src://path/to/foldernames canhave spaces/file.xzy)"
>>> string2 = \
... '(src://path/to/foldernames canhave spaces/file.xzy "optional string")'
>>> re.findall(r'\((.*?)( ".*")?\)', string2)
[('src://path/to/foldernames canhave spaces/file.xzy', ' "optional string"')]
>>> re.findall(r'\((.*?)( ".*")?\)', string)
[('src://path/to/foldernames canhave spaces/file.xzy', '')]
Since " aren't usually allowed to appear in file names, you can simply exclude them from the first group:
r = re.search(r'\(([^"]*)( ".*")?\)', input)
This is generally the preferred alternative to ungreedy repetition, because tends to be a lot more efficient. If your file names can actually contain quotes for some reason, then ungreedy repetition (as in agf's answer) is your best bet.

How fill a regex string with parameters

I would like to fill regex variables with string.
import re
hReg = re.compile("/robert/(?P<action>([a-zA-Z0-9]*))/$")
hMatch = hReg.match("/robert/delete/")
args = hMatch.groupdict()
args variable is now a dict with {"action":"delete"}.
How i can reverse this process ? With args dict and regex pattern, how i can obtain the string "/robert/delete/" ?
it's possible to have a function just like this ?
def reverse(pattern, dictArgs):
Thank you
This function should do it
def reverse(regex, dict):
replacer_regex = re.compile('''
\(\?P\< # Match the opening
(.+?) # Match the group name into group 1
\>\(.*?\)\) # Match the rest
'''
, re.VERBOSE)
return replacer_regex.sub(lambda m : dict[m.group(1)], regex)
You basically match the (\?P...) block and replace it with a value from the dict.
EDIT: regex is the regex string in my exmple. You can get it from patter by
regex_compiled.pattern
EDIT2: verbose regex added
Actually, i thinks it's doable for some narrow cases, but pretty complex thing "in general case".
You'll need to write some sort of finite state machine, parsing your regex string, and splitting different parts, then take appropriate action for this parts.
For regular symbols — simply put symbols "as is" into results string.
For named groups — put values from dictArgs in place of them
For optional blocks — put some of it's values
And so on.
One requllar expression often can match big (or even infinite) set of strings, so this "reverse" function wouldn't be very useful.
Building upon #Dimitri's answer, more sanitisation is possible.
retype = type(re.compile('hello, world'))
def reverse(ptn, dict):
if isinstance(ptn, retype):
ptn = ptn.pattern
ptn = ptn.replace(r'\.','.')
replacer_regex = re.compile(r'''
\(\?P # Match the opening
\<(.+?)\>
(.*?)
\) # Match the rest
'''
, re.VERBOSE)
# return replacer_regex.findall(ptn)
res = replacer_regex.sub( lambda m : dict[m.group(1)], ptn)
return res

Categories