This question already has an answer here:
Python regular expressions, how to search for a word starting with uppercase?
(1 answer)
Closed 7 years ago.
I'm trying to get the following to work. I've looked at the Python documentation, and I still don't know how to fix it. I'm getting an AttributeError, what am I doing wrong?
import re
text = '>:{abcd|}+)_(#)_#_Mitch_(#<$)_)*zersx!)Pamela#(_+)('
m = re.match(r'(?P<name1>[A-Z][A-Za-z]*) (?P<name2>[A-Z][A-Za-z]*)', text)
m.group('name1')
If the above is incorrect, how do I get it to output
>>> m.group('name1') = 'Mitch'
You're forgetting to check that the regex actually matched anything. If it doesn't then both the .match() and .search() functions will return None.
It may be that the named group you are trying to reference was not actually matched in that string for that pattern.
Try to call groups on the returned value and you will see a tuple of all matched groups.
It is a Python-related issue: the unmatched groups are not initialized and thus fail the whole match. You need to explicitly state an empty alternative inside the group for it to be really optional and work as in other regex flavors. Also, you have _, not a space between, so I suggest using a [\s_] character class to match both alternatives:
So, in your case, you can do it like this:
(?P<name1>[A-Z][A-Za-z]*)[\s_](?P<name2>[A-Z][A-Za-z]*|)
^^^^^ ^^
See a regex demo
Sample IDEONE demo:
import re
p = re.compile(ur'(?P<name1>[A-Z][A-Za-z]*)[\s_](?P<name2>[A-Z][A-Za-z]*|)')
test_str = u">:{abcd|}+)_(#)_#_Mitch_(#<$)_)*zersx!)Pamela#(_+)("
match = re.search(p, test_str)
if match:
print(match.group("name1")) # => Mitch
Related
This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 4 years ago.
So if I create a program in python (3.7) that looks like this:
import re
regx = re.compile("test")
print(regx.findall("testest"))
and run it, then I will get:
["test"]
Even though there are two instances of "test" it's only showing me one which I think is because a letter from the first "test" is being used in the second "test". How can I make a program that will give me ["test", "test"] as a result instead?
You will want to use a capturing group with a lookahead (?=(regex_here)):
import re
regx = re.compile("(?=(test))")
print(regx.findall("testest"))
>>> ['test', 'test']
Regex expressions are greedy. They consume as much of the target string as possible. Once consumed, a character is not examined again, so overlapping patterns are not found.
To do this you need to use a feature of python regular expressions called a look ahead assertion. You will look for instances of the character t where it is followed by est. The look ahead does not consume parts of the string.
import re
regx = re.compile('t(?=est)')
print([m.start() for m in regx.finditer('testest')])
[0,3]
More details on this page: https://docs.python.org/3/howto/regex.html
I have tried to research answers to this question online, but nothing seems to describe the problem I have here. If I missed something, please close the question and redirect it to where it has already been answered.
That being said, my python regex doesn't seem to want to recognize a pattern if it is already encompassed in another captured pattern. I tried to run the code and here are the results:
>>> import re
>>> string = 'NNTSY'
>>> m = re.findall('N[^P][ST][^P]',string)
>>> m
['NNTS']
I don't understand why it didn't yield this output:
>>> m
['NNTS','NTSY']
Thanks!
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
https://docs.python.org/3/library/re.html#re.findall
If you're not just trying to understand why, but actually need to get overlapping matches, you can use lookahead with a capturing group as described in this question's answers.
This is in fact possible, using a lookahead assertion.
(?=pattern)
will match at any position directly followed by pattern without consuming the string, and
(?=(pattern))
will capture the group that matched.
import re
string = 'NNTSY'
m = re.findall(r'(?=(N[^P][ST][^P]))',string)
print(m)
#['NNTS', 'NTSY']
This question already has answers here:
How to make some parts of a regex pattern conditional?
(3 answers)
Closed 6 years ago.
I am trying to parse the following string with python's regex library:
recipe_a = 'run_list[sm_collectd::default#1.0.0]'
Using http://pythex.org/, I am trying out the following regexes:
\[(.*)::(.*)#(.*)\]
Which yields:
Match 1
1. sm_collectd
2. default
3. 1.0.0
Here is the problem:
recipe_a can be pattern matched with this regex, however, it fails when the #version is no longer specified in the string. The following example will fail to match the pattern:
recipe_b = 'run_list[sm_collectd::default]'
\\[(.\*)::(.\*)#(.\*)\\] fails in this case because # was never matched. have python logic that trys \\[(.\*)::(.\*)#(.\*)\\] and tries the latter regex \\[(.\*)::(.\*)\\]. But that would be silly. It would be nice if I could accomplish this in one regex pattern.
I have tried solving this with a conditional regex statement. The general syntax I have tried is as follows:
(?(?=regex)then|else)
the first ? is lookahead assertion: matches without consuming. So we could have a conditional match on the # symbol.
if the # is matched then do \\[(.\*)::(.\*)#(.\*)\\], else do \\[(.\*)::(.\*)\\].
PROGRAMATIC SOLUTION
kitchen_recipe = 'recipe[my_cookbook::default#0.1.0]'
recipe = kitchen_recipe.strip('recipe[').strip(']')
if '#' in recipe:
cookbook, recipe, cookbook_version = tuple(re.split('::|#', recipe))
else:
cookbook, recipe = tuple(re.split('::', recipe))
cookbook_version = None # no version specified
REGEX SOLUTION
kitchen_recipe = 'recipe[my_cookbook::default#0.1.0]'
run_list_pattern = '\[(.*)::([^#]*)#?([0-9.]*)\]'
cookbook, recipe, cookbook_version = re.search(test_list_pattern,
kitchen_recipe).groups()
A couple of small changes should work for your purposes; try \[(.*)::([^#]*)(?:#(.*))?\]
run_list[sm_collectd::default] yields
sm_collectd
default
None
run_list[sm_collectd::default#1.0.0] yields
sm_collectd
default
1.0.0
Explanation:
I changed the final #(.*) into (?:#(.*))?. The additional (?:) is a non-capturing group, and the following ? means "match 0 or 1 times".
I also changed the first (.*) into ([^#]*), which means match anything but an # as many times as possible, so it won't treat all of default#1.0.0 as the chef recipe name.
Edit: As #gregory points out, you can avoid the non-capturing group. You can use \[(.*)::([^#]*)#?(.*)\].
If you'd like to limit the version number to only digits and .s: \[(.*)::([^#]*)#?([0-9.]*)\].
In the following regex r"\g<NAME>\w+", I would like to know that a group named NAME must be used for replacements corresponding to a match.
Which regex matches the wrong use of \g<...> ?
For example, the following code finds any not escaped groups.
p = re.compile(r"(?:[^\\])\\g<(\w+)>")
for m in p.finditer(r"\g<NAME>\w+\\g<ESCAPED>"):
print(m.group(1))
But there is a last problem to solve. How can I manage cases of \g<WRONGUSE\> and\g\<WRONGUSE> ?
As far as I am aware, the only restriction on named capture groups is that you can't put metacharacters in there, such as . \, etc...
Have you come across some kind of problem with named capture groups?
The regex you used, r"illegal|(\g<NAME>\w+)" is only illegal because you referred to a backreference without it being declared earlier in the regex string. If you want to make a named capture group, it is (?P<NAME>regex)
Like this:
>>> import re
>>> string = "sup bro"
>>> re.sub(r"(?P<greeting>sup) bro", r"\g<greeting> mate", string)
'sup mate'
If you wanted to do some kind of analysis on the actual regex string in use, I don't think there is anything inside the re module which can do this natively.
You would need to run another match on the string itself, so, you would put the regex into a string variable and then match something like \(\?P<(.*?)>\) which would give you the named capture group's name.
I hope that is what you are asking for... Let me know.
So, what you want is to get the string of the group name, right?
Maybe you can get it by doing this:
>>> regex = re.compile(r"illegal|(?P<group_name>\w+)")
>>> regex.groupindex
{'group_name': 1}
As you see, groupindex returns a dictionary mapping the group names and their position in the regex. Having that, it is easy to retrieve the string:
>>> # A list of the group names in your regex:
... regex.groupindex.keys()
['group_name']
>>> # The string of your group name:
... regex.groupindex.keys()[0]
'group_name'
Don't know if that is what you were looking for...
Use a negative lookahead?
\\g(?!<\w+>)
This search for any g not followed by <…>, thus a "wrong use".
Thanks to all the comments, I have this solution.
# Good uses.
p = re.compile(r"(?:[^\\])\\g<(\w+)>")
for m in p.finditer(r"</\g\<at__tribut1>\\g<notattribut>>"):
print(m.group(1))
# Bad uses.
p = re.compile(r"(?:[^\\])\\g(?!<\w+>)")
if p.search(r"</\g\<at__tribut1>\\g<notattribut>>"):
print("Wrong use !")
I'm trying to use the re module in a way that it will return bunch of characters until a particular string follows an individual character. The re documentation seems to indicate that I can use (?!...) to accomplish this. The example that I'm currently wrestling with:
str_to_search = 'abababsonab, etc'
first = re.search(r'(ab)+(?!son)', str_to_search)
second = re.search(r'.+(?!son)', str_to_search)
first.group() is 'abab', which is what I'm aiming for. However, second.group() returns the entire str_to_search string, despite the fact that I'm trying to make it stop at 'ababa', as the subsequent 'b' is immediately followed by 'son'. Where am I going wrong?
It's not the simplest thing, but you can capture a repeating sequence of "a character not followed by 'son'". This repeated expression should be in a non-capturing group, (?: ... ), so it doesn't mess with your match results. (You'd end up with an extra match group)
Try this:
import re
str_to_search = 'abababsonab, etc'
second = re.search(r'(?:.(?!son))+', str_to_search)
print(second.group())
Output:
ababa
See it here: http://ideone.com/6DhLgN
This should work:
second = re.search(r'(.(?!son))+', str_to_search)
#output: 'ababa'
not sure what you are trying to do
check out string.partition
'.+?' is the minimal matcher, otherwise it is greedy and gets it all
read the docs for group(...) and groups(..) especially when passing group number