Regex with repeating groups

Regex with repeating groups - python

I've been trying to match a phrase between hyphens. I realise that I can easily just split on the hyphen and get out the phrases but my equivalent regex for this is not working as expected and I want to understand why:
([^-,]+(?:(?: - )|$))+
[^-,]+ is just my definition of a phrase
(?: - ) is just the non capturing space delimited hyphen
so (?:(?: - )|$)is capturing a hyphen or end of line
Finally, the whole thing surrounded in parentheses with a + quantifier matches more than one.
What I get if I perform regex.match("A - B - C").groups() is ('C',)
I've also tried the much simpler regex ([^,-]+)+ with similar results
I'm using re.match because I wanted to use pandas.Series.str.extract to apply this to a very long list.
To reiterate: I'm now using an easy split on a hyphen but why isn't this regex returning multiple groups?
Thanks

Regular expression capturing groups are “named” statically by their appearance in the expression. Each capturing group gets its own number, and matches are assigned to that group regardless of how often a single group captures something.
If a group captured something before and later does again, the later result overwrites what was captured before. There is no way to collect all a group’s captures values using a normal matching.
If you want to find multiple values, you will need to match only a single group and repeat matching on the remainder of the string. This is commonly done by re.findall or re.finditer:
>>> re.findall('\s*([^-,]+?)\s*', 'A - B - C')
['A', 'B', 'C']

Related

Understanding the (\D\d)+ Regex pattern in Python

I'm spending some time trying to understand the Regex terminology in Python 3 and I can't figure out how (\D\d)+ works here.
I know that \D represents a nondigit character and that \d represents a digit character, and also that the plus sign + represents one or more repetitions of the preceding expression. But when I try the following code, I simply can't wrap my head around the result.
Input:
import re
text = "a1 b2 c3 d4e5f6"
regex = re.findall(r'(\D\d)+',text)
print(regex)
Output:
['a1', 'b2', 'c3', 'f6']
Since that the regex includes a plus sign, shouldn't it also output d4e5f6 as they are a sequence of nondigit and digit characters?

You aren't directly repeating the \D\d subpattern with the +, you are repeating a capturing group (indicated by parentheses) that contains that subpattern. The final match is indeed of the text d4e5f6, but it does so as three instances of the capturing group, each one of which overwrites the last. And the behavior of Python's re.findall() in the presence of capturing groups is that it returns THEM (as a tuple, if there's more than one) instead of the overall match.
There is a newer, experimental regex module in Python 3.x that is capable of returning multiple matches for a single capturing group, although I'm not exactly sure how that interacts with findall(). You could also write the expression as (?:\D\d)+ - (?: starts a non-capturing group, so findall() will give you the entire match as you expect.

Why does this regular expression to match two consecutive words not work?

There is a similar question here: Regular Expression For Consecutive Duplicate Words. This addresses the general question of how to solve this problem, whereas I am looking for specific advice on why my solution does not work.
I'm using python regex, and I'm trying to match all consecutively repeated words, such as the bold in:
I am struggling to to make this this work
I tried:
[A-Za-z0-9]* {2}
This is the logic behind this choice of regex: The '[A-Za-z0-9]*' should match any word of any length, and '[A-Za-z0-9]* ' makes it consider the space at the end of the word. Hence [A-Za-z0-9]* {2} should flag a repetition of the previous word with a space at the end. In other words it says "For any word, find cases where it is immediately repeated after a space".
How is my logic flawed here? Why does this regex not work?

[A-Za-z0-9]* {2}
Quantifiers in regular expressions will always only apply to the element right in front of them. So a \d+ will look for one or more digits but x\d+ will look for a single x, followed by one or more digits.
If you want a quantifier to apply to more than just a single thing, you need to group it first, e.g. (x\d)+. This is a capturing group, so it will actually capture that in the result. This is sometimes undesired if you just want to group things to apply a common quantifier. In that case, you can prefix the group with ?: to make it a non-capturing group: (?:x\d)+.
So, going back to your regular expression, you would have to do it like this:
([A-Za-z0-9]* ){2}
However, this does not actually have any check that the second matched word is the same as the first one. If you want to match for that, you will need to use backreferences. Backreferences allow you to reference a previously captured group within the expression, looking for it again. In your case, this would look like this:
([A-Za-z0-9]*) \1
The \1 will reference the first capturing group, which is ([A-Za-z0-9]*). So the group will match the first word. Then, there is a space, followed by a backreference to the first word again. So this will look for a repetition of the same word separated by a space.
As bobble bubble points out in the comments, there is still a lot one can do to improve the regular expression. While my main concern was to explain the various concepts without focusing too much on your particular example, I guess I still owe you a more robust regular expression for matching two consecutive words within a string that are separated by a space. This would be my take on that:
\b(\w+)\s\1\b
There are a few things that are different to the previous approach: First of all, I’m looking for word boundaries around the whole expression. The \b matches basically when a word starts or ends. This will prevent the expression from matching within other words, e.g. neither foo fooo nor foo oo would be matched.
Then, the regular expression requires at least one character. So empty words won’t be matched. I’m also using \w here which is a more flexible way of including alphanumerical characters. And finally, instead of looking for an actual space, I accept any kind of whitespace between the words, so this could even match tabs or line breaks. It might make sense to add a quantifier there too, i.e. \s+ to allow multiple whitespace characters.
Of course, whether this works better for you, depends a lot on your actual requirements which we won’t be able to tell just from your one example. But this should give you a few ideas on how to continue at least.

You can match a previous capture group with \1 for the first group, \2 for the second, etc...
import re
s = "I am struggling to to make this this work"
matches = re.findall(r'([A-Za-z0-9]+) \1', s)
print(matches)
>>> ['to', 'this']
If you want both occurrences, add a capture group around \1:
matches = re.findall(r'([A-Za-z0-9]+) (\1)', s)
print(matches)
>>> [('to', 'to'), ('this', 'this')]

At a glance it looks like this will match any two words, not repeated words. If I recall correctly asterisk (*) will match zero or more times, so perhaps you should be using plus (+) for one or more. Then you need to provide a capture and re-use the result of the capture. Additionally the \w can be used for alphanumerical characters for clarity. Also \b can be used to match empty string at word boundary.
Something along the lines of the example below will get you part of the way.
>>> import re
>>> p = re.compile(r'\b(\w+) \1\b')
>>> p.findall('fa fs bau saa saa fa bau eek mu muu bau')
['saa']
These pages may offer some guidance:
Python regex cheat sheet
RegExp match repeated characters
Regular Expression For Consecutive Duplicate Words.

This should work: \b([A-Za-z0-9]+)\s+\1\b
\b matches a word boundary, \s matches whitespace and \1 specifies the first capture group.
>>> s = 'I am struggling to to make this this work'
>>> re.findall(r'\b([A-Za-z0-9]+)\s+\1\b', s)
['to', 'this']

Here is a simple solution not using RegEx.
sentence = 'I am struggling to to make this this work'
def find_duplicates_in_string(words):
""" Takes in a string and returns any duplicate words
i.e. "this this"
"""
duplicates = []
words = words.split()
for i in range(len(words) - 1):
prev_word = words[i]
word = words[i + 1]
if word == prev_word:
duplicates.append(word)
return duplicates
print(find_duplicates_in_string(sentence))

re.sub part of string: (?: ...) mystery [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 6 years ago.
I have a character string:
temp = '4424396.6\t1\tk__Bacteria\tp__Firmicutes\tc__Erysipelotrichi\to__Erysipelotrichales'
And I need to get rid of tabulations only in between taxonomy terms.
I tried
re.sub(r'(?:\D{1})\t', ',', temp)
It came quite close, but also replaced the letter before tabs:
'4424396.6\t1\tk__Bacteri,p__Firmicute,c__Erysipelotrich,o__Erysipelotrichales'
I am confused as re documentation for (?:...) goes:
...the substring matched by the group cannot be retrieved after
performing a match or referenced later in the pattern.
The last letter was within the parenthesis, so how could it be replaced?
PS
I used re.sub(r'(?<=\D{1})(\t)', ',', temp) and it works perfectly fine, but I can't understand what's wrong with the first regexp

The text matched by (?:...) does not form a capture group, as does (...), and therefore cannot be referred to later with a backreference such as \1. However, it's still part of the overall match, and is part of the text that re.sub() will replace.
The point of non-capturing groups is that they are slightly more efficient, and may be required in uses such as re.split() where the mere existence of capturing groups will affect the output.

According to the documentation, (?:...) specifies a non-capturing group. It explains:
Sometimes you’ll want to use a group to collect a part of a regular expression, but aren’t interested in retrieving the group’s contents.
What this means is that anything that matches the ... expression (in your case, the preceding letter) will not be captured as a group but will still be part of the match. The only thing special about this is that you won't be able to access the part of the input captured by this group using match.group:
Except for the fact that you can’t retrieve the contents of what the group matched, a non-capturing group behaves exactly the same as a capturing group
In contrast, (?<=...) is a positive lookbehind assertion; the regular expression will check to make sure any matches are preceded by text matching ..., but won't capture that part.

Capturing groups and greediness in Python

Recently I have been playing around with regex expressions in Python and encountered a problem with r"(\w{3})+" and with its non-greedy equivalent r"(\w{3})+?".
Please let's take a look at the following example:
S = "abcdefghi" # string used for all the cases below
1. Greedy search
m = re.search(r"(\w{3})+", S)
print m.group() # abcdefghi
print m.groups() # ('ghi',)
m.group is exactly as I expected - just whole match.
Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right? If yes, then can I capture all overwritten groups as well? Of course, for this particular string I could just write m = re.search(r"(\w{3})(\w{3})(\w{3})", S) but I am looking for a more general way to capture groups not knowing how many of them I can expect, thus metacharacter +.
2. Non-greedy search
m = re.search(r"(\w{3})+?", S)
print m.group() # abc
print m.groups() # ('abc',)
Now we are not greedy so only abc was found - exactly as I expected.
Regarding m.groups(), the engine stopped when it found abc so I understand that this is the only found group here.
3. Greedy findall
print re.findall(r"(\w{3})+", S) # ['ghi']
Now I am truly perplexed, I always thought that function re.findall finds all substrings where the RE matches and returns them as a list. Here, we have only one match abcdefghi (according to common sense and bullet 1), so I expected to have a list containing this one item. Why only ghi was returned?
4. Non-greedy findall
print re.findall(r"(\w{3})+?", S) # ['abc', 'def', 'ghi']
Here, in turn, I expected to have abc only, but maybe having bullet 3 explained will help me understand this as well. Maybe this is even the answer for my question from bullet 1 (about capturing overwritten groups), but I would really like to understand what is happening here.

You should think about the greedy/non-greedy behavior in the context of your regex (r"(\w{3})+") versus a regex where the repeating pattern was not at the end: (r"(\w{3})+\w")
It's important because the default behavior of regex matching is:
The entire regex must match
Starting as early in the target string as possible
Matching as much of the target string as possible (greedy)
If you have a "repeat" operator - either * or + - in your regex, then the default behavior is for that to match as much as it can, so long as the rest of the regex is satisfied.
When the repeat operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as much as it can.
If you have a repeat operator with a non-greedy qualifier - *? or +? - in your regex, then the behavior is to match as little as it can, so long as the rest of the regex is satisfied.
When the repeat-nongreedy operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as little as it can.
All that is in just one match. You are mixing re.findall() in as well, which will then repeat the match, if possible.
The first time you run re.findall, with r"(\w{3})+" you are using a greedy match at the end of the pattern. Thus, it will try to apply that last block as many times as possible in a single match. You have the case where, like the call to re.search, the single match consumes the entire string. As part of consuming the entire string, the w3 block gets repeated, and the group buffer is overwritten several times.
The second time you run re.findall, with r"(\w{3})+?" you are using a non-greedy match at the end of the pattern. Thus, it will try to apply that last block as few times as possible in a single match. Since the operator is +, that would be 1. Now you have a case where the match can stop without consuming the entire string. And now, the group buffer only gets filled one time, and not overwritten. Which means that findall can return that result (abc), then loop for a different result (def), then loop for a final result (ghi).

Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right?
Right. Only the last captured text is stored in the group memory buffer.
can I capture all overwritten groups as well?
Not with re, but with PyPi regex, you can. Its match object has a captures method. However, with re, you can just match them with re.findall(r'\w{3}', S). However, in this case, you will match all 3-word character chunks from the string, not just those consecutive ones. With the regex module, you can get all the 3-character consecutive chunks from the beginning of the string with the help of \G operator: regex.findall(r"\G\w{3}", "abcdefghi") (result: abc, def, ghi).
Why only ghi was returned with re.findall(r"(\w{3})+", S)?
Because there is only one match that is equal to the whole abcdefghi string, and Capture group 1 contains just the last three characters. re.findall only returns the captured values if capturing groups are defined in the pattern.

difference between two regular expressions: [abc]+ and ([abc])+

In [29]: re.findall("([abc])+","abc")
Out[29]: ['c']
In [30]: re.findall("[abc]+","abc")
Out[30]: ['abc']
Confused by the grouped one. How does it make difference?

There are two things that need to be explained here: the behavior of quantified groups, and the design of the findall() method.
In your first example, [abc] matches the a, which is captured in group #1. Then it matches b and captures it in group #1, overwriting the a. Then again with the c, and that's what's left in group #1 at the end of the match.
But it does match the whole string. If you were using search() or finditer(), you would be able to look at the MatchObject and see that group(0) contains abc and group(1) contains c. But findall() returns strings, not MatchObjects. If there are no groups, it returns a list of the overall matches; if there are groups, the list contains all the captures, but not the overall match.
So both of your regexes are matching the whole string, but the first one is also capturing and discarding each character individually (which is kinda pointless). It's only the unexpected behavior of findall() that makes it look like you're getting different results.

In the first example you have a repeated captured group which only capture the last iteration. Here c.
([abc])+
Debuggex Demo
In the second example you are matching a single character in the list one and unlimited times.
[abc]+
Debuggex Demo

Here's the way I would think about it. ([abc])+ is attempting to repeat a captured group. When you use "+" after the capture group, it doesn't mean you are going to get two captured groups. What ends up happening, at least for Python's regex and most implementations, is that the "+" forces iteration until the capture group only contains the last match.
If you want to capture a repeated expression, you need to reverse the ordering of "(...)" and "+", e.g. instead of ([abc])+ use ([abc]+).

input "abc"
[abc]
match a single character => "a"
[abc]+
+ Between one and unlimited times, as many times as possible => "abc"
([abc])
Capturing group ([abc]) => "a"
([abc])+
+ A repeated capturing group will only capture the last iteration => "c"

Grouping just gives different preference.
([abc])+ => Find one from selection. Can match one or more. It finds one and all conditions are met as the + means 1 or more. This breaks up the regex into two stages.
While the ungrouped one is treated as a whole.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.