Capturing groups and greediness in Python

Capturing groups and greediness in Python - python

Recently I have been playing around with regex expressions in Python and encountered a problem with r"(\w{3})+" and with its non-greedy equivalent r"(\w{3})+?".
Please let's take a look at the following example:
S = "abcdefghi" # string used for all the cases below
1. Greedy search
m = re.search(r"(\w{3})+", S)
print m.group() # abcdefghi
print m.groups() # ('ghi',)
m.group is exactly as I expected - just whole match.
Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right? If yes, then can I capture all overwritten groups as well? Of course, for this particular string I could just write m = re.search(r"(\w{3})(\w{3})(\w{3})", S) but I am looking for a more general way to capture groups not knowing how many of them I can expect, thus metacharacter +.
2. Non-greedy search
m = re.search(r"(\w{3})+?", S)
print m.group() # abc
print m.groups() # ('abc',)
Now we are not greedy so only abc was found - exactly as I expected.
Regarding m.groups(), the engine stopped when it found abc so I understand that this is the only found group here.
3. Greedy findall
print re.findall(r"(\w{3})+", S) # ['ghi']
Now I am truly perplexed, I always thought that function re.findall finds all substrings where the RE matches and returns them as a list. Here, we have only one match abcdefghi (according to common sense and bullet 1), so I expected to have a list containing this one item. Why only ghi was returned?
4. Non-greedy findall
print re.findall(r"(\w{3})+?", S) # ['abc', 'def', 'ghi']
Here, in turn, I expected to have abc only, but maybe having bullet 3 explained will help me understand this as well. Maybe this is even the answer for my question from bullet 1 (about capturing overwritten groups), but I would really like to understand what is happening here.

You should think about the greedy/non-greedy behavior in the context of your regex (r"(\w{3})+") versus a regex where the repeating pattern was not at the end: (r"(\w{3})+\w")
It's important because the default behavior of regex matching is:
The entire regex must match
Starting as early in the target string as possible
Matching as much of the target string as possible (greedy)
If you have a "repeat" operator - either * or + - in your regex, then the default behavior is for that to match as much as it can, so long as the rest of the regex is satisfied.
When the repeat operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as much as it can.
If you have a repeat operator with a non-greedy qualifier - *? or +? - in your regex, then the behavior is to match as little as it can, so long as the rest of the regex is satisfied.
When the repeat-nongreedy operator is at the end of the pattern, there is no rest of the regex, so the behavior becomes match as little as it can.
All that is in just one match. You are mixing re.findall() in as well, which will then repeat the match, if possible.
The first time you run re.findall, with r"(\w{3})+" you are using a greedy match at the end of the pattern. Thus, it will try to apply that last block as many times as possible in a single match. You have the case where, like the call to re.search, the single match consumes the entire string. As part of consuming the entire string, the w3 block gets repeated, and the group buffer is overwritten several times.
The second time you run re.findall, with r"(\w{3})+?" you are using a non-greedy match at the end of the pattern. Thus, it will try to apply that last block as few times as possible in a single match. Since the operator is +, that would be 1. Now you have a case where the match can stop without consuming the entire string. And now, the group buffer only gets filled one time, and not overwritten. Which means that findall can return that result (abc), then loop for a different result (def), then loop for a final result (ghi).

Regarding m.groups please confirm: ghi is printed because it has overwritten previous captured groups of def and abc, am I right?
Right. Only the last captured text is stored in the group memory buffer.
can I capture all overwritten groups as well?
Not with re, but with PyPi regex, you can. Its match object has a captures method. However, with re, you can just match them with re.findall(r'\w{3}', S). However, in this case, you will match all 3-word character chunks from the string, not just those consecutive ones. With the regex module, you can get all the 3-character consecutive chunks from the beginning of the string with the help of \G operator: regex.findall(r"\G\w{3}", "abcdefghi") (result: abc, def, ghi).
Why only ghi was returned with re.findall(r"(\w{3})+", S)?
Because there is only one match that is equal to the whole abcdefghi string, and Capture group 1 contains just the last three characters. re.findall only returns the captured values if capturing groups are defined in the pattern.

Related

Why does this regular expression to match two consecutive words not work?

There is a similar question here: Regular Expression For Consecutive Duplicate Words. This addresses the general question of how to solve this problem, whereas I am looking for specific advice on why my solution does not work.
I'm using python regex, and I'm trying to match all consecutively repeated words, such as the bold in:
I am struggling to to make this this work
I tried:
[A-Za-z0-9]* {2}
This is the logic behind this choice of regex: The '[A-Za-z0-9]*' should match any word of any length, and '[A-Za-z0-9]* ' makes it consider the space at the end of the word. Hence [A-Za-z0-9]* {2} should flag a repetition of the previous word with a space at the end. In other words it says "For any word, find cases where it is immediately repeated after a space".
How is my logic flawed here? Why does this regex not work?

[A-Za-z0-9]* {2}
Quantifiers in regular expressions will always only apply to the element right in front of them. So a \d+ will look for one or more digits but x\d+ will look for a single x, followed by one or more digits.
If you want a quantifier to apply to more than just a single thing, you need to group it first, e.g. (x\d)+. This is a capturing group, so it will actually capture that in the result. This is sometimes undesired if you just want to group things to apply a common quantifier. In that case, you can prefix the group with ?: to make it a non-capturing group: (?:x\d)+.
So, going back to your regular expression, you would have to do it like this:
([A-Za-z0-9]* ){2}
However, this does not actually have any check that the second matched word is the same as the first one. If you want to match for that, you will need to use backreferences. Backreferences allow you to reference a previously captured group within the expression, looking for it again. In your case, this would look like this:
([A-Za-z0-9]*) \1
The \1 will reference the first capturing group, which is ([A-Za-z0-9]*). So the group will match the first word. Then, there is a space, followed by a backreference to the first word again. So this will look for a repetition of the same word separated by a space.
As bobble bubble points out in the comments, there is still a lot one can do to improve the regular expression. While my main concern was to explain the various concepts without focusing too much on your particular example, I guess I still owe you a more robust regular expression for matching two consecutive words within a string that are separated by a space. This would be my take on that:
\b(\w+)\s\1\b
There are a few things that are different to the previous approach: First of all, I’m looking for word boundaries around the whole expression. The \b matches basically when a word starts or ends. This will prevent the expression from matching within other words, e.g. neither foo fooo nor foo oo would be matched.
Then, the regular expression requires at least one character. So empty words won’t be matched. I’m also using \w here which is a more flexible way of including alphanumerical characters. And finally, instead of looking for an actual space, I accept any kind of whitespace between the words, so this could even match tabs or line breaks. It might make sense to add a quantifier there too, i.e. \s+ to allow multiple whitespace characters.
Of course, whether this works better for you, depends a lot on your actual requirements which we won’t be able to tell just from your one example. But this should give you a few ideas on how to continue at least.

You can match a previous capture group with \1 for the first group, \2 for the second, etc...
import re
s = "I am struggling to to make this this work"
matches = re.findall(r'([A-Za-z0-9]+) \1', s)
print(matches)
>>> ['to', 'this']
If you want both occurrences, add a capture group around \1:
matches = re.findall(r'([A-Za-z0-9]+) (\1)', s)
print(matches)
>>> [('to', 'to'), ('this', 'this')]

At a glance it looks like this will match any two words, not repeated words. If I recall correctly asterisk (*) will match zero or more times, so perhaps you should be using plus (+) for one or more. Then you need to provide a capture and re-use the result of the capture. Additionally the \w can be used for alphanumerical characters for clarity. Also \b can be used to match empty string at word boundary.
Something along the lines of the example below will get you part of the way.
>>> import re
>>> p = re.compile(r'\b(\w+) \1\b')
>>> p.findall('fa fs bau saa saa fa bau eek mu muu bau')
['saa']
These pages may offer some guidance:
Python regex cheat sheet
RegExp match repeated characters
Regular Expression For Consecutive Duplicate Words.

This should work: \b([A-Za-z0-9]+)\s+\1\b
\b matches a word boundary, \s matches whitespace and \1 specifies the first capture group.
>>> s = 'I am struggling to to make this this work'
>>> re.findall(r'\b([A-Za-z0-9]+)\s+\1\b', s)
['to', 'this']

Here is a simple solution not using RegEx.
sentence = 'I am struggling to to make this this work'
def find_duplicates_in_string(words):
""" Takes in a string and returns any duplicate words
i.e. "this this"
"""
duplicates = []
words = words.split()
for i in range(len(words) - 1):
prev_word = words[i]
word = words[i + 1]
if word == prev_word:
duplicates.append(word)
return duplicates
print(find_duplicates_in_string(sentence))

Add multiplication signs (*) between coefficients

I have a program in which a user inputs a function, such as sin(x)+1. I'm using ast to try to determine if the string is 'safe' by whitelisting components as shown in this answer. Now I'd like to parse the string to add multiplication (*) signs between coefficients without them.
For example:
3x-> 3*x
4(x+5) -> 4*(x+5)
sin(3x)(4) -> sin(3x)*(4) (sin is already in globals, otherwise this would be s*i*n*(3x)*(4)
Are there any efficient algorithms to accomplish this? I'd prefer a pythonic solution (i.e. not complex regexes, not because they're pythonic, but just because I don't understand them as well and want a solution I can understand. Simple regexes are ok. )
I'm very open to using sympy (which looks really easy for this sort of thing) under one condition: safety. Apparently sympy uses eval under the hood. I've got pretty good safety with my current (partial) solution. If anyone has a way to make sympy safer with untrusted input, I'd welcome this too.

A regex is easily the quickest and cleanest way to get the job done in vanilla python, and I'll even explain the regex for you, because regexes are such a powerful tool it's nice to understand.
To accomplish your goal, use the following statement:
import re
# <code goes here, set 'thefunction' variable to be the string you're parsing>
re.sub(r"((?:\d+)|(?:[a-zA-Z]\w*\(\w+\)))((?:[a-zA-Z]\w*)|\()", r"\1*\2", thefunction)
I know it's a bit long and complicated, but a different, simpler solution doesn't make itself immediately obvious without even more hacky stuff than what's gone into the regex here. But, this has been tested against all three of your test cases and works out precisely as you want.
As a brief explanation of what's going on here: The first parameter to re.sub is the regular expression, which matches a certain pattern. The second is the thing we're replacing it with, and the third is the actual string to replace things in. Every time our regex sees a match, it removes it and plugs in the substitution, with some special behind-the-scenes tricks.
A more in-depth analysis of the regex follows:
((?:\d+)|(?:[a-zA-Z]\w*\(\w+\)))((?:[a-zA-Z]\w*)|\() : Matches a number or a function call, followed by a variable or parentheses.
((?:\d+)|(?:[a-zA-Z]\w*\(\w+\))) : Group 1. Note: Parentheses delimit a Group, which is sort of a sub-regex. Capturing groups are indexed for future reference; groups can also be repeated with modifiers (described later). This group matches a number or a function call.
(?:\d+) : Non-capturing group. Any group with ?: immediately after the opening parenthesis will not assign an index to itself, but still act as a "section" of the pattern. Ex. A(?:bc)+ will match "Abcbcbcbc..." and so on, but you cannot access the "bcbcbcbc" match with an index. However, without this group, writing "Abc+" would match "Abcccccccc..."
\d : Matches any numerical digit once. A regex of \d all its own will match, separately, "1", "2", and "3" of "123".
+ : Matches the previous element one or more times. In this case, the previous element is \d, any number. In the previous example, \d+ on "123" will successfully match "123" as a single element. This is vital to our regex, to make sure that multi-digit numbers are properly registered.
| : Pipe character, and in a regex, it effectively says or: "a|b" will match "a" OR "b". In this case, it separates "a number" and "a function call"; match a number OR a function call.
(?:[a-zA-Z]\w*\(\w+\)) : Matches a function call. Also a non-capturing group, like (?:\d+).
[a-zA-Z] : Matches the first letter of the function call. There is no modifier on this because we only need to ensure the first character is a letter; A123 is technically a valid function name.
\w : Matches any alphanumeric character or an underscore. After the first letter is ensured, the following characters could be letters, numbers, or underscores and still be valid as a function name.
* : Matches the previous element 0 or more times. While initially seeming unnecessary, the star character effectively makes an element optional. In this case, our modified element is \w, but a function doesn't technically need any more than one character; A() is a valid function name. A would be matched by [a-zA-Z], making \w unnecessary. On the other end of the spectrum, there could be any number of characters following the first letter, which is why we need this modifier.
\( : This is important to understand: this is not another group. The backslash here acts much like an escape character would in a normal string. In a regex, any time you preface a special character, such as parentheses, +, or * with a backslash, it uses it like a normal character. \( matches an opening parenthesis, for the actual function call part of the function.
\w+ : Matches a number, letter or underscore one or more times. This ensures the function actually has a parameter going into it.
\) : Like \(, but matches a closing parenthesis
((?:[a-zA-Z]\w*)|\() : Group 2. Matches a variable, or an opening parenthesis.
(?:[a-zA-Z]\w*) : Matches a variable. This is the exact same as our function name matcher. However, note that this is in a non-capturing group: this is important, because of the way the OR checks. The OR immediately following this looks at this group as a whole. If this was not grouped, the "last object matched" would be \w*, which would not be sufficient for what we want. It would say: "match one letter followed by more letters OR one letter followed by a parenthesis". Putting this element in a non-capturing group allows us to control what the OR registers.
| : Or character. Matches (?:[a-zA-Z]\w*) or \(.
\( : Matches an opening parenthesis. Once we have checked if there is an opening parenthesis, we don't need to check anything beyond it for the purposes of our regex.
Now, remember our two groups, group one and group two? These are used in the substitution string, "\1*\2". The substitution string is not a true regex, but it still has certain special characters. In this case, \<number> will insert the group of that number. So our substitution string is saying: "Put group 1 in (which is either our function call or our number), then put in an asterisk (*), then put in our second group (either a variable or a parenthesis)"
I think that about sums it up!

Python Regex Behaviour

I'm trying to parse a text document with data in the following format: 24036 -977. I need to separate the numbers into separate values, and the way I've done that is with the following steps.
values = re.search("(.*?)\s(.*)")
x = values.group(1)
y = values.gropu(2)
This does the job, however I was curious about why using (.*?) in the second group causes the regex to fail? I tested it in the online regex tester(https://regex101.com/r/bM2nK1/1), and adding the ? in causes the second group to return nothing. Now as far as I know .*? means to take any value unlimited times, as few times as possible, and the .* is just the greedy version of that. What I'm confused about is why the non greedy version.*? takes that definition to mean capturing nothing?

Because it means to match the previous token, the *, as few times as possible, which is 0 times. If you would it to extend to the end of the string, add a $, which matches the end of string. If you would like it to match at least one, use + instead of *.
The reason the first group .*? matches 24036 is because you have the \s token after it, so the fewest amount of characters the .*? could match and be followed by a \s is 24036.

#iobender has pointed out the answer to your question.
But I think it's worth mentioning that if the numbers are separated by space, you can just use split:
>>> '24036 -977'.split()
['24036', '-977']
This is simpler, easier to understand and often faster than regex.

How does regex {m,n}? work in Python?

From the Python documentation of the re module:
{m,n}?
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters.
I'm confused about how this works. How is this any different from {m}? I do not see how there could ever be a case where the pattern could match more than m repetitions. If there are m+1 repetitions in a row, then there are also m. What am I missing?

Whereas, it is true that a regex solely containing a{3,5}? and one with the pattern: a{3} will match the same thing (i.e. re.match(r'a{3,5}?', 'aaaaa').group(0) and re.match(r'a{3}', 'aaaaa').group(0)
will both return 'aaa'), the differences between the patterns becomes clear when you look at patterns containing these two elements. Say your pattern is a{3,5}?b, then aaab, aaaab, and aaaaab will be matched. If you just used a{3}b then only aaab would get matched. aaaab and aaaaab would not.
Look to Shashank's answer for examples that flush out this difference a little more, or test your own. I've found that this site is a good resource to use to test out python regular expressions.

I think the way to see the difference between the two is through the following examples:
>>> re.findall(r'ab{3,5}?', 'abbbbb')
['abbb']
>>> re.findall(r'ab{3}', 'abbbbb')
['abbb']
Those two runs give the same results as expected, but let's see some differences.
Difference 1: A range quantifier on a subpattern lets you match a large range of patterns containing that subpattern. This lets you find matches where there normally wouldn't be any if you used an exact quantifier:
>>> re.findall(r'ab{3,5}?c', 'abbbbbc')
['abbbbbc']
>>> re.findall(r'ab{3}c', 'abbbbbc')
[]
Difference 2: Greedy doesn't necessarily mean "match the shortest subpattern possible". It's actually a bit more like "match the shortest subpattern possible starting from the leftmost unmatched index that can possibly start off a match":
>>> re.findall(r'b{3,5}?c', 'bbbbbc')
['bbbbbc']
>>> re.findall(r'b{3}c', 'bbbbbc')
['bbbc']
The way I think of regex is as a construct that scans the string from left to right with two iterators that point to indices in the string. The first iterator marks the beginning of the next possible pattern. The second iterator goes through the suffix of the substring starting from the first iterator and tries to complete the pattern. The first iterator only advances when the construct determines that the regex pattern cannot possibly match a string starting from that index. Thus, defining a range for your quantifier will make it so that the first iterator will keep matching sub-patterns beyond the minimum value specified even if the quantifier is non-greedy.
A non-greedy regex will stop its second iterator as soon as the pattern can stop, but a greedy regex will "save" the position of a matched pattern and keep searching for a longer one. If a longer pattern is found, then it uses that one instead, if it's not found, then it uses the shorter one that it saved in memory earlier.
That's why you see the possibly surprising result with 'b{3,5}?c' and 'bbbbbc'. Although the regex is greedy, it will still never advance its first iterator until the pattern match fails, and that's why the substring with 5 'b' characters is matched by the non-greedy regex even though its not the shortest pattern matchable.

SwankSwashbucklers's answer describes the greedy version. The ? makes it non-greedy, which means it will try to match as few items as possible, which means that
`re.match('a{3,5}?b', 'aaaab').group(0)` # returns `'aaaab'`
but
`re.match('a{3,5}?', 'aaaa').group(0)` # returns `'aaa'`

let say we have a string to be searched is:
str ="aaaaa"
Now we have patter = a{3,5}
The string which it matches are :{aaa,aaaa,aaaaa}
But here we have string as "aaaaa" since we have only one option.
Now lets say we have pattern = a{3,5}?
in this case it matches only "aaa" not "aaaaa".
Thus it takes the minimum items as possible,being non greedy.
please try using online regular Expression at :https://pythex.org/
It will be great help and we check immediately what it matches and what it does not

Don’t understand lazy regex

Say we have a string 1abcd1efg1hjk1lmn1 and want to find stuff between 1-s. What we do is
re.findall('1.*?1','1abcd1efg1hjk1lmn1')
and get two results
['1abcd1', '1hjk1']
ok I get that. But if we do
re.findall('1.*?1hj','1abcd1efg1hjk1lmn1')
why does it grab TWO intervals between 1s instead of one? Why do we get ['1abcd1efg1hj'] instead of ['1efg1hj']? Isn’t this what laziness is supposed to do?

Regex always tries to match the input string from left to right. Consider your '1.*?1hj' regex. 1 in your regex matches the first one and the following .*? matches all the characters upto the 1hj sub-string non-greedily. So that you got ['1abcd1efg1hj'] instead of ['1efg1hj']
To get ['1efg1hj'] as output, you need use a negated class as 1[^1]*1hj
>>> s = "1abcd1efg1hjk1lmn1"
>>> re.findall(r'1.*?1hj', s)
['1abcd1efg1hj']
>>> re.findall(r'1[^1]*1hj', s)
['1efg1hj']

['1abcd1efg1hj']
You get this because this satisfies your regex. 1.*?1hj essentially means start from 1 then move lazily till you find 1 followed by hj. The 1 in between if followed by ef so that will not match but . will consume all. You don't get ['1efg1hj'] because that string has already been consumed by the first match.Use lookahead to see that both satisfy the conditions. See demo.
A lookahead does not consume string so you get both the match,
https://regex101.com/r/aQ3zJ3/5

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.