Python regex issue from Kuchling's tutorial - python

Referring to http://docs.python.org/howto/regex.html section Non-capturing and named groups, I saw an example whose output is not obvious to me.
>>> import re
>>> m = re.match(r"([abc])+", "abc")
>>> m.groups()
('c',)
Here I am not able to understand why the group(1) is 'c' and also why I see I dangling comma at the end. Can somebody help?

I don't know about the dangling comma, but your first group is c, because you let the group repeat itself by putting the + after the group, and not after the character group.
This way, the regex first matches a, which is assigned to group 1. Then, it matches b, which is also assigned to group 1. And at last, it matches c, assigned to group 1 and finishes; thus leaving your group 1 to be c.
If you write ([abc]+), your group 1 will be abc.

Related

python regex where a set of options can occur at most once in a list, in any order

I'm wondering if there's any way in python or perl to build a regex where you can define a set of options can appear at most once in any order. So for example I would like a derivative of foo(?: [abc])*, where a, b, c could only appear once. So:
foo a b c
foo b c a
foo a b
foo b
would all be valid, but
foo b b
would not be
You may use this regex with a capture group and a negative lookahead:
For Perl, you can use this variant with forward referencing:
^foo((?!.*\1) [abc])+$
RegEx Demo
RegEx Details:
^: Start
foo: Match foo
(: Start a capture group #1
(?!.*\1): Negative lookahead to assert that we don't match what we have in capture group #1 anywhere in input
[abc]: Match a space followed by a or b or c
)+: End capture group #1. Repeat this group 1+ times
$: End
As mentioned earlier, this regex is using a feature called Forward Referencing which is a back-reference to a group that appears later in the regex pattern. JGsoft, .NET, Java, Perl, PCRE, PHP, Delphi, and Ruby allow forward references but Python doesn't.
Here is a work-around of same regex for Python that doesn't use forward referencing:
^foo(?!.* ([abc]).*\1)(?: [abc])+$
Here we use a negative lookahead before repeated group to check and fail the match if there is any repeat of allowed substrings i.e. [abc].
RegEx Demo 2
You can assert that there is no match for a second match for a space and a letter at the right:
foo(?!(?: [abc])*( [abc])(?: [abc])*\1)(?: [abc])*
foo Match literally
(?! Negative lookahead
(?: [abc])* Match optional repetitions of a space and a b or c
( [abc]) Capture group, use to compare with a backreference for the same
(?: [abc])* Match again a space and either a b or c
\1 Backreference to group 1
) Close lookahead
(?: [abc])* Match optional repetitions or a space and either a b or c
Regex demo
If you don't want to match only foo, you can change the quantifier to 1 or more (?: [abc])+
A variant in perl reusing the first subpattern using (?1) which refers to the capture group ([abc])
^foo ([abc])(?: (?!\1)((?1))(?: (?!\1|\2)(?1))?)?$
Regex demo
If it doesn't have to be a regex:
import collections
# python >=3.10
def is_a_match(sentence):
words = sentence.split()
return (
(len(words) > 0)
and (words[0] == 'foo')
and (collections.Counter(words) <= collections.Counter(['foo', 'a', 'b', 'c']))
)
# python <3.10
def is_a_match(sentence):
words = sentence.split()
return (
(len(words) > 0)
and (words[0] == 'foo')
and not (collections.Counter(words) - collections.Counter(['foo', 'a', 'b', 'c']))
)
# TESTING
#foo a b c True
#foo b c a True
#foo a b True
#foo b True
#foo b b False
Or with a set and the walrus operator:
def is_a_match(sentence):
words = sentence.split()
return (
(len(words) > 0)
and (words[0] == 'foo')
and (
(s := set(words[1:])) <= set(['a', 'b', 'c'])
and len(s) == len(words) - 1
)
)
You can do it using references to previously captured groups.
foo(?: ([abc]))?(?: (?!\1)([abc]))?(?: (?!\1|\2)([abc]))?$
This gets quite long with many options. Such a regex can be generated dynamically, if necessary.
def match_sequence_without_repeats(options, seperator):
def prevent_previous(n):
if n == 0:
return ""
groups = "".join(rf"\{i}" for i in range(1, n + 1))
return f"(?!{groups})"
return "".join(
f"(?:{seperator}{prevent_previous(i)}([{options}]))?"
for i in range(len(options))
)
print(f"foo{match_sequence_without_repeats('abc', ' ')}$")
Here is a modified version of anubhava's answer, using a backreference (which works in Python, and is easier to understand at least for me) instead of a forward reference.
Match using [abc] inside a capturing group, then check that the text matched by the capturing group does not appear again anywhere after it:
^foo(?:( [abc])(?!.*\1))+$
regex demo
^: Start
foo: Match foo
(?:: Start non-capturing group (?:( [abc])(?!.*\1))
( [abc]): Capturing Group 1, matching a space followed by either a, b, or c
(?!.*\1): Negative lookahead, failing to match if the text matched by the first capturing group occurs after zero or more characters matched by .
)+: End non-capturing group and match it 1 or more times
$: End
I have assumed that the elements of the string can be in any order and appear any number of times. For example, 'a foo' should match and 'a foo b foo' should not.
You can do that with a series of alternations employing lookaheads, one for each substring of interest, but it becomes a bit of a dog's breakfast when there are many strings to consider. Let's suppose you wanted to match zero or one "foo"'s and/or zero or one "a"'s. You could use the following regular expression:
^(?:(?!.*\bfoo\b)|(?=(?:(?!\bfoo\b).)*\bfoo\b(?!(.*\bfoo\b))))(?:(?!.*\ba\b)|(?=(?:(?!\ba\b).)*\ba\b(?!(.*\ba\b))))
Start your engine!
This matches, for example, 'foofoo', 'aa' and afooa. If they are not to be matched remove the word breaks (\b).
Notice that this expression begins by asserting the start of the string (^) followed by two positive lookaheads, one for 'foo' and one for 'a'. To also check for, say, 'c' one would tack on
(?:(?!.*\bc\b)|(?=(?:(?!\bc\b).)*\bc\b(?!(.*\bc\b))))
which is the same as
(?:(?!.*\ba\b)|(?=(?:(?!\ba\b).)*\ba\b(?!(.*\ba\b))))
with \ba\b changed to \bc\b.
It would be nice to be able to use back-references but I don't see how that could be done.
By hovering over the regular expression in the link an explanation is provided for each element of the expression. (If this is not clear I am referring to the cursor.)
Note that
(?!\bfoo\b).
matches a character provided it does not begin the word 'foo'. Therefore
(?:(?!\bfoo\b).)*
matches a substring that does not contain 'foo' and does not end with 'f' followed by 'oo'.
Would I advocate this approach in practice, as opposed to using simple string methods? Let me ponder that.
If the order of the strings doesn't matter, and you want to make sure every string occurs only once, you can turn the list into a set in Python:
my_lst = ['a', 'a', 'b', 'c']
my_set = set(lst)
print(my_set)
# {'a', 'c', 'b'}
There is not much to add to the above answers except here is a regex that does not use back or forward references. Instead it uses 3 separate negative lookahead assertions to ensure that the input does not contain 2 occurrences of either a or b or c. The regex also allows for liberal uses of spaces.
^foo(?![^a]*a[^a]*a)(?![^b]*b[^b]*b)(?![^c]*c[^c]*c)( +[abc])* *$
See Regex Demo
^ - Matches start of string
(?![^a]*a[^a]*a) - Negative lookahead assertion that what follows does not contain two occurrences of a
(?![^b]*b[^b]*b) - Negative lookahead assertion that what follows does not contain two occurrences of b
(?![^c]*c[^c]*c) - Negative lookahead assertion that what follows does not contain two occurrences of c
( +[abc])* - Matches 0 or more occurrences of: 1 or more spaces followed by an a or b or c
* - Matches 0 or more occurrences of space
7 $ - Matches the end of the string
The regex looks "clunky" but is very straightforward. With input foo a b c the successful match is done in 35 steps and with input foo b b the unsuccessful match is done in 13 steps. Thich compares favorably with the other answers.

Python regex match number + unit [duplicate]

I have the code:
import re
sequence="aabbaa"
rexp=re.compile("(aa|bb)+")
rexp.findall(sequence)
This returns ['aa']
If we have
import re
sequence="aabbaa"
rexp=re.compile("(aa|cc)+")
rexp.findall(sequence)
we get ['aa','aa']
Why is there a difference and why (for the first) do we not get ['aa','bb','aa']?
Thanks!
The unwanted behaviour comes down to the way you formulate regualar expression:
rexp=re.compile("(aa|bb)+")
Parentheses (aa|bb) forms a group.
And if we look at the docs of findall we will see this:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.**
As you formed a group, it mathced first aa, then bb, then aa again (because of + quantifier). So this group holds aa in the end. And findall returns this value in the list ['aa'] (as there is only one match aabbaa of the whole expression, the list contains only one element aa which is saved in the group).
From the code you gave, you seemed to want to do this:
>>> rexp=re.compile("(?:aa|bb)+")
>>> rexp.findall(sequence)
['aabbaa']
(?: ...) doesnt create any group, so findall returns the match of the whole expression.
In the end of your question you show the desired output. This is achieved by just looking for aa or bb. No quantifiers (+ or *) are needed. Just do it the way is in the Inbar Rose's answer:
>>> rexp=re.compile("aa|bb")
>>> rexp.findall(sequence)
['aa', 'bb', 'aa']
let me explain what you are doing:
regex = re.compile("(aa|bb)+")
you are creating a regex which will look for aa or bb and then will try to find if there are more aa or bb after that, and it will keep looking for aa or bb until it doesnt find. since you want your capturing group to return only the aa or bb then you only get the last captured/found group.
however, if you have a string like this: aaxaabbxaa you will get aa,bb,aa because you first look at the string and find aa, then you look for more, and find only an x, so you have 1 group. then you find another aa, but then you find a bb, and then an x so you stop and you have your second group which is bb. then you find another aa. and so your final result is aa,bb,aa
i hope this explains what you are DOING. and it is as expected. to get ANY group of aa or bb you need to remove the + which is telling the regex to seek multiple groups before returning a match. and just have regex return each match of aa or bb...
so your regex should be:
regex = re.compile("(aa|bb)")
cheers.
your pattern
rexp=re.compile("(aa|bb)+")
matches the whole string aabbaa. to clarify just look at this
>>> re.match(re.compile("(aa|bb)+"),"aabbaa").group(0)
'aabbaa'
Also no other substrings are to match then
>>> re.match(re.compile("(aa|bb)+"),"aabbaa").group(1)
'aa'
so a findall will return the one substring only
>>> re.findall(re.compile("(aa|bb)+"),"aabbaa")
['aa']
>>>
I do not understand why you use + - it means 0 or 1 occurrence, and is usually used when you want find string with optional inclusion of substring.
>>> re.findall(r'(aa|bb)', 'aabbaa')
['aa', 'bb', 'aa']
work as expected

Python Regular Expression Groups

Why does this regex print ('c',) and ()?
I thought "([abc])+" === "([abc])([abc])([abc])..."
>>> import re
>>> m = re.match("([abc])+", "abc")
>>> print m.groups()
('c',)
>>> m.groups(0)
('c',)
>>> m = re.match("[abc]+", "abc")
>>> m.groups()
()
>>> m.groups(0)
()
From documentation about groups
Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.
In the first regex ([abc])+, it is matching character a or b or c but will store only the last match
([abc])+
<----->
Matches a or b or c
Observe carefully. Capturing groups are surrounding only the character class
So, only one character from the matched character class can be stored in capturing group.
If you want to capture string abc in a capturing group use
([abc]+)
Above will find string composed of a or b or c and will store it in capturing group.
In second regex [abc]+, there are no capturing groups, so an empty result is shown.
When you put plus after the parenthesis you are matching one character at a time, many times. So it matches a then b then c, each time overwriting the previous. Move the + inside ([abc]+) to match one or more.

difference between regular expression with and without group '( )'?

There are two different codes which produce two different result but I don't know how those differences arise.
>>>re.findall('[a-z]+','abc')
['abc']
and this one with group:
>>> re.findall('([a-z])+','abc')
['c']
why the second code yield character c ?
In your last regex pattern (([a-z])+), you are repeating a capturing group (()). And doing this will return only last iteration. So you get the last letter, which is c
But in your first pattern ([a-z]+), you are repeating a character class ([]), and this doesn't behave the same as a capturing group. It returns all the iterations.

Compare two regexs [ab]* and ([ab])\1

What makes me wonder is that
why [ab]* doesn't repeat the matched part, but repeat [ab]. In other words, why it is not the same as either a*, or b*?
why ([ab])\1 repeat the matched part, but not repeat [ab]. In other words, why it can only match aa and bb, but not ab and ba?
Is it because the priority of () is lower than [], while the priority of * is higher than []? I wonder ift thinking of these as operators might not be appropriate. Thanks.
They both are entirely different.
When you say [ab]*, it means that either a or b for zero or more times. So, it will match "", "a", "b", and any combination of a and b.
But ([ab])\1 means that either a or b will be matched and then it is captured. \1 is called a backreference. It refers to the already captured group in the RegEx. In our case, ([ab]). So, if a was captured, then it will match only a again. If it is b then it will match only b again. It can match only aa and bb.
[ab]*
This will also match nothing, a, b, aaa, bbb, and any length of a string. The match isn't constrained by length and since there are no capturing groups, its stating to match a string of any length consisting of all a and b characters.
([ab])\1
In which case it forces the matched string to be two characters since there is no repetition. First, it must match what's inside the parens (for capturing group one), then it must match what it captured in group 1, which implicitly forces the match to be two characters long with both characters being identical.
Let's look at each of your expressions, then we'll add one interesting twist that may resolve any outstanding confusion.
[ab]* is equivalent to (?:a|b)*, in other words match a or b any number of times, for instance abbbaab.
[ab] is equivalent to (?:a|b), in other words match a or b once, for instance a.
a* means match a any number of times (for instance, aaaa
b* means match b any number of times (for instance, bb
You say that ([ab])\1 can only match aa or bb. That is correct, because
([ab])\1 means match a or b once, capturing it to Group 1, then match Group 1 again, i.e. a if we had a, or b if we had b.
One more variation (Perl, PCRE)
([ab])(?1) means match a or b once, capturing it to Group 1, then match the expression specified inside Group 1, therefore, match [ab] again. This would match aa, ab, ba or bb. Therefore,
([ab])(?1)* can match the same as [ab]+, and ([ab]*)(?1)* can match the same as [ab]*

Categories