Compare two regexs [ab]* and ([ab])\1 - python

What makes me wonder is that
why [ab]* doesn't repeat the matched part, but repeat [ab]. In other words, why it is not the same as either a*, or b*?
why ([ab])\1 repeat the matched part, but not repeat [ab]. In other words, why it can only match aa and bb, but not ab and ba?
Is it because the priority of () is lower than [], while the priority of * is higher than []? I wonder ift thinking of these as operators might not be appropriate. Thanks.

They both are entirely different.
When you say [ab]*, it means that either a or b for zero or more times. So, it will match "", "a", "b", and any combination of a and b.
But ([ab])\1 means that either a or b will be matched and then it is captured. \1 is called a backreference. It refers to the already captured group in the RegEx. In our case, ([ab]). So, if a was captured, then it will match only a again. If it is b then it will match only b again. It can match only aa and bb.

[ab]*
This will also match nothing, a, b, aaa, bbb, and any length of a string. The match isn't constrained by length and since there are no capturing groups, its stating to match a string of any length consisting of all a and b characters.
([ab])\1
In which case it forces the matched string to be two characters since there is no repetition. First, it must match what's inside the parens (for capturing group one), then it must match what it captured in group 1, which implicitly forces the match to be two characters long with both characters being identical.

Let's look at each of your expressions, then we'll add one interesting twist that may resolve any outstanding confusion.
[ab]* is equivalent to (?:a|b)*, in other words match a or b any number of times, for instance abbbaab.
[ab] is equivalent to (?:a|b), in other words match a or b once, for instance a.
a* means match a any number of times (for instance, aaaa
b* means match b any number of times (for instance, bb
You say that ([ab])\1 can only match aa or bb. That is correct, because
([ab])\1 means match a or b once, capturing it to Group 1, then match Group 1 again, i.e. a if we had a, or b if we had b.
One more variation (Perl, PCRE)
([ab])(?1) means match a or b once, capturing it to Group 1, then match the expression specified inside Group 1, therefore, match [ab] again. This would match aa, ab, ba or bb. Therefore,
([ab])(?1)* can match the same as [ab]+, and ([ab]*)(?1)* can match the same as [ab]*

Related

Matching strings where multiple capture groups must be different in regex

I am trying to create a regular expression that picks out a boolean algebra identity, specifically ((A+B).(A+C)), where A, B and C are different strings consisting of characters [A-Z].
I am running into problems getting the regular expression recognise that in the string I am looking for A != B != C.
Here is what I have tried:
\(\(([A-Z]+)\+([A-Z])\)\.\(\1\+([A-Z])\)\)
however, even though I have put every string that I want to be different in a capturing group, it doesn't stop it from thinking that strings B and C are the same. This is because the regular expression matches for all three of the following strings:
((A+B).(A+C))
((A+B).(A+A))
((A+A).(A+A))
while I only want it to match the first one.
You can use negative lookahead to make sure that group 2 is not the same as group 1, and that group 3 is not the same as either groups 1 or 2.
\(\(([A-Z]+)\+(?!\1)([A-Z])\)\.\(\1\+(?!\1)(?!\2)([A-Z])\)\)
Split up for readability:
\(\(
([A-Z]+)
\+
(?!\1)([A-Z])
\)\.\(
\1
\+
(?!\1)(?!\2)([A-Z])
\)\)
Inputs:
((A+B).(A+C))
((A+B).(A+A))
((A+A).(A+A))
((A+B).(A+B))
Matches:
((A+B).(A+C))
Try it on regex101

Regex for matching adjacent words where order doesn't matter

I'm using regex with python and trying to figure out the best way to match a pattern where the order of two words I'm searching for doesn't matter, but they must be adjacent. So for example, I'm searching for either the phrase "fat cat lasagna co" or "cat fat lasagna co", and I have to imagine there's a better way than just r"\b(fat cat|cat fat) lasagna co\b"
I read this question which addressed a similar problem but the words didn't have to be adjacent and couldn't figure out how to apply it to my problem.
There is no strictly better solution, but there's an alternative.
Now, if you have two normal words like "fat" and "cat", then (fat cat|cat fat) is undoubtedly the best solution. But what if you have 5 words? Or if you have more complex patterns than just fat and cat that you don't want to type twice?
Say instead of fat and cat you have 3 regex patterns A, B and C, and instead of the space between fat and cat you have the regex pattern S. In that case, you could use this recipe:
(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B)|(?!\4)()(?:C))){3}
If you don't have an S, this can be simplified to
(?:(?!\1)()(?:A)|(?!\2)()(?:B)|(?!\3)()(?:C)){3}
(Note: (?:X) can be simplified to X if X doesn't contain an alternation |.)
Example
If we set A = fat, B = cat and S = space, we get:
(?:(?:(?!\1)()|\1 )(?:(?!\2)()fat|(?!\3)()cat)){2}
Try it online.
Explanation
In essence, we're using capture groups to "remember" which patterns have already matched. To do so, we use this little pattern here:
(?!\1)()some_pattern
What does this do? It's a regex that matches exactly once. Once it has matched, it won't ever match again. If you try to add a quantifier around that pattern like (?:(?!\1)()some_pattern)* it'll match either once or won't match at all.
The trick there is the usage of a backreference to a capture group before that group has even been defined. Because capture groups are initialized with a "failed to match" state, the negative lookahead (?!\1) will match successfully - but only the first time. Because right afterwards, the capture group () matches and captures the empty string. From this point forward, the negative lookahead (?!\1) will never match again.
With this as a building block, we can create a regex that matches fatcat and catfat while only containing the words fat and cat once:
(?:(?!\1)()fat|(?!\2)()cat){2}
Because of the negative lookaheads, each word can only match at most once. Adding a {2} quantifier at the end guarantees that each of the two words matches exactly once, or the entire match fails.
Now we just need to find a way to match a space between fat and cat. Well, that's just a slight variation of the same pattern:
(?:(?!\1)()|\1 )
This pattern will match the empty string on its first match, and on each subsequent match it'll match a space.
Put it all together, and voilĂ :
(?:(?:(?!\1)()|\1 )(?:(?!\2)()fat|(?!\3)()cat)){2}
Templates (for the lazy)
2 patterns A and B, with separator S:
(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B))){2}
3 patterns A, B and C, with separator S:
(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B)|(?!\4)()(?:C))){3}
4 patterns A, B, C and D, with separator S:
(?:(?:(?!\1)()|\1(?:S))(?:(?!\2)()(?:A)|(?!\3)()(?:B)|(?!\4)()(?:C)|(?!\5)()(?:D))){4}
2 patterns A and B, without S:
(?:(?!\1)()(?:A)|(?!\2)()(?:B)){2}
3 patterns A, B and C, without S:
(?:(?!\1)()(?:A)|(?!\2)()(?:B)|(?!\3)()(?:C)){3}
4 patterns A, B, C and D, without S:
(?:(?!\1)()(?:A)|(?!\2)()(?:B)|(?!\3)()(?:C)|(?!\4)()(?:D)){4}

Regex for pattern that starts with ABC, then B's and/or C's, and ends with CBA

Let's say I have a string that can contain only A's, B's and C's.
I have substrings of a certain pattern that I want to extract: they start with ABC, continue with a combination of B's and C's, and end with CBA.
The naive solution is to use ABC[BC]*CBA.
However, that will not cover the ABCBA string. Is there a "pythonic" way to address this, other than using | to look for two possible RE's?
You can use lookarounds:
AB(?=C)[BC]*(?<=C)BA
I.e. make sure AB is followed by C and BA is preceded by C, even if they are the same C.
You do not need to use lookarounds, use an optional group:
ABC(?:[BC]*C)?BA
See the regex demo.
Details
ABC - an ABC substring
(?:[BC]*C)? - a non-capturing group matching 0 or more occurrences of B or C chars followed with a C letter
BA - a BA substring.
This will effectively match AB that can only be followed with C and then any number of B or C letters (but this steak of chars is optional) followed with CBA.
Note that depending on what you are doing with the pattern, a capturing group will also do, ABC([BC]*C)?BA.

difference between regular expression with and without group '( )'?

There are two different codes which produce two different result but I don't know how those differences arise.
>>>re.findall('[a-z]+','abc')
['abc']
and this one with group:
>>> re.findall('([a-z])+','abc')
['c']
why the second code yield character c ?
In your last regex pattern (([a-z])+), you are repeating a capturing group (()). And doing this will return only last iteration. So you get the last letter, which is c
But in your first pattern ([a-z]+), you are repeating a character class ([]), and this doesn't behave the same as a capturing group. It returns all the iterations.

How to add additional criteria to re.findall ... Python 2.7?

ORF_sequences = re.findall(r'ATG(?:...){9,}?(?:TAA|TAG|TGA)',sequence) #thanks to #Martin Pieters and #nneonneo
I have a line of code that finds any instance of A|G followed by 2 characters and then ATG that is then followed by either a TAA|TAG|TGA when read in units of 3. only works when A|G-xx-ATG-xxx-TAA|TAG|TGA is 30 elements or greater
i want to add a criteria
i need the ATG to be followed by a G
so A|G-xx-ATG-Gxx-xxx-TAA|TGA|TAG #at least 30 elements long
example:
GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA
^ would work
GCATGAGGTTTTTTTTTTTTTTTTTTTTTTTTTGA
^ would not work because it is an (A|G) followed by only one value (not 2) before the ATG and there is not a G following the A|G-xx-ATG
i hope this makes sense
I tried
ORF_sequences = re.findall(r'ATGG(?:...){9,}?(?:TAA|TAG|TGA)',sequence)
but it seemed like it was using window size 3 after last G of ATGG
basically I need that code, where the first occurrence is A|G-xx-ATG and the second occurrence is (G-xx)
It'll be easier if you use a character group of [AG], there is no need to group the two 'free' characters:
ORF_sequences2 = re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
or you need to group the A|G:
ORF_sequences2 = re.findall(r'(?:A|G)..ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
Applying the first form to your examples:
>>> re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTGA')
['GCCATGGGGTTTTGA']
>>> re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCATGGGGTTTTGA')
[]
In your attempt, the expression matches either an A, or the expression G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA) because the | symbol applies to everything that preceeds or follows it within the same group. As it is not grouped, it applies to the whole expression instead:
>>> re.findall(r'A|G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)', 'A')
['A']
>>> re.findall(r'A|G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTGA')
['GCCATGGGGTTTTGA']
If you need to match a certain amount of characters in your whole match, you need to tailor those 3 character (?:...) groups to match a minimum number of times:
ORF_sequences2 = re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)',fdna)
would match A or G followed by 2 characters, followed by ATGG with another 2 characters, then at least 7 times 3 characters (total 21), followed by a specific pattern of 3 more (TAA, TAG or TGA) for a total of at least 33 characters from the first to the last character. The extra .. make up the pattern of 3 after ATG and matches your example from your comment:
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGA')
['GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGA']
as well as correctly handling the examples given in your question:
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA')
['GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA']
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCATGAGGTTTTTTTTTTTTTTTTTTTTTTTTTGA')
[]
To ensure you get at least 30 characters, use the {n,} quantifier:
r'[AG]..ATG(?:...){9,}?(?:TAA|TAG|TGA)'
This ensures that you read at least 9 triplets (27 characters) between the ATG opening and the TAA|TGA|TAG terminator.

Categories