Matching strings where multiple capture groups must be different in regex

Matching strings where multiple capture groups must be different in regex - python

I am trying to create a regular expression that picks out a boolean algebra identity, specifically ((A+B).(A+C)), where A, B and C are different strings consisting of characters [A-Z].
I am running into problems getting the regular expression recognise that in the string I am looking for A != B != C.
Here is what I have tried:
\(\(([A-Z]+)\+([A-Z])\)\.\(\1\+([A-Z])\)\)
however, even though I have put every string that I want to be different in a capturing group, it doesn't stop it from thinking that strings B and C are the same. This is because the regular expression matches for all three of the following strings:
((A+B).(A+C))
((A+B).(A+A))
((A+A).(A+A))
while I only want it to match the first one.

You can use negative lookahead to make sure that group 2 is not the same as group 1, and that group 3 is not the same as either groups 1 or 2.
\(\(([A-Z]+)\+(?!\1)([A-Z])\)\.\(\1\+(?!\1)(?!\2)([A-Z])\)\)
Split up for readability:
\(\(
([A-Z]+)
\+
(?!\1)([A-Z])
\)\.\(
\1
\+
(?!\1)(?!\2)([A-Z])
\)\)
Inputs:
((A+B).(A+C))
((A+B).(A+A))
((A+A).(A+A))
((A+B).(A+B))
Matches:
((A+B).(A+C))
Try it on regex101

Related

Extracting two strings from between two characters. Why doesn't my regex match and how can I improve it?

I'm learning about regular expressions and I to want extract a string from a text that has the following characteristic:
It always begins with the letter C, in either lowercase or
uppercase, which is then followed by a number of hexadecimal
characters (meaning it can contain the letters A to F and numbers
from 1 to 9, with no zeros included).
After those hexadecimal
characters comes a letter P, also either in lowercase or uppercase
And then some more hexadecimal characters (again, excluding 0).
Meaning I want to capture the strings that come in between the letters C and P as well as the string that comes after the letter P and concatenate them into a single string, while discarding the letters C and P
Examples of valid strings would be:
c45AFP2
CAPF
c56Bp26
CA6C22pAAA
For the above examples what I want would be to extract the following, in the same order:
45AF2 # Original string: c45AFP2
AF # Original string: CAPF
56B26 # Original string: c56Bp26
A6C22AAA # Original string: CA6C22pAAA
Examples of invalid strings would be:
BCA6C22pAAA # It doesn't begin with C
c56Bp # There aren't any characters after P
c45AF0P2 # Contains a zero
I'm using python and I want a regex to extract the two strings that come both in between the characters C and P as well as after P
So far I've come up with this:
(?<=\A[cC])[a-fA-F1-9]*(?<=[pP])[a-fA-F1-9]*
A breakdown would be:
(?<=\A[cC]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [cC] and that [cC] must be at the beginning of the string
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
(?<=[pP]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [pP]
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
But with the above regex I can't match any of the strings!
When I insert a | in between (?<=[cC])[a-fA-F1-9]* and (?<=[pP])[a-fA-F1-9]* it works.
Meaning the below regex works:
(?<=[cC])[a-fA-F1-9]*|(?<=[pP])[a-fA-F1-9]*
I know that | means that it should match at most one of the specified regex expressions. But it's non greedy and it returns the first match that it finds. The remaining expressions aren’t tested, right?
But using | means the string BCA6C22pAAA is a partial match to AAA since it comes after P, even though the first assertion isn't true, since it doesn't begin with a C.
That shouldn't be the case. I want it to only match if all conditions explained in the beginning are true.
Could someone explain to me why my first attempt doesn't produces the result I want? Also, how can I improve my regex?
I still need it to:
Not be a match if the string contains the number 0
Only be a match if ALL conditions are met
Thank you

To match both groups before and after P or p
(?<=^[Cc])[1-9a-fA-F]+(?=[Pp]([1-9a-fA-F]+$))
(?<=^[Cc]) - Positive Lookbehind. Must match a case insensitive C or c at the start of the line
[1-9a-fA-F]+ - Matches hexadecimal characters one or more times
(?=[Pp] - Positive Lookahead for case insensitive p or P
([1-9a-fA-F]+$) - Cature group for one or more hexadecimal characters following the pP
View Demo

Your main problem is you're using a look behind (?<=[pP]) for something ahead, which will never work: You need a look ahead (?=...).
Also, the final quantifier should be + not * because you require at least one trailing character after the p.
The final mistake is that you're not capturing anything, you're only matching, so put what you want to capture inside brackets, which also means you can remove all look arounds.
If you use the case insensitive flag, it makes the regex much smaller and easier to read.
A working regex that captures the 2 hex parts in groups 1 and 2 is:
(?i)^c([a-f1-9]*)p([a-f1-9]+)
See live demo.
Unless you need to use \A, prefer ^ (start of input) over \A (start of all input in multi line scenario) because ^ easier to read and \A won't match every line, which is what many situations and tools expect. I've used ^.

Use regex to identify 4 to 5 numbers that are (consecutive, i.e no whitespace or special characters included), without including preceding 0's

I am trying to use regular expressions to identify 4 to 5 digit numbers. The code below is working effectively in all cases unless there are consecutive 0's preceding a one, two or 3 digit number. I don't want '0054','0008',or '0009' to be a match, but i would want '10354' or '10032', or '9005', or '9000' to all be matches. Is there a good way to implement this using regular expressions? Here is my current code that works for most cases except when there are preceding 0's to a series of digits less than 4 or 5 characters in length.
import re
line = 'US Machine Operations | 0054'
match = re.search(r'\d{4,5}', line)
if match is None:
print(0)
else:
print(int(match[0]))

You may use
(?<!\d)[1-9]\d{3,4}(?!\d)
See the regex demo.
NOTE: In Pandas str.extract, you must wrap the part you want to be returned with a capturing group, a pair of unescaped parentheses. So, you need to use
(?<!\d)([1-9]\d{3,4})(?!\d)
^ ^
Example:
df2['num_col'] = df2.Warehouse.str.extract(r'(?<!\d)([1-9]\d{3,4})(?!\d)', expand = False).astype(float)
Just because you can simple use a capturing group, you may use an equivalent regex:
(?:^|\D)([1-9]\d{3,4})(?!\d)
Details
(?<!\d) - no digit immediately to the left
or (?:^|\D) - start of string or non-digit char (a non-capturing group is used so that only 1 capturing group could be accommodated in the pattern and let str.extract only extract what needs extracting)
[1-9] - a non-zero digit
\d{3,4} - three or four digits
(?!\d) - no digit immediately to the right is allowed
Python demo:
import re
s = "US Machine Operations | 0054 '0054','0008',or '0009' to be a match, but i would want '10354' or '10032', or '9005', or '9000'"
print(re.findall(r'(?<!\d)[1-9]\d{3,4}(?!\d)', s))
# => ['10354', '10032', '9005', '9000']

How to match length of variable size backreference but not content

Currently I'm trying to write a regular expression (using Python's re module) that will find occurrences of 'a' in a string of a given length. There are a few different patterns I'm trying to match, but the ones that are giving me trouble look like this:
a.a.a
a..a..a
a...a...a
Basically I'm trying to find matches that contain at least three occurrences of 'a', but they must be equally spaced apart. So far I've tried regexes:
regex1 = r'a(.|..|...)a\1a'
regex2 = r'a(.{1,3})a\1a'
But the problem I'm having is that the backreference repeats the matched text. So, for example, my regex will match #1 but not #2,
1. aoooaoooa
2. aoooabbba
when in actuality I don't care about the content between occurrences of 'a', simply the distance.
I know backreferences can be used to match the same unknown text multiple times, but I suppose I don't know enough to tell whether there's just a different way to use them, or whether I should be using some other method/pattern entirely. Tips?
Thanks in advance!

If you install Python PyPi regex module, you can use subpattern recursing features. Just wrap a repeating part with a capture group, and then use (?n) where n is the capture group ID.
>>> import regex
>>> a = "aoooaoooa"
>>> b = "aoooabbba"
>>> rx = r"a(.{1,3})a(?1)a"
>>> print(regex.search(rx, a).group(0))
aoooaoooa
>>> print(regex.search(rx, b).group(0))
aoooabbba
>>> print(regex.search(rx, "abacca").group(0))
abacca
Explanation:
a - matches a literal a
(.{1,3}) - matches and captures into Group 1 one to three characters other than a newline
a - matches a literal a
(?1) - a recursive construct telling the regex engine retreive the pattern rather than the value that belongs to Group 1 (i.e. .{1,3})
a - matches a literal a
PyPi regex module does not support balanced constructs (.NET can), so you will have to add more code to check if you matched groups of equal length. Fortunately, regex module keeps all captured submatches in the .captures object. So, all you need to do to exclude abacca from the valid matches is to use:
c = "abacca"
m = regex.search(rx, c)
if len(max(m.captures(1))) - len(min(m.captures(1))) == 0: # all of equal length ?
print m.group(0)

regex: matching a repeating sequence

I'm trying to construct a regular expression that will match a repeating DNA sequence of 2 characters. These characters can be the same.
The regex should match a repeating sequence of 2 characters at least 3 times and, here are some examples:
regex should match on:
ATATAT
GAGAGAGA
CCCCCC
and should not match on:
ACAC
ACGTACGT
So far I've come up with the following regular expressions:
[ACGT]{2}
this captures any sequence consisting of exactly two characters (A, C, G or T). Now I want to repeat this pattern at least three times, so I tried the following regular expressions:
[ACGT]{2}{3,}
([ACGT]{2}){3,}
Unfortunately, the first one raises a 'multiple repeat' error (Python), while the second one will simply match any sequence with 6 characters consisting of A, C, G and T.
Is there anyone that can help me out with this regular expression?
Thanks in advance.

You could perhaps make use of backreferences.
([ATGC]{2})\1{2,}
\1 is the backreference referring to the first capture group and will be what you have captured.
regex101 demo

One:
(AT){3}
Two
(GA){4}
Three
C{6}
Combining them!
(C{6}|(GA){4}|(AT){3})

How to add additional criteria to re.findall ... Python 2.7?

ORF_sequences = re.findall(r'ATG(?:...){9,}?(?:TAA|TAG|TGA)',sequence) #thanks to #Martin Pieters and #nneonneo
I have a line of code that finds any instance of A|G followed by 2 characters and then ATG that is then followed by either a TAA|TAG|TGA when read in units of 3. only works when A|G-xx-ATG-xxx-TAA|TAG|TGA is 30 elements or greater
i want to add a criteria
i need the ATG to be followed by a G
so A|G-xx-ATG-Gxx-xxx-TAA|TGA|TAG #at least 30 elements long
example:
GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA
^ would work
GCATGAGGTTTTTTTTTTTTTTTTTTTTTTTTTGA
^ would not work because it is an (A|G) followed by only one value (not 2) before the ATG and there is not a G following the A|G-xx-ATG
i hope this makes sense
I tried
ORF_sequences = re.findall(r'ATGG(?:...){9,}?(?:TAA|TAG|TGA)',sequence)
but it seemed like it was using window size 3 after last G of ATGG
basically I need that code, where the first occurrence is A|G-xx-ATG and the second occurrence is (G-xx)

It'll be easier if you use a character group of [AG], there is no need to group the two 'free' characters:
ORF_sequences2 = re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
or you need to group the A|G:
ORF_sequences2 = re.findall(r'(?:A|G)..ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
Applying the first form to your examples:
>>> re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTGA')
['GCCATGGGGTTTTGA']
>>> re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCATGGGGTTTTGA')
[]
In your attempt, the expression matches either an A, or the expression G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA) because the | symbol applies to everything that preceeds or follows it within the same group. As it is not grouped, it applies to the whole expression instead:
>>> re.findall(r'A|G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)', 'A')
['A']
>>> re.findall(r'A|G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTGA')
['GCCATGGGGTTTTGA']
If you need to match a certain amount of characters in your whole match, you need to tailor those 3 character (?:...) groups to match a minimum number of times:
ORF_sequences2 = re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)',fdna)
would match A or G followed by 2 characters, followed by ATGG with another 2 characters, then at least 7 times 3 characters (total 21), followed by a specific pattern of 3 more (TAA, TAG or TGA) for a total of at least 33 characters from the first to the last character. The extra .. make up the pattern of 3 after ATG and matches your example from your comment:
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGA')
['GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGA']
as well as correctly handling the examples given in your question:
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA')
['GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA']
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCATGAGGTTTTTTTTTTTTTTTTTTTTTTTTTGA')
[]

To ensure you get at least 30 characters, use the {n,} quantifier:
r'[AG]..ATG(?:...){9,}?(?:TAA|TAG|TGA)'
This ensures that you read at least 9 triplets (27 characters) between the ATG opening and the TAA|TGA|TAG terminator.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching strings where multiple capture groups must be different in regex - python

Related

Extracting two strings from between two characters. Why doesn't my regex match and how can I improve it?

Use regex to identify 4 to 5 numbers that are (consecutive, i.e no whitespace or special characters included), without including preceding 0's

How to match length of variable size backreference but not content

regex: matching a repeating sequence

How to add additional criteria to re.findall ... Python 2.7?

Categories

Resources