regex: matching a repeating sequence - python

I'm trying to construct a regular expression that will match a repeating DNA sequence of 2 characters. These characters can be the same.
The regex should match a repeating sequence of 2 characters at least 3 times and, here are some examples:
regex should match on:
ATATAT
GAGAGAGA
CCCCCC
and should not match on:
ACAC
ACGTACGT
So far I've come up with the following regular expressions:
[ACGT]{2}
this captures any sequence consisting of exactly two characters (A, C, G or T). Now I want to repeat this pattern at least three times, so I tried the following regular expressions:
[ACGT]{2}{3,}
([ACGT]{2}){3,}
Unfortunately, the first one raises a 'multiple repeat' error (Python), while the second one will simply match any sequence with 6 characters consisting of A, C, G and T.
Is there anyone that can help me out with this regular expression?
Thanks in advance.

You could perhaps make use of backreferences.
([ATGC]{2})\1{2,}
\1 is the backreference referring to the first capture group and will be what you have captured.
regex101 demo

One:
(AT){3}
Two
(GA){4}
Three
C{6}
Combining them!
(C{6}|(GA){4}|(AT){3})

Related

Extracting two strings from between two characters. Why doesn't my regex match and how can I improve it?

I'm learning about regular expressions and I to want extract a string from a text that has the following characteristic:
It always begins with the letter C, in either lowercase or
uppercase, which is then followed by a number of hexadecimal
characters (meaning it can contain the letters A to F and numbers
from 1 to 9, with no zeros included).
After those hexadecimal
characters comes a letter P, also either in lowercase or uppercase
And then some more hexadecimal characters (again, excluding 0).
Meaning I want to capture the strings that come in between the letters C and P as well as the string that comes after the letter P and concatenate them into a single string, while discarding the letters C and P
Examples of valid strings would be:
c45AFP2
CAPF
c56Bp26
CA6C22pAAA
For the above examples what I want would be to extract the following, in the same order:
45AF2 # Original string: c45AFP2
AF # Original string: CAPF
56B26 # Original string: c56Bp26
A6C22AAA # Original string: CA6C22pAAA
Examples of invalid strings would be:
BCA6C22pAAA # It doesn't begin with C
c56Bp # There aren't any characters after P
c45AF0P2 # Contains a zero
I'm using python and I want a regex to extract the two strings that come both in between the characters C and P as well as after P
So far I've come up with this:
(?<=\A[cC])[a-fA-F1-9]*(?<=[pP])[a-fA-F1-9]*
A breakdown would be:
(?<=\A[cC]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [cC] and that [cC] must be at the beginning of the string
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
(?<=[pP]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [pP]
[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times
But with the above regex I can't match any of the strings!
When I insert a | in between (?<=[cC])[a-fA-F1-9]* and (?<=[pP])[a-fA-F1-9]* it works.
Meaning the below regex works:
(?<=[cC])[a-fA-F1-9]*|(?<=[pP])[a-fA-F1-9]*
I know that | means that it should match at most one of the specified regex expressions. But it's non greedy and it returns the first match that it finds. The remaining expressions aren’t tested, right?
But using | means the string BCA6C22pAAA is a partial match to AAA since it comes after P, even though the first assertion isn't true, since it doesn't begin with a C.
That shouldn't be the case. I want it to only match if all conditions explained in the beginning are true.
Could someone explain to me why my first attempt doesn't produces the result I want? Also, how can I improve my regex?
I still need it to:
Not be a match if the string contains the number 0
Only be a match if ALL conditions are met
Thank you
To match both groups before and after P or p
(?<=^[Cc])[1-9a-fA-F]+(?=[Pp]([1-9a-fA-F]+$))
(?<=^[Cc]) - Positive Lookbehind. Must match a case insensitive C or c at the start of the line
[1-9a-fA-F]+ - Matches hexadecimal characters one or more times
(?=[Pp] - Positive Lookahead for case insensitive p or P
([1-9a-fA-F]+$) - Cature group for one or more hexadecimal characters following the pP
View Demo
Your main problem is you're using a look behind (?<=[pP]) for something ahead, which will never work: You need a look ahead (?=...).
Also, the final quantifier should be + not * because you require at least one trailing character after the p.
The final mistake is that you're not capturing anything, you're only matching, so put what you want to capture inside brackets, which also means you can remove all look arounds.
If you use the case insensitive flag, it makes the regex much smaller and easier to read.
A working regex that captures the 2 hex parts in groups 1 and 2 is:
(?i)^c([a-f1-9]*)p([a-f1-9]+)
See live demo.
Unless you need to use \A, prefer ^ (start of input) over \A (start of all input in multi line scenario) because ^ easier to read and \A won't match every line, which is what many situations and tools expect. I've used ^.

Matching strings where multiple capture groups must be different in regex

I am trying to create a regular expression that picks out a boolean algebra identity, specifically ((A+B).(A+C)), where A, B and C are different strings consisting of characters [A-Z].
I am running into problems getting the regular expression recognise that in the string I am looking for A != B != C.
Here is what I have tried:
\(\(([A-Z]+)\+([A-Z])\)\.\(\1\+([A-Z])\)\)
however, even though I have put every string that I want to be different in a capturing group, it doesn't stop it from thinking that strings B and C are the same. This is because the regular expression matches for all three of the following strings:
((A+B).(A+C))
((A+B).(A+A))
((A+A).(A+A))
while I only want it to match the first one.
You can use negative lookahead to make sure that group 2 is not the same as group 1, and that group 3 is not the same as either groups 1 or 2.
\(\(([A-Z]+)\+(?!\1)([A-Z])\)\.\(\1\+(?!\1)(?!\2)([A-Z])\)\)
Split up for readability:
\(\(
([A-Z]+)
\+
(?!\1)([A-Z])
\)\.\(
\1
\+
(?!\1)(?!\2)([A-Z])
\)\)
Inputs:
((A+B).(A+C))
((A+B).(A+A))
((A+A).(A+A))
((A+B).(A+B))
Matches:
((A+B).(A+C))
Try it on regex101

Why does this regex to find repeated characters fail?

I'm trying to build a regex to match any occurrence of two or more repeated alphanumeric characters. The following regex fails:
import re
s = '__commit__'
m = re.search(r'([a-zA-Z0-9])\1\1', s)
But when I change it to this it works:
m = re.search(r'([a-zA-A0-9])\1+', s)
I'm pretty baffled as to why this is the way it is. Can anyone provide some insight?
Look at this line.
m = re.search(r'([a-zA-Z0-9])\1\1', s)
You are using a pattern and two backreferences (A reference of already matched pattern). So, it will match only when minimum of three consecutive characters appear. You can do:
m = re.search(r'([a-zA-Z0-9])\1', s)
Which will match when minimum of two consecutive character appears.
However, the following one is much better.
m = re.search(r'([a-zA-A0-9])\1+', s)
That's because, now you are trying to match at least one or more backreferences \1+, that is minimum two consecutive characters.
The \1 is a back-reference to any of the previously matching groups. So the original regex that does not work for you essentially means :
Match alphanumeric strings that contain 3 occurences of the previously matchd group. In this case the previously matched group ([a-zA-Z0-9]) contains a single character a-z or A-Z or 0-9. You then have two '\1 in your regex which accounts for two back-references to the previously matched character.
In the second regex the back-reference \1 has a + in front of it which means match atleast one occurence of the previously captured character - which means that the string confirming to this pattern has to be atleast 2 characters in length.
Hope this helps.

Matching both possible solutions in Regex

I have a string aaab. I want a Python expression to match aa, so I expect the regular expression to return aa and aa since there are two ways to find substrings of aa.
However, this is not what's happening.
THis is what I've done
a = "aaab"
b = re.match('aa', a)
You can achieve it with a look-ahead and a capturing group inside it:
(?=(a{2}))
Since a look-ahead does not move on to the next position in string, we can scan the same text many times thus enabling overlapping matches.
See demo
Python code:
import re
p = re.compile(r'(?=(a{2}))')
test_str = "aaab"
print(re.findall(p, test_str))
To generalize #stribizhev solution to match one or more of character a: (?=(a{1,}))
For three or more: (?=(a{3,})) etc.

repeating a regular expression pattern

I am attempting to write a regex to match numbers in a given string, the below manages to retrieve the first number within the string, however it stops there, I would like it to match all numbers within the file,
thanks in advance
regular expression :
([^\s+\w+\n\r]*(\d))+
string :
hi there this is 1
yes this is 2
actual match : 1
ideal match : 1,2
On site regex101.com/#python type g in the right box near your expression. This box is called modifier. And as others mention in comments use re.findall(pattern, your_string) in python. Notice also that you are actually looking for two substrings - you have two pairs of braces in your regexp.
"([\d]+)"g
Sample
test 13231 test 123123
123 asdfasdf
1a2a3 a
will match
MATCH 1
1. [5-10] `13231`
MATCH 2
1. [16-22] `123123`
MATCH 3
1. [23-26] `123`
MATCH 4
1. [37-38] `1`
MATCH 5
1. [39-40] `2`
MATCH 6
1. [41-42] `3`
and the explaination
"([\d]+)"g
1st Capturing group ([\d]+)
[\d]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\d match a digit [0-9]
g modifier: global. All matches (don't return on first match)
Why dont you use \d+ simply?
see demo on regex101.com/#python

Categories