Regex of sequences surrounded by specific pattern with overlapping problem

Regex of sequences surrounded by specific pattern with overlapping problem - python

I am very new to use python re.finditer to find a regex pattern but trying to make a complex pattern finding, which is the g-quadruplex motif and described as below.
The sequence starts with at least 3 g followed with a/t/g/c multiple times until the next group of ggg+ shows up. this will repeat 3 times and resulting in a pattern like ggg+...ggg+...ggg+...ggg+
The cases should be all ignored and the overlapping can show in the original sequence like ggg+...ggg+...ggg+...ggg+...ggg+...ggg+ should return 3 such patterns.
I have suffered for some time and can only find a way like:
re.finditer(r"(?=(([Gg]{3,})([AaTtCcGg]+?(?=[Gg]{3,})[Gg]{3,}){3}))", seq)
and then filter out the ones that do not with the same start position with
re.finditer(r"([Gg]{3,})", seq)
Is there any better way to extract this type of sequence? And no for loops please since I have millions of rows like this.
Thank you very much!
PS: an example can be like this
ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC
1.ggggggcgggggggACGCTCggctcAAGGGCTCCGGG
2.gggggggACGCTCggctcAAGGGCTCCGGGCCCCggggggg
3.GGGCTCCGGGCCCCgggggggACgcgcgAAGGG

You could start the match, asserting that what is directly to the left is not a g char to prevent matching on too many positions.
To match both upper and lowercase chars, you can make the pattern case insensitive using re.I
The value is in capture group 1, which will be returned by re.findall.
(?<!g)(?=(g{3,}(?:[atc](?:g{0,2}[atc])*g{3,}){3}))
(?<!g) Negative lookbehind, assert not g directly to the left
(?= Positive lookahead
( Capture group 1
g{3,} Match 3 or more g chars to start with
(?: Non capture group
[atc](?:g{0,2}[atc])* Optionally repeat matching a t c and 0, 1 or 2 g chars without crossing matching ggg
g{3,} Match 3 or more g chars to end with
){3} Close non capture group and repeat 3 times
) Close group 1
) Close lookahead
Regex demo | Python demo
import re
pattern = r"(?<!g)(?=(g{3,}(?:[atc](?:g{0,2}[atc])*g{3,}){3}))"
s = ("ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC \n")
print(re.findall(pattern, s, re.I))
Output
[
'ggggggcgggggggACGCTCggctcAAGGGCTCCGGG',
'gggggggACGCTCggctcAAGGGCTCCGGGCCCCggggggg',
'GGGCTCCGGGCCCCgggggggACgcgcgAAGGG'
]

import regex as re
seq = 'ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC'
for match in re.finditer(r'(?<=(^|[^g]))g{3,}([atc](g{,2}[atc]+)*g{3,}){3}', seq, overlapped=True, flags=re.I):
print(match[0])
Overlapped works by restarting the search from the character after the start of the current match rather than the end of it. This would give you a bunch of essentially duplicate results, each just removing an extra leading G. To stop that, check for a preceding G with a lookbehind:
(?<=(^|[^g]))
The middle section needs to be a bit more complicated to require an ATC, preventing the seven G's from being split into a ggggggg match. So require one, then allow for any number of less the three G's followed by more ATCs; repeating as needed:
[atc](g{,2}[atc]+)*
The rest is just the Gs and the repeating.

Related

What is a regex expression that can prune down repeating identical characters down to a maximum of two repeats?

I feel I am having the most difficulty explaining this well enough for a search engine to pick up on what I'm looking for. The behavior is essentially this:
string = "aaaaaaaaare yooooooooou okkkkkk"
would become "aare yoou okk", with the maximum number of repeats for any given character is two.
Matching the excess duplicates, and then re.sub -ing it seems to me the approach to take, but I can't figure out the regex statement I need.
The only attempt I feel is even worth posting is this - (\w)\1{3,0}
Which matched only the first instance of a character repeating more than three times - so only one match, and the whole block of repeated characters, not just the ones exceeding the max of 2. Any help is appreciated!

The regexp should be (\w)\1{2,} to match a character followed by at least 2 repetitions. That's 3 or more when you include the initial character.
The replacement is then \1\1 to replace with just two repetitions.
string = "aaaaaaaaare yooooooooou okkkkkk"
new_string = re.sub(r'(\w)\1{2,}', r'\1\1', string)

You could write
string = "aaaaaaaaare yooooooooou okkkkkk"
rgx = (\w)\1*(?=\1\1)
re.sub(rgx, '', string)
#=> "aare yoou okk"
Demo
The regular expression can be broken down as follows.
(\w) # match one word character and save it to capture group 1
\1* # match the content of capture group 1 zero or more times
(?= # begin a positive lookahead
\1\1 # match the content of capture group 1 twice
) # end the positive lookahead

How to capture a group only if occurs twice in a line

import re
text = """
Tumble Trouble Twwixt Two Towns!
Was the Moon soon in the Sea
Or soon in the sky?
Nobody really knows YET.
"""
How should I make the match happen only when the occurence is found twice in a line?
Regular expression that highlights two 'o's that appear beside each other only if there is another occurence of two 'o's appearing beside each other subsequently in the same line

You can match a single word char with a backreference, and group that again.
The word character will become group 2 as the groups are nested, then the outer group will be group 1.
Then you can assert group 1 using a positive lookahead again in the line.
((\w+)\2)(?=.*?\1)
The pattern matches:
( Capture group 1
(\w+)\2 Match 1+ word chars in capture group 2 followed by a backreference to group 2 to match the same again
) Close group 1
(?=.*?\1) Positive lookahead to assert the captured value of group 1 in the line
See a regex demo and a Python demo.
Example
print(re.compile(r"((\w+)\2)(?=.*?\1)").sub('{\g<1>}', text.rstrip()))
Output
Tumble Trouble Twwixt Two Towns!
Was the M{oo}n soon in the Sea
Or soon in the sky?
Nobody really knows YET.

Capturing repeated pattern in Python

I'm trying to implement some kind of markdown like behavior for a Python log formatter.
Let's take this string as example:
**This is a warning**: Virus manager __failed__
A few regexes later the string has lost the markdown like syntax and been turned into bash code:
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
But that should be compressed to
\033[33;1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m
I tried these, beside many other non working solutions:
(\\033\[([\d]+)m){2,} => Capture: \033[33m\033[1m with g1 '\033[1m' and g2 '1' and \033[0m\033[0mwith g1 '\033[0m' and g2 '0'
(\\033\[([\d]+)m)+ many results, not ok
(?:(\\033\[([\d]+)m)+) many results, although this is the recommended way for repeated patterns if I understood correctly, not ok
and others..
My goal is to have as results:
Input
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
Output
Match 1
033[33m\033[1m
Group1: 33
Group2: 1
Match 2
033[0m\033[0m
Group1: 0
Group2: 0
In other words, capture the ones that are "duplicated" and not the ones alone, so I can fuse them with a regex sub.

You want to match consectuively repeating \033[\d+m chunks of text and join the numbers after [ with a semi-colon.
You may use
re.sub(r'(?:\\033\[\d+m){2,}', lambda m: r'\033['+";".join(set(re.findall(r"\[(\d+)", m.group())))+'m', text)
See the Python demo online
The (?:\\033\[\d+m){2,} pattern will match two or more sequences of \033[ + one or more digits + m chunks of texts and then, the match will be passed to the lambda expression, where the output will be: 1) \033[, 2) all the numbers after [ extracted with re.findall(r"\[(\d+)", m.group()) and deduplicated with the set, and then 3) m.

The patterns in the string to be modified have not been made clear from the question. For example, is 033 fixed or might it be 025 or even 25? I've made certain assumptions in using the regex
r" ^(\\0(\d+)\[\2)[a-z]\\0\2\[(\d[a-z].+)
to obtain two capture groups that are to be combined, separated by a semi-colon. I've attempted to make clear my assumptions below, in part to help the OP modify this regex to satisfy alternative requirements.
Demo
The regex performs the following operations:
^ # match beginning of line
( # begin cap grp 1
\\0 # match '\0'
(\d+) # match 1+ digits in cap grp 2
\[ # match '['
\2 # match contents of cap grp 2
) # end cap grp 1
[a-z] # match a lc letter
\\0 # match '\0'
\2 # match contents of cap grp 2
\[ # match '['
(\d[a-z].+) # match a digit, then lc letter then 1+ chars to the
# end of the line in cap grp 3
As you see, the portion of the string captured in group 1 is
\033[33
I've assumed that the part of this string that is now 033 must be two or more digits beginning with a zero, and the second appearance of a string of digits consists of the same digits after the zero. This is done by capturing the digits following '0' (33) in capture group 2 and then using a back-reference \2.
The next part of the string is to be replaced and therefore is not captured:
m\\033[
I've assumed that m must be one lower case letter (or should it be a literal m?), the backslash and zero and required and the following digits must again match the content of capture group 2.
The remainder of the string,
1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
is captured in capture group 3. Here I've assumed it begins with one digit (perhaps it should be \d+) followed by one lower case letter that needn't be the same as the lower case letter matched earlier (though that could be enforced with another capture group). At that point I match the remainder of the line with .+, having given up matching patterns in that part of the string.
One may alternatively have just two capture groups, the capture group that is now #2, becoming #1, and #2 being the part of the string that is to be replaced with a semicolon.

This is pretty straightforward for the cases you desribe here; simply write out from left to right what you want to match and capture. Repeating capturing blocks won't help you here, because only the most recently captured values would be returned as a result.
\\033\[(\d+)m\\033\[(\d+)m

Python Regular expression to match if any number appears more than a certain amount of times

I'm in need of a regular expression for python that is able to match all strings where any number appears a certain amount of times (4 times in a 5 digit number is my desired result in this example).
For example, consider this list:
["11211", "23424", "22323", "99991", "88988", "11122"]
I would like a regEx that returns
["11211", "99991", "88988"]
because in these three cases, there is a digit that appears more than 4 times in the number.
I am not even sure if this is easily doable with just one single regEx, apart from hardcoding the digits from 0-9, which does not seem to be an elegant solution.
Here is a regEx that matches four 1's in a list of 5 number strings:
four1 = re.compile(".*1.*1.*1.*1.*")
But is there a more elegant solution than these two to not only search for four 1's, but four of any kind, as long as they are four times the same number?
four1 = re.compile("(.*1.*1.*1.*1.*")|(.*2.*2.*2.*2.*")| ...
or
four1 = re.compile(".*1.*1.*1.*1.*")
four2 = re.compile(".*2.*2.*2.*2.*")
...
Thank you for your help.

You may use this regex with a capture group and a back-reference:
(\d)(?:\d*?\1){3}
RegEx Demo
RegEx Description:
(\d): Match a single digit and capture in group #1
(?:: Start non-capture group
\d*?: Match 0 or more digits
\1: Back-reference to capture group #1 to make sure we match repeating digits of capture group #1
): End non-capture group
{3}: Match 3 instances of above non-capture group
Code:
import re
arr = ["11211", "23424", "22323", "99991", "88988", "11122"]
reg = re.compile(r'(\d)(?:\d*?\1){3}')
for s in arr:
if reg.search(s):
print s
output:
11211
99991
88988

Use regex to identify 4 to 5 numbers that are (consecutive, i.e no whitespace or special characters included), without including preceding 0's

I am trying to use regular expressions to identify 4 to 5 digit numbers. The code below is working effectively in all cases unless there are consecutive 0's preceding a one, two or 3 digit number. I don't want '0054','0008',or '0009' to be a match, but i would want '10354' or '10032', or '9005', or '9000' to all be matches. Is there a good way to implement this using regular expressions? Here is my current code that works for most cases except when there are preceding 0's to a series of digits less than 4 or 5 characters in length.
import re
line = 'US Machine Operations | 0054'
match = re.search(r'\d{4,5}', line)
if match is None:
print(0)
else:
print(int(match[0]))

You may use
(?<!\d)[1-9]\d{3,4}(?!\d)
See the regex demo.
NOTE: In Pandas str.extract, you must wrap the part you want to be returned with a capturing group, a pair of unescaped parentheses. So, you need to use
(?<!\d)([1-9]\d{3,4})(?!\d)
^ ^
Example:
df2['num_col'] = df2.Warehouse.str.extract(r'(?<!\d)([1-9]\d{3,4})(?!\d)', expand = False).astype(float)
Just because you can simple use a capturing group, you may use an equivalent regex:
(?:^|\D)([1-9]\d{3,4})(?!\d)
Details
(?<!\d) - no digit immediately to the left
or (?:^|\D) - start of string or non-digit char (a non-capturing group is used so that only 1 capturing group could be accommodated in the pattern and let str.extract only extract what needs extracting)
[1-9] - a non-zero digit
\d{3,4} - three or four digits
(?!\d) - no digit immediately to the right is allowed
Python demo:
import re
s = "US Machine Operations | 0054 '0054','0008',or '0009' to be a match, but i would want '10354' or '10032', or '9005', or '9000'"
print(re.findall(r'(?<!\d)[1-9]\d{3,4}(?!\d)', s))
# => ['10354', '10032', '9005', '9000']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex of sequences surrounded by specific pattern with overlapping problem - python

Related

What is a regex expression that can prune down repeating identical characters down to a maximum of two repeats?

How to capture a group only if occurs twice in a line

Capturing repeated pattern in Python

Python Regular expression to match if any number appears more than a certain amount of times

Use regex to identify 4 to 5 numbers that are (consecutive, i.e no whitespace or special characters included), without including preceding 0's

Categories

Resources