How to add additional criteria to re.findall ... Python 2.7? - python

ORF_sequences = re.findall(r'ATG(?:...){9,}?(?:TAA|TAG|TGA)',sequence) #thanks to #Martin Pieters and #nneonneo
I have a line of code that finds any instance of A|G followed by 2 characters and then ATG that is then followed by either a TAA|TAG|TGA when read in units of 3. only works when A|G-xx-ATG-xxx-TAA|TAG|TGA is 30 elements or greater
i want to add a criteria
i need the ATG to be followed by a G
so A|G-xx-ATG-Gxx-xxx-TAA|TGA|TAG #at least 30 elements long
example:
GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA
^ would work
GCATGAGGTTTTTTTTTTTTTTTTTTTTTTTTTGA
^ would not work because it is an (A|G) followed by only one value (not 2) before the ATG and there is not a G following the A|G-xx-ATG
i hope this makes sense
I tried
ORF_sequences = re.findall(r'ATGG(?:...){9,}?(?:TAA|TAG|TGA)',sequence)
but it seemed like it was using window size 3 after last G of ATGG
basically I need that code, where the first occurrence is A|G-xx-ATG and the second occurrence is (G-xx)

It'll be easier if you use a character group of [AG], there is no need to group the two 'free' characters:
ORF_sequences2 = re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
or you need to group the A|G:
ORF_sequences2 = re.findall(r'(?:A|G)..ATG(?:...)*?(?:TAA|TAG|TGA)',fdna)
Applying the first form to your examples:
>>> re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTGA')
['GCCATGGGGTTTTGA']
>>> re.findall(r'[AG]..ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCATGGGGTTTTGA')
[]
In your attempt, the expression matches either an A, or the expression G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA) because the | symbol applies to everything that preceeds or follows it within the same group. As it is not grouped, it applies to the whole expression instead:
>>> re.findall(r'A|G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)', 'A')
['A']
>>> re.findall(r'A|G(?:..)ATG(?:...)*?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTGA')
['GCCATGGGGTTTTGA']
If you need to match a certain amount of characters in your whole match, you need to tailor those 3 character (?:...) groups to match a minimum number of times:
ORF_sequences2 = re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)',fdna)
would match A or G followed by 2 characters, followed by ATGG with another 2 characters, then at least 7 times 3 characters (total 21), followed by a specific pattern of 3 more (TAA, TAG or TGA) for a total of at least 33 characters from the first to the last character. The extra .. make up the pattern of 3 after ATG and matches your example from your comment:
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGA')
['GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGA']
as well as correctly handling the examples given in your question:
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA')
['GCCATGGGGTTTTTTTTTTTTTTTTTTTTTTTTTGA']
>>> re.findall(r'[AG]..ATGG..(?:...){7,}?(?:TAA|TAG|TGA)', 'GCATGAGGTTTTTTTTTTTTTTTTTTTTTTTTTGA')
[]

To ensure you get at least 30 characters, use the {n,} quantifier:
r'[AG]..ATG(?:...){9,}?(?:TAA|TAG|TGA)'
This ensures that you read at least 9 triplets (27 characters) between the ATG opening and the TAA|TGA|TAG terminator.

Related

Regex of sequences surrounded by specific pattern with overlapping problem

I am very new to use python re.finditer to find a regex pattern but trying to make a complex pattern finding, which is the g-quadruplex motif and described as below.
The sequence starts with at least 3 g followed with a/t/g/c multiple times until the next group of ggg+ shows up. this will repeat 3 times and resulting in a pattern like ggg+...ggg+...ggg+...ggg+
The cases should be all ignored and the overlapping can show in the original sequence like ggg+...ggg+...ggg+...ggg+...ggg+...ggg+ should return 3 such patterns.
I have suffered for some time and can only find a way like:
re.finditer(r"(?=(([Gg]{3,})([AaTtCcGg]+?(?=[Gg]{3,})[Gg]{3,}){3}))", seq)
and then filter out the ones that do not with the same start position with
re.finditer(r"([Gg]{3,})", seq)
Is there any better way to extract this type of sequence? And no for loops please since I have millions of rows like this.
Thank you very much!
PS: an example can be like this
ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC
1.ggggggcgggggggACGCTCggctcAAGGGCTCCGGG
2.gggggggACGCTCggctcAAGGGCTCCGGGCCCCggggggg
3.GGGCTCCGGGCCCCgggggggACgcgcgAAGGG
You could start the match, asserting that what is directly to the left is not a g char to prevent matching on too many positions.
To match both upper and lowercase chars, you can make the pattern case insensitive using re.I
The value is in capture group 1, which will be returned by re.findall.
(?<!g)(?=(g{3,}(?:[atc](?:g{0,2}[atc])*g{3,}){3}))
(?<!g) Negative lookbehind, assert not g directly to the left
(?= Positive lookahead
( Capture group 1
g{3,} Match 3 or more g chars to start with
(?: Non capture group
[atc](?:g{0,2}[atc])* Optionally repeat matching a t c and 0, 1 or 2 g chars without crossing matching ggg
g{3,} Match 3 or more g chars to end with
){3} Close non capture group and repeat 3 times
) Close group 1
) Close lookahead
Regex demo | Python demo
import re
pattern = r"(?<!g)(?=(g{3,}(?:[atc](?:g{0,2}[atc])*g{3,}){3}))"
s = ("ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC \n")
print(re.findall(pattern, s, re.I))
Output
[
'ggggggcgggggggACGCTCggctcAAGGGCTCCGGG',
'gggggggACGCTCggctcAAGGGCTCCGGGCCCCggggggg',
'GGGCTCCGGGCCCCgggggggACgcgcgAAGGG'
]
import regex as re
seq = 'ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC'
for match in re.finditer(r'(?<=(^|[^g]))g{3,}([atc](g{,2}[atc]+)*g{3,}){3}', seq, overlapped=True, flags=re.I):
print(match[0])
Overlapped works by restarting the search from the character after the start of the current match rather than the end of it. This would give you a bunch of essentially duplicate results, each just removing an extra leading G. To stop that, check for a preceding G with a lookbehind:
(?<=(^|[^g]))
The middle section needs to be a bit more complicated to require an ATC, preventing the seven G's from being split into a ggggggg match. So require one, then allow for any number of less the three G's followed by more ATCs; repeating as needed:
[atc](g{,2}[atc]+)*
The rest is just the Gs and the repeating.

Matching strings where multiple capture groups must be different in regex

I am trying to create a regular expression that picks out a boolean algebra identity, specifically ((A+B).(A+C)), where A, B and C are different strings consisting of characters [A-Z].
I am running into problems getting the regular expression recognise that in the string I am looking for A != B != C.
Here is what I have tried:
\(\(([A-Z]+)\+([A-Z])\)\.\(\1\+([A-Z])\)\)
however, even though I have put every string that I want to be different in a capturing group, it doesn't stop it from thinking that strings B and C are the same. This is because the regular expression matches for all three of the following strings:
((A+B).(A+C))
((A+B).(A+A))
((A+A).(A+A))
while I only want it to match the first one.
You can use negative lookahead to make sure that group 2 is not the same as group 1, and that group 3 is not the same as either groups 1 or 2.
\(\(([A-Z]+)\+(?!\1)([A-Z])\)\.\(\1\+(?!\1)(?!\2)([A-Z])\)\)
Split up for readability:
\(\(
([A-Z]+)
\+
(?!\1)([A-Z])
\)\.\(
\1
\+
(?!\1)(?!\2)([A-Z])
\)\)
Inputs:
((A+B).(A+C))
((A+B).(A+A))
((A+A).(A+A))
((A+B).(A+B))
Matches:
((A+B).(A+C))
Try it on regex101

Capturing repeated pattern in Python

I'm trying to implement some kind of markdown like behavior for a Python log formatter.
Let's take this string as example:
**This is a warning**: Virus manager __failed__
A few regexes later the string has lost the markdown like syntax and been turned into bash code:
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
But that should be compressed to
\033[33;1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m
I tried these, beside many other non working solutions:
(\\033\[([\d]+)m){2,} => Capture: \033[33m\033[1m with g1 '\033[1m' and g2 '1' and \033[0m\033[0mwith g1 '\033[0m' and g2 '0'
(\\033\[([\d]+)m)+ many results, not ok
(?:(\\033\[([\d]+)m)+) many results, although this is the recommended way for repeated patterns if I understood correctly, not ok
and others..
My goal is to have as results:
Input
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
Output
Match 1
033[33m\033[1m
Group1: 33
Group2: 1
Match 2
033[0m\033[0m
Group1: 0
Group2: 0
In other words, capture the ones that are "duplicated" and not the ones alone, so I can fuse them with a regex sub.
You want to match consectuively repeating \033[\d+m chunks of text and join the numbers after [ with a semi-colon.
You may use
re.sub(r'(?:\\033\[\d+m){2,}', lambda m: r'\033['+";".join(set(re.findall(r"\[(\d+)", m.group())))+'m', text)
See the Python demo online
The (?:\\033\[\d+m){2,} pattern will match two or more sequences of \033[ + one or more digits + m chunks of texts and then, the match will be passed to the lambda expression, where the output will be: 1) \033[, 2) all the numbers after [ extracted with re.findall(r"\[(\d+)", m.group()) and deduplicated with the set, and then 3) m.
The patterns in the string to be modified have not been made clear from the question. For example, is 033 fixed or might it be 025 or even 25? I've made certain assumptions in using the regex
r" ^(\\0(\d+)\[\2)[a-z]\\0\2\[(\d[a-z].+)
to obtain two capture groups that are to be combined, separated by a semi-colon. I've attempted to make clear my assumptions below, in part to help the OP modify this regex to satisfy alternative requirements.
Demo
The regex performs the following operations:
^ # match beginning of line
( # begin cap grp 1
\\0 # match '\0'
(\d+) # match 1+ digits in cap grp 2
\[ # match '['
\2 # match contents of cap grp 2
) # end cap grp 1
[a-z] # match a lc letter
\\0 # match '\0'
\2 # match contents of cap grp 2
\[ # match '['
(\d[a-z].+) # match a digit, then lc letter then 1+ chars to the
# end of the line in cap grp 3
As you see, the portion of the string captured in group 1 is
\033[33
I've assumed that the part of this string that is now 033 must be two or more digits beginning with a zero, and the second appearance of a string of digits consists of the same digits after the zero. This is done by capturing the digits following '0' (33) in capture group 2 and then using a back-reference \2.
The next part of the string is to be replaced and therefore is not captured:
m\\033[
I've assumed that m must be one lower case letter (or should it be a literal m?), the backslash and zero and required and the following digits must again match the content of capture group 2.
The remainder of the string,
1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
is captured in capture group 3. Here I've assumed it begins with one digit (perhaps it should be \d+) followed by one lower case letter that needn't be the same as the lower case letter matched earlier (though that could be enforced with another capture group). At that point I match the remainder of the line with .+, having given up matching patterns in that part of the string.
One may alternatively have just two capture groups, the capture group that is now #2, becoming #1, and #2 being the part of the string that is to be replaced with a semicolon.
This is pretty straightforward for the cases you desribe here; simply write out from left to right what you want to match and capture. Repeating capturing blocks won't help you here, because only the most recently captured values would be returned as a result.
\\033\[(\d+)m\\033\[(\d+)m

Use regex to identify 4 to 5 numbers that are (consecutive, i.e no whitespace or special characters included), without including preceding 0's

I am trying to use regular expressions to identify 4 to 5 digit numbers. The code below is working effectively in all cases unless there are consecutive 0's preceding a one, two or 3 digit number. I don't want '0054','0008',or '0009' to be a match, but i would want '10354' or '10032', or '9005', or '9000' to all be matches. Is there a good way to implement this using regular expressions? Here is my current code that works for most cases except when there are preceding 0's to a series of digits less than 4 or 5 characters in length.
import re
line = 'US Machine Operations | 0054'
match = re.search(r'\d{4,5}', line)
if match is None:
print(0)
else:
print(int(match[0]))
You may use
(?<!\d)[1-9]\d{3,4}(?!\d)
See the regex demo.
NOTE: In Pandas str.extract, you must wrap the part you want to be returned with a capturing group, a pair of unescaped parentheses. So, you need to use
(?<!\d)([1-9]\d{3,4})(?!\d)
^ ^
Example:
df2['num_col'] = df2.Warehouse.str.extract(r'(?<!\d)([1-9]\d{3,4})(?!\d)', expand = False).astype(float)
Just because you can simple use a capturing group, you may use an equivalent regex:
(?:^|\D)([1-9]\d{3,4})(?!\d)
Details
(?<!\d) - no digit immediately to the left
or (?:^|\D) - start of string or non-digit char (a non-capturing group is used so that only 1 capturing group could be accommodated in the pattern and let str.extract only extract what needs extracting)
[1-9] - a non-zero digit
\d{3,4} - three or four digits
(?!\d) - no digit immediately to the right is allowed
Python demo:
import re
s = "US Machine Operations | 0054 '0054','0008',or '0009' to be a match, but i would want '10354' or '10032', or '9005', or '9000'"
print(re.findall(r'(?<!\d)[1-9]\d{3,4}(?!\d)', s))
# => ['10354', '10032', '9005', '9000']

repeating a regular expression pattern

I am attempting to write a regex to match numbers in a given string, the below manages to retrieve the first number within the string, however it stops there, I would like it to match all numbers within the file,
thanks in advance
regular expression :
([^\s+\w+\n\r]*(\d))+
string :
hi there this is 1
yes this is 2
actual match : 1
ideal match : 1,2
On site regex101.com/#python type g in the right box near your expression. This box is called modifier. And as others mention in comments use re.findall(pattern, your_string) in python. Notice also that you are actually looking for two substrings - you have two pairs of braces in your regexp.
"([\d]+)"g
Sample
test 13231 test 123123
123 asdfasdf
1a2a3 a
will match
MATCH 1
1. [5-10] `13231`
MATCH 2
1. [16-22] `123123`
MATCH 3
1. [23-26] `123`
MATCH 4
1. [37-38] `1`
MATCH 5
1. [39-40] `2`
MATCH 6
1. [41-42] `3`
and the explaination
"([\d]+)"g
1st Capturing group ([\d]+)
[\d]+ match a single character present in the list below
Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
\d match a digit [0-9]
g modifier: global. All matches (don't return on first match)
Why dont you use \d+ simply?
see demo on regex101.com/#python

Categories