Finding matching motifs on sequence and their positions - python

I am trying to find some matching motifs on a sequence, as well as the position that the motif is located in and then output that into a fasta file. The code below shows that the motif [L**L*L] is present in the sequence, when I run it returns as "YES" but I do not know where it is positioned
The ** inside the square bracket is to show that any amino acid there is permited.
`
This is the code I used to check whether the motif is present in the sequence, and it worked because it returned "YES".
peptide1= "MKFSNEVVHKSMNITEDCSALTGALLKYSTDKSNMNFETLYRDAAVESPQHEVSNESGSTLKEHDYFGLSEVSSSNSSSGKQPEKCCREELNLNESATTLQLGPPAAVKPSGHADGADAHDEGAGPENPAKRPAHHMQQESLADGRKAAAEMGSFKIQRKNILEEFRAMKAQAHMTKSPKPVHTMQHNMHASFSGAQMAFGGAKNNGVKRVFSEAVGGNHIAASGVGVGVREGNDDVSRCEEMNGTEQLDLKVHLPKGMGMARMAPVSGGQNGSAWRNLSFDNMQGPLNPFFRKSLVSKMPVPDGGDSSANASNDCANRKGMVASPSVQPPPAQNQTVGWPPVKNFNKMNTPAPPASTPARACPSVQRKGASTSSSGNLVKIYMDGVPFGRKVDLKTNDSYDKLYSMLEDMFQQYISGQYCGGRSSSSGESHWVASSRKLNFLEGSEYVLIYEDHEGDSMLVGDVPWELFVNAVKRLRIMKGSEQVNLAPKNADPTKVQVAVG"
if re.search(r"L*L*L", peptide1):
print("YES")
else:
print("NO")
The code that I wrote to find the position is below, but when I run it says invalid syntax. Could you please assist as I have no clue whether in the right track or not, as I am still new in the field and python.
for position in range(len(s)):
if peptide[position:].startswith(r"L*L*L"):
print(position+1)
I am expecting to see the position of these motifs has been identified, for example the output should state whether the motif is found in position [2, 10] or any other number. This is just random posiitions that I chose since I dont know where this is positioned

You can use re.finditer() to search for multiple regex pattern matches within a string. Your peptide1 example does not contain an "L*L*L" motif, so I designated a random simple string as a demo.
simple_demo_string = "ABCLXLYLZLABC" # use a simple string to demonstrate code
The demo string contains two overlapping motifs. Normally, regex matches do not account for overlap
Example 1
simple_regex = "L.L.L" # in regex, periods are match-any wildcards
for x in re.finditer(simple_regex, simple_demo_string):
print( x.start(), x.end(), x.group() )
# Output: 3 8 LXLYL
However, if you use a capturing group inside a lookahead, you'll be able to get everything even if there's overlap.
Example 2
lookahead_regex = "(?=(L.L.L))"
for x in re.finditer(lookahead_regex, simple_demo_string):
# note - x.end() becomes same as x.start() due to lookahead
# but can be corrected by simply adding length of match
print( x.start(), x.start()+len(x.group(1)), x.group(1) )
# Output: 3 8 LXLYL
#. 5 10 LYLZL

Related

Replace a substring with defined region and follow up variable region in Python

I have a seemingly simple problem that for the life of me is just outside my reach of understanding. What I mean by that is that I can come up with many complex ways to attempt this, but there must be an easy way.
What I am trying to do is find and replace a substring in a string, but the catch is that it is based on a mix of a defined region and then variable regions based on length.
Here is an example:
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC' and I want to replace AATCGATCGTA with <span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>
So in this example the first part will always be constant AATCGA and will be used to locate the region to replace. This is then followed by a "spacer", in this case a single character but could be more than one and needs to be specified, and finally the last bit that will follow the "tail", in this case four characters, but could also be more or less. A set-up in this case would be:
to_find = 'AATCGA'
spacer = 'T' #Variable based on number and not on the character
tail = 'CGTA' #Variable based on number and not on the character
With this information I need to do something like:
new_seq = sequence.replace(f'{to_find}{len(spacer)}{len(tail)}', f'<span color="blue">{to_find}</span><span>{spacer}</span><span color="green">{tail}</span>')
print(new_seq)
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
But the spacer could be 3 characters from the end of to_find and it may vary, the same with the tail section. Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
Any help would be much appreciated!
I'm not quite sure I understand you fully. Nevertheless, you don't seem to be too far off. Just use regex.
import re
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC'
expected_new_seq = '<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC'
to_find = 'AATCGA'
spacer = 'T' # Variable based on number and not on the character
tail = 'CGTA' # Variable based on number and not on the character
# In this case, the pattern is (AATCGA)(.{1})(.{4})
# It matches "AATCGA" that is followed by 1 character and then 4 characters.
# AATCGA is captured in group 1, then the next unknown character is captured
# in group 2, and the next 4 unknown characters are captured in group 3
# (the brackets create capturing groups).
pattern = f'({to_find})(.{{{len(spacer)}}})(.{{{len(tail)}}})'
# \1 refers to capture group 1 (to_find), \2 refers to capture group 2 (spacer),
# and \3 refers to capture group 3 (tail).
# This no longer needs to be a f-string. But making it a raw string means we
# don't need to escape the slashes
repl = r'<span color="blue">\1</span><span>\2</span><span color="green">\3</span>'
new_seq = re.sub(pattern, repl, sequence)
print(new_seq)
print(new_seq == expected_new_seq)
Output:
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
True
Have a play around with it here (also includes interactive explanation): https://regex101.com/r/2mshrI/1
Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
How do you know when to replace it when it's in reverse instead of forward? After all, all you're doing is matching a short string followed/lead by n characters. I imagine you'd get matches in both directions, so which replacement do you carry out? Please provide more examples - longer input with expected output.

How would I implement a rfind in Lua with the arguments?

For example, I would like to do something like this in lua:
s = "Hey\n There And Yea\n"
print(s.rfind("\n", 0, 5))
I've tried making this in lua with the string.find function:
local s = "Hey\n There And Yea\n"
local _, p = s:find(".*\n", -5)
print(p)
But these aren't producing the same results. What am I doing wrong, and how can I fix this to making it the same as rfind?
Lua has a little known function string.reverse that reverses all characters of a string. While this is rarely needed, the function can typically be used to make a reverse search inside a string.
So to implement rfind, you want to search the reverse pattern inside the reverse original string, and finally make some arithmetics to obtain the offset from the original string.
Here is the code that mimics Python rfind:
function rfind(subject, tofind, startIdx, endIdx)
startIdx = startIdx or 0
endIdx = endIdx or #subject
subject = subject:sub(startIdx+1, endIdx):reverse()
tofind = tofind:reverse()
local idx = subject:find(tofind)
return idx and #subject - #tofind - idx + startIdx + 1 or -1
end
print(rfind("Hello World", "H")) --> 0
print(rfind("Hello World", "l")) --> 9
print(rfind("foo foo foo", "foo")) --> 8
print(rfind("Hello World", "Toto")) --> -1
print(rfind("Hello World", "l", 1, 4)) --> 3
Note that this version of rfind uses Python index convention, starting at 0 and returning -1 if string is not found. It would be more coherent in Lua to have 1-based index and to return nil when there are no match. The modification would be trivial.
The pattern I have written will only work for single-char substrings like the one the asker used as a test case. Skip ahead to the next bold header to see that answer, or read on for an explanation of some of the things they did wrong with their attempt. Skip to the very final bold header for a general, inefficient solution for multi-char substrings
I have tried to recreate the output of python mystring.rfind with lua mystring:find, it only works for single-character substrings. Later I will show you a function that does it for all cases but is a pretty bad loop.
As a recap (to address what you're doing wrong), let's talk about mystringvar:find("pattern", index), sugar for string.find(mystringvar, "pattern", index). This will return start, stop indexes.
The optional Index sets the start, not the end, but a negative index will count backwards from the 'right minus index' to end of string (an index of -1 will only evaluate the last character, -2 the last 2). This is not the desired behavior.
Instead of trying to use the index to create a substring, you should create a substring like this:
mystringvar:sub(start, end) will extract and return the substring from start to end (1 indexed, inclusive end). So to recreate Python's 0-5 (0 indexed, exclusive end), use 1-5.
Now note that these methods can be chained into string:sub(x, y):find("") but I will break it up for ease of reading. Without further ado, I present you:
The answer
local s = "Hey\n There And Yea\n"
local substr = s:sub(1,5)
local start, fin = substr:find("\n[^\n]-$")
print(start, ",", fin)
I had a few half measure solutions, but to make sure what I was writing would work for multiple substring instances (the 1-5 substring only contains 1), I tested with the substring and the whole string. Observe:
output with sub(1, 5): 4 , 5
output with sub(1, 19) (the whole length): 19 , 19
These both correctly report the beginning of the rightmost substring, but note that the "fin" index goes to the end of the sentence, I will explain in a second. I hope this is fine because rfind only returns the starting index anyway, so this should be an appropriate replacement.
Let's reread the code to see how it works:
sub I've already explained
There is no longer a need for index in string.find
Alright, what's this pattern "\n[^\n]-$"?
$ - anchor to end of sentence
[^x] - match "not x"
- - as few matches as possible (even 0) of the previous character or set (in this case, [^\n]). This means that if a string ends with your substring, it will still work)
It begins with \n, so all together it means: "Find me a line break, but followed by no other line breaks, up to the end of the sentence." This means that even though your substring only contains 1 instance of \n, if you were to use this function on a string with multiple substrings, you would still get the highest index, as rfind does.
Note that string.find does not conform to pattern groups (()), so it would be vain to wrap the \n in a group. As a consequence, I cannot stop end-anchoring $ from extending the fin variable to the end of the sentence.
I hope this works well for you.
Function to do this for substrings of any length
I will not be explaining this one.
function string.rfind(str, substr, plain) --plain is included for you to pass to find if you wish to ignore patterns
assert(substr ~= "") --An empty substring would cause an endless loop. Bad!
local plain = plain or false --default plain to false if not included
local index = 0
--[[
Watch closely... we continually shift the starting point after each found index until nothing is left.
At that point, we find the difference between the original string's length and the new string's length, to see how many characters we cut out.
]]--
while true do
local new_start, _ = string.find(str, substr, index, plain) --index will continually push up the string to after whenever the last index was.
if new_start == nil then --no match is found
if index == 0 then return nil end --if no match is found and the index was never changed, return nil (there was no match)
return #str - #str:sub(index) --if no match is found and we have some index, do math.
end
--print("new start", new_start)
index = new_start + 1 --ok, there was some kind of match. set our index to whatever that was, and add 1 so that we don't get stuck in a loop of rematching the start of our substring.
end
end
If you'd like to see my entire "test suite" for this...

Why does this regex not match the second binary gap?

Trying a solution for the problem listed here in python, I thought I'd try a nice little regex to capture the maximum "binary gap" (chains of zeroes in the binary representation of a number).
The function I wrote for the problem is below:
def solution(N):
max_gap = 0
binary_N = format(N, 'b')
list = re.findall(r'1(0+)1', binary_N)
for element in list:
if len(element) > max_gap:
max_gap = len(element)
return max_gap
And it works pretty well. However... for some reason, it does not match the second set of zeroes in 10000010000000001 (binary representation of 66561). The 9 zeroes don't appear in the list of matches so it must be a problem with the regex - but I can't see where it is as it matches every other example given!
The same bit can't be included in two matches. Your regex matches a 1 followed by one or more 0s and ends with another 1. Once the first match has been found you are left with 0000000001 which doesn't start with a 1 so isn't matched by your regex.
As mentioned by #JoachimIsaksson, if you want to match both sets of 0s, you can use a lookahead so that the final 1 is checked but isn't included in the match. r'1(0+)(?=1)'.

Creating fuzzy matching exceptions with Python's new regex module

I'm testing the new python regex module, which allows for fuzzy string matching, and have been impressed with its capabilities so far. However, I've been having trouble making certain exceptions with fuzzy matching. The following is a case in point. I want ST LOUIS, and all variations of ST LOUIS within an edit distance of 1 to match ref. However, I want to make one exception to this rule: the edit cannot consist of an insertion to the left of the leftmost character containing the letters N, S, E, or W. With the following example, I want inputs 1 - 3 to match ref, and input 4 to fail. However, using the following ref causes it to match to all four inputs. Does anyone who is familiar with the new regex module know of a possible workaround?
input1 = 'ST LOUIS'
input2 = 'AST LOUIS'
input3 = 'ST LOUS'
input4 = 'NST LOUIS'
ref = '([^NSEW]|(?<=^))(ST LOUIS){e<=1}'
match = regex.fullmatch(ref,input1)
match
<_regex.Match object at 0x1006c6030>
match = regex.fullmatch(ref,input2)
match
<_regex.Match object at 0x1006c6120>
match = regex.fullmatch(ref,input3)
match
<_regex.Match object at 0x1006c6030>
match = regex.fullmatch(ref,input4)
match
<_regex.Match object at 0x1006c6120>
Try a negative lookahead instead:
(?![NEW]|SS)(ST LOUIS){e<=1}
(ST LOUIS){e<=1} matches a string meeting the fuzzy conditions placed on it.
You want to prevent it from starting with [NSEW]. A negative lookahead does that for you (?![NSEW]).
But your desired string starts with an S already, you only want to exclude the strings starting with an S added to the beginning of your string.
Such a string would start with SS, and that's why it's added to the negative lookahead.
Note that if you allow errors > 1, this probably wouldn't work as desired.

How to select only certain Substrings

from a string say dna = 'ATAGGGATAGGGAGAGAGCGATCGAGCTAG'
i got substring say dna.format = 'ATAGGGATAG','GGGAGAGAG'
i only want to print substring whose length is divisible by 3
how to do that? im using modulo but its not working !
import re
if mydna = 'ATAGGGATAGGGAGAGAGCAGATCGAGCTAG'
print re.findall("ATA"(.*?)"AGA" , mydna)
if len(mydna)%3 == 0
print mydna
corrected code
import re
mydna = 'ATAGGGATAGGGAGAGAGCAGATCGAGCTAG'
re.findall("ATA"(.*?)"AGA" , mydna.format)
if len(mydna.format)%3 == 0:
print mydna.format
this still doesnt give me substring with length divisible by three . . any idea whats wrong ?
im expecting only substrings which has length divisible by three to be printed
For including overlap substrings, I have the following lengthy version. The idea is to find all starting and ending marks and calculate the distance between them.
mydna = 'ATAGGGATAGGGAGAGAGCAGATCGAGCTAG'
[mydna[start.start():end.start()+3] for start in re.finditer('(?=ATA)',mydna) for end in re.finditer('(?=AGA)',mydna) if end.start()>start.start() and (end.start()-start.start())%3 == 0]
['ATAGGGATAGGG', 'ATAGGG']
Show all substrings, including overlapping ones:
[mydna[start.start():end.start()+3] for start in re.finditer('(?=ATA)',mydna) for end in re.finditer('(?=AGA)',mydna) if end.start()>start.start()]
['ATAGGGATAGGG', 'ATAGGGATAGGGAG', 'ATAGGGATAGGGAGAGAGC', 'ATAGGG', 'ATAGGGAG', 'ATAGGGAGAGAGC']
You can also use the regular expression for that:
re.findall('ATA((...)*?)AGA', mydna)
the inner braces match 3 letters at once.
Using modulo is the correct procedure. If it's not working, you're doing it wrong. Please provide an example of your code in order to debug it.
re.findAll() will return you an array of matching strings, You need to iterate on each of those and do a modulo on those strings to achieve what you want.

Categories