DNA sequencing using python

DNA sequencing using python - python

Using loops, how can I write a function in python, to sort the longest chain of proteins, regardless of order. The function returns a substring that consists only of the character 'A','C','G', and 'T' when ties are mixed up with other elements: Example, in the sequence: 'ACCGXXCXXGTTACTGGGCXTTGT', it returns 'GTTACTGGGC'

If the data is provided as a string you could simply split it by the character 'X' and thereby get a list.
startstring = 'ACCGXXCXXGTTACTGGGCXTTGT'
array = startstring.split('X')
Then looping over the list while checking for the length of the element would give you the right result:
# Initialize placeholders for comparison
temp_max_string = ''
temp_max_length = 0
#Loop over each string in the list
for i in array:
# Check if the current substring is longer than the longest found so far
if len(i) > temp_max_length:
# Replace the placeholders if it is longer
temp_max_length = len(i)
temp_max_string = i
print(temp_max_string) # or 'print temp_max_string' if you are using python2.
You could also use the python built-ins to get your result in a more efficient manner:
Sorting by descending length (list.sort())
startstring = 'ACCGXXCXXGTTACTGGGCXTTGT'
array = startstring.split('X')
array.sort(key=len, reverse=True)
print(array[0]) #print the longest since we sorted for descending lengths
print(len(array[0])) # Would give you the length of the longest substring
Only get the longest substring (max()):
startstring = 'ACCGXXCXXGTTACTGGGCXTTGT'
array = startstring.split('X')
longest = max(array, key=len)
print(longest) # gives the longest substring
print(len(longest)) # gives you the length of the longest substring

Related

Keep one of two consecutive chars in a string

So I want to replicate a word n times in my function but I want to eliminate the consecutive characters.
For example repete (amanha, 2) = "amanhamanha"
My function:
def repete(palavra,n):
a = []
b=""
for n in range (0,n):
a.append(palavra)
b = b.join(a)
return b

The first step is to determine the longest overlap between the start and end of the word. The next() function can be used to get the number of characters to skip by getting the first match starting from the longest substring down to the shortest and defaulting to zero if there is no overlap. Then the repetition can be performed on the remaining part of the word (i.e. skipping the length of the common part)
def repeat(w,n):
skip = next((i for i in range(len(w)-1,0,-1) if w[:i]==w[-i:]),0)
return w + (n-1)*w[skip:]
print(repeat("amanha",2)) # amanhamanha
print(repeat("abc",2)) # abcabc
print(repeat("abcdab",2)) # abcdabcdab
You could also use the max() function to get the length to skip (not as efficient as next() but shorter to write):
def repeat(w,n):
skip = max(range(len(w)),key=lambda i:i*(w[:i]==w[-i:]))
return w + (n-1)*w[skip:]

Find multiple longest common leading substrings with length >= 4

In Python I am trying to extract all the longest common leading substrings that contain at least 4 characters from a list. For example, in the list called "data" below, the 2 longest common substrings that fit my criteria are "johnjack" and "detc". I knew how to find the single longest common substring with the codes below, which returned nothing (as expected) because there is no common substring. But I am struggling with building a script that could detect multiple common substrings within a list, where each of the common substring must have length of 4 or above.
data = ['johnjack1', 'johnjack2', 'detc22', 'detc32', 'chunganh']
def ls(data):
if len(data)==0:
prefix = ''
else:
prefix = data[0]
for i in data:
while not i.startswith(prefix) and len(prefix) > 0:
prefix = prefix[:-1]
print(prefix)
ls(data)

Here's one, but I think it's probably not the fastest or most efficient. Let's start with just the data and a container for our answer:
data = ['johnjack1', 'johnjack2', 'detc22', 'detc32', 'chunganh', 'chunganh']
substrings = []
Note I added a dupe for chunganh -- that's a common edge case we should be handling.
See How do I find the duplicates in a list and create another list with them?
So to capture the duplicates in the data
seen = {}
dupes = []
for x in data:
if x not in seen:
seen[x] = 1
else:
if seen[x] == 1:
dupes.append(x)
seen[x] += 1
for dupe in dupes:
substrings.append(dupe)
Now let's record the unique values in the data as-is
# Capture the unique values in the data
last = set(data)
From here, we can loop through our set, popping characters off the end of each unique value. If the length of our set changes, we've found a unique substring.
# Handle strings up to 10000 characters long
for k in [0-b for b in range(1, 10000)]:
# Use negative indexing to start from the longest
last, middle = set([i[:k] for i in data]), last
# Unique substring found
if len(last) != len(middle):
for k in last:
count = 0
for word in middle:
if k in word:
count += 1
if count > 1:
substrings.append(k)
# Early stopping
if len(last) == 1:
break
Finally, you mentioned needing only substrings of length 4.
list(filter(lambda x: len(x) >= 4, substrings))

Time limit exceeded error. Word Ladder leetcode

I am trying to solve leetcode problem(https://leetcode.com/problems/word-ladder/description/):
Given two words (beginWord and endWord), and a dictionary's word list, find the length of shortest transformation sequence from beginWord to endWord, such that:
Only one letter can be changed at a time.
Each transformed word must exist in the word list. Note that beginWord is not a transformed word.
Note:
Return 0 if there is no such transformation sequence.
All words have the same length.
All words contain only lowercase alphabetic characters.
You may assume no duplicates in the word list.
You may assume beginWord and endWord are non-empty and are not the same.
Input:
beginWord = "hit",
endWord = "cog",
wordList = ["hot","dot","dog","lot","log","cog"]
Output:
5
Explanation:
As one shortest transformation is "hit" -> "hot" -> "dot" -> "dog" ->
"cog", return its length 5.
import queue
class Solution:
def isadjacent(self,a, b):
count = 0
n = len(a)
for i in range(n):
if a[i] != b[i]:
count += 1
if count > 1:
return False
if count == 1:
return True
def ladderLength(self,beginWord, endWord, wordList):
word_queue = queue.Queue(maxsize=0)
word_queue.put((beginWord,1))
while word_queue.qsize() > 0:
queue_last = word_queue.get()
index = 0
while index != len(wordList):
if self.isadjacent(queue_last[0],wordList[index]):
new_len = queue_last[1]+1
if wordList[index] == endWord:
return new_len
word_queue.put((wordList[index],new_len))
wordList.pop(index)
index-=1
index+=1
return 0
Can someone suggest how to optimise it and prevent the error!

The basic idea is to find the adjacent words faster. Instead of considering every word in the list (even one that has already been filtered by word length), construct each possible neighbor string and check whether it is in the dictionary. To make those lookups fast, make sure the word list is stored in something like a set that supports fast membership tests.
To go even faster, you could store two sorted word lists, one sorted by the reverse of each word. Then look for possibilities involving changing a letter in the first half in the reversed list and for the latter half in the normal list. All the existing neighbors can then be found without making any non-word strings. This can even be extended to n lists, each sorted by omitting one letter from all the words.

CodeEval Hard Challenge 6 - LONGEST COMMON SUBSEQUENCE - python

I am trying to solve the Longest Common Subsequence in Python. I've completed it and it's working fine although I've submitted it and it says it's 50% partially completed. I'm not sure what I'm missing here, any help is appreciated.
CHALLENGE DESCRIPTION:
You are given two sequences. Write a program to determine the longest common subsequence between the two strings (each string can have a maximum length of 50 characters). NOTE: This subsequence need not be contiguous. The input file may contain empty lines, these need to be ignored.
INPUT SAMPLE:
The first argument will be a path to a filename that contains two strings per line, semicolon delimited. You can assume that there is only one unique subsequence per test case. E.g.:
XMJYAUZ;MZJAWXU
OUTPUT SAMPLE:
The longest common subsequence. Ensure that there are no trailing empty spaces on each line you print. E.g.:
MJAU
My code is
# LONGEST COMMON SUBSEQUENCE
import argparse
def get_longest_common_subsequence(strings):
# here we will store the subsequence list
subsequences_list = list()
# split the strings in 2 different variables and limit them to 50 characters
first = strings[0]
second = strings[1]
startpos = 0
# we need to start from each index in the first string so we can find the longest subsequence
# therefore we do a loop with the length of the first string, incrementing the start every time
for start in range(len(first)):
# here we will store the current subsequence
subsequence = ''
# store the index of the found character
idx = -1
# loop through all the characters in the first string, starting at the 'start' position
for i in first[start:50]:
# search for the current character in the second string
pos = second[0:50].find(i)
# if the character was found and is in the correct sequence add it to the subsequence and update the index
if pos > idx:
subsequence += i
idx = pos
# if we have a subsequence, add it to the subsequences list
if len(subsequence) > 0:
subsequences_list.append(subsequence)
# increment the start
startpos += 1
# sort the list of subsequences with the longest at the top
subsequences_list.sort(key=len, reverse=True)
# return the longest subsequence
return subsequences_list[0]
def main():
parser = argparse.ArgumentParser()
parser.add_argument('filename')
args = parser.parse_args()
# read file as the first argument
with open(args.filename) as f:
# loop through each line
for line in f:
# if the line is empty it means it's not valid. otherwise print the common subsequence
if line.strip() not in ['\n', '\r\n', '']:
strings = line.replace('\n', '').split(';')
if len(strings[0]) > 50 or len(strings[1]) > 50:
break
print get_longest_common_subsequence(strings)
return 0
if __name__ == '__main__':
main()

The following solution prints unordered/unsorted longest common subsequences/substrings from semi-colon-separated string pairs. If a string from the pair is longer than 50 characters, then the pair is skipped (its not difficult to trim it to length 50 if that is desired).
Note: if sorting/ordering is desired it can be implemented (either alphabetic order, or sort by the order of the first string or sort by the order of the second string.
with open('filename.txt') as f:
for line in f:
line = line.strip()
if line and ';' in line and len(line) <= 101:
a, b = line.split(';')
a = set(a.strip())
b = set(b.strip())
common = a & b # intersection
if common:
print ''.join(common)
Also note: If the substrings have internal common whitespace (ie ABC DE; ZM YCA) then it will be part of the output because it will not be stripped. If that is not desired then you can replace the line a = set(a.strip()) with a = {char for char in a if char.strip()} and likewise for b.

def lcs_recursive(xlist,ylist):
if not xlist or not ylist:
return []
x,xs,y,ys, = xlist[0],xlist[1:],ylist[0],ylist[1:]
if x == y:
return [x] + lcs_recursive(xs,ys)
else:
return max(lcs_recursive(xlist,ys),lcs_recursive(xs,ylist),key=len)
s1 = 'XMJYAUZ'
s2 = 'MZJAWXU'
print (lcs_recursive(s1,s2))
This will give the correct answer MJAU and X & Z are not part of the answer because they are sequential (Note:- Subsequent)

Finding longest word error?

I'm trying to find the longest word in Python but I'm getting the wrong result. The code below was done in interactive mode. I should've received 'merchandise' (10 characters) as the longest word/string but I got 'welcome' (7 characters) instead.
str = 'welcome to the merchandise store' #string of words
longest = [] #empty list
longest = str.split() #put strings into list
max(longest) #find the longest string
'welcome' #Should've been 'merchandise?'

It's sorting the strings alphabetically not by length, what you need is:
max(longest, key=len)
Let me clarify a little bit further. In Python the default comparison for strings is alphabetical. This means "aa" will be less than "abc" for all intents and purposes (cmp in python2 and < in python2/3). If you call max on the list without a key then it will compare using the default comparison. If you put in a key function then it will compare the key instead of the value. Another options (python2 only) is the cmp argument to max, which takes a function like cmp. I don't suggest this method because it's going to end up being something like cmp=lambda x,y: len(x)-len(y) which seems much less readable then just key=len and isn't supported in python3.
If you have any more questions about using key, I'd suggest reading this specifically note (7) which covers cmp and key for list.sort which uses them in the same manner.

You can also so do this:
str = 'welcome to the merchandise store'
sorted(str.split(), key=len)[-1]
Split it, sort them by length, then take the longest (last one).

Change your str.split() to str.split(" ")
Then,
max = []
for x in str.split(" "):
if len(x) > len(max[0]):
max = []
max.append(x)
elif len(x) == len(max[0]):
max.append(x)

This is one way to do it without the lambda or key=len just using a for loop and list comprehension:
str = 'welcome to the merchandise store'
longest = []
longest = str.split()
lst = []
for i in longest:
x = len(i)
lst.append(x)
y = [a for a in longest if max(lst)== len(a)]
print y[0]
Output:
merchandise
This is in Python 2, in Python 3 print (y[0])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

DNA sequencing using python - python

Related

Keep one of two consecutive chars in a string

Find multiple longest common leading substrings with length >= 4

Time limit exceeded error. Word Ladder leetcode

CodeEval Hard Challenge 6 - LONGEST COMMON SUBSEQUENCE - python

Finding longest word error?

Categories

Resources