Rosalind: overlap graphs - python

I have come across a problem on Rosalind that I think I ave solved correctly, yet I get told my answer is incorrect. The problem can be found here: http://rosalind.info/problems/grph/
It's basic graph theory, more specifically it deals with returning an adjacency list of overlapping DNA strings.
"For a collection of strings and a positive integer k, the overlap graph for the strings is a directed graph Ok in which each string is represented by a node, and string s is connected to string t with a directed edge when there is a length k suffix of s that matches a length k prefix of t, as long as s≠t; we demand s≠t to prevent directed loops in the overlap graph (although directed cycles may be present).
Given: A collection of DNA strings in FASTA format having total length at most 10 kbp.
Return: The adjacency list corresponding to O3. You may return edges in any order."
So, if you've got:
Rosalind_0498
AAATAAA
Rosalind_2391
AAATTTT
Rosalind_2323
TTTTCCC
Rosalind_0442
AAATCCC
Rosalind_5013
GGGTGGG
you must return:
Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323
My python code, after having parsed the FASTA file containing the DNA strings, is as follows:
listTitle = []
listContent = []
#SPLIT is the parsed list of DNA strings
#here i create two new lists, one (listTitle) containing the four numbers identifying a particular string, and the second (listContent) containing the actual strings ('>Rosalind_' has been removed, because it is what I split the file with)
while i < len(SPLIT):
curr = SPLIT[i]
title = curr[0:4:1]
listTitle.append(title)
content = curr[4::1]
listContent.append(content)
i+=1
start = []
end = []
#now I create two new lists, one containing the first three chars of the string and the second containing the last three chars, a particular string's index will be the same in both lists, as well as in the title list
for item in listContent:
start.append(item[0:3:1])
end.append(item[len(item)-3:len(item):1])
list = []
#then I iterate through both lists, checking if the suffix and prefix are equal, but not originating from the same string, and append their titles to a last list
p=0
while p<len(end):
iterator=0
while iterator<len(start):
if p!=iterator:
if end[p] == start[iterator]:
one=listTitle[p]
two=listTitle[iterator]
list.append(one)
list.append(two)
iterator+=1
p+=1
#finally I print the list in the format that they require for the answer
listInc=0
while listInc < len(list):
print "Rosalind_"+list[listInc]+' '+"Rosalind_"+list[listInc+1]
listInc+=2
Where am I going wrong? Sorry that the code is a bit tedious, I have had very little training in python

I'm not sure what is wrong with your code, but here is an approach that might be considered more "pythonic".
I'll suppose that you've read your data into a dictionary mapping names to DNA strings:
{'Rosalind_0442': 'AAATCCC',
'Rosalind_0498': 'AAATAAA',
'Rosalind_2323': 'TTTTCCC',
'Rosalind_2391': 'AAATTTT',
'Rosalind_5013': 'GGGTGGG'}
We define a simple function that checks whether a string s1 has a k-suffix matching the k-prefix of a string s2:
def is_k_overlap(s1, s2, k):
return s1[-k:] == s2[:k]
Then we look at all combinations of DNA sequences to find those that match. This is made easy by itertools.combinations:
import itertools
def k_edges(data, k):
edges = []
for u,v in itertools.combinations(data, 2):
u_dna, v_dna = data[u], data[v]
if is_k_overlap(u_dna, v_dna, k):
edges.append((u,v))
if is_k_overlap(v_dna, u_dna, k):
edges.append((v,u))
return edges
For example, on the data above we get:
>>> k_edges(data, 3)
[('Rosalind_2391', 'Rosalind_2323'),
('Rosalind_0498', 'Rosalind_2391'),
('Rosalind_0498', 'Rosalind_0442')]

Related

Find matching elements of two unordered Python lists of different sizes

I'm getting this error: index out of range, in if largerList[j] == smallerList[i]. I'm working on an assignment about binary search trees, I put the trees into lists and I'm just trying to compare the two lists:
def matchList(largerList, smallerList) :
matches = []
for i in smallerList:
for j in largerList:
if largerList[j] == smallerList[i] :
matches[i] = smallerList[i]
return matches
I'm assuming nested for loops should totally iterate all elements in each loop, so smallerList is the smaller list so smallerList doesn't make largerList go out of bounds. The inner for-loop should iterate over all of the larger list entirely, comparing each value to each element of the smaller list. Why doesn't it work?
You can't set a list value with matches[i] if that index does not exist in matches.
Try appending instead:
Change this matches[i] = smallerList[i] to this matches = matches.append(smallerList[i])
Trying to find matching elements in lists like this is rather inefficient. One thing you could improve to make it arguably more pythonic is to use a list comprehension:
matches = [i for i in largerList if i in smallerList]
But then the more mathematically sensible approach still would be to realise that we have two sets of elements and we want to find an intersection of two sets so we can write something like:
matches = set(largerList).intersection(smallerList)

Finding common string in list and displaying them

I am trying to create a function compare(lst1,lst2) which compares the each element in a list and returns every common element in a new list and shows percentage of how common it is. All the elements in the list are going to be strings. For example the function should return:
lst1 = AAAAABBBBBCCCCCDDDD
lst2 = ABCABCABCABCABCABCA
common strand = AxxAxxxBxxxCxxCxxxx
similarity = 25%
The parts of the list which are not similar will simply be returned as x.
I am having trouble in completing this function without the python set and zip method. I am not allowed to use them for this task and I have to achieve this using while and for loops. Kindly guide me as to how I can achieve this.
This is what I came up with.
lst1 = 'AAAAABBBBBCCCCCDDDD'
lst2 = 'ABCABCABCABCABCABCA'
common_strand = ''
score = 0
for i in range(len(lst1)):
if lst1[i] == lst2[i]:
common_strand = common_strand + str(lst1[i])
score += 1
else:
common_strand = common_strand + 'x'
print('Common Strand: ', common_strand)
print('Similarity Score: ', score/len(lst1))
Output:
Common Strand: AxxAxxxBxxxCxxCxxxx
Similarity Score: 0.2631578947368421
I am having trouble in completing this function without the python set and zip method. I am not allowed to use them for this task and I have to achieve this using while and for loops. Kindly guide me as to how I can achieve this.
You have two strings A and B. Strings are ordered sequences of characters.
Suppose both A and B have equal length (the same number of characters). Choose some position i < len(A), len(B) (remember Python sequences are 0-indexed). Your problem statement requires:
If character i in A is identical to character i in B, yield that character
Otherwise, yield some placeholder to denote the mismatch
How do you find the ith character in some string A? Take a look at Python's string methods. Remember: strings are sequences of characters, so Python strings also implement several sequence-specific operations.
If len(A) != len(B), you need to decide what to do if you're comparing the ith element in either string to a string smaller than i. You might think to represent these as the same placeholder in (2).
If you know how to iterate the result of zip, you know how to use for loops. All you need is a way to iterate over the sequence of indices. Check out the language built-in functions.
Finally, for your measure of similarity: if you've compared n characters and found that N <= n are mismatched, you can define 1 - (N / n) as your measure of similarity. This works well for equally-long strings (for two strings with different lengths, you're always going to be calculating the proportion relative to the longer string).

My function is inefficient and takes up a lot of time- Python

The problem I am trying to solve:
Given a string, split the string into two substrings at every possible point. The rightmost substring is a suffix. The beginning of the string is the prefix. Determine the lengths of the common prefix between each prefix and the original string. Sum and return the lengths of the common prefixes. Return an array where each element 'i' is the sum for the string 'i'.
My Solution:
def commonPrefix(l):
size=len(l) #Finding number of strings
sumlist=[] #Initializing list that'll store sums
for i in range(size):
total=0 #Initializing total common prefixes for every string
string=l[i] #Storing each string in string
strsize=len(string) #Finding length of each string
for j in range(strsize):
suff=string[j:] #Calculating every suffix for a string
suffsize=len(suff) #Finding length of each suffix
common=0 #Size of common prefix for each length
for k in range(suffsize):
if(string[k]==suff[k]):
common=common+1
else:
break #If characters at equal positions are not equal, break
total=total+common #Update total for every suffix
sumlist.append(total) #Add each total in list
return sumlist #Return list
My Solution takes up a lot of time, I need help with optimizing it.

Removing list elements that repeats itself- python

I have an array that looks like
A=[0,0,1,1,2,5,6,3,7,7,0,0,1,1,2,5,6,3,7,7]
since the "0,0,1,1,2,5,6,3,7,7" part reapeats itself I dont need the second part so for a given array it should give me
A=[0,0,1,1,2,5,6,3,7,7]
I cant use the set() function and I dont know what else I can use in this case.Is there a function which can do this operation ?
I defined the following two methods:
def check_repeat(l):
for i in range(1,len(l)//2+1):
pattern=l[:i]
sub_lists=create_sublists(l, i, True)
if all([x==pattern for x in sub_lists]):
print("Found pattern {} with length {}".format(pattern, i))
return pattern
def create_sublists(l, n, only_full=False):
sub_lists=[]
for j in range(n, len(l),n):
if only_full and len(l[j:j+n])<n:
continue
sub_lists.append(l[j:j+n])
return sub_lists
It works as follows: with check_repeat(your_list) the input list is checked for patterns of arbitrary length. A pattern of length i is found, if the first i entries of your_list equal all following sub_lists of length i. The sublists are created by the method create_sublists, which returns a list of lists. If only_full is set to True, only sublists with length n are allowed. This handles the case where your list looks for example like this A=[1,2,3,1,2,3,1,2] and you want to accept [1,2,3] as a valid pattern and ignore the remaining sublist [1,2].
Each sublist is then compared to the current pattern, i.e. the first i entries of your input list. If all sublists, equal the current pattern, a true pattern is found.
In contrast to the solution in Python finding repeating sequence in list of integers?, I do not only check the next i entries if they match the current pattern, but all of the following sublists with length i.
Edit: Oooops, didn't see that you don't want to use a set. NVM, sorry.
What I would recommend doing is converting your list to a set and then back, like this:
A=[0,0,1,1,2,5,6,3,7,7,0,0,1,1,2,5,6,3,7,7]
S = set(A)
If you want to do it iteratively (, because you might want to add an extra condition):
S = set()
for item in A:
s.add(item)
And then you can convert the set back to a list like this:
A = list(S)
print(A)
>>A=[0,0,1,1,2,5,6,3,7,7]

Counting the characters in strings within a list, Python

I am having difficulty with one of my homework questions.
Basically, I am to create a function using a for loop to count the characters used in strings which are inside of a list. I am able to get the length of the strings with len(), but I can only get one of the strings and I can't seem to get it in list form.
My Code:
def method1(input1):
total = 0
a = []
for word in input1:
total = len(word)
a = [total]
return a
This returns [3] with the input ['seven', 'one']
Any help would be appreciated. I am very new to Python.
Well here is one way of doing it. You have list "a" with different strings inside of them. You can simply iterate through the list counting the length of each string.
def method1(input1):
l = list(input1)
total = sum(len(i) for i in l)
return int(total)
print(method1(["hello", "bye"]))
What you are doing here is receiving an input and converting it to a list. Then for each value inside of the list, you are calculating it's length. The sum adds up those lengths and finally, you return total.

Categories