My function is inefficient and takes up a lot of time- Python - python

The problem I am trying to solve:
Given a string, split the string into two substrings at every possible point. The rightmost substring is a suffix. The beginning of the string is the prefix. Determine the lengths of the common prefix between each prefix and the original string. Sum and return the lengths of the common prefixes. Return an array where each element 'i' is the sum for the string 'i'.
My Solution:
def commonPrefix(l):
size=len(l) #Finding number of strings
sumlist=[] #Initializing list that'll store sums
for i in range(size):
total=0 #Initializing total common prefixes for every string
string=l[i] #Storing each string in string
strsize=len(string) #Finding length of each string
for j in range(strsize):
suff=string[j:] #Calculating every suffix for a string
suffsize=len(suff) #Finding length of each suffix
common=0 #Size of common prefix for each length
for k in range(suffsize):
if(string[k]==suff[k]):
common=common+1
else:
break #If characters at equal positions are not equal, break
total=total+common #Update total for every suffix
sumlist.append(total) #Add each total in list
return sumlist #Return list
My Solution takes up a lot of time, I need help with optimizing it.

Related

Finding common string in list and displaying them

I am trying to create a function compare(lst1,lst2) which compares the each element in a list and returns every common element in a new list and shows percentage of how common it is. All the elements in the list are going to be strings. For example the function should return:
lst1 = AAAAABBBBBCCCCCDDDD
lst2 = ABCABCABCABCABCABCA
common strand = AxxAxxxBxxxCxxCxxxx
similarity = 25%
The parts of the list which are not similar will simply be returned as x.
I am having trouble in completing this function without the python set and zip method. I am not allowed to use them for this task and I have to achieve this using while and for loops. Kindly guide me as to how I can achieve this.
This is what I came up with.
lst1 = 'AAAAABBBBBCCCCCDDDD'
lst2 = 'ABCABCABCABCABCABCA'
common_strand = ''
score = 0
for i in range(len(lst1)):
if lst1[i] == lst2[i]:
common_strand = common_strand + str(lst1[i])
score += 1
else:
common_strand = common_strand + 'x'
print('Common Strand: ', common_strand)
print('Similarity Score: ', score/len(lst1))
Output:
Common Strand: AxxAxxxBxxxCxxCxxxx
Similarity Score: 0.2631578947368421
I am having trouble in completing this function without the python set and zip method. I am not allowed to use them for this task and I have to achieve this using while and for loops. Kindly guide me as to how I can achieve this.
You have two strings A and B. Strings are ordered sequences of characters.
Suppose both A and B have equal length (the same number of characters). Choose some position i < len(A), len(B) (remember Python sequences are 0-indexed). Your problem statement requires:
If character i in A is identical to character i in B, yield that character
Otherwise, yield some placeholder to denote the mismatch
How do you find the ith character in some string A? Take a look at Python's string methods. Remember: strings are sequences of characters, so Python strings also implement several sequence-specific operations.
If len(A) != len(B), you need to decide what to do if you're comparing the ith element in either string to a string smaller than i. You might think to represent these as the same placeholder in (2).
If you know how to iterate the result of zip, you know how to use for loops. All you need is a way to iterate over the sequence of indices. Check out the language built-in functions.
Finally, for your measure of similarity: if you've compared n characters and found that N <= n are mismatched, you can define 1 - (N / n) as your measure of similarity. This works well for equally-long strings (for two strings with different lengths, you're always going to be calculating the proportion relative to the longer string).

find the longest common prefix string amongst an array of strings. If there is no common prefix, return an empty string ""

find the longest common prefix string amongst an array of strings.
If there is no common prefix, return an empty string ""
I have tried to code this problem
here is code:
class Solution:
def longestCommonPrefix(self, strs: List[str]) -> str:
a=list(list(zip(*strs))[0])
b=list(list(zip(*strs))[1])
c=list(list(zip(*strs))[2])
a1=""
i=0
while(len(strs)):
if(a[i]==b[i]==c[i]):
a1+=a[i]
return a1
I have tried to solve via extracting element from the list and then comparing with other elements.
Don't know where it got struck no output is showing,
please help!
because len(strs) is always True,Your program is stuck in an infinite loop.
Another problem is that you only extracted the first three elements of each string in the string list, but the length of the largest common string may be greater than three elements
The simplest way possible would be to start building a diet.
Iterate over all the strings and capture the first character of all the strings
Add them to the dict with count 1 and key as first character
If duplicates are found increment the count on the dict key
Now find the largest number in values of dict and store in a variable
Now repeat the procedure for 2 characters build dict with 2 character keys with starting value as 1 and increment the number if duplicate 2 characters are found in list of strings and replace the previous variable value with the highest value from recent dict
Repeat the procedure by incrementing the number of characters to be checked until you get 1 as the highest value in dict, and that value's key is the longest common prefix

Longest Subsequence problem if the lengths are different

Let the input sequences be X[0..m-1] and Y[0..n-1] of lengths m and n respectively. And let L(X[0..m-1], Y[0..n-1]) be the length of LCS of the two sequences X and Y. Following is the recursive definition of L(X[0..m-1], Y[0..n-1]).
If last characters of both sequences match (or X[m-1] == Y[n-1]) then
L(X[0..m-1], Y[0..n-1]) = 1 + L(X[0..m-2], Y[0..n-2])
If last characters of both sequences do not match (or X[m-1] != Y[n-1]) then
L(X[0..m-1], Y[0..n-1]) = MAX ( L(X[0..m-2], Y[0..n-1]), L(X[0..m-1], Y[0..n-2]) )
How to solve the problem if the lengths are different ? and how to print the respective sequences
It doesn't matter if the length of input strings are same or not, and this is taken care by the base case of recursion.
if (m == 0 || n == 0)
return 0;
If we reach the end of any one of the string, the recursion stops and unwinds from there.
Also the example you mentioned in comment:
ABCEFG and ABXDE
First we compare last character from both string. In this case, they are not same.
So we try two cases:
Remove last character from first string and compare it with second.
Remove last character from second string and compare it with first.
And return the max from both cases.
(As a side note, if the last character had matched, we would add 1 to our answer and remove the last character from both strings)
This process continues till any of the string reaches it's end, in which case, the base case of your recursion is satisfied and the recursion returns.
So it doesn't matter if the original length of string is same or not.

Counting the characters in strings within a list, Python

I am having difficulty with one of my homework questions.
Basically, I am to create a function using a for loop to count the characters used in strings which are inside of a list. I am able to get the length of the strings with len(), but I can only get one of the strings and I can't seem to get it in list form.
My Code:
def method1(input1):
total = 0
a = []
for word in input1:
total = len(word)
a = [total]
return a
This returns [3] with the input ['seven', 'one']
Any help would be appreciated. I am very new to Python.
Well here is one way of doing it. You have list "a" with different strings inside of them. You can simply iterate through the list counting the length of each string.
def method1(input1):
l = list(input1)
total = sum(len(i) for i in l)
return int(total)
print(method1(["hello", "bye"]))
What you are doing here is receiving an input and converting it to a list. Then for each value inside of the list, you are calculating it's length. The sum adds up those lengths and finally, you return total.

Rosalind: overlap graphs

I have come across a problem on Rosalind that I think I ave solved correctly, yet I get told my answer is incorrect. The problem can be found here: http://rosalind.info/problems/grph/
It's basic graph theory, more specifically it deals with returning an adjacency list of overlapping DNA strings.
"For a collection of strings and a positive integer k, the overlap graph for the strings is a directed graph Ok in which each string is represented by a node, and string s is connected to string t with a directed edge when there is a length k suffix of s that matches a length k prefix of t, as long as s≠t; we demand s≠t to prevent directed loops in the overlap graph (although directed cycles may be present).
Given: A collection of DNA strings in FASTA format having total length at most 10 kbp.
Return: The adjacency list corresponding to O3. You may return edges in any order."
So, if you've got:
Rosalind_0498
AAATAAA
Rosalind_2391
AAATTTT
Rosalind_2323
TTTTCCC
Rosalind_0442
AAATCCC
Rosalind_5013
GGGTGGG
you must return:
Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323
My python code, after having parsed the FASTA file containing the DNA strings, is as follows:
listTitle = []
listContent = []
#SPLIT is the parsed list of DNA strings
#here i create two new lists, one (listTitle) containing the four numbers identifying a particular string, and the second (listContent) containing the actual strings ('>Rosalind_' has been removed, because it is what I split the file with)
while i < len(SPLIT):
curr = SPLIT[i]
title = curr[0:4:1]
listTitle.append(title)
content = curr[4::1]
listContent.append(content)
i+=1
start = []
end = []
#now I create two new lists, one containing the first three chars of the string and the second containing the last three chars, a particular string's index will be the same in both lists, as well as in the title list
for item in listContent:
start.append(item[0:3:1])
end.append(item[len(item)-3:len(item):1])
list = []
#then I iterate through both lists, checking if the suffix and prefix are equal, but not originating from the same string, and append their titles to a last list
p=0
while p<len(end):
iterator=0
while iterator<len(start):
if p!=iterator:
if end[p] == start[iterator]:
one=listTitle[p]
two=listTitle[iterator]
list.append(one)
list.append(two)
iterator+=1
p+=1
#finally I print the list in the format that they require for the answer
listInc=0
while listInc < len(list):
print "Rosalind_"+list[listInc]+' '+"Rosalind_"+list[listInc+1]
listInc+=2
Where am I going wrong? Sorry that the code is a bit tedious, I have had very little training in python
I'm not sure what is wrong with your code, but here is an approach that might be considered more "pythonic".
I'll suppose that you've read your data into a dictionary mapping names to DNA strings:
{'Rosalind_0442': 'AAATCCC',
'Rosalind_0498': 'AAATAAA',
'Rosalind_2323': 'TTTTCCC',
'Rosalind_2391': 'AAATTTT',
'Rosalind_5013': 'GGGTGGG'}
We define a simple function that checks whether a string s1 has a k-suffix matching the k-prefix of a string s2:
def is_k_overlap(s1, s2, k):
return s1[-k:] == s2[:k]
Then we look at all combinations of DNA sequences to find those that match. This is made easy by itertools.combinations:
import itertools
def k_edges(data, k):
edges = []
for u,v in itertools.combinations(data, 2):
u_dna, v_dna = data[u], data[v]
if is_k_overlap(u_dna, v_dna, k):
edges.append((u,v))
if is_k_overlap(v_dna, u_dna, k):
edges.append((v,u))
return edges
For example, on the data above we get:
>>> k_edges(data, 3)
[('Rosalind_2391', 'Rosalind_2323'),
('Rosalind_0498', 'Rosalind_2391'),
('Rosalind_0498', 'Rosalind_0442')]

Categories