Finding longest sequence of consecutive repeats of a substring within a string - python

My code for the function is really messy and I cannot find why it returns a list of 1's. A solution would obviously be great, but with advice to make the code just better, i'd be happy
def cont_cons_repeats(ADN, STR, pos):
slong = 0
# Find start of sequence
for i in range(len(ADN[pos:])):
if ADN[pos + i:i + len(STR)] == STR:
slong = 1
pos = i + pos
break
if slong == 0:
return 0
# First run
for i in range(len(ADN[pos:])):
i += len(STR) - 1
if ADN[pos + i + 1:pos + i + len(STR)] == STR:
slong += 1
else:
pos = i + pos
break
# Every other run
while True:
pslong = cont_cons_repets(ADN, STR, pos)
if pslong > slong:
slong = pslong
if pslong == 0:
break
return slong
(slong stands for size of longest sequence, pslong for potential slong, and pos for position)

Assuming you pass in pos because you want to ignore the start of the string you're searching up to pos:
def longest_run(text, part, pos):
m = 0
n = 0
while pos < len(text):
if text[pos:pos+len(part)] == part:
n += 1
pos += len(part)
else:
m = max(n, m)
n = 0
pos += 1
return m
You say your function returns a list of 1s, but that doesn't seem to match what your code is doing. Your provided code has some syntax errors, including a misspelled call to your function cont_cons_repets, so it's impossible to say why you're getting that result.
You mentioned in the comments that you thought a recursive solution was required. You could definitely make it work as a recursive function, but in many cases where a recursive function works, you should consider a non-recursive function to save on resources. Recursive functions can be very elegant and easy to read, but remember that any recursive function can also be written as a non-recursive function. It's never required, often more resource-intensive, but sometimes just a very clean and easy to maintain solution.

Related

Need some help on a function

Write a function named one_frame that takes one argument seq and performs the tasks specified below. The argument seq is to be a string that contains information for the bases of a DNA sequence.
a → The function searches given DNA string from left to right in multiples of three nucleotides (in a single reading frame).
b → When it hits a start codon ATG it calls get_orf on the slice of the string beginning at that start codon.
c → The ORF returned by get_orf is added to a list of ORFs.
d → The function skips ahead in the DNA string to the point right after the ORF that we just found and starts looking for the next ORF.
e → Steps a through d are repeated until we have traversed the entire DNA string.
The function should return a list of all ORFs it has found.
def one_frame(seq):
start_codon = 'ATG'
list_of_codons = []
y = 0
while y < len(seq):
subORF = seq[y:y + 3]
if start_codon in subORF:
list_of_codons.append(get_orf(seq))
return list_of_codons
else:
y += 3
one_frame('ATGAGATGAACCATGGGGTAA')
The one_frame at the very bottom is a test case. It is supposed to be equal to ['ATGAGA', 'ATGGGG'], however my code only returns the first item in the list.
How could I fix my function to also return the other part of that list?
You have several problems:
You have return list_of_codons inside the loop. So you return as soon as you find the first match and only return that one. Put that at the end of the function, not inside the loop.
You have y += 3 in the else: block. So you won't increment y when you find a matching codon, and you'll be stuck in a loop.
You need to call get_orf() on the slice of the string starting at y, not the whole string (task b).
Task d says you have to skip to the point after the ORF that was returned in task b, not just continue at the next codon.
def one_frame(seq):
start_codon = 'ATG'
list_of_orfs = []
y = 0
while y < len(seq):
subORF = seq[y:y + 3]
if start_codon = subORF:
orf = get_orf(seq[y:])
list_of_orfs.append(orf)
y += len(orf)
else:
y += 3
return list_of_orfs
one_frame('ATGAGATGAACCATGGGGTAA')
You have a number of problems in this code, as identified in the comments. I think this does what you are actually supposed to do:
def one_frame(seq):
start_codon = 'ATG'
list_of_codons = []
y = 0
while y < len(seq):
if seq[y:y+3] == start_codon:
orf = get_orf(seq[y:])
list_of_codons.append(orf)
y += len(orf)
else:
y += 3
return list_of_codons
one_frame('ATGAGATGAACCATGGGGTAA')
Try splitting seq into codons instead:
def one_frame(seq):
shift = 3
codons = [seq[i:i+shift] for i in range(0, len(seq), shift)]
start_codon = "ATG"
orf_list = []
for codon in codons:
if codon == start_codon:
orf_list += [get_orf(codon)]
return orf_list
seq = 'ATGAGATGAACCATGGGGTAA'
one_frame(seq)
Slightly different approach but as I know nothing about DNA sequencing this may not make sense. Here goes anyway:
def one_frame(seq):
start_codon = 'ATG'
list_of_codons = []
offset = 0
while (i := seq[offset:].find(start_codon)) >= 0:
offset += i
list_of_codons.append(get_orf(seq[offset:]))
offset += len(list_of_codons[-1])
return list_of_codons
In this way the find() starts searching from the beginning of the sequence initially but subsequently only from the end of any previous codon

Leetcode 5: Longes Palindrome Substring

I have been working on the LeetCode problem 5. Longest Palindromic Substring:
Given a string s, return the longest palindromic substring in s.
But I kept getting time limit exceeded on large test cases.
I used dynamic programming as follows:
dp[(i, j)] = True implies that s[i] to s[j] is a palindrome. So if s[i] == str[j] and dp[(i+1, j-1]) is set to True, that means S[i] to S[j] is also a palindrome.
How can I improve the performance of this implementation?
class Solution:
def longestPalindrome(self, s: str) -> str:
dp = {}
res = ""
for i in range(len(s)):
# single character is always a palindrome
dp[(i, i)] = True
res = s[i]
#fill in the table diagonally
for x in range(len(s) - 1):
i = 0
j = x + 1
while j <= len(s)-1:
if s[i] == s[j] and (j - i == 1 or dp[(i+1, j-1)] == True):
dp[(i, j)] = True
if(j-i+1) > len(res):
res = s[i:j+1]
else:
dp[(i, j)] = False
i += 1
j += 1
return res
I think the judging system for this problem is kind of too tight, it took some time to make it pass, improved version:
class Solution:
def longestPalindrome(self, s: str) -> str:
dp = {}
res = ""
for i in range(len(s)):
dp[(i, i)] = True
res = s[i]
for x in range(len(s)): # iterate till the end of the string
for i in range(x): # iterate up to the current state (less work) and for loop looks better here
if s[i] == s[x] and (dp.get((i + 1, x - 1), False) or x - i == 1):
dp[(i, x)] = True
if x - i + 1 > len(res):
res = s[i:x + 1]
return res
Here is another idea to improve the performance:
The nested loop will check over many cases where the DP value is already False for smaller ranges. We can avoid looking at large spans, by looking for palindromes from inside-out and stop extending the span as soon as it no longer is a palindrome. This process should be repeated at every offset in the source string, but this could still save some processing.
The inputs for which then most time is wasted, are those where there are lots of the same letters after each other, like "aaaaaaabcaaaaaaa". These lead to many iterations: each "a" or "aa" could be the center of a palindrome, but "growing" each of them is a waste of time. We should just consider all consecutive "a" together from the start and expand from there onwards.
You can specifically deal with these cases by first grouping consecutive letters which are the same. So the above example would be turned into 4 groups: a(7)b(1)c(1)a(7)
Then let each group in turn be taken as the center of a palindrome. For each group, "fan out" to potentially include one or more neighboring groups at both sides in "tandem". Continue fanning out until either the outside groups are not about the same letter, or they have a different group size. From that result you can derive what the largest palindrome is around that center. In particular, when the case is that the letters of the outer groups are the same, but not their sizes, you still include that letter at the outside of the palindrome, but with a repetition that corresponds to the least of these two mismatching group sizes.
Here is an implementation. I used named tuples to make it more readable:
from itertools import groupby
from collections import namedtuple
Group = namedtuple("Group", "letter,size,end")
class Solution:
def longestPalindrome(self, s: str) -> str:
longest = ""
x = 0
groups = [Group(group[0], len(group), x := x + len(group)) for group in
("".join(group[1]) for group in groupby(s))]
for i in range(len(groups)):
for j in range(0, min(i+1, len(groups) - i)):
if groups[i - j].letter != groups[i + j].letter:
break
left = groups[i - j]
right = groups[i + j]
if left.size != right.size:
break
size = right.end - (left.end - left.size) - abs(left.size - right.size)
if size > len(longest):
x = left.end - left.size + max(0, left.size - right.size)
longest = s[x:x+size]
return longest
Alternatively, you can try this approach, it seems to be faster than 96% Python submission.
def longestPalindrome(self, s: str) -> str:
N = len(s)
if N == 0:
return 0
max_len, start = 1, 0
for i in range(N):
df = i - max_len
if df >= 1 and s[df-1: i+1] == s[df-1: i+1][::-1]:
start = df - 1
max_len += 2
continue
if df >= 0 and s[df: i+1] == s[df: i+1][::-1]:
start= df
max_len += 1
return s[start: start + max_len]
If you want to improve the performance, you should create a variable for len(s) at the beginning of the function and use it. That way instead of calling len(s) 3 times, you would do it just once.
Also, I see no reason to create a class for this function. A simple function will outrun a class method, albeit very slightly.

Python 3 recursive method returning last value not current value

I am a little stumped on this one. I am doing a leet code problem for removing duplicate chars from a string. I took a recursive approach to this problem however my output is not following the expected behavior and im not to sure why. Hoping someone on here could explain what i'm missing.
input: s = "azxxzy"
expected output: "ay"
Explaination: When looking at azxxzy, we first remove the adjacent xx chars. Which leaves you with azzy. Then you can remove zz, leaving you only with ay with no other duplicate characters adjacent to each other.
The code I wrote to accomplish this.
def removeDuplicates(s: str):
i = 0
while i < len(s) - 1:
if s[i] == s[i+1]:
s = s.replace(s[i] + s[i+1], '')
removeDuplicates(s)
i += 1
return s
This is returning "azzy" as the result.
However if I put in some print statments to track the value of s. it appears to be working properly until return statment.
def removeDuplicates(self, s: str) -> str:
i = 0
print(s)
while i < len(s) - 1:
if s[i] == s[i+1]:
s = s.replace(s[i] + s[i+1], '')
removeDuplicates(s)
i += 1
print(s)
return s
returns:
azxxzy - Starting Value
azzy - Value after first recursive call
ay - Value after second recursive call
ay - No duplicates found, so it just prints the value of s
ay - No duplicates found, so it just prints the value of s
azzy - Python decides it wants to grab the last value of s for some reason!?!
Thanks to j1-lee for the explaination. I was not returning the results of the recursive function call. The below code is now working as expected.
def removeDuplicates(self, s: str) -> str:
i = 0
while i < len(s) - 1:
if s[i] == s[i+1]:
s = s.replace(s[i] + s[i+1], '')
s = removeDuplicates(s)
i += 1
return s

Why is this code not running fully? It doesn't run line 53

I made myself an exercise with python since I am new. I wanted to make a rever LMC calculator ( Least common multiple ) but for some reason, something as simple as a print in a loop doesn't seem o work for me. I would appreciate some help since I am stuck on this weird issue for 20 minutes now. Here is the code:
import random
import sys
def print_list():
count_4_print = 0
while count_4_print < len(values):
print(values[count_4_print])
count_4_print += 1
def lcm(x, y):
if x > y:
greater = x
else:
greater = y
while True:
if (greater % x == 0) and (greater % y == 0):
lcm1 = greater
break
greater += 1
return lcm1
def guess(index, first_guess, second_guess):
num = 1
while lcm(first_guess, second_guess) != values[num - 1]:
first_guess = random.randrange(1, 1000000)
second_guess = random.randrange(1, 1000000)
num += 1
num = 1
if lcm(first_guess, second_guess) == values[num - 1]:
return first_guess, second_guess
num += 1
lineN = int(input())
values = []
count_4_add = 0
count_4_guess = 0
for x in range(lineN):
values.append(int(input()))
count_4_add += 1
if count_4_add >= lineN:
break
print_list()
for x in range(lineN + 1):
first, second = guess(count_4_guess, 1, 1)
count_4_guess += 1
print(first + second)
# this ^^^ doesn't work for some reason
Line 57 is in the while loop with count_4_guess. Right above this text, it says print(first_guess + second_guess)
Edit: The code is supposed to take in an int x and then prompt for x values. The outputs are the inputs without x and LMC(output1, output2) where the "LMC" is one of the values. This is done for each of the values, x times. What it actually does is just the first part. It takes the x and prompts for x outputs and then prints them but doesn't process the data (or it just doesn't print it)
Note: From looking at your comments and edits it seems that you are lacking some basic knowledge and/or understanding of things. I strongly encourage you to study more programming, computer science and python before attempting to create entire programs like this.
It is tough to answer your question properly since many aspects are unclear, so I will update my answer to reflect any relevant changes in your post.
Now, onto my answer. First, I will go over some of your code and attempt to give feedback on what could improved. Then, I will present two ways to compute the least common multiple (LCM) in python.
Code review
Code:
def print_list():
count_4_print = 0
while count_4_print < len(values):
print(values[count_4_print])
count_4_print += 1
Notes:
Where are the parameters? It was already mentioned in a few comments, but the importance of this cannot be stressed enough! (see the note at the beginning of my comment)
It appears that you are trying to print each element of a list on a new line. You can do that with print(*my_list, sep='\n').
That while loop is not how you should iterate over the elements of a list. Instead, use a for loop: for element in (my_list):.
Code:
def lcm(x, y):
if x > y:
greater = x
else:
greater = y
while True:
if (greater % x == 0) and (greater % y == 0):
lcm1 = greater
break
greater += 1
return lcm1
Notes:
This is not a correct algorithm for the LCM, since it crashes when both numbers are 0.
The comparison of a and b can be replaced with greater = max(x, y).
See the solution I posted below for a different way of writing this same algorithm.
Code:
def guess(index, first_guess, second_guess):
num = 1
while lcm(first_guess, second_guess) != values[num - 1]:
first_guess = random.randrange(1, 1000000)
second_guess = random.randrange(1, 1000000)
num += 1
num = 1
if lcm(first_guess, second_guess) == values[num - 1]:
return first_guess, second_guess
num += 1
Notes:
The line num += 1 comes immediately after return first_guess, second_guess, which means it is never executed. Somehow the mistakes cancel each other out since, as far as I can tell, it wouldn't do anything anyway if it were executed.
if lcm(first_guess, second_guess) == values[num - 1]: is completely redundant, since the while loop above checks the exact same condition.
In fact, not only is it redundant it is also fundamentally broken, as mentioned in this comment by user b_c.
Unfortunately I cannot say much more on this function since it is too difficult for me to understand its purpose.
Code:
lineN = int(input())
values = []
count_4_add = 0
count_4_guess = 0
for x in range(lineN):
values.append(int(input()))
count_4_add += 1
if count_4_add >= lineN:
break
print_list()
Notes:
As explained previously, print_list() should not be a thing.
lineN should be changed to line_n, or even better, something like num_in_vals.
count_4_add will always be equal to lineN at the end of your for loop.
Building on the previous point, the check if count_4_add >= lineN is useless.
In conclusion, count_4_add and count_4_guess are completely unnecessary and detrimental to the program.
The for loop produces values in the variable x which is never used. You can replace an unused variable with _: for _ in range(10):.
Since your input code is simple you could probably get away with something like in_vals = [int(input(f'Enter value number {i}: ')) for i in range(1, num_in_vals+1)]. Again, this depends on what it is you're actually trying to do.
LCM Implementations
According to the Wikipedia article referenced earlier, the best way to calculate the LCM is using the greatest common denominator.
import math
def lcm(a: int, b: int) -> int:
if a == b:
res = a
else:
res = abs(a * b) // math.gcd(a, b)
return res
This second method is one possible brute force solution, which is similar to how the one you are currently using should be written.
def lcm(a, b):
if a == b:
res = a
else:
max_mult = a * b
res = max_mult
great = max(a, b)
small = min(a, b)
for i in range(great, max_mult, great):
if i % small == 0:
res = i
break
return res
This final method works for any number of inputs.
import math
import functools
def lcm_simp(a: int, b: int) -> int:
if a == b:
res = a
else:
res = abs(a * b) // math.gcd(a, b)
return res
def lcm(*args: int) -> int:
return functools.reduce(lcm_simp, args)
Oof, that ended up being way longer than I expected. Anyway, let me know if anything is unclear, if I've made a mistake, or if you have any further questions! :)

Finding the edit distance of two strings with recursion

I need to use recursion to find the edit distance of two strings, i.e I give the function two arguments(each a different sring). And the function will find the least amount of changes required to change s1 into s2. This is what I have so far:
def edit_distance(s1,s2):
split1 = list(s1)
split2 = list(s2)
count = 0
pos = 0
if split1[pos] == split2[pos]:
pos += 1
else:
pos +=1
count += 1
edit_distance(s1, s2)
return count #This should be the minimum amount required to match the two strings
I annotated your code to show you the code flow. I hope you understand now why you get the error:
def edit_distance(s1,s2):
split1 = list(s1) # Split strings into characters
split2 = list(s2)
count = 0 # This variable is local, it is not shared through calls to the function!
pos = 0 # Same
if split1[pos] == split2[pos]: # pos is always 0 here!
pos += 1 # pos is incremented anyway, in if but also in else !
else:
pos +=1 # See above
count += 1 # count is incremented, now it is 1
edit_distance(s1, s2) # recursive call, but with the identical arguments as before! The next function call will do exactly the same as this one, resulting in infinite recursion!
return count # Wrong indentation here
Your function does not do what you want. In case you are talking about Hamming distance, which is not really clear to me still, here is a sample implementation assuming the lengths of both strings are equal:
# Notice that pos is passed between calls and initially set to zero
def hamming(s1, s2, pos=0):
# Are we after the last character already?
if pos < len(s1):
# Return one if the current position differs and add the result for the following positions (starting at pos+1) to that
return (s1[pos] != s2[pos]) + hamming(s1, s2, pos+1)
else:
# If the end is already reached, the remaining distance is 0
return 0

Categories