Longest Common Subsequence of three strings - python

I've written these functions (which work) to find the longest common subsequence of two strings.
def lcs_grid(xs, ys):
grid = defaultdict(lambda: defaultdict(lambda: (0,"")))
for i,x in enumerate(xs):
for j,y in enumerate(ys):
if x == y:
grid[i][j] = (grid[i-1][j-1][0]+1,'\\')
else:
if grid[i-1][j][0] > grid[i][j-1][0]:
grid[i][j] = (grid[i-1][j][0],'<')
else:
grid[i][j] = (grid[i][j-1][0],'^')
return grid
def lcs(xs,ys):
grid = lcs_grid(xs,ys)
i, j = len(xs) - 1, len(ys) - 1
best = []
length,move = grid[i][j]
while length:
if move == '\\':
best.append(xs[i])
i -= 1
j -= 1
elif move == '^':
j -= 1
elif move == '<':
i -= 1
length,move = grid[i][j]
best.reverse()
return best
Has anybody a proposition to modify the functions s.t. they can print the longest common subsequence of three strings? I.e. the function call would be: lcs(str1, str2, str3)
Till now, I managed it with the 'reduce'-statement, but I'd like to have a function that really prints out the subsequence without the 'reduce'-statement.

To find the longest common substring of D strings, you cannot simply use reduce, since the longest common substring of 3 strings does not have to be a substring of the LCS of any of the two. Counterexample:
a = "aaabb"
b = "aaajbb"
c = "cccbb"
In the example, LCS(a,b) = "aaa" and LCS(a, b, c) = "bb". As you can see, "bb" is not a substring of "aaa".
In your case, since you implemented the dynamic programming version, you have to build a D-dimensional grid and adjust the algorithm accordingly.
You might want to look at suffix trees, which should make things faster, see Wikipedia. Also look at this stackoverflow question

Related

Find occurrence of a string in another string

Details:
There are two strings x and y.
Count the number of occurrence of y in x as follows:
Length of y is 3.
Increment the "count" value when y == x[i] x[i+2] x[i+4]
Example:
x = "aabbcc"
y = "abc"
output: 2
My Code:
def solution(x, y):
i, count = 0, 0
j = i + 2
k = i + 4
while i+4 < len(x):
cur = x[i]
while i < len(x) and i != j:
i += 1
while i < len(x) and i != k:
i += 1
count += 1
return count
solution(x, y)
I am getting count = 1. It should give count = 2
There's a couple of logic errors in your code.
The problem happens here:
while i < len(x) and i != j:
i += 1
res.append(x[i])
You keep increasing i until it is either len(x) or greater, or until it is the same as j. But since you set j to be 2 at the start (and never update it), it will simply end up setting i to len(x). And x[i] will thus fail, since x[len(x)] tries to index an element just outside x.
However, there's a few more remarks to make:
you collect what you find in res, but really only want a number (e.g. 2) as a result
you define count but don't use it
you track the coordinates in the string in three separate variables (i, j, k) and have a lot of logic to increment the first, but really all you need is to step through the string one position at a time, and look at the offsets directly
Given all that and the problem description, you were probably going for something like this:
x = "aabbcc"
y = "abc"
def solution(x, y):
i, count = 0, 0
while i + 4 < len(x):
if (x[i], x[i+2], x[i+4]) == (y[0], y[1], y[2]):
count += 1
i += 1
return count
print(solution(x, y))
However, Python has some cleverness that would make it even simpler (or at least shorter):
def solution(x, y):
count = 0
for i in range(len(x)-4):
if x[i:i+5:2] == y: # slicing with a stride of two, instead of direct indexing
count += 1
return count
Or even:
def solution(x, y):
return len([x for i in range(len(x)-4) if x[i:i+5:2] == y])
But that's favouring brevity over readability a bit too much, I feel.
A generator expression solution, taking advantage of True/False == 1/0 in a numeric context:
def solution(x, y):
return sum(y == x[i:i+5:2] for i in range(len(x)-4))
Increment the "count" value when y == x[i] x[i+2] x[i+4]
This is the same as simply creating the string consisting of x[0], x[2], x[4]... (every even-numbered character) and the string consisting of x[1], x[3], x[5]... (every odd-numbered character); counting the occurrences of y in each; and adding those two results together.
Creating the strings is trivial, and a common duplicate. Counting occurrences of a substring is also well-trodden ground. Putting these tools together:
def spread_substrings(needle, haystack):
even_haystack = haystack[::2]
odd_haystack = haystack[1::2]
return even_haystack.count(needle) + odd_haystack.count(needle)

Leetcode 5: Longes Palindrome Substring

I have been working on the LeetCode problem 5. Longest Palindromic Substring:
Given a string s, return the longest palindromic substring in s.
But I kept getting time limit exceeded on large test cases.
I used dynamic programming as follows:
dp[(i, j)] = True implies that s[i] to s[j] is a palindrome. So if s[i] == str[j] and dp[(i+1, j-1]) is set to True, that means S[i] to S[j] is also a palindrome.
How can I improve the performance of this implementation?
class Solution:
def longestPalindrome(self, s: str) -> str:
dp = {}
res = ""
for i in range(len(s)):
# single character is always a palindrome
dp[(i, i)] = True
res = s[i]
#fill in the table diagonally
for x in range(len(s) - 1):
i = 0
j = x + 1
while j <= len(s)-1:
if s[i] == s[j] and (j - i == 1 or dp[(i+1, j-1)] == True):
dp[(i, j)] = True
if(j-i+1) > len(res):
res = s[i:j+1]
else:
dp[(i, j)] = False
i += 1
j += 1
return res
I think the judging system for this problem is kind of too tight, it took some time to make it pass, improved version:
class Solution:
def longestPalindrome(self, s: str) -> str:
dp = {}
res = ""
for i in range(len(s)):
dp[(i, i)] = True
res = s[i]
for x in range(len(s)): # iterate till the end of the string
for i in range(x): # iterate up to the current state (less work) and for loop looks better here
if s[i] == s[x] and (dp.get((i + 1, x - 1), False) or x - i == 1):
dp[(i, x)] = True
if x - i + 1 > len(res):
res = s[i:x + 1]
return res
Here is another idea to improve the performance:
The nested loop will check over many cases where the DP value is already False for smaller ranges. We can avoid looking at large spans, by looking for palindromes from inside-out and stop extending the span as soon as it no longer is a palindrome. This process should be repeated at every offset in the source string, but this could still save some processing.
The inputs for which then most time is wasted, are those where there are lots of the same letters after each other, like "aaaaaaabcaaaaaaa". These lead to many iterations: each "a" or "aa" could be the center of a palindrome, but "growing" each of them is a waste of time. We should just consider all consecutive "a" together from the start and expand from there onwards.
You can specifically deal with these cases by first grouping consecutive letters which are the same. So the above example would be turned into 4 groups: a(7)b(1)c(1)a(7)
Then let each group in turn be taken as the center of a palindrome. For each group, "fan out" to potentially include one or more neighboring groups at both sides in "tandem". Continue fanning out until either the outside groups are not about the same letter, or they have a different group size. From that result you can derive what the largest palindrome is around that center. In particular, when the case is that the letters of the outer groups are the same, but not their sizes, you still include that letter at the outside of the palindrome, but with a repetition that corresponds to the least of these two mismatching group sizes.
Here is an implementation. I used named tuples to make it more readable:
from itertools import groupby
from collections import namedtuple
Group = namedtuple("Group", "letter,size,end")
class Solution:
def longestPalindrome(self, s: str) -> str:
longest = ""
x = 0
groups = [Group(group[0], len(group), x := x + len(group)) for group in
("".join(group[1]) for group in groupby(s))]
for i in range(len(groups)):
for j in range(0, min(i+1, len(groups) - i)):
if groups[i - j].letter != groups[i + j].letter:
break
left = groups[i - j]
right = groups[i + j]
if left.size != right.size:
break
size = right.end - (left.end - left.size) - abs(left.size - right.size)
if size > len(longest):
x = left.end - left.size + max(0, left.size - right.size)
longest = s[x:x+size]
return longest
Alternatively, you can try this approach, it seems to be faster than 96% Python submission.
def longestPalindrome(self, s: str) -> str:
N = len(s)
if N == 0:
return 0
max_len, start = 1, 0
for i in range(N):
df = i - max_len
if df >= 1 and s[df-1: i+1] == s[df-1: i+1][::-1]:
start = df - 1
max_len += 2
continue
if df >= 0 and s[df: i+1] == s[df: i+1][::-1]:
start= df
max_len += 1
return s[start: start + max_len]
If you want to improve the performance, you should create a variable for len(s) at the beginning of the function and use it. That way instead of calling len(s) 3 times, you would do it just once.
Also, I see no reason to create a class for this function. A simple function will outrun a class method, albeit very slightly.

Finding longest sequence of consecutive repeats of a substring within a string

My code for the function is really messy and I cannot find why it returns a list of 1's. A solution would obviously be great, but with advice to make the code just better, i'd be happy
def cont_cons_repeats(ADN, STR, pos):
slong = 0
# Find start of sequence
for i in range(len(ADN[pos:])):
if ADN[pos + i:i + len(STR)] == STR:
slong = 1
pos = i + pos
break
if slong == 0:
return 0
# First run
for i in range(len(ADN[pos:])):
i += len(STR) - 1
if ADN[pos + i + 1:pos + i + len(STR)] == STR:
slong += 1
else:
pos = i + pos
break
# Every other run
while True:
pslong = cont_cons_repets(ADN, STR, pos)
if pslong > slong:
slong = pslong
if pslong == 0:
break
return slong
(slong stands for size of longest sequence, pslong for potential slong, and pos for position)
Assuming you pass in pos because you want to ignore the start of the string you're searching up to pos:
def longest_run(text, part, pos):
m = 0
n = 0
while pos < len(text):
if text[pos:pos+len(part)] == part:
n += 1
pos += len(part)
else:
m = max(n, m)
n = 0
pos += 1
return m
You say your function returns a list of 1s, but that doesn't seem to match what your code is doing. Your provided code has some syntax errors, including a misspelled call to your function cont_cons_repets, so it's impossible to say why you're getting that result.
You mentioned in the comments that you thought a recursive solution was required. You could definitely make it work as a recursive function, but in many cases where a recursive function works, you should consider a non-recursive function to save on resources. Recursive functions can be very elegant and easy to read, but remember that any recursive function can also be written as a non-recursive function. It's never required, often more resource-intensive, but sometimes just a very clean and easy to maintain solution.

Longest non-decreasing subsequence with minimal sum

I am trying to find topic algorithm and am stuck. Basically, I adopted the code given in zzz's answer here, which is Longest Increasing Subsequence algorithm, to get Longest Non-decreasing Subsequence. What I aim to find is LNDS that has a minimal sum (MSLNDS) and don't know do I have one. But as far as I can tell, original LIS algorithm as presented on wikipedia does locate minimal sum LIS. Docstring of its code says that LIS algorithm guarantees that if multiple increasing subsequences exist, the one that ends with the smallest value is preferred, and if multiple occurrences of that value can end the sequence, then the earliest occurrence is preferred. Don't know what earliest occurrence means, but would love not to be in the position to generate all LNDS to find my MSLNDS. It seems to me that clever transformation given by templatetypedef may be used to show that unique MSLIS transforms to MSLNDS, but dont have the proof. So,
a) Will LIS algorithm as given on wikipedia always output minimal sum LIS?
b) If LIS algorithm is adopted this way, will LNDS algorithm retain this property?
def NDS(X):
n = len(X)
X = [0.0] + X
M = [None]*(n+1)
P = [None]*(n+1)
L = 0
for i in range(1,n+1):
#########################################
# for LIS algorithm, this line would be
# if L == 0 or X[M[1]] >= X[i]:
#########################################
if L == 0 or X[M[1]] > X[i]:
j = 0
else:
lo = 1
hi = L+1
while lo < hi - 1:
mid = (lo + hi)//2
#########################################
# for LIS algorithm, this line would be
# if X[M[mid]] < X[i]:
#########################################
if X[M[mid]] <= X[i]:
lo = mid
else:
hi = mid
j = lo
P[i] = M[j]
if j == L or X[i] < X[M[j+1]]:
M[j+1] = i
L = max(L,j+1)
output = []
pos = M[L]
while L > 0:
output.append(X[pos])
pos = P[pos]
L -= 1
output.reverse()
return output

more efficient method of substring calculation for advice

My code works and I am looking for smarter ideas to be more efficient?
For string similarity, it is defined as longest common prefix length,
for example, "abc" and "abd" is 2, and "aaa" and "aaab" is 3.
The problem is calculate the similarity of string S and all its suffixes,
including itself as the first suffix.
for example, for S="ababaa", suffixes are "ababaa", "babaa", "abaa","baa","aa"
and "a", the similarity are 6+0+3+0+1+1=11
# Complete the function below.
from collections import defaultdict
class TrieNode:
def __init__(self):
self.children=defaultdict(TrieNode)
self.isEnd=False
class TrieTree:
def __init__(self):
self.root=TrieNode()
def insert(self, word):
node = self.root
for w in word:
node = node.children[w]
node.isEnd = True
def search(self, word):
node = self.root
count = 0
for w in word:
node = node.children.get(w)
if not node:
break
else:
count += 1
return count
def StringSimilarity(inputs):
resultFormat=[]
for word in inputs:
# build Trie tree
index = TrieTree()
index.insert(word)
result = 0
# search for suffix
for i in range(len(word)):
result += index.search(word[i:])
print result
resultFormat.append(result)
return resultFormat
def similarity(s, t):
""" assumes len(t) <= len(s), which is easily doable"""
i = 0
while i < len(t) and s[i] == t[i]:
i += 1
return i
def selfSimilarity(s):
return sum(similarity(s, s[i:]) for i in range(len(s)))
selfSimilarity("ababaa")
# 11
Here are 3 efficient approaches you may wish to consider:
Suffix Tree
Compute the suffix tree of the original string. Then descend down the principal path through the suffix tree, counting how many paths depart from the principal at each stage.
Suffix Array
Compute the suffix array and the longest common prefix array.
These arrays can be used to compute the longest prefix of any pair of suffices, and in particular the longest prefix between the original string and each suffix.
Z function
The output you are trying to construct is known as the Z function.
It can be computed directly in linear time as shown here (Not Python code obviously):
vector z_function(string s) {
int n = (int) s.length();
vector z(n);
for (int i = 1, l = 0, r = 0; i < n; ++i) {
if (i <= r)
z[i] = min (r - i + 1, z[i - l]);
while (i + z[i] < n && s[z[i]] == s[i + z[i]])
++z[i];
if (i + z[i] - 1 > r)
l = i, r = i + z[i] - 1;
}
return z;
}
It takes a lot of work to build the TrieTree object. Skip that. Just do a double loop over all possible starting points of a match, and all possible offsets where you might still be matching.
Building complex objects like that only makes sense if you'll be querying your data structure many times. But here you aren't so it doesn't pay off.

Categories