Find next highest lexicgraphic permutation of a string [duplicate] - python

This question already has answers here:
How to get the next lexicographically bigger string in a sorted list by using itertools module?
(5 answers)
Closed 6 years ago.
given a string W, what i want to achieve its next string lexicographically greater.
eg 1:
givenstring = "hegf"
nexthighest = "hefg"
what i have tried till now is here,
from itertools import permutations
q = int(input())
for i in range(q):
s = input()
if s == s[::-1]:
print("no answer")
else:
x = ["".join(p) for p in list(permutations(s))]
x.sort()
index = x.index(s)
print(x[index+1])
since this is not the efficient way to solve this. can u please suggest me better way to solve this problem

here is another way to solve this problem
def NextHighestWord(string):
S = [ord(i) for i in string]
#find non-incresing suffix from last
i = len(S) - 1
while i > 0 and S[i-1] >= S[i]:
i = i - 1
if i <= 0:
return False
#next element to highest is pivot
j = len(S) - 1
while S[j] <= S[i -1]:
j = j - 1
S[i-1],S[j] = S[j],S[i-1]
#reverse the suffix
S[i:] = S[len(S) - 1 : i-1 : -1]
ans = [chr(i) for i in S]
ans = "".join(ans)
print(ans)
return True
test = int(input())
for i in range(test):
s = input()
val = NextHighestWord(s)
if val:
continue
else:
print("no answer")

One classic algorithm to generate next permutation is:
Step 1: Find the largest index k, such that A[k] < A[k + 1].
If not exist, this is the last permutation. (in this problem just reverse the vector and return.)
Step 2: Find the largest index l, such that A[l] > A[k] and l > k.
Step 3: Swap A[k] and A[l].
Step 4: Reverse A[k + 1] to the end.
Here is my C++ snippet of above algorithm. Though its not python, the syntax is simple and pseudo-code alike, hope you will get the idea.
void nextPermutation(vector<int> &num) {
int k = -1;
int l;
//step1
for (int i = num.size() - 1; i > 0; --i) {
if (num[i - 1] < num[i]) {
k = i - 1;
break;
}
}
if (k == -1) {
reverse(num.begin(), num.end());
return;
}
//step2
for (int i = num.size() - 1; i > k; --i) {
if (num[i] > num[k]) {
l = i;
break;
}
}
//step3
swap(num[l], num[k]);
//step4
reverse(num.begin() + k + 1, num.end());
}

Related

minimum jump to reach to end

Question statement:
Given array: [1,0,1,0,1,1,1,1,1,0,1,1,1,0]
output: minimum steps required to reach to end
Conditions:
step on 0 is exit
you can take max of 1 step or 2 steps at a time.
I have done using without DP, Is a DP solution present for this problem.
My code:
def minjump(arr):
n = len(arr)
if n <= 0 or arr[0] == 0:
return 0
index = jump = 0
while index < n:
if index == n-1:
return jump
if arr[index] == 0 and arr[index+1] == 0:
return 0
if arr[index] == 1:
if index < n-2 and arr[index+2] == 1:
jump +=1
index +=2
else:
jump += 1
index += 1
return jump
A naïve solution with no memoization simply recurses through the list taking both one or two steps and retaining the minimum steps needed:
def min_steps(array, current_count, current_position):
if current_position >= len(array): # Terminal condition if you've reached the end
return current_count
return (
min(
min_steps(array, current_count + 1, current_position + 1),
min_steps(array, current_count + 1, current_position + 2),
) # minimum count after taking one step or two
if array[current_position] # if current step is valid (non-zero)
else float("inf") # else, return float.infinity (fine since we're not imposing types on this prototype)
)
def minjump(arr):
result = min_steps(arr, 0, 0)
return 0 if result == float("inf") else result
You can solve a more general problem using dp, in the pseudocode bellow k represents the maximum number of steps you can jump:
int n = arr.Length
int dp[n]
for (i = 0 ; i < n; i++) {
int lowerBound = max(i - k, 0)
for (j = i - 1; j >= lowerBound; j--) {
if (arr[j] == 1 && (j == 0 || dp[j] > 0)) {
dp[i] = min(dp[i], 1 + dp[j])
}
}
}
return dp[n - 1]

Target sum dp algorithm when element can be zero

Target sum prompt:
You are given a set of positive numbers and a target sum ‘S’. Each number should be assigned either a ‘+’ or ‘-’ sign. We need to find the total ways to assign symbols to make the sum of the numbers equal to the target ‘S’.
Input: {1, 1, 2, 3}, S=1
Output: 3
Explanation: The given set has '3' ways to make a sum of '1': {+1-1-2+3} & {-1+1-2+3} & {+1+1+2-3}
let’s say ‘Sum(s1)’ denotes the total sum of set ‘s1’, and ‘Sum(s2)’ denotes the total sum of set ‘s2’. Add negative sign to set 's2'
This equation can be reduced to the subset sum problem target + sum(nums)/2
sum(s1) - sum(s2) = target
sum(s1) + sum(s2) = sum(nums)
2 * sum(s1) = target + sum(nums)
sum(s1) = target + sum(nums) / 2
def findTargetSumWays(nums, S):
"""
:type nums: List[int]
:type S: int
:rtype: int
"""
if (sum(nums) + S) % 2 == 1 or sum(nums) < S:
return 0
ssum = (sum(nums) + S) // 2
dp = [[0 for _ in range(ssum + 1)] for _ in range(len(nums))]
# col == 0
for i in range(len(nums)):
# [] or [0]
if i == 0 and nums[i] == 0:
dp[i][0] = 2
# [] or [0] from previous
elif nums[i] == 0:
dp[i][0] = 2 * dp[i-1][0]
else: # empty set only
dp[i][0] = 1
# take 1st element nums[0] in s == nums[0]
for s in range(1, ssum + 1):
if nums[0] == s:
dp[0][s] = 1
for i in range(1, len(nums)):
for s in range(1, ssum + 1):
if nums[i] != 0:
# skip element at i
dp[i][s] = dp[i - 1][s]
# include element at i
if s >= nums[i]:
dp[i][s] += dp[i - 1][s - nums[i]]
else: # nums[i] = 0
dp[i][s] = dp[i-1][s] * 2
return dp[len(nums) - 1][ssum]
I've spent a few hours on this prompt but still couldn't pass the following example
[7,0,3,9,9,9,1,7,2,3]
6
expected: 50
output: 43 (using my algorithm)
I've also looked through other people's answers here, they all makes sense but I just want to know where could I have possibly missed in my algorithm here?
You can do it like this:
from itertools import product
def findTargetSumWays(nums, S):
a = [1,-1]
result=[np.multiply(nums,i) for i in list(product(a, repeat=len(nums))) if sum(np.multiply(nums,i))==S]
return(len(result))
findTargetSumWays(inputs,6)
50
Basically I get all possible combinations of -1,1 in tuples with the size the same as input elements and then I'm multiplying these tuples with input.
I ran into this same issue when handling zeroes but I did this on C++ where I handled zeroes seperately.
Make sure that in the knapsack approach skip zeroes i.e.
if(a[i-1] == 0)
dp[i][j] = dp[i-1][j];
We can handle zeroes seperately by simply counting the zero occurences and we can put them in either S1 or S2. So, for each zero it is 2*(answer) and for n zeroes its 2^n * (answer) i.e.
answer = pow(2, num_zero) * answer;
Also, don't forget to simply return zero if sum(nums) + target is odd as S1 can't be fractional or target is greater than sum(nums) i.e.
if(sum < target || (sum+target)%2 == 1)
return 0;
The overall code looks like this:
int subsetSum(int a[], int n, int sum) {
int dp[n+1][sum+1];
for(int i = 0; i<sum+1; i++)
dp[0][i] = 0;
for(int i = 0; i<n+1; i++)
dp[i][0] = 1;
for(int i = 1; i<n+1; i++) {
for(int j = 1; j<sum+1; j++) {
if(a[i-1] == 0)
dp[i][j] = dp[i-1][j];
else if(a[i-1]<=j)
dp[i][j] = dp[i-1][j-a[i-1]] + dp[i-1][j];
else
dp[i][j] = dp[i-1][j];
}
}
return dp[n][sum]; }
int findTargetSumWays(int a[], int target) {
int sum = 0;
int num_zero = 0;
for(int i = 0; i<a.size(); i++) {
sum += a[i];
if(a[i] == 0)
num_zero++;
}
if(sum < target || (sum+target)%2 == 1)
return 0;
int ans = subsetSum(a, a.size(), (sum + target)/2);
return pow(2, num_zero) * ans;
}
The source of the problem is this part, initializing col == 0:
# col == 0
for i in range(len(nums)):
# [] or [0]
if i == 0 and nums[i] == 0:
dp[i][0] = 2
# [] or [0] from previous
elif nums[i] == 0:
dp[i][0] = 2 * dp[i-1][0]
else: # empty set only
dp[i][0] = 1
This code treats zeros differently depending on how the list is ordered (it resets the value to 1 if it hits a nonzero value). It should instead look like this:
# col == 0
for i in range(len(nums)):
# [] or [0]
if i == 0 and nums[i] == 0:
dp[i][0] = 2
elif i == 0:
dp[i][0] = 1
# [] or [0] from previous
elif nums[i] == 0:
dp[i][0] = 2 * dp[i-1][0]
else: # empty set only
dp[i][0] = dp[i - 1][0]
This way, the first value is set to either 2 or 1 depending on whether or not it's zero, and nonzero values later in the list don't reset the value to 1. This outputs 50 in your sample case.
You can also remove room for error by giving simpler initial conditions:
def findTargetSumWays(nums, S):
"""
:type nums: List[int]
:type S: int
:rtype: int
"""
if (sum(nums) + S) % 2 == 1 or sum(nums) < S:
return 0
ssum = (sum(nums) + S) // 2
dp = [[0 for _ in range(ssum + 1)] for _ in range(len(nums) + 1)]
# col == 0
dp[0][0] = 1
for i in range(len(nums)):
for s in range(ssum + 1):
dp[i + 1][s] = dp[i][s]
if s >= nums[i]:
dp[i + 1][s] += dp[i][s - nums[i]]
return dp[len(nums)][ssum]
This adds an additional row to represent the state before you add any numbers (just a 1 in the top left corner), and it runs your algorithm on the rest of the rows. You don't need to initialize anything else or treat zeros differently, and this way it should be easier to reason about the code.
The issue with your function is related to the way you manage zero values in the list. Perhaps a simpler way for you to handle the zero values would be to exclude them from the process and then multiply your resulting count by 2**Z where Z is the number of zero values.
While trying to find the problem, I did a bit of simplification on your code and ended up with this: (which gives the right answer, even with zeroes in the list).
ssum = (sum(nums) + S) // 2
dp = [1]+[0]*ssum # number of sets that produce each sum from 0 to ssum
for num in nums:
for s in reversed(range(num,ssum + 1)):
dp[s] += dp[s-num]
return dp[ssum]
What I did was:
Eliminate a dimension in dp because you don't need to keep all the previous set counts. Only the current and next one. Actually it can work using only the current set counts if you process the sum values backwards from ssum down to zero (which i did).
The condition s >= nums[i]was eliminated by starting the s range from the current num value so that the index s - num can never be negative.
With that done, there was no need for an index on nums, I could simply go through the values directly.
Then I got rid of all the special conditions on zero values by initializing dp with 1 for the zero sum (i.e. initially an empty set is the one solution to obtain a sum of zero, then increments proceed from there).
Starting with the empty set baseline allows the progressive accumulation of set counts to produce the right result for all values without requiring any special treatment of zeroes. When num is zero it will naturally double all the current set counts because dp[s] += dp[s-0] is the same as dp[s] = 2 * dp[s]. If the list starts out with a zero then the set count for a sum of zero (dp[0]) will be doubled and all subsequent num values will have a larger starting count (because they start out from the dp[0] value initialized with 1 for the empty set).
With that last change, the function started to give the right result.
My assertion is that, because your solution was not starting from the "empty set" baseline, the zero handling logic was interfering with the natural progression of set counts. I didn't try to fine tune the zero conditions because they weren't needed and it seemed pointless to get them to arrive at the same states that a mere initialization "one step earlier" would produce
From there, the logic can be further optimized by avoiding assignments do dp[s] outside the range of minimum and maximum possible sums (which "slides" forward as we progress through the nums list):
ssum = (sum(nums) + S) // 2
dp = [1]+[0]*ssum
maxSum = 0
minSum = S - ssum # equivalent to: ssum - sum(nums)
for num in nums:
maxSum += num
minSum += num
for s in reversed(range(max(num,minSum),min(ssum,maxSum)+1)):
dp[s] += dp[s-num]
return dp[ssum]

Speeding up Python code that has to go through entire list

I have a problem where I need to (pretty sure at least) go through the entire list to solve. The question is to figure out the largest number of consecutive numbers in a list that add up to another (greater) element in that list. If there aren't any then we just take the largest value in the list as the candidate summation and 1 as the largest consecutive number of elements.
My general code works, but not too well for large lists (>500,000 elements). I am just looking for tips as to how I could approach the problem differently. My current approach:
L = [1,2,3,4,5,6,7,8,9,10]
candidate_sum = L[-1]
largest_count = 1
N = len(L)
i = 0
while i < N - 1:
s = L[i]
j = 0
while s <= (N - L[i + j + 1]):
j += 1
s += L[i+j]
if s in L and (j+1) > largest_count:
largest_count = j+1
candidate_sum = s
i+=1
In this case, the answer would be [1,2,3,4] as they add up to 10 and the length is 4 (obviously this example L is a very simple example).
I then made it faster by changing the initial while loop condition to:
while i < (N-1)/largest_count
Not a great assumption, but basic thinking that the distribution of numbers is somewhat uniform, so two numbers on the second half of the list are on average bigger than the final number in the list, and therefore are disqualified.
I'm just looking for:
possible bottlenecks
suggestions as to different approaches to try
Strictly ascending: no duplication of elements or subsequences, single possible solution
Arbitrary-spaced: no arithmetical shortcuts, has to operate brute-force
Efficient C implementation using pointer arithmetic, quasi polymorphic over numeric types:
#define TYPE int
int max_subsum(TYPE arr [], int size) {
int max_length = 1;
TYPE arr_fst = * arr;
TYPE* num_ptr = arr;
while (size --) {
TYPE num = * num_ptr++;
TYPE* lower = arr;
TYPE* upper = arr;
TYPE sum = arr_fst;
int length = 1;
for (;;) {
if (sum > num) {
sum -= * lower++;
-- length;
}
else if (sum < num) {
sum += * ++upper;
++ length;
}
else {
if (length > max_length) {
max_length = length;
}
break;
}
}
}
return max_length;
}
The main loop over the nums is parallelizable. Relatively straight-forward translation into Python 3 using the dynamic-array list type for arr and the for each loop:
def max_subsum(arr):
max_len = 1
arr_fst = arr[0]
for n in arr:
lower = 0
upper = 0
sum = arr_fst
while True:
if sum > n:
sum -= arr[lower]
lower += 1
elif sum < n:
upper += 1
sum += arr[upper]
else:
sum_len = upper - lower + 1
if sum_len > max_len:
max_len = sum_len
break
return max_len
This max_subsum is a partial function; Python lists can be empty. The algorithm is appropriate for C-like compiled imperative languages offering fast indexing and statically typed arithmetic. Both are comparatively expensive in Python. A (totally defined) algorithm rather similar to yours, using the set data type for more performant universal quantification, and avoiding Python's dynamically typed arithmetic, can be more efficiently interpreted:
def max_subsum(arr):
size = len(arr)
max_len = 0
arr_set = set(arr)
for i in range(size):
sum = 0
sum_len = 0
for j in range(i, size):
sum_mem = sum + arr[j]
if num_mem not in arr_set:
break
sum = sum_mem
sum_len += 1
if sum_len > max_len:
max_len = sum_len
return max_len
I'm going to ignore the possibility of a changing target value, and let you figure that out, but to answer your question "is there a faster way to do it?" Yes: by using cumulative sums and some math to eliminate one of your loops.
import numpy as np
L = np.random.randint(0,100,100)
L.sort()
cum_sum = np.cumsum(L)
start = 0
end = 0
target = 200
while 1:
total = cum_sum [end-1] - (cum_sum [start-1] if start else 0)
if total == target:
break
elif total < target:
end += 1
elif total > target:
start += 1
if end >= len(L):
raise ValueError('something informative')

3SUM (finding all unique triplets in a list that equal 0)

I am working on the 3SUM problem (taken from leetcode), which takes a list as input and finds all unique triplets in the lists such that a+b+c=0. I am not really sure what my code is doing wrong, but it currently returns an empty list for this list [-1, 0, 1, 2, -1, -4], so it is not recognizing any triplets that sum to 0. I would appreciate any suggestions or improved code.
Here's my code:
result = []
nums.sort()
l = 0
r=len(nums)-1
for i in range(len(nums)-2):
while (l < r):
sum = nums[i] + nums[l] + nums[r]
if (sum < 0):
l = l + 1
if (sum > 0):
r = r - 1
if (sum == 0):
result.append([nums[i],nums[l],nums[r]])
print(result)
A couple things to note.
Don't use sum as a variable name because that's a built-in function.
Your indexing is a little problematic since you initialize l = 0 and have i begin at 0 as well.
Don't rest on your laurels: increment the value of l when you find a successful combination. It's really easy to forget this step!
An edited version of your code below.
nums = [-1, 0, 1, 2, -1, -4]
result = []
nums.sort()
r=len(nums)-1
for i in range(len(nums)-2):
l = i + 1 # we don't want l and i to be the same value.
# for each value of i, l starts one greater
# and increments from there.
while (l < r):
sum_ = nums[i] + nums[l] + nums[r]
if (sum_ < 0):
l = l + 1
if (sum_ > 0):
r = r - 1
if not sum_: # 0 is False in a boolean context
result.append([nums[i],nums[l],nums[r]])
l = l + 1 # increment l when we find a combination that works
>>> result
[[-1, -1, 2], [-1, 0, 1], [-1, 0, 1]]
If you wish, you can omit the repeats from the list.
unique_lst = []
[unique_lst.append(sublst) for sublst in result if not unique_lst.count(sublst)]
>>> unique_lst
[[-1, -1, 2], [-1, 0, 1]]
Another approach uses itertools.combinations. This doesn't require a sorted list.
from itertools import combinations
result = []
for lst in itertools.combinations(nums, 3):
if sum(lst) == 0:
result.append(lst)
A nested for loop version. Not a big fan of this approach, but it's basically the brute-force version of the itertools.combinations solution. Since it's the same as approach as above, no sort is needed.
result = []
for i in range(0, len(nums)-2):
for j in range(i + 1, len(nums)-1):
for k in range(j + 1, len(nums)):
if not sum([nums[i], nums[j], nums[k]]): # 0 is False
result.append([nums[i], nums[j], nums[k]])
Uncomment print statement from my solution:
class Solution:
def threeSum(self, nums):
"""
:type nums: List[int]
:rtype: List[List[int]]
"""
# print('Input: {}'.format(nums))
nums.sort() # inplace sorting, using only indexes
N, result = len(nums), []
# print('Sorted Input: {}'.format(nums))
for i in range(N):
if i > 0 and nums[i] == nums[i-1]:
# print("Duplicate found(when 'i' iterate ) at index: {}, current: {}, prev: {}, so JUMP this iteration------".format(i,nums[i], nums[i-1]))
continue
target = nums[i]*-1
s,e = i+1, N-1
# print('~'*50)
# print("Target: {} at index: {} & s: {} e: {} {}".format(target,i, s, e, '----'*2))
while s<e: # for each target squeeze in s & e
if nums[s]+nums[e] == target:
result.append([nums[i], nums[s], nums[e]])
# print(' {} + {} == {}, with s: {} < e: {}, Triplet: {}, MOVING --> R'.format(nums[s], nums[e], target,s, e,result))
s = s+1
while s<e and nums[s] == nums[s-1]: # duplicate
# print("Duplicate found(when 's' iterates) at s: {} < e: {}, WILL KEEP MOVING ---> R (s: {}) == (s-1: {})".format(s, e, nums[s], nums[s - 1]))
s = s+1
elif nums[s] + nums[e] < target:
# print(' {} + {} < {}, with s: {} e: {}, MOVING ---> R'.format(nums[s], nums[e], target,s, e))
s = s+1
else:
# print(' {} + {} > {}, with s: {} e: {}, MOVING <--- L'.format(nums[s], nums[e], target,s, e))
e = e-1
return result
It will help you to understand the algorithm better. Also, this algorithm is 3 times faster than the above available options. It takes ~892.18 ms compared to the above alternative with runs in ~4216.98 ms time. The overhead is because of the additional removal of duplicates logic.
I did a similar approach as 3novak, but I added in the case where the number list is less than three integers returning an empty list.
class Solution:
def threeSum(self, nums):
"""
:type nums: List[int]
:rtype: List[List[int]]
"""
# if less than three numbers, don't bother searching
if len(nums) < 3:
return []
# sort nums and use current sum to see if should have larger number or smaller number
nums = sorted(nums)
triplets = []
for i in range(len(nums)-2): # i point to first number to sum in list
j = i + 1 # j points to middle number to sum in list
k = len(nums) - 1 # k points to last number to sum in list
while j < k:
currSum = nums[i] + nums[j] + nums[k]
if currSum == 0:
tmpList = sorted([nums[i], nums[j], nums[k]])
if tmpList not in triplets:
triplets.append(tmpList)
j += 1 # go to next number to avoid infinite loop
# sum too large, so move k to smaller number
elif currSum > 0:
k -= 1
# sum too small so move j to larger number
elif currSum < 0:
j += 1
return triplets
I'm doing the same problem at leetcode, but still have a runtime error. This may be able to be done by using a binary search tree-like algorithm to find the third result, as well.
Using two pointer approach:
First sort the list.
Iterate from left to right. Say current position is i, set left side position as i+1, and set the right end as the end of the list N-1.
If the sum is greater than 0, decrease the right end by 1.
else if, the sum is less than 0, increase the left end by 1,
else, check the uniqueness of the new entry and if it is unique add it to answer list. Continue search for more entry with leftEnd++, rightEnd--.
Java Code:
public ArrayList<ArrayList<Integer>> threeSum(ArrayList<Integer> A) {
ArrayList<ArrayList<Integer>> ans = new ArrayList<ArrayList<Integer>>();
Collections.sort(A); // takes O(nlogn)
if (A.size() < 3) return ans;
ArrayList<Integer> triplet = new ArrayList<>();
for(int i = 0; i < A.size()-3; i++){ // takes O(n^2)
if (i > 0 && A.get(i) == A.get(i-1)) continue; // to maintain unique entries
int r = A.size()-1;
int l = i+1;
while (l < r){
int s = sumOfThree(A, i, l, r);
if (s == 0){
if (ans.size() == 0 || !bTripletExists(A, i, l, r, triplet)){
triplet = getNewTriplet(A, i, l, r); // to be matched against next triplet
ans.add(triplet);
}
l++;
r--;
}else if (s > 0){
r--;
}else {
l++;
}
}
}
return ans;
}
public int sumOfThree(ArrayList<Integer> A, int i, int j, int k){
return A.get(i)+A.get(j)+A.get(k);
}
public ArrayList<Integer> getNewTriplet(ArrayList<Integer> A, int i, int j, int k){
ArrayList<Integer> newTriplet = new ArrayList<>();
newTriplet.add(A.get(i));
newTriplet.add(A.get(j));
newTriplet.add(A.get(k));
return newTriplet;
}
public boolean bTripletExists(ArrayList<Integer> A, int i, int j, int k, ArrayList<Integer> triplet){
if (A.get(i).equals(triplet.get(0)) &&
A.get(j).equals(triplet.get(1)) &&
A.get(k).equals(triplet.get(2)))
return true;
return false;
}
Most of the answers given above are great but fails some edge cases on leetcode.
I added a few more checks to pass all the test cases
class Solution:
def threeSum(self, nums: List[int]) -> List[List[int]]:
# if the list has less than 3 elements
if len(nums)<3:
return []
# if nums is just zeroes return just one zeroes pair
elif sum([i**2 for i in nums]) == 0:
return [[0,0,0]]
nums.sort()
result = []
for i in range(len(nums)):
#duplicate skip it
if i > 0 and nums[i]== nums[i-1]:
continue
# left pointer starts next to current i item
l = i+1
r = len(nums)-1
while l< r:
summ = nums[l] + nums[r]
# if we find 2 numbers that sums up to -item
if summ == -nums[i]:
result.append([nums[i],nums[l],nums[r]])
l +=1
# duplicate skip it
while l<r and nums[l] == nums[l-1]:
l +=1
# if the sum is smaller than 0 we move left pointer forward
elif summ + nums[i] < 0:
l +=1
# if the sum is bigger than 0 move the right pointer backward
else:
r -=1
return result

Longest common subsequence of 3+ strings

I am trying to find the longest common subsequence of 3 or more strings. The Wikipedia article has a great description of how to do this for 2 strings, but I'm a little unsure of how to extend this to 3 or more strings.
There are plenty of libraries for finding the LCS of 2 strings, so I'd like to use one of them if possible. If I have 3 strings A, B and C, is it valid to find the LCS of A and B as X, and then find the LCS of X and C, or is this the wrong way to do it?
I've implemented it in Python as follows:
import difflib
def lcs(str1, str2):
sm = difflib.SequenceMatcher()
sm.set_seqs(str1, str2)
matching_blocks = [str1[m.a:m.a+m.size] for m in sm.get_matching_blocks()]
return "".join(matching_blocks)
print reduce(lcs, ['abacbdab', 'bdcaba', 'cbacaa'])
This outputs "ba", however it should be "baa".
Just generalize the recurrence relation.
For three strings:
dp[i, j, k] = 1 + dp[i - 1, j - 1, k - 1] if A[i] = B[j] = C[k]
max(dp[i - 1, j, k], dp[i, j - 1, k], dp[i, j, k - 1]) otherwise
Should be easy to generalize to more strings from this.
I just had to do this for a homework, so here is my dynamic programming solution in python that's pretty efficient. It is O(nml) where n, m and l are the lengths of the three sequences.
The solution works by creating a 3D array and then enumerating all three sequences to calculate the path of the longest subsequence. Then you can backtrack through the array to reconstruct the actual subsequence from its path.
So, you initialize the array to all zeros, and then enumerate the three sequences. At each step of the enumeration, you either add one to the length of the longest subsequence (if there's a match) or just carry forward the longest subsequence from the previous step of the enumeration.
Once the enumeration is complete, you can now trace back through the array to reconstruct the subsequence from the steps you took. i.e. as you travel backwards from the last entry in the array, each time you encounter a match you look it up in any of the sequences (using the coordinate from the array) and add it to the subsequence.
def lcs3(a, b, c):
m = len(a)
l = len(b)
n = len(c)
subs = [[[0 for k in range(n+1)] for j in range(l+1)] for i in range(m+1)]
for i, x in enumerate(a):
for j, y in enumerate(b):
for k, z in enumerate(c):
if x == y and y == z:
subs[i+1][j+1][k+1] = subs[i][j][k] + 1
else:
subs[i+1][j+1][k+1] = max(subs[i+1][j+1][k],
subs[i][j+1][k+1],
subs[i+1][j][k+1])
# return subs[-1][-1][-1] #if you only need the length of the lcs
lcs = ""
while m > 0 and l > 0 and n > 0:
step = subs[m][l][n]
if step == subs[m-1][l][n]:
m -= 1
elif step == subs[m][l-1][n]:
l -= 1
elif step == subs[m][l][n-1]:
n -= 1
else:
lcs += str(a[m-1])
m -= 1
l -= 1
n -= 1
return lcs[::-1]
To find the Longest Common Subsequence (LCS) of 2 strings A and B, you can traverse a 2-dimensional array diagonally like shown in the Link you posted. Every element in the array corresponds to the problem of finding the LCS of the substrings A' and B' (A cut by its row number, B cut by its column number). This problem can be solved by calculating the value of all elements in the array. You must be certain that when you calculate the value of an array element, all sub-problems required to calculate that given value has already been solved. That is why you traverse the 2-dimensional array diagonally.
This solution can be scaled to finding the longest common subsequence between N strings, but this requires a general way to iterate an array of N dimensions such that any element is reached only when all sub-problems the element requires a solution to has been solved.
Instead of iterating the N-dimensional array in a special order, you can also solve the problem recursively. With recursion it is important to save the intermediate solutions, since many branches will require the same intermediate solutions. I have written a small example in C# that does this:
string lcs(string[] strings)
{
if (strings.Length == 0)
return "";
if (strings.Length == 1)
return strings[0];
int max = -1;
int cacheSize = 1;
for (int i = 0; i < strings.Length; i++)
{
cacheSize *= strings[i].Length;
if (strings[i].Length > max)
max = strings[i].Length;
}
string[] cache = new string[cacheSize];
int[] indexes = new int[strings.Length];
for (int i = 0; i < indexes.Length; i++)
indexes[i] = strings[i].Length - 1;
return lcsBack(strings, indexes, cache);
}
string lcsBack(string[] strings, int[] indexes, string[] cache)
{
for (int i = 0; i < indexes.Length; i++ )
if (indexes[i] == -1)
return "";
bool match = true;
for (int i = 1; i < indexes.Length; i++)
{
if (strings[0][indexes[0]] != strings[i][indexes[i]])
{
match = false;
break;
}
}
if (match)
{
int[] newIndexes = new int[indexes.Length];
for (int i = 0; i < indexes.Length; i++)
newIndexes[i] = indexes[i] - 1;
string result = lcsBack(strings, newIndexes, cache) + strings[0][indexes[0]];
cache[calcCachePos(indexes, strings)] = result;
return result;
}
else
{
string[] subStrings = new string[strings.Length];
for (int i = 0; i < strings.Length; i++)
{
if (indexes[i] <= 0)
subStrings[i] = "";
else
{
int[] newIndexes = new int[indexes.Length];
for (int j = 0; j < indexes.Length; j++)
newIndexes[j] = indexes[j];
newIndexes[i]--;
int cachePos = calcCachePos(newIndexes, strings);
if (cache[cachePos] == null)
subStrings[i] = lcsBack(strings, newIndexes, cache);
else
subStrings[i] = cache[cachePos];
}
}
string longestString = "";
int longestLength = 0;
for (int i = 0; i < subStrings.Length; i++)
{
if (subStrings[i].Length > longestLength)
{
longestString = subStrings[i];
longestLength = longestString.Length;
}
}
cache[calcCachePos(indexes, strings)] = longestString;
return longestString;
}
}
int calcCachePos(int[] indexes, string[] strings)
{
int factor = 1;
int pos = 0;
for (int i = 0; i < indexes.Length; i++)
{
pos += indexes[i] * factor;
factor *= strings[i].Length;
}
return pos;
}
My code example can be optimized further. Many of the strings being cached are duplicates, and some are duplicates with just one additional character added. This uses more space than necessary when the input strings become large.
On input: "666222054263314443712", "5432127413542377777", "6664664565464057425"
The LCS returned is "54442"
This below code can find the longest common subsequence in N strings. This uses itertools to generate required index combinations and then use these indexes for finding common substring.
Example Execution:
Input:
Enter the number of sequences: 3
Enter sequence 1 : 83217
Enter sequence 2 : 8213897
Enter sequence 3 : 683147
Output:
837
from itertools import product
import numpy as np
import pdb
def neighbors(index):
N = len(index)
for relative_index in product((0, -1), repeat=N):
if not all(i == 0 for i in relative_index):
yield tuple(i + i_rel for i, i_rel in zip(index, relative_index))
def longestCommonSubSequenceOfN(sqs):
numberOfSequences = len(sqs);
lengths = np.array([len(sequence) for sequence in sqs]);
incrLengths = lengths + 1;
lengths = tuple(lengths);
inverseDistances = np.zeros(incrLengths);
ranges = [tuple(range(1, length+1)) for length in lengths[::-1]];
for tupleIndex in product(*ranges):
tupleIndex = tupleIndex[::-1];
neighborIndexes = list(neighbors(tupleIndex));
operationsWithMisMatch = np.array([]);
for neighborIndex in neighborIndexes:
operationsWithMisMatch = np.append(operationsWithMisMatch, inverseDistances[neighborIndex]);
operationsWithMatch = np.copy(operationsWithMisMatch);
operationsWithMatch[-1] = operationsWithMatch[-1] + 1;
chars = [sqs[i][neighborIndexes[-1][i]] for i in range(numberOfSequences)];
if(all(elem == chars[0] for elem in chars)):
inverseDistances[tupleIndex] = max(operationsWithMatch);
else:
inverseDistances[tupleIndex] = max(operationsWithMisMatch);
# pdb.set_trace();
subString = "";
mainTupleIndex = lengths;
while(all(ind > 0 for ind in mainTupleIndex)):
neighborsIndexes = list(neighbors(mainTupleIndex));
anyOperation = False;
for tupleIndex in neighborsIndexes:
current = inverseDistances[mainTupleIndex];
if(current == inverseDistances[tupleIndex]):
mainTupleIndex = tupleIndex;
anyOperation = True;
break;
if(not anyOperation):
subString += str(sqs[0][mainTupleIndex[0] - 1]);
mainTupleIndex = neighborsIndexes[-1];
return subString[::-1];
numberOfSequences = int(input("Enter the number of sequences: "));
sequences = [input("Enter sequence {} : ".format(i)) for i in range(1, numberOfSequences + 1)];
print(longestCommonSubSequenceOfN(sequences));
Here is a link to the solution view explanation here output is Length of LCS is 2
# Python program to find
# LCS of three strings
# Returns length of LCS
# for X[0..m-1], Y[0..n-1]
# and Z[0..o-1]
def lcsOf3(X, Y, Z, m, n, o):
L = [[[0 for i in range(o+1)] for j in range(n+1)]
for k in range(m+1)]
''' Following steps build L[m+1][n+1][o+1] in
bottom up fashion. Note that L[i][j][k]
contains length of LCS of X[0..i-1] and
Y[0..j-1] and Z[0.....k-1] '''
for i in range(m+1):
for j in range(n+1):
for k in range(o+1):
if (i == 0 or j == 0 or k == 0):
L[i][j][k] = 0
elif (X[i-1] == Y[j-1] and
X[i-1] == Z[k-1]):
L[i][j][k] = L[i-1][j-1][k-1] + 1
else:
L[i][j][k] = max(max(L[i-1][j][k],
L[i][j-1][k]),
L[i][j][k-1])
# L[m][n][o] contains length of LCS for
# X[0..n-1] and Y[0..m-1] and Z[0..o-1]
return L[m][n][o]
# Driver program to test above function
X = 'AGGT12'
Y = '12TXAYB'
Z = '12XBA'
m = len(X)
n = len(Y)
o = len(Z)
print('Length of LCS is', lcsOf3(X, Y, Z, m, n, o))
# This code is contributed by Soumen Ghosh.

Categories