Algorithm problem for Deletion Distance between 2 string

Algorithm problem for Deletion Distance between 2 string - python

I was solving this problem at Pramp and I have trouble figuring out the algorithm for this problem. I'll paste the problem description and how I kind of solved it. It's the correct solution. It is similar to the edit distance algorithm and I used the same approach. I just wanted to see what are other ways to solve this problem.
The deletion distance of two strings is the minimum number of characters you need to delete in the two strings in order to get the same string. For instance, the deletion distance between "heat" and "hit" is 3:
By deleting 'e' and 'a' in "heat", and 'i' in "hit", we get the string "ht" in both cases.
We cannot get the same string from both strings by deleting 2 letters or fewer.
Given the strings str1 and str2, write an efficient function deletionDistance that returns the deletion distance between them. Explain how your function works, and analyze its time and space complexities.
Examples:
input: str1 = "dog", str2 = "frog"
output: 3
input: str1 = "some", str2 = "some"
output: 0
input: str1 = "some", str2 = "thing"
output: 9
input: str1 = "", str2 = ""
output: 0
What I want to do in this solution, is to use dynamic programming in order to build a function that calculates opt(str1Len, str2Len). Notice the following:
I use dynamic programming methods to calculate opt(str1Len, str2Len), i.e. the deletion distance for the two strings, by calculating opt(i,j) for all 0 ≤ i ≤ str1Len, 0 ≤ j ≤ str2Len, and saving previous values
def deletion_distance(s1, s2):
m = [[0 for j in range(len(s2) +1)] for i in range(len(s1)+1)]
for i in range(len(s1)+1):
for j in range(len(s2)+1):
if i == 0:
m[i][j] = j
elif j == 0:
m[i][j] = i
elif s1[i-1] == s2[j-1]:
m[i][j] = m[i-1][j-1]
else:
m[i][j] = 1 + min(m[i-1][j], m[i][j-1])
return m[len(s1)][len(s2)]

Related

Fast python way to check if subsequence exists in a string and allow for 1-3 mismatches?

I am wondering if there is a fast Python way to check if a subsequence exists in a string, while allowing for 1-3 mismatches.
For example the string: "ATGCTGCTGA"
The subsequence "ATGCC" would be acceptable and return true. Note that there is 1 mismatch in bold.
I have tried to use the pairwise2 functions from the Bio package but it is very slow. I have 1,000 strings, and for each string I want to test 10,000 subsequences.
Computation speed would be prioritized here.
** Note I don't mean gaps, but one nucleotide (the letter A, T, C, G) being substituted for another one.

One can zip the strings and compare tuples left and right side, and count false.
Here false is just 1. Should not be slow...
st = "ATGCTGCTGA"
s = "ATGCC"
[ x==y for (x,y) in zip(st,s)].count(False)
1

Try:
ALLOWED_MISMATCHES = 3
s = "ATGCTGCTGA"
subsequence = "ATGCC"
for i in range(len(s) - len(subsequence) + 1):
if sum(a != b for a, b in zip(s[i:], subsequence)) <= ALLOWED_MISMATCHES:
print("Match")
break
else:
print("No Match")
Prints:
Match

This is possible with regex if you use the PyPi's regex module instead with fuzzy matching:
ATGCC{i<=3:[ATGC]}
ATGCC - Look for exactly 'ATGCC'
{i<=3:[ATCG]} - Allow for up to three insertions that are within the character class of nucleotide character [ATGC].
For example:
import regex as re
s = 'ATGCTGCTGA'
print(bool(re.search(r'ATGCC{i<=3:[ATGC]}', s)))
Prints:
True

Especically, if your subsequences have all the same length, you could try k-mer matching with Biotite, a package I am developer of.
To allow mismatches, you can generate similar subsequences in the matching process using a SimilarityRule. In this example the rule and substitution matrix is set, so that all subsequences with up to MAX_MISMATCH mismatches are enumerated. Since the k-mer matching is implemented in C it should run quite fast.
import numpy as np
import biotite.sequence as seq
import biotite.sequence.align as align
K = 5
MAX_MISMATCH = 1
database_sequences = [
seq.NucleotideSequence(seq_str) for seq_str in [
"ATGCTGCTGA",
# ...
]
]
query_sequences = [
seq.NucleotideSequence(seq_str) for seq_str in [
"ATGCC",
# ...
]
]
kmer_table = align.KmerTable.from_sequences(K, database_sequences)
# The alphabet for the substitution matrix
# should not contain unambiguous symbols, such as 'N'
alphabet = seq.NucleotideSequence.alphabet_unamb
matrix_entries = np.full((len(alphabet), len(alphabet)), -1)
np.fill_diagonal(matrix_entries, 0)
matrix = align.SubstitutionMatrix(alphabet, alphabet, matrix_entries)
print(matrix)
print()
# The similarity rule will allow up to MAX_MISMATCH mismatches
similarity_rule = align.ScoreThresholdRule(matrix, -MAX_MISMATCH)
for i, sequence in enumerate(query_sequences):
matches = kmer_table.match(sequence, similarity_rule)
# Colums:
# 0. Sequence position of match in query (first position of k-mer)
# 1. Index of DB sequence in list
# 2. Sequence position of match in DB sequence (first position of k-mer)
print(matches)
print()
index_of_matches = np.unique(matches[:, 1])
for j in index_of_matches:
print(f"Match of subsequence {i} to sequence {j}")
print()
Output:
A C G T
A 0 -1 -1 -1
C -1 0 -1 -1
G -1 -1 0 -1
T -1 -1 -1 0
[[0 0 0]]
Match of subsequence 0 to sequence 0
If your subsequences have different lengths, but are all quite short (~ up length 10) you could create a KmerTable for each subsequence length (K = length) and match each table to the subsequences with the respective length. If your subsequences are much larger than that, this approach probably will not work due to the memory requirements of a KmerTable.

How to get an array of all possible binary numbers given a binary number of the same length

goal: I have a string which usually looks like this "010" and I need to replace the zeros by 1 in all the possible ways like this ["010", "110", "111", "011"]
problem when I replace the zeros with 1s I iterate through the letters of the string from left to right then from right to left. As you can see in the code where I did number = number[::-1]. Now, this method does not actually cover all the possibilities.
I also need to maybe start from the middle or maybe use the permutation method But not sure how to apply in python.
mathematically there is something like factorial of the number of places/(2)!
A = '0111011110000'
B = '010101'
C = '10000010000001101'
my_list = [A,B,C]
for number in [A,B,C]:
number = number[::-1]
for i , n in enumerate(number):
number = list(number)
number[i] = '1'
number = ''.join(number)
if number not in my_list: my_list.append(number)
for number in [A,B,C]:
for i , n in enumerate(number):
number = list(number)
number[i] = '1'
number = ''.join(number)
if number not in my_list: my_list.append(number)
print(len(my_list))
print(my_list)

You can use separate out the zeros and then use itertools.product -
from itertools import product
x = '0011'
perm_elements = [('0', '1') if digit == '0' else ('1', ) for digit in x]
print([''.join(x) for x in product(*perm_elements)])
['0011', '0111', '1011', '1111']
If you only need the number of such combinations, and not the list itself - that should just be 2 ** x.count('0')

Well, you will definitely get other answers with a traditional implementations of combinations with fixed indexes, but as we're working with just "0" and "1", you can use next hack:
source = "010100100001100011"
pattern = source.replace("0", "{}")
count = source.count("0")
combinations = [pattern.format(*f"{i:0{count}b}") for i in range(1 << count)]
Basically, we count amount of zeros in source, then iteration over range where limit is number with this amount of set bits and unpack every number in binary form into a pattern.
It should be slightly faster if we predefine pattern for binary transformation too:
source = "010100100001100011"
pattern = source.replace("0", "{}")
count = source.count("0")
fmt = f"{{:0{count}b}}"
result = [pattern.format(*fmt.format(i)) for i in range(1 << count)]
Upd. It's not clear do you need to generate all possible combinations or just get number, so originally I provided code to generate them, but if you will look closely in my method I'm getting number of all possible combinations using 1 << count, where count is amount of '0' chars in source string. So if you need just number, code is next:
source = "010100100001100011"
number_of_combinations = 1 << source.count("0")
Alternatively, you can also use 2 ** source.count("0"), but generally power is much more slower than binary shift, so I'd recommend to use option I originally advised.

We also can use recursive solution for this problem, we iterate over string and if saw a "0" change it to "1" and begin another branch on this new string:
s = "010100100001100011"
def perm(s, i=0, result=[]):
if i < len(s):
if s[i] == "0":
t = s[:i]+"1"+s[i+1:]
result.append(t)
perm(t, i+1, result)
perm(s, i+1, result)
res = [s]
perm(s, 0, res)
print(res)

For each position in the string that has a zero, you can either replace it with a 1 or not. This creates the combinations. So you can progressively build the resulting list of strings by adding the replacements of each '0' position with a '1' based on the previous replacement results:
def zeroTo1(S):
result = [S] # start with no replacement
for i,b in enumerate(S):
if b != '0': continue # only for '0' positions
result += [r[:i]+'1'+r[i+1:] for r in result] # add replacements
return result
print(zeroTo1('010'))
['010', '110', '011', '111']
If you're allowed to use libraries, the product function from itertools can be used to combine the zero replacements directly for you:
from itertools import product
def zeroTo1(S):
return [*map("".join,product(*("01"[int(b):] for b in S)))]
The tuples of 1s and 0s generated by the product function are assembled into individual strings by mapping the string join function onto its output.

Based on your objective you can do this to obtain the expected results.
A = '0111011110000'
B = '010'
C = '10000010000001101'
my_list = [A, B, C]
new_list = []
for key, number in enumerate(my_list):
for key_item, num in enumerate(number):
item_list = [i for i in number]
item_list[key_item] = "1"
new_list.append(''.join(item_list))
print(len(new_list))
print(new_list)

Time complexity of a sliding window question

I'm working on the following problem:
Given a string and a list of words, find all the starting indices of substrings in the given string that are a concatenation of all the given words exactly once without any overlapping of words. It is given that all words are of the same length. For example:
Input: String = "catfoxcat", Words = ["cat", "fox"]
Output: [0, 3]
Explanation: The two substring containing both the words are "catfox" & "foxcat".
My solution is:
def find_word_concatenation(str, words):
result_indices = []
period = len(words[0])
startIndex = 0
wordCount = {}
matched = 0
for w in words:
if w not in wordCount:
wordCount[w] = 1
else:
wordCount[w] += 1
for endIndex in range(0, len(str) - period + 1, period):
rightWord = str[endIndex: endIndex + period]
if rightWord in wordCount:
wordCount[rightWord] -= 1
if wordCount[rightWord] == 0:
matched += 1
while matched == len(wordCount):
if endIndex + period - startIndex == len(words)*period:
result_indices.append(startIndex)
leftWord = str[startIndex: startIndex + period]
if leftWord in wordCount:
wordCount[leftWord] += 1
if wordCount[leftWord] > 0:
matched -= 1
startIndex += period
return result_indices
Can anyone help me figure out its time complexity please?

We should start by drawing a distinction between the time complexity of your code vs what you might actually be looking for.
In your case, you have a set of nested loops (a for and a while). So, worst case, which is what Big O is based on, you would do each of those while loops n times. But you also have that outer loop which would also be done n times.
O(n) * O(n) = O(n) 2
Which is not very good. Now, while not really so bad with this example, imagine if you were looking for "what a piece of work is man" in all of the Library of Congress or even in the collected works of Shakespeare.
On the plus side, you can refactor your code and get it down quite a bit.

Infinite string

We are given N words, each of length at max 50.All words consist of small case alphabets and digits and then we concatenate all the N words to form a bigger string A.An infinite string S is built by performing infinite steps on A recursively: In ith step, A is concatenated with ′$′ i times followed by reverse of A. Eg: let N be 3 and each word be '1','2' and '3' after concatenating we get A= 123 reverse of a is 321 and on first recursion it will be
A=123$321 on second recursion it will be A=123$321$$123$321 And so on… The infinite string thus obtained is S.Now after ith recursion we have to find the character at index say k.Now recursion can be large as pow(10,4) and N which can be large as (pow(10,4)) and length of each word at max is 50 so in worst case scenario our starting string can have a length of 5*(10**5) which is huge so recursion and adding the string won't work.
What I came up with is that the string would be a palindrome after 1 st recursion so if I can calculate the pos of '$'*I I can calculate any index since the string before and after it is a palindrome.I came up with a pattern that
looks like this:
string='123'
k=len(string)
recursion=100
lis=[]
for i in range(1,recursion+1):
x=(2**(i-1))
y=x*(k+1)+(x-i)
lis.append(y)
print(lis[:10])
Output:
[4, 8, 17, 36, 75, 154, 313, 632, 1271, 2550]
Now I have two problems with it first I also want to add position of adjacent '$' in the list because at position 8 which is the after 2nd recursion there will be more (recursion-1)=1 more '$' at position 9 and likewise for position 17 which is 3rd recursion there will be (3-1) two more '$' in position 18 and 19 and this would continue until ith recursion and for that I would have to insert while loop and that would make my algorithm to give TLE
string='123'
k=len(string)
recursion=100
lis=[]
for i in range(1,recursion+1):
x=(2**(i-1))
y=x*(k+1)+(x-i)
lis.append(y)
count=1
while(count<i):
y=y+1
lis.append(y)
count+=1
print(lis[:10])
Output: [4, 8, 9, 17, 18, 19, 36, 37, 38, 39]
The idea behind finding the position of $ is that the string before and after it is a palindrome and if the index of $ is odd the element before and after it would be the last element of the string and it is even the element before and after it would be the first element of the string.

The number of dollar signs that S will have in each group of them follows the following sequence:
1 2 1 3 1 2 1 4 1 2 1 ...
This corresponds to the number of trailing zeroes that i has in its binary representation, plus one:
bin(i) | dollar signs
--------+-------------
00001 | 1
00010 | 2
00011 | 1
00100 | 3
00101 | 1
00110 | 2
... ...
With that information you can use a loop that subtracts from k the size of the original words and then subtracts the number of dollars according to the above observation. This way you can detect whether k points at a dollar or within a word.
Once k has been "normalised" to an index within the limits of the original total words length, there only remains a check to see whether the characters are in their normal order or reversed. This depends on the number of iterations done in the above loop, and corresponds to i, i.e. whether it is odd or even.
This leads to this code:
def getCharAt(words, k):
size = sum([len(word) for word in words]) # sum up the word sizes
i = 0
while k >= size:
i += 1
# Determine number of dollars: corresponds to one more than the
# number of trailing zeroes in the binary representation of i
b = bin(i)
dollars = len(b) - b.rindex("1")
k -= size + dollars
if k < 0:
return '$'
if i%2: # if i is odd, then look in reversed order
k = size - 1 - k
# Get the character at the k-th index
for word in words:
if k < len(word):
return word[k]
k -= len(word)
You would call it like so:
print (getCharAt(['1','2','3'], 13)) # outputs 3
Generator Version
When you need to request multiple characters like that, it might be more interesting to create a generator, which just keeps producing the next character as long as you keep iterating:
def getCharacters(words):
i = 0
while True:
i += 1
if i%2:
for word in words:
yield from word
else:
for word in reversed(words):
yield from reversed(word)
b = bin(i)
dollars = len(b) - b.rindex("1")
yield from "$" * dollars
If for instance you want the first 80 characters from the infinite string that would be built from "a", "b" and "cd", then call it like this:
import itertools
print ("".join(itertools.islice(getCharacters(['a', 'b', 'cd']), 80)))
Output:
abcd$dcba$$abcd$dcba$$$abcd$dcba$$abcd$dcba$$$$abcd$dcba$$abcd$dcba$$$abcd$dcba$

Here is my solution to the problem (index starts at 1 for findIndex) I am basically counting recursively to find the value of the findIndex element.
def findInd(k,n,findIndex,orientation):
temp = k # no. of characters covered.
tempRec = n # no. of dollars to be added
bool = True # keeps track of if dollar or reverse of string is to be added.
while temp < findIndex:
if bool:
temp += tempRec
tempRec += 1
bool = not bool
else:
temp += temp - (tempRec - 1)
bool = not bool
# print(temp,findIndex)
if bool:
if findIndex <= k:
if orientation: # checks if string must be reversed.
return A[findIndex - 1]
else:
return A[::-1][findIndex - 1] # the string reverses when there is a single dollar so this is necessary
else:
if tempRec-1 == 1:
return findInd(k,1,findIndex - (temp+tempRec-1)/2,False) # we send a false for orientation as we want a reverse in case we encounter a single dollar sign.
else:
return findInd(k,1,findIndex - (temp+tempRec-1)/2,True)
else:
return "$"
A = "123" # change to suit your need
findIndex = 24 # the index to be found # change to suit your need
k = len(A) # length of the string.
print(findInd(k,1,findIndex,True))
I think this will satisfy your time constraint also as I do not go through each element.

Checking the same elements in a list : python

Hey there i'm so new in coding and i want make program comparing two lists elements and returning the same elements in them.
so far i writed this code but i'm having problem with algoritm because it is set operation and i can't find actual same elements with intersection function.
in my code i want to look for each string and finding similarity of them.
what i've tried to do is :
input="AGA"
input1="ACA"
input=input_a
if len(input1) == len(input):
i = 0
while i < len(input1):
j = 0
while j < len(input_a):
input_length = list(input_a)
if input1[i] != input_a[j]:
if input1[i] in input_a:
print "1st %s" % input_length
print "2nd %s" % set(input1)
intersection = set(DNA_input_length).intersection(set(input1))
print intersection
total = len(intersection)
print (float(total) / float(
len(input1))) * 100, "is the similarity percentage"
break
DNA_input_length.remove(input_a[i])
j = j + 1
break
what is wrong with my code is actually the intersection part i guess and
i want to see as common elements which are included each list for input and input1 = A,A (2 A's both) however, i get just one A..
How can i improve this code to evaluating common elements which is Two A not one. I really need your help..

I would define similarity as the the hamming distance between the words (which I think is what you want
word1 = "AGA"
word2 = "ACAT"
score = sum(a==b for a,b in zip(word1,word2)) + abs(len(word1)-len(word2))

If you just need to find the intersecting elements of 2 flat lists, do:
a = "AGA"
b = "ACA"
c = set(a) & set(b)
print(c)
> {'A'}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Algorithm problem for Deletion Distance between 2 string - python

Related

Fast python way to check if subsequence exists in a string and allow for 1-3 mismatches?

How to get an array of all possible binary numbers given a binary number of the same length

Time complexity of a sliding window question

Infinite string

Checking the same elements in a list : python

Categories

Resources