Is this an acceptable algorithm? - python

I've designed an algorithm to find the longest common subsequence. these are steps:
Pick the first letter in the first string.
Look for it in the second string and if its found, Add that letter to
common_subsequence and store its position in index, Otherwise
compare the length of common_subsequence with the length of lcs
and if its greater, asign its value to lcs.
Return to the first string and pick the next letter and repeat the
previous step again, But this time start searching from indexth letter
Repeat this process until there is no letter in the first string to
pick. At the end the value of lcs is the Longest Common
Subsequence.
This is an example:
‫‪
X=A, B, C, B, D, A, B‬‬
‫‪Y=B, D, C, A, B, A‬‬
Pick A in the first string.
Look for A in Y.
Now that there is an A in the second string, append it to common_subsequence.
Return to the first string and pick the next letter that is B.
Look for B in the second string this time starting from the position of A.
There is a B after A so append B to common_subsequence.
Now pick the next letter in the first string that is C. There isn't a C next to B in the second string. So assign the value of common_subsequence to lcs because its length is greater than the length of lcs.
repeat the previous steps until reaching the end of the first string. In the end the value of lcs is the Longest Common Subsequence.
The complexity of this algorithm is theta(n*m).
Here is my implementations:
First algorithm:
import time
def lcs(xstr, ystr):
if not (xstr and ystr): return # if string is empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
for i in xrange(len(xstr)):
cs = '' # common subsequence
start = 0 # start position in ystr
for item in xstr[i:]:
index = ystr.find(item, start) # position at the common letter
if index != -1: # if common letter has found
cs += item # add common letter to the cs
start = index + 1
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
The same algorithm using hash table:
import time
from collections import defaultdict
def lcs(xstr, ystr):
if not (xstr and ystr): return # if strings are empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
location = defaultdict(list) # keeps track of items in the ystr
i = 0
for k in ystr:
location[k].append(i)
i += 1
for i in xrange(len(xstr)):
cs = '' # common subsequence
index = -1
reached_index = defaultdict(int)
for item in xstr[i:]:
for new_index in location[item][reached_index[item]:]:
reached_index[item] += 1
if index < new_index:
cs += item # add item to the cs
index = new_index
break
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed

If your professor wants you to invent your own LCS algorithm, you're done. Your algorithm is not the most optimal one ever created, but it's in the right complexity class, you clearly understand it, and you clearly didn't copy your implementation from the internet. You might want to be prepared to defend your algorithm, or discuss alternatives. If I were your prof, I'd give you an A if:
You turned in that program.
You were able to explain why there's no possible O(N) or O(N log M) alternative.
You were able to participate in a reasonable discussion about other algorithms that might have a better lower bound (or significantly lower constants, etc.), and the time/space tradeoffs, etc., even if you didn't know the outcome of that discussion in advance.
On the other hand, if your professor wants you to pick one of the well-known algorithms and write your own implementation, you probably want to use the standard LP algorithm. It's a standard algorithm for a reason—which you probably want to read up on until you understand. (Even if it isn't going to be on the test, you're taking this class to learn, not just to impress the prof, right?)
Wikipedia has pseudocode for a basic implementation, then English-language descriptions of common optimizations. I'm pretty sure that writing your own Python code based on what's on that page wouldn't count as plagiarism, or even as a trivial port, especially if you can demonstrate that you understand what your code is doing, and why, and why it's a good algorithm. Plus, you're writing it in Python, which has much better ways to memoize than what's demonstrated in that article, so if you understand how it works, your code should actually be substantially better than what Wikipedia gives you.
Either way, as I suggested in the comments, I'd read A survey of longest common subsequence algorithms by Bergroth, Hakonen, and Raita, and search for similar papers online.

maxLength = 0
foundString = ""
for start in xrange(len(str1)-1):
for end in xrange(start+1, len(str1)):
str1Temp = str1[start:end]
maxLengthTemp = len(str1Temp)
if(str2.find(str1Temp)):
if(maxLengthTemp>maxLength):
maxLength = maxLengthTemp
foundString = str1Temp
print maxLength
print foundString

Related

Finding longest alphabetical substring - understanding the concepts in Python

I am completing the Introduction to Computer Science and Programming Using Python Course and am stuck on Week 1: Python Basics - Problem Set 1 - Problem 3.
The problem asks:
Assume s is a string of lower case characters.
Write a program that prints the longest substring of s in which the
letters occur in alphabetical order. For example, if s = 'azcbobobegghakl', then your program should print
Longest substring in alphabetical order is: beggh
In the case of ties, print the first substring. For example, if s = 'abcbcd', then your program should print*
Longest substring in alphabetical order is: abc
There are many posts on stack overflow where people are just chasing or giving the code as the answer. I am looking to understand the concept behind the code as I am new to programming and want gain a better understanding of the basics
I found the following code that seems to answer the question. I understand the basic concept of the for loop, I am having trouble understanding how to use them (for loops) to find alphabetical sequences in a string
Can someone please help me understand the concept of using the for loops in this way.
s = 'cyqfjhcclkbxpbojgkar'
lstring = s[0]
slen = 1
for i in range(len(s)):
for j in range(i,len(s)-1):
if s[j+1] >= s[j]:
if (j+1)-i+1 > slen:
lstring = s[i:(j+1)+1]
slen = (j+1)-i+1
else:
break
print("Longest substring in alphabetical order is: " + lstring)
Let's go through your code step by step.
First we assume that the first character forms the longest sequence. What we will do is try improving this guess.
s = 'cyqfjhcclkbxpbojgkar'
lstring = s[0]
slen = 1
The first loop then picks some index i, it will be the start of a sequence. From there, we will check all existing sequences starting from i by looping over the possible end of a sequence with the nested loop.
for i in range(len(s)): # This loops over the whole string indices
for j in range(i,len(s)-1): # This loops over indices following i
This nested loops will allow us to check every subsequence by picking every combination of i and j.
The first if statement intends to check if that sequence is still an increasing one. If it is not we break the inner loop as we are not interested in that sequence.
if s[j+1] >= s[j]:
...
else:
break
We finally need to check if the current sequence we are looking at is better than our current guess by comparing its length to slen, which is our best guess.
if (j+1)-i+1 > slen:
lstring = s[i:(j+1)+1]
slen = (j+1)-i+1
Improvements
Note that this code is not optimal as it needlessly traverses your string multiple times. You could implement a more efficient approach that traverses the string only once to recover all increasing substrings and then uses max to pick the longuest one.
s = 'cyqfjhcclkbxpbojgkar'
substrings = []
start = 0
end = 1
while end < len(s):
if s[end - 1] > s[end]:
substrings.append(s[start:end])
start = end + 1
end = start + 1
else:
end += 1
lstring = max(substrings, key=len)
print("Longest substring in alphabetical order is: " + lstring)
The list substrings looks like this after the while-loop: ['cy', 'fj', 'ccl', 'bx', 'bo', 'gk']
From these, max(..., key=len) picks the longuest one.

how to make an imputed string to a list, change it to a palindrome(if it isn't already) and reverse it as a string back

A string is palindrome if it reads the same forward and backward. Given a string that contains only lower case English alphabets, you are required to create a new palindrome string from the given string following the rules gives below:
1. You can reduce (but not increase) any character in a string by one; for example you can reduce the character h to g but not from g to h
2. In order to achieve your goal, if you have to then you can reduce a character of a string repeatedly until it becomes the letter a; but once it becomes a, you cannot reduce it any further.
Each reduction operation is counted as one. So you need to count as well how many reductions you make. Write a Python program that reads a string from a user input (using raw_input statement), creates a palindrome string from the given string with the minimum possible number of operations and then prints the palindrome string created and the number of operations needed to create the new palindrome string.
I tried to convert the string to a list first, then modify the list so that should any string be given, if its not a palindrome, it automatically edits it to a palindrome and then prints the result.after modifying the list, convert it back to a string.
c=raw_input("enter a string ")
x=list(c)
y = ""
i = 0
j = len(x)-1
a = 0
while i < j:
if x[i] < x[j]:
a += ord(x[j]) - ord(x[i])
x[j] = x[i]
print x
else:
a += ord(x[i]) - ord(x[j])
x [i] = x[j]
print x
i = i + 1
j = (len(x)-1)-1
print "The number of operations is ",a print "The palindrome created is",( ''.join(x) )
Am i approaching it the right way or is there something I'm not adding up?
Since only reduction is allowed, it is clear that the number of reductions for each pair will be the difference between them. For example, consider the string 'abcd'.
Here the pairs to check are (a,d) and (b,c).
Now difference between 'a' and 'd' is 3, which is obtained by (ord('d')-ord('a')).
I am using absolute value to avoid checking which alphabet has higher ASCII value.
I hope this approach will help.
s=input()
l=len(s)
count=0
m=0
n=l-1
while m<n:
count+=abs(ord(s[m])-ord(s[n]))
m+=1
n-=1
print(count)
This is a common "homework" or competition question. The basic concept here is that you have to find a way to get to minimum values with as few reduction operations as possible. The trick here is to utilize string manipulation to keep that number low. For this particular problem, there are two very simple things to remember: 1) you have to split the string, and 2) you have to apply a bit of symmetry.
First, split the string in half. The following function should do it.
def split_string_to_halves(string):
half, rem = divmod(len(string), 2)
a, b, c = '', '', ''
a, b = string[:half], string[half:]
if rem > 0:
b, c = string[half + 1:], string[rem + 1]
return (a, b, c)
The above should recreate the string if you do a + c + b. Next is you have to convert a and b to lists and map the ord function on each half. Leave the remainder alone, if any.
def convert_to_ord_list(string):
return map(ord, list(string))
Since you just have to do a one-way operation (only reduction, no need for addition), you can assume that for each pair of elements in the two converted lists, the higher value less the lower value is the number of operations needed. Easier shown than said:
def convert_to_palindrome(string):
halfone, halftwo, rem = split_string_to_halves(string)
if halfone == halftwo[::-1]:
return halfone + halftwo + rem, 0
halftwo = halftwo[::-1]
zipped = zip(convert_to_ord_list(halfone), convert_to_ord_list(halftwo))
counter = sum([max(x) - min(x) for x in zipped])
floors = [min(x) for x in zipped]
res = "".join(map(chr, floors))
res += rem + res[::-1]
return res, counter
Finally, some tests:
target = 'ideal'
print convert_to_palindrome(target) # ('iaeai', 6)
target = 'euler'
print convert_to_palindrome(target) # ('eelee', 29)
target = 'ohmygodthisisinsane'
print convert_to_palindrome(target) # ('ehasgidihmhidigsahe', 84)
I'm not sure if this is optimized nor if I covered all bases. But I think this pretty much covers the general concept of the approach needed. Compared to your code, this is clearer and actually works (yours does not). Good luck and let us know how this works for you.

Recursive multi shift caesar cipher - Type error (can't conc None)

First time posting here, newbie to python but really enjoying what I'm doing with it. Working through the MIT Open courseware problems at the moment. Any suggestions of other similar resources?
My problem is with returning a recursive function that's meant to build a list of multi layer shifts as tuples where each tuple is (start location of shift,magnitude of shift). A shift of 5 would change a to f, a-b-c-d-e-f.
Code below for reference but you shouldn't need to read through it all.
text is the multi layered scrambled input, eg: 'grrkxmdffi jwyxechants idchdgyqapufeulij'
def find_best_shifts_rec(wordlist, text, start):
### TODO.
"""Shifts stay in place at least until a space, so shift can only start
with a new word
Base case? all remaining words are real
Need to find the base case which goes at the start of the function to
ensure it's all cleanly returned
Base case could be empty string if true words are removed?
Base case when start = end
take recursive out of loop
use other functions to simplify
"""
shifts = []
shift = 0
for a in range(27):
shift += 1
#creates a string and only shifts from the start point
"""text[:start] + optional add in not sure how it'd help"""
testtext = apply_shift(text[start:],-shift)
testlist = testtext.split()
#counts how many real words were made by the current shift
realwords = 0
for word in testlist:
if word in wordlist:
realwords += 1
else:
#as soon as a non valid word is found i know the shift is not valid
break
if a == 27 and realwords == 0:
print 'here\'s the prob'
#if one or more words are real
if realwords > 0:
#add the location and magnitude of shift
shifts = [(start,shift)]
#recursive call - start needs to be the end of the last valid word
start += testtext.find(testlist[realwords - 1]) + len(testlist[realwords - 1]) + 1
if start >= len(text):
#Base case
return shifts
else:
return shifts + find_best_shifts_rec(wordlist,text,start)
This frequently returns the correct list of tuples, but sometimes, with this text input for example, I get the error:
return shifts + find_best_shifts_rec(wordlist,text,start)
TypeError: can only concatenate list (not "NoneType") to list
this error is for the following at the very bottom of my code
else:
return shifts + find_best_shifts_rec(wordlist,text,start)
From what I gather, one of the recursive calls returns the None value, and then trying to conc this with the list throws up the error. How can I get around this?
EDIT:
By adding to the end:
elif a == 26:
return [()]
I can prevent the type error when it can't find a correct shift. How can I get the entire function to return none?
Below is my attempt at reworking your code to do what you want. Specific changes: dropped the range from 27 to 26 to let the loop exit naturally and return the empty shifts array; combined a and shift and started them at zero so an unencoded text will return [(0, 0)]; the .find() logic will mess up if the same word appears in the list twice, changed it to rindex() as a bandaid (i.e. the last correctly decoded one is where you want to start, not the first).
def find_best_shifts_rec(wordlist, text, start=0):
shifts = []
for shift in range(26):
# creates a string and only shifts from the start point
testtext = apply_shift(text[start:], -shift)
testlist = testtext.split()
# words made by the current shift
realwords = []
for word in testlist:
if word in wordlist:
realwords.append(word)
else: # as soon as an invalid word is found I know the shift is invalid
break
if realwords: # if one or more words are real
# add the location and magnitude of shift
shifts = [(start, shift)]
# recursive call - start needs to be the end of the last valid word
realword = realwords[-1]
start += testtext.rindex(realword) + len(realword) + 1
if start >= len(text):
return shifts # base case
return shifts + find_best_shifts_rec(wordlist, text, start)
return shifts
'lbh fwj hlzkv tbizljb'

Algorithm to compute edit set for transforming one string into another?

I'd like to compute the edits required to transform one string, A, into another string B using only inserts and deletions, with the minimum number of operations required.
So something like "kitten" -> "sitting" would yield a list of operations something like ("delete at 0", "insert 's' at 0", "delete at 4", "insert 'i' at 3", "insert 'g' at 6")
Is there an algorithm to do this, note that I don't want the edit distance, I want the actual edits.
I had an assignment similar to this at one point. Try using an A* variant. Construct a graph of possible 'neighbors' for a given word and search outward using A* with the distance heuristic being the number of letter needed to change in the current word to reach the target. It should be clear as to why this is a good heuristic-it's always going to underestimate accurately. You could think of a neighbor as a word that can be reached from the current word only using one operation. It should be clear that this algorithm will correctly solve your problem optimally with slight modification.
I tried to make something that works, at least for your precise case.
word_before = "kitten"
word_after = "sitting"
# If the strings aren't the same length, we stuff the smallest one with spaces
if len(word_before) > len(word_after):
word_after += " "*(len(word_before)-len(word_after))
elif len(word_before) < len(word_after):
word_before += " "*(len(word_after)-len(word_before))
operations = []
for idx, char in enumerate(word_before):
if char != word_after[idx]:
if char != " ":
operations += ["delete at "+str(idx)]
operations += ["insert '"+word_after[idx]+"' at "+str(idx)]
print(operations)
This should be what you're looking for, using itertools.zip_longest to zip the lists together and iterate over them in pairs compares them and applies the correct operation, it appends the operation to a list at the end of each operation, it compares the lists if they match and breaks out or continues if they don't
from itertools import zip_longest
a = "kitten"
b = "sitting"
def transform(a, b):
ops = []
for i, j in zip_longest(a, b, fillvalue=''):
if i == j:
pass
else:
index = a.index(i)
print(a, b)
ops.append('delete {} '.format(i)) if i != '' else ''
a = a.replace(i, '')
if a == b:
break
ops[-1] += 'insert {} at {},'.format(j, index if i not in b else b.index(j))
return ops
result = transform(a, b)
print(result, ' {} operation(s) was carried out'.format(len(result)))
Since you only have delete and insert operations, this is an instance of the Longest Common Subsequence Problem : https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
Indeed, there is a common subsequence of length k in two strings S and T, S of length n and T of length m, if and only only you can transform S into T with m+n-2k insert and delete operations. Think about this as intuition : the order of the letters is preserved both when adding and deleting letters, as well as when taking a subsequence.
EDIT : since you asked for the list of edits, a possible way to do the edits is to first remove all the characters of S not in the common subsequence, and then insert all the characters of T that are not the in common subsequence.

Word segmentation using dynamic programming

So first off I'm very new to Python so if I'm doing something awful I'm prefacing this post with a sorry. I've been assigned this problem:
We want to devise a dynamic programming solution to the following problem: there is a string of characters which might have been a sequence of words with all the spaces removed, and we want to find a way, if any, in which to insert spaces that separate valid English words. For example, theyouthevent could be from “the you the vent”, “the youth event” or “they out he vent”. If the input is theeaglehaslande, then there’s no such way. Your task is to implement a dynamic programming solution in two separate ways:
iterative bottom-up version
recursive memorized version
Assume that the original sequence of words had no other punctuation (such as periods), no capital letters, and no proper names - all the words will be available in a dictionary file that will be provided to you.
So I'm having two main issues:
I know that this can and should be done in O(N^2) and I don't think mine is
The lookup table isn't adding all the words it seems such that it can reduce the time complexity
What I'd like:
Any kind of input (better way to do it, something you see wrong in the code, how I can get the lookup table working, how to use the table of booleans to build a sequence of valid words)
Some idea on how to tackle the recursive version although I feel once I am able to solve the iterative solution I will be able to engineer the recursive one from it.
As always thanks for any time and or effort anyone gives this, it is always appreciated.
Here's my attempt:
#dictionary function returns True if word is found in dictionary false otherwise
def dictW(s):
diction = open("diction10k.txt",'r')
for x in diction:
x = x.strip("\n \r")
if s == x:
return True
return False
def iterativeSplit(s):
n = len(s)
i = j = k = 0
A = [-1] * n
word = [""] * n
booly = False
for i in range(0, n):
for j in range(0, i+1):
prefix = s[j:i+1]
for k in range(0, n):
if word[k] == prefix:
#booly = True
A[k] = 1
#print "Array below at index k %d and word = %s"%(k,word[k])
#print A
# print prefix, A[i]
if(((A[i] == -1) or (A[i] == 0))):
if (dictW(prefix)):
A[i] = 1
word[i] = prefix
#print word[i], i
else:
A[i] = 0
for i in range(0, n):
print A[i]
For another real-world example of how to do English word segmentation, look at the source of the Python wordsegment module. It's a little more sophisticated because it uses word and phrase frequency tables but it illustrates the memoization approach.
In particular, segment illustrates the memoization approach:
def segment(text):
"Return a list of words that is the best segmenation of `text`."
memo = dict()
def search(text, prev='<s>'):
if text == '':
return 0.0, []
def candidates():
for prefix, suffix in divide(text):
prefix_score = log10(score(prefix, prev))
pair = (suffix, prefix)
if pair not in memo:
memo[pair] = search(suffix, prefix)
suffix_score, suffix_words = memo[pair]
yield (prefix_score + suffix_score, [prefix] + suffix_words)
return max(candidates())
result_score, result_words = search(clean(text))
return result_words
If you replaced the score function so that it returned "1" for a word in your dictionary and "0" if not then you would simply enumerate all positively scored candidates for your answer.
Here is the solution in C++. Read and understand the concept, and then implement.
This video is very helpful for understanding DP approach.
One more approach which I feel can help is Trie data structure. It is a better way to solve the above problem.

Categories