Need help understanding this Python Viterbi algorithm - python

I'm trying to convert a Python implementation of the Viterbi algorithm found in this Stack Overflow answer into Ruby. The full script can be found at the bottom of this question with my comments.
Unfortunately I know very little about Python so the translation is is proving more difficult than I'd like. Still, I have made some progress. Right now, the only line which is totally melting my brain is this one:
prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i))
Could someone please explain what it is doing?
Here is the full Python script:
import re
from itertools import groupby
# text will be a compound word such as 'wickedweather'.
def viterbi_segment(text):
probs, lasts = [1.0], [0]
# Iterate over the letters in the compound.
# eg. [w, ickedweather], [wi, ckedweather], and so on.
for i in range(1, len(text) + 1):
# I've no idea what this line is doing and I can't figure out how to split it up?
prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i))
# Append values to arrays.
probs.append(prob_k)
lasts.append(k)
words = []
i = len(text)
while 0 < i:
words.append(text[lasts[i]:i])
i = lasts[i]
words.reverse()
return words, probs[-1]
# Calc the probability of a word based on occurrences in the dictionary.
def word_prob(word):
# dictionary.get(key) will return the value for the specified key.
# In this case, thats the number of occurances of thw word in the
# dictionary. The second argument is a default value to return if
# the word is not found.
return dictionary.get(word, 0) / total
# This ensures we ony deal with full words rather than each
# individual letter. Normalize the words basically.
def words(text):
return re.findall('[a-z]+', text.lower())
# This gets us a hash where the keys are words and the values are the
# number of ocurrances in the dictionary.
dictionary = dict((w, len(list(ws)))
# /usr/share/dixt/words is a file of newline delimitated words.
for w, ws in groupby(sorted(words(open('/usr/share/dict/words').read()))))
# Assign the length of the longest word in the dictionary.
max_word_length = max(map(len, dictionary))
# Assign the total number of words in the dictionary. It's a float
# because we're going to divide by it later on.
total = float(sum(dictionary.values()))
# Run the algo over a file of newline delimited compound words.
compounds = words(open('compounds.txt').read())
for comp in compounds:
print comp, ": ", viterbi_segment(comp)

You are looking at a list comprehension.
The expanded version looks something like this:
all_probs = []
for j in range(max(0, i - max_word_length), i):
all_probs.append((probs[j] * word_prob(text[j:i]), j))
prob_k, k = max(all_probs)
I hope that helps explain it. If it doesn't, feel free to edit your question and point to statements that you don't understand.

Here's a working ruby implementation just in case anybody else has some use for it. I translated the list comprehension discussed above in what I believe is the appropriate level of idiomatically unreadable ruby.
def viterbi(text)
probabilities = [1.0]
lasts = [0]
# Iterate over the letters in the compound.
# eg. [h ellodarkness],[he llodarkness],...
(1..(text.length + 1)).each do |i|
prob_k, k = ([0, i - maximum_word_length].max...i).map { |j| [probabilities[j] * word_probability(text[j...i]), j] }.map { |s| s }.max_by(&:first)
probabilities << prob_k
lasts << k
end
words = []
i = text.length
while i.positive?
words << text[lasts[i]...i]
i = lasts[i]
end
words.reverse!
[words, probabilities.last]
end
def word_probability(word)
word_counts[word].to_f / word_counts_sum.to_f
end
def word_counts_sum
#word_counts_sum ||= word_counts.values.sum.to_f
end
def maximum_word_length
#maximum_word_length ||= word_counts.keys.map(&:length).max
end
def word_counts
return #word_counts if #word_counts
#word_counts = {"hello" => 12, "darkness" => 6, "friend" => 79, "my" => 1, "old" => 5}
#word_counts.default = 0
#word_counts
end
puts "Best split is %s with probability %.6f" % viterbi("hellodarknessmyoldfriend")
=> Best split is ["hello", "darkness", "my", "old", "friend"] with probability 0.000002
The main annoyance were the different range definitions in python and ruby (open/closed interval). The algorithm is extremely fast.
It might be advantageous to work with likelihoods instead of probabilities, as the repeated multiplication could lead to underflows and/or accumulating float inaccuracies with longer words.

Related

Python Hamming distance rewrite countless for cycles into recursion

I have created a code generating strings which have hamming distance n from given binary string. Though I'm not able to rewrite this in a simple recursive function. There are several sequences (edit: actually only one, the length change) in the for loops logic but I don't know how to write it into the recursive way (the input for the function is string and distance (int), but in my code the distance is represented by the count of nested for cycles. Could you please help me?
(e.g. for string '00100' and distance 4, code returns ['11010', '11001', '11111', '10011', '01011'],
for string '00100' and distance 3, code returns ['11000', '11110', '11101', '10010', '10001', '10111', '01010', '01001', '01111', '00011'])
def change(string, i):
if string[i] == '1':
return string[:i] + '0' + string[i+1:]
else: return string[:i] + '1' + string[i+1:] #'0' on input
def hamming_distance(number):
array = []
for i in range(len(number)-3): #change first bit
a = number
a = change(a, i) #change bit on index i
for j in range(i+1, len(number)-2): #change second bit
b = a
b = change(b, j)
for k in range(j+1, len(number)-1): #change third bit
c = b
c = change(c, k)
for l in range(k+1, len(number)): #change fourth bit
d = c
d = change(d, l)
array.append(d)
return array
print(hamming_distance('00100'))
Thank you!
Very briefly, you have three base cases:
len(string) == 0: # return; you've made all the needed changes
dist == 0 # return; no more changes to make
len(string) == dist # change all bits and return (no choice remaining)
... and two recursion cases; with and without the change:
ham1 = [str(1-int(string[0])) + alter
for alter in change(string[1:], dist-1) ]
ham2 = [str[0] + alter for alter in change(string[1:], dist) ]
From each call, you return a list of strings that are dist from the input string. On each return, you have to append the initial character to each item in that list.
Is that clear?
CLARIFICATION
The above approach also generates only those that change the string. "Without" the change refers to only the first character. For instance, given input string="000", dist=2, the algorithm will carry out two operations:
'1' + change("00", 2-1) # for each returned string, "10" and "01"
'0' + change("00", 2) # for the only returned string, "11"
Those two ham lines go in the recursion part of your routine. Are you familiar with the structure of such a function? It consists of base cases and recursion cases.

how to make an imputed string to a list, change it to a palindrome(if it isn't already) and reverse it as a string back

A string is palindrome if it reads the same forward and backward. Given a string that contains only lower case English alphabets, you are required to create a new palindrome string from the given string following the rules gives below:
1. You can reduce (but not increase) any character in a string by one; for example you can reduce the character h to g but not from g to h
2. In order to achieve your goal, if you have to then you can reduce a character of a string repeatedly until it becomes the letter a; but once it becomes a, you cannot reduce it any further.
Each reduction operation is counted as one. So you need to count as well how many reductions you make. Write a Python program that reads a string from a user input (using raw_input statement), creates a palindrome string from the given string with the minimum possible number of operations and then prints the palindrome string created and the number of operations needed to create the new palindrome string.
I tried to convert the string to a list first, then modify the list so that should any string be given, if its not a palindrome, it automatically edits it to a palindrome and then prints the result.after modifying the list, convert it back to a string.
c=raw_input("enter a string ")
x=list(c)
y = ""
i = 0
j = len(x)-1
a = 0
while i < j:
if x[i] < x[j]:
a += ord(x[j]) - ord(x[i])
x[j] = x[i]
print x
else:
a += ord(x[i]) - ord(x[j])
x [i] = x[j]
print x
i = i + 1
j = (len(x)-1)-1
print "The number of operations is ",a print "The palindrome created is",( ''.join(x) )
Am i approaching it the right way or is there something I'm not adding up?
Since only reduction is allowed, it is clear that the number of reductions for each pair will be the difference between them. For example, consider the string 'abcd'.
Here the pairs to check are (a,d) and (b,c).
Now difference between 'a' and 'd' is 3, which is obtained by (ord('d')-ord('a')).
I am using absolute value to avoid checking which alphabet has higher ASCII value.
I hope this approach will help.
s=input()
l=len(s)
count=0
m=0
n=l-1
while m<n:
count+=abs(ord(s[m])-ord(s[n]))
m+=1
n-=1
print(count)
This is a common "homework" or competition question. The basic concept here is that you have to find a way to get to minimum values with as few reduction operations as possible. The trick here is to utilize string manipulation to keep that number low. For this particular problem, there are two very simple things to remember: 1) you have to split the string, and 2) you have to apply a bit of symmetry.
First, split the string in half. The following function should do it.
def split_string_to_halves(string):
half, rem = divmod(len(string), 2)
a, b, c = '', '', ''
a, b = string[:half], string[half:]
if rem > 0:
b, c = string[half + 1:], string[rem + 1]
return (a, b, c)
The above should recreate the string if you do a + c + b. Next is you have to convert a and b to lists and map the ord function on each half. Leave the remainder alone, if any.
def convert_to_ord_list(string):
return map(ord, list(string))
Since you just have to do a one-way operation (only reduction, no need for addition), you can assume that for each pair of elements in the two converted lists, the higher value less the lower value is the number of operations needed. Easier shown than said:
def convert_to_palindrome(string):
halfone, halftwo, rem = split_string_to_halves(string)
if halfone == halftwo[::-1]:
return halfone + halftwo + rem, 0
halftwo = halftwo[::-1]
zipped = zip(convert_to_ord_list(halfone), convert_to_ord_list(halftwo))
counter = sum([max(x) - min(x) for x in zipped])
floors = [min(x) for x in zipped]
res = "".join(map(chr, floors))
res += rem + res[::-1]
return res, counter
Finally, some tests:
target = 'ideal'
print convert_to_palindrome(target) # ('iaeai', 6)
target = 'euler'
print convert_to_palindrome(target) # ('eelee', 29)
target = 'ohmygodthisisinsane'
print convert_to_palindrome(target) # ('ehasgidihmhidigsahe', 84)
I'm not sure if this is optimized nor if I covered all bases. But I think this pretty much covers the general concept of the approach needed. Compared to your code, this is clearer and actually works (yours does not). Good luck and let us know how this works for you.

Algorithm to compute edit set for transforming one string into another?

I'd like to compute the edits required to transform one string, A, into another string B using only inserts and deletions, with the minimum number of operations required.
So something like "kitten" -> "sitting" would yield a list of operations something like ("delete at 0", "insert 's' at 0", "delete at 4", "insert 'i' at 3", "insert 'g' at 6")
Is there an algorithm to do this, note that I don't want the edit distance, I want the actual edits.
I had an assignment similar to this at one point. Try using an A* variant. Construct a graph of possible 'neighbors' for a given word and search outward using A* with the distance heuristic being the number of letter needed to change in the current word to reach the target. It should be clear as to why this is a good heuristic-it's always going to underestimate accurately. You could think of a neighbor as a word that can be reached from the current word only using one operation. It should be clear that this algorithm will correctly solve your problem optimally with slight modification.
I tried to make something that works, at least for your precise case.
word_before = "kitten"
word_after = "sitting"
# If the strings aren't the same length, we stuff the smallest one with spaces
if len(word_before) > len(word_after):
word_after += " "*(len(word_before)-len(word_after))
elif len(word_before) < len(word_after):
word_before += " "*(len(word_after)-len(word_before))
operations = []
for idx, char in enumerate(word_before):
if char != word_after[idx]:
if char != " ":
operations += ["delete at "+str(idx)]
operations += ["insert '"+word_after[idx]+"' at "+str(idx)]
print(operations)
This should be what you're looking for, using itertools.zip_longest to zip the lists together and iterate over them in pairs compares them and applies the correct operation, it appends the operation to a list at the end of each operation, it compares the lists if they match and breaks out or continues if they don't
from itertools import zip_longest
a = "kitten"
b = "sitting"
def transform(a, b):
ops = []
for i, j in zip_longest(a, b, fillvalue=''):
if i == j:
pass
else:
index = a.index(i)
print(a, b)
ops.append('delete {} '.format(i)) if i != '' else ''
a = a.replace(i, '')
if a == b:
break
ops[-1] += 'insert {} at {},'.format(j, index if i not in b else b.index(j))
return ops
result = transform(a, b)
print(result, ' {} operation(s) was carried out'.format(len(result)))
Since you only have delete and insert operations, this is an instance of the Longest Common Subsequence Problem : https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
Indeed, there is a common subsequence of length k in two strings S and T, S of length n and T of length m, if and only only you can transform S into T with m+n-2k insert and delete operations. Think about this as intuition : the order of the letters is preserved both when adding and deleting letters, as well as when taking a subsequence.
EDIT : since you asked for the list of edits, a possible way to do the edits is to first remove all the characters of S not in the common subsequence, and then insert all the characters of T that are not the in common subsequence.

Word segmentation using dynamic programming

So first off I'm very new to Python so if I'm doing something awful I'm prefacing this post with a sorry. I've been assigned this problem:
We want to devise a dynamic programming solution to the following problem: there is a string of characters which might have been a sequence of words with all the spaces removed, and we want to find a way, if any, in which to insert spaces that separate valid English words. For example, theyouthevent could be from “the you the vent”, “the youth event” or “they out he vent”. If the input is theeaglehaslande, then there’s no such way. Your task is to implement a dynamic programming solution in two separate ways:
iterative bottom-up version
recursive memorized version
Assume that the original sequence of words had no other punctuation (such as periods), no capital letters, and no proper names - all the words will be available in a dictionary file that will be provided to you.
So I'm having two main issues:
I know that this can and should be done in O(N^2) and I don't think mine is
The lookup table isn't adding all the words it seems such that it can reduce the time complexity
What I'd like:
Any kind of input (better way to do it, something you see wrong in the code, how I can get the lookup table working, how to use the table of booleans to build a sequence of valid words)
Some idea on how to tackle the recursive version although I feel once I am able to solve the iterative solution I will be able to engineer the recursive one from it.
As always thanks for any time and or effort anyone gives this, it is always appreciated.
Here's my attempt:
#dictionary function returns True if word is found in dictionary false otherwise
def dictW(s):
diction = open("diction10k.txt",'r')
for x in diction:
x = x.strip("\n \r")
if s == x:
return True
return False
def iterativeSplit(s):
n = len(s)
i = j = k = 0
A = [-1] * n
word = [""] * n
booly = False
for i in range(0, n):
for j in range(0, i+1):
prefix = s[j:i+1]
for k in range(0, n):
if word[k] == prefix:
#booly = True
A[k] = 1
#print "Array below at index k %d and word = %s"%(k,word[k])
#print A
# print prefix, A[i]
if(((A[i] == -1) or (A[i] == 0))):
if (dictW(prefix)):
A[i] = 1
word[i] = prefix
#print word[i], i
else:
A[i] = 0
for i in range(0, n):
print A[i]
For another real-world example of how to do English word segmentation, look at the source of the Python wordsegment module. It's a little more sophisticated because it uses word and phrase frequency tables but it illustrates the memoization approach.
In particular, segment illustrates the memoization approach:
def segment(text):
"Return a list of words that is the best segmenation of `text`."
memo = dict()
def search(text, prev='<s>'):
if text == '':
return 0.0, []
def candidates():
for prefix, suffix in divide(text):
prefix_score = log10(score(prefix, prev))
pair = (suffix, prefix)
if pair not in memo:
memo[pair] = search(suffix, prefix)
suffix_score, suffix_words = memo[pair]
yield (prefix_score + suffix_score, [prefix] + suffix_words)
return max(candidates())
result_score, result_words = search(clean(text))
return result_words
If you replaced the score function so that it returned "1" for a word in your dictionary and "0" if not then you would simply enumerate all positively scored candidates for your answer.
Here is the solution in C++. Read and understand the concept, and then implement.
This video is very helpful for understanding DP approach.
One more approach which I feel can help is Trie data structure. It is a better way to solve the above problem.

Is this an acceptable algorithm?

I've designed an algorithm to find the longest common subsequence. these are steps:
Pick the first letter in the first string.
Look for it in the second string and if its found, Add that letter to
common_subsequence and store its position in index, Otherwise
compare the length of common_subsequence with the length of lcs
and if its greater, asign its value to lcs.
Return to the first string and pick the next letter and repeat the
previous step again, But this time start searching from indexth letter
Repeat this process until there is no letter in the first string to
pick. At the end the value of lcs is the Longest Common
Subsequence.
This is an example:
‫‪
X=A, B, C, B, D, A, B‬‬
‫‪Y=B, D, C, A, B, A‬‬
Pick A in the first string.
Look for A in Y.
Now that there is an A in the second string, append it to common_subsequence.
Return to the first string and pick the next letter that is B.
Look for B in the second string this time starting from the position of A.
There is a B after A so append B to common_subsequence.
Now pick the next letter in the first string that is C. There isn't a C next to B in the second string. So assign the value of common_subsequence to lcs because its length is greater than the length of lcs.
repeat the previous steps until reaching the end of the first string. In the end the value of lcs is the Longest Common Subsequence.
The complexity of this algorithm is theta(n*m).
Here is my implementations:
First algorithm:
import time
def lcs(xstr, ystr):
if not (xstr and ystr): return # if string is empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
for i in xrange(len(xstr)):
cs = '' # common subsequence
start = 0 # start position in ystr
for item in xstr[i:]:
index = ystr.find(item, start) # position at the common letter
if index != -1: # if common letter has found
cs += item # add common letter to the cs
start = index + 1
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
The same algorithm using hash table:
import time
from collections import defaultdict
def lcs(xstr, ystr):
if not (xstr and ystr): return # if strings are empty
lcs = [''] # longest common subsequence
lcslen = 0 # length of longest common subsequence so far
location = defaultdict(list) # keeps track of items in the ystr
i = 0
for k in ystr:
location[k].append(i)
i += 1
for i in xrange(len(xstr)):
cs = '' # common subsequence
index = -1
reached_index = defaultdict(int)
for item in xstr[i:]:
for new_index in location[item][reached_index[item]:]:
reached_index[item] += 1
if index < new_index:
cs += item # add item to the cs
index = new_index
break
if index == len(ystr) - 1: break # if reached end of the ystr
# update lcs and lcslen if found better cs
if len(cs) > lcslen: lcs, lcslen = [cs], len(cs)
elif len(cs) == lcslen: lcs.append(cs)
return lcs
file1 = open('/home/saji/file1')
file2 = open('/home/saji/file2')
xstr = file1.read()
ystr = file2.read()
start = time.time()
lcss = lcs(xstr, ystr)
elapsed = (time.time() - start)
print elapsed
If your professor wants you to invent your own LCS algorithm, you're done. Your algorithm is not the most optimal one ever created, but it's in the right complexity class, you clearly understand it, and you clearly didn't copy your implementation from the internet. You might want to be prepared to defend your algorithm, or discuss alternatives. If I were your prof, I'd give you an A if:
You turned in that program.
You were able to explain why there's no possible O(N) or O(N log M) alternative.
You were able to participate in a reasonable discussion about other algorithms that might have a better lower bound (or significantly lower constants, etc.), and the time/space tradeoffs, etc., even if you didn't know the outcome of that discussion in advance.
On the other hand, if your professor wants you to pick one of the well-known algorithms and write your own implementation, you probably want to use the standard LP algorithm. It's a standard algorithm for a reason—which you probably want to read up on until you understand. (Even if it isn't going to be on the test, you're taking this class to learn, not just to impress the prof, right?)
Wikipedia has pseudocode for a basic implementation, then English-language descriptions of common optimizations. I'm pretty sure that writing your own Python code based on what's on that page wouldn't count as plagiarism, or even as a trivial port, especially if you can demonstrate that you understand what your code is doing, and why, and why it's a good algorithm. Plus, you're writing it in Python, which has much better ways to memoize than what's demonstrated in that article, so if you understand how it works, your code should actually be substantially better than what Wikipedia gives you.
Either way, as I suggested in the comments, I'd read A survey of longest common subsequence algorithms by Bergroth, Hakonen, and Raita, and search for similar papers online.
maxLength = 0
foundString = ""
for start in xrange(len(str1)-1):
for end in xrange(start+1, len(str1)):
str1Temp = str1[start:end]
maxLengthTemp = len(str1Temp)
if(str2.find(str1Temp)):
if(maxLengthTemp>maxLength):
maxLength = maxLengthTemp
foundString = str1Temp
print maxLength
print foundString

Categories