I am trying to search a list (DB) for possible matches of fragments of text. For instance, I have a DB with text "evilman". I want to use user inputs to search for any possible matches in the DB and give the answer with a confidence. If the user inputs "hello", then there are no possible matches. If the user inputs "evil", then the possible match is evilman with a confidence of 57% (4 out of 7 alphabets match) and so on.
However, I also want a way to match input text such as "evxxman". 5 out of 7 characters of evxxman match the text "evilman" in the DB. But a simple check in python will say no match since it only outputs text that matches consecutively. I hope it makes sense. Thanks
Following is my code:
db = []
possible_signs = []
db.append("evilman")
text = raw_input()
for s in db:
if text in s:
if len(text) >= len(s)/2:
possible_signs.append(s)
count += 1
confidence = (float(len(text)) / float(len(s))) * 100
print "Confidence:", '%.2f' %(confidence), "<possible match:>", possible_signs[0]
This first version seems to comply with your exemples. It make the strings "slide" against each other, and count the number of identical characters.
The ratio is made by dividing the character count by the reference string length. Add a max and voila.
Call it for each string in your DB.
def commonChars(txt, ref):
txtLen = len(txt)
refLen = len(ref)
r = 0
for i in range(refLen + (txtLen - 1)):
rStart = abs(min(0, txtLen - i - 1))
tStart = txtLen -i - 1 if i < txtLen else 0
l = min(txtLen - tStart, refLen - rStart)
c = 0
for j in range(l):
if txt[tStart + j] == ref[rStart + j]:
c += 1
r = max(r, c / refLen)
return r
print(commonChars('evxxman', 'evilman')) # 0.7142857142857143
print(commonChars('evil', 'evilman')) # 0.5714285714285714
print(commonChars('man', 'evilman')) # 0.42857142857142855
print(commonChars('batman', 'evilman')) # 0.42857142857142855
print(commonChars('batman', 'man')) # 1.0
This second version produces the same results, but using the difflib mentioned in other answers.
It computes matching blocks, sum their lengths, and computes the ratio against the reference length.
import difflib
def commonBlocks(txt, ref):
matcher = difflib.SequenceMatcher(a=txt, b=ref)
matchingBlocks = matcher.get_matching_blocks()
matchingCount = sum([b.size for b in matchingBlocks])
return matchingCount / len(ref)
print(commonBlocks('evxxman', 'evilman')) # 0.7142857142857143
print(commonBlocks('evxxxxman', 'evilman')) # 0.7142857142857143
As shown by the calls above, the behavior is slightly different. "holes" between matching blocks are ignored, and do not change the final ratio.
For finding matches with a quality-estimation, have a look at difflib.SequenceMatcher.ratio and friends - these functions might not be the fastest match-checkers but they are easy to use.
Example copied from difflib docs
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.quick_ratio()
0.75
>>> s.real_quick_ratio()
1.0
Based on your description and examples, it seems to me that you're actually looking for something like the Levenshtein (or edit) distance. Note that it does not quite give the scores you specify, but I think it gives the scores you actually want.
There are several packages implementing this efficiently, e.g., distance:
In [1]: import distance
In [2]: distance.levenshtein('evilman', 'hello')
Out[2]: 6L
In [3]: distance.levenshtein('evilman', 'evil')
Out[3]: 3L
In [4]: distance.levenshtein('evilman', 'evxxman')
Out[4]: 2L
Note that the library contains several measures of similarity, e.g., jaccard and sorensen return a normalized value per default:
>>> distance.sorensen("decide", "resize")
0.5555555555555556
>>> distance.jaccard("decide", "resize")
0.7142857142857143
Create a while loop and track two iterators, one for your key word ("evil") and one for your query word ("evilman"). Here is some pseudocode:
key = "evil"
query = "evilman"
key_iterator = 0
query_iterator = 0
confidence_score = 0
while( key_iterator < key.length && query_iterator < query.length ) {
if (key[key_iterator] == query[query_iterator]) {
confidence_score++
key_iterator++
}
query_iterator++
}
// If we didnt reach the end of the key
if (key_iterator != key.length) {
confidence_score = 0
}
print ("Confidence: " + confidence_score + " out of " + query.length)
Related
I want to calculate the largest covering of a string from many sets of substrings.
All strings in this problem are lowercased, and contain no whitespace or unicode strangeness.
So, given a string: abcdef, and two groups of strings: ['abc', 'bc'], ['abc', 'd'], the second group (['abc', 'd']) covers more of the original string. Order matters for exact matches, so the term group ['fe', 'cba'] would not match the original string.
I have a large collection of strings, and a large collection of terms-groups. So I would like a bit faster implementation if possible.
I've tried the following in Python for an example. I've used Pandas and Numpy because I thought it may speed it up a bit. I'm also running into an over-counting problem as you'll see below.
import re
import pandas as pd
import numpy as np
my_strings = pd.Series(['foobar', 'foofoobar0', 'apple'])
term_sets = pd.Series([['foo', 'ba'], ['foo', 'of'], ['app', 'ppl'], ['apple'], ['zzz', 'zzapp']])
# For each string, calculate best proportion of coverage:
# Try 1: Create a function for each string.
def calc_coverage(mystr, term_sets):
# Total length of string
total_chars = len(mystr)
# For each term set, sum up length of any match. Problem: this over counts when matches overlap.
total_coverage = term_sets.apply(lambda x: np.sum([len(term) if re.search(term, mystr) else 0 for term in x]))
# Fraction of String covered. Note the above over-counting can result in fractions > 1.0.
coverage_proportion = total_coverage/total_chars
return coverage_proportion.argmax(), coverage_proportion.max()
my_strings.apply(lambda x: calc_coverage(x, term_sets))
This results in:
0 (0, 0.8333333333333334)
1 (0, 0.5)
2 (2, 1.2)
Which presents some problems. The biggest problem I see is that over-lapping terms are being counted up separately, which results in the 1.2 or 120% coverage.
I think the ideal output would be:
0 (0, 0.8333333333333334)
1 (0, 0.8)
2 (3, 1.0)
I think I can write a double for loop and brute force it. But this problem feels like there's a more optimal solution. Or a small change on what I've done so far to get it to work.
Note: If there is a tie- returning the first is fine. I'm not too interested in returning all best matches.
Ok, this is not optimized but let's start fixing the results. I believe you have two issues: one is the over-counting in apple; the other is the under-counting in foofoobar0.
Solving the second issue when the term set is composed of two non-overlapping terms (or just one term), is easy:
sum([s.count(t)*len(t) for t in ts])
will do the job.
Similarly, when we have two overlapping terms, we will just take the "best" one:
max([s.count(t)*len(t) for t in ts])
So we are left with the problem of recognizing when the two terms overlap. I don't even consider term sets with more than two terms, because the solution will already be painfully slow with two :(
Let's define a function to test for overlapping:
def terms_overlap(s, ts):
if ts[0] not in s or ts[1] not in s:
return False
start = 0
while (pos_0 := s.find(ts[0], start)) > -1:
if (pos_1 := s.find(ts[1], pos_0)) > -1:
if pos_0 <= pos_1 < (pos_0 + len(ts[0]) - 1):
return True
start += pos_0 + len(ts[0])
start = 0
while (pos_1 := s.find(ts[1], start)) > -1:
if (pos_0 := s.find(ts[0], pos_1)) > -1:
if pos_1 <= pos_0 < (pos_1 + len(ts[1]) - 1):
return True
start += pos_1 + len(ts[1])
return False
With that function ready we can finally do:
def calc_coverage(strings, tsets):
for xs, s in enumerate(strings):
best_cover = 0
best_ts = 0
for xts, ts in enumerate(tsets):
if len(ts) == 1:
cover = s.count(ts[0])*len(ts[0])
elif len(ts) == 2:
if terms_overlap(s, ts):
cover = max([s.count(t)*len(t) for t in ts])
else:
cover = sum([s.count(t)*len(t) for t in ts])
else:
raise ValueError('Cannot handle term sets of more than two terms')
if cover > best_cover:
best_cover = cover
best_ts = xts
print(f'{xs}: {s:15} {best_cover:2d} / {len(s):2d} = {best_cover/len(s):8.3f} ({best_ts}: {tsets[best_ts]})')
>>> calc_coverage(my_strings, term_sets)
0: foobar 5 / 6 = 0.833 (0: ['foo', 'ba'])
1: foofoobar0 8 / 10 = 0.800 (0: ['foo', 'ba'])
2: apple 5 / 5 = 1.000 (3: ['apple'])
** I modified the entire question **
I have an example list specified below and i want to find if 2 values are from the same list and i wanna know which list both the value comes from.
list1 = ['a','b','c','d','e']
list2 = ['f','g','h','i','j']
c = 'b'
d = 'e'
i used for loop to check whether the values exist in the list however not sure how to obtain which list the value actually is from.
for x,y in zip(list1,list2):
if c and d in x or y:
print(True)
Please advise if there is any work around.
First u might want to inspect the distribution of values and sizes where you can improve the result with the least effort like this:
df_inspect = df.copy()
df_inspect["size.value"] = ["size.value"].map(lambda x: ''.join(y.upper() for y in x if x.isalpha() if y != ' '))
df_inspect = df_inspect.groupby(["size.value"]).count().sort_values(ascending=False)
Then create a solution for the most occuring size category, here "Wide"
long = "adasda, 9.5 W US"
short = "9.5 Wide"
def get_intersection(s1, s2):
res = ''
l_s1 = len(s1)
for i in range(l_s1):
for j in range(i + 1, l_s1):
t = s1[i:j]
if t in s2 and len(t) > len(res):
res = t
return res
print(len(get_intersection(long, short)) / len(short) >= 0.6)
Then apply the solution to the dataframe
df["defective_attributes"] = df.apply(lambda x: len(get_intersection(x["item_name.value"], x["size.value"])) / len(x["size.value"]) >= 0.6)
Basically, get_intersection search for the longest intersection between the itemname and the size. Then takes the length of the intersection and says, its not defective if at least 60% of the size_value are also in the item_name.
I want to get the length of a string including a part of the string that represents its own length without padding or using structs or anything like that that forces fixed lengths.
So for example I want to be able to take this string as input:
"A string|"
And return this:
"A string|11"
On the basis of the OP tolerating such an approach (and to provide an implementation technique for the eventual python answer), here's a solution in Java.
final String s = "A String|";
int n = s.length(); // `length()` returns the length of the string.
String t; // the result
do {
t = s + n; // append the stringified n to the original string
if (n == t.length()){
return t; // string length no longer changing; we're good.
}
n = t.length(); // n must hold the total length
} while (true); // round again
The problem of, course, is that in appending n, the string length changes. But luckily, the length only ever increases or stays the same. So it will converge very quickly: due to the logarithmic nature of the length of n. In this particular case, the attempted values of n are 9, 10, and 11. And that's a pernicious case.
A simple solution is :
def addlength(string):
n1=len(string)
n2=len(str(n1))+n1
n2 += len(str(n2))-len(str(n1)) # a carry can arise
return string+str(n2)
Since a possible carry will increase the length by at most one unit.
Examples :
In [2]: addlength('a'*8)
Out[2]: 'aaaaaaaa9'
In [3]: addlength('a'*9)
Out[3]: 'aaaaaaaaa11'
In [4]: addlength('a'*99)
Out[4]: 'aaaaa...aaa102'
In [5]: addlength('a'*999)
Out[5]: 'aaaa...aaa1003'
Here is a simple python port of Bathsheba's answer :
def str_len(s):
n = len(s)
t = ''
while True:
t = s + str(n)
if n == len(t):
return t
n = len(t)
This is a much more clever and simple way than anything I was thinking of trying!
Suppose you had s = 'abcdefgh|, On the first pass through, t = 'abcdefgh|9
Since n != len(t) ( which is now 10 ) it goes through again : t = 'abcdefgh|' + str(n) and str(n)='10' so you have abcdefgh|10 which is still not quite right! Now n=len(t) which is finally n=11 you get it right then. Pretty clever solution!
It is a tricky one, but I think I've figured it out.
Done in a hurry in Python 2.7, please fully test - this should handle strings up to 998 characters:
import sys
orig = sys.argv[1]
origLen = len(orig)
if (origLen >= 98):
extra = str(origLen + 3)
elif (origLen >= 8):
extra = str(origLen + 2)
else:
extra = str(origLen + 1)
final = orig + extra
print final
Results of very brief testing
C:\Users\PH\Desktop>python test.py "tiny|"
tiny|6
C:\Users\PH\Desktop>python test.py "myString|"
myString|11
C:\Users\PH\Desktop>python test.py "myStringWith98Characters.........................................................................|"
myStringWith98Characters.........................................................................|101
Just find the length of the string. Then iterate through each value of the number of digits the length of the resulting string can possibly have. While iterating, check if the sum of the number of digits to be appended and the initial string length is equal to the length of the resulting string.
def get_length(s):
s = s + "|"
result = ""
len_s = len(s)
i = 1
while True:
candidate = len_s + i
if len(str(candidate)) == i:
result = s + str(len_s + i)
break
i += 1
This code gives the result.
I used a few var, but at the end it shows the output you want:
def len_s(s):
s = s + '|'
b = len(s)
z = s + str(b)
length = len(z)
new_s = s + str(length)
new_len = len(new_s)
return s + str(new_len)
s = "A string"
print len_s(s)
Here's a direct equation for this (so it's not necessary to construct the string). If s is the string, then the length of the string including the length of the appended length will be:
L1 = len(s) + 1 + int(log10(len(s) + 1 + int(log10(len(s)))))
The idea here is that a direct calculation is only problematic when the appended length will push the length past a power of ten; that is, at 9, 98, 99, 997, 998, 999, 9996, etc. To work this through, 1 + int(log10(len(s))) is the number of digits in the length of s. If we add that to len(s), then 9->10, 98->100, 99->101, etc, but still 8->9, 97->99, etc, so we can push past the power of ten exactly as needed. That is, adding this produces a number with the correct number of digits after the addition. Then do the log again to find the length of that number and that's the answer.
To test this:
from math import log10
def find_length(s):
L1 = len(s) + 1 + int(log10(len(s) + 1 + int(log10(len(s)))))
return L1
# test, just looking at lengths around 10**n
for i in range(9):
for j in range(30):
L = abs(10**i - j + 10) + 1
s = "a"*L
x0 = find_length(s)
new0 = s+`x0`
if len(new0)!=x0:
print "error", len(s), x0, log10(len(s)), log10(x0)
From Section 15.2 of Programming Pearls
The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c
When I implement it in Python using suffix-array:
example = open("iliad10.txt").read()
def comlen(p, q):
i = 0
for x in zip(p, q):
if x[0] == x[1]:
i += 1
else:
break
return i
suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:])) #VERY VERY SLOW
max_len = -1
for i in range(example_len - 1):
this_len = comlen(example[idx[i]:], example[idx[i+1]:])
print this_len
if this_len > max_len:
max_len = this_len
maxi = i
I found it very slow for the idx.sort step. I think it's slow because Python need to pass the substring by value instead of by pointer (as the C codes above).
The tested file can be downloaded from here
The C codes need only 0.3 seconds to finish.
time cat iliad10.txt |./longdup
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away.
real 0m0.328s
user 0m0.291s
sys 0m0.006s
But for Python codes, it never ends on my computer (I waited for 10 minutes and killed it)
Does anyone have ideas how to make the codes efficient? (For example, less than 10 seconds)
My solution is based on Suffix arrays. It is constructed by Prefix doubling the Longest common prefix. The worst-case complexity is O(n (log n)^2). The file "iliad.mb.txt" takes 4 seconds on my laptop. The longest_common_substring function is short and can be easily modified, e.g. for searching the 10 longest non-overlapping substrings. This Python code is faster than the original C code from the question, if duplicate strings are longer than 10000 characters.
from itertools import groupby
from operator import itemgetter
def longest_common_substring(text):
"""Get the longest common substrings and their positions.
>>> longest_common_substring('banana')
{'ana': [1, 3]}
>>> text = "not so Agamemnon, who spoke fiercely to "
>>> sorted(longest_common_substring(text).items())
[(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]
This function can be easy modified for any criteria, e.g. for searching ten
longest non overlapping repeated substrings.
"""
sa, rsa, lcp = suffix_array(text)
maxlen = max(lcp)
result = {}
for i in range(1, len(text)):
if lcp[i] == maxlen:
j1, j2, h = sa[i - 1], sa[i], lcp[i]
assert text[j1:j1 + h] == text[j2:j2 + h]
substring = text[j1:j1 + h]
if not substring in result:
result[substring] = [j1]
result[substring].append(j2)
return dict((k, sorted(v)) for k, v in result.items())
def suffix_array(text, _step=16):
"""Analyze all common strings in the text.
Short substrings of the length _step a are first pre-sorted. The are the
results repeatedly merged so that the garanteed number of compared
characters bytes is doubled in every iteration until all substrings are
sorted exactly.
Arguments:
text: The text to be analyzed.
_step: Is only for optimization and testing. It is the optimal length
of substrings used for initial pre-sorting. The bigger value is
faster if there is enough memory. Memory requirements are
approximately (estimate for 32 bit Python 3.3):
len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB
Return value: (tuple)
(sa, rsa, lcp)
sa: Suffix array for i in range(1, size):
assert text[sa[i-1]:] < text[sa[i]:]
rsa: Reverse suffix array for i in range(size):
assert rsa[sa[i]] == i
lcp: Longest common prefix for i in range(1, size):
assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
if sa[i-1] + lcp[i] < len(text):
assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
>>> suffix_array(text='banana')
([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])
Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
It is between tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
"""
tx = text
size = len(tx)
step = min(max(_step, 1), len(tx))
sa = list(range(len(tx)))
sa.sort(key=lambda i: tx[i:i + step])
grpstart = size * [False] + [True] # a boolean map for iteration speedup.
# It helps to skip yet resolved values. The last value True is a sentinel.
rsa = size * [None]
stgrp, igrp = '', 0
for i, pos in enumerate(sa):
st = tx[pos:pos + step]
if st != stgrp:
grpstart[igrp] = (igrp < i - 1)
stgrp = st
igrp = i
rsa[pos] = igrp
sa[i] = pos
grpstart[igrp] = (igrp < size - 1 or size == 0)
while grpstart.index(True) < size:
# assert step <= size
nextgr = grpstart.index(True)
while nextgr < size:
igrp = nextgr
nextgr = grpstart.index(True, igrp + 1)
glist = []
for ig in range(igrp, nextgr):
pos = sa[ig]
if rsa[pos] != igrp:
break
newgr = rsa[pos + step] if pos + step < size else -1
glist.append((newgr, pos))
glist.sort()
for ig, g in groupby(glist, key=itemgetter(0)):
g = [x[1] for x in g]
sa[igrp:igrp + len(g)] = g
grpstart[igrp] = (len(g) > 1)
for pos in g:
rsa[pos] = igrp
igrp += len(g)
step *= 2
del grpstart
# create LCP array
lcp = size * [None]
h = 0
for i in range(size):
if rsa[i] > 0:
j = sa[rsa[i] - 1]
while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
h += 1
lcp[rsa[i]] = h
if h > 0:
h -= 1
if size > 0:
lcp[0] = 0
return sa, rsa, lcp
I prefer this solution over more complicated O(n log n) because Python has a very fast list sorting algorithm (Timsort). Python's sort is probably faster than necessary linear time operations in the method from that article, that should be O(n) under very special presumptions of random strings together with a small alphabet (typical for DNA genome analysis). I read in Gog 2011 that worst-case O(n log n) of my algorithm can be in practice faster than many O(n) algorithms that cannot use the CPU memory cache.
The code in another answer based on grow_chains is 19 times slower than the original example from the question, if the text contains a repeated string 8 kB long. Long repeated texts are not typical for classical literature, but they are frequent e.g. in "independent" school homework collections. The program should not freeze on it.
I wrote an example and tests with the same code for Python 2.7, 3.3 - 3.6.
The translation of the algorithm into Python:
from itertools import imap, izip, starmap, tee
from os.path import commonprefix
def pairwise(iterable): # itertools recipe
a, b = tee(iterable)
next(b, None)
return izip(a, b)
def longest_duplicate_small(data):
suffixes = sorted(data[i:] for i in xrange(len(data))) # O(n*n) in memory
return max(imap(commonprefix, pairwise(suffixes)), key=len)
buffer() allows to get a substring without copying:
def longest_duplicate_buffer(data):
n = len(data)
sa = sorted(xrange(n), key=lambda i: buffer(data, i)) # suffix array
def lcp_item(i, j): # find longest common prefix array item
start = i
while i < n and data[i] == data[i + j - start]:
i += 1
return i - start, start
size, start = max(starmap(lcp_item, pairwise(sa)), key=lambda x: x[0])
return data[start:start + size]
It takes 5 seconds on my machine for the iliad.mb.txt.
In principle it is possible to find the duplicate in O(n) time and O(n) memory using a suffix array augmented with a lcp array.
Note: *_memoryview() is deprecated by *_buffer() version
More memory efficient version (compared to longest_duplicate_small()):
def cmp_memoryview(a, b):
for x, y in izip(a, b):
if x < y:
return -1
elif x > y:
return 1
return cmp(len(a), len(b))
def common_prefix_memoryview((a, b)):
for i, (x, y) in enumerate(izip(a, b)):
if x != y:
return a[:i]
return a if len(a) < len(b) else b
def longest_duplicate(data):
mv = memoryview(data)
suffixes = sorted((mv[i:] for i in xrange(len(mv))), cmp=cmp_memoryview)
result = max(imap(common_prefix_memoryview, pairwise(suffixes)), key=len)
return result.tobytes()
It takes 17 seconds on my machine for the iliad.mb.txt. The result is:
On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so Agamemnon,
who spoke fiercely to him and sent him roughly away.
I had to define custom functions to compare memoryview objects because memoryview comparison either raises an exception in Python 3 or produces wrong result in Python 2:
>>> s = b"abc"
>>> memoryview(s[0:]) > memoryview(s[1:])
True
>>> memoryview(s[0:]) < memoryview(s[1:])
True
Related questions:
Find the longest repeating string and the number of times it repeats in a given string
finding long repeated substrings in a massive string
The main problem seems to be that python does slicing by copy: https://stackoverflow.com/a/5722068/538551
You'll have to use a memoryview instead to get a reference instead of a copy. When I did this, the program hung after the idx.sort function (which was very fast).
I'm sure with a little work, you can get the rest working.
Edit:
The above change will not work as a drop-in replacement because cmp does not work the same way as strcmp. For example, try the following C code:
#include <stdio.h>
#include <string.h>
int main() {
char* test1 = "ovided by The Internet Classics Archive";
char* test2 = "rovided by The Internet Classics Archive.";
printf("%d\n", strcmp(test1, test2));
}
And compare the result to this python:
test1 = "ovided by The Internet Classics Archive";
test2 = "rovided by The Internet Classics Archive."
print(cmp(test1, test2))
The C code prints -3 on my machine while the python version prints -1. It looks like the example C code is abusing the return value of strcmp (it IS used in qsort after all). I couldn't find any documentation on when strcmp will return something other than [-1, 0, 1], but adding a printf to pstrcmp in the original code showed a lot of values outside of that range (3, -31, 5 were the first 3 values).
To make sure that -3 wasn't some error code, if we reverse test1 and test2, we'll get 3.
Edit:
The above is interesting trivia, but not actually correct in terms of affecting either chunks of code. I realized this just as I shut my laptop and left a wifi zone... Really should double check everything before I hit Save.
FWIW, cmp most certainly works on memoryview objects (prints -1 as expected):
print(cmp(memoryview(test1), memoryview(test2)))
I'm not sure why the code isn't working as expected. Printing out the list on my machine does not look as expected. I'll look into this and try to find a better solution instead of grasping at straws.
This version takes about 17 secs on my circa-2007 desktop using totally different algorithm:
#!/usr/bin/env python
ex = open("iliad.mb.txt").read()
chains = dict()
# populate initial chains dictionary
for (a,b) in enumerate(zip(ex,ex[1:])) :
s = ''.join(b)
if s not in chains :
chains[s] = list()
chains[s].append(a)
def grow_chains(chains) :
new_chains = dict()
for (string,pos) in chains :
offset = len(string)
for p in pos :
if p + offset >= len(ex) : break
# add one more character
s = string + ex[p + offset]
if s not in new_chains :
new_chains[s] = list()
new_chains[s].append(p)
return new_chains
# grow and filter, grow and filter
while len(chains) > 1 :
print 'length of chains', len(chains)
# remove chains that appear only once
chains = [(i,chains[i]) for i in chains if len(chains[i]) > 1]
print 'non-unique chains', len(chains)
print [i[0] for i in chains[:3]]
chains = grow_chains(chains)
The basic idea is to create a list of substrings and positions where they occure, thus eliminating the need to compare same strings again and again. The resulting list look like [('ind him, but', [466548, 739011]), (' bulwark bot', [428251, 428924]), (' his armour,', [121559, 124919, 193285, 393566, 413634, 718953, 760088])]. Unique strings are removed. Then every list member grows by 1 character and new list is created. Unique strings are removed again. And so on and so forth...
I'm programming a spellcheck program in Python. I have a list of valid words (the dictionary) and I need to output a list of words from this dictionary that have an edit distance of 2 from a given invalid word.
I know I need to start by generating a list with an edit distance of one from the invalid word(and then run that again on all the generated words). I have three methods, inserts(...), deletions(...) and changes(...) that should output a list of words with an edit distance of 1, where inserts outputs all valid words with one more letter than the given word, deletions outputs all valid words with one less letter, and changes outputs all valid words with one different letter.
I've checked a bunch of places but I can't seem to find an algorithm that describes this process. All the ideas I've come up with involve looping through the dictionary list multiple times, which would be extremely time consuming. If anyone could offer some insight, I'd be extremely grateful.
The thing you are looking at is called an edit distance and here is a nice explanation on wiki. There are a lot of ways how to define a distance between the two words and the one that you want is called Levenshtein distance and here is a DP (dynamic programming) implementation in python.
def levenshteinDistance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
And a couple of more implementations are here.
difflib in the standard library has various utilities for sequence matching, including the get_close_matches method that you could use. It uses an algorithm adapted from Ratcliff and Obershelp.
From the docs
>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
Here is my version for Levenshtein distance
def edit_distance(s1, s2):
m=len(s1)+1
n=len(s2)+1
tbl = {}
for i in range(m): tbl[i,0]=i
for j in range(n): tbl[0,j]=j
for i in range(1, m):
for j in range(1, n):
cost = 0 if s1[i-1] == s2[j-1] else 1
tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)
return tbl[i,j]
print(edit_distance("Helloworld", "HalloWorld"))
#this calculates edit distance not levenstein edit distance
word1="rice"
word2="ice"
len_1=len(word1)
len_2=len(word2)
x =[[0]*(len_2+1) for _ in range(len_1+1)]#the matrix whose last element ->edit distance
for i in range(0,len_1+1): #initialization of base case values
x[i][0]=i
for j in range(0,len_2+1):
x[0][j]=j
for i in range (1,len_1+1):
for j in range(1,len_2+1):
if word1[i-1]==word2[j-1]:
x[i][j] = x[i-1][j-1]
else :
x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1
print x[i][j]
Using the SequenceMatcher from Python built-in difflib is another way of doing it, but (as correctly pointed out in the comments), the result does not match the definition of an edit distance exactly. Bonus: it supports ignoring "junk" parts (e.g. spaces or punctuation).
from difflib import SequenceMatcher
a = 'kitten'
b = 'sitting'
required_edits = [
code
for code in (
SequenceMatcher(a=a, b=b, autojunk=False)
.get_opcodes()
)
if code[0] != 'equal'
]
required_edits
# [
# # (tag, i1, i2, j1, j2)
# ('replace', 0, 1, 0, 1), # replace a[0:1]="k" with b[0:1]="s"
# ('replace', 4, 5, 4, 5), # replace a[4:5]="e" with b[4:5]="i"
# ('insert', 6, 6, 6, 7), # insert b[6:7]="g" after a[6:6]="n"
# ]
# the edit distance:
len(required_edits) # == 3
I would recommend not creating this kind of code on your own. There are libraries for that.
For instance the Levenshtein library.
In [2]: Levenshtein.distance("foo", "foobar")
Out[2]: 3
In [3]: Levenshtein.distance("barfoo", "foobar")
Out[3]: 6
In [4]: Levenshtein.distance("Buroucrazy", "Bureaucracy")
Out[4]: 3
In [5]: Levenshtein.distance("Misisipi", "Mississippi")
Out[5]: 3
In [6]: Levenshtein.distance("Misisipi", "Misty Mountains")
Out[6]: 11
In [7]: Levenshtein.distance("Buroucrazy", "Born Crazy")
Out[7]: 4
Similar to Santoshi's solution above but I made three changes:
One line initialization instead of five
No need to define cost alone (just use int(boolean) 0 or 1)
Instead of double for loop use product, (this last one is only cosmetic, double loop seems unavoidable)
from itertools import product
def edit_distance(s1,s2):
d={ **{(i,0):i for i in range(len(s1)+1)},**{(0,j):j for j in range(len(s2)+1)}}
for i, j in product(range(1,len(s1)+1), range(1,len(s2)+1)):
d[i,j]=min((s1[i-1]!=s2[j-1]) + d[i-1,j-1], d[i-1,j]+1, d[i,j-1]+1)
return d[i,j]
Instead of going with Levenshtein distance algo use BK tree or TRIE, as these algorithms have less complexity then edit distance. A good browse over these topic will give a detailed description.
This link will help you more about spell checking.
You need Minimum Edit Distance for this task.
Following is my version of MED a.k.a Levenshtein Distance.
def MED_character(str1,str2):
cost=0
len1=len(str1)
len2=len(str2)
#output the length of other string in case the length of any of the string is zero
if len1==0:
return len2
if len2==0:
return len1
accumulator = [[0 for x in range(len2)] for y in range(len1)] #initializing a zero matrix
# initializing the base cases
for i in range(0,len1):
accumulator[i][0] = i;
for i in range(0,len2):
accumulator[0][i] = i;
# we take the accumulator and iterate through it row by row.
for i in range(1,len1):
char1=str1[i]
for j in range(1,len2):
char2=str2[j]
cost1=0
if char1!=char2:
cost1=2 #cost for substitution
accumulator[i][j]=min(accumulator[i-1][j]+1, accumulator[i][j-1]+1, accumulator[i-1][j-1] + cost1 )
cost=accumulator[len1-1][len2-1]
return cost
Fine tuned codes based on the version from #Santosh and should address the issue brought up by #Artur Krajewski; The biggest difference is replacing an effective 2d matrix
def edit_distance(s1, s2):
# add a blank character for both strings
m=len(s1)+1
n=len(s2)+1
# launch a matrix
tbl = [[0] * n for i in range(m)]
for i in range(m): tbl[i][0]=i
for j in range(n): tbl[0][j]=j
for i in range(1, m):
for j in range(1, n):
#if strings have same letters, set operation cost as 0 otherwise 1
cost = 0 if s1[i-1] == s2[j-1] else 1
#find min practice
tbl[i][j] = min(tbl[i][j-1]+1, tbl[i-1][j]+1, tbl[i-1][j-1]+cost)
return tbl
edit_distance("birthday", "Birthdayyy")
following up on #krassowski's answer
from difflib import SequenceMatcher
def sequence_matcher_edits(word_a, word_b):
required_edits = [code for code in (
SequenceMatcher(a=word_a, b=word_b, autojunk=False).get_opcodes()
)
if code[0] != 'equal'
]
return len(required_edits)
print(f"sequence_matcher_edits {sequence_matcher_edits('kitten', 'sitting')}")
# -> sequence_matcher_edits 3