I'm a python n00b and I'd like some suggestions on how to improve the algorithm to improve the performance of this method to compute the Jaro-Winkler distance of two names.
def winklerCompareP(str1, str2):
"""Return approximate string comparator measure (between 0.0 and 1.0)
USAGE:
score = winkler(str1, str2)
ARGUMENTS:
str1 The first string
str2 The second string
DESCRIPTION:
As described in 'An Application of the Fellegi-Sunter Model of
Record Linkage to the 1990 U.S. Decennial Census' by William E. Winkler
and Yves Thibaudeau.
Based on the 'jaro' string comparator, but modifies it according to whether
the first few characters are the same or not.
"""
# Quick check if the strings are the same - - - - - - - - - - - - - - - - - -
#
jaro_winkler_marker_char = chr(1)
if (str1 == str2):
return 1.0
len1 = len(str1)
len2 = len(str2)
halflen = max(len1,len2) / 2 - 1
ass1 = '' # Characters assigned in str1
ass2 = '' # Characters assigned in str2
#ass1 = ''
#ass2 = ''
workstr1 = str1
workstr2 = str2
common1 = 0 # Number of common characters
common2 = 0
#print "'len1', str1[i], start, end, index, ass1, workstr2, common1"
# Analyse the first string - - - - - - - - - - - - - - - - - - - - - - - - -
#
for i in range(len1):
start = max(0,i-halflen)
end = min(i+halflen+1,len2)
index = workstr2.find(str1[i],start,end)
#print 'len1', str1[i], start, end, index, ass1, workstr2, common1
if (index > -1): # Found common character
common1 += 1
#ass1 += str1[i]
ass1 = ass1 + str1[i]
workstr2 = workstr2[:index]+jaro_winkler_marker_char+workstr2[index+1:]
#print "str1 analyse result", ass1, common1
#print "str1 analyse result", ass1, common1
# Analyse the second string - - - - - - - - - - - - - - - - - - - - - - - - -
#
for i in range(len2):
start = max(0,i-halflen)
end = min(i+halflen+1,len1)
index = workstr1.find(str2[i],start,end)
#print 'len2', str2[i], start, end, index, ass1, workstr1, common2
if (index > -1): # Found common character
common2 += 1
#ass2 += str2[i]
ass2 = ass2 + str2[i]
workstr1 = workstr1[:index]+jaro_winkler_marker_char+workstr1[index+1:]
if (common1 != common2):
print('Winkler: Wrong common values for strings "%s" and "%s"' % \
(str1, str2) + ', common1: %i, common2: %i' % (common1, common2) + \
', common should be the same.')
common1 = float(common1+common2) / 2.0 ##### This is just a fix #####
if (common1 == 0):
return 0.0
# Compute number of transpositions - - - - - - - - - - - - - - - - - - - - -
#
transposition = 0
for i in range(len(ass1)):
if (ass1[i] != ass2[i]):
transposition += 1
transposition = transposition / 2.0
# Now compute how many characters are common at beginning - - - - - - - - - -
#
minlen = min(len1,len2)
for same in range(minlen+1):
if (str1[:same] != str2[:same]):
break
same -= 1
if (same > 4):
same = 4
common1 = float(common1)
w = 1./3.*(common1 / float(len1) + common1 / float(len2) + (common1-transposition) / common1)
wn = w + same*0.1 * (1.0 - w)
return wn
Example output
ZIMMERMANN ARMIENTO 0.814583333
ZIMMERMANN ZIMMERMANN 1
ZIMMERMANN CANNONS 0.766666667
CANNONS AKKER 0.8
CANNONS ALDERSON 0.845833333
CANNONS ALLANBY 0.833333333
I focused more on optimizing to get more out of Python than on optimizing the algorithm because I don't think that there is much of an algorithmic improvement to be had here. Here are some Python optimizations that I came up with.
(1). Since you appear to be using Python 2.x, change all range()'s to xrange()'s. range() generates the full list of numbers before iterating over them while xrange generates them as needed.
(2). Make the following substitutions for max and min:
start = max(0,i-halflen)
with
start = i - halflen if i > halflen else 0
and
end = min(i+halflen+1,len2)
with
end = i+halflen+1 if i+halflen+1 < len2 else len2
in the first loop and similar ones for the second loop. There's also another min() farther down and a max() near the beginning of the function so do the same with those. Replacing the min()'s and max()'s really helped to reduce the time. These are convenient functions, but more costly than the method I've replaced them with.
(3). Use common1 instead of len(ass1). You've kept track of the length of ass1 in common1 so let's use it rather than calling a costly function to find it again.
(4). Replace the following code:
minlen = min(len1,len2)
for same in xrange(minlen+1):
if (str1[:same] != str2[:same]):
break
same -= 1
with
for same in xrange(minlen):
if str1[same] != str2[same]:
break
The reason for this is mainly that str1[:same] creates a new string every time through the loop and you will be checking parts that you've already checked. Also, there's no need to check if '' != '' and decrement same afterwards if we don't have to.
(5). Use psyco, a just-in-time compiler of sorts. Once you've downloaded it and installed it, just add the lines
import psyco
psyco.full()
at the top of the file to use it. Don't use psyco unless you do the other changes that I've mentioned. For some reason, when I ran it on your original code it actually slowed it down.
Using timeit, I found that I was getting a decrease in time of about 20% or so with the first 4 changes. However, when I add psyco along with those changes, the code is about 3x to 4x faster than the original.
If you want more speed
A fair amount of the remaining time is in the string's find() method. I decided to try replacing this with my own. For the first loop, I replaced
index = workstr2.find(str1[i],start,end)
with
index = -1
for j in xrange(start,end):
if workstr2[j] == str1[i]:
index = j
break
and a similar form for the second loop. Without psyco, this slows down the code, but with psyco, it speeds it up quite a lot. With this final change the code is about 8x to 9x faster than the original.
If that isn't fast enough
Then you should probably turn to making a C module.
Good luck!
I imagine you could do even better if you used the PyLevenshtein module. It's C and quite fast for most use cases. It includes a jaro-winkler function that gives the same output, but on my machine it's 63 times faster.
In [1]: import jw
In [2]: jw.winklerCompareP('ZIMMERMANN', 'CANNONS')
Out[2]: 0.41428571428571426
In [3]: timeit jw.winklerCompareP('ZIMMERMANN', 'CANNONS')
10000 loops, best of 3: 28.2 us per loop
In [4]: import Levenshtein
In [5]: Levenshtein.jaro_winkler('ZIMMERMANN', 'CANNONS')
Out[5]: 0.41428571428571431
In [6]: timeit Levenshtein.jaro_winkler('ZIMMERMANN', 'CANNONS')
1000000 loops, best of 3: 442 ns per loop
In addition to everything that Justin says, concatenating strings is expensive - python has to allocate memory for the new string then copy both strings into it.
So this is bad:
ass1 = ''
for i in range(len1):
...
if (index > -1): # Found common character
...
ass1 = ass1 + str1[i]
It will probably be faster to make ass1 and ass2 lists of characters and use ass1.append(str1[i]). As far as I can see from my quick read of the code the only thing you do with ass1 and ass2 afterwards is to iterate through them character by character so they do not need to be strings. If you did need to use them as strings later then you can convert them with ''.join(ass1).
Related
I just wrote up code for problem 1.6 String Compression from Cracking the Coding Interview. I am wondering how I can condense this code to make it more efficient. Also, I want to make sure that this code is O(n) because I am not concatenating to a new string.
The problem states:
Implement a method to perform basic string compression using the counts of repeated characters. For example, the string 'aabcccccaaa' would become a2b1c5a3. If the "compressed" string would not become smaller than the original string, your method should return the original string. You can assume the string has only uppercase and lowercase letters (a - z).
My code works. My first if statement after the else checks to see if the count for the character is 1, and if it is then to just append the character. I do this so when checking the length of the end result and the original string to decide which one to return.
import string
def stringcompress(str1):
res = []
d = dict.fromkeys(string.ascii_letters, 0)
main = str1[0]
for char in range(len(str1)):
if str1[char] == main:
d[main] += 1
else:
if d[main] == 1:
res.append(main)
d[main] = 0
main = str1[char]
d[main] += 1
else:
res.append(main + str(d[main]))
d[main] = 0
main = str1[char]
d[main] += 1
res.append(main + str(d[main]))
return min(''.join(res), str1)
Again, my code works as expected and does what the question asks. I just want to see if there are certain lines of code I can take out to make the program more efficient.
I messed around testing different variations with the timeit module. Your variation worked fantastically when I generated test data that did not repeat often, but for short strings, my stringcompress_using_string was the fastest method. As the strings grow longer everything flips upside down, and your method of doing things becomes the fastest, and stringcompress_using_string is the slowest.
This just goes to show the importance of testing under different circumstances. My initial conclusions where incomplete, and having more test data showed the true story about the effectiveness of these three methods.
import string
import timeit
import random
def stringcompress_original(str1):
res = []
d = dict.fromkeys(string.ascii_letters, 0)
main = str1[0]
for char in range(len(str1)):
if str1[char] == main:
d[main] += 1
else:
if d[main] == 1:
res.append(main)
d[main] = 0
main = str1[char]
d[main] += 1
else:
res.append(main + str(d[main]))
d[main] = 0
main = str1[char]
d[main] += 1
res.append(main + str(d[main]))
return min(''.join(res), str1, key=len)
def stringcompress_using_list(str1):
res = []
count = 0
for i in range(1, len(str1)):
count += 1
if str1[i] is str1[i-1]:
continue
res.append(str1[i-1])
res.append(str(count))
count = 0
res.append(str1[i] + str(count+1))
return min(''.join(res), str1, key=len)
def stringcompress_using_string(str1):
res = ''
count = 0
# we can start at 1 because we already know the first letter is not a repition of any previous letters
for i in range(1, len(str1)):
count += 1
# we keep going through the for loop, until a character does not repeat with the previous one
if str1[i] is str1[i-1]:
continue
# add the character along with the number of times it repeated to the final string
# reset the count
# and we start all over with the next character
res += str1[i-1] + str(count)
count = 0
# add the final character + count
res += str1[i] + str(count+1)
return min(res, str1, key=len)
def generate_test_data(min_length=3, max_length=300, iterations=3000, repeat_chance=.66):
assert repeat_chance > 0 and repeat_chance < 1
data = []
chr = 'a'
for i in range(iterations):
the_str = ''
# create a random string with a random length between min_length and max_length
for j in range( random.randrange(min_length, max_length+1) ):
# if we've decided to not repeat by randomization, then grab a new character,
# otherwise we will continue to use (repeat) the character that was chosen last time
if random.random() > repeat_chance:
chr = random.choice(string.ascii_letters)
the_str += chr
data.append(the_str)
return data
# generate test data beforehand to make sure all of our tests use the same test data
test_data = generate_test_data()
#make sure all of our test functions are doing the algorithm correctly
print('showing that the algorithms all produce the correct output')
print('stringcompress_original: ', stringcompress_original('aabcccccaaa'))
print('stringcompress_using_list: ', stringcompress_using_list('aabcccccaaa'))
print('stringcompress_using_string: ', stringcompress_using_string('aabcccccaaa'))
print()
print('stringcompress_original took', timeit.timeit("[stringcompress_original(x) for x in test_data]", number=10, globals=globals()), ' seconds' )
print('stringcompress_using_list took', timeit.timeit("[stringcompress_using_list(x) for x in test_data]", number=10, globals=globals()), ' seconds' )
print('stringcompress_using_string took', timeit.timeit("[stringcompress_using_string(x) for x in test_data]", number=10, globals=globals()), ' seconds' )
The following results where all taken on an Intel i7-5700HQ CPU # 2.70GHz, quad core processor. Compare the different functions within each blockquote, but don't try to cross compare results from one blockquote to another because the size of the test data will be different.
Using long strings
Test data generated with generate_test_data(10000, 50000, 100, .66)
stringcompress_original took 7.346990528497378 seconds
stringcompress_using_list took 7.589927956366313 seconds
stringcompress_using_string took 7.713812443264496 seconds
Using short strings
Test data generated with generate_test_data(2, 5, 10000, .66)
stringcompress_original took 0.40272931026355685 seconds
stringcompress_using_list took 0.1525574881739265 seconds
stringcompress_using_string took 0.13842854253813164 seconds
10% chance of repeating characters
Test data generated with generate_test_data(10, 300, 10000, .10)
stringcompress_original took 4.675965586924492 seconds
stringcompress_using_list took 6.081609410376534 seconds
stringcompress_using_string took 5.887430301813865 seconds
90% chance of repeating characters
Test data generated with generate_test_data(10, 300, 10000, .90)
stringcompress_original took 2.6049783549783547 seconds
stringcompress_using_list took 1.9739111725413099 seconds
stringcompress_using_string took 1.9460854974553605 seconds
It's important to create a little framework like this that you can use to test changes to your algorithm. Often changes that don't seem useful will make your code go much faster, so the key to the game when optimizing for performance is to try out different things, and time the results. I'm sure there are more discoveries that could be found if you play around with making different changes, but it really matters on the type of data you want to optimize for -- compressing short strings vs long strings vs strings that don't repeat as often vs those that do.
I am trying to search a list (DB) for possible matches of fragments of text. For instance, I have a DB with text "evilman". I want to use user inputs to search for any possible matches in the DB and give the answer with a confidence. If the user inputs "hello", then there are no possible matches. If the user inputs "evil", then the possible match is evilman with a confidence of 57% (4 out of 7 alphabets match) and so on.
However, I also want a way to match input text such as "evxxman". 5 out of 7 characters of evxxman match the text "evilman" in the DB. But a simple check in python will say no match since it only outputs text that matches consecutively. I hope it makes sense. Thanks
Following is my code:
db = []
possible_signs = []
db.append("evilman")
text = raw_input()
for s in db:
if text in s:
if len(text) >= len(s)/2:
possible_signs.append(s)
count += 1
confidence = (float(len(text)) / float(len(s))) * 100
print "Confidence:", '%.2f' %(confidence), "<possible match:>", possible_signs[0]
This first version seems to comply with your exemples. It make the strings "slide" against each other, and count the number of identical characters.
The ratio is made by dividing the character count by the reference string length. Add a max and voila.
Call it for each string in your DB.
def commonChars(txt, ref):
txtLen = len(txt)
refLen = len(ref)
r = 0
for i in range(refLen + (txtLen - 1)):
rStart = abs(min(0, txtLen - i - 1))
tStart = txtLen -i - 1 if i < txtLen else 0
l = min(txtLen - tStart, refLen - rStart)
c = 0
for j in range(l):
if txt[tStart + j] == ref[rStart + j]:
c += 1
r = max(r, c / refLen)
return r
print(commonChars('evxxman', 'evilman')) # 0.7142857142857143
print(commonChars('evil', 'evilman')) # 0.5714285714285714
print(commonChars('man', 'evilman')) # 0.42857142857142855
print(commonChars('batman', 'evilman')) # 0.42857142857142855
print(commonChars('batman', 'man')) # 1.0
This second version produces the same results, but using the difflib mentioned in other answers.
It computes matching blocks, sum their lengths, and computes the ratio against the reference length.
import difflib
def commonBlocks(txt, ref):
matcher = difflib.SequenceMatcher(a=txt, b=ref)
matchingBlocks = matcher.get_matching_blocks()
matchingCount = sum([b.size for b in matchingBlocks])
return matchingCount / len(ref)
print(commonBlocks('evxxman', 'evilman')) # 0.7142857142857143
print(commonBlocks('evxxxxman', 'evilman')) # 0.7142857142857143
As shown by the calls above, the behavior is slightly different. "holes" between matching blocks are ignored, and do not change the final ratio.
For finding matches with a quality-estimation, have a look at difflib.SequenceMatcher.ratio and friends - these functions might not be the fastest match-checkers but they are easy to use.
Example copied from difflib docs
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.quick_ratio()
0.75
>>> s.real_quick_ratio()
1.0
Based on your description and examples, it seems to me that you're actually looking for something like the Levenshtein (or edit) distance. Note that it does not quite give the scores you specify, but I think it gives the scores you actually want.
There are several packages implementing this efficiently, e.g., distance:
In [1]: import distance
In [2]: distance.levenshtein('evilman', 'hello')
Out[2]: 6L
In [3]: distance.levenshtein('evilman', 'evil')
Out[3]: 3L
In [4]: distance.levenshtein('evilman', 'evxxman')
Out[4]: 2L
Note that the library contains several measures of similarity, e.g., jaccard and sorensen return a normalized value per default:
>>> distance.sorensen("decide", "resize")
0.5555555555555556
>>> distance.jaccard("decide", "resize")
0.7142857142857143
Create a while loop and track two iterators, one for your key word ("evil") and one for your query word ("evilman"). Here is some pseudocode:
key = "evil"
query = "evilman"
key_iterator = 0
query_iterator = 0
confidence_score = 0
while( key_iterator < key.length && query_iterator < query.length ) {
if (key[key_iterator] == query[query_iterator]) {
confidence_score++
key_iterator++
}
query_iterator++
}
// If we didnt reach the end of the key
if (key_iterator != key.length) {
confidence_score = 0
}
print ("Confidence: " + confidence_score + " out of " + query.length)
I have written a simple implementation of the Sieve of Eratosthenes, and I would like to know if there is a more efficient way to perform one of the steps.
def eratosthenes(n):
primes = [2]
is_prime = [False] + ((n - 1)/2)*[True]
for i in xrange(len(is_prime)):
if is_prime[i]:
p = 2*i + 1
primes.append(p)
is_prime[i*p + i::p] = [False]*len(is_prime[i*p + i::p])
return primes
I am using Python's list slicing to update my list of booleans is_prime. Each element is_prime[i] corresponds to an odd number 2*i + 1.
is_prime[i*p + i::p] = [False]*len(is_prime[i*p + i::p])
When I find a prime p, I can mark all elements corresponding to multiples of that prime False, and since all multiples smaller than p**2 are also multiples of smaller primes, I can skip marking those. The index of p**2 is i*p + i.
I'm worried about the cost of computing [False]*len(is_prime[i*p + 1::p]) and I have tried to compare it to two other strategies that I couldn't get to work.
For some reason, the formula (len(is_prime) - (i*p + i))/p (if positive) is not always equal to len(is_prime[i*p + i::p]). Is it because I've calculated the length of the slice wrong, or is there something subtle about slicing that I haven't caught?
When I use the following lines in my function:
print len(is_prime[i*p + i::p]), ((len(is_prime) - (i*p + i))/p)
is_prime[i*p + i::p] = [False]*((len(is_prime) - (i*p + i))/p)
I get the following output (case n = 50):
>>> eratosthenes2(50)
7 7
3 2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 9, in eratosthenes2
ValueError: attempt to assign sequence of size 2 to extended slice of size 3
I also tried replacing the bulk updating line with the following:
for j in xrange(i*p + i, len(is_prime), p):
is_prime[j] = False
But this fails for large values of n because xrange doesn't take anything bigger than a long. I gave up on trying to wrestle itertools.count into what I needed.
Are there faster and more elegant ways to bulk-update the list slice? Is there anything I can do to fix the other strategies that I tried, so that I can compare them to the working one? Thanks!
Use itertools.repeat():
is_prime[i*p + 1::p] = itertools.repeat(False, len(is_prime[i*p + 1::p]))
The slicing syntax will iterate over whatever you put on the right-hand side; it doesn't need to be a full-blown sequence.
So let's fix that formula. I'll just borrow the Python 3 formula since we know that works:
1 + (hi - 1 - lo) / step
Since step > 0, hi = stop and lo = start, so we have:
1 + (len(is_prime) - 1 - (i*p + 1))//p
(// is integer division; this future-proofs our code for Python 3, but requires 2.7 to run).
Now, put it all together:
slice_len = 1 + (len(is_prime) - 1 - (i*p + 1))//p
is_prime[i*p + 1::p] = itertools.repeat(False, slice_len)
Python 3 users: Please do not use this formula directly. Instead, just write len(range(start, stop, step)). That gives the same result with similar performance (i.e. it's O(1)) and is much easier to read.
I have a file of short strings, which I have loaded in a list short (there are 1.5 million short strings of length 150). I want to find the number of these short strings that are present in a longer string (of length ~ 5 million) which is seq in the code. I use the following obvious implementation. However, this seems to take a long time (around a day) to run.
count1=count2=0
for line in short:
count1+=1
if line in seq:
count2+=1
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'
Is there a way I can do this more efficiently?
If the short strings are a constant length (you indicated they were 150 long), you can preprocess the long string to extract all the short strings, then just do set lookups (which are constant time in expectation):
shortlen = 150
shortset = set()
for i in xrange(len(seq)-shortlen+1):
shortset.add(seq[i:i+shortlen])
for line in short:
count1 += 1
if line in shortset:
count2 += 1
The running time for this is probably going to be dominated by the preprocessing step (because it inserts nearly 5M strings of length 150 each), but that should still be faster than 1.5M searches in a 5M character string.
Do profiling, and try different options. You will not get around iterating through your sequence of "test" strings, so for line in short is something you will most probably keep. The test if line in seq is I think quite efficiently implemented in CPython, but I think this is not optimized for searching a small needle in a laaaaarge haystack. Your requirements are a bit extreme, and I guess that it is exactly this test takes that quite a while and is the bottleneck of your code. You might want to try, just as a comparison, the regex module for searching the needle in the haystack.
Edit:
A rudimentary benchmark (no repetitions, no scaling behavior investigated, no profile module used), for comparing the methods discussed in this thread:
import string
import random
import time
def genstring(N):
return ''.join(random.choice(string.ascii_uppercase) for _ in xrange(N))
t0 = time.time()
length_longstring = 10**6
length_shortstring = 7
nbr_shortstrings = 3*10**6
shortstrings = [genstring(length_shortstring) for _ in xrange(nbr_shortstrings)]
longstring = genstring(length_longstring)
duration = time.time() - t0
print "Setup duration: %.1f s" % duration
def method_1():
count1 = 0
count2 = 0
for ss in shortstrings:
count1 += 1
if ss in longstring:
count2 += 1
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'
#t0 = time.time()
#method_1()
#duration = time.time() - t0
#print "M1 duration: %.1f s" % duration
def method_2():
shortset = set()
for i in xrange(len(longstring)-length_shortstring+1):
shortset.add(longstring[i:i+length_shortstring])
count1 = 0
count2 = 0
for ss in shortstrings:
count1 += 1
if ss in shortset:
count2 += 1
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'
t0 = time.time()
method_2()
duration = time.time() - t0
print "M2 duration: %.1f s" % duration
def method_3():
shortset = set(
longstring[i:i+length_shortstring] for i in xrange(
len(longstring)-length_shortstring+1))
count1 = len(shortstrings)
count2 = sum(1 for ss in shortstrings if ss in shortset)
print str(count2) + ' of ' + str(count1) + ' strings are in long string.'
t0 = time.time()
method_3()
duration = time.time() - t0
print "M3 duration: %.1f s" % duration
Output:
$ python test.py
Setup duration: 23.3 s
364 of 3000000 strings are in long string.
M2 duration: 1.4 s
364 of 3000000 strings are in long string.
M3 duration: 1.2 s
(This is Python 2.7.3 on Linux, on a E5-2650 0 # 2.00GHz)
There is a slight difference between the method proposed by nneonneo and the improvements suggested by chepner. Under these conditions, it is already no fun to execute the original code. Under a little less extreme conditions we can make a comparison among all three methods:
length_longstring = 10**6
length_shortstring = 5
nbr_shortstrings = 10**5
->
$ python test1.py
Setup duration: 1.4 s
8121 of 100000 strings are in long string.
M1 duration: 95.0 s
8121 of 100000 strings are in long string.
M2 duration: 0.4 s
8121 of 100000 strings are in long string.
M3 duration: 0.4 s
Ok, I know you've already accepted another answer which works well, but just for the sake of completeness, here's a filled out version of what RedX suggested in the comments (I think)
import itertools
PREFIXLEN = 50 #This will need to be adjusted for efficiency, consider doing a sensitivity study
commonpres = itertools.groupby(sorted(short), lambda x: x[0:PREFIXLEN])
survivors = []
precount = 0
for pres in commonpres:
precount += 1
if pres[0] in seq:
survivors.extend(pres[1])
postcount = len(survivors)
actcount = 0
for survivor in survivors:
if survivor in seq:
actcount += 1
print "{} of {} strings are in long string.".format(actcount, len(short))
print "{} short strings ruled out by groups".format(len(short) - len(survivors))
print "{} total comparisons done".format(len(survivors) + precount)
The idea here is to rule out as many common prefixes as possible before running through all the survivors of said checks. In an extreme example, suppose your 1.5 million short strings fit into 10 common prefixs. For simplicity, lets also suppose that they are evenly divided (150,000) per prefix. If we can eliminate two of those prefixes with 10 checks, then we save 300,000 checks later. This is why PREFIXLEN needs to be "tuned." If its too low, you'll have too many common prefixes, and you won't save any checks (prefix of length one = 1.5 million checks). Where as a PREFIXLEN which is too high will give you no gains from eliminating prefixes since number of eliminations will be small. I arbitrarily picked 50, that may or may not help you.
As I said before, this answer is pretty much academic, so if anyone sees anything to be improved, please comment or just edit.
From Section 15.2 of Programming Pearls
The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c
When I implement it in Python using suffix-array:
example = open("iliad10.txt").read()
def comlen(p, q):
i = 0
for x in zip(p, q):
if x[0] == x[1]:
i += 1
else:
break
return i
suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:])) #VERY VERY SLOW
max_len = -1
for i in range(example_len - 1):
this_len = comlen(example[idx[i]:], example[idx[i+1]:])
print this_len
if this_len > max_len:
max_len = this_len
maxi = i
I found it very slow for the idx.sort step. I think it's slow because Python need to pass the substring by value instead of by pointer (as the C codes above).
The tested file can be downloaded from here
The C codes need only 0.3 seconds to finish.
time cat iliad10.txt |./longdup
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away.
real 0m0.328s
user 0m0.291s
sys 0m0.006s
But for Python codes, it never ends on my computer (I waited for 10 minutes and killed it)
Does anyone have ideas how to make the codes efficient? (For example, less than 10 seconds)
My solution is based on Suffix arrays. It is constructed by Prefix doubling the Longest common prefix. The worst-case complexity is O(n (log n)^2). The file "iliad.mb.txt" takes 4 seconds on my laptop. The longest_common_substring function is short and can be easily modified, e.g. for searching the 10 longest non-overlapping substrings. This Python code is faster than the original C code from the question, if duplicate strings are longer than 10000 characters.
from itertools import groupby
from operator import itemgetter
def longest_common_substring(text):
"""Get the longest common substrings and their positions.
>>> longest_common_substring('banana')
{'ana': [1, 3]}
>>> text = "not so Agamemnon, who spoke fiercely to "
>>> sorted(longest_common_substring(text).items())
[(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]
This function can be easy modified for any criteria, e.g. for searching ten
longest non overlapping repeated substrings.
"""
sa, rsa, lcp = suffix_array(text)
maxlen = max(lcp)
result = {}
for i in range(1, len(text)):
if lcp[i] == maxlen:
j1, j2, h = sa[i - 1], sa[i], lcp[i]
assert text[j1:j1 + h] == text[j2:j2 + h]
substring = text[j1:j1 + h]
if not substring in result:
result[substring] = [j1]
result[substring].append(j2)
return dict((k, sorted(v)) for k, v in result.items())
def suffix_array(text, _step=16):
"""Analyze all common strings in the text.
Short substrings of the length _step a are first pre-sorted. The are the
results repeatedly merged so that the garanteed number of compared
characters bytes is doubled in every iteration until all substrings are
sorted exactly.
Arguments:
text: The text to be analyzed.
_step: Is only for optimization and testing. It is the optimal length
of substrings used for initial pre-sorting. The bigger value is
faster if there is enough memory. Memory requirements are
approximately (estimate for 32 bit Python 3.3):
len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB
Return value: (tuple)
(sa, rsa, lcp)
sa: Suffix array for i in range(1, size):
assert text[sa[i-1]:] < text[sa[i]:]
rsa: Reverse suffix array for i in range(size):
assert rsa[sa[i]] == i
lcp: Longest common prefix for i in range(1, size):
assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
if sa[i-1] + lcp[i] < len(text):
assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
>>> suffix_array(text='banana')
([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])
Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
It is between tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
"""
tx = text
size = len(tx)
step = min(max(_step, 1), len(tx))
sa = list(range(len(tx)))
sa.sort(key=lambda i: tx[i:i + step])
grpstart = size * [False] + [True] # a boolean map for iteration speedup.
# It helps to skip yet resolved values. The last value True is a sentinel.
rsa = size * [None]
stgrp, igrp = '', 0
for i, pos in enumerate(sa):
st = tx[pos:pos + step]
if st != stgrp:
grpstart[igrp] = (igrp < i - 1)
stgrp = st
igrp = i
rsa[pos] = igrp
sa[i] = pos
grpstart[igrp] = (igrp < size - 1 or size == 0)
while grpstart.index(True) < size:
# assert step <= size
nextgr = grpstart.index(True)
while nextgr < size:
igrp = nextgr
nextgr = grpstart.index(True, igrp + 1)
glist = []
for ig in range(igrp, nextgr):
pos = sa[ig]
if rsa[pos] != igrp:
break
newgr = rsa[pos + step] if pos + step < size else -1
glist.append((newgr, pos))
glist.sort()
for ig, g in groupby(glist, key=itemgetter(0)):
g = [x[1] for x in g]
sa[igrp:igrp + len(g)] = g
grpstart[igrp] = (len(g) > 1)
for pos in g:
rsa[pos] = igrp
igrp += len(g)
step *= 2
del grpstart
# create LCP array
lcp = size * [None]
h = 0
for i in range(size):
if rsa[i] > 0:
j = sa[rsa[i] - 1]
while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
h += 1
lcp[rsa[i]] = h
if h > 0:
h -= 1
if size > 0:
lcp[0] = 0
return sa, rsa, lcp
I prefer this solution over more complicated O(n log n) because Python has a very fast list sorting algorithm (Timsort). Python's sort is probably faster than necessary linear time operations in the method from that article, that should be O(n) under very special presumptions of random strings together with a small alphabet (typical for DNA genome analysis). I read in Gog 2011 that worst-case O(n log n) of my algorithm can be in practice faster than many O(n) algorithms that cannot use the CPU memory cache.
The code in another answer based on grow_chains is 19 times slower than the original example from the question, if the text contains a repeated string 8 kB long. Long repeated texts are not typical for classical literature, but they are frequent e.g. in "independent" school homework collections. The program should not freeze on it.
I wrote an example and tests with the same code for Python 2.7, 3.3 - 3.6.
The translation of the algorithm into Python:
from itertools import imap, izip, starmap, tee
from os.path import commonprefix
def pairwise(iterable): # itertools recipe
a, b = tee(iterable)
next(b, None)
return izip(a, b)
def longest_duplicate_small(data):
suffixes = sorted(data[i:] for i in xrange(len(data))) # O(n*n) in memory
return max(imap(commonprefix, pairwise(suffixes)), key=len)
buffer() allows to get a substring without copying:
def longest_duplicate_buffer(data):
n = len(data)
sa = sorted(xrange(n), key=lambda i: buffer(data, i)) # suffix array
def lcp_item(i, j): # find longest common prefix array item
start = i
while i < n and data[i] == data[i + j - start]:
i += 1
return i - start, start
size, start = max(starmap(lcp_item, pairwise(sa)), key=lambda x: x[0])
return data[start:start + size]
It takes 5 seconds on my machine for the iliad.mb.txt.
In principle it is possible to find the duplicate in O(n) time and O(n) memory using a suffix array augmented with a lcp array.
Note: *_memoryview() is deprecated by *_buffer() version
More memory efficient version (compared to longest_duplicate_small()):
def cmp_memoryview(a, b):
for x, y in izip(a, b):
if x < y:
return -1
elif x > y:
return 1
return cmp(len(a), len(b))
def common_prefix_memoryview((a, b)):
for i, (x, y) in enumerate(izip(a, b)):
if x != y:
return a[:i]
return a if len(a) < len(b) else b
def longest_duplicate(data):
mv = memoryview(data)
suffixes = sorted((mv[i:] for i in xrange(len(mv))), cmp=cmp_memoryview)
result = max(imap(common_prefix_memoryview, pairwise(suffixes)), key=len)
return result.tobytes()
It takes 17 seconds on my machine for the iliad.mb.txt. The result is:
On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so Agamemnon,
who spoke fiercely to him and sent him roughly away.
I had to define custom functions to compare memoryview objects because memoryview comparison either raises an exception in Python 3 or produces wrong result in Python 2:
>>> s = b"abc"
>>> memoryview(s[0:]) > memoryview(s[1:])
True
>>> memoryview(s[0:]) < memoryview(s[1:])
True
Related questions:
Find the longest repeating string and the number of times it repeats in a given string
finding long repeated substrings in a massive string
The main problem seems to be that python does slicing by copy: https://stackoverflow.com/a/5722068/538551
You'll have to use a memoryview instead to get a reference instead of a copy. When I did this, the program hung after the idx.sort function (which was very fast).
I'm sure with a little work, you can get the rest working.
Edit:
The above change will not work as a drop-in replacement because cmp does not work the same way as strcmp. For example, try the following C code:
#include <stdio.h>
#include <string.h>
int main() {
char* test1 = "ovided by The Internet Classics Archive";
char* test2 = "rovided by The Internet Classics Archive.";
printf("%d\n", strcmp(test1, test2));
}
And compare the result to this python:
test1 = "ovided by The Internet Classics Archive";
test2 = "rovided by The Internet Classics Archive."
print(cmp(test1, test2))
The C code prints -3 on my machine while the python version prints -1. It looks like the example C code is abusing the return value of strcmp (it IS used in qsort after all). I couldn't find any documentation on when strcmp will return something other than [-1, 0, 1], but adding a printf to pstrcmp in the original code showed a lot of values outside of that range (3, -31, 5 were the first 3 values).
To make sure that -3 wasn't some error code, if we reverse test1 and test2, we'll get 3.
Edit:
The above is interesting trivia, but not actually correct in terms of affecting either chunks of code. I realized this just as I shut my laptop and left a wifi zone... Really should double check everything before I hit Save.
FWIW, cmp most certainly works on memoryview objects (prints -1 as expected):
print(cmp(memoryview(test1), memoryview(test2)))
I'm not sure why the code isn't working as expected. Printing out the list on my machine does not look as expected. I'll look into this and try to find a better solution instead of grasping at straws.
This version takes about 17 secs on my circa-2007 desktop using totally different algorithm:
#!/usr/bin/env python
ex = open("iliad.mb.txt").read()
chains = dict()
# populate initial chains dictionary
for (a,b) in enumerate(zip(ex,ex[1:])) :
s = ''.join(b)
if s not in chains :
chains[s] = list()
chains[s].append(a)
def grow_chains(chains) :
new_chains = dict()
for (string,pos) in chains :
offset = len(string)
for p in pos :
if p + offset >= len(ex) : break
# add one more character
s = string + ex[p + offset]
if s not in new_chains :
new_chains[s] = list()
new_chains[s].append(p)
return new_chains
# grow and filter, grow and filter
while len(chains) > 1 :
print 'length of chains', len(chains)
# remove chains that appear only once
chains = [(i,chains[i]) for i in chains if len(chains[i]) > 1]
print 'non-unique chains', len(chains)
print [i[0] for i in chains[:3]]
chains = grow_chains(chains)
The basic idea is to create a list of substrings and positions where they occure, thus eliminating the need to compare same strings again and again. The resulting list look like [('ind him, but', [466548, 739011]), (' bulwark bot', [428251, 428924]), (' his armour,', [121559, 124919, 193285, 393566, 413634, 718953, 760088])]. Unique strings are removed. Then every list member grows by 1 character and new list is created. Unique strings are removed again. And so on and so forth...