Python string comparison doesn't short circuit? - python

The usual saying is that string comparison must be done in constant time when checking things like password or hashes, and thus, it is recommended to avoid a == b.
However, I run the follow script and the results don't support the hypothesis that a==b short circuit on the first non-identical character.
from time import perf_counter_ns
import random
def timed_cmp(a, b):
start = perf_counter_ns()
a == b
end = perf_counter_ns()
return end - start
def n_timed_cmp(n, a, b):
"average time for a==b done n times"
ts = [timed_cmp(a, b) for _ in range(n)]
return sum(ts) / len(ts)
def check_cmp_time():
random.seed(123)
# generate a random string of n characters
n = 2 ** 8
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
# print the 10 fastest
for x in sorted_timed[:10]:
i, t = x
print("{}\t{:3f}".format(i, t))
print("---")
i, t = timed[0]
print("{}\t{:3f}".format(i, t))
i, t = timed[1]
print("{}\t{:3f}".format(i, t))
if __name__ == "__main__":
check_cmp_time()
Here is the result of a run, re-running the script gives slightly different results, but nothing satisfactory.
# ran with cpython 3.8.3
6 78.051700
1 78.203200
15 78.222700
14 78.384800
11 78.396300
12 78.441800
9 78.476900
13 78.519000
8 78.586200
3 78.631500
---
0 80.691100
1 78.203200
I would've expected that the fastest comparison would be where the first differing character is at the beginning of the string, but it's not what I get.
Any idea what's going on ???

There's a difference, you just don't see it on such small strings. Here's a small patch to apply to your code, so I use longer strings, and I do 10 checks by putting the A at a place, evenly spaced in the original string, from the beginning to the end, I mean, like this:
A_______________________________________________________________
______A_________________________________________________________
____________A___________________________________________________
__________________A_____________________________________________
________________________A_______________________________________
______________________________A_________________________________
____________________________________A___________________________
__________________________________________A_____________________
________________________________________________A_______________
______________________________________________________A_________
____________________________________________________________A___
## -15,13 +15,13 ## def n_timed_cmp(n, a, b):
def check_cmp_time():
random.seed(123)
# generate a random string of n characters
- n = 2 ** 8
+ n = 2 ** 16
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
- diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
+ diffs = [s[:i] + "A" + s[i+1:] for i in range(0, n, n // 10)]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
and you'll get:
0 122.621000
1 213.465700
2 380.214100
3 460.422000
5 694.278700
4 722.010000
7 894.630300
6 1020.722100
9 1149.473000
8 1341.754500
---
0 122.621000
1 213.465700
Note that with your example, with only 2**8 characters, it's already noticable, apply this patch:
## -21,7 +21,7 ## def check_cmp_time():
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
- diffs = [s[:i] + "A" + s[i+1:] for i in range(min(50, n))]
+ diffs = [s[:i] + "A" + s[i+1:] for i in [0, n - 1]]
timed = [(i, n_timed_cmp(10000, s, d)) for (i, d) in enumerate(diffs)]
sorted_timed = sorted(timed, key=lambda t: t[1])
to only keep the two extreme cases (first letter change vs last letter change) and you'll get:
$ python3 cmp.py
0 124.131800
1 135.566000
Numbers may vary, but most of the time test 0 is a tad faster that test 1.
To isolate more precisely which caracter is modified, it's possible as long as the memcmp does it character by character, so as long as it does not use integer comparisons, typically on the last character if they get misaligned, or on really short strings, like 8 char string, as I demo here:
from time import perf_counter_ns
from statistics import median
import random
def check_cmp_time():
random.seed(123)
# generate a random string of n characters
n = 8
s = "".join([chr(random.randint(ord("a"), ord("z"))) for _ in range(n)])
# generate a list of strings, which all differs from the original string
# by one character, at a different position
# only do that for the first 50 char, it's enough to get data
diffs = [s[:i] + "A" + s[i + 1 :] for i in range(n)]
values = {x: [] for x in range(n)}
for _ in range(10_000_000):
for i, diff in enumerate(diffs):
start = perf_counter_ns()
s == diff
values[i].append(perf_counter_ns() - start)
timed = [[k, median(v)] for k, v in values.items()]
sorted_timed = sorted(timed, key=lambda t: t[1])
# print the 10 fastest
for x in sorted_timed[:10]:
i, t = x
print("{}\t{:3f}".format(i, t))
print("---")
i, t = timed[0]
print("{}\t{:3f}".format(i, t))
i, t = timed[1]
print("{}\t{:3f}".format(i, t))
if __name__ == "__main__":
check_cmp_time()
Which gives me:
1 221.000000
2 222.000000
3 223.000000
4 223.000000
5 223.000000
6 223.000000
7 223.000000
0 241.000000
The differences are so small, Python and perf_counter_ns may no longer be the right tools here.

See, to know why it doesn't short circuit, you'll have to do some digging. The simple answer is, of course, it doesn't short circuit because the standard doesn't specify so. But you might think, "Why wouldn't the implementations choose to short circuit? Surely, It must be faster!". Not quite.
Let's take a look at cpython, for obvious reasons. Look at the code for unicode_compare_eq function defined in unicodeobject.c
static int
unicode_compare_eq(PyObject *str1, PyObject *str2)
{
int kind;
void *data1, *data2;
Py_ssize_t len;
int cmp;
len = PyUnicode_GET_LENGTH(str1);
if (PyUnicode_GET_LENGTH(str2) != len)
return 0;
kind = PyUnicode_KIND(str1);
if (PyUnicode_KIND(str2) != kind)
return 0;
data1 = PyUnicode_DATA(str1);
data2 = PyUnicode_DATA(str2);
cmp = memcmp(data1, data2, len * kind);
return (cmp == 0);
}
(Note: This function is actually called after deducing that str1 and str2 are not the same object - if they are - well that's just a simple True immediately)
Focus on this line specifically-
cmp = memcmp(data1, data2, len * kind);
Ahh, we're back at another cross road. Does memcmp short circuit? The C standard does not specify such a requirement. As seen in the opengroup docs and also in Section 7.24.4.1 of the C Standard Draft
7.24.4.1 The memcmp function
Synopsis
#include <string.h>
int memcmp(const void *s1, const void *s2, size_t n);
Description
The memcmp function compares the first n characters of the object pointed to by s1 to
the first n characters of the object pointed to by s2.
Returns
The memcmp function returns an integer greater than, equal to, or less than zero,
accordingly as the object pointed to by s1 is greater than, equal to, or less than the object pointed to by s2.
Most Some C implementations (including glibc) choose to not short circuit. But why? are we missing something, why would you not short circuit?
Because the comparison they use isn't might not be as naive as a byte by byte by check. The standard does not require the objects to be compared byte by byte. Therein lies the chance of optimization.
What glibc does, is that it compares elements of type unsigned long int instead of just singular bytes represented by unsigned char. Check out the implementation
There's a lot more going under the hood - a discussion far outside the scope of this question, after all this isn't even tagged as a C question ;). Though I found that this answer may be worth a look. But just know, the optimization is there, just in a much different form than the approach that may come in mind at first glance.
Edit: Fixed wrong function link
Edit: As #Konrad Rudolph has stated, glibc memcmp does apparently short circuit. I've been misinformed.

Related

Given a string of a million numbers, return all repeating 3 digit numbers

I had an interview with a hedge fund company in New York a few months ago and unfortunately, I did not get the internship offer as a data/software engineer. (They also asked the solution to be in Python.)
I pretty much screwed up on the first interview problem...
Question: Given a string of a million numbers (Pi for example), write
a function/program that returns all repeating 3 digit numbers and number of
repetition greater than 1
For example: if the string was: 123412345123456 then the function/program would return:
123 - 3 times
234 - 3 times
345 - 2 times
They did not give me the solution after I failed the interview, but they did tell me that the time complexity for the solution was constant of 1000 since all the possible outcomes are between:
000 --> 999
Now that I'm thinking about it, I don't think it's possible to come up with a constant time algorithm. Is it?
You got off lightly, you probably don't want to be working for a hedge fund where the quants don't understand basic algorithms :-)
There is no way to process an arbitrarily-sized data structure in O(1) if, as in this case, you need to visit every element at least once. The best you can hope for is O(n) in this case, where n is the length of the string.
Although, as an aside, a nominal O(n) algorithm will be O(1) for a fixed input size so, technically, they may have been correct here. However, that's not usually how people use complexity analysis.
It appears to me you could have impressed them in a number of ways.
First, by informing them that it's not possible to do it in O(1), unless you use the "suspect" reasoning given above.
Second, by showing your elite skills by providing Pythonic code such as:
inpStr = '123412345123456'
# O(1) array creation.
freq = [0] * 1000
# O(n) string processing.
for val in [int(inpStr[pos:pos+3]) for pos in range(len(inpStr) - 2)]:
freq[val] += 1
# O(1) output of relevant array values.
print ([(num, freq[num]) for num in range(1000) if freq[num] > 1])
This outputs:
[(123, 3), (234, 3), (345, 2)]
though you could, of course, modify the output format to anything you desire.
And, finally, by telling them there's almost certainly no problem with an O(n) solution, since the code above delivers results for a one-million-digit string in well under half a second. It seems to scale quite linearly as well, since a 10,000,000-character string takes 3.5 seconds and a 100,000,000-character one takes 36 seconds.
And, if they need better than that, there are ways to parallelise this sort of stuff that can greatly speed it up.
Not within a single Python interpreter of course, due to the GIL, but you could split the string into something like (overlap indicated by vv is required to allow proper processing of the boundary areas):
vv
123412 vv
123451
5123456
You can farm these out to separate workers and combine the results afterwards.
The splitting of input and combining of output are likely to swamp any saving with small strings (and possibly even million-digit strings) but, for much larger data sets, it may well make a difference. My usual mantra of "measure, don't guess" applies here, of course.
This mantra also applies to other possibilities, such as bypassing Python altogether and using a different language which may be faster.
For example, the following C code, running on the same hardware as the earlier Python code, handles a hundred million digits in 0.6 seconds, roughly the same amount of time as the Python code processed one million. In other words, much faster:
#include <stdio.h>
#include <string.h>
int main(void) {
static char inpStr[100000000+1];
static int freq[1000];
// Set up test data.
memset(inpStr, '1', sizeof(inpStr));
inpStr[sizeof(inpStr)-1] = '\0';
// Need at least three digits to do anything useful.
if (strlen(inpStr) <= 2) return 0;
// Get initial feed from first two digits, process others.
int val = (inpStr[0] - '0') * 10 + inpStr[1] - '0';
char *inpPtr = &(inpStr[2]);
while (*inpPtr != '\0') {
// Remove hundreds, add next digit as units, adjust table.
val = (val % 100) * 10 + *inpPtr++ - '0';
freq[val]++;
}
// Output (relevant part of) table.
for (int i = 0; i < 1000; ++i)
if (freq[i] > 1)
printf("%3d -> %d\n", i, freq[i]);
return 0;
}
Constant time isn't possible. All 1 million digits need to be looked at at least once, so that is a time complexity of O(n), where n = 1 million in this case.
For a simple O(n) solution, create an array of size 1000 that represents the number of occurrences of each possible 3 digit number. Advance 1 digit at a time, first index == 0, last index == 999997, and increment array[3 digit number] to create a histogram (count of occurrences for each possible 3 digit number). Then output the content of the array with counts > 1.
A million is small for the answer I give below. Expecting only that you have to be able to run the solution in the interview, without a pause, then The following works in less than two seconds and gives the required result:
from collections import Counter
def triple_counter(s):
c = Counter(s[n-3: n] for n in range(3, len(s)))
for tri, n in c.most_common():
if n > 1:
print('%s - %i times.' % (tri, n))
else:
break
if __name__ == '__main__':
import random
s = ''.join(random.choice('0123456789') for _ in range(1_000_000))
triple_counter(s)
Hopefully the interviewer would be looking for use of the standard libraries collections.Counter class.
Parallel execution version
I wrote a blog post on this with more explanation.
The simple O(n) solution would be to count each 3-digit number:
for nr in range(1000):
cnt = text.count('%03d' % nr)
if cnt > 1:
print '%03d is found %d times' % (nr, cnt)
This would search through all 1 million digits 1000 times.
Traversing the digits only once:
counts = [0] * 1000
for idx in range(len(text)-2):
counts[int(text[idx:idx+3])] += 1
for nr, cnt in enumerate(counts):
if cnt > 1:
print '%03d is found %d times' % (nr, cnt)
Timing shows that iterating only once over the index is twice as fast as using count.
Here is a NumPy implementation of the "consensus" O(n) algorithm: walk through all triplets and bin as you go. The binning is done by upon encountering say "385", adding one to bin[3, 8, 5] which is an O(1) operation. Bins are arranged in a 10x10x10 cube. As the binning is fully vectorized there is no loop in the code.
def setup_data(n):
import random
digits = "0123456789"
return dict(text = ''.join(random.choice(digits) for i in range(n)))
def f_np(text):
# Get the data into NumPy
import numpy as np
a = np.frombuffer(bytes(text, 'utf8'), dtype=np.uint8) - ord('0')
# Rolling triplets
a3 = np.lib.stride_tricks.as_strided(a, (3, a.size-2), 2*a.strides)
bins = np.zeros((10, 10, 10), dtype=int)
# Next line performs O(n) binning
np.add.at(bins, tuple(a3), 1)
# Filtering is left as an exercise
return bins.ravel()
def f_py(text):
counts = [0] * 1000
for idx in range(len(text)-2):
counts[int(text[idx:idx+3])] += 1
return counts
import numpy as np
import types
from timeit import timeit
for n in (10, 1000, 1000000):
data = setup_data(n)
ref = f_np(**data)
print(f'n = {n}')
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
try:
assert np.all(ref == func(**data))
print("{:16s}{:16.8f} ms".format(name[2:], timeit(
'f(**data)', globals={'f':func, 'data':data}, number=10)*100))
except:
print("{:16s} apparently crashed".format(name[2:]))
Unsurprisingly, NumPy is a bit faster than #Daniel's pure Python solution on large data sets. Sample output:
# n = 10
# np 0.03481400 ms
# py 0.00669330 ms
# n = 1000
# np 0.11215360 ms
# py 0.34836530 ms
# n = 1000000
# np 82.46765980 ms
# py 360.51235450 ms
I would solve the problem as follows:
def find_numbers(str_num):
final_dict = {}
buffer = {}
for idx in range(len(str_num) - 3):
num = int(str_num[idx:idx + 3])
if num not in buffer:
buffer[num] = 0
buffer[num] += 1
if buffer[num] > 1:
final_dict[num] = buffer[num]
return final_dict
Applied to your example string, this yields:
>>> find_numbers("123412345123456")
{345: 2, 234: 3, 123: 3}
This solution runs in O(n) for n being the length of the provided string, and is, I guess, the best you can get.
As per my understanding, you cannot have the solution in a constant time. It will take at least one pass over the million digit number (assuming its a string). You can have a 3-digit rolling iteration over the digits of the million length number and increase the value of hash key by 1 if it already exists or create a new hash key (initialized by value 1) if it doesn't exists already in the dictionary.
The code will look something like this:
def calc_repeating_digits(number):
hash = {}
for i in range(len(str(number))-2):
current_three_digits = number[i:i+3]
if current_three_digits in hash.keys():
hash[current_three_digits] += 1
else:
hash[current_three_digits] = 1
return hash
You can filter down to the keys which have item value greater than 1.
As mentioned in another answer, you cannot do this algorithm in constant time, because you must look at at least n digits. Linear time is the fastest you can get.
However, the algorithm can be done in O(1) space. You only need to store the counts of each 3 digit number, so you need an array of 1000 entries. You can then stream the number in.
My guess is that either the interviewer misspoke when they gave you the solution, or you misheard "constant time" when they said "constant space."
Here's my answer:
from timeit import timeit
from collections import Counter
import types
import random
def setup_data(n):
digits = "0123456789"
return dict(text = ''.join(random.choice(digits) for i in range(n)))
def f_counter(text):
c = Counter()
for i in range(len(text)-2):
ss = text[i:i+3]
c.update([ss])
return (i for i in c.items() if i[1] > 1)
def f_dict(text):
d = {}
for i in range(len(text)-2):
ss = text[i:i+3]
if ss not in d:
d[ss] = 0
d[ss] += 1
return ((i, d[i]) for i in d if d[i] > 1)
def f_array(text):
a = [[[0 for _ in range(10)] for _ in range(10)] for _ in range(10)]
for n in range(len(text)-2):
i, j, k = (int(ss) for ss in text[n:n+3])
a[i][j][k] += 1
for i, b in enumerate(a):
for j, c in enumerate(b):
for k, d in enumerate(c):
if d > 1: yield (f'{i}{j}{k}', d)
for n in (1E1, 1E3, 1E6):
n = int(n)
data = setup_data(n)
print(f'n = {n}')
results = {}
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
print("{:16s}{:16.8f} ms".format(name[2:], timeit(
'results[name] = f(**data)', globals={'f':func, 'data':data, 'results':results, 'name':name}, number=10)*100))
for r in results:
print('{:10}: {}'.format(r, sorted(list(results[r]))[:5]))
The array lookup method is very fast (even faster than #paul-panzer's numpy method!). Of course, it cheats since it isn't technicailly finished after it completes, because it's returning a generator. It also doesn't have to check every iteration if the value already exists, which is likely to help a lot.
n = 10
counter 0.10595780 ms
dict 0.01070654 ms
array 0.00135370 ms
f_counter : []
f_dict : []
f_array : []
n = 1000
counter 2.89462101 ms
dict 0.40434612 ms
array 0.00073838 ms
f_counter : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_dict : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_array : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
n = 1000000
counter 2849.00500992 ms
dict 438.44007806 ms
array 0.00135370 ms
f_counter : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_dict : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_array : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
Image as answer:
Looks like a sliding window.
Here is my solution:
from collections import defaultdict
string = "103264685134845354863"
d = defaultdict(int)
for elt in range(len(string)-2):
d[string[elt:elt+3]] += 1
d = {key: d[key] for key in d.keys() if d[key] > 1}
With a bit of creativity in for loop(and additional lookup list with True/False/None for example) you should be able to get rid of last line, as you only want to create keys in dict that we visited once up to that point.
Hope it helps :)
-Telling from the perspective of C.
-You can have an int 3-d array results[10][10][10];
-Go from 0th location to n-4th location, where n being the size of the string array.
-On each location, check the current, next and next's next.
-Increment the cntr as resutls[current][next][next's next]++;
-Print the values of
results[1][2][3]
results[2][3][4]
results[3][4][5]
results[4][5][6]
results[5][6][7]
results[6][7][8]
results[7][8][9]
-It is O(n) time, there is no comparisons involved.
-You can run some parallel stuff here by partitioning the array and calculating the matches around the partitions.
inputStr = '123456123138276237284287434628736482376487234682734682736487263482736487236482634'
count = {}
for i in range(len(inputStr) - 2):
subNum = int(inputStr[i:i+3])
if subNum not in count:
count[subNum] = 1
else:
count[subNum] += 1
print count

Binary mask with shift operation without cycle

We have some large binary number N (large means millions of digits). We also have binary mask M where 1 means that we must remove digit in this position in number N and move all higher bits one position right.
Example:
N = 100011101110
M = 000010001000
Res 1000110110
Is it possible to solve this problem without cycle with some set of logical or arithmetical operations? We can assume that we have access to bignum arithmetic in Python.
Feels like it should be something like this:
Res = N - (N xor M)
But it doesn't work
UPD: My current solution with cycle is following:
def prepare_reduced_arrays(dict_of_N, mask):
'''
mask: string '0000011000'
each element of dict_of_N - big python integer
'''
capacity = len(mask)
answer = dict()
for el in dict_of_N:
answer[el] = 0
new_capacity = 0
for i in range(capacity - 1, -1, -1):
if mask[i] == '1':
continue
cap2 = (1 << new_capacity)
pos = (capacity - i - 1)
for el in dict_of_N:
current_bit = (dict_of_N[el] >> pos) & 1
if current_bit:
answer[el] |= cap2
new_capacity += 1
return answer, new_capacity
While this may not be possible without a loop in python, it can be made extremely fast with numba and just in time compilation. I went on the assumption that your inputs could be easily represented as boolean arrays, which would be very simple to construct from a binary file using struct. The method I have implemented involves iterating a few different objects, however these iterations were chosen carefully to make sure they were compiler optimized, and never doing the same work twice. The first iteration is using np.where to locate the indices of all the bits to delete. This specific function (among many others) is optimized by the numba compiler. I then use this list of bit indices to build the slice indices for slices of bits to keep. The final loop copies these slices to an empty output array.
import numpy as np
from numba import jit
from time import time
def binary_mask(num, mask):
num_nbits = num.shape[0] #how many bits are in our big num
mask_bits = np.where(mask)[0] #which bits are we deleting
mask_n_bits = mask_bits.shape[0] #how many bits are we deleting
start = np.empty(mask_n_bits + 1, dtype=int) #preallocate array for slice start indexes
start[0] = 0 #first slice starts at 0
start[1:] = mask_bits + 1 #subsequent slices start 1 after each True bit in mask
end = np.empty(mask_n_bits + 1, dtype=int) #preallocate array for slice end indexes
end[:mask_n_bits] = mask_bits #each slice ends on (but does not include) True bits in the mask
end[mask_n_bits] = num_nbits + 1 #last slice goes all the way to the end
out = np.empty(num_nbits - mask_n_bits, dtype=np.uint8) #preallocate return array
for i in range(mask_n_bits + 1): #for each slice
a = start[i] #use local variables to reduce number of lookups
b = end[i]
c = a - i
d = b - i
out[c:d] = num[a:b] #copy slices
return out
jit_binary_mask = jit("b1[:](b1[:], b1[:])")(binary_mask) #decorator without syntax sugar
###################### Benchmark ########################
bignum = np.random.randint(0,2,1000000, dtype=bool) # 1 million random bits
bigmask = np.random.randint(0,10,1000000, dtype=np.uint8)==9 #delete about 1 in 10 bits
t = time()
for _ in range(10): #10 cycles of just numpy implementation
out = binary_mask(bignum, bigmask)
print(f"non-jit: {time()-t} seconds")
t = time()
out = jit_binary_mask(bignum, bigmask) #once ahead of time to compile
compile_and_run = time() - t
t = time()
for _ in range(10): #10 cycles of compiled numpy implementation
out = jit_binary_mask(bignum, bigmask)
jit_runtime = time()-t
print(f"jit: {jit_runtime} seconds")
print(f"estimated compile_time: {compile_and_run - jit_runtime/10}")
In this example, I execute the benchmark on a boolean array of length 1,000,000 a total of 10 times for both the compiled and un-compiled version. On my laptop, the output is:
non-jit: 1.865583896636963 seconds
jit: 0.06370806694030762 seconds
estimated compile_time: 0.1652850866317749
As you can see with a simple algorithm like this, very significant performance gains can be seen from compilation. (in my case about 20-30x speedup)
As far as I know, this can be done without the use of loops if and only if M is a power of 2.
Let's take your example, and modify M so that it is a power of 2:
N = 0b100011101110 = 2286
M = 0b000000001000 = 8
Removing the fourth lowest bit from N and shifting the higher bits to the right would result in:
N = 0b10001110110 = 1142
We achieved this using the following algorithm:
Begin with N = 0b100011101110 = 2286
Iterate from the most-significant bit to the least-significant bit in M.
If the current bit in M is set to 1, then store the lower bits in some variable, x:
x = 0b1101110
Then, subtract every bit up to and including the current bit in M from N, so that we end up with the following:
N - (0b10000000 + x) = N - (0b10000000 + 0b1101110) = 0b100011101110 - 0b11101110 = 0b100000000000
This step can also be achieved by and-ing the bits with 0, which may be more efficient.
Next, we shift the result once to the right:
0b100000000000 >> 1 = 0b10000000000
Finally, we add back x to the shifted result:
0b10000000000 + x = 0b10000000000 + 0b1101110 = 0b10001101110 = 1142
There may be a possibility that this can somehow be done without loops, but it would actually be efficient if you were to simply iterate over M (from the most-significant bit to the least-significant bit) and performed this process on every set bit, as the time complexity would be O(M.bit_length()).
I wrote up the code for this algorithm as well, and I believe it's relatively efficient, but I don't have any big binary numbers to test it with:
def remove_bits(N, M):
bit = 2 ** (M.bit_length() - 1)
while bit != 0:
if M & bit:
ones = bit - 1
# Store lower `bit` bits.
temp = N & ones
# Clear lower `bit` bits.
N &= ~ones
# Shift once to the right.
N >>= 1
# Set stored lower `bit` bits.
N |= temp
bit >>= 1
return N
if __name__ == '__main__':
N = 0b100011101110
M = 0b000010001000
print(bin(remove_bits(N, M)))
Using your example, this returns your result: 0b1000110110
I don't think there's any way to do this in a constant number of calls to the built-in bitwise operators. Python would have to provide something like PEXT for that to be possible.
For literally millions of digits, you may actually get best performance by working in terms of sequences of bits, sacrificing the space advantages of Python ints and the time advantages of bitwise operations in favor of more flexibility in the operations you can perform. I don't know where the break-even point would be:
import itertools
bits = bin(N)[2:]
maskbits = bin(M)[2:].zfill(len(bits))
bits = bits.zfill(len(maskbits))
chosenbits = itertools.compress(bits, map('0'.__eq__, maskbits))
result = int(''.join(chosenbits), 2)

Find length of a string that includes its own length?

I want to get the length of a string including a part of the string that represents its own length without padding or using structs or anything like that that forces fixed lengths.
So for example I want to be able to take this string as input:
"A string|"
And return this:
"A string|11"
On the basis of the OP tolerating such an approach (and to provide an implementation technique for the eventual python answer), here's a solution in Java.
final String s = "A String|";
int n = s.length(); // `length()` returns the length of the string.
String t; // the result
do {
t = s + n; // append the stringified n to the original string
if (n == t.length()){
return t; // string length no longer changing; we're good.
}
n = t.length(); // n must hold the total length
} while (true); // round again
The problem of, course, is that in appending n, the string length changes. But luckily, the length only ever increases or stays the same. So it will converge very quickly: due to the logarithmic nature of the length of n. In this particular case, the attempted values of n are 9, 10, and 11. And that's a pernicious case.
A simple solution is :
def addlength(string):
n1=len(string)
n2=len(str(n1))+n1
n2 += len(str(n2))-len(str(n1)) # a carry can arise
return string+str(n2)
Since a possible carry will increase the length by at most one unit.
Examples :
In [2]: addlength('a'*8)
Out[2]: 'aaaaaaaa9'
In [3]: addlength('a'*9)
Out[3]: 'aaaaaaaaa11'
In [4]: addlength('a'*99)
Out[4]: 'aaaaa...aaa102'
In [5]: addlength('a'*999)
Out[5]: 'aaaa...aaa1003'
Here is a simple python port of Bathsheba's answer :
def str_len(s):
n = len(s)
t = ''
while True:
t = s + str(n)
if n == len(t):
return t
n = len(t)
This is a much more clever and simple way than anything I was thinking of trying!
Suppose you had s = 'abcdefgh|, On the first pass through, t = 'abcdefgh|9
Since n != len(t) ( which is now 10 ) it goes through again : t = 'abcdefgh|' + str(n) and str(n)='10' so you have abcdefgh|10 which is still not quite right! Now n=len(t) which is finally n=11 you get it right then. Pretty clever solution!
It is a tricky one, but I think I've figured it out.
Done in a hurry in Python 2.7, please fully test - this should handle strings up to 998 characters:
import sys
orig = sys.argv[1]
origLen = len(orig)
if (origLen >= 98):
extra = str(origLen + 3)
elif (origLen >= 8):
extra = str(origLen + 2)
else:
extra = str(origLen + 1)
final = orig + extra
print final
Results of very brief testing
C:\Users\PH\Desktop>python test.py "tiny|"
tiny|6
C:\Users\PH\Desktop>python test.py "myString|"
myString|11
C:\Users\PH\Desktop>python test.py "myStringWith98Characters.........................................................................|"
myStringWith98Characters.........................................................................|101
Just find the length of the string. Then iterate through each value of the number of digits the length of the resulting string can possibly have. While iterating, check if the sum of the number of digits to be appended and the initial string length is equal to the length of the resulting string.
def get_length(s):
s = s + "|"
result = ""
len_s = len(s)
i = 1
while True:
candidate = len_s + i
if len(str(candidate)) == i:
result = s + str(len_s + i)
break
i += 1
This code gives the result.
I used a few var, but at the end it shows the output you want:
def len_s(s):
s = s + '|'
b = len(s)
z = s + str(b)
length = len(z)
new_s = s + str(length)
new_len = len(new_s)
return s + str(new_len)
s = "A string"
print len_s(s)
Here's a direct equation for this (so it's not necessary to construct the string). If s is the string, then the length of the string including the length of the appended length will be:
L1 = len(s) + 1 + int(log10(len(s) + 1 + int(log10(len(s)))))
The idea here is that a direct calculation is only problematic when the appended length will push the length past a power of ten; that is, at 9, 98, 99, 997, 998, 999, 9996, etc. To work this through, 1 + int(log10(len(s))) is the number of digits in the length of s. If we add that to len(s), then 9->10, 98->100, 99->101, etc, but still 8->9, 97->99, etc, so we can push past the power of ten exactly as needed. That is, adding this produces a number with the correct number of digits after the addition. Then do the log again to find the length of that number and that's the answer.
To test this:
from math import log10
def find_length(s):
L1 = len(s) + 1 + int(log10(len(s) + 1 + int(log10(len(s)))))
return L1
# test, just looking at lengths around 10**n
for i in range(9):
for j in range(30):
L = abs(10**i - j + 10) + 1
s = "a"*L
x0 = find_length(s)
new0 = s+`x0`
if len(new0)!=x0:
print "error", len(s), x0, log10(len(s)), log10(x0)

Generate equation with the result value closest to the requested one, have speed problems

I am writing some quiz game and need computer to solve 1 game in the quiz if players fail to solve it.
Given data :
List of 6 numbers to use, for example 4, 8, 6, 2, 15, 50.
Targeted value, where 0 < value < 1000, for example 590.
Available operations are division, addition, multiplication and division.
Parentheses can be used.
Generate mathematical expression which evaluation is equal, or as close as possible, to the target value. For example for numbers given above, expression could be : (6 + 4) * 50 + 15 * (8 - 2) = 590
My algorithm is as follows :
Generate all permutations of all the subsets of the given numbers from (1) above
For each permutation generate all parenthesis and operator combinations
Track the closest value as algorithm runs
I can not think of any smart optimization to the brute-force algorithm above, which will speed it up by the order of magnitude. Also I must optimize for the worst case, because many quiz games will be run simultaneously on the server.
Code written today to solve this problem is (relevant stuff extracted from the project) :
from operator import add, sub, mul, div
import itertools
ops = ['+', '-', '/', '*']
op_map = {'+': add, '-': sub, '/': div, '*': mul}
# iterate over 1 permutation and generates parentheses and operator combinations
def iter_combinations(seq):
if len(seq) == 1:
yield seq[0], str(seq[0])
else:
for i in range(len(seq)):
left, right = seq[:i], seq[i:] # split input list at i`th place
# generate cartesian product
for l, l_str in iter_combinations(left):
for r, r_str in iter_combinations(right):
for op in ops:
if op_map[op] is div and r == 0: # cant divide by zero
continue
else:
yield op_map[op](float(l), r), \
('(' + l_str + op + r_str + ')')
numbers = [4, 8, 6, 2, 15, 50]
target = best_value = 590
best_item = None
for i in range(len(numbers)):
for current in itertools.permutations(numbers, i+1): # generate perms
for value, item in iter_combinations(list(current)):
if value < 0:
continue
if abs(target - value) < best_value:
best_value = abs(target - value)
best_item = item
print best_item
It prints : ((((4*6)+50)*8)-2). Tested it a little with different values and it seems to work correctly. Also I have a function to remove unnecessary parenthesis but it is not relevant to the question so it is not posted.
Problem is that this runs very slowly because of all this permutations, combinations and evaluations. On my mac book air it runs for a few minutes for 1 example. I would like to make it run in a few seconds tops on the same machine, because many quiz game instances will be run at the same time on the server. So the questions are :
Can I speed up current algorithm somehow (by orders of magnitude)?
Am I missing on some other algorithm for this problem which would run much faster?
You can build all the possible expression trees with the given numbers and evalate them. You don't need to keep them all in memory, just print them when the target number is found:
First we need a class to hold the expression. It is better to design it to be immutable, so its value can be precomputed. Something like this:
class Expr:
'''An Expr can be built with two different calls:
-Expr(number) to build a literal expression
-Expr(a, op, b) to build a complex expression.
There a and b will be of type Expr,
and op will be one of ('+','-', '*', '/').
'''
def __init__(self, *args):
if len(args) == 1:
self.left = self.right = self.op = None
self.value = args[0]
else:
self.left = args[0]
self.right = args[2]
self.op = args[1]
if self.op == '+':
self.value = self.left.value + self.right.value
elif self.op == '-':
self.value = self.left.value - self.right.value
elif self.op == '*':
self.value = self.left.value * self.right.value
elif self.op == '/':
self.value = self.left.value // self.right.value
def __str__(self):
'''It can be done smarter not to print redundant parentheses,
but that is out of the scope of this problem.
'''
if self.op:
return "({0}{1}{2})".format(self.left, self.op, self.right)
else:
return "{0}".format(self.value)
Now we can write a recursive function that builds all the possible expression trees with a given set of expressions, and prints the ones that equals our target value. We will use the itertools module, that's always fun.
We can use itertools.combinations() or itertools.permutations(), the difference is in the order. Some of our operations are commutative and some are not, so we can use permutations() and assume we will get many very simmilar solutions. Or we can use combinations() and manually reorder the values when the operation is not commutative.
import itertools
OPS = ('+', '-', '*', '/')
def SearchTrees(current, target):
''' current is the current set of expressions.
target is the target number.
'''
for a,b in itertools.combinations(current, 2):
current.remove(a)
current.remove(b)
for o in OPS:
# This checks whether this operation is commutative
if o == '-' or o == '/':
conmut = ((a,b), (b,a))
else:
conmut = ((a,b),)
for aa, bb in conmut:
# You do not specify what to do with the division.
# I'm assuming that only integer divisions are allowed.
if o == '/' and (bb.value == 0 or aa.value % bb.value != 0):
continue
e = Expr(aa, o, bb)
# If a solution is found, print it
if e.value == target:
print(e.value, '=', e)
current.add(e)
# Recursive call!
SearchTrees(current, target)
# Do not forget to leave the set as it were before
current.remove(e)
# Ditto
current.add(b)
current.add(a)
And then the main call:
NUMBERS = [4, 8, 6, 2, 15, 50]
TARGET = 590
initial = set(map(Expr, NUMBERS))
SearchTrees(initial, TARGET)
And done! With these data I'm getting 719 different solutions in just over 21 seconds! Of course many of them are trivial variations of the same expression.
24 game is 4 numbers to target 24, your game is 6 numbers to target x (0 < x < 1000).
That's much similar.
Here is the quick solution, get all results and print just one in my rMBP in about 1-3s, I think one solution print is ok in this game :), I will explain it later:
def mrange(mask):
#twice faster from Evgeny Kluev
x = 0
while x != mask:
x = (x - mask) & mask
yield x
def f( i ) :
global s
if s[i] :
#get cached group
return s[i]
for x in mrange(i & (i - 1)) :
#when x & i == x
#x is a child group in group i
#i-x is also a child group in group i
fk = fork( f(x), f(i-x) )
s[i] = merge( s[i], fk )
return s[i]
def merge( s1, s2 ) :
if not s1 :
return s2
if not s2 :
return s1
for i in s2 :
#print just one way quickly
s1[i] = s2[i]
#combine all ways, slowly
# if i in s1 :
# s1[i].update(s2[i])
# else :
# s1[i] = s2[i]
return s1
def fork( s1, s2 ) :
d = {}
#fork s1 s2
for i in s1 :
for j in s2 :
if not i + j in d :
d[i + j] = getExp( s1[i], s2[j], "+" )
if not i - j in d :
d[i - j] = getExp( s1[i], s2[j], "-" )
if not j - i in d :
d[j - i] = getExp( s2[j], s1[i], "-" )
if not i * j in d :
d[i * j] = getExp( s1[i], s2[j], "*" )
if j != 0 and not i / j in d :
d[i / j] = getExp( s1[i], s2[j], "/" )
if i != 0 and not j / i in d :
d[j / i] = getExp( s2[j], s1[i], "/" )
return d
def getExp( s1, s2, op ) :
exp = {}
for i in s1 :
for j in s2 :
exp['('+i+op+j+')'] = 1
#just print one way
break
#just print one way
break
return exp
def check( s ) :
num = 0
for i in xrange(target,0,-1):
if i in s :
if i == target :
print numbers, target, "\nFind ", len(s[i]), 'ways'
for exp in s[i]:
print exp, ' = ', i
else :
print numbers, target, "\nFind nearest ", i, 'in', len(s[i]), 'ways'
for exp in s[i]:
print exp, ' = ', i
break
print '\n'
def game( numbers, target ) :
global s
s = [None]*(2**len(numbers))
for i in xrange(0,len(numbers)) :
numbers[i] = float(numbers[i])
n = len(numbers)
for i in xrange(0,n) :
s[2**i] = { numbers[i]: {str(numbers[i]):1} }
for i in xrange(1,2**n) :
#we will get the f(numbers) in s[2**n-1]
s[i] = f(i)
check(s[2**n-1])
numbers = [4, 8, 6, 2, 2, 5]
s = [None]*(2**len(numbers))
target = 590
game( numbers, target )
numbers = [1,2,3,4,5,6]
target = 590
game( numbers, target )
Assume A is your 6 numbers list.
We define f(A) is all result that can calculate by all A numbers, if we search f(A), we will find if target is in it and get answer or the closest answer.
We can split A to two real child groups: A1 and A-A1 (A1 is not empty and not equal A) , which cut the problem from f(A) to f(A1) and f(A-A1). Because we know f(A) = Union( a+b, a-b, b-a, a*b, a/b(b!=0), b/a(a!=0) ), which a in A, b in A-A1.
We use fork f(A) = Union( fork(A1,A-A1) ) stands for such process. We can remove all duplicate value in fork(), so we can cut the range and make program faster.
So, if A = [1,2,3,4,5,6], then f(A) = fork( f([1]),f([2,3,4,5,6]) ) U ... U fork( f([1,2,3]), f([4,5,6]) ) U ... U stands for Union.
We will see f([2,3,4,5,6]) = fork( f([2,3]), f([4,5,6]) ) U ... , f([3,4,5,6]) = fork( f([3]), f([4,5,6]) ) U ..., the f([4,5,6]) used in both.
So if we can cache every f([...]) the program can be faster.
We can get 2^len(A) - 2 (A1,A-A1) in A. We can use binary to stands for that.
For example: A = [1,2,3,4,5,6], A1 = [1,2,3], then binary 000111(7) stands for A1. A2 = [1,3,5], binary 010101(21) stands for A2. A3 = [1], then binary 000001(1) stands for A3...
So we get a way stands for all groups in A, we can cache them and make all process faster!
All combinations for six number, four operations and parenthesis are up to 5 * 9! at least. So I think you should use some AI algorithm. Using genetic programming or optimization seems to be the path to follow.
In the book Programming Collective Intelligence in the chapter 11 Evolving Intelligence you will find exactly what you want and much more. That chapter explains how to find a mathematical function combining operations and numbers (as you want) to match a result. You will be surprised how easy is such task.
PD: The examples are written using Python.
I would try using an AST at least it will
make your expression generation part easier
(no need to mess with brackets).
http://en.wikipedia.org/wiki/Abstract_syntax_tree
1) Generate some tree with N nodes
(N = the count of numbers you have).
I've read before how many of those you
have, their size is serious as N grows.
By serious I mean more than polynomial to say the least.
2) Now just start changing the operations
in the non-leaf nodes and keep evaluating
the result.
But this is again backtracking and too much degree of freedom.
This is a computationally complex task you're posing. I believe if you
ask the question as you did: "let's generate a number K on the output
such that |K-V| is minimal" (here V is the pre-defined desired result,
i.e. 590 in your example) , then I guess this problem is even NP-complete.
Somebody please correct me if my intuition is lying to me.
So I think even the generation of all possible ASTs (assuming only 1 operation
is allowed) is NP complete as their count is not polynomial. Not to talk that more
than 1 operation is allowed here and not to talk of the minimal difference requirement (between result and desired result).
1. Fast entirely online algorithm
The idea is to search not for a single expression for target value,
but for an equation where target value is included in one part of the equation and
both parts have almost equal number of operations (2 and 3).
Since each part of the equation is relatively small, it does not take much time to
generate all possible expressions for given input values.
After both parts of equation are generated it is possible to scan a pair of sorted arrays
containing values of these expressions and find a pair of equal (or at least best matching)
values in them. After two matching values are found we could get corresponding expressions and
join them into a single expression (in other words, solve the equation).
To join two expression trees together we could descend from the root of one tree
to "target" leaf, for each node on this path invert corresponding operation
('*' to '/', '/' to '*' or '/', '+' to '-', '-' to '+' or '-'), and move "inverted"
root node to other tree (also as root node).
This algorithm is faster and easier to implement when all operations are invertible.
So it is best to use with floating point division (as in my implementation) or with
rational division. Truncating integer division is most difficult case because it produces same result for different inputs (42/25=1 and 25/25 is also 1). With zero-remainder integer division this algorithm gives result almost instantly when exact result is available, but needs some modifications to work correctly when approximate result is needed.
See implementation on Ideone.
2. Even faster approach with off-line pre-processing
As noticed by #WolframH, there are not so many possible input number combinations.
Only 3*3*(49+4-1) = 4455 if repetitions are possible.
Or 3*3*(49) = 1134 without duplicates. Which allows us to pre-process
all possible inputs off-line, store results in compact form, and when some particular result
is needed quickly unpack one of pre-processed values.
Pre-processing program should take array of 6 numbers and generate values for all possible
expressions. Then it should drop out-of-range values and find nearest result for all cases
where there is no exact match. All this could be performed by algorithm proposed by #Tim.
His code needs minimal modifications to do it. Also it is the fastest alternative (yet).
Since pre-processing is offline, we could use something better than interpreted Python.
One alternative is PyPy, other one is to use some fast interpreted language. Pre-processing
all possible inputs should not take more than several minutes.
Speaking about memory needed to store all pre-processed values, the only problem are the
resulting expressions. If stored in string form they will take up to 4455*999*30 bytes or 120Mb.
But each expression could be compressed. It may be represented in postfix notation like this:
arg1 arg2 + arg3 arg4 + *. To store this we need 10 bits to store all arguments' permutations,
10 bits to store 5 operations, and 8 bits to specify how arguments and operations are
interleaved (6 arguments + 5 operations - 3 pre-defined positions: first two are always
arguments, last one is always operation). 28 bits per tree or 4 bytes, which means it is only
20Mb for entire data set with duplicates or 5Mb without them.
3. Slow entirely online algorithm
There are some ways to speed up algorithm in OP:
Greatest speed improvement may be achieved if we avoid trying each commutative operation twice and make recursion tree less branchy.
Some optimization is possible by removing all branches where the result of division operation is zero.
Memorization (dynamic programming) cannot give significant speed boost here, still it may be useful.
After enhancing OP's approach with these ideas, approximately 30x speedup is achieved:
from itertools import combinations
numbers = [4, 8, 6, 2, 15, 50]
target = best_value = 590
best_item = None
subsets = {}
def get_best(value, item):
global best_value, target, best_item
if value >= 0 and abs(target - value) < best_value:
best_value = abs(target - value)
best_item = item
return value, item
def compare_one(value, op, left, right):
item = ('(' + left + op + right + ')')
return get_best(value, item)
def apply_one(left, right):
yield compare_one(left[0] + right[0], '+', left[1], right[1])
yield compare_one(left[0] * right[0], '*', left[1], right[1])
yield compare_one(left[0] - right[0], '-', left[1], right[1])
yield compare_one(right[0] - left[0], '-', right[1], left[1])
if right[0] != 0 and left[0] >= right[0]:
yield compare_one(left[0] / right[0], '/', left[1], right[1])
if left[0] != 0 and right[0] >= left[0]:
yield compare_one(right[0] / left[0], '/', right[1], left[1])
def memorize(seq):
fs = frozenset(seq)
if fs in subsets:
for x in subsets[fs].items():
yield x
else:
subsets[fs] = {}
for value, item in try_all(seq):
subsets[fs][value] = item
yield value, item
def apply_all(left, right):
for l in memorize(left):
for r in memorize(right):
for x in apply_one(l, r):
yield x;
def try_all(seq):
if len(seq) == 1:
yield get_best(numbers[seq[0]], str(numbers[seq[0]]))
for length in range(1, len(seq)):
for x in combinations(seq[1:], length):
for value, item in apply_all(list(x), list(set(seq) - set(x))):
yield value, item
for x, y in try_all([0, 1, 2, 3, 4, 5]): pass
print best_item
More speed improvements are possible if you add some constraints to the problem:
If integer division is only possible when the remainder is zero.
If all intermediate results are to be non-negative and/or below 1000.
Well I don't will give up. Following the line of all the answers to your question I come up with another algorithm. This algorithm gives the solution with a time average of 3 milliseconds.
#! -*- coding: utf-8 -*-
import copy
numbers = [4, 8, 6, 2, 15, 50]
target = 590
operations = {
'+': lambda x, y: x + y,
'-': lambda x, y: x - y,
'*': lambda x, y: x * y,
'/': lambda x, y: y == 0 and 1e30 or x / y # Handle zero division
}
def chain_op(target, numbers, result=None, expression=""):
if len(numbers) == 0:
return (expression, result)
else:
for choosen_number in numbers:
remaining_numbers = copy.copy(numbers)
remaining_numbers.remove(choosen_number)
if result is None:
return chain_op(target, remaining_numbers, choosen_number, str(choosen_number))
else:
incomming_results = []
for key, op in operations.items():
new_result = op(result, choosen_number)
new_expression = "%s%s%d" % (expression, key, choosen_number)
incomming_results.append(chain_op(target, remaining_numbers, new_result, new_expression))
diff = 1e30
selected = None
for exp_result in incomming_results:
exp, res = exp_result
if abs(res - target) < diff:
diff = abs(res - target)
selected = exp_result
if diff == 0:
break
return selected
if __name__ == '__main__':
print chain_op(target, numbers)
Erratum: This algorithm do not include the solutions containing parenthesis. It always hits the target or the closest result, my bad. Still is pretty fast. It can be adapted to support parenthesis without much work.
Actually there are two things that you can do to speed up the time to milliseconds.
You are trying to find a solution for given quiz, by generating the numbers and the target number. Instead you can generate the solution and just remove the operations. You can build some thing smart that will generate several quizzes and choose the most interesting one, how ever in this case you loose the as close as possible option.
Another way to go, is pre-calculation. Solve 100 quizes, use them as build-in in your application, and generate new one on the fly, try to keep your quiz stack at 100, also try to give the user only the new quizes. I had the same problem in my bible games, and I used this method to speed thing up. Instead of 10 sec for question it takes me milliseconds as I am generating new question in background and always keeping my stack to 100.
What about Dynamic programming, because you need same results to calculate other options?

Effcient way to find longest duplicate string for Python (From Programming Pearls)

From Section 15.2 of Programming Pearls
The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c
When I implement it in Python using suffix-array:
example = open("iliad10.txt").read()
def comlen(p, q):
i = 0
for x in zip(p, q):
if x[0] == x[1]:
i += 1
else:
break
return i
suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:])) #VERY VERY SLOW
max_len = -1
for i in range(example_len - 1):
this_len = comlen(example[idx[i]:], example[idx[i+1]:])
print this_len
if this_len > max_len:
max_len = this_len
maxi = i
I found it very slow for the idx.sort step. I think it's slow because Python need to pass the substring by value instead of by pointer (as the C codes above).
The tested file can be downloaded from here
The C codes need only 0.3 seconds to finish.
time cat iliad10.txt |./longdup
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away.
real 0m0.328s
user 0m0.291s
sys 0m0.006s
But for Python codes, it never ends on my computer (I waited for 10 minutes and killed it)
Does anyone have ideas how to make the codes efficient? (For example, less than 10 seconds)
My solution is based on Suffix arrays. It is constructed by Prefix doubling the Longest common prefix. The worst-case complexity is O(n (log n)^2). The file "iliad.mb.txt" takes 4 seconds on my laptop. The longest_common_substring function is short and can be easily modified, e.g. for searching the 10 longest non-overlapping substrings. This Python code is faster than the original C code from the question, if duplicate strings are longer than 10000 characters.
from itertools import groupby
from operator import itemgetter
def longest_common_substring(text):
"""Get the longest common substrings and their positions.
>>> longest_common_substring('banana')
{'ana': [1, 3]}
>>> text = "not so Agamemnon, who spoke fiercely to "
>>> sorted(longest_common_substring(text).items())
[(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]
This function can be easy modified for any criteria, e.g. for searching ten
longest non overlapping repeated substrings.
"""
sa, rsa, lcp = suffix_array(text)
maxlen = max(lcp)
result = {}
for i in range(1, len(text)):
if lcp[i] == maxlen:
j1, j2, h = sa[i - 1], sa[i], lcp[i]
assert text[j1:j1 + h] == text[j2:j2 + h]
substring = text[j1:j1 + h]
if not substring in result:
result[substring] = [j1]
result[substring].append(j2)
return dict((k, sorted(v)) for k, v in result.items())
def suffix_array(text, _step=16):
"""Analyze all common strings in the text.
Short substrings of the length _step a are first pre-sorted. The are the
results repeatedly merged so that the garanteed number of compared
characters bytes is doubled in every iteration until all substrings are
sorted exactly.
Arguments:
text: The text to be analyzed.
_step: Is only for optimization and testing. It is the optimal length
of substrings used for initial pre-sorting. The bigger value is
faster if there is enough memory. Memory requirements are
approximately (estimate for 32 bit Python 3.3):
len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB
Return value: (tuple)
(sa, rsa, lcp)
sa: Suffix array for i in range(1, size):
assert text[sa[i-1]:] < text[sa[i]:]
rsa: Reverse suffix array for i in range(size):
assert rsa[sa[i]] == i
lcp: Longest common prefix for i in range(1, size):
assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
if sa[i-1] + lcp[i] < len(text):
assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
>>> suffix_array(text='banana')
([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])
Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
It is between tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
"""
tx = text
size = len(tx)
step = min(max(_step, 1), len(tx))
sa = list(range(len(tx)))
sa.sort(key=lambda i: tx[i:i + step])
grpstart = size * [False] + [True] # a boolean map for iteration speedup.
# It helps to skip yet resolved values. The last value True is a sentinel.
rsa = size * [None]
stgrp, igrp = '', 0
for i, pos in enumerate(sa):
st = tx[pos:pos + step]
if st != stgrp:
grpstart[igrp] = (igrp < i - 1)
stgrp = st
igrp = i
rsa[pos] = igrp
sa[i] = pos
grpstart[igrp] = (igrp < size - 1 or size == 0)
while grpstart.index(True) < size:
# assert step <= size
nextgr = grpstart.index(True)
while nextgr < size:
igrp = nextgr
nextgr = grpstart.index(True, igrp + 1)
glist = []
for ig in range(igrp, nextgr):
pos = sa[ig]
if rsa[pos] != igrp:
break
newgr = rsa[pos + step] if pos + step < size else -1
glist.append((newgr, pos))
glist.sort()
for ig, g in groupby(glist, key=itemgetter(0)):
g = [x[1] for x in g]
sa[igrp:igrp + len(g)] = g
grpstart[igrp] = (len(g) > 1)
for pos in g:
rsa[pos] = igrp
igrp += len(g)
step *= 2
del grpstart
# create LCP array
lcp = size * [None]
h = 0
for i in range(size):
if rsa[i] > 0:
j = sa[rsa[i] - 1]
while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
h += 1
lcp[rsa[i]] = h
if h > 0:
h -= 1
if size > 0:
lcp[0] = 0
return sa, rsa, lcp
I prefer this solution over more complicated O(n log n) because Python has a very fast list sorting algorithm (Timsort). Python's sort is probably faster than necessary linear time operations in the method from that article, that should be O(n) under very special presumptions of random strings together with a small alphabet (typical for DNA genome analysis). I read in Gog 2011 that worst-case O(n log n) of my algorithm can be in practice faster than many O(n) algorithms that cannot use the CPU memory cache.
The code in another answer based on grow_chains is 19 times slower than the original example from the question, if the text contains a repeated string 8 kB long. Long repeated texts are not typical for classical literature, but they are frequent e.g. in "independent" school homework collections. The program should not freeze on it.
I wrote an example and tests with the same code for Python 2.7, 3.3 - 3.6.
The translation of the algorithm into Python:
from itertools import imap, izip, starmap, tee
from os.path import commonprefix
def pairwise(iterable): # itertools recipe
a, b = tee(iterable)
next(b, None)
return izip(a, b)
def longest_duplicate_small(data):
suffixes = sorted(data[i:] for i in xrange(len(data))) # O(n*n) in memory
return max(imap(commonprefix, pairwise(suffixes)), key=len)
buffer() allows to get a substring without copying:
def longest_duplicate_buffer(data):
n = len(data)
sa = sorted(xrange(n), key=lambda i: buffer(data, i)) # suffix array
def lcp_item(i, j): # find longest common prefix array item
start = i
while i < n and data[i] == data[i + j - start]:
i += 1
return i - start, start
size, start = max(starmap(lcp_item, pairwise(sa)), key=lambda x: x[0])
return data[start:start + size]
It takes 5 seconds on my machine for the iliad.mb.txt.
In principle it is possible to find the duplicate in O(n) time and O(n) memory using a suffix array augmented with a lcp array.
Note: *_memoryview() is deprecated by *_buffer() version
More memory efficient version (compared to longest_duplicate_small()):
def cmp_memoryview(a, b):
for x, y in izip(a, b):
if x < y:
return -1
elif x > y:
return 1
return cmp(len(a), len(b))
def common_prefix_memoryview((a, b)):
for i, (x, y) in enumerate(izip(a, b)):
if x != y:
return a[:i]
return a if len(a) < len(b) else b
def longest_duplicate(data):
mv = memoryview(data)
suffixes = sorted((mv[i:] for i in xrange(len(mv))), cmp=cmp_memoryview)
result = max(imap(common_prefix_memoryview, pairwise(suffixes)), key=len)
return result.tobytes()
It takes 17 seconds on my machine for the iliad.mb.txt. The result is:
On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so Agamemnon,
who spoke fiercely to him and sent him roughly away.
I had to define custom functions to compare memoryview objects because memoryview comparison either raises an exception in Python 3 or produces wrong result in Python 2:
>>> s = b"abc"
>>> memoryview(s[0:]) > memoryview(s[1:])
True
>>> memoryview(s[0:]) < memoryview(s[1:])
True
Related questions:
Find the longest repeating string and the number of times it repeats in a given string
finding long repeated substrings in a massive string
The main problem seems to be that python does slicing by copy: https://stackoverflow.com/a/5722068/538551
You'll have to use a memoryview instead to get a reference instead of a copy. When I did this, the program hung after the idx.sort function (which was very fast).
I'm sure with a little work, you can get the rest working.
Edit:
The above change will not work as a drop-in replacement because cmp does not work the same way as strcmp. For example, try the following C code:
#include <stdio.h>
#include <string.h>
int main() {
char* test1 = "ovided by The Internet Classics Archive";
char* test2 = "rovided by The Internet Classics Archive.";
printf("%d\n", strcmp(test1, test2));
}
And compare the result to this python:
test1 = "ovided by The Internet Classics Archive";
test2 = "rovided by The Internet Classics Archive."
print(cmp(test1, test2))
The C code prints -3 on my machine while the python version prints -1. It looks like the example C code is abusing the return value of strcmp (it IS used in qsort after all). I couldn't find any documentation on when strcmp will return something other than [-1, 0, 1], but adding a printf to pstrcmp in the original code showed a lot of values outside of that range (3, -31, 5 were the first 3 values).
To make sure that -3 wasn't some error code, if we reverse test1 and test2, we'll get 3.
Edit:
The above is interesting trivia, but not actually correct in terms of affecting either chunks of code. I realized this just as I shut my laptop and left a wifi zone... Really should double check everything before I hit Save.
FWIW, cmp most certainly works on memoryview objects (prints -1 as expected):
print(cmp(memoryview(test1), memoryview(test2)))
I'm not sure why the code isn't working as expected. Printing out the list on my machine does not look as expected. I'll look into this and try to find a better solution instead of grasping at straws.
This version takes about 17 secs on my circa-2007 desktop using totally different algorithm:
#!/usr/bin/env python
ex = open("iliad.mb.txt").read()
chains = dict()
# populate initial chains dictionary
for (a,b) in enumerate(zip(ex,ex[1:])) :
s = ''.join(b)
if s not in chains :
chains[s] = list()
chains[s].append(a)
def grow_chains(chains) :
new_chains = dict()
for (string,pos) in chains :
offset = len(string)
for p in pos :
if p + offset >= len(ex) : break
# add one more character
s = string + ex[p + offset]
if s not in new_chains :
new_chains[s] = list()
new_chains[s].append(p)
return new_chains
# grow and filter, grow and filter
while len(chains) > 1 :
print 'length of chains', len(chains)
# remove chains that appear only once
chains = [(i,chains[i]) for i in chains if len(chains[i]) > 1]
print 'non-unique chains', len(chains)
print [i[0] for i in chains[:3]]
chains = grow_chains(chains)
The basic idea is to create a list of substrings and positions where they occure, thus eliminating the need to compare same strings again and again. The resulting list look like [('ind him, but', [466548, 739011]), (' bulwark bot', [428251, 428924]), (' his armour,', [121559, 124919, 193285, 393566, 413634, 718953, 760088])]. Unique strings are removed. Then every list member grows by 1 character and new list is created. Unique strings are removed again. And so on and so forth...

Categories