Related
I am currently using python and numpy for calculations of correlations between 2 lists: data_0 and data_1. Each list contains respecively sorted times t0 and t1.
I want to calculate all the events where 0 < t1 - t0 < t_max.
for time_0 in np.nditer(data_0):
delta_time = np.subtract(data_1, np.full(data_1.size, time_0))
delta_time = delta_time[delta_time >= 0]
delta_time = delta_time[delta_time < time_max]
Doing so, as the list are sorted, I am selecting a subarray of data_1 of the form data_1[index_min: index_max].
So I need in fact to find two indexes to get what I want.
And what's interesting is that when I go to the next time_0, as data_0 is also sorted, I just need to find the new index_min / index_max such as new_index_min >= index_min / new_index_max >= index_max.
Meaning that I don't need to scann again all the data_1.
(data list from scratch).
I have implemented such a solution not using the numpy methods (just with while loop) and it gives me the same results as before but not as fast than before (15 times longer!).
I think as normally it requires less calculation, there should be a way to make it faster using numpy methods but I don't know how to do it.
Does anyone have an idea?
I am not sure if I am super clear so if you have any questions, do not hestitate.
Thank you in advance,
Paul
Here is a vectorized approach using argsort. It uses a strategy similar to your avoid-full-scan idea:
import numpy as np
def find_gt(ref, data, incl=True):
out = np.empty(len(ref) + len(data) + 1, int)
total = (data, ref) if incl else (ref, data)
out[1:] = np.argsort(np.concatenate(total), kind='mergesort')
out[0] = -1
split = (out < len(data)) if incl else (out >= len(ref))
if incl:
out[~split] -= len(data)
split[0] = False
return np.maximum.accumulate(np.where(split, -1, out))[split] + 1
def find_intervals(ref, data, span, incl=(True, True)):
index_min = find_gt(ref, data, incl[0])
index_max = len(ref) - find_gt(-ref[::-1], -span-data[::-1], incl[1])[::-1]
return index_min, index_max
ref = np.sort(np.random.randint(0,20000,(10000,)))
data = np.sort(np.random.randint(0,20000,(10000,)))
span = 2
idmn, idmx = find_intervals(ref, data, span, (True, True))
print('checking')
for d,mn,mx in zip(data, idmn, idmx):
assert mn == len(ref) or ref[mn] >= d
assert mn == 0 or ref[mn-1] < d
assert mx == len(ref) or ref[mx] > d+span
assert mx == 0 or ref[mx-1] <= d+span
print('ok')
It works by
indirectly sorting both sets together
finding for each time in one set the preceding time in the other
this is done using maximum.reduce
the preceding steps are applied twice, the second time the times in
one set are shifted by span
I am trying to search a list (DB) for possible matches of fragments of text. For instance, I have a DB with text "evilman". I want to use user inputs to search for any possible matches in the DB and give the answer with a confidence. If the user inputs "hello", then there are no possible matches. If the user inputs "evil", then the possible match is evilman with a confidence of 57% (4 out of 7 alphabets match) and so on.
However, I also want a way to match input text such as "evxxman". 5 out of 7 characters of evxxman match the text "evilman" in the DB. But a simple check in python will say no match since it only outputs text that matches consecutively. I hope it makes sense. Thanks
Following is my code:
db = []
possible_signs = []
db.append("evilman")
text = raw_input()
for s in db:
if text in s:
if len(text) >= len(s)/2:
possible_signs.append(s)
count += 1
confidence = (float(len(text)) / float(len(s))) * 100
print "Confidence:", '%.2f' %(confidence), "<possible match:>", possible_signs[0]
This first version seems to comply with your exemples. It make the strings "slide" against each other, and count the number of identical characters.
The ratio is made by dividing the character count by the reference string length. Add a max and voila.
Call it for each string in your DB.
def commonChars(txt, ref):
txtLen = len(txt)
refLen = len(ref)
r = 0
for i in range(refLen + (txtLen - 1)):
rStart = abs(min(0, txtLen - i - 1))
tStart = txtLen -i - 1 if i < txtLen else 0
l = min(txtLen - tStart, refLen - rStart)
c = 0
for j in range(l):
if txt[tStart + j] == ref[rStart + j]:
c += 1
r = max(r, c / refLen)
return r
print(commonChars('evxxman', 'evilman')) # 0.7142857142857143
print(commonChars('evil', 'evilman')) # 0.5714285714285714
print(commonChars('man', 'evilman')) # 0.42857142857142855
print(commonChars('batman', 'evilman')) # 0.42857142857142855
print(commonChars('batman', 'man')) # 1.0
This second version produces the same results, but using the difflib mentioned in other answers.
It computes matching blocks, sum their lengths, and computes the ratio against the reference length.
import difflib
def commonBlocks(txt, ref):
matcher = difflib.SequenceMatcher(a=txt, b=ref)
matchingBlocks = matcher.get_matching_blocks()
matchingCount = sum([b.size for b in matchingBlocks])
return matchingCount / len(ref)
print(commonBlocks('evxxman', 'evilman')) # 0.7142857142857143
print(commonBlocks('evxxxxman', 'evilman')) # 0.7142857142857143
As shown by the calls above, the behavior is slightly different. "holes" between matching blocks are ignored, and do not change the final ratio.
For finding matches with a quality-estimation, have a look at difflib.SequenceMatcher.ratio and friends - these functions might not be the fastest match-checkers but they are easy to use.
Example copied from difflib docs
>>> s = SequenceMatcher(None, "abcd", "bcde")
>>> s.ratio()
0.75
>>> s.quick_ratio()
0.75
>>> s.real_quick_ratio()
1.0
Based on your description and examples, it seems to me that you're actually looking for something like the Levenshtein (or edit) distance. Note that it does not quite give the scores you specify, but I think it gives the scores you actually want.
There are several packages implementing this efficiently, e.g., distance:
In [1]: import distance
In [2]: distance.levenshtein('evilman', 'hello')
Out[2]: 6L
In [3]: distance.levenshtein('evilman', 'evil')
Out[3]: 3L
In [4]: distance.levenshtein('evilman', 'evxxman')
Out[4]: 2L
Note that the library contains several measures of similarity, e.g., jaccard and sorensen return a normalized value per default:
>>> distance.sorensen("decide", "resize")
0.5555555555555556
>>> distance.jaccard("decide", "resize")
0.7142857142857143
Create a while loop and track two iterators, one for your key word ("evil") and one for your query word ("evilman"). Here is some pseudocode:
key = "evil"
query = "evilman"
key_iterator = 0
query_iterator = 0
confidence_score = 0
while( key_iterator < key.length && query_iterator < query.length ) {
if (key[key_iterator] == query[query_iterator]) {
confidence_score++
key_iterator++
}
query_iterator++
}
// If we didnt reach the end of the key
if (key_iterator != key.length) {
confidence_score = 0
}
print ("Confidence: " + confidence_score + " out of " + query.length)
I am writing some quiz game and need computer to solve 1 game in the quiz if players fail to solve it.
Given data :
List of 6 numbers to use, for example 4, 8, 6, 2, 15, 50.
Targeted value, where 0 < value < 1000, for example 590.
Available operations are division, addition, multiplication and division.
Parentheses can be used.
Generate mathematical expression which evaluation is equal, or as close as possible, to the target value. For example for numbers given above, expression could be : (6 + 4) * 50 + 15 * (8 - 2) = 590
My algorithm is as follows :
Generate all permutations of all the subsets of the given numbers from (1) above
For each permutation generate all parenthesis and operator combinations
Track the closest value as algorithm runs
I can not think of any smart optimization to the brute-force algorithm above, which will speed it up by the order of magnitude. Also I must optimize for the worst case, because many quiz games will be run simultaneously on the server.
Code written today to solve this problem is (relevant stuff extracted from the project) :
from operator import add, sub, mul, div
import itertools
ops = ['+', '-', '/', '*']
op_map = {'+': add, '-': sub, '/': div, '*': mul}
# iterate over 1 permutation and generates parentheses and operator combinations
def iter_combinations(seq):
if len(seq) == 1:
yield seq[0], str(seq[0])
else:
for i in range(len(seq)):
left, right = seq[:i], seq[i:] # split input list at i`th place
# generate cartesian product
for l, l_str in iter_combinations(left):
for r, r_str in iter_combinations(right):
for op in ops:
if op_map[op] is div and r == 0: # cant divide by zero
continue
else:
yield op_map[op](float(l), r), \
('(' + l_str + op + r_str + ')')
numbers = [4, 8, 6, 2, 15, 50]
target = best_value = 590
best_item = None
for i in range(len(numbers)):
for current in itertools.permutations(numbers, i+1): # generate perms
for value, item in iter_combinations(list(current)):
if value < 0:
continue
if abs(target - value) < best_value:
best_value = abs(target - value)
best_item = item
print best_item
It prints : ((((4*6)+50)*8)-2). Tested it a little with different values and it seems to work correctly. Also I have a function to remove unnecessary parenthesis but it is not relevant to the question so it is not posted.
Problem is that this runs very slowly because of all this permutations, combinations and evaluations. On my mac book air it runs for a few minutes for 1 example. I would like to make it run in a few seconds tops on the same machine, because many quiz game instances will be run at the same time on the server. So the questions are :
Can I speed up current algorithm somehow (by orders of magnitude)?
Am I missing on some other algorithm for this problem which would run much faster?
You can build all the possible expression trees with the given numbers and evalate them. You don't need to keep them all in memory, just print them when the target number is found:
First we need a class to hold the expression. It is better to design it to be immutable, so its value can be precomputed. Something like this:
class Expr:
'''An Expr can be built with two different calls:
-Expr(number) to build a literal expression
-Expr(a, op, b) to build a complex expression.
There a and b will be of type Expr,
and op will be one of ('+','-', '*', '/').
'''
def __init__(self, *args):
if len(args) == 1:
self.left = self.right = self.op = None
self.value = args[0]
else:
self.left = args[0]
self.right = args[2]
self.op = args[1]
if self.op == '+':
self.value = self.left.value + self.right.value
elif self.op == '-':
self.value = self.left.value - self.right.value
elif self.op == '*':
self.value = self.left.value * self.right.value
elif self.op == '/':
self.value = self.left.value // self.right.value
def __str__(self):
'''It can be done smarter not to print redundant parentheses,
but that is out of the scope of this problem.
'''
if self.op:
return "({0}{1}{2})".format(self.left, self.op, self.right)
else:
return "{0}".format(self.value)
Now we can write a recursive function that builds all the possible expression trees with a given set of expressions, and prints the ones that equals our target value. We will use the itertools module, that's always fun.
We can use itertools.combinations() or itertools.permutations(), the difference is in the order. Some of our operations are commutative and some are not, so we can use permutations() and assume we will get many very simmilar solutions. Or we can use combinations() and manually reorder the values when the operation is not commutative.
import itertools
OPS = ('+', '-', '*', '/')
def SearchTrees(current, target):
''' current is the current set of expressions.
target is the target number.
'''
for a,b in itertools.combinations(current, 2):
current.remove(a)
current.remove(b)
for o in OPS:
# This checks whether this operation is commutative
if o == '-' or o == '/':
conmut = ((a,b), (b,a))
else:
conmut = ((a,b),)
for aa, bb in conmut:
# You do not specify what to do with the division.
# I'm assuming that only integer divisions are allowed.
if o == '/' and (bb.value == 0 or aa.value % bb.value != 0):
continue
e = Expr(aa, o, bb)
# If a solution is found, print it
if e.value == target:
print(e.value, '=', e)
current.add(e)
# Recursive call!
SearchTrees(current, target)
# Do not forget to leave the set as it were before
current.remove(e)
# Ditto
current.add(b)
current.add(a)
And then the main call:
NUMBERS = [4, 8, 6, 2, 15, 50]
TARGET = 590
initial = set(map(Expr, NUMBERS))
SearchTrees(initial, TARGET)
And done! With these data I'm getting 719 different solutions in just over 21 seconds! Of course many of them are trivial variations of the same expression.
24 game is 4 numbers to target 24, your game is 6 numbers to target x (0 < x < 1000).
That's much similar.
Here is the quick solution, get all results and print just one in my rMBP in about 1-3s, I think one solution print is ok in this game :), I will explain it later:
def mrange(mask):
#twice faster from Evgeny Kluev
x = 0
while x != mask:
x = (x - mask) & mask
yield x
def f( i ) :
global s
if s[i] :
#get cached group
return s[i]
for x in mrange(i & (i - 1)) :
#when x & i == x
#x is a child group in group i
#i-x is also a child group in group i
fk = fork( f(x), f(i-x) )
s[i] = merge( s[i], fk )
return s[i]
def merge( s1, s2 ) :
if not s1 :
return s2
if not s2 :
return s1
for i in s2 :
#print just one way quickly
s1[i] = s2[i]
#combine all ways, slowly
# if i in s1 :
# s1[i].update(s2[i])
# else :
# s1[i] = s2[i]
return s1
def fork( s1, s2 ) :
d = {}
#fork s1 s2
for i in s1 :
for j in s2 :
if not i + j in d :
d[i + j] = getExp( s1[i], s2[j], "+" )
if not i - j in d :
d[i - j] = getExp( s1[i], s2[j], "-" )
if not j - i in d :
d[j - i] = getExp( s2[j], s1[i], "-" )
if not i * j in d :
d[i * j] = getExp( s1[i], s2[j], "*" )
if j != 0 and not i / j in d :
d[i / j] = getExp( s1[i], s2[j], "/" )
if i != 0 and not j / i in d :
d[j / i] = getExp( s2[j], s1[i], "/" )
return d
def getExp( s1, s2, op ) :
exp = {}
for i in s1 :
for j in s2 :
exp['('+i+op+j+')'] = 1
#just print one way
break
#just print one way
break
return exp
def check( s ) :
num = 0
for i in xrange(target,0,-1):
if i in s :
if i == target :
print numbers, target, "\nFind ", len(s[i]), 'ways'
for exp in s[i]:
print exp, ' = ', i
else :
print numbers, target, "\nFind nearest ", i, 'in', len(s[i]), 'ways'
for exp in s[i]:
print exp, ' = ', i
break
print '\n'
def game( numbers, target ) :
global s
s = [None]*(2**len(numbers))
for i in xrange(0,len(numbers)) :
numbers[i] = float(numbers[i])
n = len(numbers)
for i in xrange(0,n) :
s[2**i] = { numbers[i]: {str(numbers[i]):1} }
for i in xrange(1,2**n) :
#we will get the f(numbers) in s[2**n-1]
s[i] = f(i)
check(s[2**n-1])
numbers = [4, 8, 6, 2, 2, 5]
s = [None]*(2**len(numbers))
target = 590
game( numbers, target )
numbers = [1,2,3,4,5,6]
target = 590
game( numbers, target )
Assume A is your 6 numbers list.
We define f(A) is all result that can calculate by all A numbers, if we search f(A), we will find if target is in it and get answer or the closest answer.
We can split A to two real child groups: A1 and A-A1 (A1 is not empty and not equal A) , which cut the problem from f(A) to f(A1) and f(A-A1). Because we know f(A) = Union( a+b, a-b, b-a, a*b, a/b(b!=0), b/a(a!=0) ), which a in A, b in A-A1.
We use fork f(A) = Union( fork(A1,A-A1) ) stands for such process. We can remove all duplicate value in fork(), so we can cut the range and make program faster.
So, if A = [1,2,3,4,5,6], then f(A) = fork( f([1]),f([2,3,4,5,6]) ) U ... U fork( f([1,2,3]), f([4,5,6]) ) U ... U stands for Union.
We will see f([2,3,4,5,6]) = fork( f([2,3]), f([4,5,6]) ) U ... , f([3,4,5,6]) = fork( f([3]), f([4,5,6]) ) U ..., the f([4,5,6]) used in both.
So if we can cache every f([...]) the program can be faster.
We can get 2^len(A) - 2 (A1,A-A1) in A. We can use binary to stands for that.
For example: A = [1,2,3,4,5,6], A1 = [1,2,3], then binary 000111(7) stands for A1. A2 = [1,3,5], binary 010101(21) stands for A2. A3 = [1], then binary 000001(1) stands for A3...
So we get a way stands for all groups in A, we can cache them and make all process faster!
All combinations for six number, four operations and parenthesis are up to 5 * 9! at least. So I think you should use some AI algorithm. Using genetic programming or optimization seems to be the path to follow.
In the book Programming Collective Intelligence in the chapter 11 Evolving Intelligence you will find exactly what you want and much more. That chapter explains how to find a mathematical function combining operations and numbers (as you want) to match a result. You will be surprised how easy is such task.
PD: The examples are written using Python.
I would try using an AST at least it will
make your expression generation part easier
(no need to mess with brackets).
http://en.wikipedia.org/wiki/Abstract_syntax_tree
1) Generate some tree with N nodes
(N = the count of numbers you have).
I've read before how many of those you
have, their size is serious as N grows.
By serious I mean more than polynomial to say the least.
2) Now just start changing the operations
in the non-leaf nodes and keep evaluating
the result.
But this is again backtracking and too much degree of freedom.
This is a computationally complex task you're posing. I believe if you
ask the question as you did: "let's generate a number K on the output
such that |K-V| is minimal" (here V is the pre-defined desired result,
i.e. 590 in your example) , then I guess this problem is even NP-complete.
Somebody please correct me if my intuition is lying to me.
So I think even the generation of all possible ASTs (assuming only 1 operation
is allowed) is NP complete as their count is not polynomial. Not to talk that more
than 1 operation is allowed here and not to talk of the minimal difference requirement (between result and desired result).
1. Fast entirely online algorithm
The idea is to search not for a single expression for target value,
but for an equation where target value is included in one part of the equation and
both parts have almost equal number of operations (2 and 3).
Since each part of the equation is relatively small, it does not take much time to
generate all possible expressions for given input values.
After both parts of equation are generated it is possible to scan a pair of sorted arrays
containing values of these expressions and find a pair of equal (or at least best matching)
values in them. After two matching values are found we could get corresponding expressions and
join them into a single expression (in other words, solve the equation).
To join two expression trees together we could descend from the root of one tree
to "target" leaf, for each node on this path invert corresponding operation
('*' to '/', '/' to '*' or '/', '+' to '-', '-' to '+' or '-'), and move "inverted"
root node to other tree (also as root node).
This algorithm is faster and easier to implement when all operations are invertible.
So it is best to use with floating point division (as in my implementation) or with
rational division. Truncating integer division is most difficult case because it produces same result for different inputs (42/25=1 and 25/25 is also 1). With zero-remainder integer division this algorithm gives result almost instantly when exact result is available, but needs some modifications to work correctly when approximate result is needed.
See implementation on Ideone.
2. Even faster approach with off-line pre-processing
As noticed by #WolframH, there are not so many possible input number combinations.
Only 3*3*(49+4-1) = 4455 if repetitions are possible.
Or 3*3*(49) = 1134 without duplicates. Which allows us to pre-process
all possible inputs off-line, store results in compact form, and when some particular result
is needed quickly unpack one of pre-processed values.
Pre-processing program should take array of 6 numbers and generate values for all possible
expressions. Then it should drop out-of-range values and find nearest result for all cases
where there is no exact match. All this could be performed by algorithm proposed by #Tim.
His code needs minimal modifications to do it. Also it is the fastest alternative (yet).
Since pre-processing is offline, we could use something better than interpreted Python.
One alternative is PyPy, other one is to use some fast interpreted language. Pre-processing
all possible inputs should not take more than several minutes.
Speaking about memory needed to store all pre-processed values, the only problem are the
resulting expressions. If stored in string form they will take up to 4455*999*30 bytes or 120Mb.
But each expression could be compressed. It may be represented in postfix notation like this:
arg1 arg2 + arg3 arg4 + *. To store this we need 10 bits to store all arguments' permutations,
10 bits to store 5 operations, and 8 bits to specify how arguments and operations are
interleaved (6 arguments + 5 operations - 3 pre-defined positions: first two are always
arguments, last one is always operation). 28 bits per tree or 4 bytes, which means it is only
20Mb for entire data set with duplicates or 5Mb without them.
3. Slow entirely online algorithm
There are some ways to speed up algorithm in OP:
Greatest speed improvement may be achieved if we avoid trying each commutative operation twice and make recursion tree less branchy.
Some optimization is possible by removing all branches where the result of division operation is zero.
Memorization (dynamic programming) cannot give significant speed boost here, still it may be useful.
After enhancing OP's approach with these ideas, approximately 30x speedup is achieved:
from itertools import combinations
numbers = [4, 8, 6, 2, 15, 50]
target = best_value = 590
best_item = None
subsets = {}
def get_best(value, item):
global best_value, target, best_item
if value >= 0 and abs(target - value) < best_value:
best_value = abs(target - value)
best_item = item
return value, item
def compare_one(value, op, left, right):
item = ('(' + left + op + right + ')')
return get_best(value, item)
def apply_one(left, right):
yield compare_one(left[0] + right[0], '+', left[1], right[1])
yield compare_one(left[0] * right[0], '*', left[1], right[1])
yield compare_one(left[0] - right[0], '-', left[1], right[1])
yield compare_one(right[0] - left[0], '-', right[1], left[1])
if right[0] != 0 and left[0] >= right[0]:
yield compare_one(left[0] / right[0], '/', left[1], right[1])
if left[0] != 0 and right[0] >= left[0]:
yield compare_one(right[0] / left[0], '/', right[1], left[1])
def memorize(seq):
fs = frozenset(seq)
if fs in subsets:
for x in subsets[fs].items():
yield x
else:
subsets[fs] = {}
for value, item in try_all(seq):
subsets[fs][value] = item
yield value, item
def apply_all(left, right):
for l in memorize(left):
for r in memorize(right):
for x in apply_one(l, r):
yield x;
def try_all(seq):
if len(seq) == 1:
yield get_best(numbers[seq[0]], str(numbers[seq[0]]))
for length in range(1, len(seq)):
for x in combinations(seq[1:], length):
for value, item in apply_all(list(x), list(set(seq) - set(x))):
yield value, item
for x, y in try_all([0, 1, 2, 3, 4, 5]): pass
print best_item
More speed improvements are possible if you add some constraints to the problem:
If integer division is only possible when the remainder is zero.
If all intermediate results are to be non-negative and/or below 1000.
Well I don't will give up. Following the line of all the answers to your question I come up with another algorithm. This algorithm gives the solution with a time average of 3 milliseconds.
#! -*- coding: utf-8 -*-
import copy
numbers = [4, 8, 6, 2, 15, 50]
target = 590
operations = {
'+': lambda x, y: x + y,
'-': lambda x, y: x - y,
'*': lambda x, y: x * y,
'/': lambda x, y: y == 0 and 1e30 or x / y # Handle zero division
}
def chain_op(target, numbers, result=None, expression=""):
if len(numbers) == 0:
return (expression, result)
else:
for choosen_number in numbers:
remaining_numbers = copy.copy(numbers)
remaining_numbers.remove(choosen_number)
if result is None:
return chain_op(target, remaining_numbers, choosen_number, str(choosen_number))
else:
incomming_results = []
for key, op in operations.items():
new_result = op(result, choosen_number)
new_expression = "%s%s%d" % (expression, key, choosen_number)
incomming_results.append(chain_op(target, remaining_numbers, new_result, new_expression))
diff = 1e30
selected = None
for exp_result in incomming_results:
exp, res = exp_result
if abs(res - target) < diff:
diff = abs(res - target)
selected = exp_result
if diff == 0:
break
return selected
if __name__ == '__main__':
print chain_op(target, numbers)
Erratum: This algorithm do not include the solutions containing parenthesis. It always hits the target or the closest result, my bad. Still is pretty fast. It can be adapted to support parenthesis without much work.
Actually there are two things that you can do to speed up the time to milliseconds.
You are trying to find a solution for given quiz, by generating the numbers and the target number. Instead you can generate the solution and just remove the operations. You can build some thing smart that will generate several quizzes and choose the most interesting one, how ever in this case you loose the as close as possible option.
Another way to go, is pre-calculation. Solve 100 quizes, use them as build-in in your application, and generate new one on the fly, try to keep your quiz stack at 100, also try to give the user only the new quizes. I had the same problem in my bible games, and I used this method to speed thing up. Instead of 10 sec for question it takes me milliseconds as I am generating new question in background and always keeping my stack to 100.
What about Dynamic programming, because you need same results to calculate other options?
From Section 15.2 of Programming Pearls
The C codes can be viewed here: http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c
When I implement it in Python using suffix-array:
example = open("iliad10.txt").read()
def comlen(p, q):
i = 0
for x in zip(p, q):
if x[0] == x[1]:
i += 1
else:
break
return i
suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:])) #VERY VERY SLOW
max_len = -1
for i in range(example_len - 1):
this_len = comlen(example[idx[i]:], example[idx[i+1]:])
print this_len
if this_len > max_len:
max_len = this_len
maxi = i
I found it very slow for the idx.sort step. I think it's slow because Python need to pass the substring by value instead of by pointer (as the C codes above).
The tested file can be downloaded from here
The C codes need only 0.3 seconds to finish.
time cat iliad10.txt |./longdup
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away.
real 0m0.328s
user 0m0.291s
sys 0m0.006s
But for Python codes, it never ends on my computer (I waited for 10 minutes and killed it)
Does anyone have ideas how to make the codes efficient? (For example, less than 10 seconds)
My solution is based on Suffix arrays. It is constructed by Prefix doubling the Longest common prefix. The worst-case complexity is O(n (log n)^2). The file "iliad.mb.txt" takes 4 seconds on my laptop. The longest_common_substring function is short and can be easily modified, e.g. for searching the 10 longest non-overlapping substrings. This Python code is faster than the original C code from the question, if duplicate strings are longer than 10000 characters.
from itertools import groupby
from operator import itemgetter
def longest_common_substring(text):
"""Get the longest common substrings and their positions.
>>> longest_common_substring('banana')
{'ana': [1, 3]}
>>> text = "not so Agamemnon, who spoke fiercely to "
>>> sorted(longest_common_substring(text).items())
[(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]
This function can be easy modified for any criteria, e.g. for searching ten
longest non overlapping repeated substrings.
"""
sa, rsa, lcp = suffix_array(text)
maxlen = max(lcp)
result = {}
for i in range(1, len(text)):
if lcp[i] == maxlen:
j1, j2, h = sa[i - 1], sa[i], lcp[i]
assert text[j1:j1 + h] == text[j2:j2 + h]
substring = text[j1:j1 + h]
if not substring in result:
result[substring] = [j1]
result[substring].append(j2)
return dict((k, sorted(v)) for k, v in result.items())
def suffix_array(text, _step=16):
"""Analyze all common strings in the text.
Short substrings of the length _step a are first pre-sorted. The are the
results repeatedly merged so that the garanteed number of compared
characters bytes is doubled in every iteration until all substrings are
sorted exactly.
Arguments:
text: The text to be analyzed.
_step: Is only for optimization and testing. It is the optimal length
of substrings used for initial pre-sorting. The bigger value is
faster if there is enough memory. Memory requirements are
approximately (estimate for 32 bit Python 3.3):
len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB
Return value: (tuple)
(sa, rsa, lcp)
sa: Suffix array for i in range(1, size):
assert text[sa[i-1]:] < text[sa[i]:]
rsa: Reverse suffix array for i in range(size):
assert rsa[sa[i]] == i
lcp: Longest common prefix for i in range(1, size):
assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
if sa[i-1] + lcp[i] < len(text):
assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
>>> suffix_array(text='banana')
([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])
Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
It is between tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
"""
tx = text
size = len(tx)
step = min(max(_step, 1), len(tx))
sa = list(range(len(tx)))
sa.sort(key=lambda i: tx[i:i + step])
grpstart = size * [False] + [True] # a boolean map for iteration speedup.
# It helps to skip yet resolved values. The last value True is a sentinel.
rsa = size * [None]
stgrp, igrp = '', 0
for i, pos in enumerate(sa):
st = tx[pos:pos + step]
if st != stgrp:
grpstart[igrp] = (igrp < i - 1)
stgrp = st
igrp = i
rsa[pos] = igrp
sa[i] = pos
grpstart[igrp] = (igrp < size - 1 or size == 0)
while grpstart.index(True) < size:
# assert step <= size
nextgr = grpstart.index(True)
while nextgr < size:
igrp = nextgr
nextgr = grpstart.index(True, igrp + 1)
glist = []
for ig in range(igrp, nextgr):
pos = sa[ig]
if rsa[pos] != igrp:
break
newgr = rsa[pos + step] if pos + step < size else -1
glist.append((newgr, pos))
glist.sort()
for ig, g in groupby(glist, key=itemgetter(0)):
g = [x[1] for x in g]
sa[igrp:igrp + len(g)] = g
grpstart[igrp] = (len(g) > 1)
for pos in g:
rsa[pos] = igrp
igrp += len(g)
step *= 2
del grpstart
# create LCP array
lcp = size * [None]
h = 0
for i in range(size):
if rsa[i] > 0:
j = sa[rsa[i] - 1]
while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
h += 1
lcp[rsa[i]] = h
if h > 0:
h -= 1
if size > 0:
lcp[0] = 0
return sa, rsa, lcp
I prefer this solution over more complicated O(n log n) because Python has a very fast list sorting algorithm (Timsort). Python's sort is probably faster than necessary linear time operations in the method from that article, that should be O(n) under very special presumptions of random strings together with a small alphabet (typical for DNA genome analysis). I read in Gog 2011 that worst-case O(n log n) of my algorithm can be in practice faster than many O(n) algorithms that cannot use the CPU memory cache.
The code in another answer based on grow_chains is 19 times slower than the original example from the question, if the text contains a repeated string 8 kB long. Long repeated texts are not typical for classical literature, but they are frequent e.g. in "independent" school homework collections. The program should not freeze on it.
I wrote an example and tests with the same code for Python 2.7, 3.3 - 3.6.
The translation of the algorithm into Python:
from itertools import imap, izip, starmap, tee
from os.path import commonprefix
def pairwise(iterable): # itertools recipe
a, b = tee(iterable)
next(b, None)
return izip(a, b)
def longest_duplicate_small(data):
suffixes = sorted(data[i:] for i in xrange(len(data))) # O(n*n) in memory
return max(imap(commonprefix, pairwise(suffixes)), key=len)
buffer() allows to get a substring without copying:
def longest_duplicate_buffer(data):
n = len(data)
sa = sorted(xrange(n), key=lambda i: buffer(data, i)) # suffix array
def lcp_item(i, j): # find longest common prefix array item
start = i
while i < n and data[i] == data[i + j - start]:
i += 1
return i - start, start
size, start = max(starmap(lcp_item, pairwise(sa)), key=lambda x: x[0])
return data[start:start + size]
It takes 5 seconds on my machine for the iliad.mb.txt.
In principle it is possible to find the duplicate in O(n) time and O(n) memory using a suffix array augmented with a lcp array.
Note: *_memoryview() is deprecated by *_buffer() version
More memory efficient version (compared to longest_duplicate_small()):
def cmp_memoryview(a, b):
for x, y in izip(a, b):
if x < y:
return -1
elif x > y:
return 1
return cmp(len(a), len(b))
def common_prefix_memoryview((a, b)):
for i, (x, y) in enumerate(izip(a, b)):
if x != y:
return a[:i]
return a if len(a) < len(b) else b
def longest_duplicate(data):
mv = memoryview(data)
suffixes = sorted((mv[i:] for i in xrange(len(mv))), cmp=cmp_memoryview)
result = max(imap(common_prefix_memoryview, pairwise(suffixes)), key=len)
return result.tobytes()
It takes 17 seconds on my machine for the iliad.mb.txt. The result is:
On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so Agamemnon,
who spoke fiercely to him and sent him roughly away.
I had to define custom functions to compare memoryview objects because memoryview comparison either raises an exception in Python 3 or produces wrong result in Python 2:
>>> s = b"abc"
>>> memoryview(s[0:]) > memoryview(s[1:])
True
>>> memoryview(s[0:]) < memoryview(s[1:])
True
Related questions:
Find the longest repeating string and the number of times it repeats in a given string
finding long repeated substrings in a massive string
The main problem seems to be that python does slicing by copy: https://stackoverflow.com/a/5722068/538551
You'll have to use a memoryview instead to get a reference instead of a copy. When I did this, the program hung after the idx.sort function (which was very fast).
I'm sure with a little work, you can get the rest working.
Edit:
The above change will not work as a drop-in replacement because cmp does not work the same way as strcmp. For example, try the following C code:
#include <stdio.h>
#include <string.h>
int main() {
char* test1 = "ovided by The Internet Classics Archive";
char* test2 = "rovided by The Internet Classics Archive.";
printf("%d\n", strcmp(test1, test2));
}
And compare the result to this python:
test1 = "ovided by The Internet Classics Archive";
test2 = "rovided by The Internet Classics Archive."
print(cmp(test1, test2))
The C code prints -3 on my machine while the python version prints -1. It looks like the example C code is abusing the return value of strcmp (it IS used in qsort after all). I couldn't find any documentation on when strcmp will return something other than [-1, 0, 1], but adding a printf to pstrcmp in the original code showed a lot of values outside of that range (3, -31, 5 were the first 3 values).
To make sure that -3 wasn't some error code, if we reverse test1 and test2, we'll get 3.
Edit:
The above is interesting trivia, but not actually correct in terms of affecting either chunks of code. I realized this just as I shut my laptop and left a wifi zone... Really should double check everything before I hit Save.
FWIW, cmp most certainly works on memoryview objects (prints -1 as expected):
print(cmp(memoryview(test1), memoryview(test2)))
I'm not sure why the code isn't working as expected. Printing out the list on my machine does not look as expected. I'll look into this and try to find a better solution instead of grasping at straws.
This version takes about 17 secs on my circa-2007 desktop using totally different algorithm:
#!/usr/bin/env python
ex = open("iliad.mb.txt").read()
chains = dict()
# populate initial chains dictionary
for (a,b) in enumerate(zip(ex,ex[1:])) :
s = ''.join(b)
if s not in chains :
chains[s] = list()
chains[s].append(a)
def grow_chains(chains) :
new_chains = dict()
for (string,pos) in chains :
offset = len(string)
for p in pos :
if p + offset >= len(ex) : break
# add one more character
s = string + ex[p + offset]
if s not in new_chains :
new_chains[s] = list()
new_chains[s].append(p)
return new_chains
# grow and filter, grow and filter
while len(chains) > 1 :
print 'length of chains', len(chains)
# remove chains that appear only once
chains = [(i,chains[i]) for i in chains if len(chains[i]) > 1]
print 'non-unique chains', len(chains)
print [i[0] for i in chains[:3]]
chains = grow_chains(chains)
The basic idea is to create a list of substrings and positions where they occure, thus eliminating the need to compare same strings again and again. The resulting list look like [('ind him, but', [466548, 739011]), (' bulwark bot', [428251, 428924]), (' his armour,', [121559, 124919, 193285, 393566, 413634, 718953, 760088])]. Unique strings are removed. Then every list member grows by 1 character and new list is created. Unique strings are removed again. And so on and so forth...
Instead of a complete shuffle, I am looking for a partial shuffle function in python.
Example : "string" must give rise to "stnrig", but not "nrsgit"
It would be better if I can define a specific "percentage" of characters that have to be rearranged.
Purpose is to test string comparison algorithms. I want to determine the "percentage of shuffle" beyond which an(my) algorithm will mark two (shuffled) strings as completely different.
Update :
Here is my code. Improvements are welcome !
import random
percent_to_shuffle = int(raw_input("Give the percent value to shuffle : "))
to_shuffle = list(raw_input("Give the string to be shuffled : "))
num_of_chars_to_shuffle = int((len(to_shuffle)*percent_to_shuffle)/100)
for i in range(0,num_of_chars_to_shuffle):
x=random.randint(0,(len(to_shuffle)-1))
y=random.randint(0,(len(to_shuffle)-1))
z=to_shuffle[x]
to_shuffle[x]=to_shuffle[y]
to_shuffle[y]=z
print ''.join(to_shuffle)
This is a problem simpler than it looks. And the language has the right tools not to stay between you and the idea,as usual:
import random
def pashuffle(string, perc=10):
data = list(string)
for index, letter in enumerate(data):
if random.randrange(0, 100) < perc/2:
new_index = random.randrange(0, len(data))
data[index], data[new_index] = data[new_index], data[index]
return "".join(data)
Your problem is tricky, because there are some edge cases to think about:
Strings with repeated characters (i.e. how would you shuffle "aaaab"?)
How do you measure chained character swaps or re arranging blocks?
In any case, the metric defined to shuffle strings up to a certain percentage is likely to be the same you are using in your algorithm to see how close they are.
My code to shuffle n characters:
import random
def shuffle_n(s, n):
idx = range(len(s))
random.shuffle(idx)
idx = idx[:n]
mapping = dict((idx[i], idx[i-1]) for i in range(n))
return ''.join(s[mapping.get(x,x)] for x in range(len(s)))
Basically chooses n positions to swap at random, and then exchanges each of them with the next in the list... This way it ensures that no inverse swaps are generated and exactly n characters are swapped (if there are characters repeated, bad luck).
Explained run with 'string', 3 as input:
idx is [0, 1, 2, 3, 4, 5]
we shuffle it, now it is [5, 3, 1, 4, 0, 2]
we take just the first 3 elements, now it is [5, 3, 1]
those are the characters that we are going to swap
s t r i n g
^ ^ ^
t (1) will be i (3)
i (3) will be g (5)
g (5) will be t (1)
the rest will remain unchanged
so we get 'sirgnt'
The bad thing about this method is that it does not generate all the possible variations, for example, it could not make 'gnrits' from 'string'. This could be fixed by making partitions of the indices to be shuffled, like this:
import random
def randparts(l):
n = len(l)
s = random.randint(0, n-1) + 1
if s >= 2 and n - s >= 2: # the split makes two valid parts
yield l[:s]
for p in randparts(l[s:]):
yield p
else: # the split would make a single cycle
yield l
def shuffle_n(s, n):
idx = range(len(s))
random.shuffle(idx)
mapping = dict((x[i], x[i-1])
for i in range(len(x))
for x in randparts(idx[:n]))
return ''.join(s[mapping.get(x,x)] for x in range(len(s)))
import random
def partial_shuffle(a, part=0.5):
# which characters are to be shuffled:
idx_todo = random.sample(xrange(len(a)), int(len(a) * part))
# what are the new positions of these to-be-shuffled characters:
idx_target = idx_todo[:]
random.shuffle(idx_target)
# map all "normal" character positions {0:0, 1:1, 2:2, ...}
mapper = dict((i, i) for i in xrange(len(a)))
# update with all shuffles in the string: {old_pos:new_pos, old_pos:new_pos, ...}
mapper.update(zip(idx_todo, idx_target))
# use mapper to modify the string:
return ''.join(a[mapper[i]] for i in xrange(len(a)))
for i in xrange(5):
print partial_shuffle('abcdefghijklmnopqrstuvwxyz', 0.2)
prints
abcdefghljkvmnopqrstuxwiyz
ajcdefghitklmnopqrsbuvwxyz
abcdefhwijklmnopqrsguvtxyz
aecdubghijklmnopqrstwvfxyz
abjdefgcitklmnopqrshuvwxyz
Evil and using a deprecated API:
import random
# adjust constant to taste
# 0 -> no effect, 0.5 -> completely shuffled, 1.0 -> reversed
# Of course this assumes your input is already sorted ;)
''.join(sorted(
'abcdefghijklmnopqrstuvwxyz',
cmp = lambda a, b: cmp(a, b) * (-1 if random.random() < 0.2 else 1)
))
maybe like so:
>>> s = 'string'
>>> shufflethis = list(s[2:])
>>> random.shuffle(shufflethis)
>>> s[:2]+''.join(shufflethis)
'stingr'
Taking from fortran's idea, i'm adding this to collection. It's pretty fast:
def partial_shuffle(st, p=20):
p = int(round(p/100.0*len(st)))
idx = range(len(s))
sample = random.sample(idx, p)
res=str()
samptrav = 1
for i in range(len(st)):
if i in sample:
res += st[sample[-samptrav]]
samptrav += 1
continue
res += st[i]
return res