All possible combinations of numbers with rounding error matching sum

All possible combinations of numbers with rounding error matching sum - python

I have encountered one problem that I struggle to solve. There is a group of numbers that should match some given sum and the function should output all possible combinations. The length of the input numbers differ, might be 3 or sometimes 20 numbers in the input array. Unfortunately, some numbers do have a rounding error and some do not have it, and we do not know which ones are accurate. Therefore, not always combination might be found for every sum, cause of it rounding error.
E.g. there are numbers 248, 512, 992, 240, 2134, 4214, 4211 and target sum is 3133.
I wanted to incorporate rounding error into the function that I found here:
## Create algorithm to find out underlying accounts
def subset_sum(numbers, target, partial=[], partial_sum=0):
if partial_sum == target:
yield partial
if partial_sum >= target:
return
for i, n in enumerate(numbers):
remaining = numbers[i + 1:]
yield from subset_sum(remaining, target, partial + [n], partial_sum + n)
And i wanted to do so by sort of extending the input array by 2*len(input_array) elements, where one would be an element-1 and second would be element+1.
Unfortunately, I do not know how to incorporate it into that algorithm and when i use loop to run the function multiple times with switching one element it will be super slow and create 3*length runs of this functions(actual number, actual number-1, actual numer+1). I dont think anyone would be able to help with this...

Ok, I guess you're looking for the "best" sum, rather than "exact" one, that is, find a subset sum such as the distance abs(partial_sum, target) is minimal. You can employ this simple modification of the algorithm: on each step, compute the distance and declare it "best" if it is less than the best distance so far.
Here's some working code (I also replaced recursion with a loop and a queue):
def best_subset_sum(lst, target):
## queue of candidates: (indexes, partial sum)
queue = [([i], lst[i]) for i in range(len(lst))]
## best candidate so far: (indexes, distance to target)
best = [[], target]
while queue:
indexes, s = queue.pop(0)
## the distance between the current sum and the target
diff = abs(target - s)
## is it better than the shortest so far?
if diff < best[1]:
## new best distance
best = indexes, diff
## exact sum found - stop now
if diff == 0:
break
## assuming all numbers are positive
if s < target:
i = indexes[-1] + 1
## for all indexes following the current combination
while i < len(lst):
## add current combination + i-th element to the queue
queue.append((indexes + [i], s + lst[i]))
i += 1
return [lst[i] for i in best[0]]
s = best_subset_sum([248, 512, 992, 240, 2132, 4214, 4211], 3133)
print(s, sum(s))
This prints
[248, 512, 240, 2134] 3134
And here's a modification to enumerate all such sums:
def best_subset_sums(lst, target):
queue = [[[i], lst[i]] for i in range(len(lst))]
best = [[[], target]]
while queue:
indexes, s = queue.pop(0)
diff = abs(target - s)
if diff < best[0][1]:
best = [[indexes, diff]]
elif diff == best[0][1]:
best.append([indexes, diff])
if s < target:
i = indexes[-1] + 1
while i < len(lst):
queue.append((indexes + [i], s + lst[i]))
i += 1
for b in best:
yield [lst[i] for i in b[0]]
for ls in best_subset_sums([3, 5, 7, 11, 13, 17, 19], 66):
print(ls, sum(ls))
# [5, 11, 13, 17, 19] 65
# [7, 11, 13, 17, 19] 67

Related

Subset problem in python - fix the number of addends that sum to the target

I'm trying to implement a simple program that aims to solve the subset problem in python.
I've found the following code that involves dynamically programming:
def subset_sum(numbers, target, partial=[]):
s = sum(partial)
# check if the partial sum is equals to target
if s == target:
print("sum(%s)=%s" % (partial, target))
if s >= target:
return # if we reach the number why bother to continue
for i in range(len(numbers)):
n = numbers[i]
remaining = numbers[i+1:]
subset_sum(remaining, target, partial + [n])
The code works but it finds all the possible combinations, while I'm only interested on subsets with a maximum number of addends that I want to specify time to time.
In other words I'd like to set a maximum number of the internal loops. To do that I've written the following code that considers all the possible combinations of at most 4 numbers that sum to the target (i.e. 3 internal loops)
def subset_sum(n_list, tot):
cnd = n_list[n_list < tot]
s = np.sort(cnd)
n_max = len(cnd)
possibilities = []
for i1 in range(n_max):
i2 = i1+1
while (i2<n_max)and(s[i1]+s[i2]<=tot):
if (s[i1]+s[i2]==tot):
possibilities.append([s[i1],s[i2]])
i3 = i2+1
while (i3<n_max)and(s[i1]+s[i2]+s[i3]<=tot):
if (s[i1]+s[i2]+s[i3]==tot):
possibilities.append([s[i1],s[i2],s[i3]])
i4 = i3+1
while (i4<n_max)and(s[i1]+s[i2]+s[i3]+s[i4]<=tot):
if (s[i1]+s[i2]+s[i3]+s[i4]==tot):
possibilities.append([s[i1],s[i2],s[i3],s[i4]])
i4+=1
i3+=1
i2+=1
return possibilities
This code works pretty well, can also be speed up with numba (while the first code no) but I cannot fix the maximum number of addends.
Is there a way to implement the function subset_sum with an extra argument that fix the maximum number of addends that sum to the target?

Since you are adding a number in each recursion you can just limit the recursion depth. To do so, you need to add a new parameter to control the maximum depth (a.k.a. the maximum number of addends).
Here is the code:
def subset_sum(numbers, target, num_elems, partial=[]):
# Check if the partial sum is equals to target
s = sum(partial)
if s == target:
print("sum(%s)=%s" % (partial, target))
# If we have surpassed the number there is no point to continue
if s >= target:
return
# If we have surpassed the number of elements there is no point to continue
if len(partial) >= num_elems:
return
# Otherwise go through the remaining numbers
for i in range(len(numbers)):
n = numbers[i]
remaining = numbers[i+1:]
subset_sum(remaining, target, num_elems, partial + [n])
You can run it with:
if __name__ == "__main__":
nums = [1, 2, 3, 4, 5]
num_elems = 3
target = 10
p = []
subset_sum(nums, target, num_elems, p)
And the output will be:
sum([1, 4, 5])=10
sum([2, 3, 5])=10
Notice that the combination of 4 elements ([1, 2, 3, 4]) is not shown.
EDIT:
To speed-up the above code with Numba you need to build an iterative version of it. Since you are basically computing the combinations of numbers in sets of num_elements size, you can check the iterative implementation of the itertools.combination (more details here). Based on that implementation you can obtain the following code:
def subset_sum_iter(numbers, target, num_elements):
# General: we iterate among the indexes and build each solution by taking the values in those indexes
# Initialize solutions list
solutions = []
# Build first index by taking the first num_elements from the numbers
indices = list(range(num_elements))
solution = [numbers[i] for i in indices]
if sum(solution) == target:
solutions.append(solution)
# We iterate over the rest of the indices until we have tried all combinations
while True:
for i in reversed(range(num_elements)):
if indices[i] != i + len(numbers) - num_elements:
break
else:
# No combinations left
break
# Increase current index and all its following ones
indices[i] += 1
for j in range(i + 1, num_elements):
indices[j] = indices[j - 1] + 1
# Check current solution
solution = [numbers[i] for i in indices]
if sum(solution) == target:
solutions.append(solution)
# Print all valid solutions
for sol in solutions:
print ("sum(" + str(sol) + ")=" + str(target))
Which can be run with:
if __name__ == "__main__":
nums = [1, 2, 3, 4, 5]
num_elems = 3
target = 10
# Calling iterative subset
subset_sum_iter(nums, target, num_elems)
And outputs:
sum([1, 4, 5])=10
sum([2, 3, 5])=10
As in the previous case, notice that only the combinations with 3 elements are shown.

I am not sure whether you prefer combinations or permuations here, but you could try this?
import itertools
limit = 1 #number of addends
possibilities = 0
combinations = []
not_possibilties = 0
number_addends = 4
while(limit <= number_addends):
for comb in itertools.combinations([number_list], limit):
if sum(comb) == target:
possibilities +=1
combinations.append(comb)
else:
not_possiblitities +=1
limit +=1
total_combi = possibilities + not_possibilties #check if actually all possibilities were done
If you need permutations just change itertools.combinations to itertools.permutationss

Is there a python function that returns the first positive int that does not occur in list?

I'm tryin to design a function that, given an array A of N integers, returns the smallest positive integer (greater than 0) that does not occur in A.
This code works fine yet has a high order of complexity, is there another solution that reduces the order of complexity?
Note: The 10000000 number is the range of integers in array A, I tried the sort function but does it reduces the complexity?
def solution(A):
for i in range(10000000):
if(A.count(i)) <= 0:
return(i)

The following is O(n logn):
a = [2, 1, 10, 3, 2, 15]
a.sort()
if a[0] > 1:
print(1)
else:
for i in range(1, len(a)):
if a[i] > a[i - 1] + 1:
print(a[i - 1] + 1)
break
If you don't like the special handling of 1, you could just append zero to the array and have the same logic handle both cases:
a = sorted(a + [0])
for i in range(1, len(a)):
if a[i] > a[i - 1] + 1:
print(a[i - 1] + 1)
break
Caveats (both trivial to fix and both left as an exercise for the reader):
Neither version handles empty input.
The code assumes there no negative numbers in the input.

O(n) time and O(n) space:
def solution(A):
count = [0] * len(A)
for x in A:
if 0 < x <= len(A):
count[x-1] = 1 # count[0] is to count 1
for i in range(len(count)):
if count[i] == 0:
return i+1
return len(A)+1 # only if A = [1, 2, ..., len(A)]

This should be O(n). Utilizes a temporary set to speed things along.
a = [2, 1, 10, 3, 2, 15]
#use a set of only the positive numbers for lookup
temp_set = set()
for i in a:
if i > 0:
temp_set.add(i)
#iterate from 1 upto length of set +1 (to ensure edge case is handled)
for i in range(1, len(temp_set) + 2):
if i not in temp_set:
print(i)
break

My proposal is a recursive function inspired by quicksort.
Each step divides the input sequence into two sublists (lt = less than pivot; ge = greater or equal than pivot) and decides, which of the sublists is to be processed in the next step. Note that there is no sorting.
The idea is that a set of integers such that lo <= n < hi contains "gaps" only if it has less than (hi - lo) elements.
The input sequence must not contain dups. A set can be passed directly.
# all cseq items > 0 assumed, no duplicates!
def find(cseq, cmin=1):
# cmin = possible minimum not ruled out yet
size = len(cseq)
if size <= 1:
return cmin+1 if cmin in cseq else cmin
lt = []
ge = []
pivot = cmin + size // 2
for n in cseq:
(lt if n < pivot else ge).append(n)
return find(lt, cmin) if cmin + len(lt) < pivot else find(ge, pivot)
test = set(range(1,100))
print(find(test)) # 100
test.remove(42)
print(find(test)) # 42
test.remove(1)
print(find(test)) # 1

Inspired by various solutions and comments above, about 20%-50% faster in my (simplistic) tests than the fastest of them (though I'm sure it could be made faster), and handling all the corner cases mentioned (non-positive numbers, duplicates, and empty list):
import numpy
def firstNotPresent(l):
positive = numpy.fromiter(set(l), dtype=int) # deduplicate
positive = positive[positive > 0] # only keep positive numbers
positive.sort()
top = positive.size + 1
if top == 1: # empty list
return 1
sequence = numpy.arange(1, top)
try:
return numpy.where(sequence < positive)[0][0]
except IndexError: # no numbers are missing, top is next
return top
The idea is: if you enumerate the positive, deduplicated, sorted list starting from one, the first time the index is less than the list value, the index value is missing from the list, and hence is the lowest positive number missing from the list.
This and the other solutions I tested against (those from adrtam, Paritosh Singh, and VPfB) all appear to be roughly O(n), as expected. (It is, I think, fairly obvious that this is a lower bound, since every element in the list must be examined to find the answer.) Edit: looking at this again, of course the big-O for this approach is at least O(n log(n)), because of the sort. It's just that the sort is so fast comparitively speaking that it looked linear overall.

adding sequential values in list to target value

The idea is for each number to add the next sequential numbers to try to reach the target value. If for each starting value (for i...) if adding the next sequential numbers exceeds the target then i has failed and move onto the next.
I'm getting some values slipping through and some duplicating.
If the targets intentionally match the lists it works fine; I've noticed 13 throws up some strange behaviour.
def addToTarget (mylist, target):
solutions_list = []
for i in range(0,len(mylist)):
#set base values
total = mylist[i]
counter = i
solutions = []
solutions.append(total)
if total == target:
solutions_list.append(solutions) # first value matches immediately
elif total > target:
solutions_list.append([counter-1, "first value already too high"])
elif counter == (len(mylist)):
solutions_list.append("caught as final value ")
while total < target and counter < (len(mylist)-1):
counter +=1
value = mylist[counter]
total += value
solutions.append(value)
if total == target:
solutions_list.append([counter, solutions])
elif total > target:
solutions_list.append([counter-1, "total > target during"])
elif counter == (len(mylist)-1):
solutions_list.append([counter-1, "total < target - came to end of list "])
else : solutions_list.append([counter-1, "not sure but certian values seem to slip through"])
return solutions_list
mylist = [5, 5, 3, 10, 2, 8, 10]
solutions_list = []
test = answer(mylist, 13)
for i in test : print(i)

You could use two markers that move through the list and keep track of sum of values between them. While current sum is less than target (13) and first marker is not at the end of the list move it forward and add value to current sum. Once first marker has been moved check if the current sum matches the target and update result accordingly. In final step move second marker forward one step and subtract item that it pointed to from current sum.
l = [5, 5, 3, 10, 2, 8, 10]
TARGET = 13
res = []
end = 0
current = 0
for start in range(len(l)):
while current < TARGET and end < len(l):
current += l[end]
end += 1
res.append(l[start:end] if current == TARGET else 'fail')
current -= l[start]
print(res)
Output:
[[5, 5, 3], 'fail', [3, 10], 'fail', 'fail', 'fail', 'fail']

Your code and your problem statement would benefit from rewriting as a sequence of small, self-contained, painfully obvious steps.
As far as I could understand, you take a list of numbers xs and look for values of i such that xs[i] + xs[i + 1] == target.
So, let's first generate a list of triples (i, x[i], x[i + 1]), then scan it for solutions.
An explicit way:
def pairs(xs):
for i in range(len(xs) - 1):
yield (i, xs[i], xs[i + 1])
A one-line way, fine for reasonably short lists:
def pairs(xs):
return zip(range(len(xs)), xs, xs[1:])
Now find the matches:
matches = [(i, x0, x1) for (i, x0, x1) if x0 + x1 == target]
We don't mark the various conditions of mismatches, though. Tese can be added if the list comprehension above is converted into an explicit loop.
Hope this helps.

Python: Determine whether each step in path across n arrays falls below threshold value

Given n arrays of integers, is there a good algorithm with which to determine whether there is a path across those arrays such that the minimum (Euclidean) distance of each "step" along that path falls below a certain threshold value? That is, the path across all arrays will include only one member from each array, and the distance of each step of that path will be determined by the absolute distance between the values from the two arrays being compared during the given step. For instance, say you have the following arrays:
a = [1,3,7]
b = [10,13]
c = [13,24]
and
threshold = 3
In that case, you would want to determine whether any elements of a and b have a distance of 3 or less between them, and for all pairs of elements from a and b that do in fact have a distance of three or less between them, you would want to determine whether either the given member from a or the given member from b has a distance of 3 or less from any member of c. (In the example above, the only path across the integers for which the distance of each step falls below the threshold condition is 7-->10-->13.)
Here's how I'm approaching such a problem when the number of arrays is three:
from numpy import*
a = [1,3,7]
b = [10,13]
c = [13,24]
d = [45]
def find_path_across_three_arrays_with_threshold_value_three(a,b,c):
'''this function takes three lists as input, and it determines whether
there is a path across those lists for which each step of that path
has a distance of three or less'''
threshold = 3
#start with a,b
for i in a:
for j in b:
#if the absolute value of i-j is less than or equal to the threshold parameter (user-specified proximity value)
if abs(i-j) <= threshold:
for k in c:
if abs(i-k) <= threshold:
return i,j,k
elif abs(j-k) <= threshold:
return i,j,k
#now start with a,c
for i in a:
for k in c:
if abs(i-k) <= threshold:
for j in b:
if abs(i-j) <= threshold:
return i,j,k
elif abs(j-k) <= threshold:
return i,j,k
#finally, start with b,c
for j in b:
for k in c:
if abs(j-k) <= threshold:
for i in a:
if abs(i-j) <= threshold:
return i,j,k
elif abs(i-k) <= threshold:
return i,j,k
if find_path_across_three_arrays_with_threshold_value_three(a,b,c):
print "ok"
If you didn't know in advance, though, how many arrays there were, what would be the most efficient way of calculating whether there is a path across all n arrays, such that the distance of each "step" in the path fell below the desired threshold value? Would something like Dijkstra's algorithm be the best way to generalize this problem for n arrays?
EDIT:
#Falko's method works for me:
import numpy as np
import itertools
my_list = [[1, 3, 7], [10, 13], [13, 24], [19], [16]]
def isPath(A, threshold):
for i in range(len(A) - 1):
#print "Finding edges from layer", i, "to", i + 1, "..."
diffs = np.array(A[i]).reshape((-1, 1)) - np.array(A[i + 1]).reshape((1, -1))
reached = np.any(np.abs(diffs) <= threshold, axis = 0)
A[i + 1] = [A[i + 1][j] for j in range(len(reached)) if reached[j]]
#print "Reachable nodes of next layer:", A[i + 1]
return any(reached)
for i in itertools.permutations(my_list):
new_list = []
for j in i:
new_list.extend([j])
if isPath(new_list,3):
print "threshold 3 match for ", new_list
if isPath(new_list,10):
print "threshold 10 match for ", new_list

I found a much simpler solution (maybe related to the one from JohnB; I'm not sure):
import numpy as np
def isPath(A, threshold):
for i in range(len(A) - 1):
print "Finding edges from layer", i, "to", i + 1, "..."
diffs = np.array(A[i]).reshape((-1, 1)) - np.array(A[i + 1]).reshape((1, -1))
reached = np.any(np.abs(diffs) <= threshold, axis = 0)
A[i + 1] = [A[i + 1][j] for j in range(len(reached)) if reached[j]]
print "Reachable nodes of next layer:", A[i + 1]
return any(reached)
print isPath([[1, 3, 7], [10, 13], [13, 24]], 3)
print isPath([[1, 3, 7], [10, 13], [13, 24]], 10)
Output:
Finding edges from layer 0 to 1 ...
Reachable nodes of next layer: [10]
Finding edges from layer 1 to 2 ...
Reachable nodes of next layer: [13]
True
Finding edges from layer 0 to 1 ...
Reachable nodes of next layer: [10, 13]
Finding edges from layer 1 to 2 ...
Reachable nodes of next layer: [13]
True
It steps from one layer to another an checks, which nodes still can be reached given the predefined threshold. Unreachable nodes are removed from the array. When the loop continues, those nodes are not considered anymore.
I guess it's pretty efficient and easy to implement.

First I'd build up an undirected graph: Each number in your array is a node and nodes of neighboring rows are connected if and only if their distance is smaller than your threshold.
Then you can use a standard algorithm to determine connected components of the graph. You'll probably find many references and code examples about this common problem.
Finally you need to check if one component contains nodes from a as well as nodes from your last row, c in this case.

(short answer: Floyd-Warshall is more efficient in this case than naive application of Dijkstra's)
I'm not 100% clear from your example, but it seems that you have to advance through the arrays in increasing order, and that you cannot backtrack.
ie...
A = [1, 300]
B = [2, 11]
C = [12, 301]
You go A(1) -> B(2), but there is no path to C because you can't jump to B(11) -> C(12). Similarly you can't jump A(300) -> C(301).
You could create, as you suggest, an adjacency matrix of size NM x NM where N is the |arrays| and M is |elements in each array|. You would want to use a sparse array implementation as most of the values are nil.
For each increasing pair (ai,bj), (bi,cj) you perform the pairwise calculations and store the connection if it is <= your threshold value.
The runtime for this would be N * M^2, which is smaller than the cost of finding paths (in the worst case) and so is probably acceptable. For cases where threshold << M and arrays do not contain repetitions this can be reduced to N*MlgM by sorting first. As at most threshold*M comparisons are needed for each array pair comparison.
To use Dijkstra's you'd have to run it M*M times, once for each pair of elements in arrays a and n. Which works out to O(M^2 * ElgV) (E is number of edges, V is number of vertexes) Which in the worst case E = (N-1)*M^2, and V is N*M. or N*M^4 * lg(N*M). Floyd-Warshall algorithm for all pairs of shortest paths runs in V^3, or (N*M)^3, which is smaller.

Longest equally-spaced subsequence

I have a million integers in sorted order and I would like to find the longest subsequence where the difference between consecutive pairs is equal. For example
1, 4, 5, 7, 8, 12
has a subsequence
4, 8, 12
My naive method is greedy and just checks how far you can extend a subsequence from each point. This takes O(n²) time per point it seems.
Is there a faster way to solve this problem?
Update. I will test the code given in the answers as soon as possible (thank you). However it is clear already that using n^2 memory will not work. So far there is no code that terminates with the input as [random.randint(0,100000) for r in xrange(200000)] .
Timings. I tested with the following input data on my 32 bit system.
a= [random.randint(0,10000) for r in xrange(20000)]
a.sort()
The dynamic programming method of ZelluX uses 1.6G of RAM and takes 2 minutes and 14 seconds. With pypy it takes only 9 seconds! However it crashes with a memory error on large inputs.
The O(nd) time method of Armin took 9 seconds with pypy but only 20MB of RAM. Of course this would be much worse if the range were much larger. The low memory usage meant I could also test it with a= [random.randint(0,100000) for r in xrange(200000)] but it didn't finish in the few minutes I gave it with pypy.
In order to be able to test the method of Kluev's I reran with
a= [random.randint(0,40000) for r in xrange(28000)]
a = list(set(a))
a.sort()
to make a list of length roughly 20000. All timings with pypy
ZelluX, 9 seconds
Kluev, 20 seconds
Armin, 52 seconds
It seems that if the ZelluX method could be made linear space it would be the clear winner.

We can have a solution O(n*m) in time with very little memory needs, by adapting yours. Here n is the number of items in the given input sequence of numbers, and m is the range, i.e. the highest number minus the lowest one.
Call A the sequence of all input numbers (and use a precomputed set() to answer in constant time the question "is this number in A?"). Call d the step of the subsequence we're looking for (the difference between two numbers of this subsequence). For every possible value of d, do the following linear scan over all input numbers: for every number n from A in increasing order, if the number was not already seen, look forward in A for the length of the sequence starting at n with a step d. Then mark all items in that sequence as already seen, so that we avoid searching again from them, for the same d. Because of this, the complexity is just O(n) for every value of d.
A = [1, 4, 5, 7, 8, 12] # in sorted order
Aset = set(A)
for d in range(1, 12):
already_seen = set()
for a in A:
if a not in already_seen:
b = a
count = 1
while b + d in Aset:
b += d
count += 1
already_seen.add(b)
print "found %d items in %d .. %d" % (count, a, b)
# collect here the largest 'count'
Updates:
This solution might be good enough if you're only interested in values of d that are relatively small; for example, if getting the best result for d <= 1000 would be good enough. Then the complexity goes down to O(n*1000). This makes the algorithm approximative, but actually runnable for n=1000000. (Measured at 400-500 seconds with CPython, 80-90 seconds with PyPy, with a random subset of numbers between 0 and 10'000'000.)
If you still want to search for the whole range, and if the common case is that long sequences exist, a notable improvement is to stop as soon as d is too large for an even longer sequence to be found.

UPDATE: I've found a paper on this problem, you can download it here.
Here is a solution based on dynamic programming. It requires O(n^2) time complexity and O(n^2) space complexity, and does not use hashing.
We assume all numbers are saved in array a in ascending order, and n saves its length. 2D array l[i][j] defines length of longest equally-spaced subsequence ending with a[i] and a[j], and l[j][k] = l[i][j] + 1 if a[j] - a[i] = a[k] - a[j] (i < j < k).
lmax = 2
l = [[2 for i in xrange(n)] for j in xrange(n)]
for mid in xrange(n - 1):
prev = mid - 1
succ = mid + 1
while (prev >= 0 and succ < n):
if a[prev] + a[succ] < a[mid] * 2:
succ += 1
elif a[prev] + a[succ] > a[mid] * 2:
prev -= 1
else:
l[mid][succ] = l[prev][mid] + 1
lmax = max(lmax, l[mid][succ])
prev -= 1
succ += 1
print lmax

Update: First algorithm described here is obsoleted by Armin Rigo's second answer, which is much simpler and more efficient. But both these methods have one disadvantage. They need many hours to find the result for one million integers. So I tried two more variants (see second half of this answer) where the range of input integers is assumed to be limited. Such limitation allows much faster algorithms. Also I tried to optimize Armin Rigo's code. See my benchmarking results at the end.
Here is an idea of algorithm using O(N) memory. Time complexity is O(N2 log N), but may be decreased to O(N2).
Algorithm uses the following data structures:
prev: array of indexes pointing to previous element of (possibly incomplete) subsequence.
hash: hashmap with key = difference between consecutive pairs in subsequence and value = two other hashmaps. For these other hashmaps: key = starting/ending index of the subsequence, value = pair of (subsequence length, ending/starting index of the subsequence).
pq: priority queue for all possible "difference" values for subsequences stored in prev and hash.
Algorithm:
Initialize prev with indexes i-1. Update hash and pq to register all (incomplete) subsequences found on this step and their "differences".
Get (and remove) smallest "difference" from pq. Get corresponding record from hash and scan one of second-level hash maps. At this time all subsequences with given "difference" are complete. If second-level hash map contains subsequence length better than found so far, update the best result.
In the array prev: for each element of any sequence found on step #2, decrement index and update hash and possibly pq. While updating hash, we could perform one of the following operations: add a new subsequence of length 1, or grow some existing subsequence by 1, or merge two existing subsequences.
Remove hash map record found on step #2.
Continue from step #2 while pq is not empty.
This algorithm updates O(N) elements of prev O(N) times each. And each of these updates may require to add a new "difference" to pq. All this means time complexity of O(N2 log N) if we use simple heap implementation for pq. To decrease it to O(N2) we might use more advanced priority queue implementations. Some of the possibilities are listed on this page: Priority Queues.
See corresponding Python code on Ideone. This code does not allow duplicate elements in the list. It is possible to fix this, but it would be a good optimization anyway to remove duplicates (and to find the longest subsequence beyond duplicates separately).
And the same code after a little optimization. Here search is terminated as soon as subsequence length multiplied by possible subsequence "difference" exceeds source list range.
Armin Rigo's code is simple and pretty efficient. But in some cases it does some extra computations that may be avoided. Search may be terminated as soon as subsequence length multiplied by possible subsequence "difference" exceeds source list range:
def findLESS(A):
Aset = set(A)
lmax = 2
d = 1
minStep = 0
while (lmax - 1) * minStep <= A[-1] - A[0]:
minStep = A[-1] - A[0] + 1
for j, b in enumerate(A):
if j+d < len(A):
a = A[j+d]
step = a - b
minStep = min(minStep, step)
if a + step in Aset and b - step not in Aset:
c = a + step
count = 3
while c + step in Aset:
c += step
count += 1
if count > lmax:
lmax = count
d += 1
return lmax
print(findLESS([1, 4, 5, 7, 8, 12]))
If range of integers in source data (M) is small, a simple algorithm is possible with O(M2) time and O(M) space:
def findLESS(src):
r = [False for i in range(src[-1]+1)]
for x in src:
r[x] = True
d = 1
best = 1
while best * d < len(r):
for s in range(d):
l = 0
for i in range(s, len(r), d):
if r[i]:
l += 1
best = max(best, l)
else:
l = 0
d += 1
return best
print(findLESS([1, 4, 5, 7, 8, 12]))
It is similar to the first method by Armin Rigo, but it doesn't use any dynamic data structures. I suppose source data has no duplicates. And (to keep the code simple) I also suppose that minimum input value is non-negative and close to zero.
Previous algorithm may be improved if instead of the array of booleans we use a bitset data structure and bitwise operations to process data in parallel. The code shown below implements bitset as a built-in Python integer. It has the same assumptions: no duplicates, minimum input value is non-negative and close to zero. Time complexity is O(M2 * log L) where L is the length of optimal subsequence, space complexity is O(M):
def findLESS(src):
r = 0
for x in src:
r |= 1 << x
d = 1
best = 1
while best * d < src[-1] + 1:
c = best
rr = r
while c & (c-1):
cc = c & -c
rr &= rr >> (cc * d)
c &= c-1
while c != 1:
c = c >> 1
rr &= rr >> (c * d)
rr &= rr >> d
while rr:
rr &= rr >> d
best += 1
d += 1
return best
Benchmarks:
Input data (about 100000 integers) is generated this way:
random.seed(42)
s = sorted(list(set([random.randint(0,200000) for r in xrange(140000)])))
And for fastest algorithms I also used the following data (about 1000000 integers):
s = sorted(list(set([random.randint(0,2000000) for r in xrange(1400000)])))
All results show time in seconds:
Size: 100000 1000000
Second answer by Armin Rigo: 634 ?
By Armin Rigo, optimized: 64 >5000
O(M^2) algorithm: 53 2940
O(M^2*L) algorithm: 7 711

Algorithm
Main loop traversing the list
If number found in precalculate list, then it's belong to all sequences which are in that list, recalculate all the sequences with count + 1
Remove all precalculated for current element
Recalculate new sequences where first element is from range from 0 to current, and second is current element of traversal (actually, not from 0 to current, we can use the fact that new element shouldn't be more that max(a) and new list should have possibility to become longer that already found one)
So for list [1, 2, 4, 5, 7] output would be (it's a little messy, try code yourself and see)
index 0, element 1:
if 1 in precalc? No - do nothing
Do nothing
index 1, element 2:
if 2 in precalc? No - do nothing
check if 3 = 1 + (2 - 1) * 2 in our set? No - do nothing
index 2, element 4:
if 4 in precalc? No - do nothing
check if 6 = 2 + (4 - 2) * 2 in our set? No
check if 7 = 1 + (4 - 1) * 2 in our set? Yes - add new element {7: {3: {'count': 2, 'start': 1}}} 7 - element of the list, 3 is step.
index 3, element 5:
if 5 in precalc? No - do nothing
do not check 4 because 6 = 4 + (5 - 4) * 2 is less that calculated element 7
check if 8 = 2 + (5 - 2) * 2 in our set? No
check 10 = 2 + (5 - 1) * 2 - more than max(a) == 7
index 4, element 7:
if 7 in precalc? Yes - put it into result
do not check 5 because 9 = 5 + (7 - 5) * 2 is more than max(a) == 7
result = (3, {'count': 3, 'start': 1}) # step 3, count 3, start 1, turn it into sequence
Complexity
It shouldn't be more than O(N^2), and I think it's less because of earlier termination of searching new sequencies, I'll try to provide detailed analysis later
Code
def add_precalc(precalc, start, step, count, res, N):
if step == 0: return True
if start + step * res[1]["count"] > N: return False
x = start + step * count
if x > N or x < 0: return False
if precalc[x] is None: return True
if step not in precalc[x]:
precalc[x][step] = {"start":start, "count":count}
return True
def work(a):
precalc = [None] * (max(a) + 1)
for x in a: precalc[x] = {}
N, m = max(a), 0
ind = {x:i for i, x in enumerate(a)}
res = (0, {"start":0, "count":0})
for i, x in enumerate(a):
for el in precalc[x].iteritems():
el[1]["count"] += 1
if el[1]["count"] > res[1]["count"]: res = el
add_precalc(precalc, el[1]["start"], el[0], el[1]["count"], res, N)
t = el[1]["start"] + el[0] * el[1]["count"]
if t in ind and ind[t] > m:
m = ind[t]
precalc[x] = None
for y in a[i - m - 1::-1]:
if not add_precalc(precalc, y, x - y, 2, res, N): break
return [x * res[0] + res[1]["start"] for x in range(res[1]["count"])]

Here is another answer, working in time O(n^2) and without any notable memory requirements beyond that of turning the list into a set.
The idea is quite naive: like the original poster, it is greedy and just checks how far you can extend a subsequence from each pair of points --- however, checking first that we're at the start of a subsequence. In other words, from points a and b you check how far you can extend to b + (b-a), b + 2*(b-a), ... but only if a - (b-a) is not already in the set of all points. If it is, then you already saw the same subsequence.
The trick is to convince ourselves that this simple optimization is enough to lower the complexity to O(n^2) from the original O(n^3). That's left as an exercice to the reader :-) The time is competitive with other O(n^2) solutions here.
A = [1, 4, 5, 7, 8, 12] # in sorted order
Aset = set(A)
lmax = 2
for j, b in enumerate(A):
for i in range(j):
a = A[i]
step = b - a
if b + step in Aset and a - step not in Aset:
c = b + step
count = 3
while c + step in Aset:
c += step
count += 1
#print "found %d items in %d .. %d" % (count, a, c)
if count > lmax:
lmax = count
print lmax

Your solution is O(N^3) now (you said O(N^2) per index). Here it is O(N^2) of time and O(N^2) of memory solution.
Idea
If we know subsequence that goes through indices i[0],i[1],i[2],i[3] we shouldn't try subsequence that starts with i[1] and i[2] or i[2] and i[3]
Note I edited that code to make it a bit easier using that a sorted but it will not work for equal elements. You may check number max number of equal elements in O(N) easily
Pseudocode
I'm seeking only for max length but that doesn't change anything
whereInA = {}
for i in range(n):
whereInA[a[i]] = i; // It doesn't matter which of same elements it points to
boolean usedPairs[n][n];
for i in range(n):
for j in range(i + 1, n):
if usedPair[i][j]:
continue; // do not do anything. It was in one of prev sequences.
usedPair[i][j] = true;
//here quite stupid solution:
diff = a[j] - a[i];
if diff == 0:
continue; // we can't work with that
lastIndex = j
currentLen = 2
while whereInA contains index a[lastIndex] + diff :
nextIndex = whereInA[a[lastIndex] + diff]
usedPair[lastIndex][nextIndex] = true
++currentLen
lastIndex = nextIndex
// you may store all indicies here
maxLen = max(maxLen, currentLen)
Thoughts about memory usage
O(n^2) time is very slow for 1000000 elements. But if you are going to run this code on such number of elements the biggest problem will be memory usage.
What can be done to reduce it?
Change boolean arrays to bitfields to store more booleans per bit.
Make each next boolean array shorter because we only use usedPairs[i][j] if i < j
Few heuristics:
Store only pairs of used indicies. (Conflicts with the first idea)
Remove usedPairs that will never used more (that are for such i,j that was already chosen in the loop)

This is my 2 cents.
If you have a list called input:
input = [1, 4, 5, 7, 8, 12]
You can build a data structure that for each one of this points (excluding the first one), will tell you how far is that point from anyone of its predecessors:
[1, 4, 5, 7, 8, 12]
x 3 4 6 7 11 # distance from point i to point 0
x x 1 3 4 8 # distance from point i to point 1
x x x 2 3 7 # distance from point i to point 2
x x x x 1 5 # distance from point i to point 3
x x x x x 4 # distance from point i to point 4
Now that you have the columns, you can consider the i-th item of input (which is input[i]) and each number n in its column.
The numbers that belong to a series of equidistant numbers that include input[i], are those which have n * j in the i-th position of their column, where j is the number of matches already found when moving columns from left to right, plus the k-th predecessor of input[i], where k is the index of n in the column of input[i].
Example: if we consider i = 1, input[i] = 4, n = 3, then, we can identify a sequence comprehending 4 (input[i]), 7 (because it has a 3 in position 1 of its column) and 1, because k is 0, so we take the first predecessor of i.
Possible implementation (sorry if the code is not using the same notation as the explanation):
def build_columns(l):
columns = {}
for x in l[1:]:
col = []
for y in l[:l.index(x)]:
col.append(x - y)
columns[x] = col
return columns
def algo(input, columns):
seqs = []
for index1, number in enumerate(input[1:]):
index1 += 1 #first item was sliced
for index2, distance in enumerate(columns[number]):
seq = []
seq.append(input[index2]) # k-th pred
seq.append(number)
matches = 1
for successor in input[index1 + 1 :]:
column = columns[successor]
if column[index1] == distance * matches:
matches += 1
seq.append(successor)
if (len(seq) > 2):
seqs.append(seq)
return seqs
The longest one:
print max(sequences, key=len)

Traverse the array, keeping a record of the optimal result/s and a table with
(1) index - the element difference in the sequence,
(2) count - number of elements in the sequence so far, and
(3) the last recorded element.
For each array element look at the difference from each previous array element; if that element is last in a sequence indexed in the table, adjust that sequence in the table, and update the best sequence if applicable, otherwise start a new sequence, unless the current max is greater than the length of the possible sequence.
Scanning backwards we can stop our scan when d is greater than the middle of the array's range; or when the current max is greater than the length of the possible sequence, for d greater than the largest indexed difference. Sequences where s[j] is greater than the last element in the sequence are deleted.
I converted my code from JavaScript to Python (my first python code):
import random
import timeit
import sys
#s = [1,4,5,7,8,12]
#s = [2, 6, 7, 10, 13, 14, 17, 18, 21, 22, 23, 25, 28, 32, 39, 40, 41, 44, 45, 46, 49, 50, 51, 52, 53, 63, 66, 67, 68, 69, 71, 72, 74, 75, 76, 79, 80, 82, 86, 95, 97, 101, 110, 111, 112, 114, 115, 120, 124, 125, 129, 131, 132, 136, 137, 138, 139, 140, 144, 145, 147, 151, 153, 157, 159, 161, 163, 165, 169, 172, 173, 175, 178, 179, 182, 185, 186, 188, 195]
#s = [0, 6, 7, 10, 11, 12, 16, 18, 19]
m = [random.randint(1,40000) for r in xrange(20000)]
s = list(set(m))
s.sort()
lenS = len(s)
halfRange = (s[lenS-1] - s[0]) // 2
while s[lenS-1] - s[lenS-2] > halfRange:
s.pop()
lenS -= 1
halfRange = (s[lenS-1] - s[0]) // 2
while s[1] - s[0] > halfRange:
s.pop(0)
lenS -=1
halfRange = (s[lenS-1] - s[0]) // 2
n = lenS
largest = (s[n-1] - s[0]) // 2
#largest = 1000 #set the maximum size of d searched
maxS = s[n-1]
maxD = 0
maxSeq = 0
hCount = [None]*(largest + 1)
hLast = [None]*(largest + 1)
best = {}
start = timeit.default_timer()
for i in range(1,n):
sys.stdout.write(repr(i)+"\r")
for j in range(i-1,-1,-1):
d = s[i] - s[j]
numLeft = n - i
if d != 0:
maxPossible = (maxS - s[i]) // d + 2
else:
maxPossible = numLeft + 2
ok = numLeft + 2 > maxSeq and maxPossible > maxSeq
if d > largest or (d > maxD and not ok):
break
if hLast[d] != None:
found = False
for k in range (len(hLast[d])-1,-1,-1):
tmpLast = hLast[d][k]
if tmpLast == j:
found = True
hLast[d][k] = i
hCount[d][k] += 1
tmpCount = hCount[d][k]
if tmpCount > maxSeq:
maxSeq = tmpCount
best = {'len': tmpCount, 'd': d, 'last': i}
elif s[tmpLast] < s[j]:
del hLast[d][k]
del hCount[d][k]
if not found and ok:
hLast[d].append(i)
hCount[d].append(2)
elif ok:
if d > maxD:
maxD = d
hLast[d] = [i]
hCount[d] = [2]
end = timeit.default_timer()
seconds = (end - start)
#print (hCount)
#print (hLast)
print(best)
print(seconds)

This is a particular case for the more generic problem described here: Discover long patterns where K=1 and is fixed. It is demostrated there that it can be solved in O(N^2). Runnig my implementation of the C algorithm proposed there it takes 3 seconds to find the solution for N=20000 and M=28000 in my 32bit machine.

Greedy method
1 .Only one sequence of decision is generated.
2. Many number of decisions are generated.
Dynamic programming
1. It does not guarantee to give an optimal solution always.
2. It definitely gives an optimal solution.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

All possible combinations of numbers with rounding error matching sum - python

Related

Subset problem in python - fix the number of addends that sum to the target

Is there a python function that returns the first positive int that does not occur in list?

adding sequential values in list to target value

Python: Determine whether each step in path across n arrays falls below threshold value

Longest equally-spaced subsequence

Categories

Resources