Longest equally-spaced subsequence - python

I have a million integers in sorted order and I would like to find the longest subsequence where the difference between consecutive pairs is equal. For example
1, 4, 5, 7, 8, 12
has a subsequence
4, 8, 12
My naive method is greedy and just checks how far you can extend a subsequence from each point. This takes O(n²) time per point it seems.
Is there a faster way to solve this problem?
Update. I will test the code given in the answers as soon as possible (thank you). However it is clear already that using n^2 memory will not work. So far there is no code that terminates with the input as [random.randint(0,100000) for r in xrange(200000)] .
Timings. I tested with the following input data on my 32 bit system.
a= [random.randint(0,10000) for r in xrange(20000)]
The dynamic programming method of ZelluX uses 1.6G of RAM and takes 2 minutes and 14 seconds. With pypy it takes only 9 seconds! However it crashes with a memory error on large inputs.
The O(nd) time method of Armin took 9 seconds with pypy but only 20MB of RAM. Of course this would be much worse if the range were much larger. The low memory usage meant I could also test it with a= [random.randint(0,100000) for r in xrange(200000)] but it didn't finish in the few minutes I gave it with pypy.
In order to be able to test the method of Kluev's I reran with
a= [random.randint(0,40000) for r in xrange(28000)]
a = list(set(a))
to make a list of length roughly 20000. All timings with pypy
ZelluX, 9 seconds
Kluev, 20 seconds
Armin, 52 seconds
It seems that if the ZelluX method could be made linear space it would be the clear winner.

We can have a solution O(n*m) in time with very little memory needs, by adapting yours. Here n is the number of items in the given input sequence of numbers, and m is the range, i.e. the highest number minus the lowest one.
Call A the sequence of all input numbers (and use a precomputed set() to answer in constant time the question "is this number in A?"). Call d the step of the subsequence we're looking for (the difference between two numbers of this subsequence). For every possible value of d, do the following linear scan over all input numbers: for every number n from A in increasing order, if the number was not already seen, look forward in A for the length of the sequence starting at n with a step d. Then mark all items in that sequence as already seen, so that we avoid searching again from them, for the same d. Because of this, the complexity is just O(n) for every value of d.
A = [1, 4, 5, 7, 8, 12] # in sorted order
Aset = set(A)
for d in range(1, 12):
already_seen = set()
for a in A:
if a not in already_seen:
b = a
count = 1
while b + d in Aset:
b += d
count += 1
print "found %d items in %d .. %d" % (count, a, b)
# collect here the largest 'count'
This solution might be good enough if you're only interested in values of d that are relatively small; for example, if getting the best result for d <= 1000 would be good enough. Then the complexity goes down to O(n*1000). This makes the algorithm approximative, but actually runnable for n=1000000. (Measured at 400-500 seconds with CPython, 80-90 seconds with PyPy, with a random subset of numbers between 0 and 10'000'000.)
If you still want to search for the whole range, and if the common case is that long sequences exist, a notable improvement is to stop as soon as d is too large for an even longer sequence to be found.

UPDATE: I've found a paper on this problem, you can download it here.
Here is a solution based on dynamic programming. It requires O(n^2) time complexity and O(n^2) space complexity, and does not use hashing.
We assume all numbers are saved in array a in ascending order, and n saves its length. 2D array l[i][j] defines length of longest equally-spaced subsequence ending with a[i] and a[j], and l[j][k] = l[i][j] + 1 if a[j] - a[i] = a[k] - a[j] (i < j < k).
lmax = 2
l = [[2 for i in xrange(n)] for j in xrange(n)]
for mid in xrange(n - 1):
prev = mid - 1
succ = mid + 1
while (prev >= 0 and succ < n):
if a[prev] + a[succ] < a[mid] * 2:
succ += 1
elif a[prev] + a[succ] > a[mid] * 2:
prev -= 1
l[mid][succ] = l[prev][mid] + 1
lmax = max(lmax, l[mid][succ])
prev -= 1
succ += 1
print lmax

Update: First algorithm described here is obsoleted by Armin Rigo's second answer, which is much simpler and more efficient. But both these methods have one disadvantage. They need many hours to find the result for one million integers. So I tried two more variants (see second half of this answer) where the range of input integers is assumed to be limited. Such limitation allows much faster algorithms. Also I tried to optimize Armin Rigo's code. See my benchmarking results at the end.
Here is an idea of algorithm using O(N) memory. Time complexity is O(N2 log N), but may be decreased to O(N2).
Algorithm uses the following data structures:
prev: array of indexes pointing to previous element of (possibly incomplete) subsequence.
hash: hashmap with key = difference between consecutive pairs in subsequence and value = two other hashmaps. For these other hashmaps: key = starting/ending index of the subsequence, value = pair of (subsequence length, ending/starting index of the subsequence).
pq: priority queue for all possible "difference" values for subsequences stored in prev and hash.
Initialize prev with indexes i-1. Update hash and pq to register all (incomplete) subsequences found on this step and their "differences".
Get (and remove) smallest "difference" from pq. Get corresponding record from hash and scan one of second-level hash maps. At this time all subsequences with given "difference" are complete. If second-level hash map contains subsequence length better than found so far, update the best result.
In the array prev: for each element of any sequence found on step #2, decrement index and update hash and possibly pq. While updating hash, we could perform one of the following operations: add a new subsequence of length 1, or grow some existing subsequence by 1, or merge two existing subsequences.
Remove hash map record found on step #2.
Continue from step #2 while pq is not empty.
This algorithm updates O(N) elements of prev O(N) times each. And each of these updates may require to add a new "difference" to pq. All this means time complexity of O(N2 log N) if we use simple heap implementation for pq. To decrease it to O(N2) we might use more advanced priority queue implementations. Some of the possibilities are listed on this page: Priority Queues.
See corresponding Python code on Ideone. This code does not allow duplicate elements in the list. It is possible to fix this, but it would be a good optimization anyway to remove duplicates (and to find the longest subsequence beyond duplicates separately).
And the same code after a little optimization. Here search is terminated as soon as subsequence length multiplied by possible subsequence "difference" exceeds source list range.
Armin Rigo's code is simple and pretty efficient. But in some cases it does some extra computations that may be avoided. Search may be terminated as soon as subsequence length multiplied by possible subsequence "difference" exceeds source list range:
def findLESS(A):
Aset = set(A)
lmax = 2
d = 1
minStep = 0
while (lmax - 1) * minStep <= A[-1] - A[0]:
minStep = A[-1] - A[0] + 1
for j, b in enumerate(A):
if j+d < len(A):
a = A[j+d]
step = a - b
minStep = min(minStep, step)
if a + step in Aset and b - step not in Aset:
c = a + step
count = 3
while c + step in Aset:
c += step
count += 1
if count > lmax:
lmax = count
d += 1
return lmax
print(findLESS([1, 4, 5, 7, 8, 12]))
If range of integers in source data (M) is small, a simple algorithm is possible with O(M2) time and O(M) space:
def findLESS(src):
r = [False for i in range(src[-1]+1)]
for x in src:
r[x] = True
d = 1
best = 1
while best * d < len(r):
for s in range(d):
l = 0
for i in range(s, len(r), d):
if r[i]:
l += 1
best = max(best, l)
l = 0
d += 1
return best
print(findLESS([1, 4, 5, 7, 8, 12]))
It is similar to the first method by Armin Rigo, but it doesn't use any dynamic data structures. I suppose source data has no duplicates. And (to keep the code simple) I also suppose that minimum input value is non-negative and close to zero.
Previous algorithm may be improved if instead of the array of booleans we use a bitset data structure and bitwise operations to process data in parallel. The code shown below implements bitset as a built-in Python integer. It has the same assumptions: no duplicates, minimum input value is non-negative and close to zero. Time complexity is O(M2 * log L) where L is the length of optimal subsequence, space complexity is O(M):
def findLESS(src):
r = 0
for x in src:
r |= 1 << x
d = 1
best = 1
while best * d < src[-1] + 1:
c = best
rr = r
while c & (c-1):
cc = c & -c
rr &= rr >> (cc * d)
c &= c-1
while c != 1:
c = c >> 1
rr &= rr >> (c * d)
rr &= rr >> d
while rr:
rr &= rr >> d
best += 1
d += 1
return best
Input data (about 100000 integers) is generated this way:
s = sorted(list(set([random.randint(0,200000) for r in xrange(140000)])))
And for fastest algorithms I also used the following data (about 1000000 integers):
s = sorted(list(set([random.randint(0,2000000) for r in xrange(1400000)])))
All results show time in seconds:
Size: 100000 1000000
Second answer by Armin Rigo: 634 ?
By Armin Rigo, optimized: 64 >5000
O(M^2) algorithm: 53 2940
O(M^2*L) algorithm: 7 711

Main loop traversing the list
If number found in precalculate list, then it's belong to all sequences which are in that list, recalculate all the sequences with count + 1
Remove all precalculated for current element
Recalculate new sequences where first element is from range from 0 to current, and second is current element of traversal (actually, not from 0 to current, we can use the fact that new element shouldn't be more that max(a) and new list should have possibility to become longer that already found one)
So for list [1, 2, 4, 5, 7] output would be (it's a little messy, try code yourself and see)
index 0, element 1:
if 1 in precalc? No - do nothing
Do nothing
index 1, element 2:
if 2 in precalc? No - do nothing
check if 3 = 1 + (2 - 1) * 2 in our set? No - do nothing
index 2, element 4:
if 4 in precalc? No - do nothing
check if 6 = 2 + (4 - 2) * 2 in our set? No
check if 7 = 1 + (4 - 1) * 2 in our set? Yes - add new element {7: {3: {'count': 2, 'start': 1}}} 7 - element of the list, 3 is step.
index 3, element 5:
if 5 in precalc? No - do nothing
do not check 4 because 6 = 4 + (5 - 4) * 2 is less that calculated element 7
check if 8 = 2 + (5 - 2) * 2 in our set? No
check 10 = 2 + (5 - 1) * 2 - more than max(a) == 7
index 4, element 7:
if 7 in precalc? Yes - put it into result
do not check 5 because 9 = 5 + (7 - 5) * 2 is more than max(a) == 7
result = (3, {'count': 3, 'start': 1}) # step 3, count 3, start 1, turn it into sequence
It shouldn't be more than O(N^2), and I think it's less because of earlier termination of searching new sequencies, I'll try to provide detailed analysis later
def add_precalc(precalc, start, step, count, res, N):
if step == 0: return True
if start + step * res[1]["count"] > N: return False
x = start + step * count
if x > N or x < 0: return False
if precalc[x] is None: return True
if step not in precalc[x]:
precalc[x][step] = {"start":start, "count":count}
return True
def work(a):
precalc = [None] * (max(a) + 1)
for x in a: precalc[x] = {}
N, m = max(a), 0
ind = {x:i for i, x in enumerate(a)}
res = (0, {"start":0, "count":0})
for i, x in enumerate(a):
for el in precalc[x].iteritems():
el[1]["count"] += 1
if el[1]["count"] > res[1]["count"]: res = el
add_precalc(precalc, el[1]["start"], el[0], el[1]["count"], res, N)
t = el[1]["start"] + el[0] * el[1]["count"]
if t in ind and ind[t] > m:
m = ind[t]
precalc[x] = None
for y in a[i - m - 1::-1]:
if not add_precalc(precalc, y, x - y, 2, res, N): break
return [x * res[0] + res[1]["start"] for x in range(res[1]["count"])]

Here is another answer, working in time O(n^2) and without any notable memory requirements beyond that of turning the list into a set.
The idea is quite naive: like the original poster, it is greedy and just checks how far you can extend a subsequence from each pair of points --- however, checking first that we're at the start of a subsequence. In other words, from points a and b you check how far you can extend to b + (b-a), b + 2*(b-a), ... but only if a - (b-a) is not already in the set of all points. If it is, then you already saw the same subsequence.
The trick is to convince ourselves that this simple optimization is enough to lower the complexity to O(n^2) from the original O(n^3). That's left as an exercice to the reader :-) The time is competitive with other O(n^2) solutions here.
A = [1, 4, 5, 7, 8, 12] # in sorted order
Aset = set(A)
lmax = 2
for j, b in enumerate(A):
for i in range(j):
a = A[i]
step = b - a
if b + step in Aset and a - step not in Aset:
c = b + step
count = 3
while c + step in Aset:
c += step
count += 1
#print "found %d items in %d .. %d" % (count, a, c)
if count > lmax:
lmax = count
print lmax

Your solution is O(N^3) now (you said O(N^2) per index). Here it is O(N^2) of time and O(N^2) of memory solution.
If we know subsequence that goes through indices i[0],i[1],i[2],i[3] we shouldn't try subsequence that starts with i[1] and i[2] or i[2] and i[3]
Note I edited that code to make it a bit easier using that a sorted but it will not work for equal elements. You may check number max number of equal elements in O(N) easily
I'm seeking only for max length but that doesn't change anything
whereInA = {}
for i in range(n):
whereInA[a[i]] = i; // It doesn't matter which of same elements it points to
boolean usedPairs[n][n];
for i in range(n):
for j in range(i + 1, n):
if usedPair[i][j]:
continue; // do not do anything. It was in one of prev sequences.
usedPair[i][j] = true;
//here quite stupid solution:
diff = a[j] - a[i];
if diff == 0:
continue; // we can't work with that
lastIndex = j
currentLen = 2
while whereInA contains index a[lastIndex] + diff :
nextIndex = whereInA[a[lastIndex] + diff]
usedPair[lastIndex][nextIndex] = true
lastIndex = nextIndex
// you may store all indicies here
maxLen = max(maxLen, currentLen)
Thoughts about memory usage
O(n^2) time is very slow for 1000000 elements. But if you are going to run this code on such number of elements the biggest problem will be memory usage.
What can be done to reduce it?
Change boolean arrays to bitfields to store more booleans per bit.
Make each next boolean array shorter because we only use usedPairs[i][j] if i < j
Few heuristics:
Store only pairs of used indicies. (Conflicts with the first idea)
Remove usedPairs that will never used more (that are for such i,j that was already chosen in the loop)

This is my 2 cents.
If you have a list called input:
input = [1, 4, 5, 7, 8, 12]
You can build a data structure that for each one of this points (excluding the first one), will tell you how far is that point from anyone of its predecessors:
[1, 4, 5, 7, 8, 12]
x 3 4 6 7 11 # distance from point i to point 0
x x 1 3 4 8 # distance from point i to point 1
x x x 2 3 7 # distance from point i to point 2
x x x x 1 5 # distance from point i to point 3
x x x x x 4 # distance from point i to point 4
Now that you have the columns, you can consider the i-th item of input (which is input[i]) and each number n in its column.
The numbers that belong to a series of equidistant numbers that include input[i], are those which have n * j in the i-th position of their column, where j is the number of matches already found when moving columns from left to right, plus the k-th predecessor of input[i], where k is the index of n in the column of input[i].
Example: if we consider i = 1, input[i] = 4, n = 3, then, we can identify a sequence comprehending 4 (input[i]), 7 (because it has a 3 in position 1 of its column) and 1, because k is 0, so we take the first predecessor of i.
Possible implementation (sorry if the code is not using the same notation as the explanation):
def build_columns(l):
columns = {}
for x in l[1:]:
col = []
for y in l[:l.index(x)]:
col.append(x - y)
columns[x] = col
return columns
def algo(input, columns):
seqs = []
for index1, number in enumerate(input[1:]):
index1 += 1 #first item was sliced
for index2, distance in enumerate(columns[number]):
seq = []
seq.append(input[index2]) # k-th pred
matches = 1
for successor in input[index1 + 1 :]:
column = columns[successor]
if column[index1] == distance * matches:
matches += 1
if (len(seq) > 2):
return seqs
The longest one:
print max(sequences, key=len)

Traverse the array, keeping a record of the optimal result/s and a table with
(1) index - the element difference in the sequence,
(2) count - number of elements in the sequence so far, and
(3) the last recorded element.
For each array element look at the difference from each previous array element; if that element is last in a sequence indexed in the table, adjust that sequence in the table, and update the best sequence if applicable, otherwise start a new sequence, unless the current max is greater than the length of the possible sequence.
Scanning backwards we can stop our scan when d is greater than the middle of the array's range; or when the current max is greater than the length of the possible sequence, for d greater than the largest indexed difference. Sequences where s[j] is greater than the last element in the sequence are deleted.
I converted my code from JavaScript to Python (my first python code):
import random
import timeit
import sys
#s = [1,4,5,7,8,12]
#s = [2, 6, 7, 10, 13, 14, 17, 18, 21, 22, 23, 25, 28, 32, 39, 40, 41, 44, 45, 46, 49, 50, 51, 52, 53, 63, 66, 67, 68, 69, 71, 72, 74, 75, 76, 79, 80, 82, 86, 95, 97, 101, 110, 111, 112, 114, 115, 120, 124, 125, 129, 131, 132, 136, 137, 138, 139, 140, 144, 145, 147, 151, 153, 157, 159, 161, 163, 165, 169, 172, 173, 175, 178, 179, 182, 185, 186, 188, 195]
#s = [0, 6, 7, 10, 11, 12, 16, 18, 19]
m = [random.randint(1,40000) for r in xrange(20000)]
s = list(set(m))
lenS = len(s)
halfRange = (s[lenS-1] - s[0]) // 2
while s[lenS-1] - s[lenS-2] > halfRange:
lenS -= 1
halfRange = (s[lenS-1] - s[0]) // 2
while s[1] - s[0] > halfRange:
lenS -=1
halfRange = (s[lenS-1] - s[0]) // 2
n = lenS
largest = (s[n-1] - s[0]) // 2
#largest = 1000 #set the maximum size of d searched
maxS = s[n-1]
maxD = 0
maxSeq = 0
hCount = [None]*(largest + 1)
hLast = [None]*(largest + 1)
best = {}
start = timeit.default_timer()
for i in range(1,n):
for j in range(i-1,-1,-1):
d = s[i] - s[j]
numLeft = n - i
if d != 0:
maxPossible = (maxS - s[i]) // d + 2
maxPossible = numLeft + 2
ok = numLeft + 2 > maxSeq and maxPossible > maxSeq
if d > largest or (d > maxD and not ok):
if hLast[d] != None:
found = False
for k in range (len(hLast[d])-1,-1,-1):
tmpLast = hLast[d][k]
if tmpLast == j:
found = True
hLast[d][k] = i
hCount[d][k] += 1
tmpCount = hCount[d][k]
if tmpCount > maxSeq:
maxSeq = tmpCount
best = {'len': tmpCount, 'd': d, 'last': i}
elif s[tmpLast] < s[j]:
del hLast[d][k]
del hCount[d][k]
if not found and ok:
elif ok:
if d > maxD:
maxD = d
hLast[d] = [i]
hCount[d] = [2]
end = timeit.default_timer()
seconds = (end - start)
#print (hCount)
#print (hLast)

This is a particular case for the more generic problem described here: Discover long patterns where K=1 and is fixed. It is demostrated there that it can be solved in O(N^2). Runnig my implementation of the C algorithm proposed there it takes 3 seconds to find the solution for N=20000 and M=28000 in my 32bit machine.

Greedy method
1 .Only one sequence of decision is generated.
2. Many number of decisions are generated.
Dynamic programming
1. It does not guarantee to give an optimal solution always.
2. It definitely gives an optimal solution.


Using binary search to find the duplicate number in an array

The problem:
Given an array of integers nums containing n + 1 integers where each integer is in the range [1, n] inclusive.
There is only one repeated number in nums, return this repeated number.
You must solve the problem without modifying the array nums and uses only constant
extra space.
Here is one of the possible solution using binary search
class Solution(object):
def findDuplicate(self, nums):
beg, end = 1, len(nums)-1
while beg + 1 <= end:
mid, count = (beg + end)//2, 0
for num in nums:
if num <= mid: count += 1
if count <= mid:
beg = mid + 1
end = mid
return end
Example 1:
Input: nums = [1,3,4,2,2]
Output: 2
Example 2:
Input: nums = [3,1,3,4,2]
Output: 3
Can someone please explain this solution for me? I understand the code but I don't understand the logic behind this. In particular, I do not understand how to construct the if statements (lines 7 - 13). Why and how do you know that when num <= mid then I need to do count += 1 and so on. Many thanks.
The solution keeps halving the range of numbers the answer can still be in.
For example, if the function starts with nums == [1, 3, 4, 2, 2], then the duplicate number must be between 1 and 4 inclusive by definition.
By counting how many of the numbers are smaller than or equal to the middle of that range (2), you can decide if the duplicate must be in the upper or lower half of that range. Since the actual number is greater (3 numbers are lesser than or equal to 2, and 3 > 2), the number must be in the lower half.
Repeating the process, knowing that the number must be between 1 and 2 inclusive, only 1 number is less than or equal to the middle of that range (1), which means the number must be in the upper half, and is 2.
Consider a slightly longer series: [1, 2, 5, 6, 3, 4, 3, 7]. Between 1 and 7 lies 3, 4 numbers are less than or equal to 3, so the number must be between 1 and 3. Between 1 and 3 lies 2, 2 numbers are less than or equal to 2, so the number must be over 2, which leaves 3.
The solution will iterate over all n elements of nums a limited number of times, since it keeps halving the search space. It's certainly more efficient than the naive:
def findDuplicate(self, nums):
for i, n in enumerate(nums):
for j, m in enumerate(nums):
if i != j and n == m:
return n
But as user #fas suggests in the comments, this is better:
def findDuplicate(nums):
p = 1
while p < len(nums):
p <<= 1
r = 0
for n in nums:
r ^= n
for n in range(len(nums), p):
r ^= n
return r
This is how given binary search works. Inside binary search you have implementation of isDuplicateLessOrEqualTo(x):
mid, count = (beg + end)//2, 0
for num in nums:
if num <= mid: count += 1
if count <= mid:
return False # In this case there are no duplicates less or equal than mid.
# Actually count == mid would be enough, count < mid is impossible.
return True # In this case there is a duplicate less or equal than mid.
isDuplicateLessOrEqualTo(x) is a non-decreasing function (assume x has a duplicate, then for all i < x it's false and for all i >= x it's true), that's why you can run binary search over it.
Each iteration you run through the array, which gives you overall complexity O(n log n) (where n is size of array).
There's a faster solution. Note that xor(0..(2^n)-1) = 0 for n >= 2, because there are 2^(n-1) ones for each bit position and it's an even number, for example:
0_10 = 00_2
1_10 = 01_2
2_10 = 10_2
3_10 = 11_2
2 ones here, 2 is even
2 ones here, 2 is even
So by xor-ing all the numbers you'll receive exactly your duplicate number. Let's write it:
class Solution(object):
def nearestPowerOfTwo(number):
lowerBoundDegreeOfTwo = number.bit_length()
lowerBoundDegreeOfTwo = max(lowerBoundDegreeOfTwo, 2)
return 2 ** lowerBoundDegreeOfTwo
def findDuplicate(self, nums):
xorSum = 0
for i in nums:
xorSum = xorSum ^ i
for i in range(len(nums), nearestPowerOfTwo(len(nums) - 1)):
xorSum = xorSum ^ i
return xorSum
As you can see that gives us O(n) complexity.
If anyone is interested in a different approach (not binary search) to solve this problem:
Sum all elements of the array - we will call it sumArray - the time complexity is O(n).
Sum all numbers from 1 to n (inclusive) - we will call it sumGeneral - this is simply (n * (n+1) / 2) - the time complexity is O(1).
Return the result of sumArray - sumGeneral
In total, the time complexity is O(n) (you cannot do better since you have to look at all elements of the array, potentially the repeated one is at the end), and additional space complexity is O(1).
(If you could use O(n) additional space complexity you could use a hash table)

Getting all subsets from subset sum problem on Python using Dynamic Programming

I am trying to extract all subsets from a list of elements which add up to a certain value.
Example -
List = [1,3,4,5,6]
Sum - 9
Output Expected = [[3,6],[5,4]]
Have tried different approaches and getting the expected output but on a huge list of elements it is taking a significant amount of time.
Can this be optimized using Dynamic Programming or any other technique.
def subset(array, num):
result = []
def find(arr, num, path=()):
if not arr:
if arr[0] == num:
result.append(path + (arr[0],))
find(arr[1:], num - arr[0], path + (arr[0],))
find(arr[1:], num, path)
find(array, num)
return result
numbers = [2, 2, 1, 12, 15, 2, 3]
x = 7
def isSubsetSum(arr, subset, N, subsetSize, subsetSum, index , sum):
global flag
if (subsetSum == sum):
flag = 1
for i in range(0, subsetSize):
print(subset[i], end = " ")
for i in range(index, N):
subset[subsetSize] = arr[i]
isSubsetSum(arr, subset, N, subsetSize + 1,
subsetSum + arr[i], i + 1, sum)
If you want to output all subsets you can't do better than a sluggish O(2^n) complexity, because in the worst case that will be the size of your output and time complexity is lower-bounded by output size (this is a known NP-Complete problem). But, if rather than returning a list of all subsets, you just want to return a boolean value indicating whether achieving the target sum is possible, or just one subset summing to target (if it exists), you can use dynamic programming for a pseudo-polynomial O(nK) time solution, where n is the number of elements and K is the target integer.
The DP approach involves filling in an (n+1) x (K+1) table, with the sub-problems corresponding to the entries of the table being:
DP[i][k] = subset(A[i:], k) for 0 <= i <= n, 0 <= k <= K
That is, subset(A[i:], k) asks, 'Can I sum to (little) k using the suffix of A starting at index i?' Once you fill in the whole table, the answer to the overall problem, subset(A[0:], K) will be at DP[0][K]
The base cases are for i=n: they indicate that you can't sum to anything except for 0 if you're working with the empty suffix of your array
subset(A[n:], k>0) = False, subset(A[n:], k=0) = True
The recursive cases to fill in the table are:
subset(A[i:], k) = subset(A[i+1:, k) OR (A[i] <= k AND subset(A[i+i:], k-A[i]))
This simply relates the idea that you can use the current array suffix to sum to k either by skipping over the first element of that suffix and using the answer you already had in the previous row (when that first element wasn't in your array suffix), or by using A[i] in your sum and checking if you could make the reduced sum k-A[i] in the previous row. Of course, you can only use the new element if it doesn't itself exceed your target sum.
ex: subset(A[i:] = [3,4,1,6], k = 8)
would check: could I already sum to 8 with the previous suffix (A[i+1:] = [4,1,6])? No. Or, could I use the 3 which is now available to me to sum to 8? That is, could I sum to k = 8 - 3 = 5 with [4,1,6]? Yes. Because at least one of the conditions was true, I set DP[i][8] = True
Because all the base cases are for i=n, and the recurrence relation for subset(A[i:], k) relies on the answers to the smaller sub-problems subset(A[i+i:],...), you start at the bottom of the table, where i = n, fill out every k value from 0 to K for each row, and work your way up to row i = 0, ensuring you have the answers to the smaller sub-problems when you need them.
def subsetSum(A: list[int], K: int) -> bool:
N = len(A)
DP = [[None] * (K+1) for x in range(N+1)]
DP[N] = [True if x == 0 else False for x in range(K+1)]
for i in range(N-1, -1, -1):
Ai = A[i]
DP[i] = [DP[i+1][k] or (Ai <=k and DP[i+1][k-Ai]) for k in range(0, K+1)]
# print result
print(f"A = {A}, K = {K}")
print('Ai,k:', *range(0,K+1), sep='\t')
for (i, row) in enumerate(DP): print(A[i] if i < N else None, *row, sep='\t')
print(f"DP[0][K] = {DP[0][K]}")
return DP[0][K]
subsetSum([1,4,3,5,6], 9)
If you want to return an actual possible subset alongside the bool indicating whether or not it's possible to make one, then for every True flag in your DP you should also store the k index for the previous row that got you there (it will either be the current k index or k-A[i], depending on which table lookup returned True, which will indicate whether or not A[i] was used). Then you walk backwards from DP[0][K] after the table is filled to get a subset. This makes the code messier but it's definitely do-able. You can't get all subsets this way though (at least not without increasing your time complexity again) because the DP table compresses information.
Here is the optimized solution to the problem with a complexity of O(n^2).
def get_subsets(data: list, target: int):
# initialize final result which is a list of all subsets summing up to target
subsets = []
# records the difference between the target value and a group of numbers
differences = {}
for number in data:
prospects = []
# iterate through every record in differences
for diff in differences:
# the number complements a record in differences, i.e. a desired subset is found
if number - diff == 0:
new_subset = [number] + differences[diff]
if new_subset not in subsets:
# the number fell short to reach the target; add to prospect instead
elif number - diff < 0:
prospects.append((number, diff))
# update the differences record
for prospect in prospects:
new_diff = target - sum(differences[prospect[1]]) - prospect[0]
differences[new_diff] = differences[prospect[1]] + [prospect[0]]
differences[target - number] = [number]
return subsets

Is there a python function that returns the first positive int that does not occur in list?

I'm tryin to design a function that, given an array A of N integers, returns the smallest positive integer (greater than 0) that does not occur in A.
This code works fine yet has a high order of complexity, is there another solution that reduces the order of complexity?
Note: The 10000000 number is the range of integers in array A, I tried the sort function but does it reduces the complexity?
def solution(A):
for i in range(10000000):
if(A.count(i)) <= 0:
The following is O(n logn):
a = [2, 1, 10, 3, 2, 15]
if a[0] > 1:
for i in range(1, len(a)):
if a[i] > a[i - 1] + 1:
print(a[i - 1] + 1)
If you don't like the special handling of 1, you could just append zero to the array and have the same logic handle both cases:
a = sorted(a + [0])
for i in range(1, len(a)):
if a[i] > a[i - 1] + 1:
print(a[i - 1] + 1)
Caveats (both trivial to fix and both left as an exercise for the reader):
Neither version handles empty input.
The code assumes there no negative numbers in the input.
O(n) time and O(n) space:
def solution(A):
count = [0] * len(A)
for x in A:
if 0 < x <= len(A):
count[x-1] = 1 # count[0] is to count 1
for i in range(len(count)):
if count[i] == 0:
return i+1
return len(A)+1 # only if A = [1, 2, ..., len(A)]
This should be O(n). Utilizes a temporary set to speed things along.
a = [2, 1, 10, 3, 2, 15]
#use a set of only the positive numbers for lookup
temp_set = set()
for i in a:
if i > 0:
#iterate from 1 upto length of set +1 (to ensure edge case is handled)
for i in range(1, len(temp_set) + 2):
if i not in temp_set:
My proposal is a recursive function inspired by quicksort.
Each step divides the input sequence into two sublists (lt = less than pivot; ge = greater or equal than pivot) and decides, which of the sublists is to be processed in the next step. Note that there is no sorting.
The idea is that a set of integers such that lo <= n < hi contains "gaps" only if it has less than (hi - lo) elements.
The input sequence must not contain dups. A set can be passed directly.
# all cseq items > 0 assumed, no duplicates!
def find(cseq, cmin=1):
# cmin = possible minimum not ruled out yet
size = len(cseq)
if size <= 1:
return cmin+1 if cmin in cseq else cmin
lt = []
ge = []
pivot = cmin + size // 2
for n in cseq:
(lt if n < pivot else ge).append(n)
return find(lt, cmin) if cmin + len(lt) < pivot else find(ge, pivot)
test = set(range(1,100))
print(find(test)) # 100
print(find(test)) # 42
print(find(test)) # 1
Inspired by various solutions and comments above, about 20%-50% faster in my (simplistic) tests than the fastest of them (though I'm sure it could be made faster), and handling all the corner cases mentioned (non-positive numbers, duplicates, and empty list):
import numpy
def firstNotPresent(l):
positive = numpy.fromiter(set(l), dtype=int) # deduplicate
positive = positive[positive > 0] # only keep positive numbers
top = positive.size + 1
if top == 1: # empty list
return 1
sequence = numpy.arange(1, top)
return numpy.where(sequence < positive)[0][0]
except IndexError: # no numbers are missing, top is next
return top
The idea is: if you enumerate the positive, deduplicated, sorted list starting from one, the first time the index is less than the list value, the index value is missing from the list, and hence is the lowest positive number missing from the list.
This and the other solutions I tested against (those from adrtam, Paritosh Singh, and VPfB) all appear to be roughly O(n), as expected. (It is, I think, fairly obvious that this is a lower bound, since every element in the list must be examined to find the answer.) Edit: looking at this again, of course the big-O for this approach is at least O(n log(n)), because of the sort. It's just that the sort is so fast comparitively speaking that it looked linear overall.

Generate all permutations of a list without adjacent equal elements

When we sort a list, like
a = [1,2,3,3,2,2,1]
sorted(a) => [1, 1, 2, 2, 2, 3, 3]
equal elements are always adjacent in the resulting list.
How can I achieve the opposite task - shuffle the list so that equal elements are never (or as seldom as possible) adjacent?
For example, for the above list one of the possible solutions is
p = [1,3,2,3,2,1,2]
More formally, given a list a, generate a permutation p of it that minimizes the number of pairs p[i]==p[i+1].
Since the lists are large, generating and filtering all permutations is not an option.
Bonus question: how to generate all such permutations efficiently?
This is the code I'm using to test the solutions: https://gist.github.com/gebrkn/9f550094b3d24a35aebd
UPD: Choosing a winner here was a tough choice, because many people posted excellent answers. #VincentvanderWeele, #David Eisenstat, #Coady, #enrico.bacis and #srgerg provided functions that generate the best possible permutation flawlessly. #tobias_k and David also answered the bonus question (generate all permutations). Additional points to David for the correctness proof.
The code from #VincentvanderWeele appears to be the fastest.
This is along the lines of Thijser's currently incomplete pseudocode. The idea is to take the most frequent of the remaining item types unless it was just taken. (See also Coady's implementation of this algorithm.)
import collections
import heapq
class Sentinel:
def david_eisenstat(lst):
counts = collections.Counter(lst)
heap = [(-count, key) for key, count in counts.items()]
output = []
last = Sentinel()
while heap:
minuscount1, key1 = heapq.heappop(heap)
if key1 != last or not heap:
last = key1
minuscount1 += 1
minuscount2, key2 = heapq.heappop(heap)
last = key2
minuscount2 += 1
if minuscount2 != 0:
heapq.heappush(heap, (minuscount2, key2))
if minuscount1 != 0:
heapq.heappush(heap, (minuscount1, key1))
return output
Proof of correctness
For two item types, with counts k1 and k2, the optimal solution has k2 - k1 - 1 defects if k1 < k2, 0 defects if k1 = k2, and k1 - k2 - 1 defects if k1 > k2. The = case is obvious. The others are symmetric; each instance of the minority element prevents at most two defects out of a total of k1 + k2 - 1 possible.
This greedy algorithm returns optimal solutions, by the following logic. We call a prefix (partial solution) safe if it extends to an optimal solution. Clearly the empty prefix is safe, and if a safe prefix is a whole solution then that solution is optimal. It suffices to show inductively that each greedy step maintains safety.
The only way that a greedy step introduces a defect is if only one item type remains, in which case there is only one way to continue, and that way is safe. Otherwise, let P be the (safe) prefix just before the step under consideration, let P' be the prefix just after, and let S be an optimal solution extending P. If S extends P' also, then we're done. Otherwise, let P' = Px and S = PQ and Q = yQ', where x and y are items and Q and Q' are sequences.
Suppose first that P does not end with y. By the algorithm's choice, x is at least as frequent in Q as y. Consider the maximal substrings of Q containing only x and y. If the first substring has at least as many x's as y's, then it can be rewritten without introducing additional defects to begin with x. If the first substring has more y's than x's, then some other substring has more x's than y's, and we can rewrite these substrings without additional defects so that x goes first. In both cases, we find an optimal solution T that extends P', as needed.
Suppose now that P does end with y. Modify Q by moving the first occurrence of x to the front. In doing so, we introduce at most one defect (where x used to be) and eliminate one defect (the yy).
Generating all solutions
This is tobias_k's answer plus efficient tests to detect when the choice currently under consideration is globally constrained in some way. The asymptotic running time is optimal, since the overhead of generation is on the order of the length of the output. The worst-case delay unfortunately is quadratic; it could be reduced to linear (optimal) with better data structures.
from collections import Counter
from itertools import permutations
from operator import itemgetter
from random import randrange
def get_mode(count):
return max(count.items(), key=itemgetter(1))[0]
def enum2(prefix, x, count, total, mode):
count_x = count[x]
if count_x == 1:
del count[x]
count[x] = count_x - 1
yield from enum1(prefix, count, total - 1, mode)
count[x] = count_x
del prefix[-1]
def enum1(prefix, count, total, mode):
if total == 0:
yield tuple(prefix)
if count[mode] * 2 - 1 >= total and [mode] != prefix[-1:]:
yield from enum2(prefix, mode, count, total, mode)
defect_okay = not prefix or count[prefix[-1]] * 2 > total
mode = get_mode(count)
for x in list(count.keys()):
if defect_okay or [x] != prefix[-1:]:
yield from enum2(prefix, x, count, total, mode)
def enum(seq):
count = Counter(seq)
if count:
yield from enum1([], count, sum(count.values()), get_mode(count))
yield ()
def defects(lst):
return sum(lst[i - 1] == lst[i] for i in range(1, len(lst)))
def test(lst):
perms = set(permutations(lst))
opt = min(map(defects, perms))
slow = {perm for perm in perms if defects(perm) == opt}
fast = set(enum(lst))
print(lst, fast, slow)
assert slow == fast
for r in range(10000):
test([randrange(3) for i in range(randrange(6))])
Sort the list
Loop over the first half of the sorted list and fill all even indices of the result list
Loop over the second half of the sorted list and fill all odd indices of the result list
You will only have p[i]==p[i+1] if more than half of the input consists of the same element, in which case there is no other choice than putting the same element in consecutive spots (by the pidgeon hole principle).
As pointed out in the comments, this approach may have one conflict too many in case one of the elements occurs at least n/2 times (or n/2+1 for odd n; this generalizes to (n+1)/2) for both even and odd). There are at most two such elements and if there are two, the algorithm works just fine. The only problematic case is when there is one element that occurs at least half of the time. We can simply solve this problem by finding the element and dealing with it first.
I don't know enough about python to write this properly, so I took the liberty to copy the OP's implementation of a previous version from github:
# Sort the list
a = sorted(lst)
# Put the element occurring more than half of the times in front (if needed)
n = len(a)
m = (n + 1) // 2
for i in range(n - m + 1):
if a[i] == a[i + m - 1]:
a = a[i:] + a[:i]
result = [None] * n
# Loop over the first half of the sorted list and fill all even indices of the result list
for i, elt in enumerate(a[:m]):
result[2*i] = elt
# Loop over the second half of the sorted list and fill all odd indices of the result list
for i, elt in enumerate(a[m:]):
result[2*i+1] = elt
return result
The algorithm already given of taking the most common item left that isn't the previous item is correct. Here's a simple implementation, which optimally uses a heap to track the most common.
import collections, heapq
def nonadjacent(keys):
heap = [(-count, key) for key, count in collections.Counter(a).items()]
count, key = 0, None
while heap:
count, key = heapq.heapreplace(heap, (count, key)) if count else heapq.heappop(heap)
yield key
count += 1
for index in xrange(-count):
yield key
>>> a = [1,2,3,3,2,2,1]
>>> list(nonadjacent(a))
[2, 1, 2, 3, 1, 2, 3]
You can generate all the 'perfectly unsorted' permutations (that have no two equal elements in adjacent positions) using a recursive backtracking algorithm. In fact, the only difference to generating all the permutations is that you keep track of the last number and exclude some solutions accordingly:
def unsort(lst, last=None):
if lst:
for i, e in enumerate(lst):
if e != last:
for perm in unsort(lst[:i] + lst[i+1:], e):
yield [e] + perm
yield []
Note that in this form the function is not very efficient, as it creates lots of sub-lists. Also, we can speed it up by looking at the most-constrained numbers first (those with the highest count). Here's a much more efficient version using only the counts of the numbers.
def unsort_generator(lst, sort=False):
counts = collections.Counter(lst)
def unsort_inner(remaining, last=None):
if remaining > 0:
# most-constrained first, or sorted for pretty-printing?
items = sorted(counts.items()) if sort else counts.most_common()
for n, c in items:
if n != last and c > 0:
counts[n] -= 1 # update counts
for perm in unsort_inner(remaining - 1, n):
yield [n] + perm
counts[n] += 1 # revert counts
yield []
return unsort_inner(len(lst))
You can use this to generate just the next perfect permutation, or a list holding all of them. But note, that if there is no perfectly unsorted permutation, then this generator will consequently yield no results.
>>> lst = [1,2,3,3,2,2,1]
>>> next(unsort_generator(lst))
[2, 1, 2, 3, 1, 2, 3]
>>> list(unsort_generator(lst, sort=True))
[[1, 2, 1, 2, 3, 2, 3],
... 36 more ...
[3, 2, 3, 2, 1, 2, 1]]
>>> next(unsort_generator([1,1,1]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
To circumvent this problem, you could use this together with one of the algorithms proposed in the other answers as a fallback. This will guarantee to return a perfectly unsorted permutation, if there is one, or a good approximation otherwise.
def unsort_safe(lst):
return next(unsort_generator(lst))
except StopIteration:
return unsort_fallback(lst)
In python you could do the following.
Consider you have a sorted list l, you can do:
length = len(l)
odd_ind = length%2
odd_half = (length - odd_ind)/2
for i in range(odd_half)[::2]:
my_list[i], my_list[odd_half+odd_ind+i] = my_list[odd_half+odd_ind+i], my_list[i]
These are just in place operations and should thus be rather fast (O(N)).
Note that you will shift from l[i] == l[i+1] to l[i] == l[i+2] so the order you end up with is anything but random, but from how I understand the question it is not randomness you are looking for.
The idea is to split the sorted list in the middle then exchange every other element in the two parts.
For l= [1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5] this leads to l = [3, 1, 4, 2, 5, 1, 3, 1, 4, 2, 5]
The method fails to get rid of all the l[i] == l[i + 1] as soon as the abundance of one element is bigger than or equal to half of the length of the list.
While the above works fine as long as the abundance of the most frequent element is smaller than half the size of the list, the following function also handles the limit cases (the famous off-by-one issue) where every other element starting with the first one must be the most abundant one:
def no_adjacent(my_list):
length = len(my_list)
odd_ind = length%2
odd_half = (length - odd_ind)/2
for i in range(odd_half)[::2]:
my_list[i], my_list[odd_half+odd_ind+i] = my_list[odd_half+odd_ind+i], my_list[i]
#this is just for the limit case where the abundance of the most frequent is half of the list length
if max([my_list.count(val) for val in set(my_list)]) + 1 - odd_ind > odd_half:
max_val = my_list[0]
max_count = my_list.count(max_val)
for val in set(my_list):
if my_list.count(val) > max_count:
max_val = val
max_count = my_list.count(max_val)
while max_val in my_list:
out = [max_val]
max_count -= 1
for val in my_list:
if max_count:
max_count -= 1
if max_count:
print 'this is not working'
return my_list
#raise Exception('not possible')
return out
return my_list
Here is a good algorithm:
First of all count for all numbers how often they occur. Place the answer in a map.
sort this map so that the numbers that occur most often come first.
The first number of your answer is the first number in the sorted map.
Resort the map with the first now being one smaller.
If you want to improve efficiency look for ways to increase the efficiency of the sorting step.
In answer to the bonus question: this is an algorithm which finds all permutations of a set where no adjacent elements can be identical. I believe this to be the most efficient algorithm conceptually (although others may be faster in practice because they translate into simpler code). It doesn't use brute force, it only generates unique permutations, and paths not leading to solutions are cut off at the earliest point.
I will use the term "abundant element" for an element in a set which occurs more often than all other elements combined, and the term "abundance" for the number of abundant elements minus the number of other elements.
e.g. the set abac has no abundant element, the sets abaca and aabcaa have a as the abundant element, and abundance 1 and 2 respectively.
Start with a set like:
Seperate the first occurances from the repeats:
firsts: abcd
repeats: aab
Find the abundant element in the repeats, if any, and calculate the abundance:
abundant element: a
abundance: 1
Generate all permutations of the firsts where the number of elements after the abundant element is not less than the abundance: (so in the example the "a" cannot be last)
abcd, abdc, acbd, acdb, adbc, adcb, bacd, badc, bcad, bcda, bdac, bdca,
cabd, cadb, cbad, cbda, cdab, cdba, dabc, dacb, abac, dbca, dcab, dcba
For each permutation, insert the set of repeated characters one by one, following these rules:
5.1. If the abundance of the set is greater than the number of elements after the last occurance of the abundant element in the permutation so far, skip to the next permutation.
e.g. when permutation so far is abc, a set with abundant element a can only be inserted if the abundance is 2 or less, so aaaabc is ok, aaaaabc isn't.
5.2. Select the element from the set whose last occurance in the permutation comes first.
e.g. when permutation so far is abcba and set is ab, select b
5.3. Insert the selected element at least 2 positions to the right of its last occurance in the permutation.
e.g. when inserting b into permutation babca, results are babcba and babcab
5.4. Recurse step 5 with each resulting permutation and the rest of the set.
set = abcaba
firsts = abc
repeats = aab
perm3 set select perm4 set select perm5 set select perm6
abc aab a abac ab b ababc a a ababac
abacb a a abacab
abca ab b abcba a -
abcab a a abcaba
acb aab a acab ab a acaba b b acabab
acba ab b acbab a a acbaba
bac aab b babc aa a babac a a babaca
babca a -
bacb aa a bacab a a bacaba
bacba a -
bca aab -
cab aab a caba ab b cabab a a cababa
cba aab -
This algorithm generates unique permutations. If you want to know the total number of permutations (where aba is counted twice because you can switch the a's), multiply the number of unique permutations with a factor:
F = N1! * N2! * ... * Nn!
where N is the number of occurances of each element in the set. For a set abcdabcaba this would be 4! * 3! * 2! * 1! or 288, which demonstrates how inefficient an algorithm is that generates all permutations instead of only the unique ones. To list all permutations in this case, just list the unique permutations 288 times :-)
Below is a (rather clumsy) implementation in Javascript; I suspect that a language like Python may be better suited for this sort of thing. Run the code snippet to calculate the seperated permutations of "abracadabra".
function seperatedPermutations(set) {
var unique = 0, factor = 1, firsts = [], repeats = [], abund;
abund = abundance(repeats);
permutateFirsts([], firsts);
alert("Permutations of [" + set + "]\ntotal: " + (unique * factor) + ", unique: " + unique);
function seperateRepeats(set) {
for (var i = 0; i < set.length; i++) {
var first, elem = set[i];
if (firsts.indexOf(elem) == -1) firsts.push(elem)
else if ((first = repeats.indexOf(elem)) == -1) {
factor *= 2;
} else {
repeats.splice(first, 0, elem);
factor *= repeats.lastIndexOf(elem) - first + 2;
function permutateFirsts(perm, set) {
if (set.length > 0) {
for (var i = 0; i < set.length; i++) {
var s = set.slice();
var e = s.splice(i, 1);
if (e[0] == abund.elem && s.length < abund.num) continue;
permutateFirsts(perm.concat(e), s, abund);
else if (repeats.length > 0) {
insertRepeats(perm, repeats);
else {
document.write(perm + "<BR>");
function insertRepeats(perm, set) {
var abund = abundance(set);
if (perm.length - perm.lastIndexOf(abund.elem) > abund.num) {
var sel = selectElement(perm, set);
var s = set.slice();
var elem = s.splice(sel, 1)[0];
for (var i = perm.lastIndexOf(elem) + 2; i <= perm.length; i++) {
var p = perm.slice();
p.splice(i, 0, elem);
if (set.length == 1) {
document.write(p + "<BR>");
} else {
insertRepeats(p, s);
function selectElement(perm, set) {
var sel, pos, min = perm.length;
for (var i = 0; i < set.length; i++) {
pos = perm.lastIndexOf(set[i]);
if (pos < min) {
min = pos;
sel = i;
function abundance(set) {
if (set.length == 0) return ({elem: null, num: 0});
var elem = set[0], max = 1, num = 1;
for (var i = 1; i < set.length; i++) {
if (set[i] != set[i - 1]) num = 1
else if (++num > max) {
max = num;
elem = set[i];
return ({elem: elem, num: 2 * max - set.length});
The idea is to sort the elements from the most common to the least common, take the most common, decrease its count and put it back in the list keeping the descending order (but avoiding putting the last used element first to prevent repetitions when possible).
This can be implemented using Counter and bisect:
from collections import Counter
from bisect import bisect
def unsorted(lst):
# use elements (-count, item) so bisect will put biggest counts first
items = [(-count, item) for item, count in Counter(lst).most_common()]
result = []
while items:
count, item = items.pop(0)
if count != -1:
element = (count + 1, item)
index = bisect(items, element)
# prevent insertion in position 0 if there are other items
items.insert(index or (1 if items else 0), element)
return result
>>> print unsorted([1, 1, 1, 2, 3, 3, 2, 2, 1])
[1, 2, 1, 2, 1, 3, 1, 2, 3]
>>> print unsorted([1, 2, 3, 2, 3, 2, 2])
[2, 3, 2, 1, 2, 3, 2]
Sort the list.
Generate a "best shuffle" of the list using this algorithm
It will give the minimum of items from the list in their original place (by item value) so it will try, for your example, to put the 1's, 2's and 3's away from their sorted positions.
Start with the sorted list of length n.
Let m=n/2.
Take the values at 0, then m, then 1, then m+1, then 2, then m+2, and so on.
Unless you have more than half of the numbers the same, you'll never get equivalent values in consecutive order.
Please forgive my "me too" style answer, but couldn't Coady's answer be simplified to this?
from collections import Counter
from heapq import heapify, heappop, heapreplace
from itertools import repeat
def srgerg(data):
heap = [(-freq+1, value) for value, freq in Counter(data).items()]
freq = 0
while heap:
freq, val = heapreplace(heap, (freq+1, val)) if freq else heappop(heap)
yield val
yield from repeat(val, -freq)
Edit: Here's a python 2 version that returns a list:
def srgergpy2(data):
heap = [(-freq+1, value) for value, freq in Counter(data).items()]
freq = 0
result = list()
while heap:
freq, val = heapreplace(heap, (freq+1, val)) if freq else heappop(heap)
result.extend(repeat(val, -freq))
return result
Count number of times each value appears
Select values in order from most frequent to least frequent
Add selected value to final output, incrementing the index by 2 each time
Reset index to 1 if index out of bounds
from heapq import heapify, heappop
def distribute(values):
counts = defaultdict(int)
for value in values:
counts[value] += 1
counts = [(-count, key) for key, count in counts.iteritems()]
index = 0
length = len(values)
distributed = [None] * length
while counts:
count, value = heappop(counts)
for _ in xrange(-count):
distributed[index] = value
index = index + 2 if index + 2 < length else 1
return distributed

Longest arithmetic progression with a hole

The longest arithmetic progression subsequence problem is as follows. Given an array of integers A, devise an algorithm to find the longest arithmetic progression in it. In other words find a sequence i1 < i2 < … < ik, such that A[i1], A[i2], …, A[ik] form an arithmetic progression, and k is maximal. The following code solves the problem in O(n^2) time and space. (Modified from http://www.geeksforgeeks.org/length-of-the-longest-arithmatic-progression-in-a-sorted-array/ . )
#!/usr/bin/env python
import sys
def arithmetic(arr):
n = len(arr)
if (n<=2):
return n
llap = 2
L = [[0]*n for i in xrange(n)]
for i in xrange(n):
L[i][n-1] = 2
for j in xrange(n-2,0,-1):
i = j-1
k = j+1
while (i >=0 and k <= n-1):
if (arr[i] + arr[k] < 2*arr[j]):
k = k + 1
elif (arr[i] + arr[k] > 2*arr[j]):
L[i][j] = 2
i -= 1
L[i][j] = L[j][k] + 1
llap = max(llap, L[i][j])
i = i - 1
k = j + 1
while (i >=0):
L[i][j] = 2
i -= 1
return llap
arr = [1,4,5,7,8,10]
print arithmetic(arr)
This outputs 4.
However I would like to be able to find arithmetic progressions where up to one value is missing. So if arr = [1,4,5,8,10,13] I would like it to report that there is a progression of length 5 with one value missing.
Can this be done efficiently?
Adapted from my answer to Longest equally-spaced subsequence. n is the length of A, and d is the range, i.e. the largest item minus the smallest item.
A = [1, 4, 5, 8, 10, 13] # in sorted order
Aset = set(A)
for d in range(1, 13):
already_seen = set()
for a in A:
if a not in already_seen:
b = a
count = 1
while b + d in Aset:
b += d
count += 1
# if there is a hole to jump over:
if b + 2 * d in Aset:
b += 2 * d
count += 1
while b + d in Aset:
b += d
count += 1
# don't record in already_seen here
print "found %d items in %d .. %d" % (count, a, b)
# collect here the largest 'count'
I believe that this solution is still O(n*d), simply with larger constants than looking without a hole, despite the two "while" loops inside the two nested "for" loops. Indeed, fix a value of d: then we are in the "a" loop that runs n times; but each of the inner two while loops run at most n times in total over all values of a, giving a complexity O(n+n+n) = O(n) again.
Like the original, this solution is adaptable to the case where you're not interested in the absolute best answer but only in subsequences with a relatively small step d: e.g. n might be 1'000'000, but you're only interested in subsequences of step at most 1'000. Then you can make the outer loop stop at 1'000.
