Algorithm for itertools.combinations in Python - python

I was solving a programming puzzle involving combinations. It led me to a wonderful itertools.combinations function and I'd like to know how it works under the hood. Documentation says that the algorithm is roughly equivalent to the following:
def combinations(iterable, r):
# combinations('ABCD', 2) --> AB AC AD BC BD CD
# combinations(range(4), 3) --> 012 013 023 123
pool = tuple(iterable)
n = len(pool)
if r > n:
return
indices = list(range(r))
yield tuple(pool[i] for i in indices)
while True:
for i in reversed(range(r)):
if indices[i] != i + n - r:
break
else:
return
indices[i] += 1
for j in range(i+1, r):
indices[j] = indices[j-1] + 1
yield tuple(pool[i] for i in indices)
I got the idea: we start with the most obvious combination (r first consecutive elements). Then we change one (last) item to get each subsequent combination.
The thing I'm struggling with is a conditional inside for loop.
for i in reversed(range(r)):
if indices[i] != i + n - r:
break
This experession is very terse, and I suspect it's where all the magic happens. Please, give me a hint so I could figure it out.

The loop has two purposes:
Terminating if the last index-list has been reached
Determining the right-most position in the index-list that can be legally increased. This position is then the starting point for resetting all indeces to the right.
Let us say you have an iterable over 5 elements, and want combinations of length 3. What you essentially need for this is to generate lists of indexes. The juicy part of the above algorithm generates the next such index-list from the current one:
# obvious
index-pool: [0,1,2,3,4]
first index-list: [0,1,2]
[0,1,3]
...
[1,3,4]
last index-list: [2,3,4]
i + n - r is the max value for index i in the index-list:
index 0: i + n - r = 0 + 5 - 3 = 2
index 1: i + n - r = 1 + 5 - 3 = 3
index 2: i + n - r = 2 + 5 - 3 = 4
# compare last index-list above
=>
for i in reversed(range(r)):
if indices[i] != i + n - r:
break
else:
break
This loops backwards through the current index-list and stops at the first position that doesn't hold its maximum index-value. If all positions hold their maximum index-value, there is no further index-list, thus return.
In the general case of [0,1,4] one can verify that the next list should be [0,2,3]. The loop stops at position 1, the subsequent code
indices[i] += 1
increments the value for indeces[i] (1 -> 2). Finally
for j in range(i+1, r):
indices[j] = indices[j-1] + 1
resets all positions > i to the smallest legal index-values, each 1 larger than its predecessor.

This for loop does a simple thing: it checks whether the algorithm should terminate.
The algorithm start with the first r items and increases until it reaches the last r items in the iterable, which are [Sn-r+1 ... Sn-1, Sn] (if we let S be the iterable).
Now, the algorithm scans every item in the indices, and make sure they still have where to go - so it verifies the ith indice is not the index n - r + i, which by the previous paragraph is the (we ignore the 1 here because lists are 0-based).
If all of these indices are equal to the last r positions - then it goes into the else, commiting the return and terminating the algorithm.
We could create the same functionality by using
if indices == list(range(n-r, n)): return
but the main reason for this "mess" (using reverse and break) is that the first index from the end that doesn't match is saved inside i and is used for the next level of the algorithm which increments this index and takes care of re-setting the rest.
You could check this by replacing the yields with
print('Combination: {} Indices: {}'.format(tuple(pool[i] for i in indices), indices))

Source code has some additional information about what is going on.
The yeild statement before while loop returns a trivial combination of elements (which is simply first r elements of A, (A[0], ..., A[r-1])) and prepares indices for future work.
Let's say that we have A='ABCDE' and r=3. Then, after the first step the value of indices is [0, 1, 2], which points to ('A', 'B', 'C').
Let's look at the source code of the loop in question:
2160 /* Scan indices right-to-left until finding one that is not
2161 at its maximum (i + n - r). */
2162 for (i=r-1 ; i >= 0 && indices[i] == i+n-r ; i--)
2163 ;
This loop searches for the rightmost element of indices that hasn't reached its maximum value yet. After the very first yield statement the value of indices is [0, 1, 2]. Therefore, for loop terminates at indices[2].
Next, the following code increments the ith element of indices:
2170 /* Increment the current index which we know is not at its
2171 maximum. Then move back to the right setting each index
2172 to its lowest possible value (one higher than the index
2173 to its left -- this maintains the sort order invariant). */
2174 indices[i]++;
As a result, we get index combination [0, 1, 3], which points to ('A', 'B', 'D').
Then we roll back the subsequent indices if they are too big:
2175 for (j=i+1 ; j<r ; j++)
2176 indices[j] = indices[j-1] + 1;
Indices increase step by step:
step indices
(0, 1, 2)
(0, 1, 3)
(0, 1, 4)
(0, 2, 3)
(0, 2, 4)
(0, 3, 4)
(1, 2, 3)
...

Related

Bubble sorting numbers prob

I am trying an algorithm for a bubble sort and there is a part I don't understand
nums = [1,4,3,2,10,6,8,5]
for i in range (len(nums)-1,0,-1):
for j in range(i):
if nums[j] > nums[j+1]:
temp = nums[j]
nums[j] = nums[j+1]
nums[j+1] = temp
print(nums)
what does the numbers (-1,0,-1) mean in this part of the code (it dosent sort properly without it) v v v
for i in range (len(nums)-1,0,-1):
Syntax for range in python is -
range(start, end, step)
In your case, the looping is essentially starting from the last element(Index n-1) & moving towards the first element(Index 0) one step at a time.
Okey:
first one is starting point, second tells python where to stop, last one is step.
len(nums) - it gives you (durms..) length of this list, in our case it's 8,
len(nums)-1 - it's 8-1, we are doing this because when going through list python will start on number 0 and end at number 7(still 8 elements, but last one has index 7 not 8),
We will stop at 0, with step -1. So iteration will look like:
num[len(nums)-1] = num[7]
num[len(nums)-1-1] = num[6]
num[len(nums)-1-1-1] = num[5]
.....
num[len(nums)-1-1-1-1-1-1-1] = num[0]

Is there a python function that returns the first positive int that does not occur in list?

I'm tryin to design a function that, given an array A of N integers, returns the smallest positive integer (greater than 0) that does not occur in A.
This code works fine yet has a high order of complexity, is there another solution that reduces the order of complexity?
Note: The 10000000 number is the range of integers in array A, I tried the sort function but does it reduces the complexity?
def solution(A):
for i in range(10000000):
if(A.count(i)) <= 0:
return(i)
The following is O(n logn):
a = [2, 1, 10, 3, 2, 15]
a.sort()
if a[0] > 1:
print(1)
else:
for i in range(1, len(a)):
if a[i] > a[i - 1] + 1:
print(a[i - 1] + 1)
break
If you don't like the special handling of 1, you could just append zero to the array and have the same logic handle both cases:
a = sorted(a + [0])
for i in range(1, len(a)):
if a[i] > a[i - 1] + 1:
print(a[i - 1] + 1)
break
Caveats (both trivial to fix and both left as an exercise for the reader):
Neither version handles empty input.
The code assumes there no negative numbers in the input.
O(n) time and O(n) space:
def solution(A):
count = [0] * len(A)
for x in A:
if 0 < x <= len(A):
count[x-1] = 1 # count[0] is to count 1
for i in range(len(count)):
if count[i] == 0:
return i+1
return len(A)+1 # only if A = [1, 2, ..., len(A)]
This should be O(n). Utilizes a temporary set to speed things along.
a = [2, 1, 10, 3, 2, 15]
#use a set of only the positive numbers for lookup
temp_set = set()
for i in a:
if i > 0:
temp_set.add(i)
#iterate from 1 upto length of set +1 (to ensure edge case is handled)
for i in range(1, len(temp_set) + 2):
if i not in temp_set:
print(i)
break
My proposal is a recursive function inspired by quicksort.
Each step divides the input sequence into two sublists (lt = less than pivot; ge = greater or equal than pivot) and decides, which of the sublists is to be processed in the next step. Note that there is no sorting.
The idea is that a set of integers such that lo <= n < hi contains "gaps" only if it has less than (hi - lo) elements.
The input sequence must not contain dups. A set can be passed directly.
# all cseq items > 0 assumed, no duplicates!
def find(cseq, cmin=1):
# cmin = possible minimum not ruled out yet
size = len(cseq)
if size <= 1:
return cmin+1 if cmin in cseq else cmin
lt = []
ge = []
pivot = cmin + size // 2
for n in cseq:
(lt if n < pivot else ge).append(n)
return find(lt, cmin) if cmin + len(lt) < pivot else find(ge, pivot)
test = set(range(1,100))
print(find(test)) # 100
test.remove(42)
print(find(test)) # 42
test.remove(1)
print(find(test)) # 1
Inspired by various solutions and comments above, about 20%-50% faster in my (simplistic) tests than the fastest of them (though I'm sure it could be made faster), and handling all the corner cases mentioned (non-positive numbers, duplicates, and empty list):
import numpy
def firstNotPresent(l):
positive = numpy.fromiter(set(l), dtype=int) # deduplicate
positive = positive[positive > 0] # only keep positive numbers
positive.sort()
top = positive.size + 1
if top == 1: # empty list
return 1
sequence = numpy.arange(1, top)
try:
return numpy.where(sequence < positive)[0][0]
except IndexError: # no numbers are missing, top is next
return top
The idea is: if you enumerate the positive, deduplicated, sorted list starting from one, the first time the index is less than the list value, the index value is missing from the list, and hence is the lowest positive number missing from the list.
This and the other solutions I tested against (those from adrtam, Paritosh Singh, and VPfB) all appear to be roughly O(n), as expected. (It is, I think, fairly obvious that this is a lower bound, since every element in the list must be examined to find the answer.) Edit: looking at this again, of course the big-O for this approach is at least O(n log(n)), because of the sort. It's just that the sort is so fast comparitively speaking that it looked linear overall.

Optimization of list's sublist

the problem is to find total number of sub-lists from a given list that doesn't contain numbers greater than a specified upper bound number say right and sub lists max number should be greater than a lower bound say left .Suppose my list is: x=[2, 0, 11, 3, 0] and upper bound for sub-list elements is 10 and lower bound is 1 then my sub-lists can be [[2],[2,0],[3],[3,0]] as sub lists are always continuous .My script runs well and produces correct output but needs some optimization
def query(sliced,left,right):
end_index=0
count=0
leng=len(sliced)
for i in range(leng):
stack=[]
end_index=i
while(end_index<leng and sliced[end_index]<=right):
stack.append(sliced[end_index])
if max(stack)>=left:
count+=1
end_index+=1
print (count)
origin=[2,0,11,3,0]
left=1
right=10
query(origin,left,right)
output:4
for a list say x=[2,0,0,1,11,14,3,5] valid sub-lists can be [[2],[2,0],[2,0,0],[2,0,0,1],[0,0,1],[0,1],[1],[3],[5],[3,5]] total being 10
Brute force
Generate every possible sub-list and check if the given criteria hold for each sub-list.
Worst case scenario: For every element e in the array, left < e < right.
Time complexity: O(n^3)
Optimized brute force (OP's code)
For every index in the array, incrementally build a temporary list (not really needed though) which is valid.
Worst case scenario: For every element e in the array, left < e < right.
Time complexity: O(n^2)
A more optimized solution
If the array has n elements, then the number of sub-lists in the array is 1 + 2 + 3 + ... + n = (n * (n + 1)) / 2 = O(n^2). We can use this formula strategically.
First, as #Tim mentioned, we can just consider the sum of the sub-lists that do not contain any numbers greater than right by partitioning the list about those numbers greater than right. This reduces the task to only considering sub-lists that have all elements less than or equal to right then summing the answers.
Next, break apart the reduced sub-list (yes, the sub-list of the sub-list) by partitioning the reduced sub-list about the numbers greater than or equal to left. For each of those sub-lists, compute the number of possible sub-lists of that sub-list of sub-lists (which is k * (k + 1) / 2 if the sub-list has length k). Once that is done for all the the sub-lists of sub-lists, add them together (store them in, say, w) then compute the number of possible sub-lists of that sub-list and subtract w.
Then aggregate your results by sum.
Worst case scenario: For every element e in the array, e < left.
Time Complexity: O(n)
I know this is very difficult to understand, so I have included working code:
def compute(sliced, lo, hi, left):
num_invalid = 0
start = 0
search_for_start = True
for end in range(lo, hi):
if search_for_start and sliced[end] < left:
start = end
search_for_start = False
elif not search_for_start and sliced[end] >= left:
num_invalid += (end - start) * (end - start + 1) // 2
search_for_start = True
if not search_for_start:
num_invalid += (hi - start) * (hi - start + 1) // 2
return ((hi - lo) * (hi - lo + 1)) // 2 - num_invalid
def query(sliced, left, right):
ans = 0
start = 0
search_for_start = True
for end in range(len(sliced)):
if search_for_start and sliced[end] <= right:
start = end
search_for_start = False
elif not search_for_start and sliced[end] > right:
ans += compute(sliced, start, end, left)
search_for_start = True
if not search_for_start:
ans += compute(sliced, start, len(sliced), left)
return ans
Categorise the numbers as small, valid and large (S,V and L) and further index the valid numbers: V_1, V_2, V_3 etc. Let us start off by assuming there are no large numbers.
Consider the list A = [S,S,…,S,V_1, X,X,X,X,…X] .If V_1 has index n, there are n+1, subsets of the form [V_1], [S,V_1], [S,S,V_1] and so on. And for each of these n+1 subsets, we can append the len(A)-n-1 sequences: [X], [XX], [XXX] and so on. Giving a total of (n+1)(len(A)-n) subsets containing V_1.
But we can partition the set of all subsets by those containing V_k but no V_n for n less than k. Hence we must then, simply perform the same calculation on the remaining XXX…X part of the list using V_2 and itterate. This would require something like this:
def query(sliced,left,right,total):
index=0
while index<len(sliced):
if sliced[index]>=left:
total+=(index+1)*(len(sliced)-index)
return total+query(sliced[index+1:],left,right,0)
else:
index+=1
return total
To incorporate the large numbers, we can just partition the whole set according to where the large numbers occur and add the total number of sequence for each. If we call our first function, sub_query, then we arrive at the following:
def sub_query(sliced,left,right,total):
index=0
while index<len(sliced):
if sliced[index]>=left:
total+=(index+1)*(len(sliced)-index)
return total+sub_query(sliced[index+1:],left,right,0)
else:
index+=1
return total
def query(sliced,left,right):
index=0
count=0
while index<len(sliced):
if sliced[index]>right:
count+=sub_query(sliced[:index],left,right,0)
sliced=sliced[index+1:]
index=0
else:
index+=1
count+=sub_query(sliced,left,right,0)
print (count)
This seems to run through the list and check for max/min values fewer times. Note it doesn’t distinguish between sub-lists that are the same but from different positions in the original list (as would arise from a list such as [0,1,0,0,1,0]. But the code from the original post wouldn’t do that either, so I am guessing this is not a requirement.

Array of positive integers , ideas for efficient implementation

I have a small problem within a bigger problem.
I have an array of positive integers. I need to find a position i in the array such that all the numbers which are smaller than the element at position i should appear after it.
Example:
(let's assume array is indexed at 1)
2, 3, 4, 1, 9,3, 2 => 3rd pos // 1,2,3 are less than 4 and are occurring after it.
5, 2, 1, 5 => 2nd pos
1,2,1 => 2nd pos
1, 4, 6, 7, 2, 3 => doesn't exist
I'm thinking of using a hashtable but I don't know exactly how. Or sorting would be better? Any ideas for an efficient idea?
We can start by creating a map (or hash table or whatever), which records the last occurence for each entry:
for i from 1 to n
lastOccurrence[arr[i]] = i
next
We know that if j is a valid answer, then every number smaller than j is also a valid answer. So we want to find the maximum j. The minimum j is obviously 1 because then the left sublist is empty.
We can then iterate all possible js and check their validity.
maxJ = n
for j from 1 to n
if j > maxJ
return maxJ
if lastOccurrence[arr[j]] == j
return j
maxJ = min(maxJ, lastOccurrence[arr[j]] - 1)
next
from sets import Set
def findMaxIndex(array):
lastSet = Set()
size = len(array)
maxIndex = size
for index in range(size-1,-1,-1):
if array[index] in lastSet:
continue
else:
lastSet.add(array[index])
maxIndex = index + 1
if maxIndex == 1:
return 0 # don't exist
else:
return maxIndex
from the last element to the first, use a set to keep elements having met, if iterate element(index i) is not in set, then the max index is i, and update the set

better algorithm for checking 5 in a row/col in a matrix

is there a good algorithm for checking whether there are 5 same elements in a row or a column or diagonally given a square matrix, say 6x6?
there is ofcourse the naive algorithm of iterating through every spot and then for each point in the matrix, iterate through that row, col and then the diagonal. I am wondering if there is a better way of doing it.
You could keep a histogram in a dictionary (mapping element type -> int). And then you iterate over your row or column or diagonal, and increment histogram[element], and either check at the end to see if you have any 5s in the histogram, or if you can allow more than 5 copies, you can just stop once you've reached 5 for any element.
Simple, one-dimensional, example:
m = ['A', 'A', 'A', 'A', 'B', 'A']
h = {}
for x in m:
if x in h:
h[x] += 1
else:
h[x] = 1
print "Histogram:", h
for k in h:
if h[k]>=5:
print "%s appears %d times." % (k,h[k])
Output:
Histogram: {'A': 5, 'B': 1}
A appears 5 times.
Essentially, h[x] will store the number of times the element x appears in the array (in your case, this will be the current row, or column or diagonal). The elements don't have to appear consecutively, but the counts would be reset each time you start considering a new row/column/diagonal.
You can check whether there are k same elements in a matrix of integers in a single pass.
Suppose that n is the size of the matrix and m is the largest element. We have n column, n row and 1 diagonal.
Foreach column, row or diagonal we have at most n distinct element.
Now we can create a histogram containing (n + n + 1) * (2 * m + 1) element. Representing
the rows, columns and the diagonal each of them containing at most n distinct element.
size = (n + n + 1) * (2 * m + 1)
histogram = zeros(size, Int)
Now the tricky part is how to update this histogram ?
Consider this function in pseudo-code:
updateHistogram(i, j, element)
if (element < 0)
element = m - element;
rowIndex = i * m + element
columnIndex = n * m + j * m + element
diagonalIndex = 2 * n * m + element
histogram[rowIndex] = histogram[rowIndex] + 1
histogram[columnIndex] = histogram[columnIndex] + 1
if (i = j)
histogram[diagonalIndex] = histogram[diagonalIndex] + 1
Now all you have to do is to iterate throw the histogram and check whether there is an element > k
Your best approach may depend on whether you control the placement of elements.
For example, if you were building a game and just placed the most recent element on the grid, you could capture into four strings the vertical, horizontal, and diagonal strips that intersected that point, and use the same algorithm on each strip, tallying each element and evaluating the totals. The algorithm may be slightly different depending on whether you're counting five contiguous elements out of the six, or allow gaps as long as the total is five.
For rows you can keep a counter, which indicates how many of the same elements in a row you currently have. To do this, iterate through the row and
if current element matches the previous element, increase the counter by one. If counter is 5, then you have found the 5 elements you wanted.
if current element doesn't match previous element, set the counter to 1.
The same principle can be applied to columns and diagonals as well. You probably want to use array of counters for columns (one element for each column) and diagonals so you can iterate through the matrix once.
I did the small example for a smaller case, but you can easily change it:
n = 3
matrix = [[1, 2, 3, 4],
[1, 2, 3, 1],
[2, 3, 1, 3],
[2, 1, 4, 2]]
col_counter = [1, 1, 1, 1]
for row in range(0, len(matrix)):
row_counter = 1
for col in range(0, len(matrix[row])):
current_element = matrix[row][col]
# check elements in a same row
if col > 0:
previous_element = matrix[row][col - 1]
if current_element == previous_element:
row_counter = row_counter + 1
if row_counter == n:
print n, 'in a row at:', row, col - n + 1
else:
row_counter = 1
# check elements in a same column
if row > 0:
previous_element = matrix[row - 1][col]
if current_element == previous_element:
col_counter[col] = col_counter[col] + 1;
if col_counter[col] == n:
print n, 'in a column at:', row - n + 1, col
else:
col_counter[col] = 1
I left out diagonals to keep the example short and simple, but for diagonals you can use the same principle as you use on columns. The previous element would be one of the following (depending on the direction of diagonal):
matrix[row - 1][col - 1]
matrix[row - 1][col + 1]
Note that you will need to make a little bit extra effort in the second case. For example traverse the row in the inner loop from right to left.
I don't think you can avoid iteration, but you can at least do an XOR of all elements and if the result of that is 0 => they are all equal, then you don't need to do any comparisons.
You can try improve your method with some heuristics: use the knowledge of the matrix size to exclude element sequences that do not fit and suspend unnecessary calculation. In case the given vector size is 6, you want to find 5 equal elements, and the first 3 elements are different, further calculation do not have any sense.
This approach can give you a significant advantage, if 5 equal elements in a row happen rarely enough.
If you code the rows/columns/diagonals as bitmaps, "five in a row" means "mask % 31== 0 && mask / 31 == power_of_two"
00011111 := 0x1f 31 (five in a row)
00111110 := 0x3e 62 (five in a row)
00111111 := 0x3f 63 (six in a row)
If you want to treat the six-in-a-row case also as as five-in-a-row, the easiest way is probably to:
for ( ; !(mask & 1) ; mask >>= 1 ) {;}
return (mask & 0x1f == 0x1f) ? 1 : 0;
Maybe the Stanford bit-tweaking department has a better solution or suggestion that does not need looping?

Categories