Optimization of list's sublist - python

the problem is to find total number of sub-lists from a given list that doesn't contain numbers greater than a specified upper bound number say right and sub lists max number should be greater than a lower bound say left .Suppose my list is: x=[2, 0, 11, 3, 0] and upper bound for sub-list elements is 10 and lower bound is 1 then my sub-lists can be [[2],[2,0],[3],[3,0]] as sub lists are always continuous .My script runs well and produces correct output but needs some optimization
def query(sliced,left,right):
end_index=0
count=0
leng=len(sliced)
for i in range(leng):
stack=[]
end_index=i
while(end_index<leng and sliced[end_index]<=right):
stack.append(sliced[end_index])
if max(stack)>=left:
count+=1
end_index+=1
print (count)
origin=[2,0,11,3,0]
left=1
right=10
query(origin,left,right)
output:4
for a list say x=[2,0,0,1,11,14,3,5] valid sub-lists can be [[2],[2,0],[2,0,0],[2,0,0,1],[0,0,1],[0,1],[1],[3],[5],[3,5]] total being 10

Brute force
Generate every possible sub-list and check if the given criteria hold for each sub-list.
Worst case scenario: For every element e in the array, left < e < right.
Time complexity: O(n^3)
Optimized brute force (OP's code)
For every index in the array, incrementally build a temporary list (not really needed though) which is valid.
Worst case scenario: For every element e in the array, left < e < right.
Time complexity: O(n^2)
A more optimized solution
If the array has n elements, then the number of sub-lists in the array is 1 + 2 + 3 + ... + n = (n * (n + 1)) / 2 = O(n^2). We can use this formula strategically.
First, as #Tim mentioned, we can just consider the sum of the sub-lists that do not contain any numbers greater than right by partitioning the list about those numbers greater than right. This reduces the task to only considering sub-lists that have all elements less than or equal to right then summing the answers.
Next, break apart the reduced sub-list (yes, the sub-list of the sub-list) by partitioning the reduced sub-list about the numbers greater than or equal to left. For each of those sub-lists, compute the number of possible sub-lists of that sub-list of sub-lists (which is k * (k + 1) / 2 if the sub-list has length k). Once that is done for all the the sub-lists of sub-lists, add them together (store them in, say, w) then compute the number of possible sub-lists of that sub-list and subtract w.
Then aggregate your results by sum.
Worst case scenario: For every element e in the array, e < left.
Time Complexity: O(n)
I know this is very difficult to understand, so I have included working code:
def compute(sliced, lo, hi, left):
num_invalid = 0
start = 0
search_for_start = True
for end in range(lo, hi):
if search_for_start and sliced[end] < left:
start = end
search_for_start = False
elif not search_for_start and sliced[end] >= left:
num_invalid += (end - start) * (end - start + 1) // 2
search_for_start = True
if not search_for_start:
num_invalid += (hi - start) * (hi - start + 1) // 2
return ((hi - lo) * (hi - lo + 1)) // 2 - num_invalid
def query(sliced, left, right):
ans = 0
start = 0
search_for_start = True
for end in range(len(sliced)):
if search_for_start and sliced[end] <= right:
start = end
search_for_start = False
elif not search_for_start and sliced[end] > right:
ans += compute(sliced, start, end, left)
search_for_start = True
if not search_for_start:
ans += compute(sliced, start, len(sliced), left)
return ans

Categorise the numbers as small, valid and large (S,V and L) and further index the valid numbers: V_1, V_2, V_3 etc. Let us start off by assuming there are no large numbers.
Consider the list A = [S,S,…,S,V_1, X,X,X,X,…X] .If V_1 has index n, there are n+1, subsets of the form [V_1], [S,V_1], [S,S,V_1] and so on. And for each of these n+1 subsets, we can append the len(A)-n-1 sequences: [X], [XX], [XXX] and so on. Giving a total of (n+1)(len(A)-n) subsets containing V_1.
But we can partition the set of all subsets by those containing V_k but no V_n for n less than k. Hence we must then, simply perform the same calculation on the remaining XXX…X part of the list using V_2 and itterate. This would require something like this:
def query(sliced,left,right,total):
index=0
while index<len(sliced):
if sliced[index]>=left:
total+=(index+1)*(len(sliced)-index)
return total+query(sliced[index+1:],left,right,0)
else:
index+=1
return total
To incorporate the large numbers, we can just partition the whole set according to where the large numbers occur and add the total number of sequence for each. If we call our first function, sub_query, then we arrive at the following:
def sub_query(sliced,left,right,total):
index=0
while index<len(sliced):
if sliced[index]>=left:
total+=(index+1)*(len(sliced)-index)
return total+sub_query(sliced[index+1:],left,right,0)
else:
index+=1
return total
def query(sliced,left,right):
index=0
count=0
while index<len(sliced):
if sliced[index]>right:
count+=sub_query(sliced[:index],left,right,0)
sliced=sliced[index+1:]
index=0
else:
index+=1
count+=sub_query(sliced,left,right,0)
print (count)
This seems to run through the list and check for max/min values fewer times. Note it doesn’t distinguish between sub-lists that are the same but from different positions in the original list (as would arise from a list such as [0,1,0,0,1,0]. But the code from the original post wouldn’t do that either, so I am guessing this is not a requirement.

Related

Using binary search to find the duplicate number in an array

The problem:
Given an array of integers nums containing n + 1 integers where each integer is in the range [1, n] inclusive.
There is only one repeated number in nums, return this repeated number.
You must solve the problem without modifying the array nums and uses only constant
extra space.
Here is one of the possible solution using binary search
class Solution(object):
def findDuplicate(self, nums):
beg, end = 1, len(nums)-1
while beg + 1 <= end:
mid, count = (beg + end)//2, 0
for num in nums:
if num <= mid: count += 1
if count <= mid:
beg = mid + 1
else:
end = mid
return end
Example 1:
Input: nums = [1,3,4,2,2]
Output: 2
Example 2:
Input: nums = [3,1,3,4,2]
Output: 3
Can someone please explain this solution for me? I understand the code but I don't understand the logic behind this. In particular, I do not understand how to construct the if statements (lines 7 - 13). Why and how do you know that when num <= mid then I need to do count += 1 and so on. Many thanks.
The solution keeps halving the range of numbers the answer can still be in.
For example, if the function starts with nums == [1, 3, 4, 2, 2], then the duplicate number must be between 1 and 4 inclusive by definition.
By counting how many of the numbers are smaller than or equal to the middle of that range (2), you can decide if the duplicate must be in the upper or lower half of that range. Since the actual number is greater (3 numbers are lesser than or equal to 2, and 3 > 2), the number must be in the lower half.
Repeating the process, knowing that the number must be between 1 and 2 inclusive, only 1 number is less than or equal to the middle of that range (1), which means the number must be in the upper half, and is 2.
Consider a slightly longer series: [1, 2, 5, 6, 3, 4, 3, 7]. Between 1 and 7 lies 3, 4 numbers are less than or equal to 3, so the number must be between 1 and 3. Between 1 and 3 lies 2, 2 numbers are less than or equal to 2, so the number must be over 2, which leaves 3.
The solution will iterate over all n elements of nums a limited number of times, since it keeps halving the search space. It's certainly more efficient than the naive:
def findDuplicate(self, nums):
for i, n in enumerate(nums):
for j, m in enumerate(nums):
if i != j and n == m:
return n
But as user #fas suggests in the comments, this is better:
def findDuplicate(nums):
p = 1
while p < len(nums):
p <<= 1
r = 0
for n in nums:
r ^= n
for n in range(len(nums), p):
r ^= n
return r
This is how given binary search works. Inside binary search you have implementation of isDuplicateLessOrEqualTo(x):
mid, count = (beg + end)//2, 0
for num in nums:
if num <= mid: count += 1
if count <= mid:
return False # In this case there are no duplicates less or equal than mid.
# Actually count == mid would be enough, count < mid is impossible.
else:
return True # In this case there is a duplicate less or equal than mid.
isDuplicateLessOrEqualTo(x) is a non-decreasing function (assume x has a duplicate, then for all i < x it's false and for all i >= x it's true), that's why you can run binary search over it.
Each iteration you run through the array, which gives you overall complexity O(n log n) (where n is size of array).
There's a faster solution. Note that xor(0..(2^n)-1) = 0 for n >= 2, because there are 2^(n-1) ones for each bit position and it's an even number, for example:
0_10 = 00_2
1_10 = 01_2
2_10 = 10_2
3_10 = 11_2
^
2 ones here, 2 is even
^
2 ones here, 2 is even
So by xor-ing all the numbers you'll receive exactly your duplicate number. Let's write it:
class Solution(object):
def nearestPowerOfTwo(number):
lowerBoundDegreeOfTwo = number.bit_length()
lowerBoundDegreeOfTwo = max(lowerBoundDegreeOfTwo, 2)
return 2 ** lowerBoundDegreeOfTwo
def findDuplicate(self, nums):
xorSum = 0
for i in nums:
xorSum = xorSum ^ i
for i in range(len(nums), nearestPowerOfTwo(len(nums) - 1)):
xorSum = xorSum ^ i
return xorSum
As you can see that gives us O(n) complexity.
If anyone is interested in a different approach (not binary search) to solve this problem:
Sum all elements of the array - we will call it sumArray - the time complexity is O(n).
Sum all numbers from 1 to n (inclusive) - we will call it sumGeneral - this is simply (n * (n+1) / 2) - the time complexity is O(1).
Return the result of sumArray - sumGeneral
In total, the time complexity is O(n) (you cannot do better since you have to look at all elements of the array, potentially the repeated one is at the end), and additional space complexity is O(1).
(If you could use O(n) additional space complexity you could use a hash table)

Getting all subsets from subset sum problem on Python using Dynamic Programming

I am trying to extract all subsets from a list of elements which add up to a certain value.
Example -
List = [1,3,4,5,6]
Sum - 9
Output Expected = [[3,6],[5,4]]
Have tried different approaches and getting the expected output but on a huge list of elements it is taking a significant amount of time.
Can this be optimized using Dynamic Programming or any other technique.
Approach-1
def subset(array, num):
result = []
def find(arr, num, path=()):
if not arr:
return
if arr[0] == num:
result.append(path + (arr[0],))
else:
find(arr[1:], num - arr[0], path + (arr[0],))
find(arr[1:], num, path)
find(array, num)
return result
numbers = [2, 2, 1, 12, 15, 2, 3]
x = 7
subset(numbers,x)
Approach-2
def isSubsetSum(arr, subset, N, subsetSize, subsetSum, index , sum):
global flag
if (subsetSum == sum):
flag = 1
for i in range(0, subsetSize):
print(subset[i], end = " ")
print("")
else:
for i in range(index, N):
subset[subsetSize] = arr[i]
isSubsetSum(arr, subset, N, subsetSize + 1,
subsetSum + arr[i], i + 1, sum)
If you want to output all subsets you can't do better than a sluggish O(2^n) complexity, because in the worst case that will be the size of your output and time complexity is lower-bounded by output size (this is a known NP-Complete problem). But, if rather than returning a list of all subsets, you just want to return a boolean value indicating whether achieving the target sum is possible, or just one subset summing to target (if it exists), you can use dynamic programming for a pseudo-polynomial O(nK) time solution, where n is the number of elements and K is the target integer.
The DP approach involves filling in an (n+1) x (K+1) table, with the sub-problems corresponding to the entries of the table being:
DP[i][k] = subset(A[i:], k) for 0 <= i <= n, 0 <= k <= K
That is, subset(A[i:], k) asks, 'Can I sum to (little) k using the suffix of A starting at index i?' Once you fill in the whole table, the answer to the overall problem, subset(A[0:], K) will be at DP[0][K]
The base cases are for i=n: they indicate that you can't sum to anything except for 0 if you're working with the empty suffix of your array
subset(A[n:], k>0) = False, subset(A[n:], k=0) = True
The recursive cases to fill in the table are:
subset(A[i:], k) = subset(A[i+1:, k) OR (A[i] <= k AND subset(A[i+i:], k-A[i]))
This simply relates the idea that you can use the current array suffix to sum to k either by skipping over the first element of that suffix and using the answer you already had in the previous row (when that first element wasn't in your array suffix), or by using A[i] in your sum and checking if you could make the reduced sum k-A[i] in the previous row. Of course, you can only use the new element if it doesn't itself exceed your target sum.
ex: subset(A[i:] = [3,4,1,6], k = 8)
would check: could I already sum to 8 with the previous suffix (A[i+1:] = [4,1,6])? No. Or, could I use the 3 which is now available to me to sum to 8? That is, could I sum to k = 8 - 3 = 5 with [4,1,6]? Yes. Because at least one of the conditions was true, I set DP[i][8] = True
Because all the base cases are for i=n, and the recurrence relation for subset(A[i:], k) relies on the answers to the smaller sub-problems subset(A[i+i:],...), you start at the bottom of the table, where i = n, fill out every k value from 0 to K for each row, and work your way up to row i = 0, ensuring you have the answers to the smaller sub-problems when you need them.
def subsetSum(A: list[int], K: int) -> bool:
N = len(A)
DP = [[None] * (K+1) for x in range(N+1)]
DP[N] = [True if x == 0 else False for x in range(K+1)]
for i in range(N-1, -1, -1):
Ai = A[i]
DP[i] = [DP[i+1][k] or (Ai <=k and DP[i+1][k-Ai]) for k in range(0, K+1)]
# print result
print(f"A = {A}, K = {K}")
print('Ai,k:', *range(0,K+1), sep='\t')
for (i, row) in enumerate(DP): print(A[i] if i < N else None, *row, sep='\t')
print(f"DP[0][K] = {DP[0][K]}")
return DP[0][K]
subsetSum([1,4,3,5,6], 9)
If you want to return an actual possible subset alongside the bool indicating whether or not it's possible to make one, then for every True flag in your DP you should also store the k index for the previous row that got you there (it will either be the current k index or k-A[i], depending on which table lookup returned True, which will indicate whether or not A[i] was used). Then you walk backwards from DP[0][K] after the table is filled to get a subset. This makes the code messier but it's definitely do-able. You can't get all subsets this way though (at least not without increasing your time complexity again) because the DP table compresses information.
Here is the optimized solution to the problem with a complexity of O(n^2).
def get_subsets(data: list, target: int):
# initialize final result which is a list of all subsets summing up to target
subsets = []
# records the difference between the target value and a group of numbers
differences = {}
for number in data:
prospects = []
# iterate through every record in differences
for diff in differences:
# the number complements a record in differences, i.e. a desired subset is found
if number - diff == 0:
new_subset = [number] + differences[diff]
new_subset.sort()
if new_subset not in subsets:
subsets.append(new_subset)
# the number fell short to reach the target; add to prospect instead
elif number - diff < 0:
prospects.append((number, diff))
# update the differences record
for prospect in prospects:
new_diff = target - sum(differences[prospect[1]]) - prospect[0]
differences[new_diff] = differences[prospect[1]] + [prospect[0]]
differences[target - number] = [number]
return subsets

Is there a python function that returns the first positive int that does not occur in list?

I'm tryin to design a function that, given an array A of N integers, returns the smallest positive integer (greater than 0) that does not occur in A.
This code works fine yet has a high order of complexity, is there another solution that reduces the order of complexity?
Note: The 10000000 number is the range of integers in array A, I tried the sort function but does it reduces the complexity?
def solution(A):
for i in range(10000000):
if(A.count(i)) <= 0:
return(i)
The following is O(n logn):
a = [2, 1, 10, 3, 2, 15]
a.sort()
if a[0] > 1:
print(1)
else:
for i in range(1, len(a)):
if a[i] > a[i - 1] + 1:
print(a[i - 1] + 1)
break
If you don't like the special handling of 1, you could just append zero to the array and have the same logic handle both cases:
a = sorted(a + [0])
for i in range(1, len(a)):
if a[i] > a[i - 1] + 1:
print(a[i - 1] + 1)
break
Caveats (both trivial to fix and both left as an exercise for the reader):
Neither version handles empty input.
The code assumes there no negative numbers in the input.
O(n) time and O(n) space:
def solution(A):
count = [0] * len(A)
for x in A:
if 0 < x <= len(A):
count[x-1] = 1 # count[0] is to count 1
for i in range(len(count)):
if count[i] == 0:
return i+1
return len(A)+1 # only if A = [1, 2, ..., len(A)]
This should be O(n). Utilizes a temporary set to speed things along.
a = [2, 1, 10, 3, 2, 15]
#use a set of only the positive numbers for lookup
temp_set = set()
for i in a:
if i > 0:
temp_set.add(i)
#iterate from 1 upto length of set +1 (to ensure edge case is handled)
for i in range(1, len(temp_set) + 2):
if i not in temp_set:
print(i)
break
My proposal is a recursive function inspired by quicksort.
Each step divides the input sequence into two sublists (lt = less than pivot; ge = greater or equal than pivot) and decides, which of the sublists is to be processed in the next step. Note that there is no sorting.
The idea is that a set of integers such that lo <= n < hi contains "gaps" only if it has less than (hi - lo) elements.
The input sequence must not contain dups. A set can be passed directly.
# all cseq items > 0 assumed, no duplicates!
def find(cseq, cmin=1):
# cmin = possible minimum not ruled out yet
size = len(cseq)
if size <= 1:
return cmin+1 if cmin in cseq else cmin
lt = []
ge = []
pivot = cmin + size // 2
for n in cseq:
(lt if n < pivot else ge).append(n)
return find(lt, cmin) if cmin + len(lt) < pivot else find(ge, pivot)
test = set(range(1,100))
print(find(test)) # 100
test.remove(42)
print(find(test)) # 42
test.remove(1)
print(find(test)) # 1
Inspired by various solutions and comments above, about 20%-50% faster in my (simplistic) tests than the fastest of them (though I'm sure it could be made faster), and handling all the corner cases mentioned (non-positive numbers, duplicates, and empty list):
import numpy
def firstNotPresent(l):
positive = numpy.fromiter(set(l), dtype=int) # deduplicate
positive = positive[positive > 0] # only keep positive numbers
positive.sort()
top = positive.size + 1
if top == 1: # empty list
return 1
sequence = numpy.arange(1, top)
try:
return numpy.where(sequence < positive)[0][0]
except IndexError: # no numbers are missing, top is next
return top
The idea is: if you enumerate the positive, deduplicated, sorted list starting from one, the first time the index is less than the list value, the index value is missing from the list, and hence is the lowest positive number missing from the list.
This and the other solutions I tested against (those from adrtam, Paritosh Singh, and VPfB) all appear to be roughly O(n), as expected. (It is, I think, fairly obvious that this is a lower bound, since every element in the list must be examined to find the answer.) Edit: looking at this again, of course the big-O for this approach is at least O(n log(n)), because of the sort. It's just that the sort is so fast comparitively speaking that it looked linear overall.

Find the total number of triplets when summed are less than a given threshold

So I'm working on some practice problems and having trouble reducing the complexity. I am given an array of distinct integers a[] and a threshold value T. I need to find the number of triplets i,j,k such that a[i] < a[j] < a[k] and a[i] + a[j] + a[k] <= T. I've gotten this down from O(n^3) to O(n^2 log n) with the following python script. I'm wondering if I can optimize this any further.
import sys
import bisect
first_line = sys.stdin.readline().strip().split(' ')
num_numbers = int(first_line[0])
threshold = int(first_line[1])
count = 0
if num_numbers < 3:
print count
else:
numbers = sys.stdin.readline().strip().split(' ')
numbers = map(int, numbers)
numbers.sort()
for i in xrange(num_numbers - 2):
for j in xrange(i+1, num_numbers - 1):
k_1 = threshold - (numbers[i] + numbers[j])
if k_1 < numbers[j]:
break
else:
cross_thresh = bisect.bisect(numbers,k_1) - (j+1)
if cross_thresh > 0:
count += cross_thresh
print count
In the above example, the first input line simply provides the number of numbers and the threshold. The next line is the full list. If the list is less than 3, there is no triplets that can exist, so we return 0. If not, we read in the full list of integers, sort them, and then process them as follows: we iterate over every element of i and j (such that i < j) and we compute the highest value of k that would not break i + j + k <= T. We then find the index (s) of the first element in the list that violates this condition and take all the elements between j and s and add them to the count. For 30,000 elements in a list, this takes about 7 minutes to run. Is there any way to make it faster?
You are performing binary search for each (i,j) pair to find the corresponding value for k. Hence O(n^2 log(n)).
I can suggest an algorithm that will have the worst case time complexity of O(n^2).
Assume the list is sorted from left to right and elements are numbered from 1 to n. Then the pseudo code is:
for i = 1 to n - 2:
j = i + 1
find maximal k with binary search
while j < k:
j = j + 1
find maximal k with linear search to the left, starting from last k position
The reason this has the worst case time complexity of O(n^2) and not O(n^3) is because the position k is monotonically decreasing. Thus even with linear scanning, you are not spending O(n) for each (i,j) pair. Rather, you are spending a total of O(n) time to scan for k for each distinct i value.
O(n^2) version implemented in Python (based on wookie919's answer):
def triplets(N, T):
N = sorted(N)
result = 0
for i in xrange(len(N)-2):
k = len(N)-1
for j in xrange(i+1, len(N)-1):
while k>=0 and N[i]+N[j]+N[k]>T:
k-=1
result += max(k, j)-j
return result
import random
sample = random.sample(xrange(1000000), 30000)
print triplets(sample, 500000)

Interviewstreet's Insertion sort program

I tried to program Interiewstreet's Insertion sort challenge Link for the challenge
in Python and here is my code shown below.
The program runs fine for a limit(which I'm not sure of) of input elements, but returns a false output for inputs of larger sizes. Can anyone guide me what am I doing wrong?
# This program tries to identify number of times swapping is done to sort the input array
"""
=>Get input values and print them
=>Get number of test cases and get inputs for those test cases
=>Complete Insertion sort routine
=>Add a variable to count the swapping's
"""
def sort_swap_times(nums):
""" This function takes a list of elements and then returns the number of times
swapping was necessary to complete the sorting
"""
times_swapped = 0L
# perform the insertion sort routine
for j in range(1, len(nums)):
key = nums[j]
i = j - 1
while i >= 0 and nums[i] > key:
# perform swap and update the tracker
nums[i + 1] = nums[i]
times_swapped += 1
i = i - 1
# place the key value in the position identified
nums[i + 1] = key
return times_swapped
# get the upper limit.
limit = int(raw_input())
swap_count = []
# get the length and elements.
for i in range(limit):
length = int(raw_input())
elements_str = raw_input() # returns a list of strings
# convert the given elements from str to int
elements_int = map(int, elements_str.split())
# pass integer elements list to perform the sorting
# get the number of times swapping was needed and append the return value to swap_count list
swap_count.append(sort_swap_times(elements_int))
# print the swap counts for each input array
for x in swap_count:
print x
Your algorithm is correct, but this is a naive approach to the problem and will give you a Time Limit Exceed signal on large test cases (i.e., len(nums) > 10000). Let's analyze the run-time complexity of your algorithm.
for j in range(1, len(nums)):
key = nums[j]
i = j - 1
while i >= 0 and nums[i] > key:
# perform swap and update the tracker
nums[i + 1] = nums[i]
times_swapped += 1
i = i - 1
# place the key value in the position identified
nums[i + 1] = key
The number of steps required in the above snippet is proportional to 1 + 2 + .. + len(nums)-1, or len(nums)*(len(nums)-1)/2 steps, which is O(len(nums)^2).
Hint:
Use the fact that all values will be within [1,10^6]. What you are really doing here is finding the number of inversions in the list, i.e. find all pairs of i < j s.t. nums[i] > nums[j]. Think of a data structure that allows you to find the number of swaps needed for each insert operation in logarithmic time complexity. Of course, there are other approaches.
Spoiler:
Binary Indexed Trees

Categories