Recovering Subsets in Subset Sum Problem - Not All Subsets Appear - python

Brushing up on dynamic programming (DP) when I came across this problem. I managed to use DP to determine how many solutions there are in the subset sum problem.
def SetSum(num_set, num_sum):
#Initialize DP matrix with base cases set to 1
matrix = [[0 for i in range(0, num_sum+1)] for j in range(0, len(num_set)+1)]
for i in range(len(num_set)+1): matrix[i][0] = 1
for i in range(1, len(num_set)+1): #Iterate through set elements
for j in range(1, num_sum+1): #Iterate through sum
if num_set[i-1] > j: #When current element is greater than sum take the previous solution
matrix[i][j] = matrix[i-1][j]
else:
matrix[i][j] = matrix[i-1][j] + matrix[i-1][j-num_set[i-1]]
#Retrieve elements of subsets
subsets = SubSets(matrix, num_set, num_sum)
return matrix[len(num_set)][num_sum]
Based on Subset sum - Recover Solution, I used the following method to retrieve the subsets since the set will always be sorted:
def SubSets(matrix, num_set, num):
#Initialize variables
height = len(matrix)
width = num
subset_list = []
s = matrix[0][num-1] #Keeps track of number until a change occurs
for i in range(1, height):
current = matrix[i][width]
if current > s:
s = current #keeps track of changing value
cnt = i -1 #backwards counter, -1 to exclude current value already appended to list
templist = [] #to store current subset
templist.append(num_set[i-1]) #Adds current element to subset
total = num - num_set[i-1] #Initial total will be sum - max element
while cnt > 0: #Loop backwards to find remaining elements
if total >= num_set[cnt-1]: #Takes current element if it is less than total
templist.append(num_set[cnt-1])
total = total - num_set[cnt-1]
cnt = cnt - 1
templist.sort()
subset_list.append(templist) #Add subset to solution set
return subset_list
However, since it is a greedy approach it only works when the max element of each subset is distinct. If two subsets have the same max element then it only returns the one with the larger values. So for elements [1, 2, 3, 4, 5] with sum of 10 it only returns
[1, 2, 3, 4] , [1, 4, 5]
When it should return
[1, 2, 3, 4] , [2, 3, 5] , [1, 4, 5]
I could add another loop inside the while loop to leave out each element but that would increase the complexity to O(rows^3) which can potentially be more than the actual DP, O(rows*columns). Is there another way to retrieve the subsets without increasing the complexity? Or to keep track of the subsets while the DP approach is taking place? I created another method that can retrieve all of the unique elements in the solution subsets in O(rows):
def RecoverSet(matrix, num_set):
height = len(matrix) - 1
width = len(matrix[0]) - 1
subsets = []
while height > 0:
current = matrix[height][width]
top = matrix[height-1][width]
if current > top:
subsets.append(num_set[height-1])
if top == 0:
width = width - num_set[height-1]
height -= 1
return subsets
Which would output [1, 2, 3, 4, 5]. However, getting the actual subsets from it seems like solving the subset problem all over again. Any ideas/suggestions on how to store all of the solution subsets (not print them)?

That's actually a very good question, but it seems mostly you got the right intuition.
The DP approach allows you to build a 2D table and essentially encode how many subsets sum up to the desired target sum, which takes time O(target_sum*len(num_set)).
Now if you want to actually recover all solutions, this is another story in the sense that the number of solution subsets might be very large, in fact much larger than the table you built while running the DP algorithm. If you want to find all solutions, you can use the table as a guide but it might take a long time to find all subsets. In fact, you can find them by going backwards through the recursion that defined your table (the if-else in your code when filling up the table). What do I mean by that?
Well let's say you try to find the solutions, having only the filled table at your disposal. The first thing to do to tell whether there is a solution is to check that the element at row len(num_set) and column num has value > 0, indicating that at least one subset sums up to num. Now there are two possibilities, either the last number in num_set is used in a solution in which case we must then check whether there is a subset using all numbers except that last one, which sums up to num-num_set[-1]. This is one possible branch in the recursion. The other one is when the last number in num_set is not used in a solution, in which case we must then check whether we can still find a solution to sum up to num, but having all numbers except that last one.
If you keep going you will see that the recovering can be done by doing the recursion backwards. By keeping track of the numbers along the way (so the different paths in the table that lead to the desired sum) you can retrieve all solutions, but again bear in mind that the running time might be extremely long because we want to actually find all solutions, not just know their existence.
This code should be what you are looking for recovering solutions given the filled matrix:
def recover_sol(matrix, set_numbers, target_sum):
up_to_num = len(set_numbers)
### BASE CASES (BOTTOM OF RECURSION) ###
# If the target_sum becomes negative or there is no solution in the matrix, then
# return an empty list and inform that this solution is not a successful one
if target_sum < 0 or matrix[up_to_num][target_sum] == 0:
return [], False
# If bottom of recursion is reached, that is, target_sum is 0, just return an empty list
# and inform that this is a successful solution
if target_sum == 0:
return [], True
### IF NOT BASE CASE, NEED TO RECURSE ###
# Case 1: last number in set_numbers is not used in solution --> same target but one item less
s1_sols, success1 = recover_sol(matrix, set_numbers[:-1], target_sum)
# Case 2: last number in set_numbers is used in solution --> target is lowered by item up_to_num
s2_sols, success2 = recover_sol(matrix, set_numbers[:-1], target_sum - set_numbers[up_to_num-1])
# If Case 2 is a success but bottom of recursion was reached
# so that it returned an empty list, just set current sol as the current item
if s2_sols == [] and success2:
# The set of solutions is just the list containing one item (so this explains the list in list)
s2_sols = [[set_numbers[up_to_num-1]]]
# Else there are already solutions and it is a success, go through the multiple solutions
# of Case 2 and add the current number to them
else:
s2_sols = [[set_numbers[up_to_num-1]] + s2_subsol for s2_subsol in s2_sols]
# Join lists of solutions for both Cases, and set success value to True
# if either case returns a successful solution
return s1_sols + s2_sols, success1 or success2
For the full solution with matrix filling AND recovering of solutions you can then do
def subset_sum(set_numbers, target_sum):
n_numbers = len(set_numbers)
#Initialize DP matrix with base cases set to 1
matrix = [[0 for i in range(0, target_sum+1)] for j in range(0, n_numbers+1)]
for i in range(n_numbers+1):
matrix[i][0] = 1
for i in range(1, n_numbers+1): #Iterate through set elements
for j in range(1, target_sum+1): #Iterate through sum
if set_numbers[i-1] > j: #When current element is greater than sum take the previous solution
matrix[i][j] = matrix[i-1][j]
else:
matrix[i][j] = matrix[i-1][j] + matrix[i-1][j-set_numbers[i-1]]
return recover_sol(matrix, set_numbers, target_sum)[0]
Cheers!

Related

Creating data in loop subject to moving condition

I am trying to create a list of data in a for loop then store this list in a list if it satisfies some condition. My code is
R = 10
lam = 1
proc_length = 100
L = 1
#Empty list to store lists
exponential_procs_lists = []
for procs in range(0,R):
#Draw exponential random variables
z_exponential = np.random.exponential(lam,proc_length)
#Sort values to increase
z_exponential.sort()
#Insert 0 at start of list
z_dat_r = np.insert(z_exponential,0,0)
sum = np.sum(np.diff(z_dat_r))
if sum < 5*L:
exponential_procs_lists.append(z_dat_r)
which will store some of the R lists that satisfies the sum < 5L condition. My question is, what is the best way to store R lists where the sum of each list is less than 5L? The lists can be different length but they must satisfy the condition that the sum of the increments is less than 5*L. Any help much appreciated.
Okay so based on your comment, I take that you want to generate an exponential_procs_list, inside which every sublist has a sum < 5*L.
Well, I modified your code to chop the sublists as soon as the sum exceeds 5*L.
Edit : See answer history to see my last answer for the approach above.
Well looking closer, notice you don't actually need the discrete difference array. You're finding the difference array, summing it up and checking whether the sum's < 5L and if it is, you append the original array.
But notice this:
if your array is like so: [0, 0.00760541, 0.22281415, 0.60476231], it's difference array would be [0.00760541 0.21520874 0.38194816].
If you add the first x terms of the difference array, you get the x+1th element of the original array. So you really just need to keep elements which are lesser than 5L:
import numpy as np
R = 10
lam = 1
proc_length = 5
L = 1
exponential_procs_lists = []
def chop(nums, target):
good_list = []
for num in nums:
if num >= target:
break
good_list.append(num)
return good_list
for procs in range(0,R):
z_exponential = np.random.exponential(lam,proc_length)
z_exponential.sort()
z_dat_r = np.insert(z_exponential,0,0)
good_list = chop(z_dat_r, 5*L)
exponential_procs_lists.append(good_list)
You could probably also just do a binary search(for better time complexity) or use a filter lambda, that's up to you.

Generate combinations such that the total is always 100 and uses a defined jump value

I am looking to generate a list of combinations such that the total is always 100. The combinations have to be generated based on a jump value (similar to how we use it in range or loop).
The number of elements in each combination is based on the length of the parent_list. If the parent list of 10 elements, we need each list in the output to be of 10 elements.
parent_list=['a','b','c', 'd']
jump=15
sample of expected output is
[[15,25,25,35],[30,50,10,10],[20,15,20,45]]
I used the solution given in this question, but it doesn't give the option to add the Jump parameter. Fractions are allowed too.
This program finds all combinations of n positive integers whose sum is total such that at least one of them is a multiple of jump. It works in a recursive way, setting jump to 1 if the current sequence already contains an element that's a multiple of the original jump.
def find_sum(n, total, jump, seq=()):
if n == 0:
if total == 0 and jump == 1: yield seq
return
for i in range(1, total+1):
yield from find_sum(n - 1, total - i, jump if i % jump else 1, seq + (i,))
for seq in find_sum(4, 100, 15):
print(seq)
There's still a lot of solutions.

Divide a python list into subsets of lists (the smaller the number of subsets the better), each with sum less then K

I am fairly new to python. I am making a program and am stuck with a problem that can be summed up as follows:
Lets say we have a list of numbers (each is less than 5) [1.5, 3, 4, 2.5 , 1, 4, 0.5 etc]. I want to divide this list into subsets of list, with the condition that the sum of items in each subset is <= 5. The list can have up to 200 items.
The optimal solution would be the one that returns the smallest number of subsets. But I am not looking for an optimal solution, just a good enough one.
This is called the bin packing problem. It is a well-studied NP-complete problem, meaning that no known algorithm gives exact answers (i.e. with the true minimum number of sublists) while also running efficiently for larger inputs.
However, since you only need a "good" enough solution, you are in luck; there are many good heuristics which give quite good answers in practice. A nice simple one is the "First Fit Decreasing" algorithm:
Sort the items in descending order (i.e. largest first).
Initialise a list to store the sublists in. Initially, there are none.
For each item:
If there are any sublists with sufficient remaining capacity, insert the item into the first one.
Otherwise, create a new empty sublist, and insert the item there.
This turns out to always give solutions using at most (11/9)b + 1 sublists, where b is the number of sublists used by an optimal solution (Yue, 1990).
I would contest that this is more of an algorithm problem than it is python-specific - but one algorithm that pops in to my head that feels simple enough would be to sort the list, and create "buckets" (sub-lists) that start with the max element, and add from the front of the list until it cannot be added.
In Python that might look something like list
x = [1.5, 3, 4, 2.5 , 1, 4, 0.5]
x.sort()
buckets = []
while True:
# if the list is empty, break
if x == []:
break
last_elem = x.pop() # pop removes the last element and returns it
new_bucket = [last_elem] # create a new bucket initially with just that
new_bucket_sum = last_elem
# for the remaining numbers
num_added = 0
for num in x:
if num + new_bucket_sum > 5:
break
new_bucket.append(num) # add it to the sub-list
new_bucket_sum += num # account for the sum
num_added += 1 # increase our count for this iteration
buckets.append(new_bucket) # add the bucket
x = x[num_added:] # take a sub-list of x (getting rid of the numbers added)
# Note that we now recurse until all numbers have been placed in to buckets
# After this for loop breaks, you have all the buckets
print(buckets)
This was my go-to instinct. There are more "pythonic" ways I'd say to write that algorithm but since you are new to Python I thought it may be helpful to break it up and comment. There also may be better algorithms out there. Cheers
Just thought to add that if the elements of the resulting list-of-lists MUST maintain their original ordering (with respect to the input list), then you can do this:
elts = [1.5, 3, 4, 2.5 , 1, 4, 0.5]
res = []
temp = [] # for accumulating the numbers
temp_sum = 0 # the sum of the accumulated numbers
for e in elts:
temp_sum += e # update the sum with current element
if temp_sum > 5:
# if updating the sum with the current element
# makes the sum overshoot the limit
# then don't accumulate the current element
# instead ...
res.append(temp) # append the previously accumulated elements to the result
temp = [e] # start a new accumulator with the current element
temp_sum = e # start a new accumulated sum with the current element
else:
# if updating the sum with the current element
# does not make the sum overshoot the limit ...
temp.append(e) # accumulate current element
# finally, append the last seen accumulator to the result
res.append(temp)
The result, res, will be [[1.5, 3], [4], [2.5, 1], [4, 0.5]]
I liked the challenge, so I create a heuristic algorythm based on random sampling of the base list. Thus it search for the best solution, until a preset given iteration number:
import numpy as np
#base_randlist = np.random.random(200) * 5
base_randlist = np.array([1.5, 3, 4, 2.5 , 1, 4, 0.5])
print(base_randlist)
sets = []
for i in range(10000):
set_ = []
subset = []
randlist = base_randlist
while randlist.shape[0] != 0:
while True:
if randlist.shape[0] == 0:
set_.append(subset)
break
ind = np.random.randint(0, randlist.shape[0])
last_subset = subset.copy()
subset.append(randlist[ind])
if sum(subset) <= 5:
randlist = np.delete(randlist, ind)
else:
set_.append(last_subset)
subset = []
break
sets.append(set_)
min_setnum = np.inf
for i, s in enumerate(sets):
if min_setnum > len(s):
min_setnum = len(s)
min_ind = i
print(sets[min_ind])
print(min_setnum)
Out:
[1.5 3. 4. 2.5 1. 4. 0.5]
[[3.0, 0.5], [1.5, 2.5], [4.0], [4.0, 1.0]]
4

Pythonic way of checking if indefinite # of consec elements in list sum to given value

Having trouble figuring out a nice way to get this task done.
Say i have a list of triangular numbers up to 1000 -> [0,1,3,6,10,15,..]etc
Given a number, I want to return the consecutive elements in that list that sum to that number.
i.e.
64 --> [15,21,28]
225 --> [105,120]
371 --> [36, 45, 55, 66, 78, 91]
if there's no consecutive numbers that add up to it, return an empty list.
882 --> [ ]
Note that the length of consecutive elements can be any number - 3,2,6 in the examples above.
The brute force way would iteratively check every possible consecutive pairing possibility for each element. (start at 0, look at the sum of [0,1], look at the sum of [0,1,3], etc until the sum is greater than the target number). But that's probably O(n*2) or maybe worse. Any way to do it better?
UPDATE:
Ok, so a friend of mine figured out a solution that works at O(n) (I think) and is pretty intuitively easy to follow. This might be similar (or the same) to Gabriel's answer, but it was just difficult for me to follow and I like that this solution is understandable even from a basic perspective. this is an interesting question, so I'll share her answer:
def findConsec(input1 = 7735):
list1 = range(1, 1001)
newlist = [reduce(lambda x,y: x+y,list1[0:i]) for i in list1]
curr = 0
end = 2
num = sum(newlist[curr:end])
while num != input1:
if num < input1:
num += newlist[end]
end += 1
elif num > input1:
num -= newlist[curr]
curr += 1
if curr == end:
return []
if num == input1:
return newlist[curr:end]
A 3-iteration max solution
Another solution would be to start from close where your number would be and walk forward from one position behind. For any number in the triangular list vec, their value can be defined by their index as:
vec[i] = sum(range(0,i+1))
The division between the looking-for sum value and the length of the group is the average of the group and, hence, lies within it, but may as well not exist in it.
Therefore, you can set the starting point for finding a group of n numbers whose sum matches a value val as the integer part of the division between them. As it may not be in the list, the position would be that which minimizes their difference.
# vec as np.ndarray -> the triangular or whatever-type series
# val as int -> sum of n elements you are looking after
# n as int -> number of elements to be summed
import numpy as np
def seq_index(vec,n,val):
index0 = np.argmin(abs(vec-(val/n)))-n/2-1 # covers odd and even n values
intsum = 0 # sum of which to keep track
count = 0 # counter
seq = [] # indices of vec that sum up to val
while count<=2: # walking forward from the initial guess of where the group begins or prior to it
intsum = sum(vec[(index0+count):(index0+count+n)])
if intsum == val:
seq.append(range(index0+count,index0+count+n))
count += 1
return seq
# Example
vec = []
for i in range(0,100):
vec.append(sum(range(0,i))) # build your triangular series from i = 0 (0) to i = 99 (whose sum equals 4950)
vec = np.array(vec) # convert to numpy to make it easier to query ranges
# looking for a value that belong to the interval 0-4590
indices = seq_index(vec,3,4)
# print indices
print indices[0]
print vec[indices]
print sum(vec[indices])
Returns
print indices[0] -> [1, 2, 3]
print vec[indices] -> [0 1 3]
print sum(vec[indices]) -> 4 (which we were looking for)
This seems like an algorithm question rather than a question on how to do it in python.
Thinking backwards I would copy the list and use it in a similar way to the Sieve of Eratosthenes. I would not consider the numbers that are greater than x. Then start from the greatest number and sum backwards. Then if I get greater than x, subtract the greatest number (exclude it from the solution) and continue to sum backward.
This seems the most efficient way to me and actually is O(n) - you never go back (or forward in this backward algorithm), except when you subtract or remove the biggest element, which doesn't need accessing the list again - just a temp var.
To answer Dunes question:
Yes, there is a reason - to subtracts the next largest in case of no-solution that sums larger. Going from the first element, hit a no-solution would require access to the list again or to the temporary solution list to subtract a set of elements that sum greater than the next element to sum. You risk to increase the complexity by accessing more elements.
To improve efficiency in the cases where an eventual solution is at the beginning of the sequence you can search for the smaller and larger pair using binary search. Once a pair of 2 elements, smaller than x is found then you can sum the pair and if it sums larger than x you go left, otherwise you go right. This search has logarithmic complexity in theory. In practice complexity is not what it is in theory and you can do whatever you like :)
You should pick the first three elements, sum them and do and then you keep subtracting the first of the three and add the next element in the list and see if the sum add up to whatever number you want. That would be O(n).
# vec as np.ndarray
import numpy as np
itsum = sum(list[0:2]) # the sum you want to iterate and check its value
sequence = [[] if itsum == whatever else [range(0,3)]] # indices of the list that add up to whatever (creation)
for i in range(3,len(vec)):
itsum -= vec[i-3]
itsum += vec[i]
if itsum == whatever:
sequence.append(range(i-2,i+1)) # list of sequences that add up to whatever
The solution you provide in the question isn't truly O(n) time complexity -- the way you compute your triangle numbers makes the computation O(n2). The list comprehension throws away the previous work that want into calculating the last triangle number. That is: tni = tni-1 + i (where tn is a triangle number). Since you also, store the triangle numbers in a list, your space complexity is not constant, but related to the size of the number you are looking for. Below is an identical algorithm, but is O(n) time complexity and O(1) space complexity (written for python 3).
# for python 2, replace things like `highest = next(high)` with `highest = high.next()`
from itertools import count, takewhile, accumulate
def find(to_find):
# next(low) == lowest number in total
# next(high) == highest number not in total
low = accumulate(count(1)) # generator of triangle numbers
high = accumulate(count(1))
total = highest = next(high)
# highest = highest number in the sequence that sums to total
# definitely can't find solution if the highest number in the sum is greater than to_find
while highest <= to_find:
# found a solution
if total == to_find:
# keep taking numbers from the low iterator until we find the highest number in the sum
return list(takewhile(lambda x: x <= highest, low))
elif total < to_find:
# add the next highest triangle number not in the sum
highest = next(high)
total += highest
else: # if total > to_find
# subtract the lowest triangle number in the sum
total -= next(low)
return []

Select items around a value in a sorted list with multiple repeated values

I'm trying to select some elements in a python list. The list represents a distribution of the sizes of some other elements, so it contains multiple repeated values.
After I find the average value on this list, I want to pick those elements which value lies between an upper bound and a lower bound around that average value. I can do that easily, but it selects too many elements (mainly because the distribution I have to work with is pretty much homogeneous). So I would like to be able to select the bounds where to chose the values, but also limit the spread of the search to like 5 elements below the average and 5 elements above.
I'll add my code (it is super simple).
avg_lists = sum_lists/len(lists)
num_list = len(list)
if (int(num_comm/10)%2 == 0):
window_size = int(num_list/10)
else:
window_size = int(num_list/10)-1
out_file = open('chosenLists', 'w+')
chosen_lists = []
for list in lists:
if ((len(list) >= (avg_lists-window_size)) & (len(list)<=(avg_lists+window_size))):
chosen_lists.append(list)
out_file.write("%s\n" % list)
If you are allowed to use median instead of average then you can use this simple solution:
def select(l, n):
assert n <= len(l)
s = sorted(l) # sort the list
i = (len(s) - n) // 2
return s[i:i+n] # return sublist of n elements from the middle
print select([1,2,3,4,5,1,2,3,4,5], 5) # shows [2, 2, 3, 3, 4]
The function select returns n elements closest to the median.

Categories