Find the best approximation of a subset for a value

Find the best approximation of a subset for a value - python

I would like to get an algorithm that gives me the best approximation for a value based on subset.
Here is an example:
N = 45
subset = [25,10,65,9,8]
output: [25,10,9]
The important point is that the algorithm must give the best approximation (regardless the number of the element in the final result). The result must provide the association that gives the exact value of the nearest (but can not exceed the initial value).
Do you know an algorithm that could do that with the minimal time cost ?
Thanks a lot for you help.

You cannot to do so in polynomial time (unless P=NP)
Finding out if there is a subset with sum exactly N is clearly easier than finding the subset with sum closest to N, and this former problem is called subset-sum which is known to be NP-complete.
However, pseudo-polynomial time is possible. In fact, your problem is exactly equal to the 0/1 knapsack optimization problem if we take the values in subset to be both the values in weights for the translation to knapsack. This 0/1 knapsack problem has a dynamic programming solution that runs in O(nW) where n is the number of items in subset and W is the target, which is N in your code.

The following code works for short lists. However performance will reduce significantly for longer lists:
import itertools
def closest(my_list, my_number):
l=[]
for i in range(1,len(my_list)+1):
for k in itertools.combinations(my_list, i):
l.append([k, sum(k)])
l=[i for i in l if i[1]<=my_number]
l.sort(key=lambda x:x[1])
return l[-1]
print(closest(subset, 45)[0], closest(subset, 45)[1])
Output:
(25, 10, 9) 44

Related

Algorithm-finding-dedicated-sum-from-the-population-of-variables

I need a way of finding an exact value made of the sum of variables chosen from the population. The algorithm can find just the first solution or all. So we can have 10, 20, or 30 different numbers and we will sum some of them to get a desirable number. As an example we have a population of the below numbers: -2,-1,1,2,3,5,8,10 and we try to get 6 - this can be made of 8 and -2, 1 + 5 etc. I need at least 2 decimal places to consider as well for accuracy and ideally, the sum of variables will be exact to the asking value.
Thanks for any advice and help on this:)
I build a model Using the simplex method in Excel but I need the solution in Python.

This is the subset sum problem, which is an NP Complete problem.
There is a known pseudo-polynomial solution for it, if the numbers are integers. In your case, you need to consider numbers only to 2nd decimal point, so you could convert the problem into integers by multiplying by 1001, and then run the pseudo-polynomial algorithm.
It will works quite nicely and efficiently - if the range of numbers you have is quite small (Complexity is O(n*W), where W is the sum of numbers in absolute value).
Appendix:
Pseudo polynomial time solution is Dynamic Programming adaptation of the following recursive formula:
k is the desired number
n is the total number of elements in list.
// stop clause: Found a sum
D(k, i) = true | for all 0 <= i < n
// Stop clause: failing attempt, cannot find sum in this branch.
D(x, n) = false | x != k
// Recursive step, either take the current element or skip it.
D(x, i) = D(x + arr[i], i+1) OR D(x, i+1)
Start from D(0,0)
If this is not the case, and the range of numbers is quite high, you might have to go with brute force solution, of checking all possible subsets. This solution is of course exponential, and processing it is in O(2^n) .
(1) Consider rounding if needed, but that's a simple preprocessing that doesn't affect the answer.

How to iterate through the Cartesian product of ten lists (ten elements each) faster? (Probability and Dice)

I'm trying to solve this task.
I wrote function for this purpose which uses itertools.product() for Cartesian product of input iterables:
def probability(dice_number, sides, target):
from itertools import product
from decimal import Decimal
FOUR_PLACES = Decimal('0.0001')
total_number_of_experiment_outcomes = sides ** dice_number
target_hits = 0
sides_combinations = product(range(1, sides+1), repeat=dice_number)
for side_combination in sides_combinations:
if sum(side_combination) == target:
target_hits += 1
p = Decimal(str(target_hits / total_number_of_experiment_outcomes)).quantize(FOUR_PLACES)
return float(p)
When calling probability(2, 6, 3) output is 0.0556, so works fine.
But calling probability(10, 10, 50) calculates veeery long (hours?), but there must be a better way:)
for side_combination in sides_combinations: takes to long to iterate through huge number of sides_combinations.
Please, can you help me to find out how to speed up calculation of result, i want too sleep tonight..

I guess the problem is to find the distribution of the sum of dice. An efficient way to do that is via discrete convolution. The distribution of the sum of variables is the convolution of their probability mass functions (or densities, in the continuous case). Convolution is an n-ary operator, so you can compute it conveniently just two pmf's at a time (the current distribution of the total so far, and the next one in the list). Then from the final result, you can read off the probabilities for each possible total. The first element in the result is the probability of the smallest possible total, and the last element is the probability of the largest possible total. In between you can figure out which one corresponds to the particular sum you're looking for.
The hard part of this is the convolution, so work on that first. It's just a simple summation, but it's just a little tricky to get the limits of the summation correct. My advice is to work with integers or rationals so you can do exact arithmetic.
After that you just need to construct an appropriate pmf for each input die. The input is just [1, 1, 1, ... 1] if you're using integers (you'll have to normalize eventually) or [1/n, 1/n, 1/n, ..., 1/n] if rationals, where n = number of faces. Also you'll need to label the indices of the output correctly -- again this is just a little tricky to get it right.
Convolution is a very general approach for summations of variables. It can be made even more efficient by implementing convolution via the fast Fourier transform, since FFT(conv(A, B)) = FFT(A) FFT(B). But at this point I don't think you need to worry about that.

If someone still interested in solution which avoids very-very-very long iteration process through all itertools.product Cartesian products, here it is:
def probability(dice_number, sides, target):
if dice_number == 1:
return (1 <= target <= sides**dice_number) / sides
return sum([probability(dice_number-1, sides, target-x) \
for x in range(1,sides+1)]) / sides
But you should add caching of probability function results, if you won't - calculation of probability will takes very-very-very long time as well)
P.S. this code is 100% not mine, i took it from the internet, i'm not such smart to product it by myself, hope you'll enjoy it as much as i.

Algorithm to calculate point at which to round values in an array up or down in order to least affect the mean

Consider array random array of values between 0 and 1 such as:
[0.1,0.2,0.8,0.9]
is there a way to calculate the point at which the values should be rounded down or up to an integer in order to match the mean of the un-rounded array the closest? (in above case it would be at the mean but that is purely a coincidence)
or is it just trial and error?
im coding in python
thanks for any help

Add them up, then round the sum. That's how many 1s you want. Round so you get that many 1s.
def rounding_point(l):
# if the input is sorted, you don't need the following line
l = sorted(l)
ones_needed = int(round(sum(l)))
# this may require adjustment if there are duplicates in the input
return 1.0 if ones_needed == len(l) else l[-ones_needed]
If sorting the list turns out to be too expensive, you can use a selection algorithm like quickselect. Python doesn't come with a quickselect function built in, though, so don't bother unless your inputs are big enough that the asymptotic advantage of quickselect outweighs the constant factor advantage of the highly-optimized C sorting algorithm.

Python - Sum of numbers

I am trying to sum all the numbers up to a range, with all the numbers up to the same range.
I am using python:
limit = 10
sums = []
for x in range(1,limit+1):
for y in range(1,limit+1):
sums.append(x+y)
This works just fine, however, because of the nested loops, if the limit is too big it will take a lot of time to compute the sums.
Is there any way of doing this without a nested loop?
(This is just a simplification of something that I need to do to solve a ProjectEuler problem. It involves obtaining the sum of all abundant numbers.)

[x + y for x in xrange(limit + 1) for y in xrange(x + 1)]
This still performs just as many calculations but will do it about twice as fast as a for loop.
from itertools import combinations
(a + b for a, b in combinations(xrange(n + 1, 2)))
This avoids a lot of duplicate sums. I don't know if you want to keep track of those or not.
If you just want every sum with no representation of how you got it then xrange(2*n + 2)
gives you what you want with no duplicates or looping at all.
In response to question:
[x + y for x in set set1 for y in set2]

I am trying to sum all the numbers up
to a range, with all the numbers up to
the same range.
So you want to compute limit**2 sums.
because of the nested loops, if the
limit is too big it will take a lot of
time to compute the sums.
Wrong: it's not "because of the nested loops" -- it's because you're computing a quadratic number of sums, and therefore doing a quadratic amount of work.
Is there any way of doing this without
a nested loop?
You can mask the nesting, as in #aaron's answer, and you can halve the number of sums you compute due to the problem's simmetry (though that doesn't do the same thing as your code), but, to prepare a list with a quadratic number of items, there's absolutely no way to avoid doing a quadratic amount of work.
However, for your stated purpose
obtaining the sum of all abundant
numbers.
you're need an infinite amount of work, since there's an infinity of abundant numbers;-).
I think you have in mind problem 23, which is actually very different: it asks for the sum of all numbers that cannot be expressed as the sum of two abundant numbers. How the summation you're asking about would help you move closer to that solution really escapes me.

I'm not sure if there is a good way not using nested loops.
If I put on your shoes, I'll write as following:
[x+y for x in range(1,limit+1) for y in range(1,limit+1)]

Weighted random selection with and without replacement

Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.
Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.

One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.
Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)
To create the alias lookup:
Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.
Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.
Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)
If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.
Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.
For example, if we run another iteration of 3 and 4, we see
(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned
At runtime:
Get a U(0,1) random number, say binary 0.001100000
bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.
If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.
Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.

A simple approach that hasn't been mentioned here is one proposed in Efraimidis and Spirakis. In python you could select m items from n >= m weighted items with strictly positive weights stored in weights, returning the selected indices, with:
import heapq
import math
import random
def WeightedSelectionWithoutReplacement(weights, m):
elt = [(math.log(random.random()) / weights[i], i) for i in range(len(weights))]
return [x[1] for x in heapq.nlargest(m, elt)]
This is very similar in structure to the first approach proposed by Nick Johnson. Unfortunately, that approach is biased in selecting the elements (see the comments on the method). Efraimidis and Spirakis proved that their approach is equivalent to random sampling without replacement in the linked paper.

Here's what I came up with for weighted selection without replacement:
def WeightedSelectionWithoutReplacement(l, n):
"""Selects without replacement n random elements from a list of (weight, item) tuples."""
l = sorted((random.random() * x[0], x[1]) for x in l)
return l[-n:]
This is O(m log m) on the number of items in the list to be selected from. I'm fairly certain this will weight items correctly, though I haven't verified it in any formal sense.
Here's what I came up with for weighted selection with replacement:
def WeightedSelectionWithReplacement(l, n):
"""Selects with replacement n random elements from a list of (weight, item) tuples."""
cuml = []
total_weight = 0.0
for weight, item in l:
total_weight += weight
cuml.append((total_weight, item))
return [cuml[bisect.bisect(cuml, random.random()*total_weight)] for x in range(n)]
This is O(m + n log m), where m is the number of items in the input list, and n is the number of items to be selected.

I'd recommend you start by looking at section 3.4.2 of Donald Knuth's Seminumerical Algorithms.
If your arrays are large, there are more efficient algorithms in chapter 3 of Principles of Random Variate Generation by John Dagpunar. If your arrays are not terribly large or you're not concerned with squeezing out as much efficiency as possible, the simpler algorithms in Knuth are probably fine.

It is possible to do Weighted Random Selection with replacement in O(1) time, after first creating an additional O(N)-sized data structure in O(N) time. The algorithm is based on the Alias Method developed by Walker and Vose, which is well described here.
The essential idea is that each bin in a histogram would be chosen with probability 1/N by a uniform RNG. So we will walk through it, and for any underpopulated bin which would would receive excess hits, assign the excess to an overpopulated bin. For each bin, we store the percentage of hits which belong to it, and the partner bin for the excess. This version tracks small and large bins in place, removing the need for an additional stack. It uses the index of the partner (stored in bucket[1]) as an indicator that they have already been processed.
Here is a minimal python implementation, based on the C implementation here
def prep(weights):
data_sz = len(weights)
factor = data_sz/float(sum(weights))
data = [[w*factor, i] for i,w in enumerate(weights)]
big=0
while big<data_sz and data[big][0]<=1.0: big+=1
for small,bucket in enumerate(data):
if bucket[1] is not small: continue
excess = 1.0 - bucket[0]
while excess > 0:
if big==data_sz: break
bucket[1] = big
bucket = data[big]
bucket[0] -= excess
excess = 1.0 - bucket[0]
if (excess >= 0):
big+=1
while big<data_sz and data[big][0]<=1: big+=1
return data
def sample(data):
r=random.random()*len(data)
idx = int(r)
return data[idx][1] if r-idx > data[idx][0] else idx
Example usage:
TRIALS=1000
weights = [20,1.5,9.8,10,15,10,15.5,10,8,.2];
samples = [0]*len(weights)
data = prep(weights)
for _ in range(int(sum(weights)*TRIALS)):
samples[sample(data)]+=1
result = [float(s)/TRIALS for s in samples]
err = [a-b for a,b in zip(result,weights)]
print(result)
print([round(e,5) for e in err])
print(sum([e*e for e in err]))

The following is a description of random weighted selection of an element of a
set (or multiset, if repeats are allowed), both with and without replacement in O(n) space
and O(log n) time.
It consists of implementing a binary search tree, sorted by the elements to be
selected, where each node of the tree contains:
the element itself (element)
the un-normalized weight of the element (elementweight), and
the sum of all the un-normalized weights of the left-child node and all of
its children (leftbranchweight).
the sum of all the un-normalized weights of the right-child node and all of
its chilren (rightbranchweight).
Then we randomly select an element from the BST by descending down the tree. A
rough description of the algorithm follows. The algorithm is given a node of
the tree. Then the values of leftbranchweight, rightbranchweight,
and elementweight of node is summed, and the weights are divided by this
sum, resulting in the values leftbranchprobability,
rightbranchprobability, and elementprobability, respectively. Then a
random number between 0 and 1 (randomnumber) is obtained.
if the number is less than elementprobability,
remove the element from the BST as normal, updating leftbranchweight
and rightbranchweight of all the necessary nodes, and return the
element.
else if the number is less than (elementprobability + leftbranchweight)
recurse on leftchild (run the algorithm using leftchild as node)
else
recurse on rightchild
When we finally find, using these weights, which element is to be returned, we either simply return it (with replacement) or we remove it and update relevant weights in the tree (without replacement).
DISCLAIMER: The algorithm is rough, and a treatise on the proper implementation
of a BST is not attempted here; rather, it is hoped that this answer will help
those who really need fast weighted selection without replacement (like I do).

This is an old question for which numpy now offers an easy solution so I thought I would mention it. Current version of numpy is version 1.2 and numpy.random.choice allows the sampling to be done with or without replacement and with given weights.

Suppose you want to sample 3 elements without replacement from the list ['white','blue','black','yellow','green'] with a prob. distribution [0.1, 0.2, 0.4, 0.1, 0.2]. Using numpy.random module it is as easy as this:
import numpy.random as rnd
sampling_size = 3
domain = ['white','blue','black','yellow','green']
probs = [.1, .2, .4, .1, .2]
sample = rnd.choice(domain, size=sampling_size, replace=False, p=probs)
# in short: rnd.choice(domain, sampling_size, False, probs)
print(sample)
# Possible output: ['white' 'black' 'blue']
Setting the replace flag to True, you have a sampling with replacement.
More info here:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html#numpy.random.choice

We faced a problem to randomly select K validators of N candidates once per epoch proportionally to their stakes. But this gives us the following problem:
Imagine probabilities of each candidate:
0.1
0.1
0.8
Probabilities of each candidate after 1'000'000 selections 2 of 3 without replacement became:
0.254315
0.256755
0.488930
You should know, those original probabilities are not achievable for 2 of 3 selection without replacement.
But we wish initial probabilities to be a profit distribution probabilities. Else it makes small candidate pools more profitable. So we realized that random selection with replacement would help us – to randomly select >K of N and store also weight of each validator for reward distribution:
std::vector<int> validators;
std::vector<int> weights(n);
int totalWeights = 0;
for (int j = 0; validators.size() < m; j++) {
int value = rand() % likehoodsSum;
for (int i = 0; i < n; i++) {
if (value < likehoods[i]) {
if (weights[i] == 0) {
validators.push_back(i);
}
weights[i]++;
totalWeights++;
break;
}
value -= likehoods[i];
}
}
It gives an almost original distribution of rewards on millions of samples:
0.101230
0.099113
0.799657

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.