Related
I need a way of finding an exact value made of the sum of variables chosen from the population. The algorithm can find just the first solution or all. So we can have 10, 20, or 30 different numbers and we will sum some of them to get a desirable number. As an example we have a population of the below numbers: -2,-1,1,2,3,5,8,10 and we try to get 6 - this can be made of 8 and -2, 1 + 5 etc. I need at least 2 decimal places to consider as well for accuracy and ideally, the sum of variables will be exact to the asking value.
Thanks for any advice and help on this:)
I build a model Using the simplex method in Excel but I need the solution in Python.
This is the subset sum problem, which is an NP Complete problem.
There is a known pseudo-polynomial solution for it, if the numbers are integers. In your case, you need to consider numbers only to 2nd decimal point, so you could convert the problem into integers by multiplying by 1001, and then run the pseudo-polynomial algorithm.
It will works quite nicely and efficiently - if the range of numbers you have is quite small (Complexity is O(n*W), where W is the sum of numbers in absolute value).
Appendix:
Pseudo polynomial time solution is Dynamic Programming adaptation of the following recursive formula:
k is the desired number
n is the total number of elements in list.
// stop clause: Found a sum
D(k, i) = true | for all 0 <= i < n
// Stop clause: failing attempt, cannot find sum in this branch.
D(x, n) = false | x != k
// Recursive step, either take the current element or skip it.
D(x, i) = D(x + arr[i], i+1) OR D(x, i+1)
Start from D(0,0)
If this is not the case, and the range of numbers is quite high, you might have to go with brute force solution, of checking all possible subsets. This solution is of course exponential, and processing it is in O(2^n) .
(1) Consider rounding if needed, but that's a simple preprocessing that doesn't affect the answer.
I was going through the solutions to this problem found on leetcode.
The problem states:
Given an array, strs, with strings consisting of only 0s and 1s. Also
two integers m and n.
Now your task is to find the maximum number of strings that you can
form with given m 0s and n 1s. Each 0 and 1 can be used at most once.
Input: strs = ["10","0001","111001","1","0"], m = 5, n = 3
Output: 4
Explanation: This are totally 4 strings can be formed by the using of
5 0s and 3 1s, which are "10","0001","1","0".
The algorithm used to solve the problem is below:
def findMaxForm(strs, m, n):
dp = [[0] * (n + 1) for _ in range(m +1)]
for s in strs:
zeros, ones = s.count('0'), s.count('1')
for i in range(m, zeros - 1, -1):
for j in range(n, ones -1, - 1):
# dp[i][j] indicates it has i zeros ans j ones,
# can this string be formed with those ?
dp[i][j] = max( 1 + dp[i - zeros][j - ones], dp[i][j])
# print(dp)
return dp[-1][-1]
the confusing part of the problem is the dp[i][j] = max( 1 + dp[i - zeros][j - ones], dp[i][j]). I am not sure what is going on here. Why do we minus i from zeros and j from ones?
I also found a diagram that explains how the dp table should look for ever element in the array.
My Questions:
what does the first table represent? The x and y axis? Why are there so many 1's. I think if i understand this part, something might click. I would appreciate if someone walks through the diagram
why does this way give us the maximum number of 0's and 1's that can be formed? i think i am stil confused about this part dp[i][j] = max( 1 + dp[i - zeros][j - ones], dp[i][j]).
Also the solution is described as a "3d-DP optimized to 2D space: dp[j][k]: i dimension is optimized to be used in-place." What does that mean?
When you encounter a string s, you basically have two options. It either belongs to the maximal solution, or it doesn't.
If you do, the size of the set is increased by one, but you have less ones and zeros left to use. If you don't use it, the size of the set remains unchanged, but so is the number if left ones and zeros.
The table dp represnts the maximal such set you can get until now for different number of ones and zeros "left". For example. dp[m][n] means the best value you can get so far with m zeros and n ones. Similarly, for dp[2][3] you can use 2 zeros and 3 ones for the rest of the strings.
Let's wrap it together:
For some given number of zeros (i) left to use, and some number of ones (j) left to use, and a string s:
1 + dp[i - zeros][j - ones] means the maximal set if you decide to
add s to the set (and you are left with less ones and zeros)
dp[i][j] means you are not taking this element, and moving on.
When you invoke max() on both values, you basically say: I want the better one out of these two options.
I hope this answers the first two questions, of why it is maximal and what the dp line means.
Also the solution is described as a "3d-DP optimized to 2D space:
dp[j][k]: i dimension is optimized to be used in-place." What does
that mean?
In here, you have 3d problem: the strings themselves, which you iterate over - but you don't have another dimension for the array. You optimize it to be inplace since you always only need the previous string, and never something "older" than it, saving you precious space.
I'm trying to solve this task.
I wrote function for this purpose which uses itertools.product() for Cartesian product of input iterables:
def probability(dice_number, sides, target):
from itertools import product
from decimal import Decimal
FOUR_PLACES = Decimal('0.0001')
total_number_of_experiment_outcomes = sides ** dice_number
target_hits = 0
sides_combinations = product(range(1, sides+1), repeat=dice_number)
for side_combination in sides_combinations:
if sum(side_combination) == target:
target_hits += 1
p = Decimal(str(target_hits / total_number_of_experiment_outcomes)).quantize(FOUR_PLACES)
return float(p)
When calling probability(2, 6, 3) output is 0.0556, so works fine.
But calling probability(10, 10, 50) calculates veeery long (hours?), but there must be a better way:)
for side_combination in sides_combinations: takes to long to iterate through huge number of sides_combinations.
Please, can you help me to find out how to speed up calculation of result, i want too sleep tonight..
I guess the problem is to find the distribution of the sum of dice. An efficient way to do that is via discrete convolution. The distribution of the sum of variables is the convolution of their probability mass functions (or densities, in the continuous case). Convolution is an n-ary operator, so you can compute it conveniently just two pmf's at a time (the current distribution of the total so far, and the next one in the list). Then from the final result, you can read off the probabilities for each possible total. The first element in the result is the probability of the smallest possible total, and the last element is the probability of the largest possible total. In between you can figure out which one corresponds to the particular sum you're looking for.
The hard part of this is the convolution, so work on that first. It's just a simple summation, but it's just a little tricky to get the limits of the summation correct. My advice is to work with integers or rationals so you can do exact arithmetic.
After that you just need to construct an appropriate pmf for each input die. The input is just [1, 1, 1, ... 1] if you're using integers (you'll have to normalize eventually) or [1/n, 1/n, 1/n, ..., 1/n] if rationals, where n = number of faces. Also you'll need to label the indices of the output correctly -- again this is just a little tricky to get it right.
Convolution is a very general approach for summations of variables. It can be made even more efficient by implementing convolution via the fast Fourier transform, since FFT(conv(A, B)) = FFT(A) FFT(B). But at this point I don't think you need to worry about that.
If someone still interested in solution which avoids very-very-very long iteration process through all itertools.product Cartesian products, here it is:
def probability(dice_number, sides, target):
if dice_number == 1:
return (1 <= target <= sides**dice_number) / sides
return sum([probability(dice_number-1, sides, target-x) \
for x in range(1,sides+1)]) / sides
But you should add caching of probability function results, if you won't - calculation of probability will takes very-very-very long time as well)
P.S. this code is 100% not mine, i took it from the internet, i'm not such smart to product it by myself, hope you'll enjoy it as much as i.
I have a finite metric space given as a (symmetric) k by k distance matrix. I would like an algorithm to (approximately) isometrically embed this in euclidean space R^(k-1). While it is not always possible to do exactly by solving the system of equations given by the distances I am looking for a solution that embeds with some (very small) controllable error.
I currently use multidimensional scaling (MDS) with the output dimension set to (k-1). It occurs to me that in general MDS may be optimized for the situation where you are trying to reduce the ambient embedding dimension to something less then (k-1) (typically 2 or 3) and that there may be a better algorithm for my restricted case.
Question: What is a good/fast algorithm for realizing a metric space of size k in R^{k-1} using euclidean distance?
Some parameters and pointers:
(1) My k's are relatively small. Say 3 < k < 25
(2) I don't actually care if I embed in R^{k-1}. If it simplifies things/makes things faster any R^N would also be fine as long as it's isometric. I'm happy if there's a faster algorithm or one with less error if I increase to R^k or R^(2k+1).
(3) If you can point to a python implementation I'll be even happier.
(4) Anything better then MDS would work.
Why not try LLE , you can also find the code there.
Ok, as promised here a simple solution:
Notation: Let d_{i,j}=d_{j,i} denote the squared distance between points i and j. Let N be the number of points. Let p_i denote the point i and p_{i,k} the k-th coordinate of the point.
Let's hope I derive the algorithm correctely now. There will be some explanations afterwards so you can check the derivation (I hate it when many indexes appear).
The algorithm uses dynamic programming to arrive at the correct solution
Initialization:
p_{i,k} := 0 for all i and k
Calculation:
for i = 2 to N do
sum = 0
for j = 2 to i - 1 do
accu = d_{i,j} - d_{i,1} - p_{j,j-1}^2
for k = 1 to j-1 do
accu = accu + 2 p_{i,k}*p_{j,k} - p_{j,k}^2
done
p_{i,j} = accu / ( 2 p_{j,j-1} )
sum = sum + p_{i,j}^2
done
p_{i,i-1} = sqrt( d_{i,0} - sum )
done
If I have not done any grave index mistakes (I usually do) this should do the trick.
The idea behind this:
We set the first point arbitrary at the origin to make our life easier. Not that for a point p_i we never set a coordinate k when k > i-1. I.e. for the second point we only set the first coordinate, for the third point we only set first and second coordinate etc.
Now let's assume we have coordinates for all points p_{k'} where k'<i and we want to calculate the coordinates for p_{i} so that all d_{i,k'} are satisfied (we do not care yet about any constraints for points with k>k'). If we write down the set of equations we have
d_{i,j} = \sum_{k=1}^N (p_{i,k} - p_{j,k} )^2
Because both p_{i,k} and p_{j,k} are equal to zero for k>k' we can reduce this to:
d_{i,j} = \sum_{k=1}^k' (p_{i,k} - p_{j,k} )^2
Also note that by the loop invariant all p_{j,k} will be zero when k>j-1. So we split this equation:
d_{i,j} = \sum_{k=1}^{j-1} (p_{i,k} - p_{j,k} )^2 + \sum_{k=j}^{k'} p_{i,j}^2
For the first equation we just get:
d_{i,1} = \sum_{k=1}^{j-1} (p_{i,k} - p_{j,k} )^2 + \sum_{k=j}^{k'} p_i{i,1}^2
This one will need some special treatment later.
Now if we solve all binomials in the normal equation we get:
d_{i,j} = \sum_{k=1}^{j-1} p_{i,k}^2 - 2 p_{i,k} p_{j,k} + p_{j,k}^2 + \sum_{k=j}^{k'} p_{i,j}^2
subtract the first equation from this and you are left with:
d_{i,j} - d_{i,1} = \sum_{k=1}^{j-1} p_{j,k}^2 - 2 p_{i,k} p_{j,k}
for all j > 1.
If you look at this you'll note that all squares of coordinates of p_i are gone and the only squares we need are already known. This is a set of linear equations that can easily be solved using methods from linear algebra. Actually there is one more special thing about this set of equations: The equations already are in triangular form, so you only need the final step of propagating the solutions. For the final step we are left with one single quadratic equation that we can just solve by taking one square root.
I hope you can follow my reasoning. It's a bit late and my head is a bit spinning from those indexes.
EDIT: Yes, there were indexing mistakes. Fixed. I'll try to implement this in python when I have time in order to test it.
This is a toned down version of a computer vision problem I need to solve. Suppose you are given parameters n,q and have to count the number of ways of assigning integers 0..(q-1) to elements of n-by-n grid so that for each assignment the following are all true
No two neighbors (horizontally or vertically) get the same value.
Value at positions (i,j) is 0
Value at position (k,l) is 0
Since (i,j,k,l) are not given, the output should be an array of evaluations above, one for every valid setting of (i,j,k,l)
A brute force approach is below. The goal is to get an efficient algorithm that works for q<=100 and for n<=18.
def tuples(n,q):
return [[a,]+b for a in range(q) for b in tuples(n-1,q)] if n>1 else [[a] for a in range(q)]
def isvalid(t,n):
grid=[t[n*i:n*(i+1)] for i in range(n)];
for r in range(n):
for c in range(n):
v=grid[r][c]
left=grid[r][c-1] if c>0 else -1
right=grid[r][c-1] if c<n-1 else -1
top=grid[r-1][c] if r > 0 else -1
bottom=grid[r+1][c] if r < n-1 else -1
if v==left or v==right or v==top or v==bottom:
return False
return True
def count(n,q):
result=[]
for pos1 in range(n**2):
for pos2 in range(n**2):
total=0
for t in tuples(n**2,q):
if t[pos1]==0 and t[pos2]==0 and isvalid(t,n):
total+=1
result.append(total)
return result
assert count(2,2)==[1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1]
Update 11/11
I've also asked this on TopCoder forums, and their solution is the most efficient one I've seen so far (about 3 hours for n=10, any q, from author's estimate)
Maybe this sounds too simple, but it works. Randomly distribute values to all the cells until only two are empty. Test for adjacency of all values. Compute the average the percent of successful casts vs. all casts until the variance drops to within an acceptable margin.
The risk goes to zero and the that which is at risk is only a little runtime.
This isn't an answer, just a contribution to the discussion which is too long for a comment.
tl; dr; Any algorithm which boils down to, "Compute the possibilities and count them," such as Eric Lippert's or a brute force approach won't work for #Yaroslav's goal of q <= 100 and n <= 18.
Let's first think about a single n x 1 column. How many valid numberings of this one column exist? For the first cell we can pick between q numbers. Since we can't repeat vertically, we can pick between q - 1 numbers for the second cell, and therefore q - 1 numbers for the third cell, and so on. For q == 100 and n == 18 that means there are q * (q - 1) ^ (n - 1) = 100 * 99 ^ 17 valid colorings which is very roughly 10 ^ 36.
Now consider any two valid columns (call them the bread columns) separated by a buffer column (call it the mustard column). Here is a trivial algorithm to find a valid set of values for the mustard column when q >= 4. Start at the top cell of the mustard column. We only have to worry about the adjacent cells of the bread columns which have at most 2 unique values. Pick any third number for the mustard column. Consider the second cell of the mustard column. We must consider the previous mustard cell and the 2 adjacent bread cells with a total of at most 3 unique values. Pick the 4th value. Continue to fill out the mustard column.
We have at most 2 columns containing a hard coded cell of 0. Using mustard columns, we can therefore make at least 6 bread columns, each with about 10 ^ 36 solutions for a total of at least 10 ^ 216 valid solutions, give or take an order of magnitude for rounding errors.
There are, according to Wikipedia, about 10 ^ 80 atoms in the universe.
Therefore, be cleverer.
Update 11/11 I've also asked this on TopCoder forums, and their solution is the most efficient one I've seen so far (about 41 hours hours for n=10, any q, from author's estimate)
I'm the author. Not 41, just 3 embarrassingly parallelizable CPU hours. I've counted symmetries. For n=10 there are only 675 really distinct pairs of (i,j) and (k,l). My program needs ~ 16 seconds per each.
I'm building a contribution based on the contribution to the discussion by Dave Aaron Smith.
Let's not consider for now the last two constraints ((i,j) and (k,l)).
With only one column (nx1) the solution is q * (q - 1) ^ (n - 1).
How many choices for a second column ? (q-1) for the top cell (1,2) but then q-1 or q-2 for the cell (2,2) if (1,2)/(2,1) have or not the same color.
Same thing for (3,2) : q-1 or q-2 solutions.
We can see we have a binary tree of possibilities and we need to sum over that tree. Let's assume left child is always "same color on top and at left" and right child is "different colors".
By computing over the tree the number of possibilities for the left column to create a such configurations and the number of possibilities for the new cells we are coloring we would count the number of possibilities for coloring two columns.
But let's now consider the probability distribution foe the coloring of the second column : if we want to iterate the process, we need to have an uniform distribution on the second column, it would be like the first one never existed and among all coloring of the first two column we could say things like 1/q of them have color 0 in the top cell of second column.
Without an uniform distribution it would be impossible.
The problem : is the distribution uniform ?
Answer :
We would have obtain the same number of solution by building first the second column them the first one and then the third one. The distribution of the second column is uniform in that case so it also is in the first case.
We can now apply the same "tree idea" to count the number of possibilities for the third column.
I will try to develop on that and build a general formula (since the tree is of size 2^n we don't want to explicitly explore it).
A few observations which might help other answerers as well:
The values 1..q are interchangeable - they could be letters and the result would be the same.
The constraints that no neighbours match is a very mild one, so a brute force approach will be excessively expensive. Even if you knew the values in all but one cell, there would still be at least q-8 possibilities for q>8.
The output of this will be pretty long - every set of i,j,k,l will need a line. The number of combinations is something like n2(n2-3), since the two fixed zeroes can be anywhere except adjacent to each other, unless they need not obey the first rule. For n=100 and q=18, the maximally hard case, this is ~ 1004 = 100 million. So that's your minimum complexity, and is unavoidable as the problem is currently stated.
There are simple cases - when q=2, there are the two possible checkerboards, so for any given pair of zeroes the answer is 1.
Point 3 makes the whole program O( n2(n2-3) ) as a minimum, and also suggests that you will need something reasonably efficient for each pair of zeroes as simply writing 100 million lines without any computation will take a while. For reference, at a second per line, that is 1x108s ~ 3 years, or 3 months on a 12-core box.
I suspect that there is an elegant answer given a pair of zeroes, but I'm not sure that there is an analytic solution to it. Given that you can do it with 2 or 3 colours depending on the positions of the zeroes, you could split the map into a series of regions, each of which uses only 2 or 3 colours, and then it's just the number of different combinations of 2 or 3 in q (qC2 or qC3) for each region times the number of regions, times the number of ways of splitting the map.
I'm not a mathematician, but it occurs to me that there ought to be an analytical solution to this problem, namely:
First, compute now many different colourings are possible for NxN board with Q colours (including that neighbours, defined as having common edge don't get same color). This ought to be pretty simple formula.
Then figure out how many of these solutions have 0 in (i,j), this should be 1/Q's fraction.
Then figure out how many of remaining solutions have 0 in (k,l) depending on manhattan distance |i-k|+|j-l|, and possibly distance to the board edge and "parity" of these distances, as in distance divisible by 2, divisible by 3, divisible by Q.
The last part is the hardest, though I think it might still be doable if you are really good at math.