Allocate an integer randomly across k bins - python

I'm looking for an efficient Python function that randomly allocates an integer across k bins.
That is, some function allocate(n, k) will produce a k-sized array of integers summing to n.
For example, allocate(4, 3) could produce [4, 0, 0], [0, 2, 2], [1, 2, 1], etc.
It should be randomly distributed per item, assigning each of the n items randomly to each of the k bins.

This should be faster than your brute-force version when n >> k:
def allocate(n, k):
result = np.zeros(k)
sum_so_far = 0
for ind in range(k-1):
draw = np.random.randint(n - sum_so_far + 1)
sum_so_far += draw
result[ind] = draw
result[k-1] = n - sum_so_far
return result
The idea is to draw a random number up to some maximum m (which starts out equal to n), and then we subtract that number from the maximum for the next draw, and so on, thus guaranteeing that we will never exceed n. This way we fill up the first k-1 entries; the final one is filled with whatever is missing to get a sum of exactly n.
Note: I am not sure whether this results in a "fair" random distribution of values or if it is somehow biased towards putting larger values into earlier indices or something like that.

If you are looking for a uniform distribution across all possible allocations (which is different from randomly distributing each item individually):
Using the "stars and bars" approach, we can transform this into a question of picking k-1 positions for possible dividers from a list of n+k-1 possible positions. (Wikipedia proof)
from random import sample
def allocate(n,k):
dividers = sample(range(1, n+k), k-1)
dividers = sorted(dividers)
dividers.insert(0, 0)
dividers.append(n+k)
return [dividers[i+1]-dividers[i]-1 for i in range(k)]
print(allocate(4,3))
There are ((n+k-1) choose (k-1)) possible distributions, and this is equally likely to result in each one of them.
(This is a modification of Wave Man's solution: that one is not uniform across all possible solutions: note that the only way to get [0,0,4] is to roll (0,0), but there are two ways to get [1,2,1]; rolling (1,3) or (3,1). Choosing from n+k-1 slots and counting dividers as taking a slot corrects for this. In this solution, the random sample (1,2) corresponds to [0,0,4], and the equally likely random sample (2,5) corresponds to [1,2,1])

Here's a brute-force approach:
import numpy as np
def allocate(n, k):
res = np.zeros(k)
for i in range(n):
res[np.random.randint(k)] += 1
return res
Example:
for i in range(3):
print(allocate(4, 3))
[0. 3. 1.]
[2. 1. 1.]
[2. 0. 2.]

Adapting Michael Szczesny's comment based on numpy's new paradigm:
def allocate(n, k):
return np.random.default_rng().multinomial(n, [1 / k] * k)
This notebook verifies that it returns the same distribution as my brute-force approach.

Here's my solution. I think it will make all possible allocations equally likely, but I don't have a proof of that.
from random import randint
def allocate(n,k):
dividers = [randint(0,n) for i in range(k+1)]
dividers[0] = 0
dividers[k] = n
dividers = sorted(dividers)
return [dividers[i+1]-dividers[i] for i in range(k)]
print(allocate(10000,100))

Related

Using Python to use Euler's formula into matrix approximation

I am trying to use Python and NumPy to use Euler’s formula e^i(π) represented as a matrix as e^A where
A = [0 -π]
[π 0]
,and then apply it to the Maclaurin series for an exponential function e^x as
SUMMATION(n=0, infinity) x^n/n! = 1 + x + x^2/2! + x^3/3! +...
So I am trying to compute an approximation matrix S^N+1 and print the matrix and it's four entries.
I have tried emulating euler's and maclaurin's series, which i think the final approximation matrix for this will be when N = 20, but currently my values do not add up. I am also trying to use np.linalg.norm to compute a 2 norm as well.
import math
import numpy as np
n = 0
A = np.eye(2)
A = math.pi * np.rot90(A)
A[0,1] = -A[0,1]
A
mac_series = 0
while n < 120:
print(n)
n += 1
mac_series = (A**n) / (math.factorial(n))
print("\n",mac_series)
np.linalg.norm(mac_series)
The main problem here is that you are confusing A**3 with A#A#A.
Just look at case n=0.
A**0
#array([[1., 1.],
# [1., 1.]])
I am pretty sure, you were expecting A⁰ to be identity (that is only that way that this thinking of x+iy ⇔ np.array([[x,-y],[y,x]]) makes sense)
In numpy, you have np.linalg.matrix_power for that (or you could just accumulate power your self)
sum(np.linalg.matrix_power(A,i) / math.factorial(i) for i in range(20))
is
array([[-1.00000000e+00, 5.28918267e-10],
[-5.28918267e-10, -1.00000000e+00]])
for example. Pretty sure that is what you were expecting (that is the matrix that represents real -1 using the same logic. And whole point of Euler identity is e^(iπ) = -1).
By comparison,
sum(A**i / math.factorial(i) for i in range(20))
returns
array([[ 1. , 0.04321392],
[23.14069263, 1. ]])
Which is just the maclaurin series computed for all four elements of the matrix. In other words, since your matrix is [[0,-π],[π,0]], you are evaluating using a MacLauring series [[e⁰, exp(-π)], [exp(π), e⁰]]. And it works. e⁰=1, obviously. exp(π) is 23.140692632779267, so we got a very good approximation in our result. And exp(-π) is the inverse, 0.04321391826377226. We also got a good approximation.
So it works. Just not at all to do what you obviously intend to do: prove Euler identity's in matrix form; compute exp(iπ) not just exp(π).
Without matrix_power, and with a code closer to your initial code, you could
n=0
mac_series = 0
Apowern=np.eye(2) # A⁰=Id for now
while n < 20:
print(n)
mac_series += Apowern / (math.factorial(n))
Apowern = Apowern # A # # is the matrix multiplication operator
n+=1
Note that I've also moved n+=1 which was misplaced in your code. You were stacking Aⁿ⁺¹/(n+1)! not Aⁿ/n! with your code (in other words, your sum misses the A⁰/0!=Id term).
With this, I get the expected result
>>> mac_series
array([[-1.00000000e+00, 5.28918724e-10],
[-5.28918724e-10, -1.00000000e+00]])
Last problem, more subtle: you may have noticed that I do only 20 iterations, not 120. That is because after 20, you start to have a numerical problem. Apowern (or np.linalg.matrix_power(A,n), it is the same problem for both methods) becomes to big. Since it is divided by n! in the stacking, that doesn't prevent convergence. But it does prevent numeric convergence. And, in practice, after a while, numpy change the type of Apowern.
So, we should not have big matrix divided by big number, and try to iterate things that stay small enough. Like this for example
n=0
mac_series = 0
NthTerm=np.eye(2) # Aⁿ/n!. A⁰/0!=Id for now
while n < 120: # 120 is no longer a problem
print(n)
mac_series += NthTerm
n += 1
NthTerm = (NthTerm # A) / n # so if nthterm was
# Aⁿ/n!, now it becomes Aⁿ/n! # A/(n+1) = Aⁿ⁺¹/(n+1)!
Result
>>> mac_series
array([[-1.00000000e+00, -2.34844612e-16],
[ 2.34844612e-16, -1.00000000e+00]])
tl;dr
You have 4 problems
The one already mentioned by Roy: you are not accumulating the Aⁿ/n!, just replacing them, and eventually keeping only the last. In other words, you need a += instead of =
A**n is not Aⁿ. It is just A, with all the elements to the power n. Said otherwise [[x,-y],[y,x]]**n is not [[x,-y],[y,x]]ⁿ it is [[xⁿ,(-y)ⁿ],[yⁿ,xⁿ]]. So you'll end up computing [[e⁰, 1/e^π], [e^π, e⁰]] ≈ [[1, 0.0432], [23.14, 1]] which is irrelevant.
n+=1 is misplaced
The numerical problem due to Aⁿ becoming huge (even if you intend to divide it by a even huger n!, so it does not theoretically/mathematically pose a problem, but numerically it does, since intermediate result is to big for computer)

Iterate though all permutations randomly

I have imported an large array and I want to iterate through all row permutations at random.
The code is designed to break if a certain array produces the desired solution.
The attempt so far involves your normal iterative perturbation procedure:
import numpy as np
import itertools
file = np.loadtxt("my_array.csv", delimiter=", ")
for i in itertools.permutations(file):
** do something **
if condition:
break
However, I would like the iterations to cover all perturbation and at random, with no repeats.
Ideally, (unlike random iteration in Python) I would also avoid storing all permutations of the array in memory.
Therefore a generator based solution would be best.
Is there a simple solution?
The answer is to first write a function that given an integer k in [0, n!) returns the kth permutation:
def unrank(n, k):
pi = np.arange(n)
while n > 0:
pi[n-1], pi[k % n] = pi[k % n], pi[n-1]
k //= n
n -= 1
return pi
This technique is found in Ranking and unranking permutations in linear time by Wendy Myrvold and Frank Ruskey.
Then, if we can generate a random permutation of [0, n!) we are done. We can find a technique for this (without having to construct the whole permutation) in Sometimes-Recurse Shuffle by Ben Morris and Phillip Rogaway. I have an implementation of it available here.
Then, all we have to do is:
import math
a = np.array(...) # Load data.
p = SometimeShuffle(math.factorial(len(a)), "some_random_seed")
for kth_perm in p:
shuffled_indices = unrank(len(a), kth_perm)
shuffled_a = a[shuffled_indices]

Finding the error in my Python function to plot a histogram and return a list?

*The instruction is as following:
plotRandomUniformSum(M, N, nBins)
Let’s consider what happens when we add M uniform random numbers together. Since each random number can be between 0 and 1, we expect this sum will be between 0 and M, but somehow we expect that it is more likely for the sum to be near the middle (M/2) than near the ends (0 and M) for M > 1. First write a function randomUniformSum(M) that adds up M uniform random numbers between 0 and 1. Second, write a function that forms a list of N such numbers by calling randomUniformSum a total of N times. Third, plot the result in a histogram and return the list.
Remember that return means to exit the function. Therefore, you should first plot the histogram and then return the list.
What values should you use for binMin and binMax?
To answer this, consider the minimum and maximum values that randomUniformSum may assume. Try calling your function four times with M = 1, 2, 3, 10 setting N = 1000000 and nBins = 100 in each case. You will notice that as M increases, the distribution looks more and more like a Normal (or Gaussian) distribution. This is an illustration of the Central Limit Theorem, a very important theorem from Statistics which states that as you add more and more independent random variables (from any distribution, doesn’t have to be uniform) the sum approaches a Normal distribution - very cool :-).*
This is what I have so far:
def randomUniformSum(M):
sum = 0
for i in range(M):
sum += random.uniform(0,1)
return sum
def plotRandomUniformSum(M, N, nBins):
L = []
for i in range(N+1):
x = randomUniformSum(M)
L.append(x)
hist.plotHistogram(L, nBins = nBins)
My autograder for this assignment returns an error that:
"Test Failed: None != [0.793340083761663, 0.8219540423197268, 0[202687
chars]9087]"
with different numbers for all tests.
Where is my error? I can't seem to find where I went wrong.
You're not returning the list from the plotRandomUniformSum(M, N, nBins) function
def randomUniformSum(M):
sum = 0
for i in range(M):
sum += random.uniform(0,1)
return sum
def plotRandomUniformSum(M, N, nBins):
L = []
for i in range(N+1):
x = randomUniformSum(M)
L.append(x)
hist.plotHistogram(L, nBins = nBins)
return L # you missed this out
list_of_random_sums = plotRandomUniformSum(10, 1000000, 100)
Second, write a function that forms a list of N such numbers by calling randomUniformSum a total of N times.
Hello, welcome to SO. For me it seems unclear that you've formed a list consisting of N+1 elements instead of N. Try to change for i in range(N+1): line with for i in range(N):.

How to generate a matrix with random entries and with constraints on row and columns?

How to generate a matrix that its entries are random real numbers between zero and one inclusive with the additional constraint : The sum of each row must be less than or equal to one and the sum of each column must be less than or equal to one.
Examples:
matrix = [0.3, 0.4, 0.2;
0.7, 0.0, 0.3;
0.0, 0.5, 0.1]
If you want a matrix that is uniformly distributed and fulfills those constraints, you probably need a rejection method. In Matlab it would be:
n = 3;
done = false;
while ~done
matrix = rand(n);
done = all(sum(matrix,1)<=1) & all(sum(matrix,2)<=1);
end
Note that this will be slow for large n.
If you're looking for a Python way, this is simply a transcription of Luis Mendo's rejection method. For simplicity, I'll be using NumPy:
import numpy as np
n = 3
done = False
while not done:
matrix = np.random.rand(n,n)
done = np.all(np.logical_and(matrix.sum(axis=0) <= 1, matrix.sum(axis=1) <= 1))
If you don't have NumPy, then you can generate your 2D matrix as a list of lists instead:
import random
n = 3
done = False
while not done:
# Create matrix as a list of lists
matrix = [[random.random() for _ in range(n)] for _ in range(n)]
# Compute the row sums and check for each to be <= 1
row_sums = [sum(matrix[i]) <= 1 for i in range(n)]
# Compute the column sums and check for each to be <= 1
col_sums = [sum([matrix[j][i] for j in range(n)]) <= 1 for i in range(n)]
# Only quit of all row and column sums are less than 1
done = all(row_sums) and all(col_sums)
The rejection method will surely give you a uniform solution, but it might take a long time to generate a good matrix, especially if your matrix is large. So another, but more tedious approach is to generate each element such that the sum can only be 1 in each direction. For this you always generate a new element between 0 and the remainder until 1:
n = 3
matrix = zeros(n+1); %dummy line in first row/column
for k1=2:n+1
for k2=2:n+1
matrix(k1,k2)=rand()*(1-max(sum(matrix(k1,1:k2-1)),sum(matrix(1:k1-1,k2))));
end
end
matrix = matrix(2:end,2:end)
It's a bit tricky because for each element you check the row-sum and column-sum until that point, and use the larger of the two for generating a new element (in order to stay below a sum of 1 in both directions). For practical reasons I padded the matrix with a zero line and column at the beginning to avoid indexing problems with k1-1 and k2-1.
Note that as #LuisMendo pointed out, this will have a different distribution as the rejection method. But if your constraints do not consider the distribution, this could do as well (and this will give you a matrix from a single run).

Generate "random" matrix of certain rank over a fixed set of elements

I'd like to generate matrices of size mxn and rank r, with elements coming from a specified finite set, e.g. {0,1} or {1,2,3,4,5}. I want them to be "random" in some very loose sense of that word, i.e. I want to get a variety of possible outputs from the algorithm with distribution vaguely similar to the distribution of all matrices over that set of elements with the specified rank.
In fact, I don't actually care that it has rank r, just that it's close to a matrix of rank r (measured by the Frobenius norm).
When the set at hand is the reals, I've been doing the following, which is perfectly adequate for my needs: generate matrices U of size mxr and V of nxr, with elements independently sampled from e.g. Normal(0, 2). Then U V' is an mxn matrix of rank r (well, <= r, but I think it's r with high probability).
If I just do that and then round to binary / 1-5, though, the rank increases.
It's also possible to get a lower-rank approximation to a matrix by doing an SVD and taking the first r singular values. Those values, though, won't lie in the desired set, and rounding them will again increase the rank.
This question is related, but accepted answer isn't "random," and the other answer suggests SVD, which doesn't work here as noted.
One possibility I've thought of is to make r linearly independent row or column vectors from the set and then get the rest of the matrix by linear combinations of those. I'm not really clear, though, either on how to get "random" linearly independent vectors, or how to combine them in a quasirandom way after that.
(Not that it's super-relevant, but I'm doing this in numpy.)
Update: I've tried the approach suggested by EMS in the comments, with this simple implementation:
real = np.dot(np.random.normal(0, 1, (10, 3)), np.random.normal(0, 1, (3, 10)))
bin = (real > .5).astype(int)
rank = np.linalg.matrix_rank(bin)
niter = 0
while rank > des_rank:
cand_changes = np.zeros((21, 5))
for n in range(20):
i, j = random.randrange(5), random.randrange(5)
v = 1 - bin[i,j]
x = bin.copy()
x[i, j] = v
x_rank = np.linalg.matrix_rank(x)
cand_changes[n,:] = (i, j, v, x_rank, max((rank + 1e-4) - x_rank, 0))
cand_changes[-1,:] = (0, 0, bin[0,0], rank, 1e-4)
cdf = np.cumsum(cand_changes[:,-1])
cdf /= cdf[-1]
i, j, v, rank, score = cand_changes[np.searchsorted(cdf, random.random()), :]
bin[i, j] = v
niter += 1
if niter % 1000 == 0:
print(niter, rank)
It works quickly for small matrices but falls apart for e.g. 10x10 -- it seems to get stuck at rank 6 or 7, at least for hundreds of thousands of iterations.
It seems like this might work better with a better (ie less-flat) objective function, but I don't know what that would be.
I've also tried a simple rejection method for building up the matrix:
def fill_matrix(m, n, r, vals):
assert m >= r and n >= r
trans = False
if m > n: # more columns than rows I think is better
m, n = n, m
trans = True
get_vec = lambda: np.array([random.choice(vals) for i in range(n)])
vecs = []
n_rejects = 0
# fill in r linearly independent rows
while len(vecs) < r:
v = get_vec()
if np.linalg.matrix_rank(np.vstack(vecs + [v])) > len(vecs):
vecs.append(v)
else:
n_rejects += 1
print("have {} independent ({} rejects)".format(r, n_rejects))
# fill in the rest of the dependent rows
while len(vecs) < m:
v = get_vec()
if np.linalg.matrix_rank(np.vstack(vecs + [v])) > len(vecs):
n_rejects += 1
if n_rejects % 1000 == 0:
print(n_rejects)
else:
vecs.append(v)
print("done ({} total rejects)".format(n_rejects))
m = np.vstack(vecs)
return m.T if trans else m
This works okay for e.g. 10x10 binary matrices with any rank, but not for 0-4 matrices or much larger binaries with lower rank. (For example, getting a 20x20 binary matrix of rank 15 took me 42,000 rejections; with 20x20 of rank 10, it took 1.2 million.)
This is clearly because the space spanned by the first r rows is too small a portion of the space I'm sampling from, e.g. {0,1}^10, in these cases.
We want the intersection of the span of the first r rows with the set of valid values.
So we could try sampling from the span and looking for valid values, but since the span involves real-valued coefficients that's never going to find us valid vectors (even if we normalize so that e.g. the first component is in the valid set).
Maybe this can be formulated as an integer programming problem, or something?
My friend, Daniel Johnson who commented above, came up with an idea but I see he never posted it. It's not very fleshed-out, but you might be able to adapt it.
If A is m-by-r and B is r-by-n and both have rank r then AB has rank r. Now, we just have to pick A and B such that AB has values only in the given set. The simplest case is S = {0,1,2,...,j}.
One choice would be to make A binary with appropriate row/col sums
that guaranteed the correct rank and B with column sums adding to no
more than j (so that each term in the product is in S) and row sums
picked to cause rank r (or at least encourage it as rejection can be
used).
I just think that we can come up with two independent sampling
schemes on A and B that are less complicated and quicker than trying
to attack the whole matrix at once. Unfortunately, all my matrix
sampling code is on the other computer. I know it generalized easily
to allowing entries in a bigger set than {0,1} (i.e. S), but I can't
remember how the computation scaled with m*n.
I am not sure how useful this solution will be, but you can construct a matrix that will allow you to search for the solution on another matrix with only 0 and 1 as entries. If you search randomly on the binary matrix, it is equivalent to randomly modifying the elements of the final matrix, but it is possible to come up with some rules to do better than a random search.
If you want to generate an m-by-n matrix over the element set E with elements ei, 0<=i<k, you start off with the m-by-k*m matrix, A:
Clearly, this matrix has rank m. Now, you can construct another matrix, B, that has 1s at certain locations to pick the elements from the set E. The structure of this matrix is:
Each Bi is a k-by-n matrix. So, the size of AB is m-by-n and rank(AB) is min(m, rank(B)). If we want the output matrix to have only elements from our set, E, then each column of Bi has to have exactly one element set to 1, and the rest set to 0.
If you want to search for a certain rank on B randomly, you need to start off with a valid B with max rank, and rotate a random column j of a random Bi by a random amount. This is equivalent to changing column i row j of A*B to a random element from our set, so it is not a very useful method.
However, you can do certain tricks with the matrices. For example, if k is 2, and there are no overlaps on first rows of B0 and B1, you can generate a linearly dependent row by adding the first rows of these two sub-matrices. The second row will also be linearly dependent on rows of these two matrices. I am not sure if this will easily generalize to k larger than 2, but I am sure there will be other tricks you can employ.
For example, one simple method to generate at most rank k (when m is k+1) is to get a random valid B0, keep rotating all rows of this matrix up to get B1 to Bm-2, set first row of Bm-1 to all 1, and the remaining rows to all 0. The rank cannot be less than k (assuming n > k), because B_0 columns have exactly 1 nonzero element. The remaining rows of the matrices are all linear combinations (in fact exact copies for almost all submatrices) of these rows. The first row of the last submatrix is the sum of all rows of the first submatrix, and the remaining rows of it are all zeros. For larger values of m, you can use permutations of rows of B0 instead of simple rotation.
Once you generate one matrix that satisfies the rank constraint, you may get away with randomly shuffling the rows and columns of it to generate others.
How about like this?
rank = 30
n1 = 100; n2 = 100
from sklearn.decomposition import NMF
model = NMF(n_components=rank, init='random', random_state=0)
U = model.fit_transform(np.random.randint(1, 5, size=(n1, n2)))
V = model.components_
M = np.around(U) # np.around(V)

Categories