Random distribution of N points with a specified average spacing? - python

This is probably a question of logic more than an algorithm. I want to distribute about N points in such a way that the average distance of a single point to the other N-1 points is uniformly distributed around a value d. That is, The points themselves are not be normally distributed but their spacing with respect to each other is.
Is there a logical way to implement this?
For Example :
X = np.ones(N) #1d for simplicity
d = np.ones(N)
for i in range(N):
X[i] = ## Insert the algorithm here
##
for i in range(N):
da = 0
for j in range(N):
if i != j:
da += np.sqrt(np.abs(X[i]**2 - X[j]**2)) #Calculating distance to other points and
summing up
da = da / (N-1) # Taking average of all distances
d[i] = da # the average distance of point i with all other points
A = np.mean(d) #The mean of all average distances
A is the parameter I want control over. It should be basis for how all the points are distributed.Recommendations using inbuilt Python or C modules would work as well.

Related

Choosing subset of farthest points in given set of points

Imagine you are given set S of n points in 3 dimensions. Distance between any 2 points is simple Euclidean distance. You want to chose subset Q of k points from this set such that they are farthest from each other. In other words there is no other subset Q’ of k points exists such that min of all pair wise distances in Q is less than that in Q’.
If n is approximately 16 million and k is about 300, how do we efficiently do this?
My guess is that this NP-hard so may be we just want to focus on approximation. One idea I can think of is using Multidimensional scaling to sort these points in a line and then use version of binary search to get points that are furthest apart on this line.
This is known as the discrete p-dispersion (maxmin) problem.
The optimality bound is proved in White (1991) and Ravi et al. (1994) give a factor-2 approximation for the problem with the latter proving this heuristic is the best possible (unless P=NP).
Factor-2 Approximation
The factor-2 approximation is as follows:
Let V be the set of nodes/objects.
Let i and j be two nodes at maximum distance.
Let p be the number of objects to choose.
P = set([i,j])
while size(P)<p:
Find a node v in V-P such that min_{v' in P} dist(v,v') is maximum.
\That is: find the node with the greatest minimum distance to the set P.
P = P.union(v)
Output P
You could implement this in Python like so:
#!/usr/bin/env python3
import numpy as np
p = 50
N = 400
print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2 #Make the matrix symmetric
print("Finding initial edge...")
maxdist = 0
bestpair = ()
for i in range(N):
for j in range(i+1,N):
if d[i,j]>maxdist:
maxdist = d[i,j]
bestpair = (i,j)
P = set()
P.add(bestpair[0])
P.add(bestpair[1])
print("Finding optimal set...")
while len(P)<p:
print("P size = {0}".format(len(P)))
maxdist = 0
vbest = None
for v in range(N):
if v in P:
continue
for vprime in P:
if d[v,vprime]>maxdist:
maxdist = d[v,vprime]
vbest = v
P.add(vbest)
print(P)
Exact Solution
You could also model this as an MIP. For p=50, n=400 after 6000s, the optimality gap was still 568%. The approximation algorithm took 0.47s to obtain an optimality gap of 100% (or less). A naive Gurobi Python representation might look like this:
#!/usr/bin/env python
import numpy as np
import gurobipy as grb
p = 50
N = 400
print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2 #Make the matrix symmetric
m = grb.Model(name="MIP Model")
used = [m.addVar(vtype=grb.GRB.BINARY) for i in range(N)]
objective = grb.quicksum( d[i,j]*used[i]*used[j] for i in range(0,N) for j in range(i+1,N) )
m.addConstr(
lhs=grb.quicksum(used),
sense=grb.GRB.EQUAL,
rhs=p
)
# for maximization
m.ModelSense = grb.GRB.MAXIMIZE
m.setObjective(objective)
# m.Params.TimeLimit = 3*60
# solving with Glpk
ret = m.optimize()
Scaling
Obviously, the O(N^2) scaling for the initial points is bad. We can find them more efficiently by recognizing that the pair must lie on the convex hull of the dataset. This gives us an O(N log N) way to find the pair. Once we've found it we proceed as before (using SciPy for acceleration).
The best way of scaling would be to use an R*-tree to efficiently find the minimum distance between a candidate point p and the set P. But this cannot be done efficiently in Python since a for loop is still involved.
import numpy as np
from scipy.spatial import ConvexHull
from scipy.spatial.distance import cdist
p = 300
N = 16000000
# Find a convex hull in O(N log N)
points = np.random.rand(N, 3) # N random points in 3-D
# Returned 420 points in testing
hull = ConvexHull(points)
# Extract the points forming the hull
hullpoints = points[hull.vertices,:]
# Naive way of finding the best pair in O(H^2) time if H is number of points on
# hull
hdist = cdist(hullpoints, hullpoints, metric='euclidean')
# Get the farthest apart points
bestpair = np.unravel_index(hdist.argmax(), hdist.shape)
P = np.array([hullpoints[bestpair[0]],hullpoints[bestpair[1]]])
# Now we have a problem
print("Finding optimal set...")
while len(P)<p:
print("P size = {0}".format(len(P)))
distance_to_P = cdist(points, P)
minimum_to_each_of_P = np.min(distance_to_P, axis=1)
best_new_point_idx = np.argmax(minimum_to_each_of_P)
best_new_point = np.expand_dims(points[best_new_point_idx,:],0)
P = np.append(P,best_new_point,axis=0)
print(P)
I am also pretty sure that the problem is NP-Hard, the most similar problem I found is the k-Center Problem. If runtime is more important than correctness a greedy algorithm is probably your best choice:
Q ={}
while |Q| < k
Q += p from S where mindist(p, Q) is maximal
Side note: In similar problems e.g., the set-cover problem it can be shown that the solution from the greedy algorithm is at least 63% as good as the optimal solution.
In order to speed things up I see 3 possibilities:
Index your dataset in an R-Tree first, then perform a greedy search. Construction of the R-Tree is O(n log n), but though being developed for nearest neighbor search, it can also help you finding the furthest point to a set of points in O(log n). This might be faster than the naive O(k*n) algorithm.
Sample a subset from your 16 million points and perform the greedy algorithm on the subset. You are approximate anyway so you might be able to spare a little more accuracy. You can also combine this with the 1. algorithm.
Use an iterative approach and stop when you are out of time. The idea here is to randomly select k points from S (lets call this set Q'). Then in each step you switch the point p_ from Q' that has the minimum distance to another one in Q' with a random point from S. If the resulting set Q'' is better proceed with Q'', otherwise repeat with Q'. In order not to get stuck you might want to choose another point from Q' than p_ if you could not find an adequate replacement for a couple of iterations.
If you can afford to do ~ k*n distance calculations then you could
Find the center of the distribution of points.
Select the point furthest from the center. (and remove it from the set of un-selected points).
Find the point furthest from all the currently selected points and select it.
Repeat 3. until you end with k points.
Find the maximum extent of all points. Split into 7x7x7 voxels. For all points in a voxel find the point closest to its centre. Return these 7x7x7 points. Some voxels may contain no points, hopefully not too many.

how to estimate parameters of mixture of 2 exponential random variables (ideally in Python)

Imagine a simulation experiment in which the output is n total numbers, where k of them are sampled from an exponential random variable with rate a and n-k are sampled from an exponential random variable with rate b. The constraints are that 0 < a ≤ b and 0 ≤ k ≤ n, but a, b, and k are all unknown. Further, because of details of the simulation experiment, when a << b, k ≈ 0, and when a = b, k ≈ n/2.
My goal is to estimate either a or b (don't care about k, and I don't need to estimate both a and b: just one of the two is fine). From speculation, it seems as though estimating just b might be the easiest path (when a << b, there is pretty much nothing to use to estimate a and plenty to estimate b, and when a = b, both there is still plenty to estimate b). I want to do it in Python ideally, but I am open to any free software.
My first approach was to use sklearn.optimize to optimize a likelihood function where, for each number in my dataset, I compute P(X=x) for an exponential with rate a, compute the same for an exponential with rate b, and simply choose the larger of the two:
from sys import stdin
from math import exp,log
from scipy.optimize import fmin
DATA = None
def pdf(x,l): # compute P(X=x) for an exponential rv X with rate l
return l*exp(-1*l*x)
def logML(X,la,lb): # compute the log-ML of data points X given two exponentials with rates la and lb where la < lb
ml = 0.0
for x in X:
ml += log(max(pdf(x,la),pdf(x,lb)))
return ml
def f(x): # objective function to minimize
assert DATA is not None, "DATA cannot be None"
la,lb = x
if la > lb: # force la <= lb
return float('inf')
elif la <= 0 or lb <= 0:
return float('inf') # force la and lb > 0
return -1*logML(DATA,la,lb)
if __name__ == "__main__":
DATA = [float(x) for x in stdin.read().split()] # read input data
Xbar = sum(DATA)/len(DATA) # compute mean
x0 = [1/Xbar,1/Xbar] # start with la = lb = 1/mean
result = fmin(f,x0,disp=DISP)
print("ML Rates: la = %f and lb = %f" % tuple(result))
This unfortunately didn't work very well. For some selections of the parameters, it's within an order of magnitude, but for others, it's absurdly off. Given my problem (with its constraints) and my goal of estimating the larger parameter of the two exponentials (without caring about the smaller parameter nor the number of points that came from either), any ideas?
I posted the question in more general statistical terms on the stats Stack Exchange, and it got an answer:
https://stats.stackexchange.com/questions/291642/how-to-estimate-parameters-of-mixture-of-2-exponential-random-variables-ideally
Also, I tried the following, which worked decently well:
First, for every single integer percentile (1st percentile, 2nd percentile, ..., 99th percentile), I compute the estimate of b using the quantile closed-form equation (where the i-th quantile is the (i *100)-th percentile) for an exponential distribution (the i-th quantile = −ln(1 − i) / λ, so λ = −ln(1 − i) / (i-th quantile)). The result is a list where each i-th element corresponds to the b estimate using the (i+1)-th percentile.
Then, I perform peak-calling on this list using the Python implementation of the Matlab peak-calling function. Then, I take the list of resulting peaks and return the minimum. It seems to work fairly well.
I will implement the EM solution in the Stack Exchange post as well and see which works better.
EDIT: I implemented the EM solution, and it seems to work decently well in my simulations (n = 1000, various a and b).

k-means with a centroid constraint

I'm working on a data science project for my intro to Data Science class, and we've decided to tackle a problem relating to desalination plants in california: "Where should we place k plants to minimize the distance to zip codes?"
The data that we have so far is, zip, city, county, pop, lat, long, amount of water.
The issue is, I can't find any resources on how to force the centroid to be constrained to staying on the coast. What we've thought of so far is:
Use a normal kmeans algorithm, but move the centroid to the coast once clusters have settled (bad)
Use a normal kmeans algorithm with weights, making the coastal zips have infinite weight (I've been told this isn't a great solution)
What do you guys think?
K-means does not minimize distances.
It minimizes squared errors, which is quite different.
The difference is roughly that of the median, and the mean in 1 dimensional data. The error can be massive.
Here is a counter example, assuming we have the coordinates:
-1 0
+1 0
0 -1
0 101
The center chosen by k-means would be 0,25. The optimal location is 0,0.
The sum of distances by k-means is > 152, the optimum location has distance 104. So here, the centroid is almost 50% worse than the optimum! But the centroid (= multivariate mean) is what k-means uses!
k-means does not minimize the Euclidean distance!
This is one variant how "k-means is sensitive to outliers".
It does not get better if you try to constrain it to place "centers" on the coast only...
Also, you may want to at least use Haversine distance, because in California, 1 degree north != 1 degree east, because it's not at the Equator.
Furthermore, you likely should not make the assumption that every location requires its own pipe, but rather they will be connected like a tree. This greatly reduces the cost.
I strongly suggest to treat this as a generic optimization problem, rather than k-means. K-means is an optimization too, but it may optimize the wrong function for your problem...
I would approach this by setting possible points that could be centers, i.e. your coastline.
I think this is close to Nathaniel Saul's first comment.
This way, for each iteration, instead of choosing a mean, a point out of the possible set would be chosen by proximity to the cluster.
I’ve simplified the conditions to only 2 data columns (lon. and lat.) but you should be able to extrapolate the concept. For simplicity, to demonstrate, I based this on code from here.
In this example, the purple dots are places on the coastline. If I understood correctly, the optimal Coastline locations should look something like this:
See code below:
#! /usr/bin/python3.6
# Code based on:
# https://datasciencelab.wordpress.com/2013/12/12/clustering-with-k-means-in-python/
import matplotlib.pyplot as plt
import numpy as np
import random
##### Simulation START #####
# Generate possible points.
def possible_points(n=20):
y=list(np.linspace( -1, 1, n ))
x=[-1.2]
X=[]
for i in list(range(1,n)):
x.append(x[i-1]+random.uniform(-2/n,2/n) )
for a,b in zip(x,y):
X.append(np.array([a,b]))
X = np.array(X)
return X
# Generate sample
def init_board_gauss(N, k):
n = float(N)/k
X = []
for i in range(k):
c = (random.uniform(-1, 1), random.uniform(-1, 1))
s = random.uniform(0.05,0.5)
x = []
while len(x) < n:
a, b = np.array([np.random.normal(c[0], s), np.random.normal(c[1], s)])
# Continue drawing points from the distribution in the range [-1,1]
if abs(a) < 1 and abs(b) < 1:
x.append([a,b])
X.extend(x)
X = np.array(X)[:N]
return X
##### Simulation END #####
# Identify points for each center.
def cluster_points(X, mu):
clusters = {}
for x in X:
bestmukey = min([(i[0], np.linalg.norm(x-mu[i[0]])) \
for i in enumerate(mu)], key=lambda t:t[1])[0]
try:
clusters[bestmukey].append(x)
except KeyError:
clusters[bestmukey] = [x]
return clusters
# Get closest possible point for each cluster.
def closest_point(cluster,possiblePoints):
closestPoints=[]
# Check average distance for each point.
for possible in possiblePoints:
distances=[]
for point in cluster:
distances.append(np.linalg.norm(possible-point))
closestPoints.append(np.sum(distances)) # minimize total distance
# closestPoints.append(np.mean(distances))
return possiblePoints[closestPoints.index(min(closestPoints))]
# Calculate new centers.
# Here the 'coast constraint' goes.
def reevaluate_centers(clusters,possiblePoints):
newmu = []
keys = sorted(clusters.keys())
for k in keys:
newmu.append(closest_point(clusters[k],possiblePoints))
return newmu
# Check whether centers converged.
def has_converged(mu, oldmu):
return (set([tuple(a) for a in mu]) == set([tuple(a) for a in oldmu]))
# Meta function that runs the steps of the process in sequence.
def find_centers(X, K, possiblePoints):
# Initialize to K random centers
oldmu = random.sample(list(possiblePoints), K)
mu = random.sample(list(possiblePoints), K)
while not has_converged(mu, oldmu):
oldmu = mu
# Assign all points in X to clusters
clusters = cluster_points(X, mu)
# Re-evaluate centers
mu = reevaluate_centers(clusters,possiblePoints)
return(mu, clusters)
K=3
X = init_board_gauss(30,K)
possiblePoints=possible_points()
results=find_centers(X,K,possiblePoints)
# Show results
# Show constraints and clusters
# List point types
pointtypes1=["gx","gD","g*"]
plt.plot(
np.matrix(possiblePoints).transpose()[0],np.matrix(possiblePoints).transpose()[1],'m.'
)
for i in list(range(0,len(results[0]))) :
plt.plot(
np.matrix(results[0][i]).transpose()[0], np.matrix(results[0][i]).transpose()[1],pointtypes1[i]
)
pointtypes=["bx","yD","c*"]
# Show all cluster points
for i in list(range(0,len(results[1]))) :
plt.plot(
np.matrix(results[1][i]).transpose()[0],np.matrix(results[1][i]).transpose()[1],pointtypes[i]
)
plt.show()
Edited to minimize total distance.

Is this a proper implementation of point charge dynamics with python ODEint

Since learning about point charges in my physics II class this semester, I want to be able to investigate not only the static force and field distributions but the actual trajectories of movement of electrically charged particles. The first stage in doing this is to build a naive engine for simulating the dynamics of n individual point particles. I've implemented the solution using matrices in python and was hoping someone could comment on whether I've done so correctly. As I don't know what kind of dynamics to expect, I can't tell directly from the videos that my implementation of my equations is correct.
My Particular Problem
In particular, I cannot tell if in my calculation of Force magnitude I am computing the 1/r^(3/2) factor correctly. Why? because when I simulate a dipole and use $2/2$ as an exponent the particles start going in an elliptical orbit. which is what I would expect. However, when I use the correct exponent, I get this: Where is my code going wrong? What am I supposed to expect
I'll first write down the equations I'm using:
Given n charges q_1, q_2, ..., q_n, with masses m_1, m_2, ..., m_n located at initial positions r_1, r_2, ..., r_n, with velocities (d/dt)r_1, (d/dt)r_2, ..., (d/dt)r_n the force induced on q_i by q_j is given by
F_(j -> i) = k(q_iq_j)/norm(r_i-r_j)^{3/2} * (r_i - r_j)
Now, the net marginal force on particle $q_i$ is given as the sum of the pairwise forces
F_(N, i) = sum_(j != i)(F_(j -> i))
And then the net acceleration of particle $q_i$ just normalizes the force by the mass of the particle:
(d^2/dt^2)r_i = F_(N, i)/m_i
In total, for n particles, we have an n-th order system of differential equations. We will also need to specify n initial particle velocities and n initial positions.
To implement this in python, I need to be able to compute pairwise point distances and pairwise charge multiples. To do this I tile the q vector of charges and the r vector of positions and take, respectively, their product and difference with their transpose.
def integrator_func(y, t, q, m, n, d, k):
y = np.copy(y.reshape((n*2,d)))
# rj across, ri down
rs_from = np.tile(y[:n], (n,1,1))
# ri across, rj down
rs_to = np.transpose(rs_from, axes=(1,0,2))
# directional distance between each r_i and r_j
# dr_ij is the force from j onto i, i.e. r_i - r_j
dr = rs_to - rs_from
# Used as a mask to ignore divides by zero between r_i and r_i
nd_identity = np.eye(n).reshape((n,n,1))
# WHAT I AM UNSURE ABOUT
drmag = ma.array(
np.power(
np.sum(np.power(dr, 2), 2)
,3./2)
,mask=nd_identity)
# Pairwise q_i*q_j for force equation
qsa = np.tile(q, (n,1))
qsb = np.tile(q, (n,1)).T
qs = qsa*qsb
# Directional forces
Fs = (k*qs/drmag).reshape((n,n,1))
# Dividing by m to obtain acceleration vectors
a = np.sum(Fs*dr, 1)
# Setting velocities
y[:n] = np.copy(y[n:])
# Entering the acceleration into the velocity slot
y[n:] = np.copy(a)
# Flattening it out for scipy.odeint to work properly
return np.array(y).reshape(n*2*d)
def sim_particles(t, r, v, q, m, k=1.):
"""
With n particles in d dimensions:
t: timepoints to integrate over
r: n*d matrix. The d-dimensional initial positions of n particles
v: n*d matrix of initial particle velocities
q: n*1 matrix of particle charges
m: n*1 matrix of particle masses
k: electric constant.
"""
d = r.shape[-1]
n = r.shape[0]
y0 = np.zeros((n*2,d))
y0[:n] = r
y0[n:] = v
y0 = y0.reshape(n*2*d)
yf = odeint(
integrator_func,
y0,
t,
args=(q,m,n,d,k)).reshape(t.shape[0],n*2,d)
return yf

Non biased return a list of n random positive numbers (>=0) so that their sum == total_sum

I'm either looking for an algorithm or a suggestion to improve my code to generate a list of random numbers that their sum equals some arbitrary number. With my code below, it'll always be biased as the first numbers will tend to be higher.
Is there a way to have the number selection more efficient?
#!/usr/bin/python
'''
Generate a list of 'numbs' positive random numbers whose sum = 'limit_sum'
'''
import random
def gen_list(numbs, limit_sum):
my_sum = []
for index in range(0, numbs):
if index == numbs - 1:
my_sum.append(limit_sum - sum(my_sum))
else:
my_sum.append(random.uniform(0, limit_sum - sum(my_sum)))
return my_sum
#test
import pprint
pprint.pprint(gen_list(5, 20))
pprint.pprint(gen_list(10, 200))
pprint.pprint(gen_list(0, 30))
pprint.pprint(gen_list(1, 10))
THE OUTPUT
## output
[0.10845093828525609,
16.324799712999706,
0.08200162072303821,
3.4534885160590041,
0.031259211932997744]
[133.19609626532952,
47.464880208741029,
8.556082341110228,
5.7817325913462323,
4.6342577008233716,
0.22532341156764768,
0.0027495225618908918,
0.064738336208217895,
0.028888697891734455,
0.045250924420116689]
[]
[10]
Why not just generate the right number of uniformly distributed random numbers, tot them up and scale ?
EDIT: To be a bit clearer: you want N numbers which sum to S ? So generate N uniformly distributed random numbers on the interval [0,1) or whatever your RNG produces. Add them up, they will total s (say) whereas you want them to total S, so multiply each number by S/s. Now the numbers are uniformly randomly distributed on [0,S/s) I think.
Here's how I would do it:
Generate n-1 random numbers, all in the range [0,max]
Sort those numbers
For each pair made up of the i-th and (i+1)-th number in sorted list, create an interval (i,i+1) and compute its length. The last interval will start at the last number and end at max and the first interval will start at 0 and end at the first number in the list.
Now, the lengths of those intervals will always sum up to max, since they simply represent segments inside [0,max].
Code (in Python):
#! /usr/bin/env python
import random
def random_numbers(n,sum_to):
values=[0]+[random.randint(0,sum_to) for i in xrange(n-1)]+[sum_to]
values.sort()
intervals=[values[i+1]-values[i] for i in xrange(len(values)-1)]
return intervals
if __name__=='__main__':
print random_numbers(5,100)
If you are looking for normally-distributed numbers with as little correlation as possible, and need to be rigorous* about this, I would suggest you take the following mathematical approach and translate into code.
(*rigorous: the problem with other approaches is that you can get "long tails" in your distributions -- in other words, it is rare but possible to have outliers that are very different from your expected output)
Generate N-1 independent and identically distributed (IID) gaussian random variables v0, v1, v2, ... vN-1 to match the N-1 degrees of freedom of your problem.
Create a column vector V where V = [0 v0, v1, v2, ... vN-1]T
Use a fixed weighting matrix W, where W consists of an orthonormal matrix** whose top row is [1 1 1 1 1 1 1 ... 1] / sqrt(N).
Your output vector is the product WV + SU/N where S is the desired sum and U is the column vector of 1's. In other words, the i'th output variable = the dot product of (row #i of matrix W) and column vector V, added to S/N.
The standard deviation of each output variable will be (I believe, can't verify right now) sqrt(N/N-1) * the standard deviation of the input random variables.
**orthonormal matrix: this is the hard part, I put in a question at math.stackexchange.com and there's a simple matrix W that works, and can be defined algorithmically with only 3 distinct values, so that you don't actually have to construct the matrix.
W is the Householder reflection of v-w where v = [sqrt(N), 0, 0, 0, ... ] and w = [1 1 1 1 1 ... 1] can be defined by:
W(1,i) = W(i,1) = 1/sqrt(N)
W(i,i) = 1 - K for i >= 2
W(i,j) = -K for i,j >= 2, i != j
K = 1/sqrt(N)/(sqrt(N)-1)
The problem with Mark's approach:
Why not just generate the right number of uniformly distributed random numbers, tot them up and scale ?
is that if you do this, you get a "long tail" distribution. Here's an example in MATLAB:
>> X = rand(100000,10);
>> Y = X ./ repmat(sum(X,2),1,10);
>> plot(sort(Y))
I've generated 100,000 sets of N=10 numbers in matrix X, and created matrix Y where each row of Y is the corresponding row of X divided by its sum (so that each row of Y sums to 1.0)
Plotting the sorted values of Y (each column sorted separately) yields approximately the same cumulative distribution:
A true uniform distribution would yield a straight line from 0 to the maximum value. You'll notice that it's sort of vaguely similar to a true uniform distribution, except at the end where there's a long tail. There's an excess of numbers generated between 0.2 and 0.5. The tail gets worse for larger values of N, because although the average value of the numbers goes down (mean = 1 / N), the maximum value stays at 1.0: the vector consisting of 9 values of 0.0 and 1 value of 1.0 is valid and can be generated this way, but is pathologically rare.
If you don't care about this, go ahead and use this method. And there are probably ways to generate "almost"-uniform or "almost"-gaussian distributions with desired sums, that are much simpler and more efficient than the one I describe above. But I caution you to be careful and understand the consequences of the algorithm you choose.
One fixup that leaves things sort-of-uniformly distributed without the long tail, is as follows:
Generate a vector V = N uniformly-distributed random numbers from 0.0 to 1.0.
Find their sum S and their maximum value M.
If S < k*M (maximum value is too much of an outlier), go back to step 1. I'm not sure what value to use for k, maybe k = N/2?
Output the vector V*Sdesired/S
Example in MATLAB for N=10:
>> X = rand(100000,10);
>> Y = X ./ repmat(sum(X,2),1,10);
>> i = sum(X,2)>(10/2)*max(X,[],2);
>> plot(sort(Y(i,:)))
All right, we're going to tackle the problem assuming the requirement is to generate a random vector of length N that is uniformly distributed over the allowed space, restated as follows:
Given
a desired length L,
a desired total sum S,
a range of allowed values [0,B] for each scalar value,
generate a random vector V of length N such that the random variable V is uniformly distributed throughout its permitted space.
We can simplify the problem by noting that we can calculate V = U * S where U is a similar random vector with desired total sum 1, and a range of allowed values [0,b] where b = B/S. The value b must be between 1/N and 1.
First consider N = 3. The space of allowed values {U} is a portion of a plane perpendicular to the vector [1 1 1] that passes through the point [1/3 1/3 1/3] and which lies inside the cube whose components range between 0 and b. This set of points {U} is shaped like a hexagon.
(TBD: picture. I can't generate one right now, I need access to MATLAB or another program that can do 3D plots. My installation of Octave can't.)
It is best to use an orthonormal weighting matrix W (see my other answer) with one vector = [1 1 1]/sqrt(3). One such matrix is
octave-3.2.3:1> A=1/sqrt(3)
A = 0.57735
octave-3.2.3:2> K=1/sqrt(3)/(sqrt(3)-1)
K = 0.78868
octave-3.2.3:3> W = [A A A; A 1-K -K; A -K 1-K]
W =
0.57735 0.57735 0.57735
0.57735 0.21132 -0.78868
0.57735 -0.78868 0.21132
which, again, is orthonormal (W*W = I)
If you consider the points of the cube [0 0 b],[0 b b],[0 b 0],[b b 0],[b 0 0], and [b 0 b] these form a hexagon and are all a distance of b*sqrt(2/3) from the cube's diagonal. These do not satisfy the problem in question, but are useful in a minute. The other two points [0 0 0] and [b b b] are on the cube's diagonal.
The orthonormal weighting matrix W allows us to generate points that are uniformly distributed within {U}, because orthonormal matrices are coordinate transformations that rotate/reflect and do not scale or skew.
We will generate points that are uniformly distributed in the coordinate system defined by the 3 vectors of W. The first component is the axis of the diagonal of the cube. The sum of U's components depends completely upon this axis and not at all on the others. Therefore the coordinate along this axis is forced to be 1/sqrt(3) which corresponds to the point [1/3, 1/3, 1/3].
The other two components are in directions perpendicular to the cube's diagonal. Since the maximum distance from the diagonal is b*sqrt(2/3), we will generate uniformly distributed numbers (u,v) between -b*sqrt(2/3) and +b*sqrt(2/3).
This gives us a random variable U' = [1/sqrt(3) u v]. We then compute U = U' * W. Some of the resulting points will be outside the allowable range (each component of U must be between 0 and b), in which case we reject that and start over.
In other words:
Generate independent random variables u and v that are each uniformly distributed between -b*sqrt(2/3) and +b*sqrt(3).
Calculate the vector U' = [1/sqrt(3) u v]
Compute U = U' * W.
If any of U's components are outside the range [0,b], reject this value and go back to step 1.
Calculate V = U * S.
The solution is similar for higher dimensions (uniformly distributed points within a portion of the hyperplane perpendicular to a hypercube's main diagonal):
Precalculate a weighting matrix W of rank N.
Generate independent random variables u1, u2, ... uN-1 each uniformly distributed between -b*k(N) and +b*k(N).
Calculate the vector U' = [1/N u1, u2, ... uN-1]
Compute U = U' * W. (there are shortcuts to actually having to construct and multiply by W.)
If any of U's components are outside the range [0,b], reject this value and go back to step 1.
Calculate V = U * S.
The range k(N) is a function of N that represents the maximum distance of the vertices of a hypercube of side 1 from its main diagonal. I'm not sure of the general formula but it's sqrt(2/3) for N = 3, sqrt(6/5) for N = 5, there's probably a formula for it somewhere.
I ran into this problem and specifically needed integers. An answer is to use the multinomial.
import numpy.random, numpy
total_sum = 20
n = 6
v = numpy.random.multinomial(total_sum, numpy.ones(n)/n)
As the multinomial documentation explains, you have rolled a fair six-sided dice twenty times. v contains six numbers indicating the number of times each side of the dice came up. Naturally the elements of v have to sum to twenty. Here, six is n and twenty is total_sum.
With the multinomial, you can simulate an unfair dice as well, which is very useful in some cases.
The following is quite simple, and returns uniform results:
def gen_list(numbs, limit_sum):
limits = sorted([random.uniform(0, limit_sum) for _ in xrange(numbs-1)])
limits = [0] + limits + [limit_sum]
return [x1-x0 for (x0, x1) in zip(limits[:-1], limits[1:])]
The idea is simply that if you need, say, 5 numbers between 0 and 20, you can simply put 4 "limits" between 0 and 20, and you get a partition of the (0, 20) interval. The random numbers that you want are simply the lengths of the 5 intervals in the sorted list [0, random1, random2, random3, random4, 20].
PS: oops! looks like it's the same idea as MAK's response, albeit coded without using indexes!
You could keep a running total rather than having to call sum(my_sum) repeatedly.

Categories