Sampling real numbers with sum and minimum value constraints - python

How can I sample N random values such that the following constraints are satisfied?
the N values add up to 1.0
none of the values is less than 0.01 (or some other threshold T << 1/N)
The following procedure was my first attempt.
def proportions(N):
proportions = list()
for value in sorted(numpy.random.random(N - 1) * 0.98 + 0.01):
prop = value - sum(proportions)
proportions.append(prop)
prop = 1.0 - sum(proportions)
proportions.append(prop)
return proportions
The * 0.98 + 0.01 bit was intended to enforce the ≥ 1% constraint. This works on the margins, but doesn't work internally—if two random values have a distance of < 0.01 it is not caught/corrected. Example:
>>> numpy.random.seed(2000)
>>> proportions(5)
[0.3397481983960182, 0.14892479749759702, 0.07456518420712799, 0.005868759570153426, 0.43089306032910335]
Any suggestions to fix this broken approach or to replace it with a better approach?

You could adapt Mark Dickinson's nice solution:
import random
def proportions(n):
dividers = sorted(random.sample(range(1, 100), n - 1))
return [(a - b) / 100 for a, b in zip(dividers + [100], [0] + dividers)]
print(proportions(5))
# [0.13, 0.19, 0.3, 0.34, 0.04]
# or
# [0.31, 0.38, 0.12, 0.05, 0.14]
# etc
Note this assumes "none of the values is less than 0.01" is a fixed threshold
UPDATE: We can generalize if we take the reciprocal of the threshold and use that to replace the hard-coded 100 values in the proposed code.
def proportions(N, T=0.01):
limit = int(1 / T)
dividers = sorted(random.sample(range(1, limit), N - 1))
return [(a - b) / limit for a, b in zip(dividers + [limit], [0] + dividers)]

What about this?
N/2 times, choose a random number x such that 1/N+x & 1/N-x fit your constraints; add 1/N+x & 1/N-x
If N is odd, add 1/N

Related

Create a bias for a range of numbers within a larger range

Using Python, I'm trying to create a program that randomly picks a float between 0 and 359.9. I wanted to be able to specify a bias for a range within those numbers. This might look like a function that takes 3 floats as arguments. The first two arguments would serve as the lower and upper bounds for my biased range. The third would be the probability that a number from that range would be picked. Beyond very basic usage of the random module, my knowledge in this area and in statistics in general is lacking. So I'm open to the possibility that a function may not be the best way to go about this, and grateful for any suggestions.
With something like this, I'd like a generic solution. I'm assuming:
biased ranges don't overlap, i.e. it makes little sense to have (90, 270, 0.6) as well as (180, 360, 0.4); unless you'd want that to mean the same as [(90, 180, 0.3), (180, 270, 0.3 + 0.2), (180, 360, 0.2), in which case you can either create a function to simplify it for you, or just provide it like that.
in this case you want values in a degrees range of [0, 360), but in other cases you might want values in the range [0, 24) (hours), or [0, 60) (minutes), etc.
the summed chances for all bias ranges should add up to <= 1
My approach is to first use a random value to select what range the value comes from, and then either generate a random value in that range, or if no range was selected, to generate a random value in the combined remaining range(s). To be able to do the last, a random value can be generated in the range [0, scale-sum(other ranges)). By iterating over the bias ranges, check if the value falls before the current range and return it, or add the size of the current range to skip it and repeat.
The code:
from random import random
from typing import List, Tuple
def biased_random(bias: List[Tuple[float, float, float]], scale: float = 1) -> float:
"""
Returns a float in the range [0.0, scale), but accepts a
sorted list of biases of the form ([start, end), chance), where the sum of
all chance values should be <= 1, and no interval [start, end) overlaps
:param bias: list of 3-tuples indicating ([start, end), chance) for a bias
:param scale: high end of the biased_random function, [0, scale)
:return: random float in the range [0.0, scale) applying bias
"""
assert sum(b[2] for b in bias) <= 1, \
'total bias chance cannot exceed 1'
assert all(0 <= b[0] < scale and b[0] <= b[1] < scale for b in bias), \
'every interval is in range [0, scale)'
assert all(b[0] <= b[1] for b in bias), \
'every interval is positive length'
assert all(b1[1] <= b2[0] for b1, b2 in zip(bias, bias[1:])), \
'intervals are sorted with no overlaps'
select = random()
for b in bias:
if select < b[2]:
return b[0] + random() * (b[1] - b[0])
else:
select -= b[2]
else:
result = random() * (scale - sum(b[1] - b[0] for b in bias))
for b in bias:
if result < b[0]:
return result
else:
result += b[1] - b[0]
return result
def main():
for __ in range(10):
print(biased_random([(90, 180, 0.5), (180, 270, 0.25)], 360))
if __name__ == '__main__':
main()
Similarly, if you wanted to generate a biased time of day, you could:
biased_random([(9, 17, 0.9)], 24) # 90% chance for a time within work hours
Note that you could get rid of the assertions for better performance, if you call this function very frequently (although they wouldn't be checked if you run Python with -O). Another solution would be to separate the checks into another function, only check a list of biases once and then reuse it.
Further optimisation could be achieved by creating a function that does all the checks and returns a partial function that has the scale, biases and a precomputed sum of ranges already enclosed, which you can then call for repeated random values with those biases applied.
Something like:
from random import random
from typing import List, Tuple, Callable
from functools import partial
def biased_random(bias: List[Tuple[float, float, float]], scale: float = 1, rest_range: float = None) -> float:
"""
Returns a float in the range [0.0, scale), but accepts a
sorted list of biases of the form ([start, end), chance), where the sum of
all chance values should be <= 1, and no interval [start, end) overlaps
:param bias: list of 3-tuples indicating ([start, end), chance) for a bias
:param rest_range: the remainder of range outside the bias (scale - sum of ranges)
:return: random float in the range [0.0, 1.0) applying bias
"""
select = random()
if rest_range is None:
rest_range = scale - sum(b[1] - b[0] for b in bias)
for b in bias:
if select < b[2]:
return b[0] + random() * (b[1] - b[0])
else:
select -= b[2]
else:
result = random() * rest_range
for b in bias:
if result < b[0]:
return result
else:
result += b[1] - b[0]
return result
def get_biased_random(bias: List[Tuple[float, float, float]], scale: float = 1) -> Callable[[], float]:
"""
Generate a biased_random function with a checked and set bias and scale
:param bias: list of 3-tuples indicating ([start, end), chance) for a bias
:param scale: high end of the biased_random function, [0, scale)
:return: a biased_random function that will generate a biased value on every call
"""
assert sum(b[2] for b in bias) <= 1, \
'total bias chance cannot exceed 1'
assert all(0 <= b[0] < scale and b[0] <= b[1] < scale for b in bias), \
'every interval is in range [0, scale)'
assert all(b[0] <= b[1] for b in bias), \
'every interval is positive length'
assert all(b1[1] <= b2[0] for b1, b2 in zip(bias, bias[1:])), \
'intervals are sorted with no overlaps'
return partial(biased_random, bias, scale, scale - sum(b[1] - b[0] for b in bias))
def main():
biased_random_degrees = get_biased_random([(90, 180, 0.5), (180, 270, 0.25)], 360)
for __ in range(10):
print(biased_random_degrees())
biased_random_time = get_biased_random([(9, 17, 0.9)], 24)
for __ in range(10):
print(biased_random_time())
if __name__ == '__main__':
main()
Note that biased_random still works as in the first solution, but get_biased_random now allows you to generate a checked partial function which you can then call as many times as you like, with the certainty that the enclosed bias and scale are correct, and with the remaining range pre-computed to save time.
import random
SECTOR_BIAS = {
(0.0, 59.99999): 0.9,
(60.0, 119.99999): 0.02,
(120.0, 179.99999): 0.02,
(180.0, 239.99999): 0.02,
(240.0, 299.99999): 0.02,
(300.0, 359.99999): 0.02,
}
a = []
probability = []
for i in SECTOR_BIAS:
a.append(i)
probability.append((SECTOR_BIAS[i]))
n = random.random() * 360
print(n)
def bias(lst, probability):
zipped = zip(lst, probability)
lst = [[i[0]] * int(i[1]*100) for i in zipped]
new = [b for i in lst for b in i]
return new
biased_list = bias(a, probability)
random_range = random.choice(biased_list)
result = random.randint(int(random_range[0]), int(random_range[1])) + random.random()
print(result)
There are a couple of ways to achieve such behaviour, the easiest one is probably just to duplicate a number; if you want a probability of ⅓ for option x and ⅔ for y you could use random.choice([x, y, y]):
Note: I might be slightly wrong with my probability calculations
# This way there is about twice the chance that the result will be from the biased list than before
# 10%->~19% from the biased (not 20% because we added more items)
def multiply_items(full_range: tuple=(0, 100), biased: tuple=(0, 10), biased_by: float=0.5) -> int:
l = list(range(full_range[0], full_range[1]))
l += list(range(biased[0], biased[1])) * int(1 / (1-biased_by) - 1)
return random.choice(l)
Edit: This is probably the one you're looking for
Another option is to get a random number between 0-1 and check if it's below the probability, if so choose from the biased, otherwise just choose:
# This way there is 50% more chance that the result will be from the biased range and 50% that it'll be from the full one (which containes the biased range too)
# 10%->50% from the biased
def divided_choice(full_range: tuple=(0, 100), biased: tuple=(0, 10), biased_by: float=0.5) -> int:
biased_list = range(biased[0], biased[1])
full_list = range(full_range[0], full_range[1])
full_list = [x for x in full_list if x not in biased_list]
if random.random() < biased_by:
return random.choice(biased_list)
return random.choice(full_list)
Examples:
# 10%->~52%
multiply_items((0, 100), (0, 10), 0.9)
# 10%->25%
multiply_items((0, 100), (0, 10), 0.7)
# 10%->10% (won't change because of rounding, to make it do something you'll need to do more math)
multiply_items((0, 100), (0, 10), 0.2)
# 10%->20%
divided_choice((0, 100), (0, 10), 0.2)
# 10%->70%
divided_choice((0, 100), (0, 10), 0.7)
# 10%->90%
divided_choice((0, 100), (0, 10), 0.9)
This is the main idea behind it.

Finding the smallest solution set, if one exists (two multipliers)

Note: This is the two-multipliers variation of this problem
Given a set A, consisting of floats between 0.0 and 1.0, find a smallest set B such that for each a in A, there is either a value where a == B[x], or there is a pair of unique values where a == B[x] * B[y].
For example, given
$ A = [0.125, 0.25, 0.5, 0.75, 0.9]
A possible (but probably not smallest) solution for B is
$ B = solve(A)
$ print(B)
[0.25, 0.5, 0.75, 0.9]
This satisfies the initial problem, because A[0] == B[0] * B[1], A[1] == B[1], etc., which allows us to recreate the original set A. The length of B is smaller than that of A, but I’m guessing there are smaller answers as well.
I assume that the solution space for B is large, if not infinite. If a solution exists, how would a smallest set B be found?
Notes:
We're not necessarily limited to the items in A. B can consist of any set of values, whether or not they exist in A.
Since items in A are all 0-1 floats, I'm assuming that B will also be 0-1 floats. Is this the case?
This may be a constraint satisfaction problem, but I'm not sure how it would be defined?
Since floating point math is generally problematic, any answer should frame the algorithm around rational numbers.
Sort the array. For each pair of elements Am, An ∈ A, m < n - calculate their ratio.
Check if the ratio is equal to some element in A, which is not equal to Am nor to An.
Example:
A = { 0.125, 0.25, 0.5, 0.75, 0.9 }
(0.125, 0.25): 0.5 <--- bingo
(0.125, 0.5 ): 0.25 <--- bingo
(0.125, 0.75): 0.1(6)
(0.125, 0.9 ): 0.13(8)
(0.25 , 0.5 ): 0.5
(0.25 , 0.75): 0.(3)
(0.25 , 0.9 ): 0.2(7)
(0.5 , 0.75): 0.(6)
(0.5 , 0.9 ): 0.(5)
(0.75 , 0.9 ): 0.8(3)
The numerator (0.125) is redundant (= 0.25 * 0.5) or (= 0.5 * 0.25)
We can do better by introducing new elements:
Another example:
A = { 0.1, 0.11, 0.12, 0.2, 0.22, 0.24 }
(0.1 , 0.11): 0.(90) ***
(0.1 , 0.12): 0.8(3) +++
(0.1 , 0.2 ): 0.5 <--------
(0.1 , 0.22): 0.(45)
(0.1 , 0.24): 0.41(6)
(0.11, 0,12): 0.91(6) ~~~
(0.11, 0.2 ): 0.55
(0.11, 0.22): 0.5 <--------
(0.11, 0.24): 0.458(3)
(0.12, 0.2 ): 0.6
(0.12, 0.22): 0.(54)
(0.12, 0.24): 0.5 <--------
(0.2 , 0.22): 0.(90) ***
(0.2 , 0.24): 0.8(3) +++
(0.22. 0.24): 0.91(6) ~~~
Any 2 or more pairs (a1,a2), (a3,a4), (... , ...) with a common ratio f can be replaced with { a1, a3, ..., f }.
Hence adding 0.5 to our set makes { 0.1, 0.11, 0.12 } redundant.
B = (0.2, 0.22, 0.24, 0.5}
We are now (i the general case) left with an optimization problem of selecting which of these elements to remove and which of these factors to add in order to minimize the cardinality of B (which I leave as an exercise to the reader).
Note that there is no need to introduce numbers greater than 1. B can also be represented as { 0.1, 0.11, 0.12, 2} but this set has the same cardinality.
Google's OR-Tools provide a nice CP solver which can be used to get solutions to this. You can encode your problem as a simple set of boolean variables, saying which variables or combinations of variables are valid.
I start by pulling in the relevant part of the library and setting up a few variables:
from ortools.sat.python import cp_model
A = [0.125, 0.25, 0.5, 0.75, 0.9]
# A = [0.1, 0.11, 0.12, 0.2, 0.22, 0.24]
model = cp_model.CpModel()
we can then define a few helper functions for creating variables from our numbers:
vars = {}
def get_var(val):
assert val >= 0 and val <= 1
if val in vars:
return vars[val]
var = model.NewBoolVar(str(val))
vars[val] = var
return var
pairs = {}
def get_pair(pair):
if pair in pairs:
return pairs[pair]
a, b = pair
av = get_var(a)
bv = get_var(b)
var = model.NewBoolVar(f'[{a} * {b}]')
model.AddBoolOr([av.Not(), bv.Not(), var])
model.AddImplication(var, av)
model.AddImplication(var, bv)
pairs[pair] = var
return var
i.e. get_var(0.5) will create a boolean variable (with Name='0.5'), while get_pair(0.5, 0.8) will create a variable and set constraints so that it's only true when 0.5 and 0.8 are also true. there's a useful document on encoding boolean logic in ortools
then we can go through A figuring out what combinations are valid and adding them as constraints to the solver:
for i, a in enumerate(A):
opts = {(a,)}
for a2 in A[i+1:]:
assert a < a2
m = a / a2
if m == a2:
opts.add((m,))
elif m < a2:
opts.add((m, a2))
else:
opts.add((a2, m))
alts = []
for opt in opts:
if len(opt) == 1:
alts.append(get_var(*opt))
else:
alts.append(get_pair(opt))
model.AddBoolOr(alts)
next we need a way of saying that we prefer variables to be false rather than true. the minimal version of this is:
model.Minimize(sum(vars.values()))
but we get much nicer results if we complicate this a bit and put a preference on values that were in A:
costsum = 0
for val, var in vars.items():
cost = 1000 if val in A else 1001
costsum += var * cost
model.Minimize(costsum)
finally, we can run our solver and print out a solution:
solver = cp_model.CpSolver()
status = solver.Solve(model)
print(solver.StatusName(status))
if status in {cp_model.FEASIBLE, cp_model.OPTIMAL}:
B = [val for val, var in vars.items() if solver.Value(var)]
print(sorted(B))
this gives me back the expected sets of:
[0.125, 0.5, 0.75, 0.9] and [0.2, 0.22, 0.24, 0.5]
for the two examples at the top
you could also encode the fact that you only consider solutions valid if |B| < |A| in the solver, but I'd be tempted to do that outside

Scale/Transform/Normalise NumPy Array between Two Values

I have the following scenario:
value_range = [250.0, 350.0]
precision = 0.01
unique_values = len(np.arange(min(values_range),
max(values_range) + precision,
precision))
This means all values range between 250.0 and 350.0 with a precision of 0.01, giving a potential total of 10001 unique values that the data set can have.
# This is the data I'd like to scale
values_to_scale = np.arange(min(value_range),
max(value_range) + precision,
precision)
# These are the bins I want to assign to
unique_bins = np.arange(1, unique_values + 1)
You can see in the above example, each value in values_to_scale will map exactly to its corresponding item in the unique_bins array. I.e. a value of 250.0 (values_to_scale[0]) will equal 1.0 (unique_bins[0]) etc.
However, if my values_to_scale array looks like:
values_to_scale = np.array((250.66, 342.02))
How can I do the scaling/transformation to get the unique bin value? I.e. 250.66 should equal a value of 66 but how do I obtain this?
NOTE The value_range could equally be between -1 and 1, I'm just looking for a generic way to scale/normalise data between two values.
You're basically looking for a linear interpolation between min and max:
minv = min(value_range)
maxv = max(value_range)
unique_values = int(((maxv - minv) / precision) + 1)
((values_to_scale - minv) / (maxv + precision - minv) * unique_values).astype(int)
# array([ 65, 9202])

is there an more efficient way to enumerate probability for each of possible outcome of a discrete random variable in python or R?

I am computing the pmf theoretically in Python. here is the code.
>>> a_coin = np.array([0,1])
>>> three_coins = np.array(np.meshgrid(a_coin,a_coin,a_coin)).T.reshape(-1,3)
>>> heads = np.sum(three_coins, axis = 1)
>>> df = pd.DataFrame({'heads': heads, 'prob': 1/8})
>>> np.array(df.groupby('heads').sum()['prob'])
array([0.125, 0.375, 0.375, 0.125])
this piece of code is simulating 1 toss of 3 fair coins.
the possible outcomes is {0,1,2,3}.
last line of code compute the probability for each of the possible outcomes respectively.
I have to put 10 'a_coin' in np.meshgrid(a_coin,...,a_coin) if i want to compute the pmf for tossing 10 fair coins, which seems to be boring and inefficient.
the question is, is there an more efficient way to do this in python or R?
Here's how to do it in R:
> sapply(0:3, choose, n=3)/sum(sapply(0:3, choose, n=3))
[1] 0.125 0.375 0.375 0.125
The choose function gives you the binomial coefficients. To turn them into probabilities just divide by their summs:
sapply(0:10, choose, n=10)
[1] 1 10 45 120 210 252 210 120 45 10 1
sapply(0:10, choose, n=10)/ sum( sapply(0:10, choose, n=10))
[1] 0.0009765625 0.0097656250 0.0439453125 0.1171875000 0.2050781250 0.2460937500 0.2050781250
[8] 0.1171875000 0.0439453125 0.0097656250 0.0009765625
It did not appear that you really wanted to enumerate so much as calculate. If you need to enumerate outcomes from 10 successive "fair" binomial draws, then you could use combn 11 times.
Here is an fft based numpy solution:
import numpy as np
from scipy import fftpack
def toss(n=10, p=0.5):
t1 = np.zeros(fftpack.next_fast_len(n+1))
t1[:2] = 1-p, p
f1 = fftpack.rfft(t1)
c1 = f1[1:(len(t1) - 1) // 2 * 2 + 1].view(f'c{2*t1.itemsize}')
c1 **= n
f1[::(len(t1) + 1) // 2 * 2 - 1] **= n
return fftpack.irfft(f1)[:n+1]
For example:
>>> toss(3)
array([0.125, 0.375, 0.375, 0.125])
>>> toss(10)
array([0.00097656, 0.00976562, 0.04394531, 0.1171875 , 0.20507813,
0.24609375, 0.20507813, 0.1171875 , 0.04394531, 0.00976562,
0.00097656])
Using Python standard libraries you can get probabilities as rational numbers (this is exact solution), e.g.
from fractions import Fraction
from math import factorial
n=30
[Fraction(factorial(n), factorial(n - j)) * Fraction(1, factorial(j) * 2 ** n) for j in range(0, n + 1)]
This could be easily converted to floats, e.g.
list(map(float, [Fraction(factorial(n), factorial(n - j)) * Fraction(1, factorial(j) * 2 ** n) for j in range(0, n + 1)]))

Generating a list of random numbers, using custom bounds and summing to a desired value

I want to do practically something very similar as described in this answer. I want to create a list of random numbers that sum up to a given target value. If I would not care about the bounds, I could use what the answer suggests:
>>> print np.random.dirichlet(np.ones(10),size=1)
[[ 0.01779975 0.14165316 0.01029262 0.168136 0.03061161 0.09046587 0.19987289 0.13398581 0.03119906 0.17598322]]
However, I want to be able to control the ranges and the target of the individual parameters. I want to provide the bounds of each parameter. For instance, I would pass a list of three tuples, with each tuple specifying the lower and upper boundary of the uniform distribution. The target keyword argument would describe what the sum should add up to.
get_rnd_numbers([(0.0, 1.0), (0.2, 0.5), (0.3, 0.8)], target=0.9)
The output could for example look like this:
[0.2, 0.2, 0.5]
How could that be achieved?
Update:
Normalising, i.e. dividing by the sum of all random numbers, is not acceptable as it would distort the distribution.
The solution should work with an arbitrary number of parameters / tuples.
As was mentioned in the comment, this question is actually very similar but in another programming language.
from random import uniform
while( True ):
a = uniform(0.0 ,1.0)
b = uniform(0.2 , 0.5)
c = 0.9 - a - b
if(c > 0.3 and c <0.8):
break
print(a,b,c)
Just find two randoms first. Subtract from the bounds to get the third 'random number'. Check to make sure if it satisfy the boundary conditions.
Ok, here is some idea/code to play with.
We will sample from Dirichlet, so sum objective is automatically fulfilled.
Then for each xi sampled from Dirichlet we apply linear transformation with different lower boundary li but the same scaling parameter s.
vi = li + s*xi
From summation objective (Si means summation over i) and fact, that Dirichlet sampled values are always summed to 1
Si vi = target
we could compute s:
s = target - Si li
Let's put mean value of each vi right into middle of the interval.
E[vi] = li + s*E[xi] = (li + hi) / 2
E[xi] = (hi - li) / 2 / s
And let's introduce knob which is basically proportional to inverse variance of Dirichlets, so bigger is knob, tighter are sampled random values around mean.
So for Dirichlet distribution alpha parameters array
alphai = E[xi] * vscale
where vscale is user-defined variance scale factor. We will check if sampled value violate lower or upper boundary conditions and reject sampling if they do.
Code, Python 3.6, Anaconda 5.2
import numpy as np
boundaries = np.array([[0.0, 1.0], [0.2, 0.5], [0.3, 0.8]])
target = 0.9
def get_rnd_numbers(boundaries, target, vscale):
lo = boundaries[:, 0]
hi = boundaries[:, 1]
s = target - np.sum(lo)
alpha_i = ( 0.5 * (hi-lo) / s ) * vscale
print(np.sum(alpha_i))
x_i = np.random.dirichlet(alpha_i, size=1)
v_i = lo + s*x_i
good_lo = not np.any(v_i < lo)
good_hi = not np.any(v_i > hi)
return (good_lo, good_hi, v_i)
vscale = 3.0
gl, gh, v = get_rnd_numbers(boundaries, target, vscale)
print((gl, gh, v, np.sum(v)))
if gl and gh:
print("Good sample, use it")
gl, gh, v = get_rnd_numbers(boundaries, target, vscale)
print((gl, gh, v, np.sum(v)))
if gl and gh:
print("Good sample, use it")
gl, gh, v = get_rnd_numbers(boundaries, target, vscale)
print((gl, gh, v, np.sum(v)))
if gl and gh:
print("Good sample, use it")
You could play with different transformation ideas, maybe remove or replace mean condition to something more sensible. I would advice to keep idea of the knob, so you could tighten your sampling spread.

Categories