Generate non-uniform random numbers [duplicate]

Generate non-uniform random numbers [duplicate] - python

This question already has an answer here:
Fast way to obtain a random index from an array of weights in python
(1 answer)
Closed 4 years ago.
Algo (Source: Elements of Programming Interviews, 5.16)
You are given n numbers as well as probabilities p0, p1,.., pn-1
which sum up to 1. Given a rand num generator that produces values in
[0,1] uniformly, how would you generate one of the n numbers according
to their specific probabilities.
Example
If numbers are 3, 5, 7, 11, and the probabilities are 9/18, 6/18,
2/18, 1/18, then in 1000000 cals to the program, 3 should appear
500000 times, 7 should appear 111111 times, etc.
The book says to create intervals p0, p0 + p1, p0 + p1 + p2, etc so in the example above the intervals are [0.0, 5.0), [0.5, 0.0.8333), etc and combining these intervals into a sorted array of endpoints could look something like [1/18, 3/18, 9/18, 18/18]. Then run the random function generator, and find the smallest element that is larger than the generated element - the array index that it corresponds to maps to an index in the given n numbers.
This would require O(N) pre-processing time and then O(log N) to binary search for the value.
I have an alternate solution that requires O(N) pre-processing time and O(1) execution time, and am wondering what may be wrong with it.
Why can't we iterate through each number in n, multiplying [n] * 100 * probability that matches with n. E.g [3] * (9/18) * 100. Concatenate all these arrays to get, at the end, a list of 100 elements, with the number of elements for each mapping to how likely it is to occur. Then, run the random num function and index into the array, and return the value.
Wouldn't this be more efficient than the provided solution?

Your number 100 is not independent of the input; it depends on the given p values. Any parameter that depends on the magnitude of the input values is really exponential in the input size, meaning you are actually using exponential space. Just constructing that array would thus take exponential time, even if it was structured to allow constant lookup time after generating the random number.
Consider two p values, 0.01 and 0.99. 100 values is sufficient to implement your scheme. Now consider 0.001 and 0.999. Now you need an array of 1,000 values to model the probability distribution. The amount of space grows with (I believe) the ratio of the largest p value and the smallest, not in the number of p values given.

If you have rational probabilities, you can make that work. Rather than 100, you must use a common denominator of the rational proportions. Insisting on 100 items will not fulfill the specs of your assigned example, let alone more diabolical ones.

Related

How to iterate through the Cartesian product of ten lists (ten elements each) faster? (Probability and Dice)

I'm trying to solve this task.
I wrote function for this purpose which uses itertools.product() for Cartesian product of input iterables:
def probability(dice_number, sides, target):
from itertools import product
from decimal import Decimal
FOUR_PLACES = Decimal('0.0001')
total_number_of_experiment_outcomes = sides ** dice_number
target_hits = 0
sides_combinations = product(range(1, sides+1), repeat=dice_number)
for side_combination in sides_combinations:
if sum(side_combination) == target:
target_hits += 1
p = Decimal(str(target_hits / total_number_of_experiment_outcomes)).quantize(FOUR_PLACES)
return float(p)
When calling probability(2, 6, 3) output is 0.0556, so works fine.
But calling probability(10, 10, 50) calculates veeery long (hours?), but there must be a better way:)
for side_combination in sides_combinations: takes to long to iterate through huge number of sides_combinations.
Please, can you help me to find out how to speed up calculation of result, i want too sleep tonight..

I guess the problem is to find the distribution of the sum of dice. An efficient way to do that is via discrete convolution. The distribution of the sum of variables is the convolution of their probability mass functions (or densities, in the continuous case). Convolution is an n-ary operator, so you can compute it conveniently just two pmf's at a time (the current distribution of the total so far, and the next one in the list). Then from the final result, you can read off the probabilities for each possible total. The first element in the result is the probability of the smallest possible total, and the last element is the probability of the largest possible total. In between you can figure out which one corresponds to the particular sum you're looking for.
The hard part of this is the convolution, so work on that first. It's just a simple summation, but it's just a little tricky to get the limits of the summation correct. My advice is to work with integers or rationals so you can do exact arithmetic.
After that you just need to construct an appropriate pmf for each input die. The input is just [1, 1, 1, ... 1] if you're using integers (you'll have to normalize eventually) or [1/n, 1/n, 1/n, ..., 1/n] if rationals, where n = number of faces. Also you'll need to label the indices of the output correctly -- again this is just a little tricky to get it right.
Convolution is a very general approach for summations of variables. It can be made even more efficient by implementing convolution via the fast Fourier transform, since FFT(conv(A, B)) = FFT(A) FFT(B). But at this point I don't think you need to worry about that.

If someone still interested in solution which avoids very-very-very long iteration process through all itertools.product Cartesian products, here it is:
def probability(dice_number, sides, target):
if dice_number == 1:
return (1 <= target <= sides**dice_number) / sides
return sum([probability(dice_number-1, sides, target-x) \
for x in range(1,sides+1)]) / sides
But you should add caching of probability function results, if you won't - calculation of probability will takes very-very-very long time as well)
P.S. this code is 100% not mine, i took it from the internet, i'm not such smart to product it by myself, hope you'll enjoy it as much as i.

Choose One Item from Every List, up to N combination, uniform distribution

I have 100 lists [x1..x100] , each one containing about 10 items. [x_i_1,...x_i_10]
I need to generate 80 vectors. Each vector is a production of all the lists, kind of like itertools.product(*x), except 2 things:
(1)
I need every item in each vector to have a uniform distribution.
for example:
[ np.random.choice(xi) for xi in [x1..x100]] would be good, except for my seconds condition:
(2)
i can't have repetitions.
itertools.product solves this, but it doesn't meet condition (1).
I need to generate 80 vectors, use them, and re-ask for another 80, and repeat this process until a certain condition is met.
for EACH vector across all 80-size-batch, i need them to be uniform (condition 1) and non repeating (condition 2)
Creating all permutations and shuffling that list is a great solution for a smaller list, I'm using this batch system because of the HUGE number of possible permutations
Any ideas?
thx

Just use [np.random.choice(xi) for xi in [x1..x100]]. The probability that the same vector will be generated twice in 80 trials is vanishingly small. By the birthday problem the probability that n items chosen independently from a set of d items will contain a repeated item chosen is approximately 1 - exp(n*(n-1)/(2*d)). In your case n = 80 and d = 10**100. The resulting probability is zero to a ridiculously large number of decimal places (the estimate implies that the probability begins 0.000 ... with approximately 1.37 x 10^97 zeros after the decimal point). Forget 80. You could generate 80 trillion such vectors and still have a vanishingly small probability of generating the same vector twice.

use np.random.multinomial() in python

I have a task to randomly chose 100 element from a population of alpha list [a,b,c,d] with corresponding frequency (probability) [0.1, 0.3, 0.2, 0.4].
There are many different ways to do it. But here I want what returned after this function call (suppose there is one) is a list of the number of elements chosen. Say, it returns (20,20,30,30), then it means 20 of elements a are chosen, 20 of elements c are chosen, etc.
I figured that np.random.multinomial is the way to go. Following the above example, I will need to call the function np.random.multinomial(100, [0.1,0.3,0.2,0.4],1 ). Is this right ? Thanks.
Related:
fast way to uniformly remove 10% of all the elements in a given list of python

Yes, np.random.multinomial(100, [0.1,0.3,0.2,0.4], 1 ) is correct. But since you are doing only one draw you'd maybe prefer the simpler np.random.multinomial(100, [0.1,0.3,0.2,0.4]) (without the ,1) which returns an array instead of an array of (one) array.

I agree with JulienD. The word "choose" and the given probabilities just don't go together.
When use "choose", we mean permutation without order.
When use probabilities given, we mean these are constant probabilities (unless it is stated that it is conditional). So the items are "assigned" to categories with the given probabilities.
Of course, the count in the categories is not 100*probabilities. That would have been the expected value over the long run. Just like if you toss a fair coin, you don't expect it to be HTHTHT...HT. But over the long run the count of H will be half of total tosses.
import numpy.random as npr
npr.seed(123)
npr.multinomial(100, [0.1,0.3,0.2,0.4], 1)
# Out: array([[11, 27, 18, 44]])
As the number of simulations increases, the probability will converge to the given probabilities.
simulations = 1000
sum(npr.multinomial(100, [0.1,0.3,0.2,0.4], simulations))/simulations/100
#Out:array([ 0.09995, 0.29991, 0.19804, 0.4021 ])

"Running" weighted average

I'm constantly adding/removing tuples to a list in Python and am interested in the weighted average (not the list itself). Since this part is computationally quite expensive compared to the rest, I want to optimise it. What's the best way of keeping track of the weighted average? I can think of two methods:
keeping the list and calculating the weighted average every time it gets accessed/changed (my current approach)
just keep track of current weighted average and the sum of all weights and change weight and current weighted average for every add/remove action
I would prefer the 2nd option, but I am worried about "floating point errors" induced by constant addition/subtraction. What's the best way of dealing with this?

Try doing it in integers? Python bignums should make a rational argument for rational numbers (sorry, It's late... really sorry actually).
It really depends on how many terms you are using and what your weighting coefficient is as to weather you will experience much floating point drift. You only get 53 bits of precision, you might not need that much.
If your weighting factor is less than 1, then your error should be bounded since you are constantly decreasing it. Let's say your weight is 0.6 (horrible, because you cannot represent that in binary). That is 0.00110011... represented as 0.0011001100110011001101 (rounded in the last bit). So any error you introduce from that rounding, will be then decreased after you multiply again. The error in the most current term will dominate.
Don't do the final division until you need to. Once again given 0.6 as your weight and 10 terms, your term weights will be 99.22903012752124 for the first term all the way down to 1 for the last term (0.6**-t). Multiply your new term by 99.22..., add it to your running sum and subtract the trailing term out, then divide by 246.5725753188031 (sum([0.6**-x for x in range(0,10)])
If you really want to adjust for that, you can add a ULP to the term you are about to remove, but this will just underestimate intentionally, I think.

Here is an answer that retains floating point for keeping a running total - I think a weighted average requires only two running totals:
Allocate an array to store your numbers in, so that inserting a number means finding an empty space in the array and setting it to that value and deleting a number means setting its value in the array to zero and declaring that space empty - you can use a linked list of free entries to find empty entries in time O(1)
Now you need to work out the sum of an array of size N. Treat the array as a full binary tree, as in heapsort, so offset 0 is the root, 1 and 2 are its children, 3 and 4 are the children of 1, 5 and 6 are the children of 2, and so on - the children of i are at 2i+1 and 2i+2.
For each internal node, keep the sum of all entries at or below that node in the tree. Now when you modify an entry you can recalculate the sum of the values in the array by working your way from that entry up to the root of the tree, correcting the partial sums as you go - this costs you O(log N) where N is the length of the array.

Weighted random selection with and without replacement

Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.
Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.

One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.
Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)
To create the alias lookup:
Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.
Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.
Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)
If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.
Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.
For example, if we run another iteration of 3 and 4, we see
(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned
At runtime:
Get a U(0,1) random number, say binary 0.001100000
bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.
If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.
Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.

A simple approach that hasn't been mentioned here is one proposed in Efraimidis and Spirakis. In python you could select m items from n >= m weighted items with strictly positive weights stored in weights, returning the selected indices, with:
import heapq
import math
import random
def WeightedSelectionWithoutReplacement(weights, m):
elt = [(math.log(random.random()) / weights[i], i) for i in range(len(weights))]
return [x[1] for x in heapq.nlargest(m, elt)]
This is very similar in structure to the first approach proposed by Nick Johnson. Unfortunately, that approach is biased in selecting the elements (see the comments on the method). Efraimidis and Spirakis proved that their approach is equivalent to random sampling without replacement in the linked paper.

Here's what I came up with for weighted selection without replacement:
def WeightedSelectionWithoutReplacement(l, n):
"""Selects without replacement n random elements from a list of (weight, item) tuples."""
l = sorted((random.random() * x[0], x[1]) for x in l)
return l[-n:]
This is O(m log m) on the number of items in the list to be selected from. I'm fairly certain this will weight items correctly, though I haven't verified it in any formal sense.
Here's what I came up with for weighted selection with replacement:
def WeightedSelectionWithReplacement(l, n):
"""Selects with replacement n random elements from a list of (weight, item) tuples."""
cuml = []
total_weight = 0.0
for weight, item in l:
total_weight += weight
cuml.append((total_weight, item))
return [cuml[bisect.bisect(cuml, random.random()*total_weight)] for x in range(n)]
This is O(m + n log m), where m is the number of items in the input list, and n is the number of items to be selected.

I'd recommend you start by looking at section 3.4.2 of Donald Knuth's Seminumerical Algorithms.
If your arrays are large, there are more efficient algorithms in chapter 3 of Principles of Random Variate Generation by John Dagpunar. If your arrays are not terribly large or you're not concerned with squeezing out as much efficiency as possible, the simpler algorithms in Knuth are probably fine.

It is possible to do Weighted Random Selection with replacement in O(1) time, after first creating an additional O(N)-sized data structure in O(N) time. The algorithm is based on the Alias Method developed by Walker and Vose, which is well described here.
The essential idea is that each bin in a histogram would be chosen with probability 1/N by a uniform RNG. So we will walk through it, and for any underpopulated bin which would would receive excess hits, assign the excess to an overpopulated bin. For each bin, we store the percentage of hits which belong to it, and the partner bin for the excess. This version tracks small and large bins in place, removing the need for an additional stack. It uses the index of the partner (stored in bucket[1]) as an indicator that they have already been processed.
Here is a minimal python implementation, based on the C implementation here
def prep(weights):
data_sz = len(weights)
factor = data_sz/float(sum(weights))
data = [[w*factor, i] for i,w in enumerate(weights)]
big=0
while big<data_sz and data[big][0]<=1.0: big+=1
for small,bucket in enumerate(data):
if bucket[1] is not small: continue
excess = 1.0 - bucket[0]
while excess > 0:
if big==data_sz: break
bucket[1] = big
bucket = data[big]
bucket[0] -= excess
excess = 1.0 - bucket[0]
if (excess >= 0):
big+=1
while big<data_sz and data[big][0]<=1: big+=1
return data
def sample(data):
r=random.random()*len(data)
idx = int(r)
return data[idx][1] if r-idx > data[idx][0] else idx
Example usage:
TRIALS=1000
weights = [20,1.5,9.8,10,15,10,15.5,10,8,.2];
samples = [0]*len(weights)
data = prep(weights)
for _ in range(int(sum(weights)*TRIALS)):
samples[sample(data)]+=1
result = [float(s)/TRIALS for s in samples]
err = [a-b for a,b in zip(result,weights)]
print(result)
print([round(e,5) for e in err])
print(sum([e*e for e in err]))

The following is a description of random weighted selection of an element of a
set (or multiset, if repeats are allowed), both with and without replacement in O(n) space
and O(log n) time.
It consists of implementing a binary search tree, sorted by the elements to be
selected, where each node of the tree contains:
the element itself (element)
the un-normalized weight of the element (elementweight), and
the sum of all the un-normalized weights of the left-child node and all of
its children (leftbranchweight).
the sum of all the un-normalized weights of the right-child node and all of
its chilren (rightbranchweight).
Then we randomly select an element from the BST by descending down the tree. A
rough description of the algorithm follows. The algorithm is given a node of
the tree. Then the values of leftbranchweight, rightbranchweight,
and elementweight of node is summed, and the weights are divided by this
sum, resulting in the values leftbranchprobability,
rightbranchprobability, and elementprobability, respectively. Then a
random number between 0 and 1 (randomnumber) is obtained.
if the number is less than elementprobability,
remove the element from the BST as normal, updating leftbranchweight
and rightbranchweight of all the necessary nodes, and return the
element.
else if the number is less than (elementprobability + leftbranchweight)
recurse on leftchild (run the algorithm using leftchild as node)
else
recurse on rightchild
When we finally find, using these weights, which element is to be returned, we either simply return it (with replacement) or we remove it and update relevant weights in the tree (without replacement).
DISCLAIMER: The algorithm is rough, and a treatise on the proper implementation
of a BST is not attempted here; rather, it is hoped that this answer will help
those who really need fast weighted selection without replacement (like I do).

This is an old question for which numpy now offers an easy solution so I thought I would mention it. Current version of numpy is version 1.2 and numpy.random.choice allows the sampling to be done with or without replacement and with given weights.

Suppose you want to sample 3 elements without replacement from the list ['white','blue','black','yellow','green'] with a prob. distribution [0.1, 0.2, 0.4, 0.1, 0.2]. Using numpy.random module it is as easy as this:
import numpy.random as rnd
sampling_size = 3
domain = ['white','blue','black','yellow','green']
probs = [.1, .2, .4, .1, .2]
sample = rnd.choice(domain, size=sampling_size, replace=False, p=probs)
# in short: rnd.choice(domain, sampling_size, False, probs)
print(sample)
# Possible output: ['white' 'black' 'blue']
Setting the replace flag to True, you have a sampling with replacement.
More info here:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html#numpy.random.choice

We faced a problem to randomly select K validators of N candidates once per epoch proportionally to their stakes. But this gives us the following problem:
Imagine probabilities of each candidate:
0.1
0.1
0.8
Probabilities of each candidate after 1'000'000 selections 2 of 3 without replacement became:
0.254315
0.256755
0.488930
You should know, those original probabilities are not achievable for 2 of 3 selection without replacement.
But we wish initial probabilities to be a profit distribution probabilities. Else it makes small candidate pools more profitable. So we realized that random selection with replacement would help us – to randomly select >K of N and store also weight of each validator for reward distribution:
std::vector<int> validators;
std::vector<int> weights(n);
int totalWeights = 0;
for (int j = 0; validators.size() < m; j++) {
int value = rand() % likehoodsSum;
for (int i = 0; i < n; i++) {
if (value < likehoods[i]) {
if (weights[i] == 0) {
validators.push_back(i);
}
weights[i]++;
totalWeights++;
break;
}
value -= likehoods[i];
}
}
It gives an almost original distribution of rewards on millions of samples:
0.101230
0.099113
0.799657

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.