Justification of constants used in random.sample - python

I'm looking into the source code for the function sample in random.py (python standard library).
The idea is simple:
If a small sample (k) is needed from a large population (n): Just pick k random indices, since it is unlikely you'll pick the same number twice as the population is so large. And if you do, just pick again.
If a relatively large sample (k) is needed, compared to the total population (n): It is better to keep track of what you have picked.
My Question
There are a few constants involved, setsize = 21 and setsize += 4 ** _log(3*k,4). The critical ratio is roughly k : 21+3k. The comment says # size of a small set minus size of an empty list and # table size for big sets.
Where have these specific numbers come from? What is there justification?
The comments shed some light, however I find they bring as many questions as they answer.
I would kind of understand, size of a small set but find the "minus size of an empty list" confusing. Can someone shed any light on this?
what is meant specifically by "table" size, as apposed to say "set size".
Looking on the github repository, it looks like a very old version simply used the ratio k : 6*k, as the critical ratio, but I find that equally mysterious.
The code
def sample(self, population, k):
"""Chooses k unique random elements from a population sequence or set.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
To choose a sample in a range of integers, use range as an argument.
This is especially fast and space efficient for sampling from a
large population: sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
if isinstance(population, _Set):
population = tuple(population)
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence or set. For dicts, use list(d).")
randbelow = self._randbelow
n = len(population)
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
result = [None] * k
setsize = 21 # size of a small set minus size of an empty list
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
if n <= setsize:
# An n-length list is smaller than a k-length set
pool = list(population)
for i in range(k): # invariant: non-selected at [0,n-i)
j = randbelow(n-i)
result[i] = pool[j]
pool[j] = pool[n-i-1] # move non-selected item into vacancy
else:
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow(n)
while j in selected:
j = randbelow(n)
selected_add(j)
result[i] = population[j]
return result
(I apologise is this question would be better placed in math.stackexchange. I couldn't think of any probability/statistics-y reasons for this particular ratio, and the comments sounded as though, it was maybe something to do with the amount of space that sets and lists use - but could't find any details anywhere).

This code is attempting to determine whether using a list or a set would take more space (instead of trying to estimate the time cost, for some reason).
It looks like 21 was the difference between the size of an empty list and a small set on the Python build this constant was determined on, expressed in multiples of the size of a pointer. I don't have a build of that version of Python, but testing on my 64-bit CPython 3.6.3 gives a difference of 20 pointer sizes:
>>> sys.getsizeof(set()) - sys.getsizeof([])
160
and comparing the 3.6.3 list and set struct definitions to the list and set definitions from the change that introduced this code, 21 seems plausible.
I said "the difference between the size of an empty list and a small set" because both now and at the time, small sets used a hash table contained inside the set struct itself instead of externally allocated:
setentry smalltable[PySet_MINSIZE];
The
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
check adds the size of the external table allocated for sets larger than 5 elements, with size again expressed in number of pointers. This computation assumes the set never shrinks, since the sampling algorithm never removes elements. I am not currently sure whether this computation is exact.
Finally,
if n <= setsize:
compares the base overhead of a set plus any space used by an external hash table to the n pointers required by a list of the input elements. (It doesn't seem to account for the overallocation performed by list(population), so it may be underestimating the cost of the list.)

Related

Detect significant changes in a data-set that gradually changes

I have a list of data in python that represents amount of resources used per minute. I want to find the number of times it changes significantly in that data set. What I mean by significant change is a bit different from what I've read so far.
For e.g. if I have a dataset like
[10,15,17,20,30,40,50,70,80,60,40,20]
I say a significant change happens when data increases by double or reduces by half with respect to the previous normal.
For e.g. since the list starts with 10, that is our starting normal point
Then when data doubles to 20, I count that as one significant change and set the normal to 20.
Then when data doubles to 40, it is considered a significant change and the normal is now 40
Then when data doubles to 80, it is considered a significant change and the normal is now 80
After that when data reduces by half to 40, it is considered as another significant change and the normal becomes 40
Finally when data reduces by half to 20, it is the last significant change
Here there are a total of 5 significant changes.
Is it similar to any other change detection algorithm? How can this be done efficiently in python?
This is relatively straightforward. You can do this with a single iteration through the list. We simply update our base when a 'significant' change occurs.
Note that my implementation will work for any iterable or container. This is useful if you want to, for example, read through a file without having to load it all into memory.
def gen_significant_changes(iterable, *, tol = 2):
iterable = iter(iterable) # this is necessary if it is container rather than generator.
# note that if the iterable is already a generator iter(iterable) returns itself.
base = next(iterable)
for x in iterable:
if x >= (base * tol) or x <= (base/tol):
yield x
base = x
my_list = [10,15,17,20,30,40,50,70,80,60,40,20]
print(list(gen_significant_changes(my_list)))
I can't help with the Python part, but in terms of math, the problem you're asking is fairly simple to solve using log base 2. A significant change occurs when the current value divided by a constant can be reached by raising 2 to a different power (as an integer) than the previous value. (The constant is needed since the first value in the array forms the basis of comparison.)
For each element at t, compute:
current = math.log(Array[t] /Array[0], 2)
previous = math.log(Array[t-1]/Array[0], 2)
if math.floor(current) <> math.floor(previous) a significant change has occurred
Using this method you do not need to keep track of a "normal point" at all, you just need the array. By removing the additional state variable we enable the array to be processed in any order, and we could give portions of the array to different threads if the dataset were very large. You wouldn't be able to do that with your current method.

Random Integer inside range with probability (Python) [duplicate]

This question already has answers here:
Random weighted choice
(7 answers)
Closed 8 years ago.
I am making a text-based RPG. I have an algorithm that determines the damage dealt by the player to the enemy which is based off the values of two variables. I am not sure how the first part of the algorithm will work quite yet, but that isn't important.
(AttackStrength is an attribute of the player that represents generally how strong his attacks are. WeaponStrength is an attribute of swords the player wields and represents generally how strong attacks are with the weapon.)
Here is how the algorithm will go:
import random
Damage = AttackStrength (Do some math operation to WeaponStrength) WeaponStrength
DamageDealt = randrange(DamageDealt - 4, DamageDealt + 1) #Bad pseudocode, sorry
What I am trying to do with the last line is get a random integer inside a range of integers with the minimum bound as 4 less than Damage, and the maximum bound as 1 more than Damage. But, that's not all. I want to assign probabilities that:
X% of the time DamageDealt will equal Damage
Y% of the time DamageDealt will equal one less than Damage
Z% of the time DamageDealt will equal two less than Damage
A% of the time DamageDealt will equal three less than Damage
B% of the time DamageDealt will equal three less than Damage
C% of the time DamageDealt will equal one more than Damage
I hope I haven't over-complicated all of this thank you!
I think the easiest way to do random weighted probability when you have nice integer probabilities like that is to simply populate a list with multiple copies of your choices - in the right ratios - then choose one element from it, randomly.
Let's do it from -3 to 1 with your (original) weights of 10,10,25,25,30 percent. These share a gcd of 5, so you only need a list of length 20 to hold your choices:
choices = [-3]*2 + [-2]*2 + [-1]*5 + [0]*5 + [1]*6
And implementation done, just choose randomly from that. Demo showing 100 trials:
trials = [random.choice(choices) for _ in range(100)]
[trials.count(i) for i in range(-3,2)]
Out[18]: [11, 7, 27, 22, 33]
Essentially, what you're trying to accomplish is simulation of a loaded die: you have six possibilities and want to assign different probabilities to each one. This is a fairly interesting problem, mathematically speaking, and here is a wonderful piece on the subject.
Still, you're probably looking for something a little less verbose, and the easiest pattern to implement here would be via roulette wheel selection. Given a dictionary where keys are the various 'sides' (in this case, your possible damage formulae) and the values are the probabilities that each side can occur (.3, .25, etc.), the method looks like this:
def weighted_random_choice(choices):
max = sum(choices.values())
pick = random.uniform(0, max)
current = 0
for key, value in choices.items():
current += value
if current > pick:
return key
Suppose that we wanted to have these relative weights for the outcomes:
a = (10, 15, 15, 25, 25, 30)
Then we create a list of partial sums b and a function c:
import random
b = [sum(a[:i+1]) for i,x in enumerate(a)]
def c():
n = random.randrange(sum(a))
for i, v in enumerate(b):_
if n < v: return i
The function c will return an integer from 0 to len(a)-1 with probability proportional to the weights specified in a.
This can be a tricky problem with a lot of different probabilities. Since you want to impose probabilities on the outcomes it's not really fair to call them "random". It always helps to imagine how you might represent your data. One way would be to keep a tuple of tuples like
probs = ((10, +1), (30, 0), (25, -1), (25, -2), (15, -3))
You will notice I have adjusted the series to put the highest adjustment first and so on. I have also removed the duplicate "15, -3) that your question implies because (I imagine) of a line duplicated by accident. One very useful test is to ensure that your probabilities add up to 100 (since I've represented them as integer percentages). This reveals a data fault:
>>> sum(prob[0] for prob in probs)
105
This needn't be an issue unless you really want your probabilities to sum to a sensible value. If this isn't necessary you can just treat them as weightings and select random numbers from (0, 104) instead of (0, 99). This is the course I will follow, but the adjustment should be relatively simple.
Given probs and a random number between 0 and (in your case) 104, you can iterate over the probs structure, accumulating probabilities until you find the bin this particular random number belongs to. This would look (something) like:
def damage_offset(N):
prob = random.randint(0, N-1)
cum_prob = 0
for prob, offset in probs:
cum_prob += prob
if cum_prob >= prob:
return offset
This should always terminate if you get your data right (hence my paranoid check on your weightings - I've been doing this quite a while).
Of course it's often possible to trade memory for speed. If the above needs to work faster then it's relatively easy to create a structure that maps random integer choices direct to their results. One way to construct such a mapping would be
damage_offsets = []
for i in range(N):
damage_offsets.append(damage_offset(i))
Then all you have to do after you've picked your random number r between 1 and N is to look up damage_offsets[r-1] for the particular value of r1 and you have created an O(1) operation. As I mentioned at the start, this isn't likely going to be terribly useful unless your probability list becomes huge (but if it does then you really will need to to avoid O(N) operations when you have large N for the number of probability buckets).
Apologies for untested code.

Python: Randomly draw several objects in a list

I am looking for the most efficient way to randomly draw nelements in a list given a list of probabilities stating the probability of each element to be picked.
aList = [3,4,2,1,4,3,5,7,6,4]
MyProba = [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]
It means that at each draw, the first element (which is 3) has a probability of 0.1 to be drawn. Of course,
sum(MyProba) == 1 # always returns True
len(aList) == len(MyProba) # always returns True
Up to now I did the following:
def random_pick(some_list, proba):
x = random.uniform(0, 1)
cumulative_proba = 0.0
for item, item_proba in zip(some_list, proba):
cumulative_proba += item_proba
if x < cumulative_proba:
break
return item
nb_draws = 10
list_of_drawn_elements = []
for one_draw in range(nb_draws):
list_of_drawn_elements.append(random_pick(aList, MyProba))
It works but it is terribly slow for long lists and big values of nb_draws. How can I improve the speed of this process?
Note: In the special case I am facing, nb_draws always equals the length of aList.
The general idea (as outlined by others' answers as well) is that your method is inefficient because the preprocessing (the calculation of the cumulative distribution) is done every time you draw a sample, although it would be enough to do it once before the sampling and then use the preprocessed data to do the sampling.
The preprocessing and sampling could be done efficiently with Walker's alias method. I have implemented it a while ago; take a look at the source code. (Sorry for the external link, but I think it's too long to post it here). My version requires NumPy; if you don't want to use NumPy, there is a NumPy-free alternative as well (on which my version is based).
Edit: the explanation of Walker's alias method is to be found in the first link I provided. In a nutshell, imagine that you somehow managed to construct a rectangular "darts board" that is subdivided into parts such that each part corresponds to one of your original items, and the area of each part is proportional to the desired probability of selecting the corresponding element. You can then start throwing darts at random at the darts board (by generating two random numbers that specify the horizontal and vertical coordinate of where the dart ended up) and check which areas the darts hit. The items corresponding to the areas will be the items you have selected. Walker's alias method is simply a linear-time preprocessing that constructs the dart board. Drawing each element can then be done in constant time. In the end, drawing m elements out of n will have a cost of O(n) for preprocessing and O(m) for generating the samples, yielding a total complexity of O(n + m).
here's my lazy method... build a list with expected number of values for the desired distribution, and use random.choice() to pick a value from the list.
>>> import random
>>>
>>> value_probs = dict(zip([3,4,2,1,4,3,5,7,6,4], [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]))
>>> expected_dist = sum([[i] * int(prob * 100) for i, prob in value_probs.iteritems()], [])
>>> random.choice(expected_dist)
You might try to precalculate the cumulative probability range for each element and make a tree from these intervals. Then you will get a logarithmic complexity for looking up the element corresponding to the generated probability, instead of linear one that you have now.
You're calculating cumulative_proba each time when you call random_pick. I suggest to calculate it outside the method, and use a better data structure to store it, like a binary search tree, which will reduce the time complexity from O(n) to O(lgn).

Python: Number ranges that are extremely large?

val = long(raw_input("Please enter the maximum value of the range:")) + 1
start_time = time.time()
numbers = range(0, val)
shuffle(numbers)
I cannot find a simple way to make this work with extremely large inputs - can anyone help?
I saw a question like this - but I could not implement the range function they described in a way that works with shuffle. Thanks.
To get a random permutation of the range [0, n) in a memory efficient manner; you could use numpy.random.permutation():
import numpy as np
numbers = np.random.permutation(n)
If you need only small fraction of values from the range e.g., to get k random values from [0, n) range:
import random
from functools import partial
def sample(n, k):
# assume n is much larger than k
randbelow = partial(random.randrange, n)
# from random.py
result = [None] * k
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow()
while j in selected:
j = randbelow()
selected_add(j)
result[i] = j
return result
print(sample(10**100, 10))
If you don't need the full list of numbers (and if you are getting billions, its hard to imagine why you would need them all), you might be better off taking a random.sample of your number range, rather than shuffling them all. In Python 3, random.sample can work on a range object too, so your memory use can be quite modest.
For example, here's code that will sample ten thousand random numbers from a range up to whatever maximum value you specify. It should require only a relatively small amount of memory beyond the 10000 result values, even if your maximum is 100 billion (or whatever enormous number you want):
import random
def get10kRandomNumbers(maximum):
pop = range(1, maximum+1) # this is memory efficient in Python 3
sample = random.sample(pop, 10000)
return sample
Alas, this doesn't work as nicely in Python 2, since xrange objects don't allow maximum values greater than the system's integer type can hold.
An important point to note is that it will be impossible for a computer to have the list of numbers in memory if it is larger than a few billion elements: its memory footprint becomes larger than the typical RAM size (as it takes about 4 GB for 1 billion 32-bit numbers).
In the question, val is a long integer, which seems to indicate that you are indeed using more than a billion integer, so this cannot be done conveniently in memory (i.e., shuffling will be slow, as the operating system will swap).
That said, if the number of elements is small enough (let's say smaller than 0.5 billion), then a list of elements can fit in memory thanks to the compact representation offered by the array module, and be shuffled. This can be done with the standard module array:
import array, random
numbers = array.array('I', xrange(10**8)) # or 'L', if the number of bytes per item (numbers.itemsize) is too small with 'I'
random.shuffle(numbers)

Weighted random selection with and without replacement

Recently I needed to do weighted random selection of elements from a list, both with and without replacement. While there are well known and good algorithms for unweighted selection, and some for weighted selection without replacement (such as modifications of the resevoir algorithm), I couldn't find any good algorithms for weighted selection with replacement. I also wanted to avoid the resevoir method, as I was selecting a significant fraction of the list, which is small enough to hold in memory.
Does anyone have any suggestions on the best approach in this situation? I have my own solutions, but I'm hoping to find something more efficient, simpler, or both.
One of the fastest ways to make many with replacement samples from an unchanging list is the alias method. The core intuition is that we can create a set of equal-sized bins for the weighted list that can be indexed very efficiently through bit operations, to avoid a binary search. It will turn out that, done correctly, we will need to only store two items from the original list per bin, and thus can represent the split with a single percentage.
Let's us take the example of five equally weighted choices, (a:1, b:1, c:1, d:1, e:1)
To create the alias lookup:
Normalize the weights such that they sum to 1.0. (a:0.2 b:0.2 c:0.2 d:0.2 e:0.2) This is the probability of choosing each weight.
Find the smallest power of 2 greater than or equal to the number of variables, and create this number of partitions, |p|. Each partition represents a probability mass of 1/|p|. In this case, we create 8 partitions, each able to contain 0.125.
Take the variable with the least remaining weight, and place as much of it's mass as possible in an empty partition. In this example, we see that a fills the first partition. (p1{a|null,1.0},p2,p3,p4,p5,p6,p7,p8) with (a:0.075, b:0.2 c:0.2 d:0.2 e:0.2)
If the partition is not filled, take the variable with the most weight, and fill the partition with that variable.
Repeat steps 3 and 4, until none of the weight from the original partition need be assigned to the list.
For example, if we run another iteration of 3 and 4, we see
(p1{a|null,1.0},p2{a|b,0.6},p3,p4,p5,p6,p7,p8) with (a:0, b:0.15 c:0.2 d:0.2 e:0.2) left to be assigned
At runtime:
Get a U(0,1) random number, say binary 0.001100000
bitshift it lg2(p), finding the index partition. Thus, we shift it by 3, yielding 001.1, or position 1, and thus partition 2.
If the partition is split, use the decimal portion of the shifted random number to decide the split. In this case, the value is 0.5, and 0.5 < 0.6, so return a.
Here is some code and another explanation, but unfortunately it doesn't use the bitshifting technique, nor have I actually verified it.
A simple approach that hasn't been mentioned here is one proposed in Efraimidis and Spirakis. In python you could select m items from n >= m weighted items with strictly positive weights stored in weights, returning the selected indices, with:
import heapq
import math
import random
def WeightedSelectionWithoutReplacement(weights, m):
elt = [(math.log(random.random()) / weights[i], i) for i in range(len(weights))]
return [x[1] for x in heapq.nlargest(m, elt)]
This is very similar in structure to the first approach proposed by Nick Johnson. Unfortunately, that approach is biased in selecting the elements (see the comments on the method). Efraimidis and Spirakis proved that their approach is equivalent to random sampling without replacement in the linked paper.
Here's what I came up with for weighted selection without replacement:
def WeightedSelectionWithoutReplacement(l, n):
"""Selects without replacement n random elements from a list of (weight, item) tuples."""
l = sorted((random.random() * x[0], x[1]) for x in l)
return l[-n:]
This is O(m log m) on the number of items in the list to be selected from. I'm fairly certain this will weight items correctly, though I haven't verified it in any formal sense.
Here's what I came up with for weighted selection with replacement:
def WeightedSelectionWithReplacement(l, n):
"""Selects with replacement n random elements from a list of (weight, item) tuples."""
cuml = []
total_weight = 0.0
for weight, item in l:
total_weight += weight
cuml.append((total_weight, item))
return [cuml[bisect.bisect(cuml, random.random()*total_weight)] for x in range(n)]
This is O(m + n log m), where m is the number of items in the input list, and n is the number of items to be selected.
I'd recommend you start by looking at section 3.4.2 of Donald Knuth's Seminumerical Algorithms.
If your arrays are large, there are more efficient algorithms in chapter 3 of Principles of Random Variate Generation by John Dagpunar. If your arrays are not terribly large or you're not concerned with squeezing out as much efficiency as possible, the simpler algorithms in Knuth are probably fine.
It is possible to do Weighted Random Selection with replacement in O(1) time, after first creating an additional O(N)-sized data structure in O(N) time. The algorithm is based on the Alias Method developed by Walker and Vose, which is well described here.
The essential idea is that each bin in a histogram would be chosen with probability 1/N by a uniform RNG. So we will walk through it, and for any underpopulated bin which would would receive excess hits, assign the excess to an overpopulated bin. For each bin, we store the percentage of hits which belong to it, and the partner bin for the excess. This version tracks small and large bins in place, removing the need for an additional stack. It uses the index of the partner (stored in bucket[1]) as an indicator that they have already been processed.
Here is a minimal python implementation, based on the C implementation here
def prep(weights):
data_sz = len(weights)
factor = data_sz/float(sum(weights))
data = [[w*factor, i] for i,w in enumerate(weights)]
big=0
while big<data_sz and data[big][0]<=1.0: big+=1
for small,bucket in enumerate(data):
if bucket[1] is not small: continue
excess = 1.0 - bucket[0]
while excess > 0:
if big==data_sz: break
bucket[1] = big
bucket = data[big]
bucket[0] -= excess
excess = 1.0 - bucket[0]
if (excess >= 0):
big+=1
while big<data_sz and data[big][0]<=1: big+=1
return data
def sample(data):
r=random.random()*len(data)
idx = int(r)
return data[idx][1] if r-idx > data[idx][0] else idx
Example usage:
TRIALS=1000
weights = [20,1.5,9.8,10,15,10,15.5,10,8,.2];
samples = [0]*len(weights)
data = prep(weights)
for _ in range(int(sum(weights)*TRIALS)):
samples[sample(data)]+=1
result = [float(s)/TRIALS for s in samples]
err = [a-b for a,b in zip(result,weights)]
print(result)
print([round(e,5) for e in err])
print(sum([e*e for e in err]))
The following is a description of random weighted selection of an element of a
set (or multiset, if repeats are allowed), both with and without replacement in O(n) space
and O(log n) time.
It consists of implementing a binary search tree, sorted by the elements to be
selected, where each node of the tree contains:
the element itself (element)
the un-normalized weight of the element (elementweight), and
the sum of all the un-normalized weights of the left-child node and all of
its children (leftbranchweight).
the sum of all the un-normalized weights of the right-child node and all of
its chilren (rightbranchweight).
Then we randomly select an element from the BST by descending down the tree. A
rough description of the algorithm follows. The algorithm is given a node of
the tree. Then the values of leftbranchweight, rightbranchweight,
and elementweight of node is summed, and the weights are divided by this
sum, resulting in the values leftbranchprobability,
rightbranchprobability, and elementprobability, respectively. Then a
random number between 0 and 1 (randomnumber) is obtained.
if the number is less than elementprobability,
remove the element from the BST as normal, updating leftbranchweight
and rightbranchweight of all the necessary nodes, and return the
element.
else if the number is less than (elementprobability + leftbranchweight)
recurse on leftchild (run the algorithm using leftchild as node)
else
recurse on rightchild
When we finally find, using these weights, which element is to be returned, we either simply return it (with replacement) or we remove it and update relevant weights in the tree (without replacement).
DISCLAIMER: The algorithm is rough, and a treatise on the proper implementation
of a BST is not attempted here; rather, it is hoped that this answer will help
those who really need fast weighted selection without replacement (like I do).
This is an old question for which numpy now offers an easy solution so I thought I would mention it. Current version of numpy is version 1.2 and numpy.random.choice allows the sampling to be done with or without replacement and with given weights.
Suppose you want to sample 3 elements without replacement from the list ['white','blue','black','yellow','green'] with a prob. distribution [0.1, 0.2, 0.4, 0.1, 0.2]. Using numpy.random module it is as easy as this:
import numpy.random as rnd
sampling_size = 3
domain = ['white','blue','black','yellow','green']
probs = [.1, .2, .4, .1, .2]
sample = rnd.choice(domain, size=sampling_size, replace=False, p=probs)
# in short: rnd.choice(domain, sampling_size, False, probs)
print(sample)
# Possible output: ['white' 'black' 'blue']
Setting the replace flag to True, you have a sampling with replacement.
More info here:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.random.choice.html#numpy.random.choice
We faced a problem to randomly select K validators of N candidates once per epoch proportionally to their stakes. But this gives us the following problem:
Imagine probabilities of each candidate:
0.1
0.1
0.8
Probabilities of each candidate after 1'000'000 selections 2 of 3 without replacement became:
0.254315
0.256755
0.488930
You should know, those original probabilities are not achievable for 2 of 3 selection without replacement.
But we wish initial probabilities to be a profit distribution probabilities. Else it makes small candidate pools more profitable. So we realized that random selection with replacement would help us – to randomly select >K of N and store also weight of each validator for reward distribution:
std::vector<int> validators;
std::vector<int> weights(n);
int totalWeights = 0;
for (int j = 0; validators.size() < m; j++) {
int value = rand() % likehoodsSum;
for (int i = 0; i < n; i++) {
if (value < likehoods[i]) {
if (weights[i] == 0) {
validators.push_back(i);
}
weights[i]++;
totalWeights++;
break;
}
value -= likehoods[i];
}
}
It gives an almost original distribution of rewards on millions of samples:
0.101230
0.099113
0.799657

Categories