I have a task to randomly chose 100 element from a population of alpha list [a,b,c,d] with corresponding frequency (probability) [0.1, 0.3, 0.2, 0.4].
There are many different ways to do it. But here I want what returned after this function call (suppose there is one) is a list of the number of elements chosen. Say, it returns (20,20,30,30), then it means 20 of elements a are chosen, 20 of elements c are chosen, etc.
I figured that np.random.multinomial is the way to go. Following the above example, I will need to call the function np.random.multinomial(100, [0.1,0.3,0.2,0.4],1 ). Is this right ? Thanks.
Related:
fast way to uniformly remove 10% of all the elements in a given list of python
Yes, np.random.multinomial(100, [0.1,0.3,0.2,0.4], 1 ) is correct. But since you are doing only one draw you'd maybe prefer the simpler np.random.multinomial(100, [0.1,0.3,0.2,0.4]) (without the ,1) which returns an array instead of an array of (one) array.
I agree with JulienD. The word "choose" and the given probabilities just don't go together.
When use "choose", we mean permutation without order.
When use probabilities given, we mean these are constant probabilities (unless it is stated that it is conditional). So the items are "assigned" to categories with the given probabilities.
Of course, the count in the categories is not 100*probabilities. That would have been the expected value over the long run. Just like if you toss a fair coin, you don't expect it to be HTHTHT...HT. But over the long run the count of H will be half of total tosses.
import numpy.random as npr
npr.seed(123)
npr.multinomial(100, [0.1,0.3,0.2,0.4], 1)
# Out: array([[11, 27, 18, 44]])
As the number of simulations increases, the probability will converge to the given probabilities.
simulations = 1000
sum(npr.multinomial(100, [0.1,0.3,0.2,0.4], simulations))/simulations/100
#Out:array([ 0.09995, 0.29991, 0.19804, 0.4021 ])
Related
I have 100 lists [x1..x100] , each one containing about 10 items. [x_i_1,...x_i_10]
I need to generate 80 vectors. Each vector is a production of all the lists, kind of like itertools.product(*x), except 2 things:
(1)
I need every item in each vector to have a uniform distribution.
for example:
[ np.random.choice(xi) for xi in [x1..x100]] would be good, except for my seconds condition:
(2)
i can't have repetitions.
itertools.product solves this, but it doesn't meet condition (1).
I need to generate 80 vectors, use them, and re-ask for another 80, and repeat this process until a certain condition is met.
for EACH vector across all 80-size-batch, i need them to be uniform (condition 1) and non repeating (condition 2)
Creating all permutations and shuffling that list is a great solution for a smaller list, I'm using this batch system because of the HUGE number of possible permutations
Any ideas?
thx
Just use [np.random.choice(xi) for xi in [x1..x100]]. The probability that the same vector will be generated twice in 80 trials is vanishingly small. By the birthday problem the probability that n items chosen independently from a set of d items will contain a repeated item chosen is approximately 1 - exp(n*(n-1)/(2*d)). In your case n = 80 and d = 10**100. The resulting probability is zero to a ridiculously large number of decimal places (the estimate implies that the probability begins 0.000 ... with approximately 1.37 x 10^97 zeros after the decimal point). Forget 80. You could generate 80 trillion such vectors and still have a vanishingly small probability of generating the same vector twice.
This question already has an answer here:
Fast way to obtain a random index from an array of weights in python
(1 answer)
Closed 4 years ago.
Algo (Source: Elements of Programming Interviews, 5.16)
You are given n numbers as well as probabilities p0, p1,.., pn-1
which sum up to 1. Given a rand num generator that produces values in
[0,1] uniformly, how would you generate one of the n numbers according
to their specific probabilities.
Example
If numbers are 3, 5, 7, 11, and the probabilities are 9/18, 6/18,
2/18, 1/18, then in 1000000 cals to the program, 3 should appear
500000 times, 7 should appear 111111 times, etc.
The book says to create intervals p0, p0 + p1, p0 + p1 + p2, etc so in the example above the intervals are [0.0, 5.0), [0.5, 0.0.8333), etc and combining these intervals into a sorted array of endpoints could look something like [1/18, 3/18, 9/18, 18/18]. Then run the random function generator, and find the smallest element that is larger than the generated element - the array index that it corresponds to maps to an index in the given n numbers.
This would require O(N) pre-processing time and then O(log N) to binary search for the value.
I have an alternate solution that requires O(N) pre-processing time and O(1) execution time, and am wondering what may be wrong with it.
Why can't we iterate through each number in n, multiplying [n] * 100 * probability that matches with n. E.g [3] * (9/18) * 100. Concatenate all these arrays to get, at the end, a list of 100 elements, with the number of elements for each mapping to how likely it is to occur. Then, run the random num function and index into the array, and return the value.
Wouldn't this be more efficient than the provided solution?
Your number 100 is not independent of the input; it depends on the given p values. Any parameter that depends on the magnitude of the input values is really exponential in the input size, meaning you are actually using exponential space. Just constructing that array would thus take exponential time, even if it was structured to allow constant lookup time after generating the random number.
Consider two p values, 0.01 and 0.99. 100 values is sufficient to implement your scheme. Now consider 0.001 and 0.999. Now you need an array of 1,000 values to model the probability distribution. The amount of space grows with (I believe) the ratio of the largest p value and the smallest, not in the number of p values given.
If you have rational probabilities, you can make that work. Rather than 100, you must use a common denominator of the rational proportions. Insisting on 100 items will not fulfill the specs of your assigned example, let alone more diabolical ones.
If I have a big list or numpy array or etc that I need to split into sub-lists, how could I efficiently calculate the stadistics (mean, standar deviation, etc) for the whole list?
As a simple example, let's say that I have this small list:
l = [2,1,4,1,2,1,3,2,1,5]
>>> mean(l)
2.2000000000000002
But, if for some reason I need to split into sub-lists:
l1 = [2,1,4,1]
l2 = [2,1,3,2]
l3 = [1,5]
Of course, you don't need to know a lot about mathematics to know that this is NOT TRUE:
mean(l) = mean(mean(l1), mean(l2), mean(l3))
This may be true just if the lenght of all and every list is the same, which is not in this case.
The background of this question is related to the case when you have a very big dataset that does not fit into memory, and thus, you will need to split it into chucks.
In general, you need to keep the so-called sufficient statistics for each subset. For the mean and standard deviation, the sufficient statistics are the number of data, their sum, and their sum of squares. Given those 3 quantities for each subset, you can compute the mean and standard deviation for the whole set.
The sufficient statistics are not necessarily any smaller than the subset itself. But for mean and standard deviation, the sufficient statistics are just a few numbers.
I assume you know the number of data points you have, i.e., len(l)? Then you could just calculate a sum of each list indidividually (i.e., Map-reduce) or a running sum (i.e, if you are doing a readline()), and then divide by len(l) at the very end?
I am looking for the most efficient way to randomly draw nelements in a list given a list of probabilities stating the probability of each element to be picked.
aList = [3,4,2,1,4,3,5,7,6,4]
MyProba = [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]
It means that at each draw, the first element (which is 3) has a probability of 0.1 to be drawn. Of course,
sum(MyProba) == 1 # always returns True
len(aList) == len(MyProba) # always returns True
Up to now I did the following:
def random_pick(some_list, proba):
x = random.uniform(0, 1)
cumulative_proba = 0.0
for item, item_proba in zip(some_list, proba):
cumulative_proba += item_proba
if x < cumulative_proba:
break
return item
nb_draws = 10
list_of_drawn_elements = []
for one_draw in range(nb_draws):
list_of_drawn_elements.append(random_pick(aList, MyProba))
It works but it is terribly slow for long lists and big values of nb_draws. How can I improve the speed of this process?
Note: In the special case I am facing, nb_draws always equals the length of aList.
The general idea (as outlined by others' answers as well) is that your method is inefficient because the preprocessing (the calculation of the cumulative distribution) is done every time you draw a sample, although it would be enough to do it once before the sampling and then use the preprocessed data to do the sampling.
The preprocessing and sampling could be done efficiently with Walker's alias method. I have implemented it a while ago; take a look at the source code. (Sorry for the external link, but I think it's too long to post it here). My version requires NumPy; if you don't want to use NumPy, there is a NumPy-free alternative as well (on which my version is based).
Edit: the explanation of Walker's alias method is to be found in the first link I provided. In a nutshell, imagine that you somehow managed to construct a rectangular "darts board" that is subdivided into parts such that each part corresponds to one of your original items, and the area of each part is proportional to the desired probability of selecting the corresponding element. You can then start throwing darts at random at the darts board (by generating two random numbers that specify the horizontal and vertical coordinate of where the dart ended up) and check which areas the darts hit. The items corresponding to the areas will be the items you have selected. Walker's alias method is simply a linear-time preprocessing that constructs the dart board. Drawing each element can then be done in constant time. In the end, drawing m elements out of n will have a cost of O(n) for preprocessing and O(m) for generating the samples, yielding a total complexity of O(n + m).
here's my lazy method... build a list with expected number of values for the desired distribution, and use random.choice() to pick a value from the list.
>>> import random
>>>
>>> value_probs = dict(zip([3,4,2,1,4,3,5,7,6,4], [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]))
>>> expected_dist = sum([[i] * int(prob * 100) for i, prob in value_probs.iteritems()], [])
>>> random.choice(expected_dist)
You might try to precalculate the cumulative probability range for each element and make a tree from these intervals. Then you will get a logarithmic complexity for looking up the element corresponding to the generated probability, instead of linear one that you have now.
You're calculating cumulative_proba each time when you call random_pick. I suggest to calculate it outside the method, and use a better data structure to store it, like a binary search tree, which will reduce the time complexity from O(n) to O(lgn).
For the purpose of conducting a psychological experiment I have to divide a set of pictures (240) described by 4 features (real numbers) into 3 subsets with equal number of elements in each subset (240/3 = 80) in such a way that all subsets are approximately balanced with respect to these features (in terms of mean and standard deviation).
Can anybody suggest an algorithm to automate that? Are there any packages/modules in Python or R that I could use to do that? Where should I start?
If I understand correctly your problem, you might use random.sample() in python:
import random
pool = set(["foo", "bar", "baz", "123", "456", "789"]) # your 240 elements here
slen = len(pool) / 3 # we need 3 subsets
set1 = set(random.sample(pool, slen)) # 1st random subset
pool -= set1
set2 = set(random.sample(pool, slen)) # 2nd random subset
pool -= set2
set3 = pool # 3rd random subset
I would tackle this as follows:
Divide into 3 equal subsets.
Figure out the mean and variance of each subset. From them construct an "unevenness" measure.
Compare each pair of elements, if swapping would reduce the "unevenness", swap them. Continue until there are either no more pairs to compare, or the total unevenness is below some arbitrary "good enough" threshold.
You can easily do this using the plyr library in R. Here is the code.
require(plyr)
# CREATE DUMMY DATA
mydf = data.frame(feature = sample(LETTERS[1:4], 240, replace = TRUE))
# SPLIT BY FEATURE AND DIVIDE INTO THREE SUBSETS EQUALLY
ddply(mydf, .(feature), summarize, sub = sample(1:3, 60, replace = TRUE))
In case you are still interested in the exhaustive search question. You have 240 choose 80 possibilities to choose the first set and then another 160 choose 80 for the second set, at which point the third set is fixed. In total, this gives you:
120554865392512357302183080835497490140793598233424724482217950647 * 92045125813734238026462263037378063990076729140
Clearly, this is not an option :)
Order your items by their decreasing Mahalanobis distance from the mean; they will be ordered from most extraordinary to most boring, including the effects of whatever correlations exist amongst the measures.
Assign X[3*i] X[3*i+1] X[3*i+2] to the subsets A, B, C, choosing for each i the ordering of A/B/C that minimizes your mismatch measure.
Why decreasing order? The statistically heavy items will be assigned first, and the choice of permutation in the larger number of subsequent rounds will have a better chance of evening out initial imbalances.
The point of this procedure is to maximize the chance that whatever outliers exist in the data set will be assigned to separate subsets.