draw random element in numpy - python

I have an array of element probabilities, let's say [0.1, 0.2, 0.5, 0.2]. The array sums up to 1.0.
Using plain Python or numpy, I want to draw elements proportional to their probability: the first element about 10% of the time, second 20%, third 50% etc. The "draw" should return index of the element drawn.
I came up with this:
def draw(probs):
cumsum = numpy.cumsum(probs / sum(probs)) # sum up to 1.0, just in case
return len(numpy.where(numpy.random.rand() >= cumsum)[0])
It works, but it's too convoluted, there must be a better way. Thanks.

import numpy as np
def random_pick(choices, probs):
'''
>>> a = ['Hit', 'Out']
>>> b = [.3, .7]
>>> random_pick(a,b)
'''
cutoffs = np.cumsum(probs)
idx = cutoffs.searchsorted(np.random.uniform(0, cutoffs[-1]))
return choices[idx]
How it works:
In [22]: import numpy as np
In [23]: probs = [0.1, 0.2, 0.5, 0.2]
Compute the cumulative sum:
In [24]: cutoffs = np.cumsum(probs)
In [25]: cutoffs
Out[25]: array([ 0.1, 0.3, 0.8, 1. ])
Compute a uniformly distributed random number in the half-open interval [0, cutoffs[-1]):
In [26]: np.random.uniform(0, cutoffs[-1])
Out[26]: 0.9723114393023948
Use searchsorted to find the index where the random number would be inserted into cutoffs:
In [27]: cutoffs.searchsorted(0.9723114393023948)
Out[27]: 3
Return choices[idx], where idx is that index.

You want to sample from the categorical distribution, which is not implemented in numpy. However, the multinomial distribution is a generalization of the categorical distribution and can be used for that purpose.
>>> import numpy as np
>>>
>>> def sampleCategory(p):
... return np.flatnonzero( np.random.multinomial(1,p,1) )[0]
...
>>> sampleCategory( [0.1,0.5,0.4] )
1

use numpy.random.multinomial - most efficient

I've never used numpy, but I assume my code below (python only) does the same thing as what you accomplished in one line. I'm putting it here just in case you want it.
Looks very c-ish so apologies for not being very pythonic.
weight_total would be 1 for you.
def draw(probs)
r = random.randrange(weight_total)
running_total = 0
for i, p in enumerate(probs)
running_total += p
if running_total > r:
return i

use bisect
import bisect
import random
import numpy
def draw(probs):
cumsum=numpy.cumsum(probs/sum(probs))
return bisect.bisect_left(cumsum, numpy.random.rand())
should do the trick.

Related

Random Choice with different distributions for each sample

Imagine we want to randomly get n times 0 or 1. But every time we make a random choice we want to give a different probability distribution.
Considering np.random.choice:
a = [0, 1] or just 2
size = n
p = [
[0.2, 0.8],
[0.5, 0.5],
[0.7, 0.3]
] # For instance if n = 3
The problem is that p needs to be a 1-dimensional vector. How can we make something like this without having to call np.random.choice n different times?
The reason why I need to do this without calling np.random.choice multiple times is that I want an output of size n using a seed for reproducibility. However if I call np.random.choice n times with a seed the randomness is lost within the n calls.
What I need is the following:
s = sample(a, n, p) # len(a) = len(p)
print(s)
>>> [1, 0, 0]
Numpy has a way to get an array of random floats, between 0 and 1, like so:
>>> a = np.random.uniform(0, 1, size=3)
>>> a
array([0.41444637, 0.90898856, 0.85223613])
Then, you can compare those floats with the probabilities you want:
>>> p = np.array([0.01, 0.5, 1])
>>> (a < p).astype(int)
array([0, 0, 1])
(Note: p is the probability of a 1 value, for each element.)
Putting all of that together, you can write a function to do this:
def sample(p):
n = p.size
a = np.random.uniform(0, 1, size=n)
return (a < p).astype(int)

Elements in list greater than or equal to elements in other list (without for loop?)

I have a list containing 1,000,000 elements (numbers) called x and I would like to count how many of them are equal to or above [0.5,0.55,0.60,...,1]. Is there a way to do it without a for loop?
Right now I have the following the code, which works for a specific value of the [0.5,...1] interval, let's say 0.5 and assigns it to the count variable
count=len([i for i in x if i >= 0.5])
EDIT: Basically what I want to avoid is doing this... if possible?
obs=[]
alpha = [0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1]
for a in alpha:
count= len([i for i in x if i >= a])
obs.append(count)
Thanks in advance
Best, Mikael
I don't think it's possible without loop, but you can sort the array x and then you can use bisect module (doc) to locate insertion point (index).
For example:
x = [0.341, 0.423, 0.678, 0.999, 0.523, 0.751, 0.7]
alpha = [0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1]
x = sorted(x)
import bisect
obs = [len(x) - bisect.bisect_left(x, a) for a in alpha]
print(obs)
Will print:
[5, 4, 4, 4, 3, 2, 1, 1, 1, 1, 0]
Note:
sorted() has complexity n log(n) and bisect_left() log(n)
You can use numpy and boolean indexing:
>>> import numpy as np
>>> a = np.array(list(range(100)))
>>> a[a>=50].size
50
Even if you are not using for loop, internal methods use them. But iterates them efficiently.
you can use below function without for loop from your end.
x = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
l = list(filter(lambda _: _ > .5 , x))
print(l)
Based on comments, you're ok with using numpy, so use np.searchsorted to simply insert alpha into a sorted version of x. The indices will be your counts.
If you're ok with sorting x in-place:
x.sort()
counts = x.size - np.searchsorted(x, alpha)
If not,
counts = x.size - np.searchsorted(np.sort(x), alpha)
These counts assume that you want x < alpha. To get <= add the keyword side='right':
np.searchsorted(x, alpha, side='right')
PS
There are a couple of significant problems with the line
count = len([i for i in x if i >= 0.5])
First of all, you're creating a list of all the matching elements instead of just counting them. To count them do
count = sum(1 for i in x if i >= threshold)
Now the problem is that you are doing a linear pass through the entire array for each alpha, which is not necessary.
As I commented under #Andrej Kesely's answer, let's say we have N = len(x) and M = len(alpha). Your implementation is O(M * N) time complexity, while sorting gives you O((M + N) log N). For M << N (small alpha), your complexity is approximately O(N), which beats O(N log N). But for M ~= N, yours approaches O(N^2) vs my O(N log N).
EDIT: If you are using NumPy already, you can simply do this:
import numpy as np
# Make random data
np.random.seed(0)
x = np.random.binomial(n=20, p=0.5, size=1000000) / 20
bins = np.arange(0.55, 1.01, 0.05)
# One extra value for the upper bound of last bin
bins = np.append(bins, max(bins.max(), x.max()) + 1)
h, _ = np.histogram(x, bins)
result = np.cumsum(h)
print(result)
# [280645 354806 391658 406410 411048 412152 412356 412377 412378 412378]
If you are dealing with large arrays of numbers, you may considering using NumPy. But if you are using simple Python lists, you can do that for example like this:
def how_many_bigger(nums, mins):
# List of counts for each minimum
counts = [0] * len(mins)
# For each number
for n in nums:
# For each minimum
for i, m in enumerate(mins):
# Add 1 to the count if the number is greater than the current minimum
if n >= m:
counts[i] += 1
return counts
# Test
import random
# Make random data
random.seed(0)
nums = [random.random() for _ in range(1_000_000)]
# Make minimums
mins = [i / 100. for i in range(55, 101, 5)]
print(mins)
# [0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0]
count = how_many_bigger(nums, mins)
print(count)
# [449771, 399555, 349543, 299687, 249605, 199774, 149945, 99928, 49670, 0]

Memory efficient mean pairwise distance

I am aware of the scipy.spatial.distance.pdist function and how to compute the mean from the resulting matrix/ndarray.
>>> x = np.random.rand(10000, 2)
>>> y = pdist(x, metric='euclidean')
>>> y.mean()
0.5214255824176626
In the example above y gets quite large (nearly 2,500 times as large as the input array):
>>> y.shape
(49995000,)
>>> from sys import getsizeof
>>> getsizeof(x)
160112
>>> getsizeof(y)
399960096
>>> getsizeof(y) / getsizeof(x)
2498.0019986009793
But since I am only interested in the mean pairwise distance, the distance matrix doesn't have to be kept in memory. Instead the mean of each row (or column) can be computed seperatly. The final mean value can then be computed from the row mean values.
Is there already a function which exploit this property or is there an easy way to extend/combine existing functions to do so?
If you use the square version of distance, it is equivalent to using the variance with n-1:
from scipy.spatial.distance import pdist, squareform
import numpy as np
x = np.random.rand(10000, 2)
y = np.array([[1,1], [0,0], [2,0]])
print(pdist(x, 'sqeuclidean').mean())
print(np.var(x, 0, ddof=1).sum()*2)
>>0.331474285845873
0.33147428584587346
You will have to weight each row by the number of observations that make up the mean. For example the pdist of a 3 x 2 matrix is the flattened upper triangle (offset of 1) of the squareform 3 x 3 distance matrix.
arr = np.arange(6).reshape(3,2)
arr
array([[0, 1],
[2, 3],
[4, 5]])
pdist(arr)
array([2.82842712, 5.65685425, 2.82842712])
from sklearn.metrics import pairwise_distances
square = pairwise_distances(arr)
square
array([[0. , 2.82842712, 5.65685425],
[2.82842712, 0. , 2.82842712],
[5.65685425, 2.82842712, 0. ]])
square[triu_indices(square.shape[0], 1)]
array([2.82842712, 5.65685425, 2.82842712])
There is the pairwise_distances_chuncked function that can be used to iterate over the distance matrix row by row, but you will need to keep track of the row index to make sure you only take the mean of values in the upper/lower triangle of the matrix (distance matrix is symmetrical). This isn't complicated, but I imagine you will introduce a significant slowdown.
tot = ((arr.shape[0]**2) - arr.shape[0]) / 2
weighted_means = 0
for i in gen:
if r < arr.shape[0]:
sm = i[0, r:].mean()
wgt = (i.shape[1] - r) / tot
weighted_means += sm * wgt
r += 1

Multidimensional array for random.choice in NumPy

I have a table and I need to use random.choice for probability calculation,
for example (taken from docs):
>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
>>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'],
dtype='|S11')
If I have 3D array instead of aa_milne_arr, it doesn't let me proceed. I need to generate random things with the different probabilities for the 3 arrays, but the same for elements inside of them. For example,
>>> arr0 = ['red', 'green', 'blue']
>>> arr1 = ['light', 'wind', 'sky']
>>> arr3 = ['chicken', 'wolf', 'dog']
>>> p = [0.5, 0.1, 0.4]
And I want the same probs for elements in arr0 (0.5), arr1 (0.1) and arr3 (0.4) so as a result I will see with the probability of 0.5 any element from arr0 etc.
Is it any elegant way to do it?
Divide values of p by the lengths of arrays and then repeat by the same lengths.
Then choose from concatenated array with new probabilities
arr = [arr0, arr1, arr3]
lens = [len(a) for a in arr]
p = [.5, .1, .4]
new_arr = np.concatenate(arr)
new_p = np.repeat(np.divide(p, lens), lens)
np.random.choice(new_arr, p=new_p)
Here is what I came with.
It takes either a vector of probabilities, or a matrix where the weights are organized in columns. The weights will be normalized to sum to 1.
import numpy as np
def choice_vect(source,weights):
# Draw N choices, each picked among K options
# source: K x N ndarray
# weights: K x N ndarray or K vector of probabilities
weights = np.atleast_2d(weights)
source = np.atleast_2d(source)
N = source.shape[1]
if weights.shape[0] == 1:
weights = np.tile(weights.transpose(),(1,N))
cum_weights = weights.cumsum(axis=0) / np.sum(weights,axis=0)
unif_draws = np.random.rand(1,N)
choices = (unif_draws < cum_weights)
bool_indices = choices > np.vstack( (np.zeros((1,N),dtype='bool'),choices ))[0:-1,:]
return source[bool_indices]
It avoids using loops and is like a vectorized version of random.choice.
You can then use it like that:
source = [[1,2],[3,4],[5,6]]
weights = [0.5, 0.4, 0.1]
choice_vect(source,weights)
>> array([3, 2])
weights = [[0.5,0.1],[0.4,0.4],[0.1,0.5]]
choice_vect(source,weights)
>> array([1, 4])

How to check a condition on each element of each row with numpy

For machine learning, I'm appliying Parzen Window algorithm.
I have an array (m,n). I would like to check on each row if any of the values is > 0.5 and if each of them is, then I would return 0, otherwise 1.
I would like to know if there is a way to do this without a loop thanks to numpy.
You can use np.all with axis=1 on a boolean array.
import numpy as np
arr = np.array([[0.8, 0.9], [0.1, 0.6], [0.2, 0.3]])
print(np.all(arr>0.5, axis=1))
>> [True False False]
import numpy as np
# Value Initialization
a = np.array([0.75, 0.25, 0.50])
y_predict = np.zeros((1, a.shape[0]))
#If the value is greater than 0.5, the value is 1; otherwise 0
y_predict = (a > 0.5).astype(float)
I have an array (m,n). I would like to check on each row if any of the values is > 0.5
That will be stored in b:
import numpy as np
a = # some np.array of shape (m,n)
b = np.any(a > 0.5, axis=1)
and if each of them is, then I would return 0, otherwise 1.
I'm assuming you mean 'and if this is the case for all rows'. In this case:
c = 1 - 1 * np.all(b)
c contains your return value, either 0 or 1.

Categories