Multidimensional array for random.choice in NumPy - python

I have a table and I need to use random.choice for probability calculation,
for example (taken from docs):
>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
>>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'],
dtype='|S11')
If I have 3D array instead of aa_milne_arr, it doesn't let me proceed. I need to generate random things with the different probabilities for the 3 arrays, but the same for elements inside of them. For example,
>>> arr0 = ['red', 'green', 'blue']
>>> arr1 = ['light', 'wind', 'sky']
>>> arr3 = ['chicken', 'wolf', 'dog']
>>> p = [0.5, 0.1, 0.4]
And I want the same probs for elements in arr0 (0.5), arr1 (0.1) and arr3 (0.4) so as a result I will see with the probability of 0.5 any element from arr0 etc.
Is it any elegant way to do it?

Divide values of p by the lengths of arrays and then repeat by the same lengths.
Then choose from concatenated array with new probabilities
arr = [arr0, arr1, arr3]
lens = [len(a) for a in arr]
p = [.5, .1, .4]
new_arr = np.concatenate(arr)
new_p = np.repeat(np.divide(p, lens), lens)
np.random.choice(new_arr, p=new_p)

Here is what I came with.
It takes either a vector of probabilities, or a matrix where the weights are organized in columns. The weights will be normalized to sum to 1.
import numpy as np
def choice_vect(source,weights):
# Draw N choices, each picked among K options
# source: K x N ndarray
# weights: K x N ndarray or K vector of probabilities
weights = np.atleast_2d(weights)
source = np.atleast_2d(source)
N = source.shape[1]
if weights.shape[0] == 1:
weights = np.tile(weights.transpose(),(1,N))
cum_weights = weights.cumsum(axis=0) / np.sum(weights,axis=0)
unif_draws = np.random.rand(1,N)
choices = (unif_draws < cum_weights)
bool_indices = choices > np.vstack( (np.zeros((1,N),dtype='bool'),choices ))[0:-1,:]
return source[bool_indices]
It avoids using loops and is like a vectorized version of random.choice.
You can then use it like that:
source = [[1,2],[3,4],[5,6]]
weights = [0.5, 0.4, 0.1]
choice_vect(source,weights)
>> array([3, 2])
weights = [[0.5,0.1],[0.4,0.4],[0.1,0.5]]
choice_vect(source,weights)
>> array([1, 4])

Related

Random Choice with different distributions for each sample

Imagine we want to randomly get n times 0 or 1. But every time we make a random choice we want to give a different probability distribution.
Considering np.random.choice:
a = [0, 1] or just 2
size = n
p = [
[0.2, 0.8],
[0.5, 0.5],
[0.7, 0.3]
] # For instance if n = 3
The problem is that p needs to be a 1-dimensional vector. How can we make something like this without having to call np.random.choice n different times?
The reason why I need to do this without calling np.random.choice multiple times is that I want an output of size n using a seed for reproducibility. However if I call np.random.choice n times with a seed the randomness is lost within the n calls.
What I need is the following:
s = sample(a, n, p) # len(a) = len(p)
print(s)
>>> [1, 0, 0]
Numpy has a way to get an array of random floats, between 0 and 1, like so:
>>> a = np.random.uniform(0, 1, size=3)
>>> a
array([0.41444637, 0.90898856, 0.85223613])
Then, you can compare those floats with the probabilities you want:
>>> p = np.array([0.01, 0.5, 1])
>>> (a < p).astype(int)
array([0, 0, 1])
(Note: p is the probability of a 1 value, for each element.)
Putting all of that together, you can write a function to do this:
def sample(p):
n = p.size
a = np.random.uniform(0, 1, size=n)
return (a < p).astype(int)

Elements in list greater than or equal to elements in other list (without for loop?)

I have a list containing 1,000,000 elements (numbers) called x and I would like to count how many of them are equal to or above [0.5,0.55,0.60,...,1]. Is there a way to do it without a for loop?
Right now I have the following the code, which works for a specific value of the [0.5,...1] interval, let's say 0.5 and assigns it to the count variable
count=len([i for i in x if i >= 0.5])
EDIT: Basically what I want to avoid is doing this... if possible?
obs=[]
alpha = [0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1]
for a in alpha:
count= len([i for i in x if i >= a])
obs.append(count)
Thanks in advance
Best, Mikael
I don't think it's possible without loop, but you can sort the array x and then you can use bisect module (doc) to locate insertion point (index).
For example:
x = [0.341, 0.423, 0.678, 0.999, 0.523, 0.751, 0.7]
alpha = [0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1]
x = sorted(x)
import bisect
obs = [len(x) - bisect.bisect_left(x, a) for a in alpha]
print(obs)
Will print:
[5, 4, 4, 4, 3, 2, 1, 1, 1, 1, 0]
Note:
sorted() has complexity n log(n) and bisect_left() log(n)
You can use numpy and boolean indexing:
>>> import numpy as np
>>> a = np.array(list(range(100)))
>>> a[a>=50].size
50
Even if you are not using for loop, internal methods use them. But iterates them efficiently.
you can use below function without for loop from your end.
x = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
l = list(filter(lambda _: _ > .5 , x))
print(l)
Based on comments, you're ok with using numpy, so use np.searchsorted to simply insert alpha into a sorted version of x. The indices will be your counts.
If you're ok with sorting x in-place:
x.sort()
counts = x.size - np.searchsorted(x, alpha)
If not,
counts = x.size - np.searchsorted(np.sort(x), alpha)
These counts assume that you want x < alpha. To get <= add the keyword side='right':
np.searchsorted(x, alpha, side='right')
PS
There are a couple of significant problems with the line
count = len([i for i in x if i >= 0.5])
First of all, you're creating a list of all the matching elements instead of just counting them. To count them do
count = sum(1 for i in x if i >= threshold)
Now the problem is that you are doing a linear pass through the entire array for each alpha, which is not necessary.
As I commented under #Andrej Kesely's answer, let's say we have N = len(x) and M = len(alpha). Your implementation is O(M * N) time complexity, while sorting gives you O((M + N) log N). For M << N (small alpha), your complexity is approximately O(N), which beats O(N log N). But for M ~= N, yours approaches O(N^2) vs my O(N log N).
EDIT: If you are using NumPy already, you can simply do this:
import numpy as np
# Make random data
np.random.seed(0)
x = np.random.binomial(n=20, p=0.5, size=1000000) / 20
bins = np.arange(0.55, 1.01, 0.05)
# One extra value for the upper bound of last bin
bins = np.append(bins, max(bins.max(), x.max()) + 1)
h, _ = np.histogram(x, bins)
result = np.cumsum(h)
print(result)
# [280645 354806 391658 406410 411048 412152 412356 412377 412378 412378]
If you are dealing with large arrays of numbers, you may considering using NumPy. But if you are using simple Python lists, you can do that for example like this:
def how_many_bigger(nums, mins):
# List of counts for each minimum
counts = [0] * len(mins)
# For each number
for n in nums:
# For each minimum
for i, m in enumerate(mins):
# Add 1 to the count if the number is greater than the current minimum
if n >= m:
counts[i] += 1
return counts
# Test
import random
# Make random data
random.seed(0)
nums = [random.random() for _ in range(1_000_000)]
# Make minimums
mins = [i / 100. for i in range(55, 101, 5)]
print(mins)
# [0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0]
count = how_many_bigger(nums, mins)
print(count)
# [449771, 399555, 349543, 299687, 249605, 199774, 149945, 99928, 49670, 0]

Calculate mean, variance, covariance of different length matrices in a split list

I have an array of 5 values, consisting of 4 values and one index. I sort and split the array along the index. This leads me to splits of matrices with different lengths. From here on I want to calculate the mean, variance of the fourth values and covariance of the first 3 values for every split. My current approach works with a for loop, which I would like to replace by matrix operations, but I am struggeling with the different sizes of my matrices.
import numpy as np
A = np.random.rand(10,5)
A[:,-1] = np.random.randint(4, size=10)
sorted_A = A[np.argsort(A[:,4])]
splits = np.split(sorted_A, np.where(np.diff(sorted_A[:,4]))[0]+1)
My current for loop looks like this:
result = np.zeros((len(splits), 5))
for idx, values in enumerate(splits):
if(len(values))>0:
result[idx, 0] = np.mean(values[:,3])
result[idx, 1] = np.var(values[:,3])
result[idx, 2:5] = np.cov(values[:,0:3].transpose(), ddof=0).diagonal()
else:
result[idx, 0] = values[:,3]
I tried to work with masked arrays without success, since I couldn't load the matrices into the masked arrays in a proper form. Maybe someone knows how to do this or has a different suggestion.
You can use np.add.reduceat as follows:
>>> idx = np.concatenate([[0], np.where(np.diff(sorted_A[:,4]))[0]+1, [A.shape[0]]])
>>> result2 = np.empty((idx.size-1, 5))
>>> result2[:, 0] = np.add.reduceat(sorted_A[:, 3], idx[:-1]) / np.diff(idx)
>>> result2[:, 1] = np.add.reduceat(sorted_A[:, 3]**2, idx[:-1]) / np.diff(idx) - result2[:, 0]**2
>>> result2[:, 2:5] = np.add.reduceat(sorted_A[:, :3]**2, idx[:-1], axis=0) / np.diff(idx)[:, None]
>>> result2[:, 2:5] -= (np.add.reduceat(sorted_A[:, :3], idx[:-1], axis=0) / np.diff(idx)[:, None])**2
>>>
>>> np.allclose(result, result2)
True
Note that the diagonal of the covariance matrix are just the variances which simplifies this vectorization quite a bit.

python numpy weighted average with nans

First things first: this is not a duplicate of NumPy: calculate averages with NaNs removed, i'll explain why:
Suppose I have an array
a = array([1,2,3,4])
and I want to average over it with the weights
weights = [4,3,2,1]
output = average(a, weights=weights)
print output
2.0
ok. So this is pretty straightforward. But now I have something like this:
a = array([1,2,nan,4])
calculating the average with the usual method yields of coursenan. Can I avoid this?
In principle I want to ignore the nans, so I'd like to have something like this:
a = array([1,2,4])
weights = [4,3,1]
output = average(a, weights=weights)
print output
1.75
Alternatively, you can use a MaskedArray as such:
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> ma = np.ma.MaskedArray(a, mask=np.isnan(a))
>>> np.ma.average(ma, weights=weights)
1.75
First find out indices where the items are not nan, and then pass the filtered versions of a and weights to numpy.average:
>>> import numpy as np
>>> a = np.array([1,2,np.nan,4])
>>> weights = np.array([4,3,2,1])
>>> indices = np.where(np.logical_not(np.isnan(a)))[0]
>>> np.average(a[indices], weights=weights[indices])
1.75
As suggested by #mtrw in comments, it would be cleaner to use masked array here instead of index array:
>>> indices = ~np.isnan(a)
>>> np.average(a[indices], weights=weights[indices])
1.75
I would offer another solution, which is more scalable to bigger dimensions (eg when doing average over different axis). Attached code works with 2D array, which possibly contains nans, and takes average over axis=0.
a = np.random.randint(5, size=(3,2)) # let's generate some random 2D array
# make weights matrix with zero weights at nan's in a
w_vec = np.arange(1, a.shape[0]+1)
w_vec = w_vec.reshape(-1, 1)
w_mtx = np.repeat(w_vec, a.shape[1], axis=1)
w_mtx *= (~np.isnan(a))
# take average as (weighted_elements_sum / weights_sum)
w_a = a * w_mtx
a_sum_vec = np.nansum(w_a, axis=0)
w_sum_vec = np.nansum(w_mtx, axis=0)
mean_vec = a_sum_vec / w_sum_vec
# mean_vec is vector with weighted nan-averages of array a taken along axis=0
Expanding on #Ashwini and #Nicolas' answers, here is a version that can also handle an edge case where all the data values are np.nan, and that is designed to also work with pandas DataFrame without type-related issues:
def calc_wa_ignore_nan(df: pd.DataFrame, measures: List[str],
weights: List[Union[float, int]]) -> np.ndarray:
""" Calculates the weighted average of `measures`' values, ex-nans.
When nans are present in `measures`' values,
the weights are recalculated based only on the weights for non-nan measures.
Note:
The calculation used is NOT the same as just ignoring nans.
For example, if we had data and weights:
data = [2, 3, np.nan]
weights = [0.5, 0.2, 0.3]
calc_wa_ignore_nan approach:
(2*(0.5/(0.5+0.2))) + (3*(0.2/(0.5+0.2))) == 2.285714285714286
The ignoring nans approach:
(2*0.5) + (3*0.2) == 1.6
Args:
data: Multiple rows of numeric data values with `measures` as column headers.
measures: The str names of values to select from `row`.
weights: The numeric weights associated with `measures`.
Example:
>>> df = pd.DataFrame({"meas1": [1, 1],
"meas2": [2, 2],
"meas3": [3, 3],
"meas4": [np.nan, 0],
"meas5": [5, 5]})
>>> measures = ["meas2", "meas3", "meas4"]
>>> weights = [0.5, 0.2, 0.3]
>>> calc_wa_ignore_nan(df, measures, weights)
array([2.28571429, 1.6])
"""
assert not df.empty, "Nothing to calculate weighted average for: `df` is empty."
# Need to coerce type to np.float instead of python's float
# to avoid "ufunc 'isnan' not supported for the input types ..." error
data = np.array(df[measures].values, dtype=np.float64)
# Make a 2d array with the same weights for each row
# cast for safety and better errors
weights = np.array([weights, ] * data.shape[0], dtype=np.float64)
mask = np.isnan(data)
masked_data = np.ma.masked_array(data, mask=mask)
masked_weights = np.ma.masked_array(weights, mask=mask)
# np.nanmean doesn't support weights
weighted_avgs = np.average(masked_data, weights=masked_weights, axis=1)
# Replace masked elements with np.nan
# otherwise those elements will be interpretted as 0 when read into a pd.DataFrame
weighted_avgs = weighted_avgs.filled(np.nan)
return weighted_avgs
All the solutions above are very good, but has don't handle the cases when there is nan in weights. For doing so, using pandas :
def weighted_average_ignoring_nan(df, col_value, col_weight):
den = 0
num = 0
for index, row in df.iterrows():
if(~np.isnan(row[col_weight]) & ~np.isnan(row[col_value])):
den = den + row[col_weight]
num = num + row[col_weight]*row[col_value]
return num/den
Since you're looking for the mean another idea is to simply replace all the nan values with 0's:
>>>import numpy as np
>>>a = np.array([[ 3., 2., 5.], [np.nan, 4., np.nan], [np.nan, np.nan, np.nan]])
>>>w = np.array([[ 1., 2., 3.], [np.nan, np.nan, np.nan], [np.nan, np.nan, np.nan]])
>>>a[np.isnan(a)] = 0
>>>w[np.isnan(w)] = 0
>>>np.average(a, weights=w)
3.6666666666666665
This can be used with the axis functionality of the average function but be carful that your weights don't sum up to 0.

draw random element in numpy

I have an array of element probabilities, let's say [0.1, 0.2, 0.5, 0.2]. The array sums up to 1.0.
Using plain Python or numpy, I want to draw elements proportional to their probability: the first element about 10% of the time, second 20%, third 50% etc. The "draw" should return index of the element drawn.
I came up with this:
def draw(probs):
cumsum = numpy.cumsum(probs / sum(probs)) # sum up to 1.0, just in case
return len(numpy.where(numpy.random.rand() >= cumsum)[0])
It works, but it's too convoluted, there must be a better way. Thanks.
import numpy as np
def random_pick(choices, probs):
'''
>>> a = ['Hit', 'Out']
>>> b = [.3, .7]
>>> random_pick(a,b)
'''
cutoffs = np.cumsum(probs)
idx = cutoffs.searchsorted(np.random.uniform(0, cutoffs[-1]))
return choices[idx]
How it works:
In [22]: import numpy as np
In [23]: probs = [0.1, 0.2, 0.5, 0.2]
Compute the cumulative sum:
In [24]: cutoffs = np.cumsum(probs)
In [25]: cutoffs
Out[25]: array([ 0.1, 0.3, 0.8, 1. ])
Compute a uniformly distributed random number in the half-open interval [0, cutoffs[-1]):
In [26]: np.random.uniform(0, cutoffs[-1])
Out[26]: 0.9723114393023948
Use searchsorted to find the index where the random number would be inserted into cutoffs:
In [27]: cutoffs.searchsorted(0.9723114393023948)
Out[27]: 3
Return choices[idx], where idx is that index.
You want to sample from the categorical distribution, which is not implemented in numpy. However, the multinomial distribution is a generalization of the categorical distribution and can be used for that purpose.
>>> import numpy as np
>>>
>>> def sampleCategory(p):
... return np.flatnonzero( np.random.multinomial(1,p,1) )[0]
...
>>> sampleCategory( [0.1,0.5,0.4] )
1
use numpy.random.multinomial - most efficient
I've never used numpy, but I assume my code below (python only) does the same thing as what you accomplished in one line. I'm putting it here just in case you want it.
Looks very c-ish so apologies for not being very pythonic.
weight_total would be 1 for you.
def draw(probs)
r = random.randrange(weight_total)
running_total = 0
for i, p in enumerate(probs)
running_total += p
if running_total > r:
return i
use bisect
import bisect
import random
import numpy
def draw(probs):
cumsum=numpy.cumsum(probs/sum(probs))
return bisect.bisect_left(cumsum, numpy.random.rand())
should do the trick.

Categories