Random Choice with different distributions for each sample - python

Imagine we want to randomly get n times 0 or 1. But every time we make a random choice we want to give a different probability distribution.
Considering np.random.choice:
a = [0, 1] or just 2
size = n
p = [
[0.2, 0.8],
[0.5, 0.5],
[0.7, 0.3]
] # For instance if n = 3
The problem is that p needs to be a 1-dimensional vector. How can we make something like this without having to call np.random.choice n different times?
The reason why I need to do this without calling np.random.choice multiple times is that I want an output of size n using a seed for reproducibility. However if I call np.random.choice n times with a seed the randomness is lost within the n calls.
What I need is the following:
s = sample(a, n, p) # len(a) = len(p)
print(s)
>>> [1, 0, 0]

Numpy has a way to get an array of random floats, between 0 and 1, like so:
>>> a = np.random.uniform(0, 1, size=3)
>>> a
array([0.41444637, 0.90898856, 0.85223613])
Then, you can compare those floats with the probabilities you want:
>>> p = np.array([0.01, 0.5, 1])
>>> (a < p).astype(int)
array([0, 0, 1])
(Note: p is the probability of a 1 value, for each element.)
Putting all of that together, you can write a function to do this:
def sample(p):
n = p.size
a = np.random.uniform(0, 1, size=n)
return (a < p).astype(int)

Related

Is there a numpy function to change values in an array by index without for loop?

I have done the code below:
size = int(len(np.where(features_temp != 0)[0]) * k_value)
idx = np.random.choice(np.where(np.in1d(features_temp, np.array(list_amino_numbers)))[0], size=size)
for i in idx:
features_temp[i] = np.random.choice(list_amino_numbers, p=probabilities[:, list_amino_numbers.index(features_temp[i])].tolist())
This code works well, but I think that it can run faster, mainly in the for iteration. There is some operation that I can change the for iteration?
Code Explanation: I am trying to change the values of features_temp in the indexes that are values different from 0. Each index can be changed many times, and the number of possible changes depends on the number of values different from 0 and a constant (it is saved at idx). In the end, each index depends on a matrix (probabilities), and in the matrix, each line i and column j defines the probability of each j be changed to i (so, I need to use the column values).
Input Example:
features_temp = np.array([3, 2, 0, 2, 1])
k_value = 1.5
list_amino_numbers = [1, 3, 2]
probabilities = np.array([[0.9, 0.2, 0.3], [0.07, 0.7, 0.5], [0.03, 0.1, 0.2]])
In this case, size = 6.

Elements in list greater than or equal to elements in other list (without for loop?)

I have a list containing 1,000,000 elements (numbers) called x and I would like to count how many of them are equal to or above [0.5,0.55,0.60,...,1]. Is there a way to do it without a for loop?
Right now I have the following the code, which works for a specific value of the [0.5,...1] interval, let's say 0.5 and assigns it to the count variable
count=len([i for i in x if i >= 0.5])
EDIT: Basically what I want to avoid is doing this... if possible?
obs=[]
alpha = [0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1]
for a in alpha:
count= len([i for i in x if i >= a])
obs.append(count)
Thanks in advance
Best, Mikael
I don't think it's possible without loop, but you can sort the array x and then you can use bisect module (doc) to locate insertion point (index).
For example:
x = [0.341, 0.423, 0.678, 0.999, 0.523, 0.751, 0.7]
alpha = [0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1]
x = sorted(x)
import bisect
obs = [len(x) - bisect.bisect_left(x, a) for a in alpha]
print(obs)
Will print:
[5, 4, 4, 4, 3, 2, 1, 1, 1, 1, 0]
Note:
sorted() has complexity n log(n) and bisect_left() log(n)
You can use numpy and boolean indexing:
>>> import numpy as np
>>> a = np.array(list(range(100)))
>>> a[a>=50].size
50
Even if you are not using for loop, internal methods use them. But iterates them efficiently.
you can use below function without for loop from your end.
x = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
l = list(filter(lambda _: _ > .5 , x))
print(l)
Based on comments, you're ok with using numpy, so use np.searchsorted to simply insert alpha into a sorted version of x. The indices will be your counts.
If you're ok with sorting x in-place:
x.sort()
counts = x.size - np.searchsorted(x, alpha)
If not,
counts = x.size - np.searchsorted(np.sort(x), alpha)
These counts assume that you want x < alpha. To get <= add the keyword side='right':
np.searchsorted(x, alpha, side='right')
PS
There are a couple of significant problems with the line
count = len([i for i in x if i >= 0.5])
First of all, you're creating a list of all the matching elements instead of just counting them. To count them do
count = sum(1 for i in x if i >= threshold)
Now the problem is that you are doing a linear pass through the entire array for each alpha, which is not necessary.
As I commented under #Andrej Kesely's answer, let's say we have N = len(x) and M = len(alpha). Your implementation is O(M * N) time complexity, while sorting gives you O((M + N) log N). For M << N (small alpha), your complexity is approximately O(N), which beats O(N log N). But for M ~= N, yours approaches O(N^2) vs my O(N log N).
EDIT: If you are using NumPy already, you can simply do this:
import numpy as np
# Make random data
np.random.seed(0)
x = np.random.binomial(n=20, p=0.5, size=1000000) / 20
bins = np.arange(0.55, 1.01, 0.05)
# One extra value for the upper bound of last bin
bins = np.append(bins, max(bins.max(), x.max()) + 1)
h, _ = np.histogram(x, bins)
result = np.cumsum(h)
print(result)
# [280645 354806 391658 406410 411048 412152 412356 412377 412378 412378]
If you are dealing with large arrays of numbers, you may considering using NumPy. But if you are using simple Python lists, you can do that for example like this:
def how_many_bigger(nums, mins):
# List of counts for each minimum
counts = [0] * len(mins)
# For each number
for n in nums:
# For each minimum
for i, m in enumerate(mins):
# Add 1 to the count if the number is greater than the current minimum
if n >= m:
counts[i] += 1
return counts
# Test
import random
# Make random data
random.seed(0)
nums = [random.random() for _ in range(1_000_000)]
# Make minimums
mins = [i / 100. for i in range(55, 101, 5)]
print(mins)
# [0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0]
count = how_many_bigger(nums, mins)
print(count)
# [449771, 399555, 349543, 299687, 249605, 199774, 149945, 99928, 49670, 0]

Range of index in NumPy correlation function

I am looking into the NumPy correlation function
numpy.correlate(a, v, mode='valid')[source]
Cross-correlation of two 1-dimensional sequences.
This function computes the correlation as generally defined in signal processing texts:
c_{av}[k] = sum_n a[n+k] * conj(v[n])
Then for the example:
a = [1, 2, 3]
v = [0, 1, 0.5]
np.correlate([1, 2, 3], [0, 1, 0.5], "full")
array([ 0.5, 2. , 3.5, 3. , 0. ])
So the k in the output array is from 0 to 4 in this example. However, I am wondering how does a[n+k] is defined when (n+k) > 2 in this case?
Also, how is conjugate(v(n)) defined and how is each element in array computed?
The formula c_{av}[k] = sum_n a[n+k] * conj(v[n]) is a little misleading because k on the left is not necessarily the Python index of the output array. In the 'full' mode, the possible values of k are those for which there exists at least one n such that a[n+k] * conj(v[n]) is defined (that is, both n+k and n fall in the ranges of respective arrays).
In your examples, k in sum_n a[n+k] * conj(v[n]) can be -2, -1, 0, 1, 2. These generate 5 values that you see. For example, k being -2 results in a[2-2]*conj(v[2]) which is 0.5, and so on.
In general, the range of k in the 'full' mode is from 1-len(a) to len(v)-1 inclusive. So, if k is really understood as Python index, then the formula should be
c_{av}[k] = sum_n a[n+k+len(a)-1] * conj(v[n])

Generate each column of the numpy array with random number from different range

How to generate a numpy array such that each column of the array comes from a uniform distribution within different ranges efficiently? The following code uses two for loop which is slow, is there any matrix-style way to generate such array faster? Thanks.
import numpy as np
num = 5
ranges = [[0,1],[4,5]]
a = np.zeros((num, len(ranges)))
for i in range(num):
for j in range(len(ranges)):
a[i, j] = np.random.uniform(ranges[j][0], ranges[j][1])
What you can do is produce all random numbers in the interval [0, 1) first and then scale and shift them accordingly:
import numpy as np
num = 5
ranges = np.asarray([[0,1],[4,5]])
starts = ranges[:, 0]
widths = ranges[:, 1]-ranges[:, 0]
a = starts + widths*np.random.random(size=(num, widths.shape[0]))
So basically, you create an array of the right size via np.random.random(size=(num, widths.shape[0])) with random number between 0 and 1. Then you scale each value by a factor corresponding to the width of the interval that you actually want to sample. Finally, you shift them by starts to account for the different starting values of the intervals.
numpy.random.uniform will broadcast its arguments, it can generate the desired samples by passing the following arguments:
low: the sequence of low values.
high: the sequence of high values.
size: a tuple like (num, m), where m is the number of ranges and num the number of groups of m samples to generate.
For example:
In [23]: num = 5
In [24]: ranges = np.array([[0, 1], [4, 5], [10, 15]])
In [25]: np.random.uniform(low=ranges[:, 0], high=ranges[:, 1], size=(num, ranges.shape[0]))
Out[25]:
array([[ 0.98752526, 4.70946614, 10.35525699],
[ 0.86137374, 4.22046152, 12.28458447],
[ 0.92446543, 4.52859103, 11.30326391],
[ 0.0535877 , 4.8597036 , 14.50266784],
[ 0.55854656, 4.86820001, 14.84934564]])

Multidimensional array for random.choice in NumPy

I have a table and I need to use random.choice for probability calculation,
for example (taken from docs):
>>> aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
>>> np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'],
dtype='|S11')
If I have 3D array instead of aa_milne_arr, it doesn't let me proceed. I need to generate random things with the different probabilities for the 3 arrays, but the same for elements inside of them. For example,
>>> arr0 = ['red', 'green', 'blue']
>>> arr1 = ['light', 'wind', 'sky']
>>> arr3 = ['chicken', 'wolf', 'dog']
>>> p = [0.5, 0.1, 0.4]
And I want the same probs for elements in arr0 (0.5), arr1 (0.1) and arr3 (0.4) so as a result I will see with the probability of 0.5 any element from arr0 etc.
Is it any elegant way to do it?
Divide values of p by the lengths of arrays and then repeat by the same lengths.
Then choose from concatenated array with new probabilities
arr = [arr0, arr1, arr3]
lens = [len(a) for a in arr]
p = [.5, .1, .4]
new_arr = np.concatenate(arr)
new_p = np.repeat(np.divide(p, lens), lens)
np.random.choice(new_arr, p=new_p)
Here is what I came with.
It takes either a vector of probabilities, or a matrix where the weights are organized in columns. The weights will be normalized to sum to 1.
import numpy as np
def choice_vect(source,weights):
# Draw N choices, each picked among K options
# source: K x N ndarray
# weights: K x N ndarray or K vector of probabilities
weights = np.atleast_2d(weights)
source = np.atleast_2d(source)
N = source.shape[1]
if weights.shape[0] == 1:
weights = np.tile(weights.transpose(),(1,N))
cum_weights = weights.cumsum(axis=0) / np.sum(weights,axis=0)
unif_draws = np.random.rand(1,N)
choices = (unif_draws < cum_weights)
bool_indices = choices > np.vstack( (np.zeros((1,N),dtype='bool'),choices ))[0:-1,:]
return source[bool_indices]
It avoids using loops and is like a vectorized version of random.choice.
You can then use it like that:
source = [[1,2],[3,4],[5,6]]
weights = [0.5, 0.4, 0.1]
choice_vect(source,weights)
>> array([3, 2])
weights = [[0.5,0.1],[0.4,0.4],[0.1,0.5]]
choice_vect(source,weights)
>> array([1, 4])

Categories