get a list of unique items from Random.choices function - python

I have a method that is using the random package to generate a list with certain probability for example:
import random
seed = 30
rand = random.Random(seed)
options_list = [1, 2, 3, 4, 5, 6]
prob_weights = [0.1, 0.2, 0.1, 0.05, 0.02, 0.06]
result = rand.choices(option_list, prob_weights, k=4) # k will be <= len(option_list)
my problem is that result can hold two of the same item, and I want it to be unique.
I could make the k param much larger and then filter out the unique items but that seems like the wrong way to do that. I looked in the docs and I dont see that the choices function gets this kind of parameter.
Any ideas how to config random to return a list of unique items?

You can use np.random.choice, which allows you to assign probabilities associated with each entry and also to generate random samples without replacement. The probabilities however must add up to one, you'll have to divide the vector by its L^1-Norm. So here's how you could do it:
import numpy as np
options_list = np.array([1, 2, 3, 4, 5, 6])
prob_weights = np.array([0.1, 0.2, 0.1, 0.05, 0.02, 0.06])
prob_weights_scaled = prob_weights / sum(prob_weights)
some_length = 4
np.random.choice(a=options_list, size=some_length, replace=False, p=prob_weights_scaled)
Output
array([2, 1, 6, 3])

Related

Random Choice with different distributions for each sample

Imagine we want to randomly get n times 0 or 1. But every time we make a random choice we want to give a different probability distribution.
Considering np.random.choice:
a = [0, 1] or just 2
size = n
p = [
[0.2, 0.8],
[0.5, 0.5],
[0.7, 0.3]
] # For instance if n = 3
The problem is that p needs to be a 1-dimensional vector. How can we make something like this without having to call np.random.choice n different times?
The reason why I need to do this without calling np.random.choice multiple times is that I want an output of size n using a seed for reproducibility. However if I call np.random.choice n times with a seed the randomness is lost within the n calls.
What I need is the following:
s = sample(a, n, p) # len(a) = len(p)
print(s)
>>> [1, 0, 0]
Numpy has a way to get an array of random floats, between 0 and 1, like so:
>>> a = np.random.uniform(0, 1, size=3)
>>> a
array([0.41444637, 0.90898856, 0.85223613])
Then, you can compare those floats with the probabilities you want:
>>> p = np.array([0.01, 0.5, 1])
>>> (a < p).astype(int)
array([0, 0, 1])
(Note: p is the probability of a 1 value, for each element.)
Putting all of that together, you can write a function to do this:
def sample(p):
n = p.size
a = np.random.uniform(0, 1, size=n)
return (a < p).astype(int)

Is there a numpy function to change values in an array by index without for loop?

I have done the code below:
size = int(len(np.where(features_temp != 0)[0]) * k_value)
idx = np.random.choice(np.where(np.in1d(features_temp, np.array(list_amino_numbers)))[0], size=size)
for i in idx:
features_temp[i] = np.random.choice(list_amino_numbers, p=probabilities[:, list_amino_numbers.index(features_temp[i])].tolist())
This code works well, but I think that it can run faster, mainly in the for iteration. There is some operation that I can change the for iteration?
Code Explanation: I am trying to change the values of features_temp in the indexes that are values different from 0. Each index can be changed many times, and the number of possible changes depends on the number of values different from 0 and a constant (it is saved at idx). In the end, each index depends on a matrix (probabilities), and in the matrix, each line i and column j defines the probability of each j be changed to i (so, I need to use the column values).
Input Example:
features_temp = np.array([3, 2, 0, 2, 1])
k_value = 1.5
list_amino_numbers = [1, 3, 2]
probabilities = np.array([[0.9, 0.2, 0.3], [0.07, 0.7, 0.5], [0.03, 0.1, 0.2]])
In this case, size = 6.

Index array using array of unique values

I have three arrays, such that:
Data_Arr = np.array([1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5])
ID_Arr = np.array([1, 2, 3, 4, 5])
Value_Arr = np.array([0.1, 0.6, 0.3, 0.8, 0.2])
I want to create a new array which has the dimensions of Data, but where each element is from Values, using the index position in ID. So far I have this in a loop, but its very slow as my Data array is very large:
out = np.zeros_like(Data_Arr, dtype=np.float)
for i in range(len(Data_Arr)):
out[i] = Values_Arr[ID_Arr==Data_Arr[I]]
is there a more pythonic way of doing this and avoiding this loop (doesn't have to use numpy)?
Actual data looks like:
Data_Arr = [ 852116 852116 852116 ... 1001816 1001816 1001816]
ID_Arr = [ 852116 852117 852118 ... 1001814 1001815 1001816]
Value_Arr = [1.5547194 1.5547196 1.5547197 ... 1.5536859 1.5536858 1.5536857]
shapes are:
Data_Arr = (4021165,)
ID_Arr = (149701,)
Value_Arr = (149701,)
Since ID_Arr is sorted, we can directly use np.searchsorted and index Value_Arr with the result:
Value_Arr[np.searchsorted(ID_Arr, Data_Arr)]
array([0.1, 0.1, 0.1, 0.6, 0.6, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.8, 0.8,
0.2, 0.2, 0.2])
If ID_Arr isn't sorted (note: in case there may be out of bounds indices, we should remove them, see divakar's answer):
s_ind = ID_Arr.argsort()
ss = np.searchsorted(ID_Arr, Data_Arr, sorter=s_ind)
out = Value_Arr[s_ind[ss]]
Checking with the arrays suggested by alaniwi:
Data_Arr = np.array([1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5])
ID_Arr = array([2, 1, 3, 4, 5])
Value_Arr = np.array([0.6, 0.1, 0.3, 0.8, 0.2])
out_op = np.zeros_like(Data_Arr, dtype=np.float)
for i in range(len(Data_Arr)):
out_op[i] = Value_Arr[ID_Arr==Data_Arr[i]]
s_ind = ID_Arr.argsort()
ss = np.searchsorted(ID_Arr, Data_Arr, sorter=s_ind)
out_answer = Value_Arr[s_ind[ss]]
np.array_equal(out_op, out_answer)
#True
Based off approaches from this post, here are the adaptations.
Approach #1
# https://stackoverflow.com/a/62658135/ #Divakar
a,b,invalid_specifier = ID_Arr, Data_Arr, 0
sidx = a.argsort()
idx = np.searchsorted(a,b,sorter=sidx)
# Remove out of bounds indices as they wont be matches
idx[idx==len(a)] = 0
# Get traced back indices corresponding to original version of a
idx0 = sidx[idx]
# Mask out invalid ones with invalid_specifier and return
out = np.where(a[idx0]==b, Values_Arr[idx0], invalid_specifier)
Approach #2
Lookup based -
# https://stackoverflow.com/a/62658135/ #Divakar
def find_indices_lookup(a,b,invalid_specifier=-1):
# Setup array where we will assign ranged numbers
N = max(a.max(), b.max())+1
lookup = np.full(N, invalid_specifier)
# We index into lookup with b to trace back the positions. Non matching ones
# would have invalid_specifier values as wount had been indexed by ranged ones
lookup[a] = np.arange(len(a))
indices = lookup[b]
return indices
idx = find_indices_lookup(ID_Arr, Data_Arr)
out = np.where(idx!=-1, Values_Arr[idx], 0)
Faster/simpler variant
And a simplified and hopefully faster version would be a direct lookup of values -
a,b,invalid_specifier = ID_Arr, Data_Arr, 0
N = max(a.max(), b.max())+1
lookup = np.zeros(N, dtype=Values_Arr.dtype)
lookup[ID_Arr] = Values_Arr
out = lookup[Data_Arr]
If all values from ID_Arr are guaranteed to be in Data_Arr, we can use np.empty in place of np.zeros for the array-assignment and thus gain further perf. boost.
Looks like you want:
out = Value_Arr[ID_Arr[Data_Arr - 1] - 1]
Note that the - 1 are due to the fact that Python/Numpy is 0-based index.

How to write a loop-free code to find the max in each region?

I have a vector which specifies a number of regions over 1 to N. For example, if
A = [1,2,3,6,7,9,10]
Then the regions are [1,3], [6,7], [9,10] defined over interval [1,10] with N=10. I have another vector with length N that contains a set of positive and negative numbers:
x = [0.8,0.1,1,-1,-2,-0.76,0.1,0.2,0.9,0.6]
I want to find the maximum value of x in each region. In this example, the result is:
y = [1,0.1,0.9]
y_locs = [3,7,9]
It is possible to compute the max in each region by first obtaining regions from A and then using a for loop to find the max in each region. Is there a loop-free way to do that?
You could slice your array and use the built in max() function. Something like:
x = [0.8, 0.1, 1, -1, -2, -0.76, 0.1, 0.2, 0.9, 0.6]
# each tuple contains (start_index, length, maximum_value)
max_list = [(0, 3, max(x[0:3])), (5, 2, max(x[5:7])), (8, 2, max(x[8:]))]
locations_list = [max_list[i][0] + x[max_list[i][0]:max_list[i][0] + max_list[i][1]].index(max_list[i][2]) + 1 for i in range(len(max_list))]
print(max_list)
print(locations_list)
Yields:
[(0, 3, 1), (5, 2, 0.1), (8, 2, 0.9)]
[3, 7, 9]
Notes:
I did use a for loop to iterate each section, but you could expand this by hand into three separate lines that do not have a for loop (this would become very tedious for large data though)
I do not know the internals of max() and it may use a for loop that is hidden.

How do I "randomly" select numbers with a specified bias toward a particular number

How do I generate random numbers with a specified bias toward one number. For example, how would I pick between two numbers, 1 and 2, with a 90% bias toward 1. The best I can come up with is...
import random
print random.choice([1, 1, 1, 1, 1, 1, 1, 1, 1, 2])
Is there a better way to do this? The method I showed works in simple examples but eventually I'll have to do more complicated selections with biases that are very specific (such as 37.65% bias) which would require a very long list.
EDIT:
I should have added that I'm stuck on numpy 1.6 so I can't use numpy.random.choice.
np.random.choice has a p parameter which you can use to specify the probability of the choices:
np.random.choice([1,2], p=[0.9, 0.1])
The algorithm used by np.random.choice() is relatively simple to replicate if you only need to draw one item at a time.
import numpy as np
def simple_weighted_choice(choices, weights, prng=np.random):
running_sum = np.cumsum(weights)
u = prng.uniform(0.0, running_sum[-1])
i = np.searchsorted(running_sum, u, side='left')
return choices[i]
For random sampling with replacement, the essential code in np.random.choice is
cdf = p.cumsum()
cdf /= cdf[-1]
uniform_samples = self.random_sample(shape)
idx = cdf.searchsorted(uniform_samples, side='right')
So we can use that in a new function the does the same thing (but without error checking and other niceties):
import numpy as np
def weighted_choice(values, p, size=1):
values = np.asarray(values)
cdf = np.asarray(p).cumsum()
cdf /= cdf[-1]
uniform_samples = np.random.random_sample(size)
idx = cdf.searchsorted(uniform_samples, side='right')
sample = values[idx]
return sample
Examples:
In [113]: weighted_choice([1, 2], [0.9, 0.1], 20)
Out[113]: array([1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1])
In [114]: weighted_choice(['cat', 'dog', 'goldfish'], [0.3, 0.6, 0.1], 15)
Out[114]:
array(['cat', 'dog', 'cat', 'dog', 'dog', 'dog', 'dog', 'dog', 'dog',
'dog', 'dog', 'dog', 'goldfish', 'dog', 'dog'],
dtype='|S8')
Something like that should do the trick, and working with all floating point probability without creating a intermediate array.
import random
from itertools import accumulate # for python 3.x
def accumulate(l): # for python 2.x
tmp = 0
for n in l:
tmp += n
yield tmp
def random_choice(a, p):
sums = sum(p)
accum = accumulate(p) # made a cumulative list of probability
accum = [n / sums for n in accum] # normalize
rnd = random.random()
for i, item in enumerate(accum):
if rnd < item:
return a[i]
Easy to get is the index in probability table.
Make a table for as many weights as you need looking for example like this:
prb = [0.5, 0.65, 0.8, 1]
Get index with something like this:
def get_in_range(prb, pointer):
"""Returns index of matching range in table prb"""
found = 0
for p in prb:
if nr>p:
found += 1
return found
Index returned by get_in_range may be used to point in corresponding table of values.
Example usage:
import random
values = [1, 2, 3]
weights = [0.9, 0.99, 1]
result = values[get_in_range(prb, random.random())]
There should be probability of choosing 1 with 95%; 2 with 4% and 3 with 1%

Categories