how to get the index of numpy.random.choice? - python - python

Is it possible to modify the numpy.random.choice function in order to make it return the index of the chosen element?
Basically, I want to create a list and select elements randomly without replacement
import numpy as np
>>> a = [1,4,1,3,3,2,1,4]
>>> np.random.choice(a)
>>> 4
>>> a
>>> [1,4,1,3,3,2,1,4]
a.remove(np.random.choice(a)) will remove the first element of the list with that value it encounters (a[1] in the example above), which may not be the chosen element (eg, a[7]).

Regarding your first question, you can work the other way around, randomly choose from the index of the array a and then fetch the value.
>>> a = [1,4,1,3,3,2,1,4]
>>> a = np.array(a)
>>> random.choice(arange(a.size))
6
>>> a[6]
But if you just need random sample without replacement, replace=False will do. Can't remember when it was firstly added to random.choice, might be 1.7.0. So if you are running very old numpy it may not work. Keep in mind the default is replace=True

Here's one way to find out the index of a randomly selected element:
import random # plain random module, not numpy's
random.choice(list(enumerate(a)))[0]
=> 4 # just an example, index is 4
Or you could retrieve the element and the index in a single step:
random.choice(list(enumerate(a)))
=> (1, 4) # just an example, index is 1 and element is 4

numpy.random.choice(a, size=however_many, replace=False)
If you want a sample without replacement, just ask numpy to make you one. Don't loop and draw items repeatedly. That'll produce bloated code and horrible performance.
Example:
>>> a = numpy.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> numpy.random.choice(a, size=5, replace=False)
array([7, 5, 8, 6, 2])
On a sufficiently recent NumPy (at least 1.17), you should use the new randomness API, which fixes a longstanding performance issue where the old API's replace=False code path unnecessarily generated a complete permutation of the input under the hood:
rng = numpy.random.default_rng()
result = rng.choice(a, size=however_many, replace=False)

This is a bit in left field compared with the other answers, but I thought it might help what it sounds like you're trying to do in a slightly larger sense. You can generate a random sample without replacement by shuffling the indices of the elements in the source array :
source = np.random.randint(0, 100, size=100) # generate a set to sample from
idx = np.arange(len(source))
np.random.shuffle(idx)
subsample = source[idx[:10]]
This will create a sample (here, of size 10) by drawing elements from the source set (here, of size 100) without replacement.
You can interact with the non-selected elements by using the remaining index values, i.e.:
notsampled = source[idx[10:]]

Maybe late but it worth to mention this solution because I think the simplest way to do so is:
a = [1, 4, 1, 3, 3, 2, 1, 4]
n = len(a)
idx = np.random.choice(list(range(n)), p=np.ones(n)/n)
It means you are choosing from the indices uniformly. In a more general case, you can do a weighted sampling (and return the index) in this way:
probs = [.3, .4, .2, 0, .1]
n = len(a)
idx = np.random.choice(list(range(n)), p=probs)
If you try to do so for so many times (e.g. 1e5), the histogram of the chosen indices would be like [0.30126 0.39817 0.19986 0. 0.10071] in this case which is correct.
Anyway, you should choose from the indices and use the values (if you need) as their probabilities.

Instead of using choice, you can also simply random.shuffle your array, i.e.
random.shuffle(a) # will shuffle a in-place

Based on your comment:
The sample is already a. I want to work directly with a so that I can control how many elements are still left and perform other operations with a. – HappyPy
it sounds to me like you're interested in working with a after n randomly selected elements are removed. Instead, why not work with N = len(a) - n randomly selected elements from a? Since you want them to still be in the original order, you can select from indices like in #CTZhu's answer, but then sort them and grab from the original list:
import numpy as np
n = 3 #number to 'remove'
a = np.array([1,4,1,3,3,2,1,4])
i = np.random.choice(np.arange(a.size), a.size-n, replace=False)
i.sort()
a[i]
#array([1, 4, 1, 3, 1])
So now you can save that as a again:
a = a[i]
and work with a with n elements removed.

Here is a simple solution, just choose from the range function.
import numpy as np
a = [100,400,100,300,300,200,100,400]
I=np.random.choice(np.arange(len(a)))
print('index is '+str(I)+' number is '+str(a[I]))

The question title versus its description are a bit different. I just wanted the answer to the title question which was getting only an (integer) index from numpy.random.choice(). Rather than any of the above, I settled on index = numpy.random.choice(len(array_or_whatever)) (tested in numpy 1.21.6).
Ex:
import numpy
a = [1, 2, 3, 4]
i = numpy.random.choice(len(a))
The problem I had in the other solutions were the unnecessary conversions to list which would recreate the entire collection in a new object (slow!).
Reference: https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html?highlight=choice#numpy.random.choice
Key point from the docs about the first parameter a:
a: 1-D array-like or int
If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if it were np.arange(a)
Since the question is very old then it's possible I'm coming at this from the convenience of newer versions supporting exactly what myself and the OP wanted.

Related

Split sorted list into two lists

I'm trying to split a sorted integer list into two lists. The first list would have all ints under n and the second all ints over n. Note that n does not have to be in the original list.
I can easily do this with:
under = []
over = []
for x in sorted_list:
if x < n:
under.append(x)
else
over.append(x)
But it just seems like it should be possible to do this in a more elegant way knowing that the list is sorted. takewhile and dropwhile from itertools sound like the solution but then I would be iterating over the list twice.
Functionally, the best I can do is this:
i = 0
while sorted_list[i] < n:
i += 1
under = sorted_list[:i]
over = sorted_list[i:]
But I'm not even sure if it is actually better than just iterating over the list twice and it is definitely not more elegant.
I guess I'm looking for a way to get the list returned by takewhile and the remaining list, perhaps, in a pair.
The correct solution here is the bisect module. Use bisect.bisect to find the index to the right of n (or the index where it would be inserted if it's missing), then slice around that point:
import bisect # At top of file
split_idx = bisect.bisect(sorted_list, n)
under = sorted_list[:split_idx]
over = sorted_list[split_idx:]
While any solution is going to be O(n) (you do have to copy the elements after all), the comparisons are typically more expensive than simple pointer copies (and associated reference count updates), and bisect reduces the comparison work on a sorted list to O(log n), so this will typically (on larger inputs) beat simply iterating and copying element by element until you find the split point.
Use bisect.bisect_left (which finds the leftmost index of n) instead of bisect.bisect (equivalent to bisect.bisect_right) if you want n to end up in over instead of under.
I would use following approach, where I find the index and use slicing to create under and over:
sorted_list = [1,2,4,5,6,7,8]
n=6
idx = sorted_list.index(n)
under = sorted_list[:idx]
over = sorted_list[idx:]
print(under)
print(over)
Output (same as with your code):
[1, 2, 4, 5]
[6, 7, 8]
Edit: As I understood the question wrong here is an adapted solution to find the nearest index:
import numpy as np
sorted_list = [1,2,4,5,6,7,8]
n=3
idx = np.searchsorted(sorted_list, n)
under = sorted_list[:idx]
over = sorted_list[idx:]
print(under)
print(over)
Output:
[1, 2]
[4, 5, 6, 7, 8]

How to filter two numpy arrays?

Edit: I fixed y so that x,y have the same length
I don't understand much about programing but I have a giant mass of data to analyze and it has to be done in Python.
Say I have two arrays:
import numpy as np
x=np.array([1,2,3,4,5,6,7,8,9,10])
y=np.array([25,18,16,19,30,5,9,20,80,45])
and say I want to choose the values in y which are greater than 17, and keep only the values in x which has the same index as the left values in y. for example I want to erase the first value of y (25) and accordingly the matching value in x (1).
I tried this:
filter=np.where(y>17, 0, y)
but I don't know how to filter the x values accordingly (the actual data are much longer arrays so doing it "by hand" is basically imposible)
Solution: using #mozway tip, now that x,y have the same length the needed code is:
import numpy as np
x=np.array([1,2,3,4,5,6,7,8,9,10])
y=np.array([25,18,16,19,30,5,9,20,80,45])
x_filtered=x[y>17]
As your question is not fully clear and you did not provide the expected output, here are two possibilities:
filtering
Nunique arrays can be sliced by an array (iterable) of booleans.
If the two arrays were the same length you could do:
x[y>17]
Here, xis longer than y so we first need to make it the same length:
import numpy as np
x=np.array([1,2,3,4,5,6,7,8,9,10])
y=np.array([25,18,16,19,30,5,9,20])
x[:len(y)][y>17]
Output: array([1, 2, 4, 5, 8])
replacement
To select between x and y based on a condition, use where:
np.where(y>17, x[:len(y)], y)
Output:
array([ 1, 2, 16, 4, 5, 5, 9, 8])
As someone with little experience in Numpy specifically, I wrote this answer before seeing #mozway's excellent answer for filtering. My answer works on more generic containers than Numpy's arrays, though it uses more concepts as a result. I'll attempt to explain each concept in enough detail for the answer to make sense.
TL;DR:
Please, definitely read the rest of the answer, it'll help you understand what's going on.
import numpy as np
x = np.array([1,2,3,4,5,6,7,8,9,10])
y = np.array([25,18,16,19,30,5,9,20])
filtered_x_list = []
filtered_y_list = []
for i in range(min(len(x), len(y))):
if y[i] > 17:
filtered_y_list.append(y[i])
filtered_x_list.append(x[i])
filtered_x = np.array(filtered_x_list)
filtered_y = np.array(filtered_y_list)
# These lines are just for us to see what happened
print(filtered_x) # prints [1 2 4 5 8]
print(filtered_y) # prints [25 18 19 30 20]
Pre-requisite Knowledge
Python containers (lists, arrays, and a bunch of other stuff I won't get into)
Lets take a look at the line:
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
What's Python doing?
The first thing it's doing is creating a list:
[1, 2, 3] # and so on
Lists in Python have a few features that are useful for us in this solution:
Accessing elements:
x_list = [ 1, 2, 3 ]
print(x_list[0]) # prints 1
print(x_list[1]) # prints 2, and so on
Adding elements to the end:
x_list = [ 1, 2, 3 ]
x_list.append(4)
print(x_list) # prints [1, 2, 3, 4]
Iteration:
x_list = [ 1, 2, 3 ]
for x in x_list:
print(x)
# prints:
# 1
# 2
# 3
Numpy arrays are slightly different: we can still access and iterate elements in them, but once they're created, we can't modify them - they have no .append, and there are other modifications one can do with lists (like changing one value, or deleting a value) we can't do with numpy arrays.
So the filtered_x_list and the filtered_y_list are empty lists we're creating, but we're going to modify them by adding the values we care about to the end.
The second thing Python is doing is creating a numpy array, using the list to define its contents. The array constructor can take a list expressed as [...], or a list defined by x_list = [...], which we're going to take advantage of later.
A little more on iteration
In your question, for every x element, there is a corresponding y element. We want to test something for each y element, then act on the corresponding x element, too.
Since we can access the same element in both arrays using an index - x[0], for instance - instead of iterating over one list or the other, we can iterate over all indices needed to access the lists.
First, we need to figure out how many indices we're going to need, which is just the length of the lists. len(x) lets us do that - in this case, it returns 10.
What if x and y are different lengths? In this case, I chose the smallest of the two - first, do len(x) and len(y), then pass those to the min() function, which is what min(len(x), len(y)) in the code above means.
Finally, we want to actually iterate through the indices, starting at 0 and ending at len(x) - 1 or len(y) - 1, whichever is smallest. The range sequence lets us do exactly that:
for i in range(10):
print(i)
# prints:
# 0
# 1
# 2
# 3
# 4
# 5
# 6
# 7
# 8
# 9
So range(min(len(x), len(y))), finally, gets us the indices to iterate over, and finally, this line makes sense:
for i in range(min(len(x), len(y))):
Inside this for loop, i now gives us an index we can use for both x and y.
Now, we can do the comparison in our for loop:
for i in range(min(len(x), len(y))):
if y[i] > 17:
filtered_y_list.append(y[i])
Then, including xs for the corresponding ys is a simple case of just appending the same x value to the x list:
for i in range(min(len(x), len(y))):
if y[i] > 17:
filtered_y_list.append(y[i])
filtered_x_list.append(x[i])
The filtered lists now contain the numbers you're after. The last two lines, outside the for loop, just create numpy arrays from the results:
filtered_x = np.array(filtered_x_list)
filtered_y = np.array(filtered_y_list)
Which you might want to do, if certain numpy functions expect arrays.
While there are, in my opinion, better ways to do this (I would probably write custom iterators that produce the intended results without creating new lists), they require a somewhat more advanced understanding of programming, so I opted for something simpler.

Remove duplicates from one Python list, prune other lists based on it

I have a problem that's easy enough to do in an ugly way, but I'm wondering if there's a more Pythonic way of doing it.
Say I have three lists, A, B and C.
A = [1, 1, 2, 3, 4, 4, 5, 5, 3]
B = [1, 2, 3, 4, 5, 6, 7, 8, 9]
C = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# The actual data isn't important.
I need to remove all duplicates from list A, but when a duplicate entry is deleted, I would like the corresponding indexes removed from B and C:
A = [1, 2, 3, 4, 5]
B = [1, 3, 4, 5, 7]
C = [1, 3, 4, 5, 7]
This is easy enough to do with longer code by moving everything to new lists:
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
if A[i] not in new_A:
new_A.append(A[i])
new_B.append(B[i])
new_C.append(C[i])
But is there a more elegant and efficient (and less repetitive) way of doing this? This could get cumbersome if the number of lists grows, which it might.
Zip the three lists together, uniquify based on the first element, then unzip:
from operator import itemgetter
from more_itertools import unique_everseen
abc = zip(a, b, c)
abc_unique = unique_everseen(abc, key=itemgetter(0))
a, b, c = zip(*abc_unique)
This is a very common pattern. Whenever you want to do anything in lock step over a bunch of lists (or other iterables), you zip them together and loop over the result.
Also, if you go from 3 lists to 42 of them ("This could get cumbersome if the number of lists grows, which it might."), this is trivial to extend:
abc = zip(*list_of_lists)
abc_unique = unique_everseen(abc, key=itemgetter(0))
list_of_lists = zip(*abc_unique)
Once you get the hang of zip, the "uniquify" is the only hard part, so let me explain it.
Your existing code checks whether each element has been seen by searching for each one in new_A. Since new_A is a list, this means that if you have N elements, M of them unique, on average you're going to be doing M/2 comparisons for each of those N elements. Plug in some big numbers, and NM/2 gets pretty big—e.g., 1 million values, a half of them unique, and you're doing 250 billion comparisons.
To avoid that quadratic time, you use a set. A set can test an element for membership in constant, rather than linear, time. So, instead of 250 billion comparisons, that's 1 million hash lookups.
If you don't need to maintain order or decorate-process-undecorate the values, just copy the list to a set and you're done. If you need to decorate, you can use a dict instead of a set (with the key as the dict keys, and everything else hidden in the values). To preserve order, you could use an OrderedDict, but at that point it's easier to just use a list and a set side by side. For example, the smallest change to your code that works is:
new_A_set = set()
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
if A[i] not in new_A_set:
new_A_set.add(A[i])
new_A.append(A[i])
new_B.append(B[i])
new_C.append(C[i])
But this can be generalized—and should be, especially if you're planning to expand from 3 lists to a whole lot of them.
The recipes in the itertools documentation include a function called unique_everseen that generalizes exactly what we want. You can copy and paste it into your code, write a simplified version yourself, or pip install more-itertools and use someone else's implementation (as I did above).
PadraicCunningham asks:
how efficient is zip(*unique_everseen(zip(a, b, c), key=itemgetter(0)))?
If there are N elements, M unique, it's O(N) time and O(M) space.
In fact, it's effectively doing the same work as the 10-line version above. In both cases, the only work that's not obviously trivial inside the loop is key in seen and seen.add(key), and since both operations are amortized constant time for set, that means the whole thing is O(N) time. In practice, for N=1000000, M=100000 the two versions are about 278ms and 297ms (I forget which is which) compared to minutes for the quadratic version. You could probably micro-optimize that down to 250ms or so—but it's hard to imagine a case where you'd need that, but wouldn't benefit from running it in PyPy instead of CPython, or writing it in Cython or C, or numpy-izing it, or getting a faster computer, or parallelizing it.
As for space, the explicit version makes it pretty obvious. Like any conceivable non-mutating algorithm, we've got the three new_Foo lists around at the same time as the original lists, and we've also added new_A_set of the same size. Since all of those are length M, that's 4M space. We could cut that in half by doing one pass to get indices, then doing the same thing mu 無's answer does:
indices = set(zip(*unique_everseen(enumerate(a), key=itemgetter(1))[0])
a = [a[index] for index in indices]
b = [b[index] for index in indices]
c = [c[index] for index in indices]
But there's no way to go lower than that; you have to have at least a set and a list of length M alive to uniquify a list of length N in linear time.
If you really need to save space, you can mutate all three lists in-place. But this is a lot more complicated, and a bit slower (although still linear*).
Also, it's worth noting another advantage of the zip version: it works on any iterables. You can feed it three lazy iterators, and it won't have to instantiate them eagerly. I don't think it's doable in 2M space, but it's not too hard in 3M:
indices, a = zip(*unique_everseen(enumerate(a), key=itemgetter(1))
indices = set(indices)
b = [value for index, value in enumerate(b) if index in indices]
c = [value for index, value in enumerate(c) if index in indices]
* Note that just del c[i] will make it quadratic, because deleting from the middle of a list takes linear time. Fortunately, that linear time is a giant memmove that's orders of magnitude faster than the equivalent number of Python assignments, so if N isn't too big you can get away with it—in fact, at N=100000, M=10000 it's twice as fast as the immutable version… But if N might be too big, you have to instead replace each duplicate element with a sentinel, then loop over the list in a second pass so you can shift each element only once, which is instead 50% slower than the immutable version.
How about this - basically get a set of all unique elements of A, and then get their indices, and create a new list based on these indices.
new_A = list(set(A))
indices_to_copy = [A.index(element) for element in new_A]
new_B = [B[index] for index in indices_to_copy]
new_C = [C[index] for index in indices_to_copy]
You can write a function for the second statement, for reuse:
def get_new_list(original_list, indices):
return [original_list[idx] for idx in indices]

Inserting and removing into/from sorted list in Python

I have a sorted list of integers, L, and I have a value X that I wish to insert into the list such that L's order is maintained. Similarly, I wish to quickly find and remove the first instance of X.
Questions:
How do I use the bisect module to do the first part, if possible?
Is L.remove(X) going to be the most efficient way to do the second part? Does Python detect that the list has been sorted and automatically use a logarithmic removal process?
Example code attempts:
i = bisect_left(L, y)
L.pop(i) #works
del L[bisect_left(L, i)] #doesn't work if I use this instead of pop
You use the bisect.insort() function:
bisect.insort(L, X)
L.remove(X) will scan the whole list until it finds X. Use del L[bisect.bisect_left(L, X)] instead (provided that X is indeed in L).
Note that removing from the middle of a list is still going to incur a cost as the elements from that position onwards all have to be shifted left one step. A binary tree might be a better solution if that is going to be a performance bottleneck.
You could use Raymond Hettinger's IndexableSkiplist. It performs 3 operations in O(ln n) time:
insert value
remove value
lookup value by rank
import skiplist
import random
random.seed(2013)
N = 10
skip = skiplist.IndexableSkiplist(N)
data = range(N)
random.shuffle(data)
for num in data:
skip.insert(num)
print(list(skip))
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
for num in data[:N//2]:
skip.remove(num)
print(list(skip))
# [0, 3, 4, 6, 9]

how to match two numpy array of unequal length?

i have two 1D numpy arrays. The lengths are unequal. I want to make pairs (array1_elemnt,array2_element) of the elements which are close to each other. Lets consider following example
a = [1,2,3,8,20,23]
b = [1,2,3,5,7,21,35]
The expected result is
[(1,1),
(2,2),
(3,3),
(8,7),
(20,21),
(23,25)]
It is important to note that 5 is left alone. It could easily be done by loops but I have very large arrays. I considered using nearest neighbor. But felt like killing a sparrow with a canon.
Can anybody please suggest any elegant solution.
Thanks a lot.
How about using the Needleman-Wunsch algorithm? :)
The scoring matrix would be trivial, as the "distance" between two numbers is just their difference.
But that will probably feel like killing a sparrow with a tank ...
You could use the built in map function to vectorize a function that does this. For example:
ar1 = np.array([1,2,3,8,20,23])
ar2 = np.array([1,2,3,5,7,21,35])
def closest(ar1, ar2, iter):
x = np.abs(ar1[iter] - ar2)
index = np.where(x==x.min())
value = ar2[index]
return value
def find(x):
return closest(ar1, ar2, x)
c = np.array(map(find, range(ar1.shape[0])))
In the example above, it looked like you wanted to exclude values once they had been paired. In that case, you could include a removal process in the first function like this, but be very careful about how array 1 is sorted:
def closest(ar1, ar2, iter):
x = np.abs(ar1[iter] - ar2)
index = np.where(x==x.min())
value = ar2[index]
ar2[ar2==value] = -10000000
return value
The best method I can think of is use a loop. If loop in python is slow, you can use Cython to speedup you code.
I think one can do it like this:
create two new structured arrays, such that there is a second index which is 0 or 1 indicating to which array the value belongs, i.e. the key
concatenate both arrays
sort the united array along the first field (the values)
use 2 stacks: go through the array putting elements with key 1 on the left stack, and when you cross an element with key 0, put them in the right stack. When you reach the second element with key 0, for the first with key 0 check the top and bottom of the left and right stacks and take the closest value (maybe with a maximum distance), switch stacks and continue.
sort should be slowest step and max total space for the stacks is n or m.
You can do the following:
a = np.array([1,2,3,8,20,23])
b = np.array([1,2,3,5,7,21,25])
def find_closest(a, sorted_b):
j = np.searchsorted(.5*(sorted_b[1:] + sorted_b[:-1]), a, side='right')
return b[j]
b.sort() # or, b = np.sort(b), if you don't want to modify b in-place
print np.c_[a, find_closest(a, b)]
# ->
# array([[ 1, 1],
# [ 2, 2],
# [ 3, 3],
# [ 8, 7],
# [20, 21],
# [23, 25]])
This should be pretty fast. How it works is that searchsorted will find for each number a the index into the b past the midpoint between two numbers, i.e., the closest number.

Categories