checking elements is in a set in numpy - python

I am new to numpy so any help is appreciated. But I am curious how array handles set in python.
This is my code but it doesn't work as expected.
I am trying to filter elements that are not in my set.
new_mask = np.where(np.isin(mask, my_set), 1, 0)
Because for my understanding, searching in a set is more efficient than in a list due to hashing. But I found this in the doc. I am wondering why it doesn't work?
Because of how array handles sets, the following does not work as expected:
>>> test_set = {1, 2, 4, 8}
>>> np.isin(element, test_set)
array([[False, False],
[False, False]])
Casting the set to a list gives the expected result:
>>> np.isin(element, list(test_set))
array([[False, True],
[ True, False]])

Related

Get np.all() of varying length ranges within a bool array using numpy

I have the following numpy arrays:
selected= [True, False, False, True, True, True, True, True, False, False, False, True]
start_index = [0, 3, 5, 8, 10]
end_index = [3, 5, 8, 10, 12] # End index itself is not included in the defined ranges
and I would like to get the following result:
result=[False, True,True,False,False]
In other words, I would like to get the equivalent of this code using numpy:
result=[]
for idx in range(0,5):
result.append(np.all(selected[start_index[idx]:end_index[idx]]))
The difficulty is that the ranges are of different lengths, so I cannot just reshape the selected array and use np.all() on each row.
The answer from Mechanic Pig was exactly what I needed, posting it here as an answer for better visibility:
In my situation the different ranges were successive slices of the selected array, in that specific case this answer was what I was looking for:
np.logical_and.reduceat(selected, start_index)

Is it possible to vectorize this numpy array comparison?

I have these two numpy arrays in Python:
a = np.array(sorted(np.random.rand(6)*6)) # It is sorted.
b = np.array(np.random.rand(3)*6)
Say that the arrays are
a = array([0.27148588, 0.42828064, 2.48130785, 4.01811243, 4.79403723, 5.46398145])
b = array([0.06231266, 1.64276013, 5.22786201])
I want to produce an array containing the indices where a is <= than each element in b, i.e. I want exactly this:
np.argmin(np.array([a<b_i for b_i in b]),1)-1
which produces array([-1, 1, 4]) meaning that b[0]<a[0], a[1]<b[1]<a[2] and a[4]<b[2]<a[5].
Is there any native numpy fast vectorized way of doing this avoiding the for loop?
To answer your specific question, i.e., a vectorized way to get the equivalent of np.array([a<b_i for b_i in b], you can take advantage of broadcasting, here, you could use:
a[None, ...] < b[..., None]
So:
>>> a[None, ...] < b[..., None]
array([[False, False, False, False, False, False],
[ True, True, False, False, False, False],
[ True, True, True, True, True, False]])
Importantly, for broadcasting:
>>> a[None, ...].shape, b[..., None].shape
((1, 6), (3, 1))
Here's the link to the official numpy docs to understand broadcasting. Some relevant tidbits:
When operating on two arrays, NumPy compares their shapes
element-wise. It starts with the trailing (i.e. rightmost) dimensions
and works its way left. Two dimensions are compatible when
they are equal, or
one of them is 1
...
When either of the dimensions compared is one, the other is used. In
other words, dimensions with size 1 are stretched or “copied” to match
the other.
Edit
As noted in the comments under your question, using an entirely different approach is much better algorithmically than your own, brute force solution, namely, taking advantage of binary search, using np.searchsorted

too many indices for array when using np.where

I have the code:
a=b=np.arange(9).reshape(3,3)
c=np.zeros(3)
for x in range(3):
c[x]=np.average(b[np.where(a<x+3)])
The output of c is
>>>array([ 1. , 1.5, 2. ])
Instead of the for loop, I wanna use array (vectorization), then I did the following code:
a=b=np.arange(9).reshape(3,3)
c=np.zeros(3)
i=np.arange(3)
c[i]=np.average(b[np.where(a<i[:,None,None]+3)])
But it shows IndexError: too many indices for array
As for a<i[:,None,None]+3
it correctly shows
array([[[ True, True, True],
[False, False, False],
[False, False, False]],
[[ True, True, True],
[ True, False, False],
[False, False, False]],
[[ True, True, True],
[ True, True, False],
[False, False, False]]], dtype=bool)
But when I use b[np.where(a<i[:,None,None]+3)], it again shows IndexError: too many indices for array. I cannot get the correct output of c.
I am sensing you are trying to vectorize things here, though not explicitly mentioned. Now, I don't think you can index like that in a vectorized manner. To solve your qustion in a vectorized manner, I would suggest a more efficient way to get the sum-reduction with matrix-multiplication using np.tensordot and with help from broadcasting as you had set out already in your trials.
Thus, one solution would be -
from __future__ import division
i = np.arange(3)
mask = a<i[:,None,None]+3
c = np.tensordot(b,mask,axes=((0,1),(1,2)))/mask.sum((1,2))
Related post to understand tensordot.
Possible improvements on performance
Convert the mask to float dtype before feeding to np.dot as BLAS based matrix-multiplication would be faster with it.
Use np.count_nonzero instead of np.sum for counting booleans. So, use it to replace mask.sum() part.

Find indices in an array that contain one of values from another array

How to get the index of values in an array (a) by a another array (label) with more than one "markers"? For example, given
label = array([1, 2])
a = array([1, 1, 2, 2, 3, 3])
the goal is to find the indices of a with the value of 1 or 2; that is, 0, 1, 2, 3.
I tried several combinations. None of the following seems to work.
label = array([1, 2])
a = array([1, 1, 2, 2, 3, 3])
idx = where(a==label) # gives me only the index of the last value in label
idx = where(a==label[0] or label[1]) # Is confused by all or any?
idx = where(a==label[0] | label[1]) # gives me results as if nor. idx = [4,5]
idx = where(a==label[0] || label[1]) # syntax error
idx = where(a==bolean.or(label,0,1) # I know, this is not the correct form but I don`t remember it correctly but remember the error: also asks for a.all or a.any
idx = where(label[0] or label[1] in a) # gives me only the first appearance. index = 0. Also without where().
idx = where(a==label[0] or a==label[1]).all()) # syntax error
idx = where(a.any(0,label[0] or label[1])) # gives me only the first appearance. index=0. Also without where().
idx = where(a.any(0,label[0] | label[1])) # gives me only the first appearance. index=0. Also without where().
idx=where(a.any(0,label)) # Datatype not understood
Ok, I think you get my problem. Does anyone know how to do it correctly? Best would be a solution with a general label instead of label[x] so that the use of label is more variable for later changes.
You can use numpy.in1d:
>>> a = numpy.array([1, 1, 2, 2, 3, 3])
>>> label = numpy.array([1, 2])
>>> numpy.in1d(a, label)
array([ True, True, True, True, False, False], dtype=bool)
The above returns a mask. If you want indices, you can call numpy.nonzero on the mask array.
Also, if the values in label array are unique, you can pass assume_unique=True to in1d to possibly speed it up.
np.where(a==label) is the same as np.nonzeros(a==label). It tells us the coordinates (indexes) of all non-zero (or True) elements in the array, a==label.
So instead of trying all these different where expressions, focus on the conditional array
Without the where here's what some of your expressions produce:
In [40]: a==label # 2 arrays don't match in size, scalar False
Out[40]: False
In [41]: a==label[0] # result is the size of a
Out[41]: array([ True, True, False, False, False, False], dtype=bool)
In [42]: a==label[0] or label[1] # or is a Python scalar operation
...
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
In [43]: a==label[0] | label[1]
Out[43]: array([False, False, False, False, True, True], dtype=bool)
This last is the same as a==(label[0] | label[1]), the | is evaluated before the ==.
You need to understand how each of those arrays (or scalar or error) are produced before you understand what where gives you.
Correct combination of 2 equality tests (the extra () are important):
In [44]: (a==label[1]) | (a==label[0])
Out[44]: array([ True, True, True, True, False, False], dtype=bool)
Using broadcasting to separately test the 2 elements of label. Result is 2d array:
In [45]: a==label[:,None]
Out[45]:
array([[ True, True, False, False, False, False],
[False, False, True, True, False, False]], dtype=bool)
In [47]: (a==label[:,None]).any(axis=0)
Out[47]: array([ True, True, True, True, False, False], dtype=bool)
As I understand it, you want the indices of 1 and 2 in array "a".
In that case, try
label= [1,2]
a= [1,1,2,2,3,3]
idx_list = list()
for x in label:
for i in range(0,len(a)-1):
if a[i] == x:
idx_list.append(i)
I think what I'm reading as your intent is to get the indices in the second list, 'a', of the values in the first list, 'labels'. I think that a dictionary is a good way to store this information where the labels will be keys and indices will be the values.
Try this:
labels = [a,2]
a = [1,1,2,2,3,3]
results = {}
for label in labels:
results[label] = [i for i,x in enumerate(a) if x == label]
if you want the indices of 1 just call results[1]. The list comprehension is and the enumerate function are the real MVPs here.

Find where a NumPy array is equal to any value in a list of values

I have an array of integers and want to find where that array is equal to any value in a list of multiple values.
This can easily be done by treating each value individually, or by using multiple "or" statements in a loop, but I feel like there must be a better/faster way to do it. I'm actually dealing with arrays of size 4000 x 2000, but here is a simplified edition of the problem:
fake = arange(9).reshape((3,3))
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
want = (fake==0) + (fake==2) + (fake==6) + (fake==8)
print want
array([[ True, False, True],
[False, False, False],
[ True, False, True]], dtype=bool)
What I would like is a way to get want from a single command involving fake and the list of values [0, 2, 6, 8].
I'm assuming there is a package that has this included already that would be significantly faster than if I just wrote a function with a loop in Python.
The function numpy.in1d seems to do what you want. The only problems is that it only works on 1d arrays, so you should use it like this:
In [9]: np.in1d(fake, [0,2,6,8]).reshape(fake.shape)
Out[9]:
array([[ True, False, True],
[False, False, False],
[ True, False, True]], dtype=bool)
I have no clue why this is limited to 1d arrays only. Looking at its source code, it first seems to flatten the two arrays, after which it does some clever sorting tricks. But nothing would stop it from unflattening the result at the end again, like I had to do by hand here.
NumPy 0.13+
As of NumPy v0.13, you can use np.isin, which works on multi-dimensional arrays:
>>> element = 2*np.arange(4).reshape((2, 2))
>>> element
array([[0, 2],
[4, 6]])
>>> test_elements = [1, 2, 4, 8]
>>> mask = np.isin(element, test_elements)
>>> mask
array([[ False, True],
[ True, False]])
NumPy pre-0.13
The accepted answer with np.in1d works only with 1d arrays and requires reshaping for the desired result. This is good for versions of NumPy before v0.13.
#Bas's answer is the one you're probably looking for. But here's another way to do it, using numpy's vectorize trick:
import numpy as np
S = set([0,2,6,8])
#np.vectorize
def contained(x):
return x in S
contained(fake)
=> array([[ True, False, True],
[False, False, False],
[ True, False, True]], dtype=bool)
The con of this solution is that contained() is called for each element (i.e. in python-space), which makes this much slower than a pure-numpy solution.

Categories