Unexpected result from boolean mask slicing - python

I'm confused about the way numpy array slicing is working in the example below. I can't figure out how exactly the slicing is working and would appreciate an explanation.
import numpy as np
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12],
[13,14,15,16]
])
m = [False,True,True,False]
# Test 1 - Expected behaviour
print(arr[m])
Out:
array([[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
# Test 2 - Expected behaviour
print(arr[m,:])
Out:
array([[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
# Test 3 - Expected behaviour
print(arr[:,m])
Out:
array([[ 2, 3],
[ 6, 7],
[10, 11],
[14, 15]])
### What's going on here? ###
# Test 4
print(arr[m,m])
Out:
array([ 6, 11]) # <--- diagonal components. I expected [[6,7],[10,11]].
I found that I could achieve the desired result with arr[:,m][m]. But I'm still curious about how this works.

You can use matrix multiplication to create a 2d mask.
import numpy as np
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12],
[13,14,15,16]
])
m = [False,True,True,False]
mask2d = np.array([m]).T * m
print(arr[mask2d])
Output :
[ 6 7 10 11]
Alternatively, you can have the output in matrix format.
print(np.ma.masked_array(arr, ~mask2d))

It's just the way indexing works for numpy arrays. Usually if you have specific "slices" of rows and columns you want to select you just do:
import numpy as np
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12],
[13,14,15,16]
])
# You want to check out rows 2-3 cols 2-3
print(arr[2:4,2:4])
Out:
[[11 12]
[15 16]]
Now say you want to select arbitrary combinations of specific row and column indices, for example you want row0-col2 and row2-col3
print(arr[[0, 2], [2, 3]])
Out:
[ 3 12]
What you are doing is identical to the above. [m,m] is equivalent to:
[m,m] == [[False,True,True,False], [False,True,True,False]]
Which is in turn equivalent to saying you want row1-col1 and row2-col2
print(arr[[1, 2], [1, 2]])
Out:
[ 6 11]

I don't know why, but this is the way numpy treats slicing by a tuple of 1d boolean arrays:
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12]
])
m1 = [True, False, True]
m2 = [False, False, True, True]
# Pseudocode for what NumPy does
#def arr[m1,m2]:
# intm1 = np.transpose(np.argwhere(m1)) # [True, False, True] -> [0,2]
# intm2 = np.transpose(np.argwhere(m2)) # [False, False, True, True] -> [2,3]
# return arr[intm1,intm2] # arr[[0,2],[2,3]]
print(arr[m1,m2]) # --> [3 12]
What I was expecting was slicing behaviour with non-contiguous segments of the array; selecting the intersection of rows and columns, can be achieved with:
arr = np.array([
[1,2,3,4],
[5,6,7,8],
[9,10,11,12]
])
m1 = [True, False, True]
m2 = [False, False, True, True]
def row_col_select(arr, *ms):
n = arr.ndim
assert(len(ms) == n)
# Accumulate a full boolean mask which will have the shape of `arr`
accum_mask = np.reshape(True, (1,) * n)
for i in range(n):
shape = tuple([1]*i + [arr.shape[i]] + [1]*(n-i-1))
m = np.reshape(ms[i], shape)
accum_mask = np.logical_and(accum_mask, m)
# Select `arr` according to full boolean mask
# The boolean mask is the multiplication of the boolean arrays across each corresponding dimension. E.g. for m1 and m2 above it is:
# m1: | m2: False False True True
# |
# True | [[False False True True]
# False | [False False False False]
# True | [False False True True]]
return arr[accum_mask]
print(row_col_select(arr,m1,m2)) # --> [ 3 4 11 12]

In [55]: arr = np.array([
...: [1,2,3,4],
...: [5,6,7,8],
...: [9,10,11,12],
...: [13,14,15,16]
...: ])
...: m = [False,True,True,False]
In all your examples we can use this m1 instead of the boolean list:
In [58]: m1 = np.where(m)[0]
In [59]: m1
Out[59]: array([1, 2])
If m was a 2d array like arr than we could use it to select elements from arr - but they will be raveled; but when used to select along one dimension, the equivalent array index is clearer. Yes we could use np.array([2,1]) or np.array([2,1,1,2]) to select rows in a different order or even multiple times. But substituting m1 for m does not loose any information or control.
Select rows, or columns:
In [60]: arr[m1]
Out[60]:
array([[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
In [61]: arr[:,m1]
Out[61]:
array([[ 2, 3],
[ 6, 7],
[10, 11],
[14, 15]])
With 2 arrays, we get 2 elements, arr[1,1] and arr[2,2].
In [62]: arr[m1, m1]
Out[62]: array([ 6, 11])
Note that in MATLAB we have to use sub2ind to do the same thing. What's easy in numpy is a bit harder in MATLAB; for blocks it's the other way.
To get a block, we have to create a column array to broadcast with the row one:
In [63]: arr[m1[:,None], m1]
Out[63]:
array([[ 6, 7],
[10, 11]])
If that's too hard to remember, np.ix_ can do it for us:
In [64]: np.ix_(m1,m1)
Out[64]:
(array([[1],
[2]]),
array([[1, 2]]))
[63] is doing the same thing as [62]; the difference is that the 2 arrays broadcast differently. It's the same broadcasting as done in these additions:
In [65]: m1+m1
Out[65]: array([2, 4])
In [66]: m1[:,None]+m1
Out[66]:
array([[2, 3],
[3, 4]])
This indexing behavior is perfectly consistent - provided we don't import expectations from other languages.
I used m1 because boolean arrays don't broadcast, as show below:
In [67]: np.array(m)
Out[67]: array([False, True, True, False])
In [68]: np.array(m)[:,None]
Out[68]:
array([[False],
[ True],
[ True],
[False]])
In [69]: arr[np.array(m)[:,None], np.array(m)]
...
IndexError: too many indices for array
in fact the 'column' boolean doesn't work either:
In [70]: arr[np.array(m)[:,None]]
...
IndexError: boolean index did not match indexed array along dimension 1; dimension is 4 but corresponding boolean dimension is 1
We can use logical_and to broadcast a column boolean against a row boolean:
In [72]: mb = np.array(m)
In [73]: mb[:,None]&mb
Out[73]:
array([[False, False, False, False],
[False, True, True, False],
[False, True, True, False],
[False, False, False, False]])
In [74]: arr[_]
Out[74]: array([ 6, 7, 10, 11]) # 1d result
This is the case you quoted: "If obj.ndim == x.ndim, x[obj] returns a 1-dimensional array filled with the elements of x corresponding to the True values of obj"
Your other quote:
*"Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view)." *
means that if arr1 = arr[m,:], arr1 is a copy, and any modifications to arr1 will not affect arr. However I could use arr[m,:]=10to modify arr. The alternative to a copy is a view, as in basic indexing, arr2=arr[0::2,:]. modifications to arr2 do modify arr as well.

Related

What's the fastest way to return that indices of values of two arrays that are equal to each other?

Say I have these two numpy arrays:
A = np.array([[1,2,3],[4,5,6],[8,7,3])
B = np.array([[1,2,3],[3,2,1],[8,7,3])
It should return
[0,2]
Since the values at the 0th and 2nd index are equal to each other.
What's the most efficient way of doing this?
I tried something like:
[val for val in range(len(A)) if A[val]==B[val]]
but got the error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
You better look for vectorized solution so...
You can try do:
>>>np.where(np.all(A == B, axis=1))
array([0 2])
You can see the benefit of vectorization When it comes to speed here : https://chelseatroy.com/2018/11/07/code-mechanic-numpy-vectorization/amp/
Assuming A.shape == B.shape (otherwise just take A=A[:len(B)] and B=B[:len(A)]) consider:
>>> A==B
[[ True True True]
[False False False]
[ True True True]]
>>> (A==B).all(axis=1)
[ True False True]
>>> np.argwhere((A==B).all(axis=1))
[[0]
[2]]
You can do something like that
>>> [a in B for a in A]
[True, False, True]
>>> A[[a in B for a in A]]
array([[1, 2, 3],
[8, 7, 3]])
>>> np.where((A==B).all(axis=1))
(array([0, 2]),)
The following solution works also for arrays that do not match in their first dimension, i.e., have a different number of rows. It also works if a match occurs multiple times.
import numpy as np
from scipy.spatial import distance
A = np.array([[1, 2 ,3],
[4, 5 ,6],
[8, 7, 3]])
B = np.array([[1, 2, 3],
[3, 2, 1],
[1, 2, 3],
[9, 9, 9]])
res = np.nonzero(distance.cdist(A, B) == 0)
# ==> (array([0, 0]), array([0, 2]))
The result res is a tuple of two array, which represent the match index of the first and the second input array, respectively. So, in this example, the row at the 0th index of the first array matches the row of the 0th index of second array, and the row at the 0th index of the first array matches the row at the second index of the second array.
In [174]: A = np.array([[1,2,3],[4,5,6],[8,7,3]])
...: B = np.array([[1,2,3],[3,2,1],[8,7,3]])
Your list comprehension works fine for lists:
In [175]: Al = A.tolist(); Bl = B.tolist()
In [177]: [val for val in range(len(Al)) if Al[val]==Bl[val]]
Out[177]: [0, 2]
For lists == is a simple boolean test - same or not; for arrays it returns a boolean array, which can't be use in an if:
In [178]: Al[0]==Bl[0]
Out[178]: True
In [179]: A[0]==B[0]
Out[179]: array([ True, True, True])
With arrays, you need to add a all as suggested by the error:
In [180]: [val for val in range(len(A)) if np.all(A[val]==B[val])]
Out[180]: [0, 2]
The list version will be faster.
But you can also compare the whole arrays, and take row by row all:
In [181]: A==B
Out[181]:
array([[ True, True, True],
[False, False, False],
[ True, True, True]])
In [182]: np.all(A==B, axis=1)
Out[182]: array([ True, False, True])
In [183]: np.nonzero(np.all(A==B, axis=1))
Out[183]: (array([0, 2]),)

Understanding numpy.where

I want to get first index of numpy array element which is greater than some specific element of that same array. I tried following:
>>> Q5=[[1,2,3],[4,5,6]]
>>> Q5 = np.array(Q5)
>>> Q5[0][Q5>Q5[0,0]]
array([2, 3])
>>> np.where(Q5[0]>Q5[0,0])
(array([1, 2], dtype=int32),)
>>> np.where(Q5[0]>Q5[0,0])[0][0]
1
Q1. Is above correct way to obtain first index of an element in Q5[0] greater than Q5[0,0]?
I am more concerned with np.where(Q5[0]>Q5[0,0]) returning tuple (array([1, 2], dtype=int32),) and hence requiring me to double index [0][0] at the end of np.where(Q5[0]>Q5[0,0])[0][0].
Q2. Why this return tuple, but below returns proper numpy array?
>>> np.where(Q5[0]>Q5[0,0],Q5[0],-1)
array([-1, 2, 3])
So that I can index directly:
>>> np.where(Q5[0]>Q5[0,0],Q5[0],-1)[1]
2
In [58]: A = np.arange(1,10).reshape(3,3)
In [59]: A.shape
Out[59]: (3, 3)
In [60]: A
Out[60]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
np.where with just the condition is really np.nonzero.
Generate a boolean array:
In [63]: A==6
Out[63]:
array([[False, False, False],
[False, False, True],
[False, False, False]])
Find where that is true:
In [64]: np.nonzero(A==6)
Out[64]: (array([1]), array([2]))
The result is a tuple, one element per dimension of the condition. Each element is an indexing array, together they define the location of the True(s)
Another test with several True
In [65]: (A%3)==1
Out[65]:
array([[ True, False, False],
[ True, False, False],
[ True, False, False]])
In [66]: np.nonzero((A%3)==1)
Out[66]: (array([0, 1, 2]), array([0, 0, 0]))
Using the tuple to index the original array:
In [67]: A[np.nonzero((A%3)==1)]
Out[67]: array([1, 4, 7])
Using the 3 argument where to create a new array with a mix of values from A and A+10
In [68]: np.where((A%3)==1,A+10, A)
Out[68]:
array([[11, 2, 3],
[14, 5, 6],
[17, 8, 9]])
If the condition has multiple True, nonzero isn't the test tool for finding the "first", since it necessarily finds all.
The nonzero tuple can be turned into a 2d array with a transpose. It actually may be easier to get the "first" from this array:
In [73]: np.argwhere((A%3)==1)
Out[73]:
array([[0, 0],
[1, 0],
[2, 0]])
You are looking in a 1d array, a row of A:
In [77]: A[0]>A[0,0]
Out[77]: array([False, True, True])
In [78]: np.nonzero(A[0]>A[0,0])
Out[78]: (array([1, 2]),) # 1 element tuple
In [79]: np.argwhere(A[0]>A[0,0])
Out[79]:
array([[1],
[2]])
In [81]: np.where(A[0]>A[0,0], 100, 0) # 3 argument where
Out[81]: array([ 0, 100, 100])
So whether you are searching a 1d array or a 2d (or 3 or 4), nonzero returns a tuple with one array element per dimension. That way it can always be used to index a like sized array. The 1d tuple might look redundant, but it is consistent with other dimensional results.
When trying understand operations like this, read the docs carefully, and look at individual steps. Here I look at the conditional matrix, the nonzero result, and its various uses.
Using argmax with a boolean array will give you the index of the first True.
In [54]: q
Out[54]:
array([[1, 2, 3],
[4, 5, 6]])
In [55]: q > q[0,0]
Out[55]:
array([[False, True, True],
[ True, True, True]], dtype=bool)
argmax can take an axis/dimension argument.
In [56]: np.argmax(q > q[0,0], 0)
Out[56]: array([1, 0, 0], dtype=int64)
That says the first True is index one for column zero and index zero for columns one and two.
In [57]: np.argmax(q > q[0,0], 1)
Out[57]: array([1, 0], dtype=int64)
That says the first True is index one for row zero and index zero for row one.
Q1. Is above correct way to obtain first index of an element in Q5[0] greater than Q5[0,0]?
No I would use argmax with 1 for the axis argument then select the first item from that result.
Q2. Why this return tuple
You told it to return -1 for False values and return Q5[0] items for True values.
Q2 ...but below returns proper numpy array?
You got lucky and chose the correct index.
numpy.where() is like a for loop with an if.
numpy.where(condition, values, new_value)
condition - just like if conditions.
values - The values to iterate on
new_value - if the condition is true for a value, its going to change to the new_value
If we would like to write it for a 1-dimensional array it should look something like this:
[xv if c else yv for c, xv, yv in zip(condition, x, y)]
Example:
>>> a = np.arange(10)
>>> a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> np.where(a < 5, a, 10*a)
array([ 0, 1, 2, 3, 4, 50, 60, 70, 80, 90])
First we create an array with numbers from 0 to 9 (0, 1, 2 ... 7, 8, 9)
and then we are checking for all the values in the array that are greater from 5 and multiplying their value by 10.
So now all the values in the array that are less then 5 stayed the same and all the values that are greater multiplied by 10

How exactly does numpy.where() select the elements in this example?

From numpy docs
>>> np.where([[True, False], [True, True]],
... [[1, 2], [3, 4]],
... [[9, 8], [7, 6]])
array([[1, 8],
[3, 4]])
Am I right in assuming that the [[True, False], [True, True]] part is the condition and [[1, 2], [3, 4]] and [[9, 8], [7, 6]] are x and y respectively according to the docs parameters.
Then how exactly is the function choosing the elements in the following examples?
Also, why is the element type in these examples a list?
>>> np.where([[True, False,True], [False, True]], [[1, 2,56], [3, 4]], [[9, 8,79], [7, 6]])
array([list([1, 2, 56]), list([3, 4])], dtype=object)
>>> np.where([[False, False,True,True], [False, True]], [[1, 2,56,69], [3, 4]], [[9, 8,90,100], [7, 6]])
array([list([1, 2, 56, 69]), list([3, 4])], dtype=object)
In the first case, each term is a (2,2) array (or rather list that can be made into such an array). For each True in the condition, it returns the corresponding term in x, the [[1 -][3,4]], and for each False, the term from y [[- 8][- -]]
In the second case, the lists are ragged
In [1]: [[True, False,True], [False, True]]
Out[1]: [[True, False, True], [False, True]]
In [2]: np.array([[True, False,True], [False, True]])
Out[2]: array([list([True, False, True]), list([False, True])], dtype=object)
the array is (2,), with 2 lists. And when cast as boolean, a 2 element array, with both True. Only an empty list would produce False.
In [3]: _.astype(bool)
Out[3]: array([ True, True])
The where then returns just the x values.
This second case is understandable, but pathological.
more details
Let's demonstrate where in more detail, with a simpler case. Same condition array:
In [57]: condition = np.array([[True, False], [True, True]])
In [58]: condition
Out[58]:
array([[ True, False],
[ True, True]])
The single argument version, which is the equivalent to condition.nonzero():
In [59]: np.where(condition)
Out[59]: (array([0, 1, 1]), array([0, 0, 1]))
Some find it easier to visualize the transpose of that tuple - the 3 pairs of coordinates where condition is True:
In [60]: np.argwhere(condition)
Out[60]:
array([[0, 0],
[1, 0],
[1, 1]])
Now the simplest version with 3 arguments, with scalar values.
In [61]: np.where(condition, True, False) # same as condition
Out[61]:
array([[ True, False],
[ True, True]])
In [62]: np.where(condition, 100, 200)
Out[62]:
array([[100, 200],
[100, 100]])
A good way of visualizing this action is with two masked assignments.
In [63]: res = np.zeros(condition.shape, int)
In [64]: res[condition] = 100
In [65]: res[~condition] = 200
In [66]: res
Out[66]:
array([[100, 200],
[100, 100]])
Another way to do this is to initial an array with the y value(s), and where the nonzero where to fill in the x value.
In [69]: res = np.full(condition.shape, 200)
In [70]: res
Out[70]:
array([[200, 200],
[200, 200]])
In [71]: res[np.where(condition)] = 100
In [72]: res
Out[72]:
array([[100, 200],
[100, 100]])
If x and y are arrays, not scalars, this masked assignment will require refinements, but hopefully for a start this will help.
np.where(condition,x,y)
It checks the condition and if its True returns x else it returns y
np.where([[True, False], [True, True]],
[[1, 2], [3, 4]],
[[9, 8], [7, 6]])
Here you condition is[[True, False], [True, True]]
x = [[1 , 2] , [3 , 4]]
y = [[9 , 8] , [7 , 6]]
First condition is true so it return 1 instead of 9
Second condition is false so it returns 8 instead of 2
After reading about broadcasting as #hpaulj suggested I think I know how the function works.
It will try to broadcast the 3 arrays,then if the broadcast was successful it will use the True and False values to pick elements either from x or y.
In the example
>>>np.where([[True, False,True], [False, True]], [[1, 2,56], [3, 4]], [[9, 8,79], [7, 6]])
We have
cnd=np.array([[True, False,True], [False, True]])
x=np.array([[1, 2,56], [3, 4]])
y=np.array([[9, 8,79], [7, 6]])
Now
>>>x.shape
Out[7]: (2,)
>>>y.shape
Out[8]: (2,)
>>>cnd.shape
Out[9]: (2,)
So all three are just arrays with 2 elements(of type list) even the condition(cnd).So both [True, False,True] and [False, True] will be evaluated as True.And both the elements will be selected from x.
>>>np.where([[True, False,True], [False, True]], [[1, 2,56], [3, 4]], [[9, 8,79], [7, 6]])
Out[10]: array([list([1, 2, 56]), list([3, 4])], dtype=object)
I also tried it with a more complex example(a 2x2x2 broadcast) and it still explains it.
np.where([[[True,False],[True,True]], [[False,False],[True,False]]],
[[[12,45],[10,50]], [[100,10],[17,81]]],
[[[90,93],[85,13]], [[12,345], [190,56,34]]])
Where
cnd=np.array([[[True,False],[True,True]], [[False,False],[True,False]]])
x=np.array([[[12,45],[10,50]], [[100,10],[17,81]]])
y=np.array( [[[90,93],[85,13]], [[12,345], [190,56,34]]])
Here cnd and x have the shape (2,2,2) and y has the shape (2,2).
>>>cnd.shape
Out[14]: (2, 2, 2)
>>>x.shape
Out[15]: (2, 2, 2)
>>>y.shape
Out[16]: (2, 2)
Now as #hpaulj commented y will be broadcasted to (2,2,2).
And it'll probably look like this
>>>cnd
Out[6]:
array([[[ True, False],
[ True, True]],
[[False, False],
[ True, False]]])
>>>x
Out[7]:
array([[[ 12, 45],
[ 10, 50]],
[[100, 10],
[ 17, 81]]])
>>>np.broadcast_to(y,(2,2,2))
Out[8]:
array([[[list([90, 93]), list([85, 13])],
[list([12, 345]), list([190, 56, 34])]],
[[list([90, 93]), list([85, 13])],
[list([12, 345]), list([190, 56, 34])]]], dtype=object)
And the result can be easily predicted to be
>>>np.where([[[True,False],[True,True]], [[False,False],[True,False]]], [[[12,45],[10,50]], [[100,10],[17,81]]],[[[90,93],[85,13]], [[12,345], [190,56,34]]])
Out[9]:
array([[[12, list([85, 13])],
[10, 50]],
[[list([90, 93]), list([85, 13])],
[17, list([190, 56, 34])]]], dtype=object)

2d numpy mask not working as expected

I'm trying to turn a 2x3 numpy array into a 2x2 array by removing select indexes.
I think I can do this with a mask array with true/false values.
Given
[ 1, 2, 3],
[ 4, 1, 6]
I want to remove one element from each row to give me:
[ 2, 3],
[ 4, 6]
However this method isn't working quite like I would expect:
import numpy as np
in_array = np.array([
[ 1, 2, 3],
[ 4, 1, 6]
])
mask = np.array([
[False, True, True],
[True, False, True]
])
print in_array[mask]
Gives me:
[2 3 4 6]
Which is not what I want. Any ideas?
The only thing 'wrong' with that is it is the shape - 1d rather than 2. But what if your mask was
mask = np.array([
[False, True, False],
[True, False, True]
])
1 value in the first row, 2 in second. It couldn't return that as a 2d array, could it?
So the default behavior when masking like this is to return a 1d, or raveled result.
Boolean indexing like this is effectively a where indexing:
In [19]: np.where(mask)
Out[19]: (array([0, 0, 1, 1], dtype=int32), array([1, 2, 0, 2], dtype=int32))
In [20]: in_array[_]
Out[20]: array([2, 3, 4, 6])
It finds the elements of the mask which are true, and then selects the corresponding elements of the in_array.
Maybe the transpose of where is easier to visualize:
In [21]: np.argwhere(mask)
Out[21]:
array([[0, 1],
[0, 2],
[1, 0],
[1, 2]], dtype=int32)
and indexing iteratively:
In [23]: for ij in np.argwhere(mask):
...: print(in_array[tuple(ij)])
...:
2
3
4
6

Numpy Vectorization While Indexing Two Arrays

I'm trying to vectorize the following function using numpy and am completely lost.
A = ndarray: Z x 3
B = ndarray: Z x 3
C = integer
D = ndarray: C x 3
Pseudocode:
entries = []
means = []
For i in range(C):
for p in range(len(B)):
if B[p] == D[i]:
entries.append(A[p])
means.append(columnwise_means(entries))
return means
an example would be :
A = [[1,2,3],[1,2,3],[4,5,6],[4,5,6]]
B = [[9,8,7],[7,6,5],[1,2,3],[3,4,5]]
C = 2
D = [[1,2,3],[4,5,6]]
Returns:
[average([9,8,7],[7,6,5]), average(([1,2,3],[3,4,5])] = [[8,7,6],[2,3,4]]
I've tried using np.where, np.argwhere, np.mean, etc but can't seem to get the desired effect. Any help would be greatly appreciated.
Thanks!
Going by the expected output of the question, I am assuming that in the actual code, you would have :
IF conditional statement as : if A[p] == D[i], and
Entries would be appended from B : entries.append(B[p]).
So, here's one vectorized approach with NumPy broadcasting and dot-product -
mask = (D[:,None,:] == A).all(-1)
out = mask.dot(B)/(mask.sum(1)[:,None])
If the input arrays are integer arrays, then you can save on memory and boost up performance, considering the arrays as indices of a n-dimensional array and thus create the 2D mask without going 3D like so -
dims = np.maximum(A.max(0),D.max(0))+1
mask = np.ravel_multi_index(D.T,dims)[:,None] == np.ravel_multi_index(A.T,dims)
Sample run -
In [107]: A
Out[107]:
array([[1, 2, 3],
[1, 2, 3],
[4, 5, 6],
[4, 5, 6]])
In [108]: B
Out[108]:
array([[9, 8, 7],
[7, 6, 5],
[1, 2, 3],
[3, 4, 5]])
In [109]: mask = (D[:,None,:] == A).all(-1)
...: out = mask.dot(B)/(mask.sum(1)[:,None])
...:
In [110]: out
Out[110]:
array([[8, 7, 6],
[2, 3, 4]])
I see two hints :
First, comparing array by rows. A way to do that is to simplify you index system in 1D :
def indexer(M,base=256):
return (M*base**arange(3)).sum(axis=1)
base is an integer > A.max() . Then the selection can be done like that :
indices=np.equal.outer(indexer(D),indexer(A))
for :
array([[ True, True, False, False],
[False, False, True, True]], dtype=bool)
Second, each group can have different length, so vectorisation is difficult for the last step. Here a way to do achieve the job.
B=array(B)
means=[B[i].mean(axis=0) for i in indices]

Categories