Pytorch/NumPy batched submatrix indexing - python

There's a single source (square) matrix L of shape (N, N)
import torch as pt
import numpy as np
N = 4
L = pt.arange(N*N).reshape(N, N) # or np.arange(N*N).reshape(N, N)
L = tensor([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
and a matrix (vector of vectors) of boolean masks m of shape (K, N) according to which I'd like to extract submatrices from L.
K = 3
m = tensor([[ True, True, False, False],
[False, True, True, False],
[False, True, False, True]])
I know how to extract a single submatrix using a single mask vector by calling L[m[i]][:, m[i]] for any i. So, for example, for i=0, we'd get
tensor([[ 0, 1],
[ 4, 5]])
but I need to perform the operation along the entire "batch" dimension. The end result I'm looking for then could be achieved by
res = []
for i in range(K):
res.append(L[m[i]][:, m[i]])
output = pt.stack(res)
however, I hope there is a better solution excluding the for loop. I realize that the for loop solution itself would crash if the sum of m along the last dimension (dim/axis=1) wasn't constant, but if I can guarantee that it is, is there a better solution? If there isn't, would changing the selector representation help? I chose boolean masks for convenience, but I prefer better performance.

Notice that you can get the first square by indexing together with broadcasting:
r = torch.tensor([0,1])
L[r[:,None], r]
output:
tensor([[0, 1],
[4, 5]])
The same principle can be applied to the second square:
r = torch.tensor([1,2])
L[r[:,None], r]
output:
tensor([[ 5, 6],
[ 9, 10]])
In combination you get:
i = torch.tensor([[0, 1], [1, 2]])
L[i[:,:,None], i[:,None]]
output:
tensor([[[ 0, 4],
[ 1, 5]],
[[ 5, 9],
[ 6, 10]]])
All 3 squares:
i = torch.tensor([
[0, 1],
[1, 2],
[1, 3],
])
L[i[:,:,None], i[:,None]]
output:
tensor([[[ 0, 1],
[ 4, 5]],
[[ 5, 6],
[ 9, 10]],
[[ 5, 7],
[13, 15]]])
to summarize, I would suggest using indices instead of a mask.

Related

Remove entire sub array from multi-dimensional array if any element in array is duplicate

I have a multi-dimensional array in Python where there may be a repeated integer within a vector in the array. For example.
array = [[1,2,3,4],
[2,9,12,4],
[5,6,7,8],
[6,8,12,13]]
I would like to completely remove the vectors that contain any element that has appeared previously. In this case, vector [2,9,12,4] and vector [6,11,12,13] should be removed because they have an element (2 and 6 respectively) that has appeared in a previous vector within that array. Note that [6,8,12,13] contains two elements that have appeared previously, so the code should be able to work with these scenarios as well.
The resulting array should end up being:
array = [[1,2,3,4],
[5,6,7,8]]
I thought I could achieve this with np.unique(array, axis=0), but I couldnt find another function that would take care of this particular uniqueness.
Any thoughts are appreaciated.
You can work with array of sorted numbers and corresponding indices of rows that looks like so:
number_info = array([[ 0, 1],
[ 0, 2],
[ 1, 2],
[ 0, 3],
[ 0, 4],
[ 1, 4],
[ 2, 5],
[ 2, 6],
[ 3, 6],
[ 2, 7],
[ 2, 8],
[ 3, 8],
[ 1, 9],
[ 1, 12],
[ 3, 12],
[ 3, 13]])
It indicates that rows remove_idx = [2, 5, 8, 11, 14] of this array needs to be removed and it points to rows rows_idx = [1, 1, 3, 3, 3] of the original array. Now, the code:
flat_idx = np.repeat(np.arange(array.shape[0]), array.shape[1])
number_info = np.transpose([flat_idx, array.ravel()])
number_info = number_info[np.argsort(number_info[:,1])]
remove_idx = np.where((np.diff(number_info[:,1])==0) &
(np.diff(number_info[:,0])>0))[0] + 1
remove_rows = number_info[remove_idx, 0]
output = np.delete(array, remove_rows, axis=0)
Output:
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
Here's a quick way to do it with a list comprehension and set intersections:
>>> array = [[1,2,3,4],
... [2,9,12,4],
... [5,6,7,8],
... [6,8,12,13]]
>>> [v for i, v in enumerate(array) if not any(set(a) & set(v) for a in array[:i])]
[[1, 2, 3, 4], [5, 6, 7, 8]]

Compare two 3d Numpy array and return unmatched values with index and later recreate them without loop

I am currently working on a problem where in one requirement I need to compare two 3d NumPy arrays and return the unmatched values with their index position and later recreate the same array. Currently, the only approach I can think of is to loop across the arrays to get the values during comparing and later recreating. The problem is with scale as there will be hundreds of arrays and looping effects the Latency of the overall application. I would be thankful if anyone can help me with better utilization of NumPy comparison while using minimal or no loops. A dummy code is below:
def compare_array(final_array_list):
base_array = None
i = 0
for array in final_array_list:
if i==0:
base_array =array[0]
else:
index = np.where(base_array != array)
#getting index like (array([0, 1]), array([1, 1]), array([2, 2]))
# to access all unmatched values I need to loop.Need to avoid loop here
i=i+1
return [base_array, [unmatched value (8,10)and its index (array([0, 1]), array([1, 1]), array([2, 2])],..]
# similarly recreate array1 back
def recreate_array(array_list):
# need to avoid looping while recreating array back
return list of array #i.e. [base_array, array_1]
# creating dummy array
base_array = np.array([[[1, 2, 3], [3, 4, 5]], [[5, 6, 7], [7, 8, 9]]])
array_1 = b = np.array([[[1, 2,3], [3, 4,8]], [[5, 6,7], [7, 8,10]]])
final_array_list = [base_array,array_1, ...... ]
#compare base_array with other arrays and get unmatched values (like 8,10 in array_1) and their index
difff_array = compare_array(final_array_list)
# recreate array1 from the base array after receiving unmatched value and its index value
recreate_array(difff_array)
I think this may be what you're looking for:
base_array = np.array([[[1, 2, 3], [3, 4, 5]], [[5, 6, 7], [7, 8, 9]]])
array_1 = b = np.array([[[1, 2,3], [3, 4,8]], [[5, 6,7], [7, 8,10]]])
match_mask = (base_array == array_1)
idx_unmatched = np.argwhere(~match_mask)
# idx_unmatched:
# array([[0, 1, 2],
# [1, 1, 2]])
# values with associated with idx_unmatched:
values_unmatched = base_array[tuple(idx_unmatched.T)]
# values_unmatched:
# array([5, 9])
I'm not sure I understand what you mean by "recreate them" (completely recreate them? why not use the arrays themselves?).
I can help you though by noting that ther are plenty of functions which vectorize with numpy, and as a general rule of thumb, do not use for loops unless G-d himself tells you to :)
For example:
If a, b are any np.arrays (regardless of dimensions), the simple a == b will return a numpy array of the same size, with boolean values. Trues = they are equal in this coordinate, and False otherwise.
The function np.where(c), will convert c to a boolean np.array, and return you the indexes in which c is True.
To clarify:
Here I instantiate two arrays, with b differing from a with -1 values:
Note what a==b is, at the end.
>>> a = np.random.randint(low=0, high=10, size=(4, 4))
>>> b = np.copy(a)
>>> b[2, 3] = -1
>>> b[0, 1] = -1
>>> b[1, 1] = -1
>>> a
array([[9, 9, 3, 4],
[8, 4, 6, 7],
[8, 4, 5, 5],
[1, 7, 2, 5]])
>>> b
array([[ 9, -1, 3, 4],
[ 8, -1, 6, 7],
[ 8, 4, 5, -1],
[ 1, 7, 2, 5]])
>>> a == b
array([[ True, False, True, True],
[ True, False, True, True],
[ True, True, True, False],
[ True, True, True, True]])
Now the function np.where, which output is a bit tricky, but can be used easily. This will return two arrays of the same size: the first array is the rows and the second array is the columns at places in which the given array is True.
>>> np.where(a == b)
(array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3], dtype=int64), array([0, 2, 3, 0, 2, 3, 0, 1, 2, 0, 1, 2, 3], dtype=int64))
Now you can "fix" the b array to match a, by switching the values of b ar indexes in which it differs from a, to be a's indexes:
>>> b[np.where(a != b)]
array([-1, -1, -1])
>>> b[np.where(a != b)] = a[np.where(a != b)]
>>> np.all(a == b)
True

Numpy assignment like 'numpy.take'

Is it possible to assign to a numpy array along the lines of how the take functionality works?
E.g. if I have a an array a, a list of indices inds, and a desired axis, I can use take as follows:
import numpy as np
a = np.arange(12).reshape((3, -1))
inds = np.array([1, 2])
print(np.take(a, inds, axis=1))
[[ 1 2]
[ 5 6]
[ 9 10]]
This is extremely useful when the indices / axis needed may change at runtime.
However, numpy does not let you do this:
np.take(a, inds, axis=1) = 0
print(a)
It looks like there is some limited (1-D) support for this via numpy.put, but I was wondering if there was a cleaner way to do this?
In [222]: a = np.arange(12).reshape((3, -1))
...: inds = np.array([1, 2])
...:
In [223]: np.take(a, inds, axis=1)
Out[223]:
array([[ 1, 2],
[ 5, 6],
[ 9, 10]])
In [225]: a[:,inds]
Out[225]:
array([[ 1, 2],
[ 5, 6],
[ 9, 10]])
construct an indexing tuple
In [226]: idx=[slice(None)]*a.ndim
In [227]: axis=1
In [228]: idx[axis]=inds
In [229]: a[tuple(idx)]
Out[229]:
array([[ 1, 2],
[ 5, 6],
[ 9, 10]])
In [230]: a[tuple(idx)] = 0
In [231]: a
Out[231]:
array([[ 0, 0, 0, 3],
[ 4, 0, 0, 7],
[ 8, 0, 0, 11]])
Or for a[inds,:]:
In [232]: idx=[slice(None)]*a.ndim
In [233]: idx[0]=inds
In [234]: a[tuple(idx)]
Out[234]:
array([[ 4, 0, 0, 7],
[ 8, 0, 0, 11]])
In [235]: a[tuple(idx)]=1
In [236]: a
Out[236]:
array([[0, 0, 0, 3],
[1, 1, 1, 1],
[1, 1, 1, 1]])
PP's suggestion:
def put_at(inds, axis=-1, slc=(slice(None),)):
return (axis<0)*(Ellipsis,) + axis*slc + (inds,) + (-1-axis)*slc
To be used as in a[put_at(ind_list,axis=axis)]
I've seen both styles on numpy functions. This looks like one used for extend_dims, mine was used in apply_along/over_axis.
earlier thoughts
In a recent take question I/we figured out that it was equivalent to arr.flat[ind] for some some raveled index. I'll have to look that up.
There is an np.put that is equivalent to assignment to the flat:
Signature: np.put(a, ind, v, mode='raise')
Docstring:
Replaces specified elements of an array with given values.
The indexing works on the flattened target array. `put` is roughly
equivalent to:
a.flat[ind] = v
Its docs also mention place and putmask (and copyto).
numpy multidimensional indexing and the function 'take'
I commented take (without axis) is equivalent to:
lut.flat[np.ravel_multi_index(arr.T, lut.shape)].T
with ravel:
In [257]: a = np.arange(12).reshape((3, -1))
In [258]: IJ=np.ix_(np.arange(a.shape[0]), inds)
In [259]: np.ravel_multi_index(IJ, a.shape)
Out[259]:
array([[ 1, 2],
[ 5, 6],
[ 9, 10]], dtype=int32)
In [260]: np.take(a,np.ravel_multi_index(IJ, a.shape))
Out[260]:
array([[ 1, 2],
[ 5, 6],
[ 9, 10]])
In [261]: a.flat[np.ravel_multi_index(IJ, a.shape)] = 100
In [262]: a
Out[262]:
array([[ 0, 100, 100, 3],
[ 4, 100, 100, 7],
[ 8, 100, 100, 11]])
and to use put:
In [264]: np.put(a, np.ravel_multi_index(IJ, a.shape), np.arange(1,7))
In [265]: a
Out[265]:
array([[ 0, 1, 2, 3],
[ 4, 3, 4, 7],
[ 8, 5, 6, 11]])
Use of ravel is unecessary in this case but might useful in others.
I have given an example for use of
numpy.take in 2 dimensions. Perhaps you can adapt that to your problem
You can juste use indexing in this way :
a[:,[1,2]]=0

pick TxK numpy array from TxN numpy array using TxK column index array

This is an indirect indexing problem.
It can be solved with a list comprehension.
The question is whether, or, how to solve it within numpy,
When
data.shape is (T,N)
and
c.shape is (T,K)
and each element of c is an int between 0 and N-1 inclusive, that is,
each element of c is intended to refer to a column number from data.
The goal is to obtain out where
out.shape = (T,K)
And for each i in 0..(T-1)
the row out[i] = [ data[i, c[i,0]] , ... , data[i, c[i,K-1]] ]
Concrete example:
data = np.array([\
[ 0, 1, 2],\
[ 3, 4, 5],\
[ 6, 7, 8],\
[ 9, 10, 11],\
[12, 13, 14]])
c = np.array([
[0, 2],\
[1, 2],\
[0, 0],\
[1, 1],\
[2, 2]])
out should be out = [[0, 2], [4, 5], [6, 6], [10, 10], [14, 14]]
The first row of out is [0,2] because the columns chosen are given by c's row 0, they are 0 and 2, and data[0] at columns 0 and 2 are 0 and 2.
The second row of out is [4,5] because the columns chosen are given by c's row 1, they are 1 and 2, and data[1] at columns 1 and 2 is 4 and 5.
Numpy fancy indexing doesn't seem to solve this in an obvious way because indexing data with c (e.g. data[c], np.take(data,c,axis=1) ) always produces a 3 dimensional array.
A list comprehension can solve it:
out = [ [data[rowidx,i1],data[rowidx,i2]] for (rowidx, (i1,i2)) in enumerate(c) ]
if K is 2 I suppose this is marginally OK. If K is variable, this is not so good.
The list comprehension has to be rewritten for each value K, because it unrolls the columns picked out of data by each row of c. It also violates DRY.
Is there a solution based entirely in numpy?
You can avoid loops with np.choose:
In [1]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
data = np.array([\
[ 0, 1, 2],\
[ 3, 4, 5],\
[ 6, 7, 8],\
[ 9, 10, 11],\
[12, 13, 14]])
c = np.array([
[0, 2],\
[1, 2],\
[0, 0],\
[1, 1],\
[2, 2]])
--
In [2]: np.choose(c, data.T[:,:,np.newaxis])
Out[2]:
array([[ 0, 2],
[ 4, 5],
[ 6, 6],
[10, 10],
[14, 14]])
Here's one possible route to a general solution...
Create masks for data to select the values for each column of out. For example, the first mask could be achieved by writing:
>>> np.arange(3) == np.vstack(c[:,0])
array([[ True, False, False],
[False, True, False],
[ True, False, False],
[False, True, False],
[False, False, True]], dtype=bool)
>>> data[_]
array([ 2, 5, 6, 10, 14])
The mask to get the values for the second column of out: np.arange(3) == np.vstack(c[:,1]).
So, to get the out array...
>>> mask0 = np.arange(3) == np.vstack(c[:,0])
>>> mask1 = np.arange(3) == np.vstack(c[:,1])
>>> np.vstack((data[mask0], data[mask1])).T
array([[ 0, 2],
[ 4, 5],
[ 6, 6],
[10, 10],
[14, 14]])
Edit: Given arbitrary array widths K and N you could use a loop to create the masks, so the general construction of the out array might simply look like this:
np.vstack([data[np.arange(N) == np.vstack(c[:,i])] for i in range(K)]).T
Edit 2: A slightly neater solution (though still relying on a loop) is:
np.vstack([data[i][c[i]] for i in range(T)])

Using matlab's find like operation in python [duplicate]

This question already has answers here:
MATLAB-style find() function in Python
(9 answers)
Closed 8 years ago.
I have this matrix of shape 10,000x30 in python. What I want is to find the indices of the rows, i.e., from the 10,000 rows, determine the indices with the 5th column value equal to 0.
How can I get the indices. Once I get the indices, I want to selected corresponding rows from another matrix B. How do I accomplish this?
>>> a = np.random.randint(0, 10, (10, 5))
>>> a
array([[4, 9, 7, 2, 9],
[1, 9, 5, 0, 8],
[1, 7, 7, 8, 4],
[6, 2, 1, 9, 6],
[6, 2, 0, 0, 8],
[5, 5, 8, 4, 5],
[6, 8, 8, 8, 8],
[2, 2, 3, 4, 3],
[3, 6, 2, 1, 2],
[6, 3, 2, 4, 0]])
>>> a[:, 4] == 0
array([False, False, False, False, False, False, False, False, False, True], dtype=bool)
>>> b = np.random.rand(10, 5)
>>> b
array([[ 0.37363295, 0.96763033, 0.72892652, 0.77217485, 0.86549555],
[ 0.83041897, 0.35277681, 0.13011611, 0.82887195, 0.87522863],
[ 0.88325189, 0.67976957, 0.56058782, 0.58438597, 0.10571746],
[ 0.27305838, 0.72306733, 0.01630463, 0.86069002, 0.9458257 ],
[ 0.23113894, 0.30396521, 0.92840314, 0.39544522, 0.59708927],
[ 0.71878406, 0.91327744, 0.71407427, 0.65388644, 0.416599 ],
[ 0.83550209, 0.85024774, 0.96788451, 0.72253464, 0.41661953],
[ 0.61458993, 0.34527785, 0.20301719, 0.10626226, 0.00773484],
[ 0.87275531, 0.54878131, 0.24933454, 0.29894835, 0.66966912],
[ 0.59533278, 0.15037691, 0.37865046, 0.99402371, 0.17325722]])
>>> b[a[:,4] == 0]
array([[ 0.59533278, 0.15037691, 0.37865046, 0.99402371, 0.17325722]])
>>>
To get a find like result instead of using logical indexing use np.where, which returns a tuple of arrays which serve as indices into each dimension:
>>> indices = np.where(a[:, 4] == 0)
>>> b[indices[0]]
array([[ 0.59533278, 0.15037691, 0.37865046, 0.99402371, 0.17325722]])
>>>

Categories