2d numpy mask not working as expected - python

I'm trying to turn a 2x3 numpy array into a 2x2 array by removing select indexes.
I think I can do this with a mask array with true/false values.
Given
[ 1, 2, 3],
[ 4, 1, 6]
I want to remove one element from each row to give me:
[ 2, 3],
[ 4, 6]
However this method isn't working quite like I would expect:
import numpy as np
in_array = np.array([
[ 1, 2, 3],
[ 4, 1, 6]
])
mask = np.array([
[False, True, True],
[True, False, True]
])
print in_array[mask]
Gives me:
[2 3 4 6]
Which is not what I want. Any ideas?

The only thing 'wrong' with that is it is the shape - 1d rather than 2. But what if your mask was
mask = np.array([
[False, True, False],
[True, False, True]
])
1 value in the first row, 2 in second. It couldn't return that as a 2d array, could it?
So the default behavior when masking like this is to return a 1d, or raveled result.
Boolean indexing like this is effectively a where indexing:
In [19]: np.where(mask)
Out[19]: (array([0, 0, 1, 1], dtype=int32), array([1, 2, 0, 2], dtype=int32))
In [20]: in_array[_]
Out[20]: array([2, 3, 4, 6])
It finds the elements of the mask which are true, and then selects the corresponding elements of the in_array.
Maybe the transpose of where is easier to visualize:
In [21]: np.argwhere(mask)
Out[21]:
array([[0, 1],
[0, 2],
[1, 0],
[1, 2]], dtype=int32)
and indexing iteratively:
In [23]: for ij in np.argwhere(mask):
...: print(in_array[tuple(ij)])
...:
2
3
4
6

Related

Rows of sparse matrix where no column is zero

I have a matrix like this:
A = sp.csr_matrix(np.array(
[[1, 1, 2, 1],
[0, 0, 2, 0],
[1, 4, 1, 1],
[0, 1, 0, 0]]))
I want to get all the rows where all columns are nonzero, so I can then get their sum. Either as an array:
rows = [True, False, True, False]
result = A[rows].sum()
Or as indices:
rows = [0, 2]
result = A[rows].sum()
I am stuck however at the first part, figuring out which rows to include in the sum, as most results seem to be looking for the opposite (rows where all columns are zero).
In [35]: from scipy import sparse
In [36]: A = sparse.csr_matrix(np.array(
...: [[1, 1, 2, 1],
...: [0, 0, 2, 0],
...: [1, 4, 1, 1],
...: [0, 1, 0, 0]]))
In [37]: A
Out[37]:
<4x4 sparse matrix of type '<class 'numpy.int64'>'
with 10 stored elements in Compressed Sparse Row format>
Sparse doesn't do 'all/any' kinds of operations because they treat 0's as significant values.
all on the dense equivalent works nicely:
In [41]: A.A.all(axis=1)
Out[41]: array([ True, False, True, False])
On the sparse one we can turn the dtype to boolean, and sum along the axis. And then test it for the full value:
In [42]: A.astype(bool).sum(axis=1)
Out[42]:
matrix([[4],
[1],
[4],
[1]])
In [43]: A.astype(bool).sum(axis=1).A1==4
Out[43]: array([ True, False, True, False])
Notice that the sparse sum returns a np.matrix. I used A1 to turn that into a 1d array.
If the matrix isn't too large, working with the dense array may be faster. Sparse operations like sum are actually performed with matrix multiplication.
In [51]: A.astype(bool)#np.ones(4,int)
Out[51]: array([4, 1, 4, 1])
Or we could convert it to lil format, and look at the length of the 'rows':
In [67]: A.tolil().data
Out[67]:
array([list([1, 1, 2, 1]), list([2]), list([1, 4, 1, 1]), list([1])],
dtype=object)
In [68]: [len(i) for i in A.tolil().data]
Out[68]: [4, 1, 4, 1]
But wait, there's more. The indptr attribute of the csr is:
In [69]: A.indptr
Out[69]: array([ 0, 4, 5, 9, 10], dtype=int32)
In [70]: np.diff(A.indptr)
Out[70]: array([4, 1, 4, 1], dtype=int32)
I've omitted some test timings, but this last is clearly the fastest!
It is a bit easier to do for numpy arrays than for sparse ones. If you do not mind converting to numpy as an intermediate step, you can get the right rows via
(A.toarray() != 0).all(axis=1)
to produce
array([ True, False, True, False])
and then use it in indexing A as such:
A[(A.toarray() != 0).all(axis=1),:].sum()
returns 12

Convert a list to a numpy mask array

Given list like indice = [1, 0, 2] and dimension m = 3, I want to get the mask array like this
>>> import numpy as np
>>> mask_array = np.array([ [1, 1, 0], [1, 0, 0], [1, 1, 1] ])
>>> mask_array
[[1, 1, 0],
[1, 0, 0],
[1, 1, 1]]
Given m = 3, so the axis=1 of mask_array is 3, the row of mask_array indicates the length of indice.
For converting the indice to mask_array, the rule is marking the item values whose index is less or equal to the each entry of inside to value 1. For example, indice[0]=1, so the output is [1, 1, 0], given dimension is 3.
In NumPy, are there any APIs which can be used to do this?
Sure, just use broadcasting with arange(m), make sure to use an np.array for the indices, not a list...
>>> indice = [1, 0, 2]
>>> m = 3
>>> np.arange(m) <= np.array(indice)[..., None]
array([[ True, True, False],
[ True, False, False],
[ True, True, True]])
Note, the [..., None] just reshapes the indices array so that the broadcasting works like we want, like this:
>>> indices = np.array(indice)
>>> indices
array([1, 0, 2])
>>> indices[...,None]
array([[1],
[0],
[2]])

Use numpy.argwhere to obtain the matching values in an np.array

I'd like to use np.argwhere() to obtain the values in an np.array.
For example:
z = np.arange(9).reshape(3,3)
[[0 1 2]
[3 4 5]
[6 7 8]]
zi = np.argwhere(z % 3 == 0)
[[0 0]
[1 0]
[2 0]]
I want this array: [0, 3, 6] and did this:
t = [z[tuple(i)] for i in zi] # -> [0, 3, 6]
I assume there is an easier way.
Why not simply use masking here:
z[z % 3 == 0]
For your sample matrix, this will generate:
>>> z[z % 3 == 0]
array([0, 3, 6])
If you pass a matrix with the same dimensions with booleans as indices, you get an array with the elements of that matrix where the boolean matrix is True.
This will furthermore work more efficient, since you do the filtering at the numpy level (whereas list comprehension works at the Python interpreter level).
Source for argwhere
def argwhere(a):
"""
Find the indices of array elements that are non-zero, grouped by element.
...
"""
return transpose(nonzero(a))
np.where is the same as np.nonzero.
In [902]: z=np.arange(9).reshape(3,3)
In [903]: z%3==0
Out[903]:
array([[ True, False, False],
[ True, False, False],
[ True, False, False]], dtype=bool)
In [904]: np.nonzero(z%3==0)
Out[904]: (array([0, 1, 2], dtype=int32), array([0, 0, 0], dtype=int32))
In [905]: np.transpose(np.nonzero(z%3==0))
Out[905]:
array([[0, 0],
[1, 0],
[2, 0]], dtype=int32)
In [906]: z[[0,1,2], [0,0,0]]
Out[906]: array([0, 3, 6])
z[np.nonzero(z%3==0)] is equivalent to using I,J as indexing arrays:
In [907]: I,J =np.nonzero(z%3==0)
In [908]: I
Out[908]: array([0, 1, 2], dtype=int32)
In [909]: J
Out[909]: array([0, 0, 0], dtype=int32)
In [910]: z[I,J]
Out[910]: array([0, 3, 6])

Finding a matching row in a numpy matrix

Using numpy, I have a matrix called points.
points
=> matrix([[0, 2],
[0, 0],
[1, 3],
[4, 6],
[0, 7],
[0, 3]])
If I have the tuple (1, 3), I want to find the row in points that matches these numbers (in this case, the row index is 2).
I tried using np.where:
np.where(points == (1, 3))
=> (array([2, 2, 5]), array([0, 1, 1]))
What is the meaning of this output? Can it be used to find the row where (1, 3) occurs?
You were just needed to look for ALL matches along each row, like so -
np.where((a==(1,3)).all(axis=1))[0]
Steps involved using given sample -
In [17]: a # Input matrix
Out[17]:
matrix([[0, 2],
[0, 0],
[1, 3],
[4, 6],
[0, 7],
[0, 3]])
In [18]: (a==(1,3)) # Matrix of broadcasted matches
Out[18]:
matrix([[False, False],
[False, False],
[ True, True],
[False, False],
[False, False],
[False, True]], dtype=bool)
In [19]: (a==(1,3)).all(axis=1) # Look for ALL matches along each row
Out[19]:
matrix([[False],
[False],
[ True],
[False],
[False],
[False]], dtype=bool)
In [20]: np.where((a==(1,3)).all(1))[0] # Use np.where to get row indices
Out[20]: array([2])

python numpy get masked data without flattening

How do I get the masked data only without flattening the data into a 1D array? That is, suppose I have a numpy array
a = np.array([[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3]])
and I mask all elements greater than 1,
b = ma.masked_greater(a, 1)
masked_array(data =
[[0 1 -- --]
[0 1 -- --]
[0 1 -- --]],
mask =
[[False False True True]
[False False True True]
[False False True True]],
fill_value = 999999)
How do I get only the masked elements without flattening the output? That is, I need to get
array([[ 2, 3],
[2, 3],
[2, 3]])
Lets try an example that produces a ragged result - different number of 'masked' values in each row.
In [292]: a=np.arange(12).reshape(3,4)
In [293]: a
Out[293]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [294]: a<6
Out[294]:
array([[ True, True, True, True],
[ True, True, False, False],
[False, False, False, False]], dtype=bool)
The flattened list of values that match this condition. It can't return a regular 2d array, so it has to revert to a flattened array.
In [295]: a[a<6]
Out[295]: array([0, 1, 2, 3, 4, 5])
do the same thing, but iterating row by row
In [296]: [a1[a1<6] for a1 in a]
Out[296]: [array([0, 1, 2, 3]), array([4, 5]), array([], dtype=int32)]
Trying to make an array of the result produces an object type array, which is little more than a list in an array wrapper:
In [297]: np.array([a1[a1<6] for a1 in a])
Out[297]: array([array([0, 1, 2, 3]), array([4, 5]), array([], dtype=int32)], dtype=object)
The fact that the result is ragged is a good indicator that it is difficult, if not impossible, to perform that action with one vectorized operation.
Here's another way of producing the list of arrays. With sum I find how many elements there are in each row, and then use this to split the flattened array into sublists.
In [320]: idx=(a<6).sum(1).cumsum()[:-1]
In [321]: idx
Out[321]: array([4, 6], dtype=int32)
In [322]: np.split(a[a<6], idx)
Out[322]: [array([0, 1, 2, 3]), array([4, 5]), array([], dtype=float64)]
It does use 'flattening'. And for these small examples it is slower than the row iteration. (Don't worry about the empty float array, split had to construct something and used a default dtype. )
A different mask, without empty rows clearly shows the equivalence of the 2 approaches.
In [344]: mask=np.tri(3,4,dtype=bool) # lower tri
In [345]: mask
Out[345]:
array([[ True, False, False, False],
[ True, True, False, False],
[ True, True, True, False]], dtype=bool)
In [346]: idx=mask.sum(1).cumsum()[:-1]
In [347]: idx
Out[347]: array([1, 3], dtype=int32)
In [348]: [a1[m] for a1,m in zip(a,mask)]
Out[348]: [array([0]), array([4, 5]), array([ 8, 9, 10])]
In [349]: np.split(a[mask],idx)
Out[349]: [array([0]), array([4, 5]), array([ 8, 9, 10])]
Zip the two lists together, and then filter them out:
data = [[0, 1, 1, 1], [0, 1, 1, 1], [0, 1, 1, 1]]
mask = [[False, False, True, True],
[False, False, True, True],
[False, False, True, True]]
zipped = zip(data, mask) # [([0, 1, 1, 1], [False, False, True, True]), ([0, 1, 1, 1], [False, False, True, True]), ([0, 1, 1, 1], [False, False, True, True])]
masked = []
for lst, mask in zipped:
pairs = zip(lst, mask) # [(0, False), (1, False), (1, True), (1, True)]
masked.append([num for num, b in pairs if b])
print(masked) # [[1, 1], [1, 1], [1, 1]]
or more succinctly:
zipped = [...]
masked = [[num for num, b in zip(lst, mask) if b] for lst, mask in zipped]
print(masked) # [[1, 1], [1, 1], [1, 1]]
Due to vectorization in numpy you can use np.where to select items from the first array and use None (or some arbitrary value) to indicate the places that a value has been masked out. Note that this means you have to use a less compact representation for the array so may want to use -1 or some special value.
import numpy as np
a = np.array([
[0, 1, 2, 3],
[0, 1, 2, 3],
[0, 1, 2, 3]])
mask = np.array([[ True, True, True, True],
[ True, False, True, True],
[False, True, True, False]])
np.where(a, np.array, None)
This produces
array([[0, 1, 2, 3],
[0, None, 2, 3],
[None, 1, 2, None]], dtype=object)

Categories