Numpy confusion with boolean indexing and broadcasting - python

I have a numpy array y:
y = np.arange(35).reshape(5,7)
5 rows, 7 columns. Now I create a boolean 1-d 5-element mask which picks out the rows I want (following the doc at numpy indexing):
b = np.array([False, False, True, False, True])
Then y[b] returns the rows of interest. But the doc is confusing: it says
Boolean arrays must be of the same shape as the array being indexed, or broadcastable to the same shape.
And b is not broadcastable with y:
>>> np.broadcast_arrays(y, b)
ValueError: shape mismatch: two or more arrays have incompatible dimensions on axis 1.
because broadcasting works by matching the trailing dimensions and working backwards.
In this case of boolean indexing, there's clearly some different rule at work; is the doc wrong or am I just misunderstanding it? If I did what the doc suggests and make b be of shape (5,1) it doesn't pick out the rows; it just gets the first column of each selected row and returns that as a 1-d array.
I suspect the real rule is that the boolean object's dims must match the original array's initial dims, and it selects the values of each of those dims where the boolean is true, returning all elements of any trailing dims. But I can't find anything official that says that's how it works.
So my question is, (a) am I doing it right and the doc is just wrong? (b) am I reading the wrong doc? (c) is there a better/different way to do it or to understand it?

The reduction of y[b] appears to do what you want. I don't think it's out of step with the docs and there is not a special case for boolean vs numbers for broadcasting here.
y[b] # gives what you want
# broadcast booleans
np.broadcast_arrays(b, y) #gives the error you saw
np.broadcast_arrays(b[:,np.newaxis], y) #broadcasts.
# same as broadcast numbers
np.broadcast_arrays(np.arange(5), y) #ALSO gives the error you saw
np.broadcast_arrays(np.arange(5)[:,np.newaxis], y) #broadcasts.

Related

Identity index for NumPy array

Suppose I have a flat NumPy array a and want to define an index array i to index a with and thus obtain a again by a[i].
I tried
import numpy as np
a = np.array([1,2]).reshape(-1)
i = True
But this does not preserve shape: a[i] has shape (1, 2) while a has shape (2,).
I know I could reshape a[i] or use i = np.full_like(a, True, dtype=bool). I want neither: The reshape is unnecessary if i is per some conditional definition sometimes not plain True but a boolean array matching the shape of a. The second approach means I need different is for doing this on arrays of different shapes.
So... is there something build-in in NumPy to just get the array back when used as index?
Numpy can not preserve the shape of a boolean masked result because it may be ragged. When you pass in a single boolean scalar, things get special-case weird.
You must therefore use a fancy index. With a fancy index, the shape of the result is exactly the shape of the index. For a 1-D array the following is fine:
i = np.arange(a.size)
For more dimensions, you'll want to create a full indexing tuple, using np.indices for example. The elements of the tuple can broadcast to the final desired shape, so you can use sparse=True:
i = np.indices(a.shape, sparse=True)
If you want i to be a numpy array, you can set sparse=False, in which case i will be of shape (a.ndim, *a.shape).
If you want to cheat, you can use slices. slice(None) is the object representing the literal index ::
i = (slice(None),) * a.ndim
Or just index the first dimension only, which returns the entire array:
i = slice(None)
Or if you're feeling really lazy, use Ellipsis directly. This is the object that stands for the literal ..., meaning :, :, etc, as many times as necessary:
i = Ellipsis
Going back to the boolean mask option, you can use it for the same effect if you create a separate array for each dimension:
i = tuple(np.ones(k, dtype=bool) for k in a.shape)
You could save some memory by only allocating the largest shape and creating views:
s = np.ones(max(a.shape), dtype=bool)
i = tuple(s[:k] for k in a.shape)

NumPy: is assignment of a scalar to a slice broadcasting?

I know in Python,
[1,2,3][0:2]=7
doesn't work because the right side must be an iterable.
However, the same thing works for NumPy ndarrays:
a=np.array([1,2,3])
a[0:2]=9
a
Is this the same mechanism as broadcasting? On https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html, it is said broadcasting is only for arithmetic operations.
Yes, assignment follows the same rules of broadcasting because you can also assign an array to another array's items. This however requires that the second array's shape to be broadcastable to destination slice/array shape.
This is also mentioned in Assigning values to indexed arrays documentation:
As mentioned, one can select a subset of an array to assign to using a single index, slices, and index and mask arrays. The value being assigned to the indexed array must be shape consistent (the same shape or broadcastable to the shape the index produces).

numpy.argmax() on multi-dim data does not return an ndarray

The documentation of numpy.argmax states that it returns the index position of the maximum value found in the array and it does so by returning an ndarray of int with the same shape as the input array:
numpy.argmax(a, axis=None, out=None)[source]
Returns the indices of the maximum values along an axis.
Parameters:
a : array_like
Input array.
axis : int, optional
By default, the index is into the flattened array, otherwise along the specified axis.
out : array, optional
If provided, the result will be inserted into this array. It should be of the appropriate shape and dtype.
Returns:
index_array : ndarray of ints
Array of indices into the array. It has the same shape as a.shape with the dimension along axis removed.
Why then does the command return for a 2-dim array a single int?
a = np.arange(6).reshape(2,3)
print np.argmax(a)
5
It seems to me the reshape acts as a new view and the argmax command still uses the underlying 1-dim array. I tried copying the array into an initialized 2-dim array with the same end result.
This is Python 2.7.12 but since they have it in the documentation I believe this is the expected behavior. Am I missing anything? How can I get the ndarray returned?
From the documentation you quoted:
By default, the index is into the flattened array
It's providing the integer such that a.flat[np.argmax(a)] == np.max(a). If you want to convert this into an index tuple, you can use the np.unravel_index function linked in the "See also" section.
The part about "the same shape as a.shape with the dimension along axis removed" applies if you actually provide an axis argument. Also, note the "with the dimension along axis removed" part; even if you specify axis, it won't return an array of the same shape as the input like you were expecting.

transpose manipulation with indexing over multiple dimensions

I have a trouble with numpy ndarray when I'm indexing multiple dimensions at the same time :
> a = np.random.random((25,50,30))
> b = a[0,:,np.arange(30)]
> print(b.shape)
Here I expected the result to be (50,30), but actually the real result is (30,50) !
Can someone explain it to me please I don't get it and this feature introduces tons of bugs in my code. Thank you :)
Additional information :
Indexing in one dimension works perfectly :
> b = a[0,:,:]
> print(b.shape)
(50,30)
And when I have the transposition :
> a[0,:,0] == b[0,:]
True
From numpy docs
The easiest way to understand the situation may be to think in terms
of the result shape. There are two parts to the indexing operation,
the subspace defined by the basic indexing (excluding integers) and
the subspace from the advanced indexing part. Two cases of index
combination need to be distinguished:
The advanced indexes are separated by a slice, ellipsis or newaxis.
For example x[arr1, :, arr2].
The advanced indexes are all next to each other. For example x[...,
arr1, arr2, :] but not x[arr1, :, 1] since 1 is an advanced index in
this regard.
In the first case, the dimensions resulting from the advanced indexing
operation come first in the result array, and the subspace dimensions
after that. In the second case, the dimensions from the advanced
indexing operations are inserted into the result array at the same
spot as they were in the initial array (the latter logic is what makes
simple advanced indexing behave just like slicing).
(my emphasis) the highlighted bit applies to your
b = a[0,:,np.arange(30)]
When you use a list or array of integers to index a numpy array, you're using something that is known as Fancy Indexing. The rules for Fancy Indexing are not so straightforward as one might think. This is the reason that you're array has the wrong dimension. To avoid surprises, I'd recommend you to stick with slicing. So, you should change your code to:
a = np.random.random((25,50,30))
b = a[0,:,:]
print(b.shape)

Logical indices in numpy throwing exception [duplicate]

This question already has an answer here:
Logical indexing in Numpy with two indices as in MATLAB
(1 answer)
Closed 7 years ago.
I am trying to write some code that uses logical numpy arrays to index a larger array, similar to how MATLAB allows array indexing with logical arrays.
import numpy as np
m = 4
n = 4
unCov = np.random.randint(10, size = (m,n) )
rowCov = np.zeros( m, dtype = bool )
colCov = np.ones( n, dtype = bool )
>>> unCov[rowCov, rowCov]
[] # as expected
>>> unCov[colCov, colCov]
[0 8 3 3] # diagonal values of unCov, as expected
>>> unCov[rowCov, colCov]
ValueError: shape mismatch: objects cannot be broadcast to a single shape
For this last evaluation, I expected an empty array, similar to what MATLAB returns. I'd rather not have to check rowCov/colCov for True elements prior to indexing. Why is this happening, and is there a better way to do this?
As I understand it, numpy will translate your 2d logical indices to actual index vectors: arr[[True,False],[False,True]] would become arr[0,1] for an ndarray of shape (2,2). However, in your last case the second index array is full False, hence it corresponds to an index array of length 0. This is paired with the other full True index vector, corresponding to an index array of length 4.
From the numpy manual:
If the index arrays do not have the same shape, there is an attempt to
broadcast them to the same shape. If they cannot be broadcast to the
same shape, an exception is raised:
In your case, the error is exactly due to this:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-1411-28e41e233472> in <module>()
----> 1 unCov[colCov,rowCov]
IndexError: shape mismatch: indexing arrays could not be broadcast together with shapes (4,) (0,)
MATLAB, on the other hand, automatically returns an empty array if the index array is empty along any given dimension.
This actually highlights a fundamental difference between the logical indexing in MATLAB and numpy. In MATLAB, vectors in subscript indexing always slice out a subarray. That is, both
arr([1,2],[1,2])
and
arr([true,true],[true,true])
will return the 2 x 2 submatrix of the matrix arr. If the logical index vectors are shorter than the given dimension of the array, the missing indexing elements are assumed to be false. Fun fact: the index vector can also be longer than the given dimension, as long as the excess elements are all false. So the above is also equivalent to
arr([true,true,false,false],[true,true])
and
arr([true,true,false,false,false,false,false],[true,true])
for a 4 x 4 array (for the sake of argument).
In numpy, however, indexing with boolean-valued numpy arrays in this way will try to extract a vector. Furthermore, the boolean index vectors should be the same length as the dimension they are indexing into. In your 4 x 4 example,
unCov[np.array([True,True]),np.array([True,True])]
and
unCov[np.array([True,True,False,False,False]),np.array([True,True,False,False,False])]
both return the two first diagonal elements, so not a submatrix but rather a vector. Furthermore, they also give the less-then-encouraging warning along the lines of
/usr/bin/ipython:1: VisibleDeprecationWarning: boolean index did not match indexed array along dimension 0; dimension is 4 but corresponding boolean dimension is 5
So, in numpy, your logical indexing vectors should be the same length as the corresponding dimensions of the ndarray. And then what I wrote above holds true: the logical values are translated into indices, and the result is expected to be a vector. The length of this vector is the number of True elements in every index vector, so if your boolean index vectors have a different number of True elements, then the referencing doesn't make sense, and you get the error that you get.

Categories