Dimensions of array after multidimensional index slicing - python

I want to slice a multidimensional numpy array (>2 dimensions) along 2 of its axes using index slicing. What are the rules for where each of its original dimensions end up?
To illustrate my problem, let me provide an example. Say we have a 4D array:
import numpy as np
a = np.arange(2*3*4*5).reshape(2,3,4,5)
I'll create a tuple of indices using numpy.where, for slicing along axes 1 and 3:
mask = np.where(np.random.rand(3,5) > 0.5)
This will pick out random slices from my array a. Let's say it returned tuples of length 7.
To preserve the remaining dimensions I will use slice(None) objects:
b = a[(slice(None), mask[0], slice(None), mask[1])]
This changed the shape:
>>> a.shape
(2, 3, 4, 5)
>>> b.shape
(7, 2, 4)
The axes that were untouched (i.e. sliced using the slice(None) object) appear to have been preserved, whereas the sliced axes are destroyed and the resulting axis is moved to the front.
However, this is not always the case. When I apply a mask to axes 1 and 2:
mask2 = np.where(np.random.rand(3,4) > 0.5)
c = a[(slice(None), mask[0], mask[1], slice(None))]
I observe the following (numpy.where has returned tuples of length 7 again):
>>> c.shape
(2, 7, 5)
The axis resulting from the axes that have been destroyed by the slicing did not move to the front this time.
My guess is that it is related to whether the sliced axes are adjacent or not, but I want to know from what rules this behavior emerges.

https://docs.scipy.org/doc/numpy-1.15.4/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
Your where masks will produce a 1d (7,) shape array if applied to a 2d array, the values where the condition is true. You phrase that as 'destroying' a pair of axes.
In the second case that 7 can be placed between the 2 and 5.
But in the first it's ambiguous because of the slice in the middle (the non adjacency) - the fall back rule is to put it first, and order the slices after. In other words, instead of trying to choose between a (2,7,4) and (2,4,7) order, it chooses (7,2,4).
The ambiguity is clear in this case, and the default reasonable. It's more complicated with one or more of the dimensions is eliminated by a scalar index.

Related

Extract 2d ndarray from arbitrarily dimensional ndarray using index arrays

I want to extract parts of an numpy ndarray based on arrays of index positions for some of the dimensions. Let me show this on an example
Example data
dummy = np.random.rand(5,2,100)
X = np.array([[0,1],[4,1],[2,0]])
dummy is the original ndarray with dimensionality 5x2x100. This dimensionality is arbitrary, it could as well be 5x2x4x100.
X is a matrix of index values, here X[:,0] are the indices of the first dimension of dummy, X[:,1] those of the second dimension. The number of columns in X is always the number of dimensions in dummy minus 1.
Example output
I want to extract an ndarray of the following form for this example
[
dummy[0,1,:],
dummy[4,1,:],
dummy[2,0,:]
]
Complications
If the number of dimensions in dummy were fixed, this could just be done by dummy[X[:,0],X[:,1],:] . Sadly the dimensionality can be different, e.g. dummy could be a 5x2x4x6x100 ndarray and X correspondingly would then be 3x4 . My attempts at dealing with it have not yielded the desired result.
dummy[X,:] yields a 3x2x2x100 ndarray for this example same as dummy[X]
Iteratively reducing dummy by doing something like dummy = dummy[X[:,i],:] with i an iterator over the number of columns of X also does not reduce the ndarray in the example past 3x2x100
I have a feeling that this should be pretty simple with numpy indexing, but I guess my search for a solution was missing the right terms for this.
Does anyone have a solution to this?
I will try to provide some explainability to #Michael Szczesny answer.
First, notice that if you have an np.array with dimension n and pass m indexes where m<n, then it will be the same as using : in the dimensions >=m. In your case, for example:
dummy[(0, 0)] == dummy[0, 0, :]
Given that, note that you can also pass an array as an index. Thus:
dummy[([0, 1], [0, 0])]
It would be the same as:
np.array([dummy[(0,0)], dummy[(1,0)]])
You can validate that using:
dummy[([0, 1], [0, 0])] == np.array([dummy[(0,0)], dummy[(1,0)]])
Finally, notice that:
(*X.T,)
# (array([0, 4, 2]), array([1, 1, 0]))
You are here getting each dimension as an array, and then you will get:
[
dummy[0,1],
dummy[4,1],
dummy[2,0]
]
Which is the same as:
[
dummy[0,1,:],
dummy[4,1,:],
dummy[2,0,:]
]
Edit: Instead of using (*X.T,), you can use tuple(X.T), which for me, makes more sense
as Michael Szczesny wrote, the best solution is dummy[(*X.T,)].
Since X[:,0] are the indices of the first dimension of dummy and X[:,1] are the indices of the second dimension of dummy, if you transpose X (X.T) you'll have the the indices of the first dimension of dummy as X.T[0] and the indices of the second dimension of dummy as X.T[1].
Now to slice dummy as you want, you can specify the indices of the first and of the second dimension in this way:
dummy[(first_dim_indices, second_dim_indices)] = dummy[(X.T[0], X.T[1])]
In order to simplify the code (and since you doesn't want to transpose the X matrix twice) you can unpack X.T in a tuple as (*X.T,) and so write X[(*X.T,)] is the same thing to write dummy[(X.T[0], X.T[1])].
This writing is also useful if you have an unfixed number of dimensions to slice trough because you will unpack from X.T as many lines as there are dimensions to slice in dummy. For example suppose you want to retrieve an 1D-array from dummy given the following indices:
first_dim: (0, 4, 2)
second_dim: (1, 1, 0)
third_dim: (9, 8, 7)
You can specify the indices of the 3 dimensions as X = np.array([[0,1,9],[4,1,8],[2,0,7]]) and dim[(*X.T,)] is still valid.

Subtract Mean from Multidimensional Numpy-Array

I'm currently learning about broadcasting in Numpy and in the book I'm reading (Python for Data Analysis by Wes McKinney the author has mentioned the following example to "demean" a two-dimensional array:
import numpy as np
arr = np.random.randn(4, 3)
print(arr.mean(0))
demeaned = arr - arr.mean(0)
print(demeaned)
print(demeand.mean(0))
Which effectively causes the array demeaned to have a mean of 0.
I had the idea to apply this to an image-like, three-dimensional array:
import numpy as np
arr = np.random.randint(0, 256, (400,400,3))
demeaned = arr - arr.mean(2)
Which of course failed, because according to the broadcasting rule, the trailing dimensions have to match, and that's not the case here:
print(arr.shape) # (400, 400, 3)
print(arr.mean(2).shape) # (400, 400)
Now, i have gotten it to work mostly, by substracting the mean from every single index in the third dimension of the array:
demeaned = np.ones(arr.shape)
for i in range(3):
demeaned[...,i] = arr[...,i] - means
print(demeaned.mean(0))
At this point, the returned values are very close to zero and i think, that's a precision error. Am i actually right with this thought or is there another caveat, that i missed?
Also, this doesn't seam to be the cleanest, most 'numpy'-way to achieve what i wanted to achieve. Is there a function or a principle that i can make use of to improve the code?
As of numpy version 1.7.0, np.mean, and several other functions, accept a tuple in their axis parameter. This means that you can perform the operation on the planes of the image all at once:
m = arr.mean(axis=(0, 1))
This mean will have shape (3,), with one element for each plane of the image.
If you want to subtract the means of each pixel individually, you have to remember that broadcasting aligns shape tuples on the right edge. That means that you need to insert an extra dimension:
n = arr.mean(axis=2)
n = n.reshape(*n.shape, 1)
Or
n = arr.mean(axis=2)[..., None]
Try np.apply_along_axis().
np.apply_along_axis(lambda x: x - np.mean(x), 2, arr)
Output: you get the array of the same shape where each cell is demeaned in the dimension you want (the second parameter, here it is 2).

List of simple arrays with pyplot.plot

I have some trouble to understand how pyplot.plot works.
I take a simple example: I want to plot pyplot.plot(lst2, lst2) where lst2 is a list.
The difficulty comes from the fact that each element of lst2 is an array of shape (1,1). If the elements were floating and not array, there would be no problems.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
V2 = np.array([[1]])
W2 = np.array([[2]])
print('The shape of V2 is', V2.shape)
print('The shape of W2 is', W2.shape)
lst2 = [V2, W2]
plt.plot(lst2, lst2)
plt.show
Below is the end of the error message I got:
~\Anaconda3\lib\site-packages\matplotlib\axes\_base.py in _xy_from_xy(self,x, y)
245 if x.ndim > 2 or y.ndim > 2:
246 raise ValueError("x and y can be no greater than 2-D, but have "
--> 247 "shapes {} and {}".format(x.shape, y.shape))
248
249 if x.ndim == 1:
ValueError: x and y can be no greater than 2-D, but have shapes (2, 1, 1) and (2, 1, 1)
What surprised me in the error message is the mention of an array of dimension (2,1,1). It seems like the array np.array([V2,W2]) is built when we call pyplot.plot.
My question is then what happens behind the scenes when we call pyplot.plot(x,y) with x and y list? It seems like an array with the elements of x is built (and same for y). And these arrays must have maximum 2 axis. Am I correct?
I know that if I use numpy.squeeze on V2 and W2, it would work. But I would like to understand what it happening inside pyplot.plot in the example I gave.
Take a closer look at what you're doing:
V2 = np.array([[1]])
W2 = np.array([[2]])
lst2 = [V2, W2]
plt.plot(lst2, lst2)
For some odd reason you're defining your arrays to be of shape (1,1) by using a nested pair of brackets. When you construct lst2, you stack your arrays along a new leading dimension. This has nothing do with pyplot, this is numpy.
Numpy arrays are rectangular, and they are compatible with lists of lists of ... of lists. The level of nesting determines the number of dimensions of an array. Look at a simple 2d example:
>>> M = np.arange(2*3).reshape(2,3)
>>> print(repr(M))
array([[0, 1, 2],
[3, 4, 5]])
You can for all intents and purposes think of this 2x3 matrix as two row vectors. M[0] is the same as M[0,:] and is the first row, M[1] is the same as M[1,:] is the second row. You could then also construct this array from the two rows in the following way:
row1 = [0, 1, 2]
row2 = [3, 4, 5]
lst = [row1, row2]
np.array(lst)
My point is that we took two flat lists of length 3 (which are compatible with 1d numpy arrays of shape (3,)), and concatenated them in a list. The result was compatible with a 2d array of shape (2,3). The "2" is due to the fact that we put 2 lists into lst, and the "3" is due to the fact that both lists had a length of 3.
So, when you create lst2 above, you're doing something that is equivalent to this:
lst2 = [ [[1]], [[2]] ]
You put two nested sublists into an array-compatible list, and both sublists are compatible with shape (1,1). This implies that you'll end up with a 3d array (in accordance with the fact that you have three opening brackets at the deepest level of nesting), with shape (2,1,1). Again the 2 comes from the fact that you have two arrays inside, and the trailing dimensions come from the contents.
The real question is what you're trying to do. For one, your data shouldn't really be of shape (1,1). In the most straightforward application of pyplot.plot you have 1d datasets: one for the x and one for the y coordinates of your plot. For this you can use a simple (flat) list or 1d array for both x and y. What matters is that they are of the same length.
Then when you plot the two against each other, you pass the x coordinates first, then the y coordinates second. You presumably meant something like
plt.plot(V2,W2)
In which case you'd pass 2d arrays to plot, and you wouldn't see the error caused by passing a 3d-array-like. However, the behaviour of pyplot.plot is non-trivial for 2d inputs (columns of both datasets will get plotted against one another), and you have to make sure that you really want to pass 2d arrays as inputs. But you almost never want to pass the same object as the first two arguments to pyplot.plot.

What is the best way to do multi-dimensional indexing with numpy?

I am trying to do some indexing on a 3D numpy array.
Basically I have an array phi which has shape (F,A,D); for example (5, 3, 7). Generated, for example as follows:
F=5; A=3; D=7; phi = np.random.random((F,A,D))
My goal is to be able to index over A and D, with a 2D array such as [[0,1,2],[5,5,6]], which means take the values indexed by 0 in the 3rd dimension, for the the first position in A, the values indexed by 1 in the 3rd dimension for the second position of A and so on. The result should have a shape that is (F,A,2) or (F,2,A).
This would be equivalent to manually cycling all the values of the "indexer array" such as:
phi[:,0,0]; phi[:,1,1]; phi[:,2,2]
phi[:,0,5]; phi[:,1,5]; phi[:,2,6]
Intuitively I would do something like phi[:,:,[[0,1,2],[3,3,3]]], but it's shape ends up being (5, 3, 2, 3).
Any ideas on how to obtain the correct result?
I think this is what you want
phi[:,range(A),[[0,1,2],[5,5,6]]]
Your attempt
phi[:,:,[[0,1,2],[5,5,6]]]
takes the values along the third dimension for every values of the first two dimensions, therefore you end up with a shape of (5,3,2,3).
However, according to your example you want a continous increase in the second dimension which is accomplished in my code by range(A) and numpy's broadcasting.

Confusion in array operation in numpy

I generally use MATLAB and Octave, and i recently switching to python numpy.
In numpy when I define an array like this
>>> a = np.array([[2,3],[4,5]])
it works great and size of the array is
>>> a.shape
(2, 2)
which is also same as MATLAB
But when i extract the first entire column and see the size
>>> b = a[:,0]
>>> b.shape
(2,)
I get size (2,), what is this? I expect the size to be (2,1). Perhaps i misunderstood the basic concept. Can anyone make me clear about this??
A 1D numpy array* is literally 1D - it has no size in any second dimension, whereas in MATLAB, a '1D' array is actually 2D, with a size of 1 in its second dimension.
If you want your array to have size 1 in its second dimension you can use its .reshape() method:
a = np.zeros(5,)
print(a.shape)
# (5,)
# explicitly reshape to (5, 1)
print(a.reshape(5, 1).shape)
# (5, 1)
# or use -1 in the first dimension, so that its size in that dimension is
# inferred from its total length
print(a.reshape(-1, 1).shape)
# (5, 1)
Edit
As Akavall pointed out, I should also mention np.newaxis as another method for adding a new axis to an array. Although I personally find it a bit less intuitive, one advantage of np.newaxis over .reshape() is that it allows you to add multiple new axes in an arbitrary order without explicitly specifying the shape of the output array, which is not possible with the .reshape(-1, ...) trick:
a = np.zeros((3, 4, 5))
print(a[np.newaxis, :, np.newaxis, ..., np.newaxis].shape)
# (1, 3, 1, 4, 5, 1)
np.newaxis is just an alias of None, so you could do the same thing a bit more compactly using a[None, :, None, ..., None].
* An np.matrix, on the other hand, is always 2D, and will give you the indexing behavior you are familiar with from MATLAB:
a = np.matrix([[2, 3], [4, 5]])
print(a[:, 0].shape)
# (2, 1)
For more info on the differences between arrays and matrices, see here.
Typing help(np.shape) gives some insight in to what is going on here. For starters, you can get the output you expect by typing:
b = np.array([a[:,0]])
Basically numpy defines things a little differently than MATLAB. In the numpy environment, a vector only has one dimension, and an array is a vector of vectors, so it can have more. In your first example, your array is a vector of two vectors, i.e.:
a = np.array([[vec1], [vec2]])
So a has two dimensions, and in your example the number of elements in both dimensions is the same, 2. Your array is therefore 2 by 2. When you take a slice out of this, you are reducing the number of dimensions that you have by one. In other words, you are taking a vector out of your array, and that vector only has one dimension, which also has 2 elements, but that's it. Your vector is now 2 by _. There is nothing in the second spot because the vector is not defined there.
You could think of it in terms of spaces too. Your first array is in the space R^(2x2) and your second vector is in the space R^(2). This means that the array is defined on a different (and bigger) space than the vector.
That was a lot to basically say that you took a slice out of your array, and unlike MATLAB, numpy does not represent vectors (1 dimensional) in the same way as it does arrays (2 or more dimensions).

Categories