I have a trouble with numpy ndarray when I'm indexing multiple dimensions at the same time :
> a = np.random.random((25,50,30))
> b = a[0,:,np.arange(30)]
> print(b.shape)
Here I expected the result to be (50,30), but actually the real result is (30,50) !
Can someone explain it to me please I don't get it and this feature introduces tons of bugs in my code. Thank you :)
Additional information :
Indexing in one dimension works perfectly :
> b = a[0,:,:]
> print(b.shape)
(50,30)
And when I have the transposition :
> a[0,:,0] == b[0,:]
True
From numpy docs
The easiest way to understand the situation may be to think in terms
of the result shape. There are two parts to the indexing operation,
the subspace defined by the basic indexing (excluding integers) and
the subspace from the advanced indexing part. Two cases of index
combination need to be distinguished:
The advanced indexes are separated by a slice, ellipsis or newaxis.
For example x[arr1, :, arr2].
The advanced indexes are all next to each other. For example x[...,
arr1, arr2, :] but not x[arr1, :, 1] since 1 is an advanced index in
this regard.
In the first case, the dimensions resulting from the advanced indexing
operation come first in the result array, and the subspace dimensions
after that. In the second case, the dimensions from the advanced
indexing operations are inserted into the result array at the same
spot as they were in the initial array (the latter logic is what makes
simple advanced indexing behave just like slicing).
(my emphasis) the highlighted bit applies to your
b = a[0,:,np.arange(30)]
When you use a list or array of integers to index a numpy array, you're using something that is known as Fancy Indexing. The rules for Fancy Indexing are not so straightforward as one might think. This is the reason that you're array has the wrong dimension. To avoid surprises, I'd recommend you to stick with slicing. So, you should change your code to:
a = np.random.random((25,50,30))
b = a[0,:,:]
print(b.shape)
Related
I'm doing a lot of vector algebra and want to use numpy arrays to remove any need for loops and run faster.
What I've found is that if I have a matrix A of size [N,P] I constantly need to use np.array([A[:,0]).T to force A[:,0] to be a column vector of size (N,1)
Is there a way to keep the single row or column of a 2D array as a 2D array because it makes the following arithmetic sooo much easier. For example, I often have to multiply a column vector (from a matrix) with a row vector (also created from a matrix) to create a new matrix: eg
C = A[:,i] * B[j,:]
it'd be be great if I didn't have to keep using:
C = np.array([A[:,i]]).T * np.array([B[j,:]])
It really obfuscates the code - in MATLAB it'd simply be C = A[:,i] * B[j,:] which is easier to read and compare with the underlying mathematics, especially if there's a lot of terms like this in the same line, but unfortunately most of my colleagues don't have MATLAB licenses.
Note this isn't the only use case, so a specific function for this column x row operation isn't too helpful
Even MATLAB/Octave squeezes out excess dimensions:
>> ones(2,3,4)(:,:,1)
ans =
1 1 1
1 1 1
>> size(ones(2,3,4)(1,:)) # some indexing "flattens" outer dims
ans =
1 12
When I started MATLAB v3.5 2d matrix was all it had; cells, struct and higher dimensions were later additions (as demonstrated by the above examples).
Your:
In [760]: A=np.arange(6).reshape(2,3)
In [762]: np.array([A[:,0]]).T
Out[762]:
array([[0],
[3]])
is more convoluted than needed. It makes a list, then a (1,N) array from that, and finally a (N,1)
A[:,[0]], A[:,:,None], A[:,0:1] are more direct. Even A[:,0].reshape(-1,1)
I can't think of something simple that treats a scalar and list index the same.
Functions like np.atleast_2d can conditionally add a new dimension, but it will be a leading (outer) one. But by the rules of broadcasting leading dimensions are usually 'automatic'.
basic v advanced indexing
In the underlying Python, scalars can't be indexed, and lists can only be indexed with scalars and slices. The underlying syntax allows indexing with tuples, but lists reject those. It's numpy that has extended the indexing considerably - not with syntax but with how it handles those tuples.
numpy indexing with slices and scalars is basic indexing. That's where the dimension loss can occur. That's consistent with list indexing
In [768]: [[1,2,3],[4,5,6]][1]
Out[768]: [4, 5, 6]
In [769]: np.array([[1,2,3],[4,5,6]])[1]
Out[769]: array([4, 5, 6])
Indexing with lists and arrays is advanced indexing, without any list counterparts. This is perhaps where the differences between MATLAB and numpy are ugliest :)
>> A([1,2],[1,2])
produces a (2,2) block. In numpy that produces a "diagonal"
In [781]: A[[0,1],[0,1]]
Out[781]: array([0, 4])
To get the block we have to use lists (or arrays) that "broadcast" against each other:
In [782]: A[[[0],[1]],[0,1]]
Out[782]:
array([[0, 1],
[3, 4]])
To get the "diagonal" in MATLAB we have to use sub2ind([2,2],[1,2],[1,2]) to get the [1,4] flat indices.
What kind of multiplication?
In
np.array([A[:,i]]).T * np.array([B[j,:]])
is this elementwise (.*) or matrix?
For a (N,1) and (1,M) pair, A*B and A#B produce the same (N,M) result, but one uses broadcasting to generalize the outer product, and the other is inner/matrix product (with sum-of-products).
https://numpy.org/doc/stable/reference/generated/numpy.matrix.html
Returns a matrix from an array-like object, or from a string of data. A matrix is a specialized 2-D array that retains its 2-D nature through operations. It has certain special operators, such as * (matrix multiplication) and ** (matrix power).
I'm not sure how to re-implement it though, it's an interesting exercise.
As mentionned, matrix will be deprecated. But from np.array, you can specify the dimension with the argument ndim=2:
np.array([1, 2, 3], ndmin=2)
You can keep the dimension in the following way (using # for matrix multiplication)
C = A[:,[i]] # B[[j],:]
Notice the brackets around i and j, otherwise C won't be a 2-dimensional matrix.
I have a 5 dimension array like this
a=np.random.randint(10,size=[2,3,4,5,600])
a.shape #(2,3,4,5,600)
I want to get the first element of the 2nd dimension, and several elements of the last dimension
b=a[:,0,:,:,[1,3,5,30,17,24,30,100,120]]
b.shape #(9,2,4,5)
as you can see, the last dimension was automatically converted to the first dimension.
why? and how to avoid that?
This behavior is described in the numpy documentation. In the expression
a[:,0,:,:,[1,3,5,30,17,24,30,100,120]]
both 0 and [1,3,5,30,17,24,30,100,120] are advanced indexes, separated by slices. As the documentation explains, in such case dimensions coming from advanced indexes will be first in the resulting array.
If we replace 0 by the slice 0:1 it will change this situation (since it will leave only one advanced index), and then the order of dimensions will be preserved. Thus one way to fix this issue is to use the 0:1 slice and then squeeze the appropriate axis:
a[:,0:1,:,:,[1,3,5,30,17,24,30,100,120]].squeeze(axis=1)
Alternatively, one can keep both advanced indexes, and then rearrange axes:
np.moveaxis(a[:,0,:,:,[1,3,5,30,17,24,30,100,120]], 0, -1)
I am trying to 'expand' an array (generate a new array with proportionally more elements in all dimensions). I have an array with known numbers (let's call it X) and I want to make it j times bigger (in each dimension).
So far I generated a new array of zeros with more elements, then I used broadcasting to insert the original numbers in the new array (at fixed intervals).
Finally, I used linspace to fill the gaps, but this part is actually not directly relevant to the question.
The code I used (for n=3) is:
import numpy as np
new_shape = (np.array(X.shape) - 1 ) * ratio + 1
new_array = np.zeros(shape=new_shape)
new_array[::ratio,::ratio,::ratio] = X
My problem is that this is not general, I would have to modify the third line based on ndim. Is there a way to use such broadcasting for any number of dimensions in my array?
Edit: to be more precise, the third line would have to be:
new_array[::ratio,::ratio] = X
if ndim=2
or
new_array[::ratio,::ratio,::ratio,::ratio] = X
if ndim=4
etc. etc. I want to avoid having to write code for each case of ndim
p.s. If there is a better tool to do the entire process (such as 'inner-padding' that I am not aware of, I will be happy to learn about it).
Thank you
array = array[..., np.newaxis] will add another dimension
This article might help
You can use slice notation -
slicer = tuple(slice(None,None,ratio) for i in range(X.ndim))
new_array[slicer] = X
Build the slicing tuple manually. ::ratio is equivalent to slice(None, None, ratio):
new_array[(slice(None, None, ratio),)*new_array.ndim] = ...
I'm having trouble getting used to Numpy arrays (I'm a Matlab user). When I try to select just a range of values from an array, I see the resulting array has an extra dimension:
ioi = np.nonzero((self.data_array[0,:] >= range_start) & (self.data_array[0,:] <= range_end))
print("self.data_array.shape = {0}".format(self.data_array.shape))
print("self.data_array.shape[:,ioi] = {0}".format(self.data_array[:,ioi].shape))
The result is:
self.data_array.shape = (5, 50000)
self.data_array.shape[:,ioi] = (5, 1, 408)
I also see that ioi is a tuple. I don't know if that has anything to do with it.
What is happening here to create that extra dimension and what should I do, in the most direct way, to get an array shape of (5,408) in this case?
The simplest and most efficient thing would be to get rid of the np.nonzero call, and use logical indexing just as one would in Matlab. Here's an example. (I'm using random data of the same shape, FYI.)
>>> data = np.random.randn(5, 5000)
>>> start, end = -0.5, 0.5
>>> ioi = (data[0] > start) & (data[0] < end)
>>> print(ioi.shape)
(5000,)
>>> print(ioi.sum())
1900
>>> print(data[:, ioi].shape)
(5, 1900)
The np.nonzero call is not usually needed. Just like Matlab's find function, it's slow compared with logical indexing, and usually one's goal can be more efficiently accomplished with logical indexing. np.nonzero, just like find, should mostly be used only when you need the actual index values themselves.
As you suspected, the reason for the extra dimensions is that tuples are handled differently from other types of indexing arrays in NumPy. This is to allow more flexible indexing, such as with slices, ellipses, etc. See this useful page for in-depth explanation, especially the last section.
There are at least two other options to solve the problem. One is to use the ioi array, as returned from np.nonzero, directly as your only index to the data array. As in: self.data_array[ioi]. Part of why you have an extra dimension is that you actually have two set of indices in your call: the slice (:) and the tuple ioi. np.nonzero is guaranteed to return a tuple exactly for this reason, so that its output can always be used to directly index the source array.
The last option is to call np.squeeze on the returned array, but I'd opt for one of the above first.
Been working with numpy for a while now. Just when I think I have arrays figured out, though, it throws me another curve. For instance, I construct the 3D array pltz, and then
>>> gridset2 = range(0, pltx.shape[2], grdspc)
>>> pltz[10,:,gridset2].shape
(17, 160)
>>> pltz[10][:,gridset2].shape
(160, 17)
Why on Earth are the two shapes different?
Since your indexing expression has both a : and a list in it, NumPy needs to apply both the basic and advanced indexing rules, and the way they interact is kind of weird. The relevant documentation is here, and you should consult it if you want to know the full details. I'll focus on the part that causes this shape mismatch.
When all components of the indexing expression that use advanced indexing are next to each other, dimensions of the result coming from advanced indexing are placed into the result in the position of the dimensions they replace. Advanced indexing components are array-likes, such as arrays, lists, and scalars; scalars can also be used in basic indexing, but for this purpose, they're considered advanced. Thus, if arr.shape == (10, 20, 30) and ind.shape = (2, 3, 4), then
arr[:, ind, :].shape == (10, 2, 3, 4, 30)
Your first expression falls into this case.
On the other hand, if components of the indexing expression that use advanced indexing are separated by components that use basic indexing, there is no unambiguous place to insert the advanced indexing dimensions. For example, with
arr[ind, :, ind]
the result needs to have dimensions of length 2, 3, 4, and 20, and there's no good place to stick the 20.
When advanced indexing components are separated by basic indexing components, NumPy sticks all dimensions resulting from advanced indexing at the start of the result array. Basic indexing components are :, ..., and np.newaxis (None). Your second expression falls into this case.
Since your second expression has advanced indexing components separated by basic indexing components and your first expression doesn't, your two expressions use different indexing rules. To avoid this, you could separate the basic indexing and advanced indexing into two stages, or you could replace the basic indexing with equivalent advanced indexing. Whatever you do, I recommend putting an explanatory comment above such code.
You should tell us the lenth of gridset2 and the shape of pltz.
But I've deduced from the documentation that user2357112 gave us that
len(gridset2) == 17
pltz.shape[1] == 160
http://docs.scipy.org/doc/numpy-1.10.0/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
The advanced indexes are separated by a slice, ellipsis or newaxis.
For example x[arr1, :, arr2].
The advanced indexes are all next to
each other. For example x[..., arr1, arr2, :] but not x[arr1, :, 1]
since 1 is an advanced index in this regard.
In the first case, the
dimensions resulting from the advanced indexing operation come first
in the result array, and the subspace dimensions after that. In the
second case, the dimensions from the advanced indexing operations are
inserted into the result array at the same spot as they were in the
initial array
>>> pltz[10,:,gridset2].shape
(17, 160)
This is the first case in the quote, a slice in the middle. gridset2 is advanced indexing (e.g. [1,2,3,...]). It is put first; the [10,:] subspace is placed after.
>>> pltz[10][:,gridset2].shape
(160, 17)
with pltz[10], the new array (a view) is 2d `(160,N)'. It now puts the size 17 dim last, the 2nd case in the documentation.