Confusion in array operation in numpy

Confusion in array operation in numpy - python

I generally use MATLAB and Octave, and i recently switching to python numpy.
In numpy when I define an array like this
>>> a = np.array([[2,3],[4,5]])
it works great and size of the array is
>>> a.shape
(2, 2)
which is also same as MATLAB
But when i extract the first entire column and see the size
>>> b = a[:,0]
>>> b.shape
(2,)
I get size (2,), what is this? I expect the size to be (2,1). Perhaps i misunderstood the basic concept. Can anyone make me clear about this??

A 1D numpy array* is literally 1D - it has no size in any second dimension, whereas in MATLAB, a '1D' array is actually 2D, with a size of 1 in its second dimension.
If you want your array to have size 1 in its second dimension you can use its .reshape() method:
a = np.zeros(5,)
print(a.shape)
# (5,)
# explicitly reshape to (5, 1)
print(a.reshape(5, 1).shape)
# (5, 1)
# or use -1 in the first dimension, so that its size in that dimension is
# inferred from its total length
print(a.reshape(-1, 1).shape)
# (5, 1)
Edit
As Akavall pointed out, I should also mention np.newaxis as another method for adding a new axis to an array. Although I personally find it a bit less intuitive, one advantage of np.newaxis over .reshape() is that it allows you to add multiple new axes in an arbitrary order without explicitly specifying the shape of the output array, which is not possible with the .reshape(-1, ...) trick:
a = np.zeros((3, 4, 5))
print(a[np.newaxis, :, np.newaxis, ..., np.newaxis].shape)
# (1, 3, 1, 4, 5, 1)
np.newaxis is just an alias of None, so you could do the same thing a bit more compactly using a[None, :, None, ..., None].
* An np.matrix, on the other hand, is always 2D, and will give you the indexing behavior you are familiar with from MATLAB:
a = np.matrix([[2, 3], [4, 5]])
print(a[:, 0].shape)
# (2, 1)
For more info on the differences between arrays and matrices, see here.

Typing help(np.shape) gives some insight in to what is going on here. For starters, you can get the output you expect by typing:
b = np.array([a[:,0]])
Basically numpy defines things a little differently than MATLAB. In the numpy environment, a vector only has one dimension, and an array is a vector of vectors, so it can have more. In your first example, your array is a vector of two vectors, i.e.:
a = np.array([[vec1], [vec2]])
So a has two dimensions, and in your example the number of elements in both dimensions is the same, 2. Your array is therefore 2 by 2. When you take a slice out of this, you are reducing the number of dimensions that you have by one. In other words, you are taking a vector out of your array, and that vector only has one dimension, which also has 2 elements, but that's it. Your vector is now 2 by _. There is nothing in the second spot because the vector is not defined there.
You could think of it in terms of spaces too. Your first array is in the space R^(2x2) and your second vector is in the space R^(2). This means that the array is defined on a different (and bigger) space than the vector.
That was a lot to basically say that you took a slice out of your array, and unlike MATLAB, numpy does not represent vectors (1 dimensional) in the same way as it does arrays (2 or more dimensions).

Related

Extract 2d ndarray from arbitrarily dimensional ndarray using index arrays

I want to extract parts of an numpy ndarray based on arrays of index positions for some of the dimensions. Let me show this on an example
Example data
dummy = np.random.rand(5,2,100)
X = np.array([[0,1],[4,1],[2,0]])
dummy is the original ndarray with dimensionality 5x2x100. This dimensionality is arbitrary, it could as well be 5x2x4x100.
X is a matrix of index values, here X[:,0] are the indices of the first dimension of dummy, X[:,1] those of the second dimension. The number of columns in X is always the number of dimensions in dummy minus 1.
Example output
I want to extract an ndarray of the following form for this example
[
dummy[0,1,:],
dummy[4,1,:],
dummy[2,0,:]
]
Complications
If the number of dimensions in dummy were fixed, this could just be done by dummy[X[:,0],X[:,1],:] . Sadly the dimensionality can be different, e.g. dummy could be a 5x2x4x6x100 ndarray and X correspondingly would then be 3x4 . My attempts at dealing with it have not yielded the desired result.
dummy[X,:] yields a 3x2x2x100 ndarray for this example same as dummy[X]
Iteratively reducing dummy by doing something like dummy = dummy[X[:,i],:] with i an iterator over the number of columns of X also does not reduce the ndarray in the example past 3x2x100
I have a feeling that this should be pretty simple with numpy indexing, but I guess my search for a solution was missing the right terms for this.
Does anyone have a solution to this?

I will try to provide some explainability to #Michael Szczesny answer.
First, notice that if you have an np.array with dimension n and pass m indexes where m<n, then it will be the same as using : in the dimensions >=m. In your case, for example:
dummy[(0, 0)] == dummy[0, 0, :]
Given that, note that you can also pass an array as an index. Thus:
dummy[([0, 1], [0, 0])]
It would be the same as:
np.array([dummy[(0,0)], dummy[(1,0)]])
You can validate that using:
dummy[([0, 1], [0, 0])] == np.array([dummy[(0,0)], dummy[(1,0)]])
Finally, notice that:
(*X.T,)
# (array([0, 4, 2]), array([1, 1, 0]))
You are here getting each dimension as an array, and then you will get:
[
dummy[0,1],
dummy[4,1],
dummy[2,0]
]
Which is the same as:
[
dummy[0,1,:],
dummy[4,1,:],
dummy[2,0,:]
]
Edit: Instead of using (*X.T,), you can use tuple(X.T), which for me, makes more sense

as Michael Szczesny wrote, the best solution is dummy[(*X.T,)].
Since X[:,0] are the indices of the first dimension of dummy and X[:,1] are the indices of the second dimension of dummy, if you transpose X (X.T) you'll have the the indices of the first dimension of dummy as X.T[0] and the indices of the second dimension of dummy as X.T[1].
Now to slice dummy as you want, you can specify the indices of the first and of the second dimension in this way:
dummy[(first_dim_indices, second_dim_indices)] = dummy[(X.T[0], X.T[1])]
In order to simplify the code (and since you doesn't want to transpose the X matrix twice) you can unpack X.T in a tuple as (*X.T,) and so write X[(*X.T,)] is the same thing to write dummy[(X.T[0], X.T[1])].
This writing is also useful if you have an unfixed number of dimensions to slice trough because you will unpack from X.T as many lines as there are dimensions to slice in dummy. For example suppose you want to retrieve an 1D-array from dummy given the following indices:
first_dim: (0, 4, 2)
second_dim: (1, 1, 0)
third_dim: (9, 8, 7)
You can specify the indices of the 3 dimensions as X = np.array([[0,1,9],[4,1,8],[2,0,7]]) and dim[(*X.T,)] is still valid.

How to fuse two axis of a n-dimentional array in Python

Instead of a n-dimentional array, let's take a 3D array to illustrate my question :
>>> import numpy as np
>>> arr = np.ones(24).reshape(2, 3, 4)
So I have an array of shape (2, 3, 4). I would like to concatenate/fuse the 2nd and 3rd axis together to get an array of the shape (2, 12).
Wrongly, thought I could have done it easily with np.concatenate :
>>> np.concatenate(arr, axis=1).shape
(3, 8)
I found a way to do it by a combination of np.rollaxis and np.concatenate but it is increasingly ugly as the array goes up in dimension:
>>> np.rollaxis(np.concatenate(np.rollaxis(arr, 0, 3), axis=0), 0, 2).shape
(2, 12)
Is there any simple way to accomplish this? It seems very trivial, so there must exist some function, but I cannot seem to find it.
EDIT : Indeed I could use np.reshape, which means to compute the dimensions of the axis first. Is it possible without accessing/computing the shape beforehand?

On recent python versions you can do:
anew = a.reshape(*a.shape[:k], -1, *a.shape[k+2:])
I recommend against directly assigning to .shape since it doesn't work on sufficiently noncontiguous arrays.

Let's say that you have n dimensions in your array and that you want to fuse adjacent axis i and i+1:
shape = a.shape
new_shape = list(shape[:i]) + [-1] + list(shape[i+2:])
a.shape = new_shape

Dimensions of array after multidimensional index slicing

I want to slice a multidimensional numpy array (>2 dimensions) along 2 of its axes using index slicing. What are the rules for where each of its original dimensions end up?
To illustrate my problem, let me provide an example. Say we have a 4D array:
import numpy as np
a = np.arange(2*3*4*5).reshape(2,3,4,5)
I'll create a tuple of indices using numpy.where, for slicing along axes 1 and 3:
mask = np.where(np.random.rand(3,5) > 0.5)
This will pick out random slices from my array a. Let's say it returned tuples of length 7.
To preserve the remaining dimensions I will use slice(None) objects:
b = a[(slice(None), mask[0], slice(None), mask[1])]
This changed the shape:
>>> a.shape
(2, 3, 4, 5)
>>> b.shape
(7, 2, 4)
The axes that were untouched (i.e. sliced using the slice(None) object) appear to have been preserved, whereas the sliced axes are destroyed and the resulting axis is moved to the front.
However, this is not always the case. When I apply a mask to axes 1 and 2:
mask2 = np.where(np.random.rand(3,4) > 0.5)
c = a[(slice(None), mask[0], mask[1], slice(None))]
I observe the following (numpy.where has returned tuples of length 7 again):
>>> c.shape
(2, 7, 5)
The axis resulting from the axes that have been destroyed by the slicing did not move to the front this time.
My guess is that it is related to whether the sliced axes are adjacent or not, but I want to know from what rules this behavior emerges.

https://docs.scipy.org/doc/numpy-1.15.4/reference/arrays.indexing.html#combining-advanced-and-basic-indexing
Your where masks will produce a 1d (7,) shape array if applied to a 2d array, the values where the condition is true. You phrase that as 'destroying' a pair of axes.
In the second case that 7 can be placed between the 2 and 5.
But in the first it's ambiguous because of the slice in the middle (the non adjacency) - the fall back rule is to put it first, and order the slices after. In other words, instead of trying to choose between a (2,7,4) and (2,4,7) order, it chooses (7,2,4).
The ambiguity is clear in this case, and the default reasonable. It's more complicated with one or more of the dimensions is eliminated by a scalar index.

numpy: Why is there a difference between (x,1) and (x, ) dimensionality

I am wondering why in numpy there are one dimensional array of dimension (length, 1) and also one dimensional array of dimension (length, ) w/o a second value.
I am running into this quite frequently, e.g. when using np.concatenate() which then requires a reshape step beforehand (or I could directly use hstack/vstack).
I can't think of a reason why this behavior is desirable. Can someone explain?
Edit:
It was suggested by one of the comments that my question is a possible duplicate. I am more interested in the underlying working logic of Numpy and not that there is a distinction between 1d and 2d arrays which I think is the point of the mentioned thread.

The data of a ndarray is stored as a 1d buffer - just a block of memory. The multidimensional nature of the array is produced by the shape and strides attributes, and the code that uses them.
The numpy developers chose to allow for an arbitrary number of dimensions, so the shape and strides are represented as tuples of any length, including 0 and 1.
In contrast MATLAB was built around FORTRAN programs that were developed for matrix operations. In the early days everything in MATLAB was a 2d matrix. Around 2000 (v3.5) it was generalized to allow more than 2d, but never less. The numpy np.matrix still follows that old 2d MATLAB constraint.
If you come from a MATLAB world you are used to these 2 dimensions, and the distinction between a row vector and column vector. But in math and physics that isn't influenced by MATLAB, a vector is a 1d array. Python lists are inherently 1d, as are c arrays. To get 2d you have to have lists of lists or arrays of pointers to arrays, with x[1][2] style of indexing.
Look at the shape and strides of this array and its variants:
In [48]: x=np.arange(10)
In [49]: x.shape
Out[49]: (10,)
In [50]: x.strides
Out[50]: (4,)
In [51]: x1=x.reshape(10,1)
In [52]: x1.shape
Out[52]: (10, 1)
In [53]: x1.strides
Out[53]: (4, 4)
In [54]: x2=np.concatenate((x1,x1),axis=1)
In [55]: x2.shape
Out[55]: (10, 2)
In [56]: x2.strides
Out[56]: (8, 4)
MATLAB adds new dimensions at the end. It orders its values like a order='F' array, and can readily change a (n,1) matrix to a (n,1,1,1). numpy is default order='C', and readily expands an array dimension at the start. Understanding this is essential when taking advantage of broadcasting.
Thus x1 + x is a (10,1)+(10,) => (10,1)+(1,10) => (10,10)
Because of broadcasting a (n,) array is more like a (1,n) one than a (n,1) one. A 1d array is more like a row matrix than a column one.
In [64]: np.matrix(x)
Out[64]: matrix([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
In [65]: _.shape
Out[65]: (1, 10)
The point with concatenate is that it requires matching dimensions. It does not use broadcasting to adjust dimensions. There are a bunch of stack functions that ease this constraint, but they do so by adjusting the dimensions before using concatenate. Look at their code (readable Python).
So a proficient numpy user needs to be comfortable with that generalized shape tuple, including the empty () (0d array), (n,) 1d, and up. For more advanced stuff understanding strides helps as well (look for example at the strides and shape of a transpose).

Much of it is a matter of syntax. This tuple (x) isn't a tuple at all (just a redundancy). (x,), however, is.
The difference between (x,) and (x,1) goes even further. You can take a look into the examples of previous questions like this. Quoting the example from it, this is an 1D numpy array:
>>> np.array([1, 2, 3]).shape
(3,)
But this one is 2D:
>>> np.array([[1, 2, 3]]).shape
(1, 3)
Reshape does not make a copy unless it needs to so it should be safe to use.

Why Numpy has dimension (n,) instead of (n,1) only [duplicate]

This question already has answers here:
Difference between numpy.array shape (R, 1) and (R,)
(8 answers)
Closed 8 years ago.
I have been curious about this for some time. I can live with that, but it always bites me when enough care is not taken, so I decide to post it here. Suppose the following example (Numpy version = 1.8.2):
a = array([[0, 1], [2, 3]])
print shape(a[0:0, :]) # (0, 2)
print shape(a[0:1, :]) # (1, 2)
print shape(a[0:2, :]) # (2, 2)
print shape(a[0:100, :]) # (2, 2)
print shape(a[0]) # (2, )
print shape(a[0, :]) # (2, )
print shape(a[:, 0]) # (2, )
I don't know how other people feel, but the result feels inconsistent to me. The last line is a column vector while the second to last line is a row vector, they should have different dimension -- in linear algebra they do! (Line 5 is another surprise, but I will neglect it for now). Consider a second example:
solution = scipy.sparse.linalg.dsolve.linsolve.spsolve(A, b) # solution of dimension (n, )
analytic = reshape(f(x, y), (n, 1)) # analytic of dimension (n, 1)
error = solution - analytic
Now error is of dimension (n, n). Yes, in the second line I should use (n, ) instead of (n, 1), but why? I used to use MATLAB a lot, where one-d vector has dimension (n, 1), linspace/arange returns array of dimension (n, 1), and there never exists (n, ). But in Numpy (n, 1) and (n, ) coexist, and there are many functions for dimension handling alone: atleast, newaxis and different uses of reshape, but to me those functions are more of confusion than help. If an array print like [1,2,3], then intuitively the dimension should be [1,3] instead of [3,], right? If Numpy does not have (n, ), I can only see a gain in clarity, not a loss in functionality.
So there must be some design reason behind this. I have been searching from time to time, without finding a clear answer or report. Could someone help clarifying this confusion or provide me some useful references? Your help is much appreciated.

numpy's philosphy is not that a[:, 0] is a "column vector" and a[0, :] a "row vector" in the general case. Rather they are both, quite simply, vectors—i.e. arrays with one and only one dimension. This is actually highly logical and consistent (but yes, can get annoying for those of us accustomed to Matlab).
I say "in the general case" because that is true for numpy's most general data structure, the array, which is intended for all kinds of multi-dimensional dense data storage and manipulation applications—not just matrix math. Having "rows" and "columns" is a highly specialized context for array operations—but yes, a very common one: that's why numpy also supplies the matrix class. Convert your array to a numpy.matrix (or use the matrix constructor instead of array to begin with) and you will see behaviour closer to what you expect. For more information, see What are the differences between numpy arrays and matrices? Which one should I use?
For cases where you're dealing with more than 2 dimensions, take a look at the numpy.expand_dims function. Though the syntax is annoyingly redundant and unpythonically verbose, when I'm working on arrays with more than 2 dimensions (so cannot use matrix), I'm forever having to use expand_dims to do this kind of thing:
A -= numpy.expand_dims( A.mean( axis=2 ), 2 ) # subtract mean-across-layers from A
instead of
A -= A.mean( axis=2 ) # throw an exception while naively attempting to subtract mean-across-layers from A
But consider Matlab, by contrast. Matlab implicitly asserts that there is no such thing as a one-dimensional object and that the minimum number of dimensions a thing can ever have is 2. Sure, you and I are both highly accustomed to this, but take a moment to realize how arbitrary it is. There is clearly a conceptual difference between a fundamentally one-dimensional object, and a two-dimensional object that just happens to have extent 1 in one of its dimensions: the latter is allowed to grow in its second dimension, whereas the former doesn't even know what the second dimension means—and why should it? Hence a.shape==(N,) and a.shape==(N,1) make perfect sense as separate cases. You might as well ask "why is it not (N, 1, 1)?" or "why is it not (N, 1, 1, 1, 1, 1, 1)?"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.