What is the best way to keep dimensionality when subarraying numpy arrays? - python

Suppose I had a standard numpy array such as
a = np.arange(6).reshape((2,3))
When I subarray the array, by performing such task as
a[1, :]
I will lose dimensionality and it will turn into 1D and print, array([3, 4, 5])
Of course the list being 2D you originally want to keep dimensionality. So Ihave to do a tedious task such as
b=a[1, :]
b.reshape(1, b.size)
Why does numpy decrease dimensionality when subarraying?
What is the best way to keep dimensionality, since a[1, :].reshape(1, a.size) will break?

Just use slicing rather than indexing, and the shape will be preserved:
a[1:2]

Although I agree with John Zwinck's answer, I wanted to provide an alternative in case, for whatever reason, you are forced into using indexing (instead of slicing).
OP says that "a[1, :].reshape(1, a.size) will break":
You can add dimensions to numpy arrays like this:
b = a[1]
# array([3, 4, 5]
b = a[1][np.newaxis]
# array([[3, 4, 5]])
(Note that np.newaxis is None, but it's a lot more readable to use the np.newaxis)
As pointed out in the comments (#PaulPanzer and #Divakar), there are actually many ways to accomplish this same thing (again, with indexing instead of slicing):
These ones do not make a copy (data changed in each affect a)
a[1, None]
a[1, np.newaxis]
a[1].reshape(1, a.shape[1]) # Use shape, not size
This one does make a copy (data is independent from a)
a[[1]]

Related

What is the difference between resize and reshape when using arrays in NumPy?

I have just started using NumPy. What is the difference between resize and reshape for arrays?
Reshape doesn't change the data as mentioned here.
Resize changes the data as can be seen here.
Here are some examples:
>>> numpy.random.rand(2,3)
array([[ 0.6832785 , 0.23452056, 0.25131171],
[ 0.81549186, 0.64789272, 0.48778127]])
>>> ar = numpy.random.rand(2,3)
>>> ar.reshape(1,6)
array([[ 0.43968751, 0.95057451, 0.54744355, 0.33887095, 0.95809916,
0.88722904]])
>>> ar
array([[ 0.43968751, 0.95057451, 0.54744355],
[ 0.33887095, 0.95809916, 0.88722904]])
After reshape the array didn't change, but only output a temporary array reshape.
>>> ar.resize(1,6)
>>> ar
array([[ 0.43968751, 0.95057451, 0.54744355, 0.33887095, 0.95809916,
0.88722904]])
After resize the array changed its shape.
One major difference is reshape() does not change your data, but resize() does change it. resize() first accommodates all the values in the original array. After that, if extra space is there (or size of new array is greater than original array), it adds its own values. As #David mentioned in comments, what values resize() adds depends on how that is called.
You can call reshape() and resize() function in the following two ways.
numpy.resize()
ndarray.resize() - where ndarray is an n dimensional array you are resizing.
You can similarly call reshape also as numpy.reshape() and ndarray.reshape(). But here they are almost the same except the syntax.
One point to notice is that, reshape() will always try to return a view wherever possible, otherwise it would return a copy. Also, it can't tell what will be returned when, but you can make your code to raise error whenever the data is copied.
For resize() function, numpy.resize() returns a new copy of the array whereas ndarray.resize() does it in-place. But they don't go to the view thing.
Now coming to the point that what the values of extra elements should be. From the documentation, it says
If the new array is larger than the original array, then the new array is filled with repeated copies of a. Note that this behavior is different from a.resize(new_shape) which fills with zeros instead of repeated copies of a.
So for ndarray.resize() it is the value 0, but for numpy.resize() it is the values of the array itself (of course, whatever can fit in the new size). The below code snippet will make it clear.
In [40]: arr = np.array([1, 2, 3, 4])
In [41]: np.resize(arr, (2,5))
Out[41]:
array([[1, 2, 3, 4, 1],
[2, 3, 4, 1, 2]])
In [42]: arr.resize((2,5))
In [43]: arr
Out[43]:
array([[1, 2, 3, 4, 0],
[0, 0, 0, 0, 0]])
You can also see that ndarray.resize() returns None and does the resizing in-place.
reshape() is able to change the shape only (i.e. the meta info), not the number of elements.
If the array has five elements, we may use e.g. reshape(5, ), reshape(1, 5),
reshape(1, 5, 1), but not reshape(2, 3).
reshape() in general don't modify data themselves, only meta info about them,
the .reshape() method (of ndarray) returns the reshaped array, keeping the original array untouched.
resize() is able to change both the shape and the number of elements, too.
So for an array with five elements we may use resize(5, 1), but also resize(2, 2) or resize(7, 9).
The .resize() method (of ndarray) returns None, changing only the original array (so it seems as an in-place change).
Suppose you have the following np.ndarray:
a = np.array([1, 2, 3, 4]) # Shape of this is (4,)
Now we try 'a.reshape'
a.reshape(1, 4)
array([[1, 2, 3, 4]])
a.shape # This will again return (4,)
We see that the shape of a hasn't changed.
Let's try 'a.resize' now:
a.resize(1,4)
a.shape # Now the shape changes to (1,4)
'resize' changed the shape of our original NumPy array a (It changes shape 'IN-PLACE').
One more point is:
np.reshape can take -1 in one dimension. np.resize can't.
Example as below:
arr = np.arange(20)
arr.resize(5, 2, 2)
arr.reshape(2, 2, -1)

numpy: Why is there a difference between (x,1) and (x, ) dimensionality

I am wondering why in numpy there are one dimensional array of dimension (length, 1) and also one dimensional array of dimension (length, ) w/o a second value.
I am running into this quite frequently, e.g. when using np.concatenate() which then requires a reshape step beforehand (or I could directly use hstack/vstack).
I can't think of a reason why this behavior is desirable. Can someone explain?
Edit:
It was suggested by one of the comments that my question is a possible duplicate. I am more interested in the underlying working logic of Numpy and not that there is a distinction between 1d and 2d arrays which I think is the point of the mentioned thread.
The data of a ndarray is stored as a 1d buffer - just a block of memory. The multidimensional nature of the array is produced by the shape and strides attributes, and the code that uses them.
The numpy developers chose to allow for an arbitrary number of dimensions, so the shape and strides are represented as tuples of any length, including 0 and 1.
In contrast MATLAB was built around FORTRAN programs that were developed for matrix operations. In the early days everything in MATLAB was a 2d matrix. Around 2000 (v3.5) it was generalized to allow more than 2d, but never less. The numpy np.matrix still follows that old 2d MATLAB constraint.
If you come from a MATLAB world you are used to these 2 dimensions, and the distinction between a row vector and column vector. But in math and physics that isn't influenced by MATLAB, a vector is a 1d array. Python lists are inherently 1d, as are c arrays. To get 2d you have to have lists of lists or arrays of pointers to arrays, with x[1][2] style of indexing.
Look at the shape and strides of this array and its variants:
In [48]: x=np.arange(10)
In [49]: x.shape
Out[49]: (10,)
In [50]: x.strides
Out[50]: (4,)
In [51]: x1=x.reshape(10,1)
In [52]: x1.shape
Out[52]: (10, 1)
In [53]: x1.strides
Out[53]: (4, 4)
In [54]: x2=np.concatenate((x1,x1),axis=1)
In [55]: x2.shape
Out[55]: (10, 2)
In [56]: x2.strides
Out[56]: (8, 4)
MATLAB adds new dimensions at the end. It orders its values like a order='F' array, and can readily change a (n,1) matrix to a (n,1,1,1). numpy is default order='C', and readily expands an array dimension at the start. Understanding this is essential when taking advantage of broadcasting.
Thus x1 + x is a (10,1)+(10,) => (10,1)+(1,10) => (10,10)
Because of broadcasting a (n,) array is more like a (1,n) one than a (n,1) one. A 1d array is more like a row matrix than a column one.
In [64]: np.matrix(x)
Out[64]: matrix([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
In [65]: _.shape
Out[65]: (1, 10)
The point with concatenate is that it requires matching dimensions. It does not use broadcasting to adjust dimensions. There are a bunch of stack functions that ease this constraint, but they do so by adjusting the dimensions before using concatenate. Look at their code (readable Python).
So a proficient numpy user needs to be comfortable with that generalized shape tuple, including the empty () (0d array), (n,) 1d, and up. For more advanced stuff understanding strides helps as well (look for example at the strides and shape of a transpose).
Much of it is a matter of syntax. This tuple (x) isn't a tuple at all (just a redundancy). (x,), however, is.
The difference between (x,) and (x,1) goes even further. You can take a look into the examples of previous questions like this. Quoting the example from it, this is an 1D numpy array:
>>> np.array([1, 2, 3]).shape
(3,)
But this one is 2D:
>>> np.array([[1, 2, 3]]).shape
(1, 3)
Reshape does not make a copy unless it needs to so it should be safe to use.

weird result when using both slice indexing and boolean indexing on a 3d array

I encountered a quite weird problem when indexing a numpy ndarray. You can produce the result with following code. I don't understand why the result of indexing a is somehow transposed while the result of 2d array b is normal. Thanks.
In [1]: a = np.array(range(6)).reshape((1,2,3))
In [2]: mask = np.array([True, True, True])
In [3]: a
Out[3]:
array([[[0, 1, 2],
[3, 4, 5]]])
In [4]: a[0, :, mask]
Out[4]:
array([[0, 3],
[1, 4],
[2, 5]])
In [5]: a[0, :, mask].shape
Out[5]: (3, 2)
In [6]: b = np.array(range(6)).reshape((2,3))
In [7]: b[:, mask].shape
Out[7]: (2, 3)
a[0, :, mask] mixes advanced indexing with slicing. The : is a "slice index", while the 0 (for this purpose) and mask are consider "advanced indexes".
The rules governing the behavior of indexing when both advanced indexing and slicing are combined state:
There are two parts to the indexing operation, the subspace defined by the basic indexing (excluding integers) and the subspace from the advanced indexing part. Two cases of index combination need to be distinguished:
The advanced indexes are separated by a slice, ellipsis or newaxis. For example x[arr1, :, arr2].
The advanced indexes are all next to each other. For example x[..., arr1, arr2, :] but not x[arr1, :, 1] since 1 is an advanced index in this regard.
In the first case, the dimensions resulting from the advanced indexing operation come first in the result array, and the subspace dimensions after that. In the second case, the dimensions from the advanced indexing operations are inserted into the result array at the same spot as they were in the initial array (the latter logic is what makes simple advanced indexing behave just like slicing).
So since a[0, :, mask] has advanced indexes separated by a slice (the first case), the shape of the resulting array has the axes associated to the advanced indexes pushed to the front and the axes associated with the slice pushed tho the end. Thus the shape is (3, 2) since the mask is associated with the axis of length 3, and the slice, :, associated with the axis of length 2. (The 0 index in effect removes the axis of length 1 from the resultant array so it plays no role in the resultant shape.)
In contrast, b[:, mask] has all the advanced indexes together (the second case). So the shape of the resulting array keeps the axes in place. b[:, mask].shape is thus (2, 3).

Confusion in array operation in numpy

I generally use MATLAB and Octave, and i recently switching to python numpy.
In numpy when I define an array like this
>>> a = np.array([[2,3],[4,5]])
it works great and size of the array is
>>> a.shape
(2, 2)
which is also same as MATLAB
But when i extract the first entire column and see the size
>>> b = a[:,0]
>>> b.shape
(2,)
I get size (2,), what is this? I expect the size to be (2,1). Perhaps i misunderstood the basic concept. Can anyone make me clear about this??
A 1D numpy array* is literally 1D - it has no size in any second dimension, whereas in MATLAB, a '1D' array is actually 2D, with a size of 1 in its second dimension.
If you want your array to have size 1 in its second dimension you can use its .reshape() method:
a = np.zeros(5,)
print(a.shape)
# (5,)
# explicitly reshape to (5, 1)
print(a.reshape(5, 1).shape)
# (5, 1)
# or use -1 in the first dimension, so that its size in that dimension is
# inferred from its total length
print(a.reshape(-1, 1).shape)
# (5, 1)
Edit
As Akavall pointed out, I should also mention np.newaxis as another method for adding a new axis to an array. Although I personally find it a bit less intuitive, one advantage of np.newaxis over .reshape() is that it allows you to add multiple new axes in an arbitrary order without explicitly specifying the shape of the output array, which is not possible with the .reshape(-1, ...) trick:
a = np.zeros((3, 4, 5))
print(a[np.newaxis, :, np.newaxis, ..., np.newaxis].shape)
# (1, 3, 1, 4, 5, 1)
np.newaxis is just an alias of None, so you could do the same thing a bit more compactly using a[None, :, None, ..., None].
* An np.matrix, on the other hand, is always 2D, and will give you the indexing behavior you are familiar with from MATLAB:
a = np.matrix([[2, 3], [4, 5]])
print(a[:, 0].shape)
# (2, 1)
For more info on the differences between arrays and matrices, see here.
Typing help(np.shape) gives some insight in to what is going on here. For starters, you can get the output you expect by typing:
b = np.array([a[:,0]])
Basically numpy defines things a little differently than MATLAB. In the numpy environment, a vector only has one dimension, and an array is a vector of vectors, so it can have more. In your first example, your array is a vector of two vectors, i.e.:
a = np.array([[vec1], [vec2]])
So a has two dimensions, and in your example the number of elements in both dimensions is the same, 2. Your array is therefore 2 by 2. When you take a slice out of this, you are reducing the number of dimensions that you have by one. In other words, you are taking a vector out of your array, and that vector only has one dimension, which also has 2 elements, but that's it. Your vector is now 2 by _. There is nothing in the second spot because the vector is not defined there.
You could think of it in terms of spaces too. Your first array is in the space R^(2x2) and your second vector is in the space R^(2). This means that the array is defined on a different (and bigger) space than the vector.
That was a lot to basically say that you took a slice out of your array, and unlike MATLAB, numpy does not represent vectors (1 dimensional) in the same way as it does arrays (2 or more dimensions).

numpy vstack vs. column_stack

What exactly is the difference between numpy vstack and column_stack. Reading through the documentation, it looks as if column_stack is an implementation of vstack for 1D arrays. Is it a more efficient implementation? Otherwise, I cannot find a reason for just having vstack.
I think the following code illustrates the difference nicely:
>>> np.vstack(([1,2,3],[4,5,6]))
array([[1, 2, 3],
[4, 5, 6]])
>>> np.column_stack(([1,2,3],[4,5,6]))
array([[1, 4],
[2, 5],
[3, 6]])
>>> np.hstack(([1,2,3],[4,5,6]))
array([1, 2, 3, 4, 5, 6])
I've included hstack for comparison as well. Notice how column_stack stacks along the second dimension whereas vstack stacks along the first dimension. The equivalent to column_stack is the following hstack command:
>>> np.hstack(([[1],[2],[3]],[[4],[5],[6]]))
array([[1, 4],
[2, 5],
[3, 6]])
I hope we can agree that column_stack is more convenient.
hstack stacks horizontally, vstack stacks vertically:
The problem with hstack is that when you append a column you need convert it from 1d-array to a 2d-column first, because 1d array is normally interpreted as a vector-row in 2d context in numpy:
a = np.ones(2) # 2d, shape = (2, 2)
b = np.array([0, 0]) # 1d, shape = (2,)
hstack((a, b)) -> dimensions mismatch error
So either hstack((a, b[:, None])) or column_stack((a, b)):
where None serves as a shortcut for np.newaxis.
If you're stacking two vectors, you've got three options:
As for the (undocumented) row_stack, it is just a synonym of vstack, as 1d array is ready to serve as a matrix row without extra work.
The case of 3D and above proved to be too huge to fit in the answer, so I've included it in the article called Numpy Illustrated.
In the Notes section to column_stack, it points out this:
This function is equivalent to np.vstack(tup).T.
There are many functions in numpy that are convenient wrappers of other functions. For example, the Notes section of vstack says:
Equivalent to np.concatenate(tup, axis=0) if tup contains arrays that are at least 2-dimensional.
It looks like column_stack is just a convenience function for vstack.

Categories