About numpy's concatenate, hstack, vstack functions? - python

See some examples
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.concatenate((a,b), axis=0)) # [1,2,3,4,5,6]
print(np.hstack((a,b))) # [1,2,3,4,5,6]
print(np.vstack((a,b))) # [[1,2,3],[4,5,6]]
print(np.concatenate((a,b), axis=1)) # IndexError: axis 1 out of bounds [0, 1)
The result of hstack is the same as concatenate along axis=0, but the api document says hstack=concatenate along axis=1, please look at the https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.hstack.html#numpy.hstack
And concatenating along the axis=1 raise an IndexError, the api document says hstack=concatenate along axis=0, please look at the https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.vstack.html#numpy.vstack
Can anybody explain it?By the way, can anybody explain how to broadcast when the ndarray's dimension is less than 2 and concatenating along axis=1?

Look at the actual code for hstack:
arrs = [atleast_1d(_m) for _m in tup]
# As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
if arrs[0].ndim == 1:
return _nx.concatenate(arrs, 0)
else:
return _nx.concatenate(arrs, 1)
I don't see anything in the docs about axis=1. The term it uses is 'stack them horizontally'.
As I noted a year ago, Concatenation of 2 1D numpy arrays along 2nd axis, earlier versions don't raise an error if the axis is too high. But in 1.12 we get an error.
There is a newish np.stack that can add a dimension where needed:
In [46]: np.stack((np.arange(3), np.arange(4,7)),axis=1)
Out[46]:
array([[0, 4],
[1, 5],
[2, 6]])
The base function is concatenate. The various stack functions adjust array dimensions in one way or other, and then do concatenate. Look at their code to see the details. (I've summarized the differences in earlier posts as well).

np.hstack(tup) and np.concatenate(tup, axis=1) are indeed equivalent but only if tup contains arrays that are at least 2-dimensional. This was in fact spelled out in the documentation for vstack, so it looks like it was just an oversight that it did not also in the documentation for hstack; it will for future versions though.

Related

Extract 2d ndarray from arbitrarily dimensional ndarray using index arrays

I want to extract parts of an numpy ndarray based on arrays of index positions for some of the dimensions. Let me show this on an example
Example data
dummy = np.random.rand(5,2,100)
X = np.array([[0,1],[4,1],[2,0]])
dummy is the original ndarray with dimensionality 5x2x100. This dimensionality is arbitrary, it could as well be 5x2x4x100.
X is a matrix of index values, here X[:,0] are the indices of the first dimension of dummy, X[:,1] those of the second dimension. The number of columns in X is always the number of dimensions in dummy minus 1.
Example output
I want to extract an ndarray of the following form for this example
[
dummy[0,1,:],
dummy[4,1,:],
dummy[2,0,:]
]
Complications
If the number of dimensions in dummy were fixed, this could just be done by dummy[X[:,0],X[:,1],:] . Sadly the dimensionality can be different, e.g. dummy could be a 5x2x4x6x100 ndarray and X correspondingly would then be 3x4 . My attempts at dealing with it have not yielded the desired result.
dummy[X,:] yields a 3x2x2x100 ndarray for this example same as dummy[X]
Iteratively reducing dummy by doing something like dummy = dummy[X[:,i],:] with i an iterator over the number of columns of X also does not reduce the ndarray in the example past 3x2x100
I have a feeling that this should be pretty simple with numpy indexing, but I guess my search for a solution was missing the right terms for this.
Does anyone have a solution to this?
I will try to provide some explainability to #Michael Szczesny answer.
First, notice that if you have an np.array with dimension n and pass m indexes where m<n, then it will be the same as using : in the dimensions >=m. In your case, for example:
dummy[(0, 0)] == dummy[0, 0, :]
Given that, note that you can also pass an array as an index. Thus:
dummy[([0, 1], [0, 0])]
It would be the same as:
np.array([dummy[(0,0)], dummy[(1,0)]])
You can validate that using:
dummy[([0, 1], [0, 0])] == np.array([dummy[(0,0)], dummy[(1,0)]])
Finally, notice that:
(*X.T,)
# (array([0, 4, 2]), array([1, 1, 0]))
You are here getting each dimension as an array, and then you will get:
[
dummy[0,1],
dummy[4,1],
dummy[2,0]
]
Which is the same as:
[
dummy[0,1,:],
dummy[4,1,:],
dummy[2,0,:]
]
Edit: Instead of using (*X.T,), you can use tuple(X.T), which for me, makes more sense
as Michael Szczesny wrote, the best solution is dummy[(*X.T,)].
Since X[:,0] are the indices of the first dimension of dummy and X[:,1] are the indices of the second dimension of dummy, if you transpose X (X.T) you'll have the the indices of the first dimension of dummy as X.T[0] and the indices of the second dimension of dummy as X.T[1].
Now to slice dummy as you want, you can specify the indices of the first and of the second dimension in this way:
dummy[(first_dim_indices, second_dim_indices)] = dummy[(X.T[0], X.T[1])]
In order to simplify the code (and since you doesn't want to transpose the X matrix twice) you can unpack X.T in a tuple as (*X.T,) and so write X[(*X.T,)] is the same thing to write dummy[(X.T[0], X.T[1])].
This writing is also useful if you have an unfixed number of dimensions to slice trough because you will unpack from X.T as many lines as there are dimensions to slice in dummy. For example suppose you want to retrieve an 1D-array from dummy given the following indices:
first_dim: (0, 4, 2)
second_dim: (1, 1, 0)
third_dim: (9, 8, 7)
You can specify the indices of the 3 dimensions as X = np.array([[0,1,9],[4,1,8],[2,0,7]]) and dim[(*X.T,)] is still valid.

What is the best way to keep dimensionality when subarraying numpy arrays?

Suppose I had a standard numpy array such as
a = np.arange(6).reshape((2,3))
When I subarray the array, by performing such task as
a[1, :]
I will lose dimensionality and it will turn into 1D and print, array([3, 4, 5])
Of course the list being 2D you originally want to keep dimensionality. So Ihave to do a tedious task such as
b=a[1, :]
b.reshape(1, b.size)
Why does numpy decrease dimensionality when subarraying?
What is the best way to keep dimensionality, since a[1, :].reshape(1, a.size) will break?
Just use slicing rather than indexing, and the shape will be preserved:
a[1:2]
Although I agree with John Zwinck's answer, I wanted to provide an alternative in case, for whatever reason, you are forced into using indexing (instead of slicing).
OP says that "a[1, :].reshape(1, a.size) will break":
You can add dimensions to numpy arrays like this:
b = a[1]
# array([3, 4, 5]
b = a[1][np.newaxis]
# array([[3, 4, 5]])
(Note that np.newaxis is None, but it's a lot more readable to use the np.newaxis)
As pointed out in the comments (#PaulPanzer and #Divakar), there are actually many ways to accomplish this same thing (again, with indexing instead of slicing):
These ones do not make a copy (data changed in each affect a)
a[1, None]
a[1, np.newaxis]
a[1].reshape(1, a.shape[1]) # Use shape, not size
This one does make a copy (data is independent from a)
a[[1]]

Adding a New Column to an Empty NumPy Array

I'm trying to add a new column to an empty NumPy array and am facing some troubles. I've looked at a lot of other questions, but for some reason they don't seem to be helping me solve the problem I'm facing, so I decided to ask my own question.
I have an empty NumPy array such that:
array1 = np.array([])
Let's say I have data that is of shape (100, 100), and want to append each column to array1 one by one. However, if I do for example:
array1 = np.append(array1, some_data[:, 0])
array1 = np.append(array1, some_data[:, 1])
I noticed that I won't be getting a (100, 2) matrix, but a (200,) array. So I tried to specify the axis as
array1 = np.append(array1, some_data[:, 0], axis=1)
which produces a AxisError: axis 1 is out of bounds for array of dimension 1.
Next I tried to use the np.c_[] method:
array1 = np.c_[array1, somedata[:, 0]]
which gives me a ValueError: all the input array dimensions except for the concatenation axis must match exactly.
Is there any way that I would be able to add columns to the NumPy array sequentially?
Thank you.
EDIT
I learned that my initial question didn't contain enough information for others to offer help, and made this update to make up for the initial mistake.
My big objective is to make a program that selects features in a "greedy fashion." Basically, I'm trying to take the design matrix some_data, which is a (100, 100) matrix containing floating point numbers as entries, and fitting a linear regression model with an increasing number of features until I find the best set of features.
For example, since I have a total of 100 features, the first round would fit the model on each 100, select the best one and store it, then continue with the remaining 99.
That's what I'm trying to do in my head, but I got stuck from the beginning with the problem I mentioned.
You start with a (0,) array and (n,) shaped one:
In [482]: arr1 = np.array([])
In [483]: arr1.shape
Out[483]: (0,)
In [484]: arr2 = np.array([1,2,3])
In [485]: arr2.shape
Out[485]: (3,)
np.append uses concatenate (but with some funny business when axis is not provided):
In [486]: np.append(arr1, arr2)
Out[486]: array([1., 2., 3.])
In [487]: np.append(arr1, arr2,axis=0)
Out[487]: array([1., 2., 3.])
In [489]: np.concatenate([arr1, arr2])
Out[489]: array([1., 2., 3.])
And trying axis=1
In [488]: np.append(arr1, arr2,axis=1)
---------------------------------------------------------------------------
AxisError Traceback (most recent call last)
<ipython-input-488-457b8657453e> in <module>()
----> 1 np.append(arr1, arr2,axis=1)
/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py in append(arr, values, axis)
4526 values = ravel(values)
4527 axis = arr.ndim-1
-> 4528 return concatenate((arr, values), axis=axis)
AxisError: axis 1 is out of bounds for array of dimension 1
Look at the whole message - the error occurs in the concatenate step. You can't concatenate 1d arrays along axis=1.
Using np.append or even np.concatenate iteratively is slow (it creates a new array each time), and hard to initialize correctly. It is a poor substitute for the widely use list append-to-empty-list recipe.
np.c_ is also just a cover function for concatenate.
There isn't just one empty array. np.array([[]]) and np.array([[[]]]) also have 0 elements.
If you want to add a column to an array, you need to start with a 2d array, and the column also needs to be 2d.
Here's an example of a proper concatenation of 2 2d arrays:
In [490]: np.concatenate([ np.zeros((3,0),int), np.arange(3)[:,None]], axis=1)
Out[490]:
array([[0],
[1],
[2]])
column_stack is another cover function for concatenate that makes sure the inputs are 2d. But even with that getting an initial 'empty' array is tricky.
In [492]: np.column_stack([np.zeros(3,int), np.arange(3)])
Out[492]:
array([[0, 0],
[0, 1],
[0, 2]])
In [493]: np.column_stack([np.zeros((3,0),int), np.arange(3)])
Out[493]:
array([[0],
[1],
[2]])
np.c_ is a lot like column_stack, though implemented in a different way:
In [496]: np.c_[np.zeros(3,int), np.arange(3)]
Out[496]:
array([[0, 0],
[0, 1],
[0, 2]])
The basic message is, that when using np.concatenate you need to pay attention to dimensions. Its variants allow you to fudge things a bit, but you really need to understand that fudging to get things right, especially when starting from this poorly defined idea of a 'empty' array.
I usually use concatenate method and do it like this:
# Some stuff
alldata = None
....
array1 = np.random.random((100,1))
if alldata is None: alldata = array1
...
array2 = np.random.random((100,1))
alldata = np.concatenate((alldata,array2),axis=1)
In case, you are working with vectors:
alldata = None
....
array1 = np.random.random((100,))
if alldata is None: alldata = array1[:,np.newaxis]
...
array2 = np.random.random((100,))
alldata = np.concatenate((alldata,array2[:,np.newaxis]),axis=1)

python: numpy reshaping an array

I am creating a numpy array using the following:
X = np.linspace(-5, 5, num=500)
This generates points evenly sampled 500 points between -5 and 5. The shape of the resulting array is: (500,). Now, I need to pass it to a function that expects a 2-D array. So, I can reshape it as:
X = X.reshape((500, 1))
However, I noticed that X = X[:, None] has the same effect. For the life of me though, I cannot understand what this syntax is doing. Hoping someone can shed some light on this.
The syntax X[: ,None] is actually the same as:
X[:, np.newaxis]
which is adding a new dimension to your original array.
The Python interpreter translates
x[:,None]
to
x.__getitem__((slice(None,None,None), None))
and the ndarray implementation of __getitem__ acts in much the same way as x.reshape(500,1). Implementation details will differ, but the effect is the same. `
So at a syntax level, it's just normal Python. But the numpy semantics give it a distinctive meaning.
x[:, np.newaxis]
may be clearer, but np.newaxis is just an alias for None:
In [48]: np.newaxis is None
Out[48]: True

numpy vstack vs. column_stack

What exactly is the difference between numpy vstack and column_stack. Reading through the documentation, it looks as if column_stack is an implementation of vstack for 1D arrays. Is it a more efficient implementation? Otherwise, I cannot find a reason for just having vstack.
I think the following code illustrates the difference nicely:
>>> np.vstack(([1,2,3],[4,5,6]))
array([[1, 2, 3],
[4, 5, 6]])
>>> np.column_stack(([1,2,3],[4,5,6]))
array([[1, 4],
[2, 5],
[3, 6]])
>>> np.hstack(([1,2,3],[4,5,6]))
array([1, 2, 3, 4, 5, 6])
I've included hstack for comparison as well. Notice how column_stack stacks along the second dimension whereas vstack stacks along the first dimension. The equivalent to column_stack is the following hstack command:
>>> np.hstack(([[1],[2],[3]],[[4],[5],[6]]))
array([[1, 4],
[2, 5],
[3, 6]])
I hope we can agree that column_stack is more convenient.
hstack stacks horizontally, vstack stacks vertically:
The problem with hstack is that when you append a column you need convert it from 1d-array to a 2d-column first, because 1d array is normally interpreted as a vector-row in 2d context in numpy:
a = np.ones(2) # 2d, shape = (2, 2)
b = np.array([0, 0]) # 1d, shape = (2,)
hstack((a, b)) -> dimensions mismatch error
So either hstack((a, b[:, None])) or column_stack((a, b)):
where None serves as a shortcut for np.newaxis.
If you're stacking two vectors, you've got three options:
As for the (undocumented) row_stack, it is just a synonym of vstack, as 1d array is ready to serve as a matrix row without extra work.
The case of 3D and above proved to be too huge to fit in the answer, so I've included it in the article called Numpy Illustrated.
In the Notes section to column_stack, it points out this:
This function is equivalent to np.vstack(tup).T.
There are many functions in numpy that are convenient wrappers of other functions. For example, the Notes section of vstack says:
Equivalent to np.concatenate(tup, axis=0) if tup contains arrays that are at least 2-dimensional.
It looks like column_stack is just a convenience function for vstack.

Categories