Adding a New Column to an Empty NumPy Array - python

I'm trying to add a new column to an empty NumPy array and am facing some troubles. I've looked at a lot of other questions, but for some reason they don't seem to be helping me solve the problem I'm facing, so I decided to ask my own question.
I have an empty NumPy array such that:
array1 = np.array([])
Let's say I have data that is of shape (100, 100), and want to append each column to array1 one by one. However, if I do for example:
array1 = np.append(array1, some_data[:, 0])
array1 = np.append(array1, some_data[:, 1])
I noticed that I won't be getting a (100, 2) matrix, but a (200,) array. So I tried to specify the axis as
array1 = np.append(array1, some_data[:, 0], axis=1)
which produces a AxisError: axis 1 is out of bounds for array of dimension 1.
Next I tried to use the np.c_[] method:
array1 = np.c_[array1, somedata[:, 0]]
which gives me a ValueError: all the input array dimensions except for the concatenation axis must match exactly.
Is there any way that I would be able to add columns to the NumPy array sequentially?
Thank you.
EDIT
I learned that my initial question didn't contain enough information for others to offer help, and made this update to make up for the initial mistake.
My big objective is to make a program that selects features in a "greedy fashion." Basically, I'm trying to take the design matrix some_data, which is a (100, 100) matrix containing floating point numbers as entries, and fitting a linear regression model with an increasing number of features until I find the best set of features.
For example, since I have a total of 100 features, the first round would fit the model on each 100, select the best one and store it, then continue with the remaining 99.
That's what I'm trying to do in my head, but I got stuck from the beginning with the problem I mentioned.

You start with a (0,) array and (n,) shaped one:
In [482]: arr1 = np.array([])
In [483]: arr1.shape
Out[483]: (0,)
In [484]: arr2 = np.array([1,2,3])
In [485]: arr2.shape
Out[485]: (3,)
np.append uses concatenate (but with some funny business when axis is not provided):
In [486]: np.append(arr1, arr2)
Out[486]: array([1., 2., 3.])
In [487]: np.append(arr1, arr2,axis=0)
Out[487]: array([1., 2., 3.])
In [489]: np.concatenate([arr1, arr2])
Out[489]: array([1., 2., 3.])
And trying axis=1
In [488]: np.append(arr1, arr2,axis=1)
---------------------------------------------------------------------------
AxisError Traceback (most recent call last)
<ipython-input-488-457b8657453e> in <module>()
----> 1 np.append(arr1, arr2,axis=1)
/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py in append(arr, values, axis)
4526 values = ravel(values)
4527 axis = arr.ndim-1
-> 4528 return concatenate((arr, values), axis=axis)
AxisError: axis 1 is out of bounds for array of dimension 1
Look at the whole message - the error occurs in the concatenate step. You can't concatenate 1d arrays along axis=1.
Using np.append or even np.concatenate iteratively is slow (it creates a new array each time), and hard to initialize correctly. It is a poor substitute for the widely use list append-to-empty-list recipe.
np.c_ is also just a cover function for concatenate.
There isn't just one empty array. np.array([[]]) and np.array([[[]]]) also have 0 elements.
If you want to add a column to an array, you need to start with a 2d array, and the column also needs to be 2d.
Here's an example of a proper concatenation of 2 2d arrays:
In [490]: np.concatenate([ np.zeros((3,0),int), np.arange(3)[:,None]], axis=1)
Out[490]:
array([[0],
[1],
[2]])
column_stack is another cover function for concatenate that makes sure the inputs are 2d. But even with that getting an initial 'empty' array is tricky.
In [492]: np.column_stack([np.zeros(3,int), np.arange(3)])
Out[492]:
array([[0, 0],
[0, 1],
[0, 2]])
In [493]: np.column_stack([np.zeros((3,0),int), np.arange(3)])
Out[493]:
array([[0],
[1],
[2]])
np.c_ is a lot like column_stack, though implemented in a different way:
In [496]: np.c_[np.zeros(3,int), np.arange(3)]
Out[496]:
array([[0, 0],
[0, 1],
[0, 2]])
The basic message is, that when using np.concatenate you need to pay attention to dimensions. Its variants allow you to fudge things a bit, but you really need to understand that fudging to get things right, especially when starting from this poorly defined idea of a 'empty' array.

I usually use concatenate method and do it like this:
# Some stuff
alldata = None
....
array1 = np.random.random((100,1))
if alldata is None: alldata = array1
...
array2 = np.random.random((100,1))
alldata = np.concatenate((alldata,array2),axis=1)
In case, you are working with vectors:
alldata = None
....
array1 = np.random.random((100,))
if alldata is None: alldata = array1[:,np.newaxis]
...
array2 = np.random.random((100,))
alldata = np.concatenate((alldata,array2[:,np.newaxis]),axis=1)

Related

Extract 2d ndarray from arbitrarily dimensional ndarray using index arrays

I want to extract parts of an numpy ndarray based on arrays of index positions for some of the dimensions. Let me show this on an example
Example data
dummy = np.random.rand(5,2,100)
X = np.array([[0,1],[4,1],[2,0]])
dummy is the original ndarray with dimensionality 5x2x100. This dimensionality is arbitrary, it could as well be 5x2x4x100.
X is a matrix of index values, here X[:,0] are the indices of the first dimension of dummy, X[:,1] those of the second dimension. The number of columns in X is always the number of dimensions in dummy minus 1.
Example output
I want to extract an ndarray of the following form for this example
[
dummy[0,1,:],
dummy[4,1,:],
dummy[2,0,:]
]
Complications
If the number of dimensions in dummy were fixed, this could just be done by dummy[X[:,0],X[:,1],:] . Sadly the dimensionality can be different, e.g. dummy could be a 5x2x4x6x100 ndarray and X correspondingly would then be 3x4 . My attempts at dealing with it have not yielded the desired result.
dummy[X,:] yields a 3x2x2x100 ndarray for this example same as dummy[X]
Iteratively reducing dummy by doing something like dummy = dummy[X[:,i],:] with i an iterator over the number of columns of X also does not reduce the ndarray in the example past 3x2x100
I have a feeling that this should be pretty simple with numpy indexing, but I guess my search for a solution was missing the right terms for this.
Does anyone have a solution to this?
I will try to provide some explainability to #Michael Szczesny answer.
First, notice that if you have an np.array with dimension n and pass m indexes where m<n, then it will be the same as using : in the dimensions >=m. In your case, for example:
dummy[(0, 0)] == dummy[0, 0, :]
Given that, note that you can also pass an array as an index. Thus:
dummy[([0, 1], [0, 0])]
It would be the same as:
np.array([dummy[(0,0)], dummy[(1,0)]])
You can validate that using:
dummy[([0, 1], [0, 0])] == np.array([dummy[(0,0)], dummy[(1,0)]])
Finally, notice that:
(*X.T,)
# (array([0, 4, 2]), array([1, 1, 0]))
You are here getting each dimension as an array, and then you will get:
[
dummy[0,1],
dummy[4,1],
dummy[2,0]
]
Which is the same as:
[
dummy[0,1,:],
dummy[4,1,:],
dummy[2,0,:]
]
Edit: Instead of using (*X.T,), you can use tuple(X.T), which for me, makes more sense
as Michael Szczesny wrote, the best solution is dummy[(*X.T,)].
Since X[:,0] are the indices of the first dimension of dummy and X[:,1] are the indices of the second dimension of dummy, if you transpose X (X.T) you'll have the the indices of the first dimension of dummy as X.T[0] and the indices of the second dimension of dummy as X.T[1].
Now to slice dummy as you want, you can specify the indices of the first and of the second dimension in this way:
dummy[(first_dim_indices, second_dim_indices)] = dummy[(X.T[0], X.T[1])]
In order to simplify the code (and since you doesn't want to transpose the X matrix twice) you can unpack X.T in a tuple as (*X.T,) and so write X[(*X.T,)] is the same thing to write dummy[(X.T[0], X.T[1])].
This writing is also useful if you have an unfixed number of dimensions to slice trough because you will unpack from X.T as many lines as there are dimensions to slice in dummy. For example suppose you want to retrieve an 1D-array from dummy given the following indices:
first_dim: (0, 4, 2)
second_dim: (1, 1, 0)
third_dim: (9, 8, 7)
You can specify the indices of the 3 dimensions as X = np.array([[0,1,9],[4,1,8],[2,0,7]]) and dim[(*X.T,)] is still valid.

Filtering data in arrays and keeping the same shape in axis 1

I'm trying to get some specific values given a dataframe using pandas and numpy.
My process right now is as it follows:
In[1]: df = pd.read_csv(file)
In[2]: a = df[df.columns[1]].values
Right now a has the following shape:
In[3]: a.shape
Out[4]: (8640, 1)
When I filter it out to get the values that match a given condition I don't get the same shape in the axis 1:
In[5]: b = a[a>100]
In[6]: b.shape
Out[7]: (3834,)
Right now I'm reshaping the new arrays everytime I filter them, however this is making my code look really messy and uncomfortable:
In[8]: (b.reshape(b.size,1)).shape
Out[9]: (3834, 1)
I really need it to have the shape (x, 1) in order to use some other functions, so is it any way of getting that shape everytime I filter out the values without having to reshape it constantly?
EDIT:
The main reason I need to do this reshaping is that I need to get the minimum value in every row for two arrays with the same number of rows. What I use is np.min and np.concatenate.
For example:
av is the mean of 5 different columns in my dataframe:
av = np.mean(myColumns,axis=1)
Which has shape (8640, )
med is the median for the same columns:
med = np.median(myColumns,axis=1)
And when I try to get the minimum values I have the next error:
np.min(np.concatenate((av,med),axis=1),axis=1) Traceback (most recent
call last):
File "", line 1, in
np.min(np.concatenate((av,med),axis=1),axis=1)
AxisError: axis 1 is out of bounds for array of dimension 1
However, if I reshape av and med it works fine:
np.min(np.concatenate((av.reshape(av.size,1),med.reshape(av.size,1)),axis=1),axis=1)
Out[232]: array([0., 0., 0., ..., 0., 0., 0.])
you can use np.take(a, np.where(a>100)[0], axis=0) for keeping the same shape as original
If you really need this shape, this code gives shape (..., 1) and it's not that ugly:
np.expand_dims(a, 1)
or
a[:, np.newaxis]
If your code is not too heavy that you have to use numpy for performance, you can stick with pandas objects (DataFrame/Series) and maintain shape.
For example, take this example df (which, I must add, you should've provided with your question):
df = pd.DataFrame(data=np.random.rand(7,3), columns=['a','b','c'])
df
a b c
0 0.382530 0.748674 0.186446
1 0.142991 0.965972 0.299884
2 0.568910 0.469341 0.896786
3 0.452816 0.021598 0.989637
4 0.884955 0.738519 0.082460
5 0.944797 0.103953 0.287005
6 0.379389 0.593280 0.832720
To create an object with shape (7,1) you can use x = df[['a']], which is a dataframe with one column (compare with x=df['a'], which is a Series with shape (7,)).
Now, if I move on to a numpy array by using y=x.values, I still get the same shape (both x and y have shapes (7,1)).
However, both react differently to boolean indexing: calling y[y>0.3] will return an array with shape (6,), while calling x[x>0.3] will return ... a dataframe with shape (7,1). Let's see:
array:
y[y>0.3]
array([0.38252971, 0.56890993, 0.45281553, 0.88495521, 0.94479716,
0.37938899])
dataframe:
x[x>0.3]
a
0 0.382530
1 NaN
2 0.568910
3 0.452816
4 0.884955
5 0.944797
6 0.379389
So, to get a series with the shape that you want (6,1), you can use
x[x['a']>0.3]
which returns
a
0 0.382530
2 0.568910
3 0.452816
4 0.884955
5 0.944797
6 0.379389
And then, only after doing all of your manipulations, you can call the .values at the end at obtain a numpy array with the desired result.
Now, generally speaking, manipulation on arrays are faster than on pandas objects, but working with pandas objects is easier, especially if you have lots of data processing to do.
You might prefer to work with numpy all the way, but the pandas option is worth knowing, and in my opinion is easier and simpler.
Hope this helps!

Converting OpenCV SURF features to float32 arrays in Python

I extract the features with the compute() function and add them to a list. I then try to convert all the features to float32 using NumPy so that they can be used with OpenCV for classification. The error I am getting is:
ValueError: setting an array element with a sequence.
Not really sure what I can do about this. I am following a book and doing the same steps except they use HOS to extract the features. I am extracting the features and getting back matrices of inconsistent sizes and am not sure how I can make them all equal. Related code (which might have minor syntax errors cause I truncated it from the original code):
def get_SURF_feature_vector(area_of_interest, surf):
# Detect the key points in the image
key_points = surf.detect(area_of_interest);
# Create array of zeros with the same shape and type as a given array
image_key_points = np.zeros_like(area_of_interest);
# Draw key points on the image
image_key_points = cv2.drawKeypoints(area_of_interest, key_points, image_key_points, flags=cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)
# Create feature discriptors
key_points, feature_descriptors = surf.compute(area_of_interest, key_points);
# Plot Image and descriptors
# plt.imshow(image_key_points);
# Return computed feature description matrix
return feature_descriptors;
for x in range(0, len(data)):
feature_list.append(get_SURF_feature_vector(area_of_interest[x], surf));
list_of_features = np.array(list_of_features, dtype = np.float32);
The error isn't specific to OpenCV at all, just numpy.
Your list feature_list contains different length arrays. You can't make a 2d array out of arrays of different sizes.
For e.g. you can reproduce the error really simply:
>>> np.array([[1], [2, 3]], dtype=np.float32)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.
I'm assuming what you're expecting from the operation is to input [1], [1, 2] and be returned np.array([1, 2, 3]), i.e., concatenation (actually this is not what OP wants, see the comments under this post). You can use the np.hstack() or np.vstack() for those operations, just depending on the shape of your input. You can use np.concatenate() too with the axis argument but the stacking operations are more explicit for 2D/3D arrays.
>>> a = np.array([1], dtype=np.float32)
>>> b = np.array([2, 3, 4], dtype=np.float32)
>>> np.hstack([a, b])
array([1., 2., 3., 4.], dtype=float32)
Descriptors are listed vertically though, so they should be stacked vertically, not horizontally as above. Thus you can simply do:
list_of_features = np.vstack(list_of_features)
You don't need to specify dtype=np.float32 as the descriptors are np.float32 by default (also, vstack doesn't have a dtype argument so you'd have to convert it after the stacking operation).
If you instead want an 3D array, then you need the same number of features across all images so that it's an evenly filled 3D array. You could just fill up your feature vectors with placeholder values, like 0s or np.nan so that they're all the same length, and then you can group them together as you did originally.
>>> des1 = np.random.rand(500, 64).astype(np.float32)
>>> des2 = np.random.rand(200, 64).astype(np.float32)
>>> des3 = np.random.rand(400, 64).astype(np.float32)
>>> feature_descriptors = [des1, des2, des3]
So here each image's feature descriptors have a different number of features. You can find the largest one:
>>> max_des_length = max([len(d) for d in feature_descriptors])
>>> max_des_length
500
You can use np.pad() to pad each feature array with however many more values it needs to be the same size as your maximum size descriptor set.
Now this is a little unnecessary to do it all in one line, but whatever.
>>> feature_descriptors = [np.pad(d, ((0, (max_des_length - len(d))), (0, 0)), 'constant', constant_values=np.nan) for d in feature_descriptors]
The annoying argument here ((0, (max_des_length - len(d))), (0, 0)) is just saying to pad with 0 elements on the top, max_des_length - len(des) elements on the bottom, 0 on the left, 0 on the right.
As you can see here, I'm adding np.nan values to the arrays. If you left out the constant_values argument it defaults to 0. Lastly all you have to do is cast as a numpy array:
>>> feature_descriptors = np.array(feature_descriptors)
>>> feature_descriptors.shape
(3, 500, 64)

What is the difference between an array with shape (N,1) and one with shape (N)? And how to convert between the two?

Python newbie here coming from a MATLAB background.
I have a 1 column array and I want to move that column into the first column of a 3 column array. With a MATLAB background this is what I would do:
import numpy as np
A = np.zeros([150,3]) #three column array
B = np.ones([150,1]) #one column array which needs to replace the first column of A
#MATLAB-style solution:
A[:,0] = B
However this does not work because the "shape" of A is (150,3) and the "shape" of B is (150,1). And apparently the command A[:,0] results in a "shape" of (150).
Now, what is the difference between (150,1) and (150)? Aren't they the same thing: a column vector? And why isn't Python "smart enough" to figure out that I want to put the column vector, B, into the first column of A?
Is there an easy way to convert a 1-column vector with shape (N,1) to a 1-column vector with shape (N)?
I am new to Python and this seems like a really silly thing that MATLAB does much better...
Several things are different. In numpy arrays may be 0d or 1d or higher. In MATLAB 2d is the smallest (and at one time the only dimensions). MATLAB readily expands dimensions the end because it is Fortran ordered. numpy, is by default c ordered, and most readily expands dimensions at the front.
In [1]: A = np.zeros([5,3])
In [2]: A[:,0].shape
Out[2]: (5,)
Simple indexing reduces a dimension, regardless whether it's A[0,:] or A[:,0]. Contrast that with happens to a 3d MATLAB matrix, A(1,:,:) v A(:,:,1).
numpy does broadcasting, adjusting dimensions during operations like sum and assignment. One basic rule is that dimensions may be automatically expanded toward the start if needed:
In [3]: A[:,0] = np.ones(5)
In [4]: A[:,0] = np.ones([1,5])
In [5]: A[:,0] = np.ones([5,1])
...
ValueError: could not broadcast input array from shape (5,1) into shape (5)
It can change (5,) LHS to (1,5), but can't change it to (5,1).
Another broadcasting example, +:
In [6]: A[:,0] + np.ones(5);
In [7]: A[:,0] + np.ones([1,5]);
In [8]: A[:,0] + np.ones([5,1]);
Now the (5,) works with (5,1), but that's because it becomes (1,5), which together with (5,1) produces (5,5) - an outer product broadcasting:
In [9]: (A[:,0] + np.ones([5,1])).shape
Out[9]: (5, 5)
In Octave
>> x = ones(2,3,4);
>> size(x(1,:,:))
ans =
1 3 4
>> size(x(:,:,1))
ans =
2 3
>> size(x(:,1,1) )
ans =
2 1
>> size(x(1,1,:) )
ans =
1 1 4
To do the assignment that you want you adjust either side
Index in a way that preserves the number of dimensions:
In [11]: A[:,[0]].shape
Out[11]: (5, 1)
In [12]: A[:,[0]] = np.ones([5,1])
transpose the (5,1) to (1,5):
In [13]: A[:,0] = np.ones([5,1]).T
flatten/ravel the (5,1) to (5,):
In [14]: A[:,0] = np.ones([5,1]).flat
In [15]: A[:,0] = np.ones([5,1])[:,0]
squeeze, ravel also work.
Some quick tests in Octave indicate that it is more forgiving when it comes to dimensions mismatch. But the numpy prioritizes consistency. Once the broadcasting rules are understood, the behavior makes sense.
Use squeeze method to eliminate the dimensions of size 1.
A[:,0] = B.squeeze()
Or just create B one-dimensional to begin with:
B = np.ones([150])
The fact that NumPy maintains a distinction between a 1D array and 2D array with one of dimensions being 1 is reasonable, especially when one begins working with n-dimensional arrays.
To answer the question in the title: there is an evident structural difference between an array of shape (3,) such as
[1, 2, 3]
and an array of shape (3, 1) such as
[[1], [2], [3]]

About numpy's concatenate, hstack, vstack functions?

See some examples
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(np.concatenate((a,b), axis=0)) # [1,2,3,4,5,6]
print(np.hstack((a,b))) # [1,2,3,4,5,6]
print(np.vstack((a,b))) # [[1,2,3],[4,5,6]]
print(np.concatenate((a,b), axis=1)) # IndexError: axis 1 out of bounds [0, 1)
The result of hstack is the same as concatenate along axis=0, but the api document says hstack=concatenate along axis=1, please look at the https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.hstack.html#numpy.hstack
And concatenating along the axis=1 raise an IndexError, the api document says hstack=concatenate along axis=0, please look at the https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.vstack.html#numpy.vstack
Can anybody explain it?By the way, can anybody explain how to broadcast when the ndarray's dimension is less than 2 and concatenating along axis=1?
Look at the actual code for hstack:
arrs = [atleast_1d(_m) for _m in tup]
# As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
if arrs[0].ndim == 1:
return _nx.concatenate(arrs, 0)
else:
return _nx.concatenate(arrs, 1)
I don't see anything in the docs about axis=1. The term it uses is 'stack them horizontally'.
As I noted a year ago, Concatenation of 2 1D numpy arrays along 2nd axis, earlier versions don't raise an error if the axis is too high. But in 1.12 we get an error.
There is a newish np.stack that can add a dimension where needed:
In [46]: np.stack((np.arange(3), np.arange(4,7)),axis=1)
Out[46]:
array([[0, 4],
[1, 5],
[2, 6]])
The base function is concatenate. The various stack functions adjust array dimensions in one way or other, and then do concatenate. Look at their code to see the details. (I've summarized the differences in earlier posts as well).
np.hstack(tup) and np.concatenate(tup, axis=1) are indeed equivalent but only if tup contains arrays that are at least 2-dimensional. This was in fact spelled out in the documentation for vstack, so it looks like it was just an oversight that it did not also in the documentation for hstack; it will for future versions though.

Categories