Copy columns of subarray in Numpy - python

Given an array X of shape (100,8192), I want to copy the subarrays of length 8192 for each of the 100 outer dimensions 10 times, so that the resulting array has shape (100,8192,10).
I'm kind of confused about how the tile function works, I can sort of only copy a 1d array (although probably not really elegantly), e.g. if I'm given a 1d array of shape (8192,), I can create a 2d array by copying the 1d array like this: np.tile(x,(10,1)).transpose(), but once I try to do this on a 2d array, I have no idea what the tile function is actually doing when you provide a tuple of values, the documentation is kind of unclear about that.
Can anybody tell me how to do this please?
EDIT: Example, given the 2d array:
In [229]: x
Out[229]:
array([[1, 2, 3],
[4, 5, 6]])
I want to get by copying along the columns 3 times in this case, the following array:
In [233]: y
Out[233]:
array([[[1, 1, 1],
[2, 2, 2],
[3, 3, 3]],
[[4, 4, 4],
[5, 5, 5],
[6, 6, 6]]])

One way to do this is using np.repeat, e.g.:
Let X be the array of shape (100,8192), to replicate the subarray of dimension 8192 10-times across the column dimension, do the following:
X_new = np.repeat(X,10).reshape(100,8192,10)

Are you really asking for a shape (100,8192,10)? By reading you, I would have rather thought of something like (100,10,8192)? Could you provide an example? If you're actually asking for (100,10,8192), maybe you want:
np.tile(x,10).reshape((100,10,8192))
Is it what you're asking for?

Related

How to broadcast update operation to numpy array

Say I have an array
A = [[0,1], [2,3]]
which is shape (2,2).
Then say I want to update the 0th row of A to [4,4], and the 1st row of A to [8,8], where the output is a new array of shape (2,2,2)
C = [[[4,4],[2,3]],
[0,1],[8,8]]]
I want to do this without using a for-loop, ie, I want to do this using numpy's vectorizing features.
Thanks
You're concatenating and reshaping three arrays:
np.concatenate(([[4, 4]], A, [[8, 8]]), axis=0).reshape(2, 2, 2).transpose(1, 0, 2)

Swapping the dimensions of a numpy array using Ellipsis?

This code is swapping first and the last channels of an RBG image which is loaded into a Numpy array:
img = imread('image1.jpg')
# Convert from RGB -> BGR
img = img[..., [2, 1, 0]]
While I understand the use of Ellipsis for slicing in Numpy arrays, I couldn't understand the use of Ellipsis here. Could anybody explain what is exactly happening here?
tl;dr
img[..., [2, 1, 0]] produces the same result as taking the slices img[:, :, i] for each i in the index array [2, 1, 0], and then stacking the results along the last dimension of img. In other words:
img[..., [2,1,0]]
will produce the same output as:
np.stack([img[:,:,2], img[:,:,1], img[:,:,0]], axis=2)
The ellipsis ... is a placeholder that tells numpy which axis to apply the index array to. Without the ... the index array will be applied to the first axis of img instead of the last. Thus, without ..., the index statement:
img[[2,1,0]]
will produce the same output as:
np.stack([img[2,:,:], img[1,:,:], img[0,:,:]], axis=0)
What the docs say
This is an example of what the docs call "Combining advanced and basic indexing":
When there is at least one slice (:), ellipsis (...) or np.newaxis in the index (or the array has more dimensions than there are advanced indexes), then the behaviour can be more complicated. It is like concatenating the indexing result for each advanced index element.
It goes on to describe that in this
case, the dimensions from the advanced indexing operations [in your example [2, 1, 0]] are inserted into the result array at the same spot as they were in the initial array (the latter logic is what makes simple advanced indexing behave just like slicing).
The 2D case
The docs aren't the easiest to understand, but in this case it's not too hard to pick apart. Start with a simpler 2D case:
arr = np.arange(12).reshape(4,3)
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
Using the same kind of advanced indexing with a single index value yields:
arr[:, [1]]
array([[ 1],
[ 4],
[ 7],
[10]])
which is the 1st column of arr. In other words, it's like you yielded all possible values from arr while holding the index of the last axis fixed. Like #hpaulj said in his comment, the ellipsis is there to act as a placeholder. It effectively tells numpy to iterate freely over all of the axes except for the last, to which the indexing array is applied.
You can use also this indexing syntax to shuffle the columns of arr around however you'd like:
arr[..., [1,0,2]]
array([[ 1, 0, 2],
[ 4, 3, 5],
[ 7, 6, 8],
[10, 9, 11]])
This is essentially the same operation as in your example, but on a 2D array instead of a 3D one.
You can explain what's going on with arr[..., [1,0,2]] by breaking it down to simpler indexing ops. It's kind of like you first take the return value of arr[..., [1]]:
array([[ 1],
[ 4],
[ 7],
[10]])
then the return value of arr[..., [0]]:
array([[0],
[3],
[6],
[9]])
then the return value of arr[..., [1]]:
array([[ 2],
[ 5],
[ 8],
[11]])
and then finally concatenated all of those results into a single array of shape (*arr.shape[:-1], len(ix)), where ix = [2, 0, 1] is the index array. The data along the last axis are ordered according to their order in ix.
One good way to understand exactly the ellipsis is doing is to perform the same op without it:
arr[[1,0,2]]
array([[6, 7, 8],
[0, 1, 2],
[3, 4, 5]])
In this case, the index array is applied to the first axis of arr, so the output is an array containing the [1,0,2] rows of arr. Adding an ... before the index array tells numpy to apply the index array to the last axis of arr instead.
Your 3D case
The case you asked about is the 3D equivalent of the 2D arr[..., [1,0,2]] example above. Say that img.shape is (480, 640, 3). You can think about img[..., [2, 1, 0]] as looping over each value i in ix=[2, 1, 0]. For every i, the indexing operation will gather the slab of shape (480, 640, 1) that lies along the ith index of the last axis of img. Once all three slabs are collected, the final result will be the equivalent of concatenating along their last axis (and in the order they were found).
notes
The only difference between arr[..., [1]] and arr[:,1] is that arr[..., [1]] preserves the shape of the data from the original array.
For a 2D array, arr[:, [1]] is equivalent to arr[..., [1]]. : acts as a placeholder just like ..., but only for a single dimension.

Numpy horizontal concat with failure

I want to concatenate two numpy arrays with the shape (100,3) and (100,7) to get a (100,10) matrix.
I've tried it using hstack, concatenate but only receives a ValueError: all the int arrays must have same number of dimensions
In a dummy example like the following it works ...
x=np.arange(30).reshape(10,3)
y=np.arange(20).reshape(10,2)
np.concatenate((x,y), axis=1)
UPDATE 1:
I've created the first two metrics's with sklearn's preprocessing module (RobustScaler and OneHotEncoder).
UPDATE 2:
When using scipy.sparse.hstack it works, but why
The sparse hstack joins the coo attributes and builds a new coo sparse matrix from those. The numpy hstack knows nothing about the different sparse structure. To explain this further I'd have to explain sparse construction, and quote from the respective functions.
If you want to concatenate it vertically axis must beequal to 0. This is explained in the doc for concatenate.
In this link we have this example:
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
np.concatenate((a, b), axis=0)
array([[1, 2],
[3, 4],
[5, 6]])
np.concatenate((a, b.T), axis=1)
array([[1, 2, 5],
[3, 4, 6]])
This works perfectly fine for me:
import numpy as np
x=np.arange(100 * 3).reshape(100,3)
y=np.arange(100 * 7).reshape(100,7)
np.hstack((x,y)).shape # (100, 10)

Numpy: get 1D array as 2D array without reshape

I have need for hstacking multple arrays with with the same number of rows (although the number of rows is variable between uses) but different number of columns. However some of the arrays only have one column, eg.
array = np.array([1,2,3,4,5])
which gives
#array.shape = (5,)
but I'd like to have the shape recognized as a 2d array, eg.
#array.shape = (5,1)
So that hstack can actually combine them.
My current solution is:
array = np.atleast_2d([1,2,3,4,5]).T
#array.shape = (5,1)
So I was wondering, is there a better way to do this? Would
array = np.array([1,2,3,4,5]).reshape(len([1,2,3,4,5]), 1)
be better?
Note that my use of [1,2,3,4,5] is just a toy list to make the example concrete. In practice it will be a much larger list passed into a function as an argument. Thanks!
Check the code of hstack and vstack. One, or both of those, pass the arguments through atleast_nd. That is a perfectly acceptable way of reshaping an array.
Some other ways:
arr = np.array([1,2,3,4,5]).reshape(-1,1) # saves the use of len()
arr = np.array([1,2,3,4,5])[:,None] # adds a new dim at end
np.array([1,2,3],ndmin=2).T # used by column_stack
hstack and vstack transform their inputs with:
arrs = [atleast_1d(_m) for _m in tup]
[atleast_2d(_m) for _m in tup]
test data:
a1=np.arange(2)
a2=np.arange(10).reshape(2,5)
a3=np.arange(8).reshape(2,4)
np.hstack([a1.reshape(-1,1),a2,a3])
np.hstack([a1[:,None],a2,a3])
np.column_stack([a1,a2,a3])
result:
array([[0, 0, 1, 2, 3, 4, 0, 1, 2, 3],
[1, 5, 6, 7, 8, 9, 4, 5, 6, 7]])
If you don't know ahead of time which arrays are 1d, then column_stack is easiest to use. The others require a little function that tests for dimensionality before applying the reshaping.
Numpy: use reshape or newaxis to add dimensions
If I understand your intent correctly, you wish to convert an array of shape (N,) to an array of shape (N,1) so that you can apply np.hstack:
In [147]: np.hstack([np.atleast_2d([1,2,3,4,5]).T, np.atleast_2d([1,2,3,4,5]).T])
Out[147]:
array([[1, 1],
[2, 2],
[3, 3],
[4, 4],
[5, 5]])
In that case, you could use avoid reshaping the arrays and use np.column_stack instead:
In [151]: np.column_stack([[1,2,3,4,5], [1,2,3,4,5]])
Out[151]:
array([[1, 1],
[2, 2],
[3, 3],
[4, 4],
[5, 5]])
I followed Ludo's work and just changed the size of v from 5 to 10000. I ran the code on my PC and the result shows that atleast_2d seems to be a more efficient method in the larger scale case.
import numpy as np
import timeit
v = np.arange(10000)
print('atleast2d:',timeit.timeit(lambda:np.atleast_2d(v).T))
print('reshape:',timeit.timeit(lambda:np.array(v).reshape(-1,1))) # saves the use of len()
print('v[:,None]:', timeit.timeit(lambda:np.array(v)[:,None])) # adds a new dim at end
print('np.array(v,ndmin=2).T:', timeit.timeit(lambda:np.array(v,ndmin=2).T)) # used by column_stack
The result is:
atleast2d: 1.3809496470021259
reshape: 27.099974197000847
v[:,None]: 28.58291715100131
np.array(v,ndmin=2).T: 30.141663907001202
My suggestion is that use [:None] when dealing with a short vector and np.atleast_2d when your vector goes longer.
Just to add info on hpaulj's answer. I was curious about how fast were the four methods described. The winner is the method adding a column at the end of the 1d array.
Here is what I ran:
import numpy as np
import timeit
v = [1,2,3,4,5]
print('atleast2d:',timeit.timeit(lambda:np.atleast_2d(v).T))
print('reshape:',timeit.timeit(lambda:np.array(v).reshape(-1,1))) # saves the use of len()
print('v[:,None]:', timeit.timeit(lambda:np.array(v)[:,None])) # adds a new dim at end
print('np.array(v,ndmin=2).T:', timeit.timeit(lambda:np.array(v,ndmin=2).T)) # used by column_stack
And the results:
atleast2d: 4.455070924214851
reshape: 2.0535152913971615
v[:,None]: 1.8387219828073285
np.array(v,ndmin=2).T: 3.1735243063353664

numpy vstack vs. column_stack

What exactly is the difference between numpy vstack and column_stack. Reading through the documentation, it looks as if column_stack is an implementation of vstack for 1D arrays. Is it a more efficient implementation? Otherwise, I cannot find a reason for just having vstack.
I think the following code illustrates the difference nicely:
>>> np.vstack(([1,2,3],[4,5,6]))
array([[1, 2, 3],
[4, 5, 6]])
>>> np.column_stack(([1,2,3],[4,5,6]))
array([[1, 4],
[2, 5],
[3, 6]])
>>> np.hstack(([1,2,3],[4,5,6]))
array([1, 2, 3, 4, 5, 6])
I've included hstack for comparison as well. Notice how column_stack stacks along the second dimension whereas vstack stacks along the first dimension. The equivalent to column_stack is the following hstack command:
>>> np.hstack(([[1],[2],[3]],[[4],[5],[6]]))
array([[1, 4],
[2, 5],
[3, 6]])
I hope we can agree that column_stack is more convenient.
hstack stacks horizontally, vstack stacks vertically:
The problem with hstack is that when you append a column you need convert it from 1d-array to a 2d-column first, because 1d array is normally interpreted as a vector-row in 2d context in numpy:
a = np.ones(2) # 2d, shape = (2, 2)
b = np.array([0, 0]) # 1d, shape = (2,)
hstack((a, b)) -> dimensions mismatch error
So either hstack((a, b[:, None])) or column_stack((a, b)):
where None serves as a shortcut for np.newaxis.
If you're stacking two vectors, you've got three options:
As for the (undocumented) row_stack, it is just a synonym of vstack, as 1d array is ready to serve as a matrix row without extra work.
The case of 3D and above proved to be too huge to fit in the answer, so I've included it in the article called Numpy Illustrated.
In the Notes section to column_stack, it points out this:
This function is equivalent to np.vstack(tup).T.
There are many functions in numpy that are convenient wrappers of other functions. For example, the Notes section of vstack says:
Equivalent to np.concatenate(tup, axis=0) if tup contains arrays that are at least 2-dimensional.
It looks like column_stack is just a convenience function for vstack.

Categories