Extract 2d ndarray from arbitrarily dimensional ndarray using index arrays

Extract 2d ndarray from arbitrarily dimensional ndarray using index arrays - python

I want to extract parts of an numpy ndarray based on arrays of index positions for some of the dimensions. Let me show this on an example
Example data
dummy = np.random.rand(5,2,100)
X = np.array([[0,1],[4,1],[2,0]])
dummy is the original ndarray with dimensionality 5x2x100. This dimensionality is arbitrary, it could as well be 5x2x4x100.
X is a matrix of index values, here X[:,0] are the indices of the first dimension of dummy, X[:,1] those of the second dimension. The number of columns in X is always the number of dimensions in dummy minus 1.
Example output
I want to extract an ndarray of the following form for this example
[
dummy[0,1,:],
dummy[4,1,:],
dummy[2,0,:]
]
Complications
If the number of dimensions in dummy were fixed, this could just be done by dummy[X[:,0],X[:,1],:] . Sadly the dimensionality can be different, e.g. dummy could be a 5x2x4x6x100 ndarray and X correspondingly would then be 3x4 . My attempts at dealing with it have not yielded the desired result.
dummy[X,:] yields a 3x2x2x100 ndarray for this example same as dummy[X]
Iteratively reducing dummy by doing something like dummy = dummy[X[:,i],:] with i an iterator over the number of columns of X also does not reduce the ndarray in the example past 3x2x100
I have a feeling that this should be pretty simple with numpy indexing, but I guess my search for a solution was missing the right terms for this.
Does anyone have a solution to this?

I will try to provide some explainability to #Michael Szczesny answer.
First, notice that if you have an np.array with dimension n and pass m indexes where m<n, then it will be the same as using : in the dimensions >=m. In your case, for example:
dummy[(0, 0)] == dummy[0, 0, :]
Given that, note that you can also pass an array as an index. Thus:
dummy[([0, 1], [0, 0])]
It would be the same as:
np.array([dummy[(0,0)], dummy[(1,0)]])
You can validate that using:
dummy[([0, 1], [0, 0])] == np.array([dummy[(0,0)], dummy[(1,0)]])
Finally, notice that:
(*X.T,)
# (array([0, 4, 2]), array([1, 1, 0]))
You are here getting each dimension as an array, and then you will get:
[
dummy[0,1],
dummy[4,1],
dummy[2,0]
]
Which is the same as:
[
dummy[0,1,:],
dummy[4,1,:],
dummy[2,0,:]
]
Edit: Instead of using (*X.T,), you can use tuple(X.T), which for me, makes more sense

as Michael Szczesny wrote, the best solution is dummy[(*X.T,)].
Since X[:,0] are the indices of the first dimension of dummy and X[:,1] are the indices of the second dimension of dummy, if you transpose X (X.T) you'll have the the indices of the first dimension of dummy as X.T[0] and the indices of the second dimension of dummy as X.T[1].
Now to slice dummy as you want, you can specify the indices of the first and of the second dimension in this way:
dummy[(first_dim_indices, second_dim_indices)] = dummy[(X.T[0], X.T[1])]
In order to simplify the code (and since you doesn't want to transpose the X matrix twice) you can unpack X.T in a tuple as (*X.T,) and so write X[(*X.T,)] is the same thing to write dummy[(X.T[0], X.T[1])].
This writing is also useful if you have an unfixed number of dimensions to slice trough because you will unpack from X.T as many lines as there are dimensions to slice in dummy. For example suppose you want to retrieve an 1D-array from dummy given the following indices:
first_dim: (0, 4, 2)
second_dim: (1, 1, 0)
third_dim: (9, 8, 7)
You can specify the indices of the 3 dimensions as X = np.array([[0,1,9],[4,1,8],[2,0,7]]) and dim[(*X.T,)] is still valid.

Related

Randomly select rows from numpy array based on a condition

Let's say I have 2 arrays of arrays, labels is 1D and data is 5D note that both arrays have the same first dimension.
To simplify things let's say labels contain only 3 arrays :
labels=np.array([[0,0,0,1,1,2,0,0],[0,4,0,0,0],[0,3,0,2,1,0,0,1,7,0]])
And let's say I have a datalist of data arrays (length=3) where each array has a 5D shape where the first dimension of each one is the same as the arrays of the labels array.
In this example, datalist has 3 arrays of shapes : (8,3,100,10,1), (5,3,100,10,1) and (10,3,100,10,1) respectively. Here, the first dimension of each of these arrays is the same as the lengths of each array in label.
Now I want to reduce the number of zeros in each array of labels and keep the other values. Let's say I want to keep only 3 zeros for each array. Therefore, the length of each array in labels as well as the first dimension of each array in data will be 6, 4 and 8.
In order to reduce the number of zeros in each array of labels, I want to randomly select and keep only 3. Now these same random selected indexes will be used then to select the correspondant rows from data.
For this example, the new_labels array will be something like this :
new_labels=np.array([[0,0,1,1,2,0],[4,0,0,0],[0,3,2,1,0,1,7,0]])
Here's what I have tried so far :
all_ind=[] #to store indexes where value=0 for all arrays
indexes_to_keep=[] #to store the random selected indexes
new_labels=[] #to store the final results
for i in range(len(labels)):
ind=[] #to store indexes where value=0 for one array
for j in range(len(labels[i])):
if (labels[i][j]==0):
ind.append(j)
all_ind.append(ind)
for k in range(len(labels)):
indexes_to_keep.append(np.random.choice(all_ind[i], 3))
aux= np.zeros(len(labels[i]) - len(all_ind[i]) + 3)
....
....
Here, how can I fill **aux** with the values ?
....
....
new_labels.append(aux)
Any suggestions ?

Playing with numpy arrays of different lenghts is not a good idea therefore you are required to iterate each item and perform some method on it. Assuming you want to optimize that method only, masking might work pretty well here:
def specific_choice(x, n):
'''leaving n random zeros of the list x'''
x = np.array(x)
mask = x != 0
idx = np.flatnonzero(~mask)
np.random.shuffle(idx) #dynamical change of idx value, quite fast
idx = idx[:n]
mask[idx] = True
return x[mask] # or mask if you need it
Iteration of list is faster than one of array so effective usage would be:
labels = [[0,0,0,1,1,2,0,0],[0,4,0,0,0],[0,3,0,2,1,0,0,1,7,0]]
output = [specific_choice(n, 3) for n in labels]
Output:
[array([0, 1, 1, 2, 0, 0]), array([0, 4, 0, 0]), array([0, 3, 0, 2, 1, 1, 7, 0])]

Why does indexing a matrix by an integer produce a different shape than the dot product with a one hot vector in numpy?

I have a matrix that I initialized with numpy.random.uniform like so:
W = np.random.uniform(-1, 1, (V,N))
In my case, V = 10000 and N = 50, x is a positive integer
When I multiply W by a one hot vector x_vec of dimension V X 1, like W.T.dot(x_vec), I get a column vector with a shape of (50,1). When I try to get the same vector by indexing W, as in W[x].T or W[x,:].T I get shape (50,).
Can anyone explain to me why these two expression return different shapes and if it's possible to return a (50,1) matrix (vector) with the indexing method. The vector of shape (50,) is problematic because it doesn't behave the same way as the (50,1) vector when I multiply it with other matrices, but I'd like to use indexing to speed things up a little.
*Sorry in advance if this question should be in a place like Cross Validated instead of Stack Exchange

They are different operations. matrix (in the maths sense) times matrix gives matrix, some of your matrices just happen to have width 1.
Indexing with an integer scalar eats the dimension you are indexing into. Once you are down to a single dimension, .T does nothing because it doesn't have enough axes to shuffle.
If you want to go from (50,) to (50, 1) shape-wise, the recipe is indexing with None like so v[:, None]. In your case you have at least two one-line options:
W[x, :][:, None] # or W[x][:, None] or
W[x:x+1, :].T # or W[x:x+1].T
The second-line option preserves the first dimension of W by requesting a subrange of length one. The first option can be contracted into a single indexing operation - thanks to #hpaulj for pointing this out - which gives the arguably most readable option:
W[x, :, None]
The first index (scalar integer x) consumes the first dimension of W, the second dimension (unaffected by :) becomes the first and None creates a new dimension on the right.

What is the best way to do multi-dimensional indexing with numpy?

I am trying to do some indexing on a 3D numpy array.
Basically I have an array phi which has shape (F,A,D); for example (5, 3, 7). Generated, for example as follows:
F=5; A=3; D=7; phi = np.random.random((F,A,D))
My goal is to be able to index over A and D, with a 2D array such as [[0,1,2],[5,5,6]], which means take the values indexed by 0 in the 3rd dimension, for the the first position in A, the values indexed by 1 in the 3rd dimension for the second position of A and so on. The result should have a shape that is (F,A,2) or (F,2,A).
This would be equivalent to manually cycling all the values of the "indexer array" such as:
phi[:,0,0]; phi[:,1,1]; phi[:,2,2]
phi[:,0,5]; phi[:,1,5]; phi[:,2,6]
Intuitively I would do something like phi[:,:,[[0,1,2],[3,3,3]]], but it's shape ends up being (5, 3, 2, 3).
Any ideas on how to obtain the correct result?

I think this is what you want
phi[:,range(A),[[0,1,2],[5,5,6]]]
Your attempt
phi[:,:,[[0,1,2],[5,5,6]]]
takes the values along the third dimension for every values of the first two dimensions, therefore you end up with a shape of (5,3,2,3).
However, according to your example you want a continous increase in the second dimension which is accomplished in my code by range(A) and numpy's broadcasting.

Confusion in array operation in numpy

I generally use MATLAB and Octave, and i recently switching to python numpy.
In numpy when I define an array like this
>>> a = np.array([[2,3],[4,5]])
it works great and size of the array is
>>> a.shape
(2, 2)
which is also same as MATLAB
But when i extract the first entire column and see the size
>>> b = a[:,0]
>>> b.shape
(2,)
I get size (2,), what is this? I expect the size to be (2,1). Perhaps i misunderstood the basic concept. Can anyone make me clear about this??

A 1D numpy array* is literally 1D - it has no size in any second dimension, whereas in MATLAB, a '1D' array is actually 2D, with a size of 1 in its second dimension.
If you want your array to have size 1 in its second dimension you can use its .reshape() method:
a = np.zeros(5,)
print(a.shape)
# (5,)
# explicitly reshape to (5, 1)
print(a.reshape(5, 1).shape)
# (5, 1)
# or use -1 in the first dimension, so that its size in that dimension is
# inferred from its total length
print(a.reshape(-1, 1).shape)
# (5, 1)
Edit
As Akavall pointed out, I should also mention np.newaxis as another method for adding a new axis to an array. Although I personally find it a bit less intuitive, one advantage of np.newaxis over .reshape() is that it allows you to add multiple new axes in an arbitrary order without explicitly specifying the shape of the output array, which is not possible with the .reshape(-1, ...) trick:
a = np.zeros((3, 4, 5))
print(a[np.newaxis, :, np.newaxis, ..., np.newaxis].shape)
# (1, 3, 1, 4, 5, 1)
np.newaxis is just an alias of None, so you could do the same thing a bit more compactly using a[None, :, None, ..., None].
* An np.matrix, on the other hand, is always 2D, and will give you the indexing behavior you are familiar with from MATLAB:
a = np.matrix([[2, 3], [4, 5]])
print(a[:, 0].shape)
# (2, 1)
For more info on the differences between arrays and matrices, see here.

Typing help(np.shape) gives some insight in to what is going on here. For starters, you can get the output you expect by typing:
b = np.array([a[:,0]])
Basically numpy defines things a little differently than MATLAB. In the numpy environment, a vector only has one dimension, and an array is a vector of vectors, so it can have more. In your first example, your array is a vector of two vectors, i.e.:
a = np.array([[vec1], [vec2]])
So a has two dimensions, and in your example the number of elements in both dimensions is the same, 2. Your array is therefore 2 by 2. When you take a slice out of this, you are reducing the number of dimensions that you have by one. In other words, you are taking a vector out of your array, and that vector only has one dimension, which also has 2 elements, but that's it. Your vector is now 2 by _. There is nothing in the second spot because the vector is not defined there.
You could think of it in terms of spaces too. Your first array is in the space R^(2x2) and your second vector is in the space R^(2). This means that the array is defined on a different (and bigger) space than the vector.
That was a lot to basically say that you took a slice out of your array, and unlike MATLAB, numpy does not represent vectors (1 dimensional) in the same way as it does arrays (2 or more dimensions).

Convert 1D array into numpy matrix

I have a simple, one dimensional Python array with random numbers. What I want to do is convert it into a numpy Matrix of a specific shape. My current attempt looks like this:
randomWeights = []
for i in range(80):
randomWeights.append(random.uniform(-1, 1))
W = np.mat(randomWeights)
W.reshape(8,10)
Unfortunately it always creates a matrix of the form:
[[random1, random2, random3, ...]]
So only the first element of one dimension gets used and the reshape command has no effect. Is there a way to convert the 1D array to a matrix so that the first x items will be row 1 of the matrix, the next x items will be row 2 and so on?
Basically this would be the intended shape:
[[1, 2, 3, 4, 5, 6, 7, 8],
[9, 10, 11, ... , 16],
[..., 800]]
I suppose I can always build a new matrix in the desired form manually by parsing through the input array. But I'd like to know if there is a simpler, more eleganz solution with built-in functions I'm not seeing. If I have to build those matrices manually I'll have a ton of extra work in other areas of the code since all my source data comes in simple 1D arrays but will be computed as matrices.

reshape() doesn't reshape in place, you need to assign the result:
>>> W = W.reshape(8,10)
>>> W.shape
(8,10)

You can use W.resize(), ndarray.resize()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract 2d ndarray from arbitrarily dimensional ndarray using index arrays - python

Related

Randomly select rows from numpy array based on a condition

Why does indexing a matrix by an integer produce a different shape than the dot product with a one hot vector in numpy?

What is the best way to do multi-dimensional indexing with numpy?

Confusion in array operation in numpy

Convert 1D array into numpy matrix

Categories

Resources