Randomly select rows from numpy array based on a condition

Randomly select rows from numpy array based on a condition - python

Let's say I have 2 arrays of arrays, labels is 1D and data is 5D note that both arrays have the same first dimension.
To simplify things let's say labels contain only 3 arrays :
labels=np.array([[0,0,0,1,1,2,0,0],[0,4,0,0,0],[0,3,0,2,1,0,0,1,7,0]])
And let's say I have a datalist of data arrays (length=3) where each array has a 5D shape where the first dimension of each one is the same as the arrays of the labels array.
In this example, datalist has 3 arrays of shapes : (8,3,100,10,1), (5,3,100,10,1) and (10,3,100,10,1) respectively. Here, the first dimension of each of these arrays is the same as the lengths of each array in label.
Now I want to reduce the number of zeros in each array of labels and keep the other values. Let's say I want to keep only 3 zeros for each array. Therefore, the length of each array in labels as well as the first dimension of each array in data will be 6, 4 and 8.
In order to reduce the number of zeros in each array of labels, I want to randomly select and keep only 3. Now these same random selected indexes will be used then to select the correspondant rows from data.
For this example, the new_labels array will be something like this :
new_labels=np.array([[0,0,1,1,2,0],[4,0,0,0],[0,3,2,1,0,1,7,0]])
Here's what I have tried so far :
all_ind=[] #to store indexes where value=0 for all arrays
indexes_to_keep=[] #to store the random selected indexes
new_labels=[] #to store the final results
for i in range(len(labels)):
ind=[] #to store indexes where value=0 for one array
for j in range(len(labels[i])):
if (labels[i][j]==0):
ind.append(j)
all_ind.append(ind)
for k in range(len(labels)):
indexes_to_keep.append(np.random.choice(all_ind[i], 3))
aux= np.zeros(len(labels[i]) - len(all_ind[i]) + 3)
....
....
Here, how can I fill **aux** with the values ?
....
....
new_labels.append(aux)
Any suggestions ?

Playing with numpy arrays of different lenghts is not a good idea therefore you are required to iterate each item and perform some method on it. Assuming you want to optimize that method only, masking might work pretty well here:
def specific_choice(x, n):
'''leaving n random zeros of the list x'''
x = np.array(x)
mask = x != 0
idx = np.flatnonzero(~mask)
np.random.shuffle(idx) #dynamical change of idx value, quite fast
idx = idx[:n]
mask[idx] = True
return x[mask] # or mask if you need it
Iteration of list is faster than one of array so effective usage would be:
labels = [[0,0,0,1,1,2,0,0],[0,4,0,0,0],[0,3,0,2,1,0,0,1,7,0]]
output = [specific_choice(n, 3) for n in labels]
Output:
[array([0, 1, 1, 2, 0, 0]), array([0, 4, 0, 0]), array([0, 3, 0, 2, 1, 1, 7, 0])]

Related

Extract 2d ndarray from arbitrarily dimensional ndarray using index arrays

I want to extract parts of an numpy ndarray based on arrays of index positions for some of the dimensions. Let me show this on an example
Example data
dummy = np.random.rand(5,2,100)
X = np.array([[0,1],[4,1],[2,0]])
dummy is the original ndarray with dimensionality 5x2x100. This dimensionality is arbitrary, it could as well be 5x2x4x100.
X is a matrix of index values, here X[:,0] are the indices of the first dimension of dummy, X[:,1] those of the second dimension. The number of columns in X is always the number of dimensions in dummy minus 1.
Example output
I want to extract an ndarray of the following form for this example
[
dummy[0,1,:],
dummy[4,1,:],
dummy[2,0,:]
]
Complications
If the number of dimensions in dummy were fixed, this could just be done by dummy[X[:,0],X[:,1],:] . Sadly the dimensionality can be different, e.g. dummy could be a 5x2x4x6x100 ndarray and X correspondingly would then be 3x4 . My attempts at dealing with it have not yielded the desired result.
dummy[X,:] yields a 3x2x2x100 ndarray for this example same as dummy[X]
Iteratively reducing dummy by doing something like dummy = dummy[X[:,i],:] with i an iterator over the number of columns of X also does not reduce the ndarray in the example past 3x2x100
I have a feeling that this should be pretty simple with numpy indexing, but I guess my search for a solution was missing the right terms for this.
Does anyone have a solution to this?

I will try to provide some explainability to #Michael Szczesny answer.
First, notice that if you have an np.array with dimension n and pass m indexes where m<n, then it will be the same as using : in the dimensions >=m. In your case, for example:
dummy[(0, 0)] == dummy[0, 0, :]
Given that, note that you can also pass an array as an index. Thus:
dummy[([0, 1], [0, 0])]
It would be the same as:
np.array([dummy[(0,0)], dummy[(1,0)]])
You can validate that using:
dummy[([0, 1], [0, 0])] == np.array([dummy[(0,0)], dummy[(1,0)]])
Finally, notice that:
(*X.T,)
# (array([0, 4, 2]), array([1, 1, 0]))
You are here getting each dimension as an array, and then you will get:
[
dummy[0,1],
dummy[4,1],
dummy[2,0]
]
Which is the same as:
[
dummy[0,1,:],
dummy[4,1,:],
dummy[2,0,:]
]
Edit: Instead of using (*X.T,), you can use tuple(X.T), which for me, makes more sense

as Michael Szczesny wrote, the best solution is dummy[(*X.T,)].
Since X[:,0] are the indices of the first dimension of dummy and X[:,1] are the indices of the second dimension of dummy, if you transpose X (X.T) you'll have the the indices of the first dimension of dummy as X.T[0] and the indices of the second dimension of dummy as X.T[1].
Now to slice dummy as you want, you can specify the indices of the first and of the second dimension in this way:
dummy[(first_dim_indices, second_dim_indices)] = dummy[(X.T[0], X.T[1])]
In order to simplify the code (and since you doesn't want to transpose the X matrix twice) you can unpack X.T in a tuple as (*X.T,) and so write X[(*X.T,)] is the same thing to write dummy[(X.T[0], X.T[1])].
This writing is also useful if you have an unfixed number of dimensions to slice trough because you will unpack from X.T as many lines as there are dimensions to slice in dummy. For example suppose you want to retrieve an 1D-array from dummy given the following indices:
first_dim: (0, 4, 2)
second_dim: (1, 1, 0)
third_dim: (9, 8, 7)
You can specify the indices of the 3 dimensions as X = np.array([[0,1,9],[4,1,8],[2,0,7]]) and dim[(*X.T,)] is still valid.

How to tile a 1D numpy array using uneven subarrays as tiles?

Is there a way to use sub-arrays of a 1-D array as the input tiles for np.tile? I start with:
a 1D array,
the sizes of each of the tiles,
the number of repeats for each tile.
In this case, the number of repeats for each tile is equal to the number of elements in that tile.
Example:
arr = np.array([0,1,2,3,4])
tile_sizes = np.array([2, 3])
num_repeats = tile_sizes
#do some np.tile thing here
and the output array will be:
np.array([0,1,0,1,2,3,4,2,3,4,2,3,4])
note that the first 2 elements (0 and 1) formed a tile of shape (2,) which was repeated 2 times. The next tile was 3 elements (2,3, and 4) and was tiled 3 times.
The use-case for this will involve arrays of a million elements, so memory and speed are concerns, meaning broadcasting is preferred.
A non-broadcasting way to achieve this looks like:
tiles = np.split(arr, np.cumsum(tile_sizes)[:-1])
repeated_tiles = [np.tile(tile, tile.shape[0]) for tile in tiles]
output = np.concatenate(repeated_tiles)
output
>>>>>
array([0, 1, 0, 1, 2, 3, 4, 2, 3, 4, 2, 3, 4])

It's not a perfect solution, but you can get rid of the list comprehension using np.repeat if that helps.
a = np.arange(5)
tile_sizes = np.array([2, 3])
tiles = np.array(np.split(a, np.cumsum(tile_sizes)[:-1]), dtype=np.object)
tiles = np.concatenate(np.repeat(tiles, tile_sizes))

Summarize ndarray by 2d array in Python

I want to summarize a 3d array dat using indices contained in a 2d array idx.
Consider the example below. For each margin along dat[:, :, i], I want to compute the median according to some index idx. The desired output (out) is a 2d array, whose rows record the index and columns record the margin. The following code works but is not very efficient. Any suggestions?
import numpy as np
dat = np.arange(12).reshape(2, 2, 3)
idx = np.array([[0, 0], [1, 2]])
out = np.empty((3, 3))
for i in np.unique(idx):
out[i,] = np.median(dat[idx==i], axis = 0)
print(out)
Output:
[[ 1.5 2.5 3.5]
[ 6. 7. 8. ]
[ 9. 10. 11. ]]

To visualize the problem better, I will refer to the 2x2 dimensions of the array as the rows and columns, and the 3 dimension as depth. I will refer to vectors along the 3rd dimension as "pixels" (pixels have length 3), and planes along the first two dimensions as "channels".
Your loop is accumulating a set of pixels selected by the mask idx == i, and taking the median of each channel within that set. The result is an Nx3 array, where N is the number of distinct incides that you have.
One day, generalized ufuncs will be ubiquitous in numpy, and np.median will be such a function. On that day, you will be able to use reduceat magic1 to do something like
unq, ind = np.unique(idx, return_inverse=True)
np.median.reduceat(dat.reshape(-1, dat.shape[-1]), np.r_[0, np.where(np.diff(unq[ind]))[0]+1])
1 See Applying operation to unevenly split portions of numpy array for more info on the specific type of magic.
Since this is not currently possible, you can use scipy.ndimage.median instead. This version allows you to compute medians over a set of labeled areas in an array, which is exactly what you have with idx. This method assumes that your index array contains N densely packed values, all of which are in range(N). Otherwise the reshaping operations will not work properly.
If that is not the case, start by transforming idx:
_, ind = np.unique(idx, return_inverse=True)
idx = ind.reshape(idx.shape)
OR
idx = np.unique(idx, return_inverse=True)[1].reshape(idx.shape)
Since you are actually computing a separate median for each region and channel, you will need to have a set of labels for each channel. Flesh out idx to have a distinct set of indices for each channel:
chan = dat.shape[-1]
offset = idx.max() + 1
index = np.stack([idx + i * offset for i in range(chan)], axis=-1)
Now index has an identical set of regions defined in each channel, which you can use in scipy.ndimage.median:
out = scipy.ndimage.median(dat, index, index=range(offset * chan)).reshape(chan, offset).T
The input labels must be densely packed from zero to offset * chan for index=range(offset * chan) to work properly, and the reshape operation to have the right number of elements. The final transpose is just an artifact of how the labels are arranged.
Here is the complete product, along with an IDEOne demo of the result:
import numpy as np
from scipy.ndimage import median
dat = np.arange(12).reshape(2, 2, 3)
idx = np.array([[0, 0], [1, 2]])
def summarize(dat, idx):
idx = np.unique(idx, return_inverse=True)[1].reshape(idx.shape)
chan = dat.shape[-1]
offset = idx.max() + 1
index = np.stack([idx + i * offset for i in range(chan)], axis=-1)
return median(dat, index, index=range(offset * chan)).reshape(chan, offset).T
print(summarize(dat, idx))

Selecting which dimension to index in a numpy array

I am writing a program that is suppose to be able to import numpy arrays of some higher dimension, e.g. something like an array a:
a = numpy.zeros([3,5,7,2])
Further, each dimension will correspond to some physical dimension, e.g. frequency, distance, ... and I will also import arrays with information about these dimensions, e.g. for a above:
freq = [1,2,3]
time = [0,1,2,3,4,5,6]
distance = [0,0,0,4,1]
angle = [0,180]
Clearly from this example and the signature it can be figured out that freq belong to dimension 0, time to dimension 2 and so on. But since this is not known in advance, I can take a frequency slice like
a_f1 = a[1,:,:,:]
since I do not know which dimension the frequency is indexed.
So, what I would like is to have some way to chose which dimension to index with an index; in some Python'ish code something like
a_f1 = a.get_slice([0,], [[1],])
This is suppose to return the slice with index 1 from dimension 0 and the full other dimensions.
Doing
a_p = a[0, 1:, ::2, :-1]
would then correspond to something like
a_p = a.get_slice([0, 1, 2, 3], [[0,], [1,2,3,4], [0,2,4,6], [0,]])

You can fairly easily construct a tuple of indices, using slice objects where needed, and then use this to index into your array. The basic is recipe is this:
indices = {
0: # put here whatever you want to get on dimension 0,
1: # put here whatever you want to get on dimension 1,
# leave out whatever dimensions you want to get all of
}
ix = [indices.get(dim, slice(None)) for dim in range(arr.ndim)]
arr[ix]
Here I have done it with a dictionary since I think that makes it easier to see which dimension goes with which indexer.
So with your example data:
x = np.zeros([3,5,7,2])
We do this:
indices = {0: 1}
ix = [indices.get(dim, slice(None)) for dim in range(x.ndim)]
>>> x[ix].shape
(5L, 7L, 2L)
Because your array is all zeros, I'm just showing the shape of the result to indicate that it is what we want. (Even if it weren't all zeros, it's hard to read a 3D array in text form.)
For your second example:
indices = {
0: 0,
1: slice(1, None),
2: slice(None, None, 2),
3: slice(None, -1)
}
ix = [indices.get(dim, slice(None)) for dim in range(x.ndim)]
>>> x[ix].shape
(4L, 4L, 1L)
You can see that the shape corresponds to the number of values in your a_p example. One thing to note is that the first dimension is gone, since you only specified one value for that index. The last dimension still exists, but with a length of one, because you specified a slice that happens to just get one element. (This is the same reason that some_list[0] gives you a single value, but some_list[:1] gives you a one-element list.)

You can use advanced indexing to achieve this.
The index for each dimension needs to be shaped appropriately so that the indices will broadcast correctly across the array. For example, the index for the first dimension of a 3-d array needs to be shaped (x, 1, 1) so that it will broadcast across the first dimension. The index for the second dimension of a 3-d array needs to be shaped (1, y, 1) so that it will broadcast across the second dimension.
import numpy as np
a = np.zeros([3,5,7,2])
b = a[0, 1:, ::2, :-1]
indices = [[0,], [1,2,3,4], [0,2,4,6], [0,]]
def get_aslice(a, indices):
n_dim_ = len(indices)
index_array = [np.array(thing) for thing in indices]
idx = []
# reshape the arrays by adding single-dimensional entries
# based on the position in the index array
for d, thing in enumerate(index_array):
shape = [1] * n_dim_
shape[d] = thing.shape[0]
#print(d, shape)
idx.append(thing.reshape(shape))
c = a[idx]
# to remove leading single-dimensional entries from the shape
#while c.shape[0] == 1:
# c = np.squeeze(c, 0)
# To remove all single-dimensional entries from the shape
#c = np.squeeze(c).shape
return c
For a as an input, it returns an array with shape (1,4,4,1) your a_p example has a shape of (4,4,1). If the extra dimensions need to be removed un-comment the np.squeeze lines in the function.
Now I feel silly. While reading the docs slower I noticed numpy has an indexing routine that does what you want - numpy.ix_
>>> a = numpy.zeros([3,5,7,2])
>>> indices = [[0,], [1,2,3,4], [0,2,4,6], [0,]]
>>> index_arrays = np.ix_(*indices)
>>> a_p = a[index_arrays]
>>> a_p.shape
(1, 4, 4, 1)
>>> a_p = np.squeeze(a_p)
>>> a_p.shape
(4, 4)
>>>

Crop part of np.array

Ihave a numpy array A like
A.shape
(512,270,1,20)
I dont want to use all the 20 layers in dimension 4. The new array should be like
Anew.shape
(512,270,1,2)
So I want to crop out 2 "slices" of the array A

From the python documentation, the answer is:
start = 4 # Index where you want to start.
Anew = A[:,:,:,start:start+2]

You can use a list or array of indices rather than slice notation in order to select an arbitrary sequence of indices in the final dimension:
x = np.zeros((512, 270, 1, 20))
y = x[..., [4, 10]] # the 5th and 11th indices in the final dimension
print(y.shape)
# (512,270,1,2)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Randomly select rows from numpy array based on a condition - python

Related

Extract 2d ndarray from arbitrarily dimensional ndarray using index arrays

How to tile a 1D numpy array using uneven subarrays as tiles?

Summarize ndarray by 2d array in Python

Selecting which dimension to index in a numpy array

Crop part of np.array

Categories

Resources