Use 1d boolean index to select out of 2d array

Use 1d boolean index to select out of 2d array - python

Sometimes I'll have an ND array out of which I need to select data, but the data criterion has only M < N dimensions. Take for example
## generate some matrix
test = np.arange(9).reshape((3, 3))
## some condition based on first-dimension only
selectMe = np.array([ True, True, False], dtype=bool)
Now, I would like to do
test[selectMe[:, None]]
but that leads to an IndexError:
IndexError: boolean index did not match indexed array along dimension 1; dimension is 3 but corresponding boolean dimension is 1
Naturally, if I repeat the boolean index on the second dimension, everything works -- the following is the expected output:
test[np.repeat(selectMe[:, None], 3, axis=1)]
Out[41]: array([0, 1, 2, 3, 4, 5])
However, this is quite inefficient. What's the natural way of achieving this with numpy without having to repeat the matrix?

If I understand your problem, you can use ellipsis (...) to cover unfiltered dimensions:
import numpy as np
test = np.arange(10000).reshape((100, 100))
# condition
selectMe = np.random.randint(0, 2, 100).astype(bool)
assert (test[selectMe, ...].ravel() == test[np.repeat(selectMe[:, None], 100, axis=1)]).all()
%timeit test[selectMe, ...].ravel() # 11.6 µs
%timeit test[np.repeat(selectMe[:, None], 100, axis=1)] # 103 µs

Related

How to add element to empty 2d numpy array

I'm trying to insert elements to an empty 2d numpy array. However, I am not getting what I want.
I tried np.hstack but it is giving me a normal array only. Then I tried using append but it is giving me an error.
Error:
ValueError: all the input arrays must have same number of dimensions
randomReleaseAngle1 = np.random.uniform(20.0, 77.0, size=(5, 1))
randomVelocity1 = np.random.uniform(40.0, 60.0, size=(5, 1))
randomArray =np.concatenate((randomReleaseAngle1,randomVelocity1),axis=1)
arr1 = np.empty((2,2), float)
arr = np.array([])
for i in randomArray:
data = [[170, 68.2, i[0], i[1]]]
df = pd.DataFrame(data, columns = ['height', 'release_angle', 'velocity', 'holding_angle'])
test_y_predictions = model.predict(df)
print(test_y_predictions)
if (np.any(test_y_predictions == 1)):
arr = np.hstack((arr, np.array([i[0], i[1]])))
arr1 = np.append(arr1, np.array([i[0], i[1]]), axis=0)
print(arr)
print(arr1)
I wanted to get something like
[[1.5,2.2],
[3.3,4.3],
[7.1,7.3],
[3.3,4.3],
[3.3,4.3]]
However, I'm getting
[56.60290125 49.79106307 35.45102444 54.89380834 47.09359271 49.19881675
22.96523274 44.52753514 67.19027156 54.10421167]

The recommended list append approach:
In [39]: alist = []
In [40]: for i in range(3):
...: alist.append([i, i+10])
...:
In [41]: alist
Out[41]: [[0, 10], [1, 11], [2, 12]]
In [42]: np.array(alist)
Out[42]:
array([[ 0, 10],
[ 1, 11],
[ 2, 12]])
If we start with a empty((2,2)) array:
In [47]: arr = np.empty((2,2),int)
In [48]: arr
Out[48]:
array([[139934912589760, 139934912589784],
[139934871674928, 139934871674952]])
In [49]: np.concatenate((arr, [[1,10]],[[2,11]]), axis=0)
Out[49]:
array([[139934912589760, 139934912589784],
[139934871674928, 139934871674952],
[ 1, 10],
[ 2, 11]])
Note that empty does not mean the same thing as the list []. It's a real 2x2 array, with 'unspecified' values. And those values remain when we add other arrays to it.
I could start with an array with a 0 dimension:
In [51]: arr = np.empty((0,2),int)
In [52]: arr
Out[52]: array([], shape=(0, 2), dtype=int64)
In [53]: np.concatenate((arr, [[1,10]],[[2,11]]), axis=0)
Out[53]:
array([[ 1, 10],
[ 2, 11]])
That looks more like the list append approach. But why start with the (0,2) array in the first place?
np.concatenate takes a list of arrays (or lists that can be made into arrays). I used nested lists that make (1,2) arrays. With this I can join them on axis 0.
Each concatenate makes a new array. So if done iteratively it is more expensive than the list append.
np.append just takes 2 arrays and does a concatenate. So doesn't add much. hstack tweaks shapes and joins on the 2nd (horizontal) dimension. vstack is another variant. But they all end up using concatenate.

With the hstack method, you can just reshape after you get the final array:
arr = arr.reshape(-1, 2)
print(arr)
The other method can be more easily done in a similar way:
arr1 = np.append(arr1, np.array([i[0], i[1]]) # in the loop
arr1 = arr1.reshape(-1, 2)
print(arr1)

Removing max and min elements of array from mean calculation

I am hoping to delete the highest number and the lowest number from the array 3*4. Let's say, the data looks like this:
a=np.array([[1,4,5,10],[2,6,5,0],[3,9,9,0]])
so I expected to see the result like this:
deleted_data=[4,5],[2,5],[3]
Could you advise me how to delete the max and min from each array?
to do so, I did like this (UPDATE):
#to find out the max / min values:
b = np.max(a,1) #max
c = np.min(a,1) #min
#creating dataset after deleting max & min
d=(a!=b[:,None]) & (a!=c[:,None])
f=[i[j] for i,j in zip(a, d)]
output: [array([8, 7, 7, 9, 9, 8]), array([8, 7, 8, 6, 8, 8]), array([9, 8, 9, 9, 8]), array([6, 7, 7, 6, 6, 7]), array([7, 7, 7, 7, 6])]
Now I am not sure how to calculate the mean of the list objects?
I would like to calculate the mean of each array, so I have tried this:
mean1=f.mean(axis=0)
but it did not work.

Another method is to use a Masked Array
import numpy.ma as ma
mask = np.logical_or(a == a.max(1, keepdims = 1), a == a.min(1, keepdims = 1))
a_masked = ma.masked_array(a, mask = mask)
from there if you want an average of the unmasked elements you can just do
a_masked.mean()
Or you could even do the mean of the rows
a_masked.mean(1).data
or columns (strange, but seems to be what you're asking for)
a_masked.mean(0).data

A python list has a remove method.
With a utility function we could remove the min and max elements from a row:
def foo(i,j,k):
il = i.tolist()
il.remove(j)
il.remove(k)
return il
In [230]: [foo(i,j,k) for i,j,k in zip(a,b,c)]
Out[230]: [[4, 5], [2, 5], [3, 9]]
This could be turned back into an array with np.array(...). Note that this removed just one of the 9 in the last row. If it had removed both, the last list would have just 1 value, and the result could not be turned back into a 2d array.
I'm sure we could come up with a pure-array method, possibly useing argmax and argmin instead of max and min. But I think the list approach is a better starting point for a Python beginner.
An array masking approach
In [232]: bi = np.argmax(a,1)
In [233]: ci = np.argmin(a,1)
In [234]: bi
Out[234]: array([3, 1, 1], dtype=int32)
In [235]: ci
Out[235]: array([0, 3, 3], dtype=int32)
In [243]: mask = np.ones_like(a, bool)
In [244]: mask[np.arange(3),bi]=False
In [245]: mask[np.arange(3),ci]=False
In [246]: mask
Out[246]:
array([[False, True, True, False],
[ True, False, True, False],
[ True, False, True, False]], dtype=bool)
In [247]: a[mask]
Out[247]: array([4, 5, 2, 5, 3, 9])
In [248]: _.reshape(3,-1)
Out[248]:
array([[4, 5],
[2, 5],
[3, 9]])
Again this is better if we just delete one max and one min from each row.
Another masking approach:
In [257]: (a!=b[:,None]) & (a!=c[:,None])
Out[257]:
array([[False, True, True, False],
[ True, False, True, False],
[ True, False, False, False]], dtype=bool)
In [258]: a[(a!=b[:,None]) & (a!=c[:,None])]
Out[258]: array([4, 5, 2, 5, 3])
This does remove all '9's in the last row. But it does not preserve the row split.
This preserves the row structure, and allows variable lengths:
In [259]: mask=(a!=b[:,None]) & (a!=c[:,None])
In [260]: [i[j] for i,j in zip(a, mask)]
Out[260]: [array([4, 5]), array([2, 5]), array([3])]

As #hpaulj predicted, there is an array-only method. And it's a doozy. As a one-liner:
a[np.arange(a.shape[0])[:, None], np.sort(np.argpartition(a, (0,-1), axis = 1)[:, 1:-1], axis = 1)]
Let's break that down:
y_ = np.argpartition(a, (0,-1), axis = 1)[:, 1:-1]
argpartiton takes the index of the 0th (smallest) and -1th (largest) elements of each row and moves them to the first and last position repsectively. [:,1:-1] indexes everything else. Now argpartition can sometimes reorder the rest of the elements, so
y = np.sort(y_ , axis = 1)
We sort the rest of the indices back to their orginal positions. Now we have a y.shape -> (m, n-2) array of indices with the max and min removed, for your original (m, n) = a.shape array.
Now to use this, we need the row indicies as well.
x = np.arange(a.shape[0])[:, None]
arange just gives the m row indices. To broadcast this x.shape -> (a.shape[0],) -> (m,) array to your index array, you need the [:, None] to make x.shape -> (m, 1). Now the m lines up for broadcasting and you have your two sets of indices.
a[x, y]
array([[4, 5],
[2, 5],
[3, 9]])

You could get to the final destination of average of elements that are not the max or min per row in two steps with masking -
In [140]: a # input array
Out[140]:
array([[ 1, 4, 5, 10],
[ 2, 6, 5, 0],
[ 3, 9, 9, 0]])
In [141]: m = (a!=a.min(1,keepdims=1)) & (a!=a.max(1,keepdims=1))
In [142]: (a*m).sum(1)/m.sum(1).astype(float)
Out[142]: array([ 4.5, 3.5, 3. ])
This avoids the mess of creating the intermediate ragged arrays, which arent the most convenient data formats to operate with NumPy funcs.
Alternatively, for performance boost, use np.einsum to get the equivalent of (a*m).sum(1) with np.einsum('ij,ij->i',a,m).
Runtime test on bigger array -
In [181]: np.random.seed(0)
In [182]: a = np.random.randint(0,10,(5000,5000))
# #Daniel F' soln from https://stackoverflow.com/a/47325431/
In [183]: %%timeit
...: mask = np.logical_or(a == a.max(1, keepdims = 1), a == a.min(1, keepdims = 1))
...: a_masked = ma.masked_array(a, mask = mask)
...: out = a_masked.mean(1).data
1 loop, best of 3: 251 ms per loop
# Posted in here
In [184]: %%timeit
...: m = (a!=a.min(1,keepdims=1)) & (a!=a.max(1,keepdims=1))
...: out = (a*m).sum(1)/m.sum(1).astype(float)
10 loops, best of 3: 165 ms per loop
# Posted in here with additional einsum
In [185]: %%timeit
...: m = (a!=a.min(1,keepdims=1)) & (a!=a.max(1,keepdims=1))
...: out = np.einsum('ij,ij->i',a,m)/m.sum(1).astype(float)
10 loops, best of 3: 124 ms per loop

If the question is to remove min and/or max elements from a numpy array arr then this is the easiest way in my opinion.
np.delete(arr, np.argmax(arr))
example
tmp = np.random.random(3)
print(tmp)
tmp = np.delete(tmp, np.argmax(tmp))
print(tmp)
returns
[0.7366768 0.65492774 0.93632866]
[0.7366768 0.65492774]

Altering arrays of different dimensions to be broadcasted together

I am looking for a more optimized way to convert a (n,n) or (n,n,1) matrix to a (n,n,3) matrix. I start out with an (n,n,3), but my dimensions get reduced after I perform a sum over the second axis to (n,n). Essentially, I want to keep the original size of the array and have the second axis just repeated 3 times. The reason I need this is that I will later be broadcasting it with another (n,n,3) array, but they need the same dimensions.
My current method works, but does not seem elegant.
a0=np.random.random((n,n))
b=a.flatten().tolist()
a=np.array(zip(b,b,b))
a.shape=n,n,3
This setup has the desired result, but is clunky and hard to follow. Is there perhaps a way to go directly from an (n,n) to an (n,n,3) by duplicating the second index? or perhaps a way to not downsize the array to begin with?

None or np.newaxis is a common way of adding a dimension to an array. reshape with (3,3,1) works just as well:
In [64]: arr=np.arange(9).reshape(3,3)
In [65]: arr1 = arr[...,None]
In [66]: arr1.shape
Out[66]: (3, 3, 1)
repeat as function or method replicates this.
In [72]: arr2=arr1.repeat(3,axis=2)
In [73]: arr2.shape
Out[73]: (3, 3, 3)
In [74]: arr2[0,0,:]
Out[74]: array([0, 0, 0])
But you might not need to do this. With broadcasting a (3,3,1) works with a (3,3,3).
In [75]: (arr1+arr2).shape
Out[75]: (3, 3, 3)
In fact it will broadcast with a (3,) to produce (3,3,3).
In [77]: arr1+np.ones(3,int)
Out[77]:
array([[[1, 1, 1],
[2, 2, 2],
...
[[7, 7, 7],
[8, 8, 8],
[9, 9, 9]]])
So arr1+np.zeros(3,int) is another way of expanding that (3,3,1) to (3,3,3).
The broadcasting rules are:
(3,3,1) + (3,) => (3,3,1) + (1,1,3) => (3,3,3)
broadcasting adds dimensions at the start as needed.
When you sum on an axis, you can keep the original number of dimensions with a parameter:
In [78]: arr2.sum(axis=2).shape
Out[78]: (3, 3)
In [79]: arr2.sum(axis=2, keepdims=True).shape
Out[79]: (3, 3, 1)
This is handy if you want to subtract the mean from an array along any dimension:
arr2-arr2.mean(axis=2, keepdims=True)

You can firstly create a new axis (axis = 2) on a and then use np.repeat along this new axis:
np.repeat(a[:,:,None], 3, axis = 2)
Or another approach, flatten the array, repeat elements and then reshape:
np.repeat(a.ravel(), 3).reshape(n,n,3)
The result comparison:
import numpy as np
n = 4
a=np.random.random((n,n))
b=a.flatten().tolist()
a1=np.array(zip(b,b,b))
a1.shape=n,n,3
# a1 is the result from the original method
(np.repeat(a[:,:,None], 3, axis = 2) == a1).all()
# True
(np.repeat(a.ravel(), 3).reshape(4,4,3) == a1).all()
# True
Timing, use built-in numpy.repeat also shows a speed up:
import numpy as np
n = 4
a=np.random.random((n,n))

def rep():
b=a.flatten().tolist()
a1=np.array(zip(b,b,b))
a1.shape=n,n,3
%timeit rep()
# 100000 loops, best of 3: 7.11 µs per loop
%timeit np.repeat(a[:,:,None], 3, axis = 2)
# 1000000 loops, best of 3: 1.64 µs per loop
%timeit np.repeat(a.ravel(), 3).reshape(4,4,3)
# 1000000 loops, best of 3: 1.9 µs per loop

cut some rows and columns where values are 255

I am trying to get rid of all rows and columns in a grayscale numpy array where the values are 255.
My array could be:
arr = [[255,255,255,255],
[255,0,0,255],
[255,255,255,255]]
The result should be:
arr = [0,0]
I can just interating over the array, but there should be a pythonic way to solve the problem.
For the rows i tried:
arr = arr[~(arr==255).all(1)]
This works really well, but i cannot find an equal solution for colums.

Given boolean arrays for rows and columns:
In [26]: rows
Out[26]: array([False, True, False], dtype=bool)
In [27]: cols
Out[27]: array([False, True, True, False], dtype=bool)
np.ix_ creates ordinal indexers which can be used to index arr:
In [32]: np.ix_(rows, cols)
Out[32]: (array([[1]]), array([[1, 2]]))
In [33]: arr[np.ix_(rows, cols)]
Out[33]: array([[0, 0]])
Therefore you could use
import numpy as np
arr = np.array([[255,255,255,255],
[255,0,0,255],
[255,255,255,255]])
mask = (arr != 255)
rows = mask.all(axis=1)
cols = mask.all(axis=0)
print(arr[np.ix_(rows, cols)])
which yields the 2D array
[[0 0]]

For the columns, you can simply transpose the array:
arr = arr.T[~(arr.T==255).all(1)].T
arr = arr[~(arr==255).all(1)]
which results in
>> print(arr)
[[0 0]]

Next argmax values in python

I have a function that returns the argmax from a large 2d array
getMax = np.argmax(dist, axis=1)
However I want to get the next biggest values, is there a way of removing the getMax values from the original array and then performing argmax again?

Use the command np.argsort(a, axis=-1, kind='quicksort', order=None), but with appropriate choice of arguments (below).
here is the documentation. Note "It returns an array of indices of the same shape as a that index data along the given axis in sorted order."
The default order is small to large. So sort with -dist (for quick coding). Caution: doing -dist causes a new array to be generated which you may care about if dist is huge. See bottom of post for a better alternative there.
Here is an example:
x = np.array([[1,2,5,0],[5,7,2,3]])
L = np.argsort(-x, axis=1)
print L
[[2 1 0 3]
[1 0 3 2]]
x
array([[1, 2, 5, 0],
[5, 7, 2, 3]])
So the n'th entry in a row of L gives the locations of the n'th largest element of x.
x is unchanged.
L[:,0] will give the same output as np.argmax(x)
L[:,0]
array([2, 1])
np.argmax(x,axis=1)
array([2, 1])
and L[:,1] will give the same as a hypothetical argsecondmax(x)
L[:,1]
array([1, 0])
If you don't want to generate a new list, so you don't want to use -x:
L = np.argsort(x, axis=1)
print L
[[3 0 1 2]
[2 3 0 1]]
L[:,-1]
array([2, 1])
L[:,-2]
array([1, 0])

If speed is important to you, using argpartition rather than argsort could be useful.
For example, to return the n largest elements from a list:
import numpy as np
l = np.random.random_integer(0, 100, 1e6)
top_n_1 = l[np.argsort(-l)[0:n]]
top_n_2 = l[np.argpartition(l, -n)[-n:]]
The %timeit function in ipython reports
10 loops, best of 3: 56.9 ms per loop for top_n_1 and 100 loops, best of 3: 8.06 ms per loop for top_n_2.
I hope this is useful.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use 1d boolean index to select out of 2d array - python

Related

How to add element to empty 2d numpy array

Removing max and min elements of array from mean calculation

Altering arrays of different dimensions to be broadcasted together

cut some rows and columns where values are 255

Next argmax values in python

Categories

Resources