Count instances in numpy array within a certain value of each row - python

I have a numpy array such as this
[[ 0, 57],
[ 7, 72],
[ 2, 51],
[ 8, 67],
[ 4, 42]]
I want to find out for each row, how many elements in the 2nd column are within a certain distance (say, 10) of the 2nd column value for that row. So in this example, here the solution would be
[[ 0, 57, 3],
[ 7, 72, 2],
[ 2, 51, 3],
[ 8, 67, 3],
[ 4, 42, 2]]
So [first row, third column] is 3, because there are 3 elements in the 2nd column (57,51,67) which are within distance 10 from 57. Similarly for each row
Any help would be appreciated!

Here's one approach leveraging broadcasting with outer-subtraction -
(np.abs(a[:,1,None] - a[:,1]) <= 10).sum(1)
With outer subtract builtin and count_nonzero for counting -
np.count_nonzero(np.abs(np.subtract.outer(a[:,1],a[:,1]))<=10,axis=1)
Sample run -
# Input array
In [23]: a
Out[23]:
array([[ 0, 57],
[ 7, 72],
[ 2, 51],
[ 8, 67],
[ 4, 42]])
# Get count
In [24]: count = (np.abs(a[:,1,None] - a[:,1]) <= 10).sum(1)
In [25]: count
Out[25]: array([3, 2, 3, 3, 2])
# Stack with input
In [26]: np.c_[a,count]
Out[26]:
array([[ 0, 57, 3],
[ 7, 72, 2],
[ 2, 51, 3],
[ 8, 67, 3],
[ 4, 42, 2]])
Alternatively with SciPy's cdist -
In [53]: from scipy.spatial.distance import cdist
In [54]: (cdist(a[:,None,1],a[:,1,None], 'minkowski', p=2)<=10).sum(1)
Out[54]: array([3, 2, 3, 3, 2])
For million rows in the input, we might want to resort to a loopy one -
n = len(a)
count = np.empty(n, dtype=int)
for i in range(n):
count[i] = np.count_nonzero(np.abs(a[:,1]-a[i,1])<=10)

Here's a non-broadcasting approach, which takes advantage of the fact that to know how many numbers are within 3 of 10, you can subtract the number of numbers <= 13 from those strictly less than 7.
import numpy as np
def broadcast(x, width):
# for comparison
return (np.abs(x[:,None] - x) <= width).sum(1)
def largest_leq(arr, x, allow_equal=True):
maybe = np.searchsorted(arr, x)
maybe = maybe.clip(0, len(arr) - 1)
above = arr[maybe] > x if allow_equal else arr[maybe] >= x
maybe[above] -= 1
return maybe
def faster(x, width):
uniq, inv, counts = np.unique(x, return_counts=True, return_inverse=True)
counts = counts.cumsum()
low_bounds = uniq - width
low_ix = largest_leq(uniq, low_bounds, allow_equal=False)
low_counts = counts[low_ix]
low_counts[low_ix < 0] = 0
high_bounds = uniq + width
high_counts = counts[largest_leq(uniq, high_bounds)]
delta = high_counts - low_counts
out = delta[inv]
return out
This passes my tests:
for width in range(1, 10):
for window in range(5):
for trial in range(10):
x = np.random.randint(0, 10, width)
b = broadcast(x, window).tolist()
f = faster(x, window).tolist()
assert b == f
and behaves pretty well even at larger sizes:
In [171]: x = np.random.random(10**6)
In [172]: %time faster(x, 0)
Wall time: 386 ms
Out[172]: array([1, 1, 1, ..., 1, 1, 1], dtype=int64)
In [173]: %time faster(x, 1)
Wall time: 372 ms
Out[173]: array([1000000, 1000000, 1000000, ..., 1000000, 1000000, 1000000], dtype=int64)
In [174]: x = np.random.randint(0, 10, 10**6)
In [175]: %timeit faster(x, 3)
10 loops, best of 3: 83 ms per loop

Related

Check if there are 3 consecutive values in an array which are above some threshold

Say I have a np.array like this:
a = [1, 3, 4, 5, 60, 43, 53, 4, 46, 54, 56, 78]
Is there a quick method to get the indices of all locations where 3 consecutive numbers are all above some threshold? That is, for some threshold th, get all x where this holds:
a[x]>th and a[x+1]>th and a[x+2]>th
Example: for threshold 40 and the list given above, x should be [4,8,9].
Many thanks.
Approach #1
Use convolution on the mask of boolean array obtained after comparison -
In [40]: a # input array
Out[40]: array([ 1, 3, 4, 5, 60, 43, 53, 4, 46, 54, 56, 78])
In [42]: N = 3 # compare N consecutive numbers
In [44]: T = 40 # threshold for comparison
In [45]: np.flatnonzero(np.convolve(a>T, np.ones(N, dtype=int),'valid')>=N)
Out[45]: array([4, 8, 9])
Approach #2
Use binary_erosion -
In [77]: from scipy.ndimage.morphology import binary_erosion
In [31]: np.flatnonzero(binary_erosion(a>T,np.ones(N, dtype=int), origin=-(N//2)))
Out[31]: array([4, 8, 9])
Approach #3 (Specific case) : Small numbers of consecutive numbers check
For checking such a small number of consecutive numbers (three in this case), we can also slicing on the compared mask for better performance -
m = a>T
out = np.flatnonzero(m[:-2] & m[1:-1] & m[2:])
Benchmarking
Timings on 100000 repeated/tiled array from given sample -
In [78]: a
Out[78]: array([ 1, 3, 4, 5, 60, 43, 53, 4, 46, 54, 56, 78])
In [79]: a = np.tile(a,100000)
In [80]: N = 3
In [81]: T = 40
# Approach #3
In [82]: %%timeit
...: m = a>T
...: out = np.flatnonzero(m[:-2] & m[1:-1] & m[2:])
1000 loops, best of 3: 1.83 ms per loop
# Approach #1
In [83]: %timeit np.flatnonzero(np.convolve(a>T, np.ones(N, dtype=int),'valid')>=N)
100 loops, best of 3: 10.9 ms per loop
# Approach #2
In [84]: %timeit np.flatnonzero(binary_erosion(a>T,np.ones(N, dtype=int), origin=-(N//2)))
100 loops, best of 3: 11.7 ms per loop
try:
th=40
results = [ x for x in range( len( array ) -2 ) if(array[x:x+3].min() > th) ]
which is a list comprehension for
th=40
results = []
for x in range( len( array ) -2 ):
if( array[x:x+3].min() > th ):
results.append( x )
Another approach, using numpy.lib.stride_tricks.as_strided:
in [59]: import numpy as np
In [60]: from numpy.lib.stride_tricks import as_strided
Define the input data:
In [61]: a = np.array([ 1, 3, 4, 5, 60, 43, 53, 4, 46, 54, 56, 78])
In [62]: N = 3
In [63]: threshold = 40
Compute the result; q is the boolean mask for the "big" values.
In [64]: q = a > threshold
In [65]: result = np.all(as_strided(q, shape=(len(q)-N+1, N), strides=(q.strides[0], q.strides[0])), axis=1).nonzero()[0]
In [66]: result
Out[66]: array([4, 8, 9])
Do it again with N = 4:
In [67]: N = 4
In [68]: result = np.all(as_strided(q, shape=(len(q)-N+1, N), strides=(q.strides[0], q.strides[0])), axis=1).nonzero()[0]
In [69]: result
Out[69]: array([8])

How to swap array's columns if condition is satisfied

I have an Nx3 numpy array:
A = [[01,02,03]
[11,12,13]
[21,22,23]]
I need an array where the second and third columns are swapped if the sum of the second and third numbers is greater then 20:
[[01,02,03]
[11,13,12]
[21,23,22]]
Is it possible to achieve this without a loop?
UPDATE:
So, the story behind this is that I want to swap colors in a RGB image, namely green and blue, but not yellow - this is my condition. Empirically I found out it is abs(green - blue) > 15 && (blue > green)
swapped = np.array(img).reshape(img.shape[0] * img.shape[1], img.shape[2])
idx = ((np.abs(swapped[:,1] - swapped[:,2]) < 15) & (swapped[:, 2] < swapped[:, 1]))
swapped[idx, 1], swapped[idx, 2] = swapped[idx, 2], swapped[idx, 1]
plt.imshow(swapped.reshape(img.shape[0], img.shape[1], img.shape[2]))
this actually works, but partially. The first column will be swapped, but the second one will be overwritten.
# tested in pyton3
a = np.array([[1,2,3],[11,12,13],[21,22,23]])
a[:,1], a[:,2] = a[:,2], a[:,1]
array([[ 1, 3, 3],
[11, 13, 13],
[21, 23, 23]])
Here's one way with masking -
# Get 1D mask of length same as the column length of array and with True
# values at places where the combined sum is > 20
m = A[:,1] + A[:,2] > 20
# Get the masked elements off the second column
tmp = A[m,2]
# Assign into the masked places in the third col from the
# corresponding masked places in second col.
# Note that this won't change `tmp` because `tmp` isn't a view into
# the third col, but holds a separate memory space
A[m,2] = A[m,1]
# Finally assign into the second col from tmp
A[m,1] = tmp
Sample run -
In [538]: A
Out[538]:
array([[ 1, 2, 3],
[11, 12, 13],
[21, 22, 23]])
In [539]: m = A[:,1] + A[:,2] > 20
...: tmp = A[m,2]
...: A[m,2] = A[m,1]
...: A[m,1] = tmp
In [540]: A
Out[540]:
array([[ 1, 2, 3],
[11, 13, 12],
[21, 23, 22]])
How about using np.where along with "fancy" indexing, and np.flip to swap the elements.
In [145]: A
Out[145]:
array([[ 1, 2, 3],
[11, 12, 13],
[21, 22, 23]])
# extract matching sub-array
In [146]: matches = A[np.where(np.sum(A[:, 1:], axis=1) > 20)]
In [147]: matches
Out[147]:
array([[11, 12, 13],
[21, 22, 23]])
# swap elements and update the original array using "boolean" indexing
In [148]: A[np.where(np.sum(A[:, 1:], axis=1) > 20)] = np.hstack((matches[:, :1], np.flip(matches[:, 1:], axis=1)))
In [149]: A
Out[149]:
array([[ 1, 2, 3],
[11, 13, 12],
[21, 23, 22]])
One more approach based on #Divakar's suggestion would be to:
First get the indices which are nonzero for the condition specified (here
sum of the elements in second and third column > 20)
In [70]: idx = np.flatnonzero(np.sum(A[:, 1:3], axis=1) > 20)
Then create an open mesh using np.ix_
In [71]: gidx = np.ix_(idx,[1,2])
# finally update the original array `A`
In [72]: A[gidx] = A[gidx][:,::-1]

Changing multiple Numpy array elements using slicing in Python

Say I have the numpy array arr_1 = np.arange(10) returning:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
How do I change multiple elements to a certain value using slicing?
For example: changing the zeroth, first and second element that occur every five elements, starting from the first element, to 100. I want this:
array([0, 100, 100, 100, 4, 5, 100, 100, 100, 9])
I tried arr_1[1::[5, 6, 7]] = 100 but that doesn't work.
Here is another solution based on what you did :
arr_1 = np.arange(10)
arr_1[1::5] = 100
arr_1[2::5] = 100
arr_1[3::5] = 100
and it returns :
array([ 0, 100, 100, 100, 4, 5, 100, 100, 100, 9])
If your repeat offset divides the array length:
a.reshape((-1, 5))[:, 1:4] = 100
General case requires two lines:
a[: len(a) // 5 * 5].reshape((-1, 5))[:, 1:4] = 100
a[len(a) // 5 * 5 :][1:4] = 100
How it works: Reshaping in the described way stacks consecutive stretches of the array in such a way that the target substretches are aligned and can therefore be addressed in one go using standard 2d indexing:
>>> a = np.arange(15)
>>> a.reshape((-1, 5))
array([[ 0, 1x, 2x, 3x, 4],
[ 5, 6x, 7x, 8x, 9],
[10, 11x, 12x, 13x, 14]])
Here's one approach with masking -
a = np.arange(10) # Input array
idx = np.array([0,1,2]) # Indices to be set
offset = 1 # Offset
a[np.in1d(np.mod(np.arange(a.size),5) , idx+offset)] = 100
Sample run with original sample -
In [849]: a = np.arange(10) # Input array
...: idx = np.array([0,1,2]) # Indices to be set
...: offset = 1 # Offset
...:
...: a[np.in1d(np.mod(np.arange(a.size),5) , idx+offset)] = 100
...:
In [850]: a
Out[850]: array([ 0, 100, 100, 100, 4, 5, 100, 100, 100, 9])
Sample run with non-sequential indices -
In [851]: a = np.arange(11) # Input array
...: idx = np.array([0,2,3]) # Indices to be set
...: offset = 1 # Offset
...:
In [852]: a[np.in1d(np.mod(np.arange(a.size),5) , idx+offset)] = 100
In [853]: a
Out[853]: array([ 0, 100, 2, 100, 100, 5, 100, 7, 100, 100, 10])
You just need to wrap your list of indexes in np.array(list). You were very close to being correct:
In [2]: arr_1 = np.arange(10)
In [3]: arr_1[np.array([0,1,2,5,6,7])] = 100
In [4]: arr_1
Out[4]: array([100, 100, 100, 3, 4, 100, 100, 100, 8, 9])
I used hand coded values for the indexes, per your requirements. You can get the indexes in an automated way using some technique you like, like that shown by Divakar.

Efficient numpy indexing: Take first N rows of every block of M rows

x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
I want to grab first 2 rows of array x from every block of 5, result should be:
x[fancy_indexing] = [1,2, 6,7, 11,12]
It's easy enough to build up an index like that using a for loop.
Is there a one-liner slicing trick that will pull it off? Points for simplicity here.
Approach #1 Here's a vectorized one-liner using boolean-indexing -
x[np.mod(np.arange(x.size),M)<N]
Approach #2 If you are going for performance, here's another vectorized approach using NumPy strides -
n = x.strides[0]
shp = (x.size//M,N)
out = np.lib.stride_tricks.as_strided(x, shape=shp, strides=(M*n,n)).ravel()
Sample run -
In [61]: # Inputs
...: x = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
...: N = 2
...: M = 5
...:
In [62]: # Approach 1
...: x[np.mod(np.arange(x.size),M)<N]
Out[62]: array([ 1, 2, 6, 7, 11, 12])
In [63]: # Approach 2
...: n = x.strides[0]
...: shp = (x.size//M,N)
...: out=np.lib.stride_tricks.as_strided(x,shape=shp,strides=(M*n,n)).ravel()
...:
In [64]: out
Out[64]: array([ 1, 2, 6, 7, 11, 12])
I first thought you need this to work for 2d arrays due to your phrasing of "first N rows of every block of M rows", so I'll leave my solution as this.
You could work some magic by reshaping your array into 3d:
M = 5 # size of blocks
N = 2 # number of columns to cut
x = np.arange(3*4*M).reshape(4,-1) # (4,3*N)-shaped dummy input
x = x.reshape(x.shape[0],-1,M)[:,:,:N+1].reshape(x.shape[0],-1) # (4,3*N)-shaped output
This will extract every column according to your preference. In order to use it for your 1d case you'd need to make your 1d array into a 2d one using x = x[None,:].
Reshape the array to multiple rows of five columns then take (slice) the first two columns of each row.
>>> x
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
>>> x.reshape(x.shape[0] / 5, 5)[:,:2]
array([[ 1, 2],
[ 6, 7],
[11, 12]])
Or
>>> x.reshape(x.shape[0] / 5, 5)[:,:2].flatten()
array([ 1, 2, 6, 7, 11, 12])
>>>
It only works with 1-d arrays that have a length that is a multiple of five.
import numpy as np
x = np.array(range(1, 16))
y = np.vstack([x[0::5], x[1::5]]).T.ravel()
y
// => array([ 1, 2, 6, 7, 11, 12])
Taking the first N rows of every block of M rows in the array [1, 2, ..., K]:
import numpy as np
K = 30
M = 5
N = 2
x = np.array(range(1, K+1))
y = np.vstack([x[i::M] for i in range(N)]).T.ravel()
y
// => array([ 1, 2, 6, 7, 11, 12, 16, 17, 21, 22, 26, 27])
Notice that .T and .ravel() are fast operations: they don't copy any data, but just manipulate the dimensions and strides of the array.
If you insist on getting your slice using fancy indexing:
import numpy as np
K = 30
M = 5
N = 2
x = np.array(range(1, K+1))
fancy_indexing = [i*M+n for i in range(len(x)//M) for n in range(N)]
x[fancy_indexing]
// => array([ 1, 2, 6, 7, 11, 12, 16, 17, 21, 22, 26, 27])

Find the index of the k smallest values of a numpy array

In order to find the index of the smallest value, I can use argmin:
import numpy as np
A = np.array([1, 7, 9, 2, 0.1, 17, 17, 1.5])
print A.argmin() # 4 because A[4] = 0.1
But how can I find the indices of the k-smallest values?
I'm looking for something like:
print A.argmin(numberofvalues=3)
# [4, 0, 7] because A[4] <= A[0] <= A[7] <= all other A[i]
Note: in my use case A has between ~ 10 000 and 100 000 values, and I'm interested for only the indices of the k=10 smallest values. k will never be > 10.
Use np.argpartition. It does not sort the entire array. It only guarantees that the kth element is in sorted position and all smaller elements will be moved before it. Thus the first k elements will be the k-smallest elements.
import numpy as np
A = np.array([1, 7, 9, 2, 0.1, 17, 17, 1.5])
k = 3
idx = np.argpartition(A, k)
print(idx)
# [4 0 7 3 1 2 6 5]
This returns the k-smallest values. Note that these may not be in sorted order.
print(A[idx[:k]])
# [ 0.1 1. 1.5]
To obtain the k-largest values use
idx = np.argpartition(A, -k)
# [4 0 7 3 1 2 6 5]
A[idx[-k:]]
# [ 9. 17. 17.]
WARNING: Do not (re)use idx = np.argpartition(A, k); A[idx[-k:]] to obtain the k-largest.
That won't always work. For example, these are NOT the 3 largest values in x:
x = np.array([100, 90, 80, 70, 60, 50, 40, 30, 20, 10, 0])
idx = np.argpartition(x, 3)
x[idx[-3:]]
array([ 70, 80, 100])
Here is a comparison against np.argsort, which also works but just sorts the entire array to get the result.
In [2]: x = np.random.randn(100000)
In [3]: %timeit idx0 = np.argsort(x)[:100]
100 loops, best of 3: 8.26 ms per loop
In [4]: %timeit idx1 = np.argpartition(x, 100)[:100]
1000 loops, best of 3: 721 µs per loop
In [5]: np.alltrue(np.sort(np.argsort(x)[:100]) == np.sort(np.argpartition(x, 100)[:100]))
Out[5]: True
You can use numpy.argsort with slicing
>>> import numpy as np
>>> A = np.array([1, 7, 9, 2, 0.1, 17, 17, 1.5])
>>> np.argsort(A)[:3]
array([4, 0, 7], dtype=int32)
For n-dimentional arrays, this function works well. The indecies are returned in a callable form. If you want a list of the indices to be returned, then you need to transpose the array before you make a list.
To retrieve the k largest, simply pass in -k.
def get_indices_of_k_smallest(arr, k):
idx = np.argpartition(arr.ravel(), k)
return tuple(np.array(np.unravel_index(idx, arr.shape))[:, range(min(k, 0), max(k, 0))])
# if you want it in a list of indices . . .
# return np.array(np.unravel_index(idx, arr.shape))[:, range(k)].transpose().tolist()
Example:
r = np.random.RandomState(1234)
arr = r.randint(1, 1000, 2 * 4 * 6).reshape(2, 4, 6)
indices = get_indices_of_k_smallest(arr, 4)
indices
# (array([1, 0, 0, 1], dtype=int64),
# array([3, 2, 0, 1], dtype=int64),
# array([3, 0, 3, 3], dtype=int64))
arr[indices]
# array([ 4, 31, 54, 77])
%%timeit
get_indices_of_k_smallest(arr, 4)
# 17.1 µs ± 651 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
numpy.partition(your_array, k) is an alternative. No slicing necessary as it gives the values sorted until the kth element.

Categories