Related
I would like to know the fastest way to extract the indices of the first n non zero values per column in a 2D array.
For example, with the following array:
arr = [
[4, 0, 0, 0],
[0, 0, 0, 0],
[0, 4, 0, 0],
[2, 0, 9, 0],
[6, 0, 0, 0],
[0, 7, 0, 0],
[3, 0, 0, 0],
[1, 2, 0, 0],
With n=2 I would have [0, 0, 1, 1, 2] as xs and [0, 3, 2, 5, 3] as ys. 2 values in the first and second columns and 1 in the third.
Here is how it is currently done:
x = []
y = []
n = 3
for i, c in enumerate(arr.T):
a = c.nonzero()[0][:n]
if len(a):
x.extend([i]*len(a))
y.extend(a)
In practice I have arrays of size (405, 256).
Is there a way to make it faster?
Here is a method, although quite confusing as it uses a lot of functions, that does not require sorting the array (only a linear scan is necessary to get non null values):
n = 2
# Get indices with non null values, columns indices first
nnull = np.stack(np.where(arr.T != 0))
# split indices by unique value of column
cols_ids= np.array_split(range(len(nnull[0])), np.where(np.diff(nnull[0]) > 0)[0] +1 )
# Take n in each (max) and concatenate the whole
np.concatenate([nnull[:, u[:n]] for u in cols_ids], axis = 1)
outputs:
array([[0, 0, 1, 1, 2],
[0, 3, 2, 5, 3]], dtype=int64)
Here is one approach using argsort, it gives a different order though:
n = 2
m = arr!=0
# non-zero values first
idx = np.argsort(~m, axis=0)
# get first 2 and ensure non-zero
m2 = np.take_along_axis(m, idx, axis=0)[:n]
y,x = np.where(m2)
# slice
x, idx[y,x]
# (array([0, 1, 2, 0, 1]), array([0, 2, 3, 3, 5]))
Use dislocation comparison for the row results of the transposed nonzero:
>>> n = 2
>>> i, j = arr.T.nonzero()
>>> mask = np.concatenate([[True] * n, i[n:] != i[:-n]])
>>> i[mask], j[mask]
(array([0, 0, 1, 1, 2], dtype=int64), array([0, 3, 2, 5, 3], dtype=int64))
How can I count the number of times an array is present in a larger array?
a = np.array([1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1])
b = np.array([1, 1, 1])
The count for the number of times b is present in a should be 3
b can be any combination of 1s and 0s
I'm working with huge arrays, so for loops are pretty slow
If the subarray being searched for contains all 1s, you can count the number of times the subarray appears in the larger array by convolving the two arrays with np.convolve and counting the number of entries in the result that equal the size of the subarray:
# 'valid' = convolve only over the complete overlap of the signals
>>> np.convolve(a, b, mode='valid')
array([1, 1, 2, 3, 2, 2, 2, 3, 3, 2, 1, 1])
# ^ ^ ^ <= Matches
>>> win_size = min(a.size, b.size)
>>> np.count_nonzero(np.convolve(a, b) == win_size)
3
For subarrays that may contain 0s, you can start by using convolution to transform a into an array containing the binary numbers encoded by each window of size b.size. Then just compare each element of the transformed array with the binary number encoded by b and count the matches:
>>> b = np.array([0, 1, 1]) # encodes '3'
>>> weights = 2 ** np.arange(b.size) # == [1, 2, 4, 8, ..., 2**(b.size-1)]
>>> np.convolve(a, weights, mode='valid')
array([4, 1, 3, 7, 6, 5, 3, 7, 7, 6, 4, 1])
# ^ ^ Matches
>>> target = (b * np.flip(weights)).sum() # target==3
>>> np.count_nonzero(np.convolve(a, weights, mode='valid') == target)
2
Not a super fast method, but you can view a as a windowed array using np.lib.stride_tricks.sliding_window_view:
window = np.lib.stride_tricks.sliding_window_view(a, b.shape)
You can now equate this to b directly and find where they match:
result = (window == b).all(-1).sum()
For older versions of numpy (pre-1.20.0), you can use np.libs.stride_tricks.as_strided to achieve a similar result:
window = np.lib.stride_tricks.as_strided(
a, shape=(*(np.array(a.shape) - b.shape + 1), *b.shape),
strides=a.strides + (a.strides[0],) * b.ndim)
Here is a solution using a list comprehension:
a = [1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1]
b = [1, 1, 1]
sum(a[i:i+len(b)]==b for i in range(len(a)-len(b)))
output: 3
Here are a few improvements on #Brian's answer:
Use np.correlate not np.convolve; they are nearly identical but convolve reads a and b in opposite directions
To deal with templates that have zeros convert the zeros to -1. For example:
a = np.array([1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1])
b = np.array([0,1,1])
np.correlate(a,2*b-1)
# array([-1, 1, 2, 1, 0, 0, 2, 1, 1, 0, -1, 1])
The template fits where the correlation equals the number of ones in the template. The indices can be extracted like so:
(np.correlate(a,2*b-1)==np.count_nonzero(b)).nonzero()[0]
# array([2, 6])
If you only need the count use np.count_nonzero
np.count_nonzero((np.correlate(a,2*b-1)==np.count_nonzero(b)))
# 2
For example, I have such array z:
array([1, 0, 1, 0, 0, 0, 1, 0, 0, 1])
How to find a distances between two successive 1s in this array? (measured in the numbers of 0s)
For example, in the z array, such distances are:
[1, 3, 2]
I have such code for it:
distances = []
prev_idx = 0
for idx, element in enumerate(z):
if element == 1:
distances.append(idx - prev_idx)
prev_idx = idx
distances = np.array(distances[1:]) - 1
Can this opeartion be done without for-loop and maybe in more efficient way?
UPD
The solution in the #warped answer works fine in 1-D case.
But what if z will be 2D-array like np.array([z, z])?
You can use np.where to find the ones, and then np.diff to get the distances:
q=np.where(z==1)
np.diff(q[0])-1
out:
array([1, 3, 2], dtype=int64)
edit:
for 2d arrays:
You can use the minimum of the manhattan distance (decremented by 1) of the positions that have ones to get the number of zeros inbetween:
def manhattan_distance(a, b):
return np.abs(np.array(a) - np.array(b)).sum()
zeros_between = []
r, c = np.where(z==1)
coords = list(zip(r,c))
for i, c in enumerate(coords[:-1]):
zeros_between.append(
np.min([manhattan_distance(c, coords[j])-1 for j in range(i+1, len(coords))]))
If you dont want to use the for, you can use np.where and np.roll
import numpy as np
x = np.array([1, 0, 1, 0, 0, 0, 1, 0, 0, 1])
pos = np.where(x==1)[0] #pos = array([0, 2, 6, 9])
shift = np.roll(pos,-1) # shift = array([2, 6, 9, 0])
result = ((shift-pos)-1)[:-1]
#shift-pos = array([ 2, 4, 3, -9])
#(shif-pos)-1 = array([ 1, 3, 2, -10])
#((shif-pos)-1)[:-1] = array([ 1, 3, 2])
print(result)
I have a 2d numpy array., A I want to apply np.bincount() to each column of the matrix A to generate another 2d array B that is composed of the bincounts of each column of the original matrix A.
My problem is that np.bincount() is a function that takes a 1d array-like. It's not an array method like B = A.max(axis=1) for example.
Is there a more pythonic/numpythic way to generate this B array other than a nasty for-loop?
import numpy as np
states = 4
rows = 8
cols = 4
A = np.random.randint(0,states,(rows,cols))
B = np.zeros((states,cols))
for x in range(A.shape[1]):
B[:,x] = np.bincount(A[:,x])
Using the same philosophy as in this post, here's a vectorized approach -
m = A.shape[1]
n = A.max()+1
A1 = A + (n*np.arange(m))
out = np.bincount(A1.ravel(),minlength=n*m).reshape(m,-1).T
I would suggest to use np.apply_along_axis, which will allow you to apply a 1D-method (in this case np.bincount) to 1D slices of a higher dimensional array:
import numpy as np
states = 4
rows = 8
cols = 4
A = np.random.randint(0,states,(rows,cols))
B = np.zeros((states,cols))
B = np.apply_along_axis(np.bincount, axis=0, arr=A)
You'll have to be careful, though. This (as well as your suggested for-loop) only works if the output of np.bincount has the right shape. If the maximum state is not present in one or multiple columns of your array A, the output will not have a smaller dimensionality and thus, the code will file with a ValueError.
This solution using the numpy_indexed package (disclaimer: I am its author) is fully vectorized, thus does not include any python loops behind the scenes. Also, there are no restrictions on the input; not every column needs to contain the same set of unique values.
import numpy_indexed as npi
rowidx, colidx = np.indices(A.shape)
(bin, col), B = npi.count_table(A.flatten(), colidx.flatten())
This gives an alternative (sparse) representation of the same result, which may be much more appropriate if the B array does indeed contain many zeros:
(bin, col), count = npi.count((A.flatten(), colidx.flatten()))
Note that apply_along_axis is just syntactic sugar for a for-loop, and has the same performance characteristics.
Yet another possibility:
import numpy as np
def bincount_columns(x, minlength=None):
nbins = x.max() + 1
if minlength is not None:
nbins = max(nbins, minlength)
ncols = x.shape[1]
count = np.zeros((nbins, ncols), dtype=int)
colidx = np.arange(ncols)[None, :]
np.add.at(count, (x, colidx), 1)
return count
For example,
In [110]: x
Out[110]:
array([[4, 2, 2, 3],
[4, 3, 4, 4],
[4, 3, 4, 4],
[0, 2, 4, 0],
[4, 1, 2, 1],
[4, 2, 4, 3]])
In [111]: bincount_columns(x)
Out[111]:
array([[1, 0, 0, 1],
[0, 1, 0, 1],
[0, 3, 2, 0],
[0, 2, 0, 2],
[5, 0, 4, 2]])
In [112]: bincount_columns(x, minlength=7)
Out[112]:
array([[1, 0, 0, 1],
[0, 1, 0, 1],
[0, 3, 2, 0],
[0, 2, 0, 2],
[5, 0, 4, 2],
[0, 0, 0, 0],
[0, 0, 0, 0]])
I'm looking to compactly represent duplicates in a Python list / 1D numpy array. For instance, say we have
x = np.array([1, 0, 0, 3, 3, 0])
this array has several duplicate elements, that can be represented with a
group_id = np.array([0, 1, 1, 2, 2, 1])
so that all duplicates in a given cluster are found with x[group_id==<some_id>].
The list of duplicate pairs can be efficiently computed with sorting,
s_idx = np.argsort(x)
diff_idx = np.nonzero(x[s_idx[:-1]] == x[s_idx[1:]])[0]
where the pair s_idx[diff_idx] <-> s_idx[diff_idx+1] correspond to the indices in the original array that are duplicates.
(here array([1, 2, 3]) <-> array([2, 5, 4])).
However, I'm not sure how to efficiently calculate cluster_id from this linkage information for large arrays sizes (N > 10⁶).
Edit: as suggested by #Chris_Rands, this can indeed be done with itertools.groupby,
import numpy as np
import itertools
def get_group_id(x):
group_id = np.zeros(x.shape, dtype='int')
for i, j in itertools.groupby(x):
j_el = next(j)
group_id[x==j_el] = i
return group_id
however the scaling appears to be O(n^2), and this would not scale to my use case (N > 10⁶),
for N in [50000, 100000, 200000]:
%time _ = get_group_id(np.random.randint(0, N, size=N))
CPU times: total: 1.53 s
CPU times: total: 5.83 s
CPU times: total: 23.9 s
and I belive using the duplicate linkage information would be more efficient as computing duplicate pairs for N=200000 takes just 6.44 µs in comparison.
You could use numpy.unique:
In [13]: x = np.array([1, 0, 0, 3, 3, 0])
In [14]: values, cluster_id = np.unique(x, return_inverse=True)
In [15]: values
Out[15]: array([0, 1, 3])
In [16]: cluster_id
Out[16]: array([1, 0, 0, 2, 2, 0])
(The cluster IDs are assigned in the order of the sorted unique values, not in the order of a value's first appearance in the input.)
Locations of the items in cluster 0:
In [22]: cid = 0
In [23]: values[cid]
Out[23]: 0
In [24]: (cluster_id == cid).nonzero()[0]
Out[24]: array([1, 2, 5])
Here's an approach using np.unique to keep the order according to the first appearance of a number -
unq, first_idx, ID = np.unique(x,return_index=1,return_inverse=1)
out = first_idx.argsort().argsort()[ID]
Sample run -
In [173]: x
Out[173]: array([1, 0, 0, 3, 3, 0, 9, 0, 2, 6, 0, 0, 4, 8])
In [174]: unq, first_idx, ID = np.unique(x,return_index=1,return_inverse=1)
In [175]: first_idx.argsort().argsort()[ID]
Out[175]: array([0, 1, 1, 2, 2, 1, 3, 1, 4, 5, 1, 1, 6, 7])