Find repeat values in numpy nd array - python

My data samples are each a numpy array of shape e.g. (100, 100, 9), and I have 10 of these concatenated into a single array foo of shape (10, 100, 100, 9). Over the 10 data samples, I'd like to find the indices of repeat values. So for example, if foo[0, 42, 42, 3] = 0.72 and foo[0, 42, 42, 7] = 0.72, I'd like an output that reflects this. What is an efficient way of doing so?
I'm thinking a boolean output array of shape (100, 100, 9), but is there a better approach than looping to compare each data sample (quadratic runtime for the number of data samples (10))?

In the below snippet, dups is the desired result: a boolean array that shows which indices are duplicates. There's also a delta threshold, so any difference in values <= this threshold is a duplicate.
delta = 0.
dups = np.zeros(foo.shape[1:], dtype=bool)
for i in xrange(foo.shape[0]):
for j in xrange(foo.shape[0]):
if i==j: continue
dups += abs(foo[i] - foo[j]) <= delta

Here is a solution using argsort on each sample. Not pretty, not fast but does the job.
import numpy as np
from timeit import timeit
def dupl(a, axis=0, make_dict=True):
a = np.moveaxis(a, axis, -1)
i = np.argsort(a, axis=-1, kind='mergesort')
ai = a[tuple(np.ogrid[tuple(map(slice, a.shape))][:-1]) + (i,)]
same = np.zeros(a.shape[:-1] + (a.shape[-1]+1,), bool)
same[..., 1:-1] = np.diff(ai, axis=-1) == 0
uniqs = np.where((same[..., 1:] & ~same[..., :-1]).ravel())[0]
same = (same[...,1:]|same[...,:-1]).ravel()
reps = np.split(i.ravel()[same], np.cumsum(same)[uniqs[1:]-1])
grps = np.searchsorted(uniqs, np.arange(0, same.size, a.shape[-1]))
keys = ai.ravel()[uniqs]
if make_dict:
result = np.empty(a.shape[:-1], object)
result.ravel()[:] = [dict(zip(*p)) for p in np.split(
np.array([keys, reps], object), grps[1:], axis=-1)]
return result
else:
return keys, reps, grps
a = np.random.randint(0,10,(10,100,100,9))
axis = 0
result = dupl(a, axis)
print('shape, axis, time (sec) for 10 trials:',
a.shape, axis, timeit(lambda: dupl(a, axis=axis), number=10))
print('same without creating dict:',
a.shape, axis, timeit(lambda: dupl(a, axis=axis, make_dict=False),
number=10))
#check
print("checking result")
am = np.moveaxis(a, axis, -1)
for af, df in zip(am.reshape(-1, am.shape[-1]), result.ravel()):
assert len(set(af)) + sum(map(len, df.values())) == len(df) + am.shape[-1]
for k, v in df.items():
assert np.all(np.where(af == k)[0] == v)
print("no errors")
prints:
shape, axis, time (sec) for 10 trials: (10, 100, 100, 9) 0 5.328339613042772
same without creating dict: (10, 100, 100, 9) 0 2.568383438978344
checking result
no errors

Related

How to index a multidimensional numpy array with a number of 1d boolean arrays?

Assume that I have a numpy array A with n dimensions, which might be very large, and assume that I have k 1-dimensional boolean masks M1, ..., Mk
I would like to extract from A an n-dimensional array B which contains all the elements of A located at indices where the "outer-AND" of all the masks is True.
..but I would like to do this without first forming the (possibly very large) "outer-AND" of all the masks, and without having to extract the specified elements from each axis one axis at a time hence creating (possibly many) intermediate copies in the process.
The example below demonstrates the two ways of extracting the elements from A just described above:
from functools import reduce
import numpy as np
m = 100
for _ in range(m):
n = np.random.randint(0, 10)
k = np.random.randint(0, n + 1)
A_shape = tuple(np.random.randint(0, 10, n))
A = np.random.uniform(-1, 1, A_shape)
M_lst = [np.random.randint(0, 2, dim).astype(bool) for dim in A_shape]
# creating shape of B:
B_shape = tuple(map(np.count_nonzero, M_lst)) + A_shape[len(M_lst):]
# size of B:
B_size = np.prod(B_shape)
# --- USING "OUTER-AND" OF ALL MASKS --- #
# creating "outer-AND" of all masks:
M = reduce(np.bitwise_and, (np.expand_dims(M, tuple(np.r_[:i, i+1:n])) for i, M in enumerate(M_lst)), True)
# extracting elements from A and reshaping to the correct shape:
B1 = A[M].reshape(B_shape)
# checking that the correct number of elements was extracted
assert B1.size == B_size
# THE PROBLEM WITH THIS METHOD IS THE POSSIBLY VERY LARGE OUTER-AND OF ALL THE MASKS!
# --- USING ONE MASK AT A TIME --- #
B2 = A
for i, M in enumerate(M_lst):
B2 = B2[tuple(slice(None) for _ in range(i)) + (M,)]
assert B2.size == np.prod(B_shape)
assert B2.shape == B_shape
# THE PROBLEM WITH THIS METHOD IS THE POSSIBLY LARGE NUMBER OF POSSIBLY LARGE INTERMEDIATE COPIES!
assert np.all(B1 == B2)
# EDIT 1:
# USING np.ix_ AS SUGGESTED BY Chrysophylaxs
i = np.ix_(*M_lst)
B3 = A[i]
assert B3.shape == B_shape
assert B3.size == B_size
assert np.prod(list(map(np.size, i))) == B_size
print(f'All three methods worked all {m} times')
Is there a smarter (more efficient) way to do this, possibly using an existing numpy function?.
IIUC, you're looking for np.ix_; an example:
import numpy as np
arr = np.arange(60).reshape(3, 4, 5)
x = [True, False, True]
y = [False, True, True, False]
z = [False, True, False, True, False]
out = arr[np.ix_(x, y, z)]
out:
array([[[ 6, 8],
[11, 13]],
[[46, 48],
[51, 53]]])

Matrix multiplication while subsetting elements from matrices and storing in a new matrix

I am attempting a numpy.matmul call using as variables
Matrix A of dimensions (p, t, q)
Matrix B of dimensions (r, t).
A categories vector of shape r and p categories, used to take slices of B and define the index of A do use.
The multiplications are done iteratively using the indices of each category. For each category p_i, I extract from A a submatrix (t, q). Then, I multiply those with a subset of B (x, t), where x is a mask defined by r == p_i. Finally, the matrix multiplication of (x, t) and (t, q) produces the output (x, q) which is stored at S[x].
I have noted that I cannot figure out a non-iterative version of this algorithm. The first snippet describes an iterative solution. The second one is an attempt at what I would wish to get, where everything is calculated as a single-step and would be presumably faster. However, it is incorrect because matrix A has three dimensions instead of two. Maybe there is no way to do this in NumPy with a single call, and in general, looking for advice/ideas to try out.
Thanks!
import numpy as np
p, q, r, t = 2, 9, 512, 4
# data initialization (random)
np.random.seed(500)
S = np.random.rand(r, q)
A = np.random.randint(0, 3, size=(p, t, q))
B = np.random.rand(r, t)
categories = np.random.randint(0, p, r)
print('iterative') # iterative
for i in range(p):
# print(i)
a = A[i, :, :]
mask = categories == i
b = B[mask]
print(b.shape, a.shape, S[mask].shape,
np.matmul(b, a).shape)
S[mask] = np.matmul(b, a)
print(S.shape)
a simple way to write it down
S = np.random.rand(r, q)
print(A[:p,:,:].shape)
result = np.matmul(B, A[:p,:,:])
# iterative assignment
i = 0
S[categories == i] = result[i, categories == i, :]
i = 1
S[categories == i] = result[i, categories == i, :]
The next snippet will produce an error during the multiplication step.
# attempt to multiply once, indexing all categories only once (not possible)
np.random.seed(500)
S = np.random.rand(r, q)
# attempt to use the categories vector
a = A[categories, :, :]
b = B[categories]
# due to the shapes of the arrays, this multiplication is not possible
print('\nsingle step (error due to shapes of the matrix a')
print(b.shape, a.shape, S[categories].shape)
S[categories] = np.matmul(b, a)
print(scores.shape)
iterative
(250, 4) (4, 9) (250, 9) (250, 9)
(262, 4) (4, 9) (262, 9) (262, 9)
(512, 9)
single step (error due to shapes of the 2nd matrix a).
(512, 4) (512, 4, 9) (512, 9)
In [63]: (np.ones((512,4))#np.ones((512,4,9))).shape
Out[63]: (512, 512, 9)
This because the first array is broadcasted to (1,512,4). I think you want instead to do:
In [64]: (np.ones((512,1,4))#np.ones((512,4,9))).shape
Out[64]: (512, 1, 9)
Then remove the middle dimension to get a (512,9).
Another way:
In [72]: np.einsum('ij,ijk->ik', np.ones((512,4)), np.ones((512,4,9))).shape
Out[72]: (512, 9)
To remove the loop altogether, you can try this
bigmask = np.arange(p)[:, np.newaxis] == categories
C = np.matmul(B, A)
res = C[np.broadcast_to(bigmask[..., np.newaxis], C.shape)].reshape(r, q)
# `res` has the same rows as the iterative `S` but in the wrong order
# so we need to reorder the rows
sort_index = np.argsort(np.broadcast_to(np.arange(r), bigmask.shape)[bigmask])
assert np.allclose(S, res[sort_index])
Though I'm not sure it's much faster than the iterative version.

get minimum value across array of indices

I have an n-by-3 index array (think of triangles indexing points) and a list of float values associated with the triangles. I now want to get for each index ("point") the minimum value, i.e., check all rows which contain the index, say, 0, and get the minimum value from vals across the respective rows:
import numpy
a = numpy.array([
[0, 1, 2],
[2, 3, 0],
[1, 4, 2],
[2, 5, 3],
])
vals = numpy.array([0.1, 0.5, 0.3, 0.6])
out = [
numpy.min(vals[numpy.any(a == i, axis=1)])
for i in range(6)
]
# out = numpy.array([0.1, 0.1, 0.1, 0.5, 0.3, 0.6])
This solution is inefficient because it does a full array comparison for every i.
This problem is quite similar to numpy's ufuncs, but numpy.min.at doesn't exist.
Any hints?
Approach #1
One approach based on array-assignment to setup a 2D array filled up NaNs, using those a values as column indices (so assumes those to be integers), then mapping vals into it and looking for nan-skipped min values for the final output -
nr,nc = len(a),a.max()+1
m = np.full((nr,nc),np.nan)
m[np.arange(nr)[:,None],a] = vals[:,None]
out = np.nanmin(m,axis=0)
Approach #2
Another one again based on array-assignment, but uses masking and np.minimum.reduceat in favor of dealing with NaNs -
nr,nc = len(a),a.max()+1
m = np.zeros((nc,nr),dtype=bool)
m[a.T,np.arange(nr)] = 1
c = m.sum(1)
shift_idx = np.r_[0,c[:-1].cumsum()]
out = np.minimum.reduceat(np.broadcast_to(vals,m.shape)[m],shift_idx)
Approach #3
Another based on argsort (assuming you have all integers from 0 to a.max() in a) -
sidx = a.ravel().argsort()
c = np.bincount(a.ravel())
out = np.minimum.reduceat(vals[sidx//a.shape[1]],np.r_[0,c[:-1].cumsum()])
Approach #4
For memory efficiency and hence perf. and also to complete the set -
from numba import njit
#njit
def numba1(a, vals, out):
m,n = a.shape
for j in range(m):
for i in range(n):
e = a[j,i]
if vals[j] < out[e]:
out[e] = vals[j]
return out
def func1(a, vals, outlen=None): # feed in output length as outlen if known
if outlen is not None:
N = outlen
else:
N = a.max()+1
out = np.full(N,np.inf)
return numba1(a, vals, out)
You may switch to pd.GroupBy or itertools.groupby if your for loop goes way beyond 6.
For instance,
r = n.ravel()
pd.Series(np.arange(len(r))//3).groupby(r).apply(lambda s: vals[s].min())
This solution would be faster for long loops, and probably slower for small loops (< 50)
Here is one based on this Q&A:
If you have pythran, compile
file <stb_pthr.py>
import numpy as np
#pythran export sort_to_bins(int[:], int)
def sort_to_bins(idx, mx):
if mx==-1:
mx = idx.max() + 1
cnts = np.zeros(mx + 2, int)
for i in range(idx.size):
cnts[idx[i]+2] += 1
for i in range(2, cnts.size):
cnts[i] += cnts[i-1]
res = np.empty_like(idx)
for i in range(idx.size):
res[cnts[idx[i]+1]] = i
cnts[idx[i]+1] += 1
return res, cnts[:-1]
Otherwise the script will fall back to a sparse matrix based approach which is only slightly slower:
import numpy as np
try:
from stb_pthr import sort_to_bins
HAVE_PYTHRAN = True
except:
HAVE_PYTHRAN = False
from scipy.sparse import csr_matrix
def sort_to_bins_sparse(idx, mx):
if mx==-1:
mx = idx.max() + 1
aux = csr_matrix((np.ones_like(idx),idx,np.arange(idx.size+1)),
(idx.size,mx)).tocsc()
return aux.indices, aux.indptr
if not HAVE_PYTHRAN:
sort_to_bins = sort_to_bins_sparse
def f_op():
mx = a.max() + 1
return np.fromiter((np.min(vals[np.any(a == i, axis=1)])
for i in range(mx)),vals.dtype,mx)
def f_pp():
idx, bb = sort_to_bins(a.reshape(-1),-1)
res = np.minimum.reduceat(vals[idx//3], bb[:-1])
res[bb[:-1]==bb[1:]] = np.inf
return res
def f_div_3():
sidx = a.ravel().argsort()
c = np.bincount(a.ravel())
bb = np.r_[0,c.cumsum()]
res = np.minimum.reduceat(vals[sidx//a.shape[1]],bb[:-1])
res[bb[:-1]==bb[1:]] = np.inf
return res
a = np.array([
[0, 1, 2],
[2, 3, 0],
[1, 4, 2],
[2, 5, 3],
])
vals = np.array([0.1, 0.5, 0.3, 0.6])
assert np.all(f_op()==f_pp())
from timeit import timeit
a = np.random.randint(0,1000,(10000,3))
vals = np.random.random(10000)
assert len(np.unique(a))==1000
assert np.all(f_op()==f_pp())
print("1000/1000 labels, 10000 rows")
print("op ", timeit(f_op, number=10)*100, 'ms')
print("pp ", timeit(f_pp, number=100)*10, 'ms')
print("div", timeit(f_div_3, number=100)*10, 'ms')
a = 1 + 2 * np.random.randint(0,5000,(1000000,3))
vals = np.random.random(1000000)
nl = len(np.unique(a))
assert np.all(f_div_3()==f_pp())
print(f"{nl}/{a.max()+1} labels, 1000000 rows")
print("pp ", timeit(f_pp, number=10)*100, 'ms')
print("div", timeit(f_div_3, number=10)*100, 'ms')
a = 1 + 2 * np.random.randint(0,100000,(1000000,3))
vals = np.random.random(1000000)
nl = len(np.unique(a))
assert np.all(f_div_3()==f_pp())
print(f"{nl}/{a.max()+1} labels, 1000000 rows")
print("pp ", timeit(f_pp, number=10)*100, 'ms')
print("div", timeit(f_div_3, number=10)*100, 'ms')
Sample run (timings include #Divakar approach 3 for reference):
1000/1000 labels, 10000 rows
op 145.1122640981339 ms
pp 0.7944229000713676 ms
div 2.2905819199513644 ms
5000/10000 labels, 1000000 rows
pp 113.86540920939296 ms
div 417.2476712032221 ms
100000/200000 labels, 1000000 rows
pp 158.23634970001876 ms
div 486.13436080049723 ms
UPDATE: #Divakar's latest (approach 4) is hard to beat, being essentially a C implementation. Nothing wrong with that except that jitting is not an option but a requirement here (the unjitted code is no fun to run). If one accepts that, the same can, of course, be done with pythran:
pythran -O3 labeled_min.py
file <labeled_min.py>
import numpy as np
#pythran export labeled_min(int[:,:], float[:])
def labeled_min(A, vals):
mn = np.empty(A.max()+1)
mn[:] = np.inf
M,N = A.shape
for i in range(M):
v = vals[i]
for j in range(N):
c = A[i,j]
if v < mn[c]:
mn[c] = v
return mn
Both give another massive speedup:
from labeled_min import labeled_min
func1() # do not measure jitting time
print("nmb ", timeit(func1, number=100)*10, 'ms')
print("pthr", timeit(lambda:labeled_min(a,vals), number=100)*10, 'ms')
Sample run:
nmb 8.41792532010004 ms
pthr 8.104007659712806 ms
pythran comes out a few percent faster but this is only because I moved vals lookup out of the inner loop; without that they are all but equal.
For comparison, the previously best with and without non python helpers on the same problem:
pp 114.04887529788539 ms
pp (py only) 147.0821460010484 ms
Apparently, numpy.minimum.at exists:
import numpy
a = numpy.array([
[0, 1, 2],
[2, 3, 0],
[1, 4, 2],
[2, 5, 3],
])
vals = numpy.array([0.1, 0.5, 0.3, 0.6])
out = numpy.full(6, numpy.inf)
numpy.minimum.at(out, a.reshape(-1), numpy.repeat(vals, 3))

Vectorizing an operation between all pairs of elements in two numpy arrays

Given two arrays where each row represents a circle (x, y, r):
data = {}
data[1] = np.array([[455.108, 97.0478, 0.0122453333],
[403.775, 170.558, 0.0138770952],
[255.383, 363.815, 0.0179857619]])
data[2] = np.array([[455.103, 97.0473, 0.012041],
[210.19, 326.958, 0.0156912857],
[455.106, 97.049, 0.0150472381]])
I would like to pull out all of the pairs of circles that are not disjointed. This can be done by:
close_data = {}
for row1 in data[1]: #loop over first array
for row2 in data[2]: #loop over second array
condition = ((abs(row1[0]-row2[0]) + abs(row1[1]-row2[1])) < (row1[2]+row2[2]))
if condition: #circles overlap if true
if tuple(row1) not in close_data.keys():
close_data[tuple(row1)] = [row1, row2] #pull out close data points
else:
close_data[tuple(row1)].append(row2)
for k, v in close_data.iteritems():
print k, v
#desired outcome
#(455.108, 97.047799999999995, 0.012245333299999999)
#[array([ 4.55108000e+02, 9.70478000e+01, 1.22453333e-02]),
# array([ 4.55103000e+02, 9.70473000e+01, 1.2040000e-02]),
# array([ 4.55106000e+02, 9.70490000e+01, 1.50472381e-02])]
However the multiple loops over the arrays are very inefficient for large datasets. Is it possible to vectorize the calculations so I get the advantage of using numpy?
The most difficult bit is actually getting to your representation of the info. Oh, and I inserted a few squares. If you really don't want Euclidean distances you have to change back.
import numpy as np
data = {}
data[1] = np.array([[455.108, 97.0478, 0.0122453333],
[403.775, 170.558, 0.0138770952],
[255.383, 363.815, 0.0179857619]])
data[2] = np.array([[455.103, 97.0473, 0.012041],
[210.19, 326.958, 0.0156912857],
[455.106, 97.049, 0.0150472381]])
d1 = data[1][:, None, :]
d2 = data[2][None, :, :]
dists2 = ((d1[..., :2] - d2[..., :2])**2).sum(axis = -1)
radss2 = (d1[..., 2] + d2[..., 2])**2
inds1, inds2 = np.where(dists2 <= radss2)
# translate to your representation:
bnds = np.r_[np.searchsorted(inds1, np.arange(3)), len(inds1)]
rows = [data[2][inds2[bnds[i]:bnds[i+1]]] for i in range(3)]
out = dict([(tuple (data[1][i]), rows[i]) for i in range(3) if rows[i].size > 0])
Here is a pure numpythonic way (a is data[1] and b is data[2]):
In [80]: p = np.arange(3) # for creating the indices of combinations using np.tile and np.repeat
In [81]: a = a[np.repeat(p, 3)] # creates the first column of combination array
In [82]: b = b[np.tile(p, 3)] # creates the second column of combination array
In [83]: abs(a[:, :2] - b[:, :2]).sum(1) < a[:, 2] + b[:, 2]
Out[83]: array([ True, False, True, True, False, True, True, False, True], dtype=bool)

NumPy indexing with varying position

I have an array input_data of shape (A, B, C), and an array ind of shape (B,). I want to loop through the B axis and take the sum of elements C[B[i]] and C[B[i]+1]. The desired output is of shape (A, B). I have the following code which works, but I feel is inefficient due to index-based looping through the B axis. Is there a more efficient method?
import numpy as np
input_data = np.random.rand(2, 6, 10)
ind = [ 2, 3, 5, 6, 5, 4 ]
out = np.zeros( ( input_data.shape[0], input_data.shape[1] ) )
for i in range( len(ind) ):
d = input_data[:, i, ind[i]:ind[i]+2]
out[:, i] = np.sum(d, axis = 1)
Edited based on Divakar's answer:
import timeit
import numpy as np
N = 1000
input_data = np.random.rand(10, N, 5000)
ind = ( 4999 * np.random.rand(N) ).astype(np.int)
def test_1(): # Old loop-based method
out = np.zeros( ( input_data.shape[0], input_data.shape[1] ) )
for i in range( len(ind) ):
d = input_data[:, i, ind[i]:ind[i]+2]
out[:, i] = np.sum(d, axis = 1)
return out
def test_2():
extent = 2 # Comes from 2 in "ind[i]:ind[i]+2"
m,n,r = input_data.shape
idx = (np.arange(n)*r + ind)[:,None] + np.arange(extent)
out1 = input_data.reshape(m,-1)[:,idx].reshape(m,n,-1).sum(2)
return out1
print timeit.timeit(stmt = test_1, number = 1000)
print timeit.timeit(stmt = test_2, number = 1000)
print np.all( test_1() == test_2(), keepdims = True )
>> 7.70429363482
>> 0.392034666757
>> [[ True]]
Here's a vectorized approach using linear indexing with some help from broadcasting. We merge the last two axes of the input array, calculate the linear indices corresponding to the last two axes, perform slicing and reshape back to a 3D shape. Finally, we do summation along the last axis to get the desired output. The implementation would look something like this -
extent = 2 # Comes from 2 in "ind[i]:ind[i]+2"
m,n,r = input_data.shape
idx = (np.arange(n)*r + ind)[:,None] + np.arange(extent)
out1 = input_data.reshape(m,-1)[:,idx].reshape(m,n,-1).sum(2)
If the extent is always going to be 2 as stated in the question - "... sum of elements C[B[i]] and C[B[i]+1]", then you could simply do -
m,n,r = input_data.shape
ind_arr = np.array(ind)
axis1_r = np.arange(n)
out2 = input_data[:,axis1_r,ind_arr] + input_data[:,axis1_r,ind_arr+1]
You could also use integer array indexing combined with basic slicing:
import numpy as np
m,n,r = 2, 6, 10
input_data = np.arange(2*6*10).reshape(m, n, r)
ind = np.array([ 2, 3, 5, 6, 5, 4 ])
out = np.zeros( ( input_data.shape[0], input_data.shape[1] ) )
for i in range( len(ind) ):
d = input_data[:, i, ind[i]:ind[i]+2]
out[:, i] = np.sum(d, axis = 1)
out2 = input_data[:, np.arange(n)[:,None], np.add.outer(ind,range(2))].sum(axis=-1)
print(out2)
# array([[ 5, 27, 51, 73, 91, 109],
# [125, 147, 171, 193, 211, 229]])
assert np.allclose(out, out2)

Categories