Intersection between two multi-dimensional arrays with tolerance - NumPy / Python - python

i am stuck at a problem. I have two 2-D numpy arrays, filled with x and y coordinates. Those arrays might look like:
array1([[(1.22, 5.64)],
[(2.31, 7.63)],
[(4.94, 4.15)]],
array2([[(1.23, 5.63)],
[(6.31, 10.63)],
[(2.32, 7.65)]],
Now I have to find "duplicate nodes". However, i also have to consider nodes as equal within a given tolerance of the coordinates, therefore, i can't use solutions like this . Since my arrays are quite big (~200.000 lines each) two simple for loops are not an option as well. My final output should look like this:
output([[(1.23, 5.63)],
[(2.32, 7.65)]],
I would appreciate some hints.
Cheers,

In order to compare to nodes with a giving tolerance I recommend to use numpy.isclose(), where you can set a relative and absolute tolerance.
numpy.isclose(1.24, 1.25, atol=1e-1)
# [True]
numpy.isclose([1.24, 2.31], [1.25, 2.32], atol=1e-1)
# [True, True]
Instead of using a two for loops, you can make use of itertools.product() package, to go through all pairs. The following code does what you want:
array1 = np.array([[1.22, 5.64],
[2.31, 7.63],
[4.94, 4.15]])
array2 = np.array([[1.23, 5.63],
[6.31, 10.63],
[2.32, 7.64]])
output = np.empty((0,2))
for i0, i1 in itertools.product(np.arange(array1.shape[0]),
np.arange(array2.shape[0])):
if np.all(np.isclose(array1[i0], array2[i1], atol=1e-1)):
output = np.concatenate((output, [array2[i1]]), axis=0)
# output = [[ 1.23 5.63]
# [ 2.32 7.64]]

Defining a isclose function similar to numpy.isclose, but a bit faster (mostly due to not checking any input and not supporting both relative and absolute tolerance):
import numpy as np
array1 = np.array([[(1.22, 5.64)],
[(2.31, 7.63)],
[(4.94, 4.15)]])
array2 = np.array([[(1.23, 5.63)],
[(6.31, 10.63)],
[(2.32, 7.65)]])
def isclose(x, y, atol):
return np.abs(x - y) < atol
Now comes the hard part. We need to calculate if any two values are close within the inner most dimension. For this I reshape the arrays in such a way that the first array has its values along the second dimension, replicated across the first and the second array has its values along the first dimension, replicated along the second (note the 1, 3 and 3, 1):
In [92]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03)
Out[92]:
array([[[ True, True],
[False, False],
[False, False]],
[[False, False],
[False, False],
[False, False]],
[[False, False],
[ True, True],
[False, False]]], dtype=bool)
Now we want all entries where the value is close to any other value (along the same dimension):
In [93]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0)
Out[93]:
array([[ True, True],
[ True, True],
[False, False]], dtype=bool)
Then we want only those where both values of the tuple are close:
In [111]: isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)
Out[111]: array([ True, True, False], dtype=bool)
And finally, we can use this to index array1:
In [112]: array1[isclose(array1.reshape(1,3,2), array2.reshape(3,1,2), 0.03).any(axis=0).all(axis=-1)]
Out[112]:
array([[[ 1.22, 5.64]],
[[ 2.31, 7.63]]])
If you want to, you can swap the any and all calls. One might be faster than the other in your case.
The 3 in the reshape calls needs to be substituted for the actual length of your data.
This algorithm will have the same bad runtime of the other answer using itertools.product, but at least the actual looping is done implicitly by numpy and is implemented in C. This is visible in the timings:
In [122]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
11.6 µs ± 493 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [126]: %timeit pares(array1_pares, array2_pares)
267 µs ± 8.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Where the pares function is the code defined by #Ferran Parés in another answer and the arrays as already reshaped there.
And for larger arrays it becomes more obvious:
array1 = np.random.normal(0, 0.1, size=(1000, 1, 2))
array2 = np.random.normal(0, 0.1, size=(1000, 1, 2))
array1_pares = array1.reshape(1000, 2)
array2_pares = arra2.reshape(1000, 2)
In [149]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
135 µs ± 5.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [157]: %timeit pares(array1_pares, array2_pares)
1min 36s ± 6.85 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In the end this is limited by the available system memory. My machine (16GB RAM) can still handle arrays of length 20000, but that pushes it almost to 100%. It also takes about 12s:
In [14]: array1 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [15]: array2 = np.random.normal(0, 0.1, size=(20000, 1, 2))
In [16]: %timeit array1[isclose(array1.reshape(1,len(array1),2), array2.reshape(len(array2),1,2), 0.03).any(axis=0).all(axis=-1)]
12.3 s ± 514 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

There are many possible ways to define that tolerance. Since, we are talking about XY coordinates, most probably we are talking about euclidean distances to set that tolerance value. So, we can use Cython-powered kd-tree for quick nearest-neighbor lookup, which is very efficient both memory-wise and with performance. The implementation would look something like this -
from scipy.spatial import cKDTree
# Assuming a default tolerance value of 1 here
def intersect_close(a, b, tol=1):
# Get closest distances for each pt in b
dist = cKDTree(a).query(b, k=1)[0] # k=1 selects closest one neighbor
# Check the distances against the given tolerance value and
# thus filter out rows off b for the final output
return b[dist <= tol]
Sample step-by-step run -
# Input 2D arrays
In [68]: a
Out[68]:
array([[1.22, 5.64],
[2.31, 7.63],
[4.94, 4.15]])
In [69]: b
Out[69]:
array([[ 1.23, 5.63],
[ 6.31, 10.63],
[ 2.32, 7.65]])
# Get closest distances for each pt in b
In [70]: dist = cKDTree(a).query(b, k=1)[0]
In [71]: dist
Out[71]: array([0.01414214, 5. , 0.02236068])
# Mask of distances within the given tolerance
In [72]: tol = 1
In [73]: dist <= tol
Out[73]: array([ True, False, True])
# Finally filter out valid ones off b
In [74]: b[dist <= tol]
Out[74]:
array([[1.23, 5.63],
[2.32, 7.65]])
Timings on 200,000 pts -
In [20]: N = 200000
...: np.random.seed(0)
...: a = np.random.rand(N,2)
...: b = np.random.rand(N,2)
In [21]: %timeit intersect_close(a, b)
1 loop, best of 3: 1.37 s per loop

As commented, scaling and rounding your numbers might allow you to use intersect1d or the equivalent.
And if you have just 2 columns, it might work to turn it into a 1d array of complex dtype.
But you might also want to keep in mind what intersect1d does:
if not assume_unique:
# Might be faster than unique( intersect1d( ar1, ar2 ) )?
ar1 = unique(ar1)
ar2 = unique(ar2)
aux = np.concatenate((ar1, ar2))
aux.sort()
return aux[:-1][aux[1:] == aux[:-1]]
unique has been enhanced to handle rows (axis parameters), but intersect has not. In any case it uses argsort to put similar elements next to each other, and then skips the duplicates.
Notice that intersect concatenenates the unique arrays, sorts, and again finds the duplicates.
I know you didn't want a loop version, but to promote conceptualization of the problem here's one anyways:
In [581]: a = np.array([(1.22, 5.64),
...: (2.31, 7.63),
...: (4.94, 4.15)])
...:
...: b = np.array([(1.23, 5.63),
...: (6.31, 10.63),
...: (2.32, 7.65)])
...:
I removed a layer of nesting in your arrays.
In [582]: c = []
In [583]: for a1 in a:
...: for b1 in b:
...: if np.allclose(a1,b1, atol=0.5): c.append((a1,b1))
or as list comprehension
In [586]: [(a1,b1) for a1 in a for b1 in b if np.allclose(a1,b1,atol=0.5)]
Out[586]:
[(array([1.22, 5.64]), array([1.23, 5.63])),
(array([2.31, 7.63]), array([2.32, 7.65]))]
complex approximation
In [604]: aa = (a*10).astype(int)
In [605]: aa
Out[605]:
array([[12, 56],
[23, 76],
[49, 41]])
In [606]: ac=aa[:,0]+1j*aa[:,1]
In [607]: bb = (b*10).astype(int)
In [608]: bc=bb[:,0]+1j*bb[:,1]
In [609]: np.intersect1d(ac,bc)
Out[609]: array([12.+56.j, 23.+76.j])
intersect inspired
Concatenate the arrays, sort them, take difference, and find the small differences:
In [616]: ab = np.concatenate((a,b),axis=0)
In [618]: np.lexsort(ab.T)
Out[618]: array([2, 3, 0, 1, 5, 4], dtype=int32)
In [619]: ab1 = ab[_,:]
In [620]: ab1
Out[620]:
array([[ 4.94, 4.15],
[ 1.23, 5.63],
[ 1.22, 5.64],
[ 2.31, 7.63],
[ 2.32, 7.65],
[ 6.31, 10.63]])
In [621]: ab1[1:]-ab1[:-1]
Out[621]:
array([[-3.71, 1.48],
[-0.01, 0.01],
[ 1.09, 1.99],
[ 0.01, 0.02],
[ 3.99, 2.98]])
In [623]: ((ab1[1:]-ab1[:-1])<.1).all(axis=1) # refine with abs
Out[623]: array([False, True, False, True, False])
In [626]: np.where(Out[623])
Out[626]: (array([1, 3], dtype=int32),)
In [627]: ab[_]
Out[627]:
array([[2.31, 7.63],
[1.23, 5.63]])

May be you could try this using pure NP and self defined function:
import numpy as np
#Your Example
xDA=np.array([[1.22, 5.64],[2.31, 7.63],[4.94, 4.15],[6.1,6.2]])
yDA=np.array([[1.23, 5.63],[6.31, 10.63],[2.32, 7.65],[3.1,9.2]])
###Try this large sample###
#xDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)
#yDA=np.round(np.random.uniform(1,2, size=(5000, 2)),2)
print(xDA)
print(yDA)
#Match x to y
def np_matrix(myx,myy,calp=0.2):
Xxx = np.transpose(np.repeat(myx[:, np.newaxis], myy.size, axis=1))
Yyy = np.repeat(myy[:, np.newaxis], myx.size, axis=1)
# define a caliper
matches = {}
dist = np.abs(Xxx - Yyy)
for m in range(0, myx.size):
if (np.min(dist[:, m]) <= calp) or not calp:
matches[m] = np.argmin(dist[:, m])
return matches
alwd_dist=0.1
xc1=xDA[:,1]
yc1=yDA[:,1]
m1=np_matrix(xc1,yc1,alwd_dist)
xc0=xDA[:,0]
yc0=yDA[:,0]
m0=np_matrix(xc0,yc0,alwd_dist)
shared_items = set(m1.items()) & set(m0.items())
if (int(len(shared_items))==0):
print("No Matched Items based on given allowed distance:",alwd_dist)
else:
print("Matched:")
for ke in shared_items:
print(xDA[ke[0]],yDA[ke[1]])

Related

get variance of matrix without zero values numpy

How to can I compute variance without zero elements?
For example
np.var([[1, 1], [1, 2]], axis=1) -> [0, 0.25]
I need:
var([[1, 1, 0], [1, 2, 0]], axis=1) -> [0, 0.25]
Is it what your are looking for? You can filter out columns where all values are 0 (or at least one value is not 0).
m = np.array([[1, 1, 0], [1, 2, 0]])
np.var(m[:, np.any(m != 0, axis=0)], axis=1)
# Output
array([0. , 0.25])
V1
You can use a masked array:
data = np.array([[1, 1, 0], [1, 2, 0]])
np.ma.array(data, mask=(data == 0)).var(axis=1)
The result is
masked_array(data=[0. , 0.25],
mask=False,
fill_value=1e+20)
The raw numpy array is the data attribute of the resulting masked array:
>>> np.ma.array(data, mask=(data == 0)).var(axis=1).data
array([0. , 0.25])
V2
Without masked arrays, the operation of removing a variable number of elements in each row is a bit tricky. It would be simpler to implement the variance in terms of the formula sum(x**2) / N - (sum(x) / N)**2 and partial reduction of ufuncs.
First we need to find the split indices and segment lengths. In the general case, that looks like
lens = np.count_nonzero(data, axis=1)
inds = np.r_[0, lens[:-1].cumsum()]
Now you can operate on the raveled masked data:
mdata = data[data != 0]
mdata2 = mdata**2
var = np.add.reduceat(mdata2, inds) / lens - (np.add.reduceat(mdata, inds) / lens)**2
This gives you the same result for var (probably more efficiently than the masked version by the way):
array([0. , 0.25])
V3
The var function appears to use the more traditional formula (x - x.mean()).mean(). You can implement that using the quantities above with just a bit more work:
means = (np.add.reduceat(mdata, inds) / lens).repeat(lens)
var = np.add.reduceat((mdata - means)**2, inds) / lens
Comparison
Here is a quick benchmark for the two approaches:
def nzvar_v1(data):
return np.ma.array(data, mask=(data == 0)).var(axis=1).data
def nzvar_v2(data):
lens = np.count_nonzero(data, axis=1)
inds = np.r_[0, lens[:-1].cumsum()]
mdata = data[data != 0]
return np.add.reduceat(mdata**2, inds) / lens - (np.add.reduceat(mdata, inds) / lens)**2
def nzvar_v3(data):
lens = np.count_nonzero(data, axis=1)
inds = np.r_[0, lens[:-1].cumsum()]
mdata = data[data != 0]
return np.add.reduceat((mdata - (np.add.reduceat(mdata, inds) / lens).repeat(lens))**2, inds) / lens
np.random.seed(100)
data = np.random.randint(10, size=(1000, 1000))
%timeit nzvar_v1(data)
18.3 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nzvar_v2(data)
5.89 ms ± 69.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit nzvar_v3(data)
11.8 ms ± 62.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So for a large dataset, the second approach, while requiring a bit more code, appears to be ~3x faster than masked arrays and ~2x faster than using the traditional formulation.

Elementwise comparison between two large vectors, high degree of sparsity

Need a function that performs similarly to numpy.where function, but that doesn't run into memory issues caused by the dense representation of the Boolean array. The function should therefore be able to return an extremely sparse Boolean array.
While the example presented below works fine for small data sets/vectors, it is impossible to use the numpy.where function once my_sample is - for example - of shape (10.000.000, 1) and my_population is of shape (100.000, 1). Having read other threads, numpy.where apparently creates a dense Boolean array of shape (10.000.000, 100.000) when evaluating the expression numpy.where((my_sample == my_population.T)). This dense (10.000.000, 100.000) array cannot fit into memory on my machine/most machines.
The resulting array is extremely sparse. In my case, know it will have at most two 1s per row! Using the specifications from above, the sparsity equals 0.002%. This should definitely fit into memory.
Trying to create something similar to a model/design matrix for a numerical simulation. The resulting matrix will be used for some linear algebra operations.
Minimal working example: Please note that the positions/coordinates in the vectors are of importance.
# import packages
import numpy as np
# my_sample is the vector of observations
my_sample = ['a', 'b', 'c', 'a']
# my_population is the lookup vector
my_population = ['a', 'b', 'c']
# initalise the matrix (dense matrix for this exampe)
my_zero = np.zeros((len(my_sample), len(my_population)))
# reshape to arrays
my_sample = np.array(my_sample).reshape(-1, 1)
my_population = np.array((my_population)).reshape(-1, 1)
# THIS STEP CAUSES THE MEMORY ISSUES
my_indices = np.where((my_sample == my_population.T))
# set the matches to equal one
my_zero[my_indices] = 1
# show matrix
my_zero
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.],
[1., 0., 0.]])
First, let's encode this as integers, not strings. Strings suck.
pop_levels, pop_idx = np.unique(my_population, return_inverse=True)
sample_levels, sample_idx = np.unique(my_sample, return_inverse=True)
It's important that pop_levels and sample_levels be identical, but if they are you're pretty much done - pack these into sparse masks:
sample_mask = sps.csr_matrix((np.ones_like(sample_idx), sample_idx, range(len(sample_idx) + 1)))
And we're done:
>>> sample_mask.A
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0]])
You may need to reorder your factor levels so that they're the same between your sample and population, but as long as you can unify those labels this is very simple to do with just matrix assignment.
A more direct route:
In [125]: my_sample = ['a', 'b', 'c', 'a']
...: my_population = ['a', 'b', 'c']
...:
...:
In [126]: np.array(my_sample)[:,None]==np.array(my_population)
Out[126]:
array([[ True, False, False],
[False, True, False],
[False, False, True],
[ True, False, False]])
This is a boolean dtype. If you want 0/1 integer matrix:
In [128]: (np.array(my_sample)[:,None]==np.array(my_population)).astype(int)
Out[128]:
array([[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[1, 0, 0]])
If you have space to make my_zero, you should have space to make this array. If there is still a problem with a large temporary buffer, you could try converting to 'uint8', which takes up less space.
In your version you make two large arrays, my_zero and my_sample == my_population.T. But beware that even if you make it past this step, you may not have space to do anything else with my_zero.
Creating a sparse matrix may save you space, but the sparsity has to be quite high to maintain any sort of speed. Though matrix-multiplication is a relative strong area for scipy.sparse matrices.
time tests
In [134]: %%timeit
...: pop_levels, pop_idx = np.unique(my_population, return_inverse=True)
...: sample_levels, sample_idx = np.unique(my_sample, return_inverse=True)
...: sample_mask = sparse.csr_matrix((np.ones_like(sample_idx), sample_idx, range(len(s
...: ample_idx) + 1)))
247 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [135]: timeit (np.array(my_sample)[:,None]==np.array(my_population)).astype(int)
9.61 µs ± 9.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
And make a sparse matrix from the dense:
In [136]: timeit sparse.csr_matrix((np.array(my_sample)[:,None]==np.array(my_population)).as
...: type(int))
332 µs ± 1.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Scaling for large arrays could be quite different.

How to check if an array contains all the elements of another array? If not, output the missing elements

I need to check if an array A contains all elements of another array B. If not, output the missing elements. Both A and B are integers, and B is always from 0 to N with an interval of 1.
import numpy as np
A=np.array([1,2,3,6,7,8,9])
B=np.arange(10)
I know that I can use the following to check if there is any missing elements, but it does not give the index of the missing element.
np.all(elem in A for elem in B)
Is there a good way in python to output the indices of the missing elements?
IIUC you can try the following and assuming that B always is an "index" list:
[i for i in B if i not in A]
The output would be : [0, 4, 5]
Best way to do it with numpy
Numpy actually has a function to perform this : numpy.insetdiff1d
np.setdiff1d(B, A)
# Which returns
array([0, 4, 5])
You can use enumerate to get both index and content of a list. The following code would do what you want
idx = [idx for idx, element in enumerate(B) if element not in A]
I am assuming we want to get the elements exclusive to B, when compared to A.
Approach #1
Given the specific of B is always from 0 to N with an interval of 1, we can use a simple mask-based one -
mask = np.ones(len(B), dtype=bool)
mask[A] = False
out = B[mask]
Approach #2
Another one that edits B and would be more memory-efficient -
B[A] = -1
out = B[B>=0]
Approach #3
A more generic case of integers could be handled differently -
def setdiff_for_ints(B, A):
N = max(B.max(), A.max()) - min(min(A.min(),B.min()),0) + 1
mask = np.zeros(N, dtype=bool)
mask[B] = True
mask[A] = False
out = np.flatnonzero(mask)
return out
Sample run -
In [77]: A
Out[77]: array([ 1, 2, 3, 6, 7, 8, -6])
In [78]: B
Out[78]: array([1, 3, 4, 5, 7, 9])
In [79]: setdiff_for_ints(B, A)
Out[79]: array([4, 5, 9])
# Using np.setdiff1d to verify :
In [80]: np.setdiff1d(B, A)
Out[80]: array([4, 5, 9])
Timings -
In [81]: np.random.seed(0)
...: A = np.unique(np.random.randint(-10000,100000,1000000))
...: B = np.unique(np.random.randint(0,100000,1000000))
# #Hugolmn's soln with np.setdiff1d
In [82]: %timeit np.setdiff1d(B, A)
4.78 ms ± 96.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [83]: %timeit setdiff_for_ints(B, A)
599 µs ± 6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

How to obtain only the first True value from each row of a numpy array?

I have a 4x3 boolean numpy array, and I'm trying to return a same-sized array which is all False, except for the location of the first True value on each row of the original. So if I have a starting array of
all_bools = np.array([[False, True, True],[True, True, True],[False, False, True],[False,False,False]])
all_bools
array([[False, True, True], # First true value = index 1
[ True, True, True], # First true value = index 0
[False, False, True], # First true value = index 2
[False, False, False]]) # No True Values
then I'd like to return
[[False, True, False],
[True, False, False],
[False, False, True],
[False, False, False]]
so indices 1, 0 and 2 on the first three rows have been set to True and nothing else. Essentially any True value (beyond the first on each row) from the original way have been set to False.
I've been fiddling around with this with np.where and np.argmax and I haven't yet found a good solution - any help gratefully received. This needs to run many, many times so I'd like to avoid iterating.
You can use cumsum, and find the first bool by comparing the result with 1.
all_bools.cumsum(axis=1).cumsum(axis=1) == 1
array([[False, True, False],
[ True, False, False],
[False, False, True],
[False, False, False]])
This also accounts for the issue #a_guest pointed out. The second cumsum call is needed to avoid matching all False values between the first and second True value.
If performance is important, use argmax and set values:
y = np.zeros_like(all_bools, dtype=bool)
idx = np.arange(len(x)), x.argmax(axis=1)
y[idx] = x[idx]
y
array([[False, True, False],
[ True, False, False],
[False, False, True],
[False, False, False]])
Perfplot Performance Timings
I'll take this opportunity to show off perfplot, with some timings, since it is good to see how our solutions vary with different sized inputs.
import numpy as np
import perfplot
def cs1(x):
return x.cumsum(axis=1).cumsum(axis=1) == 1
def cs2(x):
y = np.zeros_like(x, dtype=bool)
idx = np.arange(len(x)), x.argmax(axis=1)
y[idx] = x[idx]
return y
def a_guest(x):
b = np.zeros_like(x, dtype=bool)
i = np.argmax(x, axis=1)
b[np.arange(i.size), i] = np.logical_or.reduce(x, axis=1)
return b
perfplot.show(
setup=lambda n: np.random.randint(0, 2, size=(n, n)).astype(bool),
kernels=[cs1, cs2, a_guest],
labels=['cs1', 'cs2', 'a_guest'],
n_range=[2**k for k in range(1, 8)],
xlabel='N'
)
The trend carries forward to larger N. cumsum is very expensive, while there is a constant time difference between my second solution, and #a_guest's.
You can use the following approach using np.argmax and a product with np.logical_or.reduce for dealing with rows that are all False:
b = np.zeros_like(a, dtype=bool)
i = np.argmax(a, axis=1)
b[np.arange(i.size), i] = np.logical_or.reduce(a, axis=1)
Timing results
Different versions in increasing performance, i.e. fastest approach comes last:
In [1]: import numpy as np
In [2]: def f(a):
...: return a.cumsum(axis=1).cumsum(axis=1) == 1
...:
...:
In [3]: def g(a):
...: b = np.zeros_like(a, dtype=bool)
...: i = np.argmax(a, axis=1)
...: b[np.arange(i.size), i] = np.logical_or.reduce(a, axis=1)
...: return b
...:
...:
In [4]: x = np.random.randint(0, 2, size=(1000, 1000)).astype(bool)
In [5]: %timeit f(x)
10.4 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]: %timeit g(x)
120 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [7]: def h(a):
...: y = np.zeros_like(x)
...: idx = np.arange(len(x)), x.argmax(axis=1)
...: y[idx] += x[idx]
...: return y
...:
...:
In [8]: %timeit h(x)
92.1 µs ± 3.51 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [9]: def h2(a):
...: y = np.zeros_like(x)
...: idx = np.arange(len(x)), x.argmax(axis=1)
...: y[idx] = x[idx]
...: return y
...:
...:
In [10]: %timeit h2(x)
78.5 µs ± 353 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

numpy conditional list membership element wise

I have a 2D numpy array:
a = np.array([[0,1],
[2,3]])
I have a list of values to keep:
vals_keep = [1,2]
I want to test for list membership for each element in the array. Something like:
mask = a in vals_keep
The result I want:
array([[False, True],
[True, False]])
You can use isin
isin is an element-wise function version of the python keyword in
np.isin(a, vals_keep)
array([[False, True],
[ True, False]])
An added benefit of isin is that it's flexible with arrays of different dimensions:
a = np.arange(4).reshape(1,2,2,1)
np.isin(a, vals_keep)
array([[[[False],
[ True]],
[[ True],
[False]]]])
Here is one way using broadcasting:
In [35]: (a[:, :, None] == vals_keep).any(2)
Out[35]:
array([[False, True],
[ True, False]])
Which is faster than isin for small arrays (less than 100 rows):
In [37]: %timeit np.isin(a, vals_keep)
22 µs ± 728 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [38]: %timeit (a[:, :, None] == vals_keep).any(2)
12.6 µs ± 95.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
For large arrays it's better to use isin because broadcasting in 3D is not very efficient for large arrays/matrices.

Categories