The answer for three matrices was given in this question, but I'm not sure how to apply this logic to an arbitrary amount of pairwise connected matrices:
f(i, j, k, l, ...) = min(A(i, j), B(i,k), C(i,l), D(j,k), E(j,l), F(k,l), ...)
Where A,B,... are matrices and i,j,... are indices that range up to the respective dimensions of the matrices. If we consider n indices, there are n(n-1)/2 pairs and thus matrices. I would like to find (i,j,k,...) such that f(i,j,k,l,...) is maximized. I am currently doing that as follows:
import numpy as np
import itertools
# i j k l ...
dimensions = [50,50,50,50]
n_dims = len(dimensions)
pairs = list(itertools.combinations(range(n_dims), 2))
# Construct the matrices A(i,j), B(i,k), ...
matrices = [];
for pair in pairs:
matrices.append(np.random.rand(dimensions[pair[0]], dimensions[pair[1]]))
# All the different i,j,k,l... combinations
combinations = itertools.product(*list(map(np.arange,dimensions)))
combinations = np.asarray(list(combinations))
# Find the maximum minimum
vals = []
for i in range(len(pairs)):
pair = pairs[i]
matrix = matrices[i]
vals.append(matrix[combinations[:,pair[0]], combinations[:,pair[1]]])
f = np.min(vals,axis=0)
best_indices = combinations[np.argmax(f)]
print(best_indices, np.max(f))
[5 17 17 18] 0.932985854758534
This is faster than iterating over all (i, j, k, l, ...), but a lot of time is spent constructing the combinations and vals matrices. Is there an alternative way to do this where (1) the speed of numpy's matrix computation can be preserved and (2) I don't have to construct the memory-intensive vals matrices?
Here is a generalisation of the 3D solution. I assume there are other (better?) ways of organising the recursion but this works well enough. It does a 6D example (product of dims 9x10^6) in <10 ms
Sample run, note that occasionally the indices returned by the two methods do not match. This is because they are not always unique, sometimes different index combinations yield the same maximum of minima. Also note that in the very end we do a single run of a huge 6D 9x10^12 example. Brute force is no longer viable on that, the smart method takes about 10 seconds.
trial 1
results identical True
results compatible True
brute force 276.8830654968042 ms
branch cut 9.971900499658659 ms
trial 2
results identical True
results compatible True
brute force 273.444719001418 ms
branch cut 9.236706099909497 ms
trial 3
results identical True
results compatible True
brute force 274.2998780013295 ms
branch cut 7.31226220013923 ms
trial 4
results identical True
results compatible True
brute force 273.0268925006385 ms
branch cut 6.956217200058745 ms
HUGE (100, 150, 200, 100, 150, 200) 9000000000000
branch cut 10246.754082996631 ms
Code:
import numpy as np
import itertools as it
import functools as ft
def bf(dims,pairs):
dims,pairs = np.array(dims),np.array(pairs,object)
n,m = len(dims),len(pairs)
IDX = np.empty((m,n),object)
Y,X = np.triu_indices(n,1)
IDX[np.arange(m),Y] = slice(None)
IDX[np.arange(m),X] = slice(None)
idx = np.unravel_index(
ft.reduce(np.minimum,(p[(*i,)] for p,i in zip(pairs,IDX))).argmax(),dims)
return ft.reduce(np.minimum,(
p[I] for p,I in zip(pairs,it.combinations(idx,2)))),idx
def cut(dims,pairs,offs=None):
n = len(dims)
if n<3:
if n==2:
A = pairs[0] if offs is None else np.minimum(
pairs[0],np.minimum.outer(offs[0],offs[1]))
idx = np.unravel_index(A.argmax(),dims)
return A[idx],idx
else:
idx = offs[0].argmax()
return offs[0][idx],(idx,)
gmx = min(map(np.min,pairs))
gidx = n * (0,)
A = pairs[0] if offs is None else np.minimum(
pairs[0],np.minimum.outer(offs[0],offs[1]))
Y,X = np.unravel_index(A.argsort(axis=None)[::-1],dims[:2])
for y,x in zip(Y,X):
if A[y,x] <= gmx:
return gmx,gidx
coffs = [np.minimum(p1[y],p2[x])
for p1,p2 in zip(pairs[1:n-1],pairs[n-1:])]
if not offs is None:
coffs = [*map(np.minimum,coffs,offs[2:])]
cmx,cidx = cut(dims[2:],pairs[2*n-3:],coffs)
if cmx >= A[y,x]:
return A[y,x],(y,x,*cidx)
if gmx < cmx:
gmx = min(A[y,x],cmx)
gidx = y,x,*cidx
return gmx,gidx
from timeit import timeit
IDX = 10,15,20,10,15,20
for rep in range(4):
print("trial",rep+1)
pairs = [np.random.rand(i,j) for i,j in it.combinations(IDX,2)]
print("results identical",cut(IDX,pairs)==bf(IDX,pairs))
print("results compatible",cut(IDX,pairs)[1]==bf(IDX,pairs)[1])
print("brute force",timeit(lambda:bf(IDX,pairs),number=2)*500,"ms")
print("branch cut",timeit(lambda:cut(IDX,pairs),number=10)*100,"ms")
IDX = 100,150,200,100,150,200
pairs = [np.random.rand(i,j) for i,j in it.combinations(IDX,2)]
print("HUGE",IDX,np.prod(IDX))
print("branch cut",timeit(lambda:cut(IDX,pairs),number=1)*1000,"ms")
Related
Consider the following problem: Given a set of n intervals and a set of m floating-point numbers, determine, for each floating-point number, the subset of intervals that contain the floating-point number.
This problem has been addressed by constructing an interval tree (or called range tree or segment tree). Implementations have been done for the one-dimensional case, e.g. python's intervaltree package. Usually, these implementations consider one or few floating-point numbers, namely a small "m" above.
In my problem setting, both n and m are extremely large numbers (from solving an image processing problem). Further, I need to consider the N-dimensional intervals (called cuboid when N=3, because I was modeling human brains with the Finite Element Method). I have implemented a simple N-dimensional interval tree in python, but it run in a loop and can only take one floating-point number at a time. Can anyone help improve the implementation in terms of efficiency? You can change data structure freely.
import sys
import time
import numpy as np
# find the index of a satisfying x > a in one dimension
def find_index_smaller(a, x):
idx = np.argsort(a)
ss = np.searchsorted(a, x, sorter=idx)
res = idx[0:ss]
return res
# find the index of a satisfying x < a in one dimension
def find_index_larger(a, x):
return find_index_smaller(-a, -x)
# find the index of a satisfing amin < x < amax in one dimension
def find_intv_at(amin, amax, x):
idx = find_index_smaller(amin, x)
idx2 = find_index_larger(amax[idx], x)
res = idx[idx2]
return res
# find the index of a satisfying amin < x < amax in N dimensions
def find_intv_at_nd(amin, amax, x):
dim = amin.shape[0]
res = np.arange(amin.shape[-1])
for i in range(dim):
idx = find_intv_at(amin[i, res], amax[i, res], x[i])
res = res[idx]
return res
I also have two test examples for sanity check and performance testing:
def demo1():
print ("By default, we do a correctness test")
n_intv = 2
n_point = 2
# generate the test data
point = np.random.rand(3, n_point)
intv_min = np.random.rand(3, n_intv)
intv_max = intv_min + np.random.rand(3, n_intv)*8
print ("point ")
print (point)
print ("intv_min")
print (intv_min)
print ("intv_max")
print (intv_max)
print ("===Indexes of intervals that contain the point===")
for i in range(n_point):
print (find_intv_at_nd(intv_min,intv_max, point[:, i]))
def demo2():
print ("Performance:")
n_points=100
n_intv = 1000000
# generate the test data
points = np.random.rand(n_points, 3)*512
intv_min = np.random.rand(3, n_intv)*512
intv_max = intv_min + np.random.rand(3, n_intv)*8
print ("point.shape = "+str(points.shape))
print ("intv_min.shape = "+str(intv_min.shape))
print ("intv_max.shape = "+str(intv_max.shape))
starttime = time.time()
for point in points:
tmp = find_intv_at_nd(intv_min, intv_max, point)
print("it took this long to run {} points, with {} interva: {}".format(n_points, n_intv, time.time()-starttime))
My idea would be:
Remove np.argsort() from the algo, because the interval tree does not change, so sorting could have been done in pre-processing.
Vectorize x. The algo runs a loop for each x. It would be nice if we can get rid of the loop over x.
Any contribution would be appreciated.
Given an nxn array A of real positive numbers, I'm trying to find the minimum of the maximum of the element-wise minimum of all combinations of three rows of the 2-d array. Using for-loops, that comes out to something like this:
import numpy as np
n = 100
np.random.seed(2)
A = np.random.rand(n,n)
global_best = np.inf
for i in range(n-2):
for j in range(i+1, n-1):
for k in range(j+1, n):
# find the maximum of the element-wise minimum of the three vectors
local_best = np.amax(np.array([A[i,:], A[j,:], A[k,:]]).min(0))
# if local_best is lower than global_best, update global_best
if (local_best < global_best):
global_best = local_best
save_rows = [i, j, k]
print global_best, save_rows
In the case for n = 100, the output should be this:
Out[]: 0.492652949593 [6, 41, 58]
I have a feeling though that I could do this much faster using Numpy vectorization, and would certainly appreciate any help on doing this. Thanks.
This solution is 5x faster for n=100:
coms = np.fromiter(itertools.combinations(np.arange(n), 3), 'i,i,i').view(('i', 3))
best = A[coms].min(1).max(1)
at = best.argmin()
global_best = best[at]
save_rows = coms[at]
The first line is a bit convoluted but turns the result of itertools.combinations into a NumPy array which contains all possible [i,j,k] index combinations.
From there, it's a simple matter of indexing into A using all the possible index combinations, then reducing along the appropriate axes.
This solution consumes a lot more memory as it builds the concrete array of all possible combinations A[coms]. It saves time for smallish n, say under 250, but for large n the memory traffic will be very high and it may be slower than the original code.
Working by chunks allows to combine the speed of vectorized calculus while avoiding to run into Memory Errors. Below there is an example of converting the nested loops to vectorization by chunks.
Starting from the same variables as the question, a chunk length is defined, in order to vectorize calculations inside the chunk and loop only over chunks instead of over combinations.
chunk = 2000 # define chunk length, if to small, the code won't take advantage
# of vectorization, if it is too large, excessive memory usage will
# slow down execution, or Memory Error will be risen
combinations = itertools.combinations(range(n),3) # generate iterator containing
# all possible combinations of 3 columns
N = n*(n-1)*(n-2)//6 # number of combinations (length of combinations cannot be
# retrieved because it is an iterator)
# generate a list containing how many elements of combinations will be retrieved
# per iteration
n_chunks, remainder = divmod(N,chunk)
counts_list = [chunk for _ in range(n_chunks)]
if remainder:
counts_list.append(remainder)
# Iterate one chunk at a time, using vectorized code to treat the chunk
for counts in counts_list:
# retrieve combinations in current chunk
current_comb = np.fromiter(combinations,dtype='i,i,i',count=counts)\
.view(('i',3))
# maximum of element-wise minimum in current chunk
chunk_best = np.minimum(np.minimum(A[current_comb[:,0],:],A[current_comb[:,1],:]),
A[current_comb[:,2],:]).max(axis=1)
ravel_save_row = chunk_best.argmin() # minimum of maximums in current chunk
# check if current chunk contains global minimum
if chunk_best[ravel_save_row] < global_best:
global_best = chunk_best[ravel_save_row]
save_rows = current_comb[ravel_save_row]
print(global_best,save_rows)
I ran some performance comparisons with the nested loops, obtaining the following results (chunk_length = 1000):
n=100
Nested loops: 1.13 s ± 16.6 ms
Work by chunks: 108 ms ± 565 µs
n=150
Nested loops: 4.16 s ± 39.3 ms
Work by chunks: 523 ms ± 4.75 ms
n=500
Nested loops: 3min 18s ± 3.21 s
Work by chunks: 1min 12s ± 1.6 s
Note
After profiling the code, I found that the np.min was what took longest by calling np.maximum.reduce. I converted it directly to np.maximum which improved performance a bit.
Don't try to vectorize loops that are not simple to vectorize. Instead use a jit compiler like Numba or use Cython. Vectorized solutions are good if the resulting code is more readable, but in terms of performance a compiled solution is usually faster or in a worst case scenario as fast as a vectorized solution (except BLAS routines).
Single-threaded example
import numba as nb
import numpy as np
#Min and max library calls may be costly for only 3 values
#nb.njit()
def max_min_3(A,B,C):
max_of_min=-np.inf
for i in range(A.shape[0]):
loc_min=A[i]
if (B[i]<loc_min):
loc_min=B[i]
if (C[i]<loc_min):
loc_min=C[i]
if (max_of_min<loc_min):
max_of_min=loc_min
return max_of_min
#nb.njit()
def your_func(A):
n=A.shape[0]
save_rows=np.zeros(3,dtype=np.uint64)
global_best=np.inf
for i in range(n):
for j in range(i+1, n):
for k in range(j+1, n):
# find the maximum of the element-wise minimum of the three vectors
local_best = max_min_3(A[i,:], A[j,:], A[k,:])
# if local_best is lower than global_best, update global_best
if (local_best < global_best):
global_best = local_best
save_rows[0] = i
save_rows[1] = j
save_rows[2] = k
return global_best, save_rows
Performance of single-threaded version
n=100
your_version: 1.56s
compiled_version: 0.0168s (92x speedup)
n=150
your_version: 5.41s
compiled_version: 0.08122s (66x speedup)
n=500
your_version: 283s
compiled_version: 8.86s (31x speedup)
The first call has a constant overhead of about 0.3-1s. For performance measurement of the calculation time itself, call it once and then measure performance.
With a few code changes this task can also be parallelized.
Multi-threaded example
#nb.njit(parallel=True)
def your_func(A):
n=A.shape[0]
all_global_best=np.inf
rows=np.empty((3),dtype=np.uint64)
save_rows=np.empty((n,3),dtype=np.uint64)
global_best_Temp=np.empty((n),dtype=A.dtype)
global_best_Temp[:]=np.inf
for i in range(n):
for j in nb.prange(i+1, n):
row_1=0
row_2=0
row_3=0
global_best=np.inf
for k in range(j+1, n):
# find the maximum of the element-wise minimum of the three vectors
local_best = max_min_3(A[i,:], A[j,:], A[k,:])
# if local_best is lower than global_best, update global_best
if (local_best < global_best):
global_best = local_best
row_1 = i
row_2 = j
row_3 = k
save_rows[j,0]=row_1
save_rows[j,1]=row_2
save_rows[j,2]=row_3
global_best_Temp[j]=global_best
ind=np.argmin(global_best_Temp)
if (global_best_Temp[ind]<all_global_best):
rows[0] = save_rows[ind,0]
rows[1] = save_rows[ind,1]
rows[2] = save_rows[ind,2]
all_global_best=global_best_Temp[ind]
return all_global_best, rows
Performance of multi-threaded version
n=100
your_version: 1.56s
compiled_version: 0.0078s (200x speedup)
n=150
your_version: 5.41s
compiled_version: 0.0282s (191x speedup)
n=500
your_version: 283s
compiled_version: 2.95s (96x speedup)
Edit
In a newer Numba Version (installed through the Anaconda Python Distribution) I have to manually install tbb to get a working parallelization.
You can use combinations from itertools, that it's a python standard library, and it will help you to to remove all those nested loops.
from itertools import combinations
import numpy as np
n = 100
np.random.seed(2)
A = np.random.rand(n,n)
global_best = 1000000000000000.0
for i, j, k in combinations(range(n), 3):
local_best = np.amax(np.array([A[i,:], A[j,:], A[k,:]]).min(0))
if local_best < global_best:
global_best = local_best
save_rows = [i, j, k]
print global_best, save_rows
I'm using NumPy to store data into matrices.
I'm struggling to make the below Python code perform better.
RESULT is the data store I want to put the data into.
TMP = np.array([[1,1,0],[0,0,1],[1,0,0],[0,1,1]])
n_row, n_col = TMP.shape[0], TMP.shape[0]
RESULT = np.zeros((n_row, n_col))
def do_something(array1, array2):
intersect_num = np.bitwise_and(array1, array2).sum()
union_num = np.bitwise_or(array1, array2).sum()
try:
return intersect_num / float(union_num)
except ZeroDivisionError:
return 0
for i in range(n_row):
for j in range(n_col):
if i >= j:
continue
RESULT[i, j] = do_something(TMP[i], TMP[j])
I guess it would be much faster if I could use some NumPy built-in function instead of for-loops.
I was looking for the various questions around here, but I couldn't find the best fit for my problem.
Any suggestion? Thanks in advance!
Approach #1
You could do something like this as a vectorized solution -
# Store number of rows in TMP as a paramter
N = TMP.shape[0]
# Get the indices that would be used as row indices to select rows off TMP and
# also as row,column indices for setting output array. These basically correspond
# to the iterators involved in the loopy implementation
R,C = np.triu_indices(N,1)
# Calculate intersect_num, union_num and division results across all iterations
I = np.bitwise_and(TMP[R],TMP[C]).sum(-1)
U = np.bitwise_or(TMP[R],TMP[C]).sum(-1)
vals = np.true_divide(I,U)
# Setup output array and assign vals into it
out = np.zeros((N, N))
out[R,C] = vals
Approach #2
For cases with TMP holding 1s and 0s, those np.bitwise_and and np.bitwise_or would be replaceable with dot-products and as such could be faster alternatives. So, with those we would have an implementation like so -
M = TMP.shape[1]
I = TMP.dot(TMP.T)
TMP_inv = 1-TMP
U = M - TMP_inv.dot(TMP_inv.T)
out = np.triu(np.true_divide(I,U),1)
I'm trying to get a list of the indices for all the elements in an array so for an array of 1000 x 1000 I end up with [(0,0), (0,1),...,(999,999)].
I made a function to do this which is below:
def indices(alist):
results = []
ele = alist.size
counterx = 0
countery = 0
x = alist.shape[0]
y = alist.shape[1]
while counterx < x:
while countery < y:
results.append((counterx,countery))
countery += 1
counterx += 1
countery = 0
return results
After I timed it, it seemed quite slow as it was taking about 650 ms to run (granted on a slow laptop). So, figuring that numpy must have a way to do this faster than my mediocre coding, I took a look at the documentation and tried:
indices = [k for k in numpy.ndindex(q.shape)]
which took about 4.5 SECONDS (wtf?)
indices = [x for x,i in numpy.ndenumerate(q)]
better, but 1.5 seconds!
Is there a faster way to do this?
Thanks
how about np.ndindex?
np.ndindex(1000,1000)
This returns an iterable object:
>>> ix = numpy.ndindex(1000,1000)
>>> next(ix)
(0, 0)
>>> next(ix)
(0, 1)
>>> next(ix)
(0, 2)
In general, if you have an array, you can build the index iterable via:
index_iterable = np.ndindex(*arr.shape)
Of course, there's always np.ndenumerate as well which could be implemented like this:
def ndenumerate(arr):
for ix in np.ndindex(*arr.shape):
yield ix,arr[ix]
Have you thought of using itertools? It will generate an iterator for your results, and will almost certainly be optimally fast:
import itertools
a = range(1000)
b = range(1000)
product = itertools.product(a, b)
for x in product:
print x
# (0, 0)
# (0, 1)
# ...
# (999, 999)
Notice this didn't require the dependency on numpy. Also, notice the fun use of range to create a list from 0 to 999.
Ahha!
Using numpy to build an array of all combinations of two arrays
Runs in 41 ms as opposed to the 330ms using the itertool.product one!
I have two arrays which are lex-sorted.
In [2]: a = np.array([1,1,1,2,2,3,5,6,6])
In [3]: b = np.array([10,20,30,5,10,100,10,30,40])
In [4]: ind = np.lexsort((b, a)) # sorts elements first by a and then by b
In [5]: print a[ind]
[1 1 1 2 2 3 5 6 6]
In [7]: print b[ind]
[ 10 20 30 5 10 100 10 30 40]
I want to do a binary search for (2, 7) and (5, 150) expecting (4, 7) as the answer.
In [6]: np.lexsearchsorted((a,b), ([2, 5], [7,150]))
We have searchsorted function but that works only on 1D arrays.
EDIT: Edited to reflect comment.
def comp_leq(t1,t2):
if (t1[0] > t2[0]) or ((t1[0] == t2[0]) and (t1[1] > t2[1])):
return 0
else:
return 1
def bin_search(L,item):
from math import floor
x = L[:]
while len(x) > 1:
index = int(floor(len(x)/2) - 1)
#Check item
if comp_leq(x[index], item):
x = x[index+1:]
else:
x = x[:index+1]
out = L.index(x[0])
#If greater than all
if item >= L[-1]:
return len(L)
else:
return out
def lexsearch(a,b,items):
z = zip(a,b)
return [bin_search(z,item) for item in items]
if __name__ == '__main__':
a = [1,1,1,2,2,3,5,6,6]
b = [10,20,30,5,10,100,10,30,40]
print lexsearch(a,b,([2,7],[5,150])) #prints [4,7]
This code seems to do it for a set of (exactly) 2 lexsorted arrays
You might be able to make it faster if you create a set of values[-1], and than create a dictionary with the boundries for them.
I haven't checked other cases apart from the posted one, so please verify it's not bugged.
def lexsearchsorted_2(arrays, values, side='left'):
assert len(arrays) == 2
assert (np.lexsort(arrays) == range(len(arrays[0]))).all()
# here it will be faster to work on all equal values in 'values[-1]' in one time
boundries_l = np.searchsorted(arrays[-1], values[-1], side='left')
boundries_r = np.searchsorted(arrays[-1], values[-1], side='right')
# a recursive definition here will make it work for more than 2 lexsorted arrays
return tuple([boundries_l[i] +
np.searchsorted(arrays[-2[boundries_l[i]:boundries_r[i]],
values[-2][i],
side=side)
for i in range(len(boundries_l))])
Usage:
import numpy as np
a = np.array([1,1,1,2,2,3,5,6,6])
b = np.array([10,20,30,5,10,100,10,30,40])
lexsearchsorted_2((b, a), ([7,150], [2, 5])) # return (4, 7)
I ran into the same issue and came up with a different solution. You can treat the multi-column data instead as single entries using a structured data type. A structured data type will allow one to use argsort/sort on the data (instead of lexsort, although lexsort appears faster at this stage) and then use the standard searchsorted. Here is an example:
import numpy as np
from itertools import repeat
# Setup our input data
# Every row is an entry, every column what we want to sort by
# Unlike lexsort, this takes columns in decreasing priority, not increasing
a = np.array([1,1,1,2,2,3,5,6,6])
b = np.array([10,20,30,5,10,100,10,30,40])
data = np.transpose([a,b])
# Sort the data
data = data[np.lexsort(data.T[::-1])]
# Convert to a structured data-type
dt = np.dtype(zip(repeat(''), repeat(data.dtype, data.shape[1]))) # the structured dtype
data = np.ascontiguousarray(data).view(dt).squeeze(-1) # the dtype change leaves a trailing 1 dimension, ascontinguousarray is required for the dtype change
# You can also first convert to the structured data-type with the two lines above then use data.sort()/data.argsort()/np.sort(data)
# Search the data
values = np.array([(2,7),(5,150)], dtype=dt) # note: when using structured data types the rows must be a tuple
pos = np.searchsorted(data, values)
# pos is (4,7) in this example, exactly what you would want
This works for any number of columns, uses the built-in numpy functions, the columns remain in the "logical" order (decreasing priority), and it should be quite fast.
A compared the two two numpy-based methods time-wise.
#1 is the recursive method from #j0ker5 (the one below extends his example with his suggestion of recursion and works with any number of lexsorted rows)
#2 is the structured array from me
They both take the same inputs, basically like searchsorted except a and v are as per lexsort.
import numpy as np
def lexsearch1(a, v, side='left', sorter=None):
def _recurse(a, v):
if a.shape[1] == 0: return 0
if a.shape[0] == 1: return a.squeeze(0).searchsorted(v.squeeze(0), side)
bl = np.searchsorted(a[-1,:], v[-1], side='left')
br = np.searchsorted(a[-1,:], v[-1], side='right')
return bl + _recurse(a[:-1,bl:br], v[:-1])
a,v = np.asarray(a), np.asarray(v)
if v.ndim == 1: v = v[:,np.newaxis]
assert a.ndim == 2 and v.ndim == 2 and a.shape[0] == v.shape[0] and a.shape[0] > 1
if sorter is not None: a = a[:,sorter]
bl = np.searchsorted(a[-1,:], v[-1,:], side='left')
br = np.searchsorted(a[-1,:], v[-1,:], side='right')
for i in xrange(len(bl)): bl[i] += _recurse(a[:-1,bl[i]:br[i]], v[:-1,i])
return bl
def lexsearch2(a, v, side='left', sorter=None):
from itertools import repeat
a,v = np.asarray(a), np.asarray(v)
if v.ndim == 1: v = v[:,np.newaxis]
assert a.ndim == 2 and v.ndim == 2 and a.shape[0] == v.shape[0] and a.shape[0] > 1
a_dt = np.dtype(zip(repeat(''), repeat(a.dtype, a.shape[0])))
v_dt = np.dtype(zip(a_dt.names, repeat(v.dtype, a.shape[0])))
a = np.asfortranarray(a[::-1,:]).view(a_dt).squeeze(0)
v = np.asfortranarray(v[::-1,:]).view(v_dt).squeeze(0)
return a.searchsorted(v, side, sorter).ravel()
a = np.random.randint(100, size=(2,10000)) # Values to sort, rows in increasing priority
v = np.random.randint(100, size=(2,10000)) # Values to search for, rows in increasing priority
sorted_idx = np.lexsort(a)
a_sorted = a[:,sorted_idx]
And the timing results (in iPython):
# 2 rows
%timeit lexsearch1(a_sorted, v)
10 loops, best of 3: 33.4 ms per loop
%timeit lexsearch2(a_sorted, v)
100 loops, best of 3: 14 ms per loop
# 10 rows
%timeit lexsearch1(a_sorted, v)
10 loops, best of 3: 103 ms per loop
%timeit lexsearch2(a_sorted, v)
100 loops, best of 3: 14.7 ms per loop
Overall the structured array approach is faster, and can be made even faster if you design it to work with the flipped and transposed versions of a and v. It gets even faster as the numbers of rows/keys goes up, barely slowing down when going from 2 rows to 10 rows.
I did not notice any significant timing difference between using a_sorted or a and sorter=sorted_idx so I left those out for clarity.
I believe that a really fast method could be made using Cython, but this is as fast as it is going to get with pure pure Python and numpy.