Let's say I have a 2-dimensional matrix as a numpy array. If I want to delete rows with specific indices in this matrix, I use numpy.delete(). Here is an example of what I mean:
In [1]: my_matrix = numpy.array([
...: [10, 20, 30, 40, 50],
...: [15, 25, 35, 45, 55],
...: [95, 96, 97, 98, 99]
...: ])
In [2]: numpy.delete(my_matrix, [0, 2], axis=0)
Out[2]: array([[15, 25, 35, 45, 55]])
I'm looking for a way to do the above with matrices from the scipy.sparse package. I know it's possible to do this by converting the entire matrix into a numpy array but I don't want to do that. Is there any other way of doing that?
Thanks a lot!
For CSR, this is probably the most efficient way to do it in-place:
def delete_row_csr(mat, i):
if not isinstance(mat, scipy.sparse.csr_matrix):
raise ValueError("works only for CSR format -- use .tocsr() first")
n = mat.indptr[i+1] - mat.indptr[i]
if n > 0:
mat.data[mat.indptr[i]:-n] = mat.data[mat.indptr[i+1]:]
mat.data = mat.data[:-n]
mat.indices[mat.indptr[i]:-n] = mat.indices[mat.indptr[i+1]:]
mat.indices = mat.indices[:-n]
mat.indptr[i:-1] = mat.indptr[i+1:]
mat.indptr[i:] -= n
mat.indptr = mat.indptr[:-1]
mat._shape = (mat._shape[0]-1, mat._shape[1])
In LIL format it's even simpler:
def delete_row_lil(mat, i):
if not isinstance(mat, scipy.sparse.lil_matrix):
raise ValueError("works only for LIL format -- use .tolil() first")
mat.rows = np.delete(mat.rows, i)
mat.data = np.delete(mat.data, i)
mat._shape = (mat._shape[0] - 1, mat._shape[1])
Pv.s answer is a good and solid in-place solution that takes
a = scipy.sparse.csr_matrix((100,100), dtype=numpy.int8)
%timeit delete_row_csr(a.copy(), 0)
10000 loops, best of 3: 80.3 us per loop
for any array size. Since boolean indexing works for sparse matrices, at least in scipy >= 0.14.0, I would suggest to use it whenever multiple rows are to be removed:
def delete_rows_csr(mat, indices):
"""
Remove the rows denoted by ``indices`` form the CSR sparse matrix ``mat``.
"""
if not isinstance(mat, scipy.sparse.csr_matrix):
raise ValueError("works only for CSR format -- use .tocsr() first")
indices = list(indices)
mask = numpy.ones(mat.shape[0], dtype=bool)
mask[indices] = False
return mat[mask]
This solution takes significantly longer for a single row removal
%timeit delete_rows_csr(a.copy(), [50])
1000 loops, best of 3: 509 us per loop
But is more efficient for the removal of multiple rows, as the execution time barely increases with the number of rows
%timeit delete_rows_csr(a.copy(), numpy.random.randint(0, 100, 30))
1000 loops, best of 3: 523 us per loop
In addition to #loli's version of #pv's answer, I expanded their function to allow for row and/or column deletion by index on CSR matrices.
import numpy as np
from scipy.sparse import csr_matrix
def delete_from_csr(mat, row_indices=[], col_indices=[]):
"""
Remove the rows (denoted by ``row_indices``) and columns (denoted by ``col_indices``) from the CSR sparse matrix ``mat``.
WARNING: Indices of altered axes are reset in the returned matrix
"""
if not isinstance(mat, csr_matrix):
raise ValueError("works only for CSR format -- use .tocsr() first")
rows = []
cols = []
if row_indices:
rows = list(row_indices)
if col_indices:
cols = list(col_indices)
if len(rows) > 0 and len(cols) > 0:
row_mask = np.ones(mat.shape[0], dtype=bool)
row_mask[rows] = False
col_mask = np.ones(mat.shape[1], dtype=bool)
col_mask[cols] = False
return mat[row_mask][:,col_mask]
elif len(rows) > 0:
mask = np.ones(mat.shape[0], dtype=bool)
mask[rows] = False
return mat[mask]
elif len(cols) > 0:
mask = np.ones(mat.shape[1], dtype=bool)
mask[cols] = False
return mat[:,mask]
else:
return mat
You can delete row 0 < i < X.shape[0] - 1 from a CSR matrix X with
scipy.sparse.vstack([X[:i, :], X[i:, :]])
You can delete the first or the last row with X[1:, :] or X[:-1, :], respectively. Deleting multiple rows in one gone will probably require rolling your own function.
For other formats than CSR, this might not necessarily work as not all formats support row slicing.
To remove the i'th row from A simply use left matrix multiplication:
B = J*A
where J is a sparse identity matrix with i'th row removed.
Left multiplication by the transpose of J will insert a zero-vector back to the i'th row of B, which makes this solution a bit more general.
A0 = J.T * B
To construct J itself, I used pv.'s solution on a sparse diagonal matrix as follows (maybe there's a simpler solution for this special case?)
def identity_minus_rows(N, rows):
if np.isscalar(rows):
rows = [rows]
J = sps.diags(np.ones(N), 0).tocsr() # make a diag matrix
for r in sorted(rows):
J = delete_row_csr(J, r)
return J
You may also remove columns by right-multiplying by J.T of the appropriate size.
Finally, multiplication is efficient in this case because J is so sparse.
Note that sparse matrices support fancy indexing to some degree. So what you can do is this:
mask = np.ones(len(mat), dtype=bool)
mask[rows_to_delete] = False
# unfortunatly I think boolean indexing does not work:
w = np.flatnonzero(mask)
result = s[w,:]
The delete method doesn't really do anything else either.
Using #loli implementation, here I leave a function to remove columns:
def delete_cols_csr(mat, indices):
"""
Remove the cols denoted by ``indices`` form the CSR sparse matrix ``mat``.
"""
if not isinstance(mat, csr_matrix):
raise ValueError("works only for CSR format -- use .tocsr() first")
indices = list(indices)
mask = np.ones(mat.shape[1], dtype=bool)
mask[indices] = False
return mat[:,mask]
Related
For a given 2D matrix np.array([[1,3,1],[2,0,5]]) if one needs to calculate the max of each row in a matrix excluding its own column, with expected example return np.array([[3,1,3],[5,5,2]]), what would be the most efficient way to do so?
Currently I implemented it with a loop to exclude its own col index:
n=x.shape[0]
row_max_mat=np.zeros((n,n))
rng=np.arange(n)
for i in rng:
row_max_mat[:,i] = np.amax(s_a_array_sum[:,rng!=i],axis=1)
Is there a faster way to do so?
Similar idea to yours (exclude columns one by one), but with indexing:
mask = ~np.eye(cols, dtype=bool)
a[:,np.where(mask)[1]].reshape((a.shape[0], a.shape[1]-1, -1)).max(1)
Output:
array([[3, 1, 3],
[5, 5, 2]])
You could do this using np.accumulate. Compute the forward and backward accumulations of maximums along the horizontal axis and then combine them with an offset of one:
import numpy as np
m = np.array([[1,3,1],[2,0,5]])
fmax = np.maximum.accumulate(m,axis=1)
bmax = np.maximum.accumulate(m[:,::-1],axis=1)[:,::-1]
r = np.full(m.shape,np.min(m))
r[:,:-1] = np.maximum(r[:,:-1],bmax[:,1:])
r[:,1:] = np.maximum(r[:,1:],fmax[:,:-1])
print(r)
# [[3 1 3]
# [5 5 2]]
This will require 3x the size of your matrix to process (although you could take that down to 2x if you want an in-place update). Adding a 3rd&4th dimension could also work using a mask but that will require columns^2 times matrix's size to process and will likely be slower.
If needed, you can apply the same technique column wise or to both dimensions (by combining row wise and column wise results).
a = np.array([[1,3,1],[2,0,5]])
row_max = a.max(axis=1).reshape(-1,1)
b = (((a // row_max)+1)%2)
c = b*row_max
d = (a // row_max)*((a*b).max(axis=1).reshape(-1,1))
c+d # result
Since, we are looking to get max excluding its own column, basically the output would have each row filled with the max from it, except for the max element position, for which we will need to fill in with the second largest value. As such, argpartition seems would fit right in there. So, here's one solution with it -
def max_exclude_own_col(m):
out = np.full(m.shape, m.max(1, keepdims=True))
sidx = np.argpartition(-m,2,axis=1)
R = np.arange(len(sidx))
s0,s1 = sidx[:,0], sidx[:,1]
mask = m[R,s0]>m[R,s1]
L1c,L2c = np.where(mask,s0,s1), np.where(mask,s1,s0)
out[R,L1c] = m[R,L2c]
return out
Benchmarking
Other working solution(s) for large arrays -
# #Alain T.'s soln
def max_accum(m):
fmax = np.maximum.accumulate(m,axis=1)
bmax = np.maximum.accumulate(m[:,::-1],axis=1)[:,::-1]
r = np.full(m.shape,np.min(m))
r[:,:-1] = np.maximum(r[:,:-1],bmax[:,1:])
r[:,1:] = np.maximum(r[:,1:],fmax[:,:-1])
return r
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
So, we will test out with large arrays of various shapes for timings and speedups -
In [54]: import benchit
In [55]: funcs = [max_exclude_own_col, max_accum]
In [170]: inputs = [np.random.randint(0,100,(100000,n)) for n in [10, 20, 50, 100, 200, 500]]
In [171]: T = benchit.timings(funcs, inputs, indexby='shape')
In [172]: T
Out[172]:
Functions max_exclude_own_col max_accum
Shape
100000x10 0.017721 0.014580
100000x20 0.028078 0.028124
100000x50 0.056355 0.089285
100000x100 0.103563 0.200085
100000x200 0.188760 0.407956
100000x500 0.439726 0.976510
# Speedups with max_exclude_own_col over max_accum
In [173]: T.speedups(ref_func_by_index=1)
Out[173]:
Functions max_exclude_own_col Ref:max_accum
Shape
100000x10 0.822783 1.0
100000x20 1.001660 1.0
100000x50 1.584334 1.0
100000x100 1.932017 1.0
100000x200 2.161241 1.0
100000x500 2.220725 1.0
Is there a way to get rid of the loop in the code below and replace it with vectorized operation?
Given a data matrix, for each row I want to find the index of the minimal value that fits within ranges defined (per row) in a separate array.
Here's an example:
import numpy as np
np.random.seed(10)
# Values of interest, for this example a random 6 x 100 matrix
data = np.random.random((6,100))
# For each row, define an inclusive min/max range
ranges = np.array([[0.3, 0.4],
[0.35, 0.5],
[0.45, 0.6],
[0.52, 0.65],
[0.6, 0.8],
[0.75, 0.92]])
# For each row, find the index of the minimum value that fits inside the given range
result = np.zeros(6).astype(np.int)
for i in xrange(6):
ind = np.where((ranges[i][0] <= data[i]) & (data[i] <= ranges[i][1]))[0]
result[i] = ind[np.argmin(data[i,ind])]
print result
# Result: [35 8 22 8 34 78]
print data[np.arange(6),result]
# Result: [ 0.30070006 0.35065639 0.45784951 0.52885388 0.61393513 0.75449247]
Approach #1 : Using broadcasting and np.minimum.reduceat -
mask = (ranges[:,None,0] <= data) & (data <= ranges[:,None,1])
r,c = np.nonzero(mask)
cut_idx = np.unique(r, return_index=1)[1]
out = np.minimum.reduceat(data[mask], cut_idx)
Improvement to avoid np.nonzero and compute cut_idx directly from mask :
cut_idx = np.concatenate(( [0], np.count_nonzero(mask[:-1],1).cumsum() ))
Approach #2 : Using broadcasting and filling invalid places with NaNs and then using np.nanargmin -
mask = (ranges[:,None,0] <= data) & (data <= ranges[:,None,1])
result = np.nanargmin(np.where(mask, data, np.nan), axis=1)
out = data[np.arange(6),result]
Approach #3 : If you are not iterating enough (just like you have a loop of 6 iterations in the sample), you might want to stick to a loop for memory efficiency, but make use of more efficient masking with a boolean array instead -
out = np.zeros(6)
for i in xrange(6):
mask_i = (ranges[i,0] <= data[i]) & (data[i] <= ranges[i,1])
out[i] = np.min(data[i,mask_i])
Approach #4 : There is one more loopy solution possible here. The idea would be to sort each row of data. Then, use the two range limits for each row to decide on the start and stop indices with help from np.searchsorted. Further, we would use those indices to slice and then get the minimum values. Benefit with slicing that way is, we would be working with views and as such would be very efficient, both on memory and performance.
The implementation would look something like this -
out = np.zeros(6)
sdata = np.sort(data, axis=1)
for i in xrange(6):
start = np.searchsorted(sdata[i], ranges[i,0])
stop = np.searchsorted(sdata[i], ranges[i,1], 'right')
out[i] = np.min(sdata[i,start:stop])
Furthermore, we could get those start, stop indices in a vectorized manner following an implementation of vectorized searchsorted.
Based on suggestion by #Daniel F for the case when we are dealing with ranges that are within the limits of given data, we could simply use the start indices -
out[i] = sdata[i, start]
Assuming at least one value in range, you don't even have to bother with the upper limit:
result = np.empty(6)
for i in xrange(6):
lt = (ranges[i,0] >= data[i]).sum()
result[i] = np.argpartition(data[i], lt)[lt]
Actually, you could even vectorize the whole thing using argpartition
lt = (ranges[:,None,0] >= data).sum(1)
result = np.argpartition(data, lt)[np.arange(data.shape[0]), lt]
Of course, this is only efficient if data.shape[0] << data.shape[1], as otherwise you're basically sorting
I'm using NumPy to store data into matrices.
I'm struggling to make the below Python code perform better.
RESULT is the data store I want to put the data into.
TMP = np.array([[1,1,0],[0,0,1],[1,0,0],[0,1,1]])
n_row, n_col = TMP.shape[0], TMP.shape[0]
RESULT = np.zeros((n_row, n_col))
def do_something(array1, array2):
intersect_num = np.bitwise_and(array1, array2).sum()
union_num = np.bitwise_or(array1, array2).sum()
try:
return intersect_num / float(union_num)
except ZeroDivisionError:
return 0
for i in range(n_row):
for j in range(n_col):
if i >= j:
continue
RESULT[i, j] = do_something(TMP[i], TMP[j])
I guess it would be much faster if I could use some NumPy built-in function instead of for-loops.
I was looking for the various questions around here, but I couldn't find the best fit for my problem.
Any suggestion? Thanks in advance!
Approach #1
You could do something like this as a vectorized solution -
# Store number of rows in TMP as a paramter
N = TMP.shape[0]
# Get the indices that would be used as row indices to select rows off TMP and
# also as row,column indices for setting output array. These basically correspond
# to the iterators involved in the loopy implementation
R,C = np.triu_indices(N,1)
# Calculate intersect_num, union_num and division results across all iterations
I = np.bitwise_and(TMP[R],TMP[C]).sum(-1)
U = np.bitwise_or(TMP[R],TMP[C]).sum(-1)
vals = np.true_divide(I,U)
# Setup output array and assign vals into it
out = np.zeros((N, N))
out[R,C] = vals
Approach #2
For cases with TMP holding 1s and 0s, those np.bitwise_and and np.bitwise_or would be replaceable with dot-products and as such could be faster alternatives. So, with those we would have an implementation like so -
M = TMP.shape[1]
I = TMP.dot(TMP.T)
TMP_inv = 1-TMP
U = M - TMP_inv.dot(TMP_inv.T)
out = np.triu(np.true_divide(I,U),1)
I have a sparse csr_matrix, and I want to change the values of a single row to different values. I can't find an easy and efficient implementation however. This is what it has to do:
A = csr_matrix([[0, 1, 0],
[1, 0, 1],
[0, 1, 0]])
new_row = np.array([-1, -1, -1])
print(set_row_csr(A, 2, new_row).todense())
>>> [[ 0, 1, 0],
[ 1, 0, 1],
[-1, -1, -1]]
This is my current implementation of set_row_csr:
def set_row_csr(A, row_idx, new_row):
A[row_idx, :] = new_row
return A
But this gives me a SparseEfficiencyWarning. Is there a way of getting this done without manual index juggling, or is this my only way out?
physicalattraction's answer is indeed significantly quicker. It's much faster than my solution, which was to just add a separate matrix with that single row set. Though the addition solution was faster than the slicing solution.
The take away for me is that the fastest way to set rows in a csr_matrix or columns in a csc_matrix is to modify the underlying data yourself.
def time_copy(A, num_tries = 10000):
start = time.time()
for i in range(num_tries):
B = A.copy()
end = time.time()
return end - start
def test_method(func, A, row_idx, new_row, num_tries = 10000):
start = time.time()
for i in range(num_tries):
func(A.copy(), row_idx, new_row)
end = time.time()
copy_time = time_copy(A, num_tries)
print("Duration {}".format((end - start) - copy_time))
def set_row_csr_slice(A, row_idx, new_row):
A[row_idx,:] = new_row
def set_row_csr_addition(A, row_idx, new_row):
indptr = np.zeros(A.shape[1] + 1)
indptr[row_idx +1:] = A.shape[1]
indices = np.arange(A.shape[1])
A += csr_matrix((new_row, indices, indptr), shape=A.shape)
>>> A = csr_matrix((np.ones(1000), (np.random.randint(0,1000,1000), np.random.randint(0, 1000, 1000))))
>>> test_method(set_row_csr_slice, A, 200, np.ones(A.shape[1]), num_tries = 10000)
Duration 4.938395977020264
>>> test_method(set_row_csr_addition, A, 200, np.ones(A.shape[1]), num_tries = 10000)
Duration 2.4161765575408936
>>> test_method(set_row_csr, A, 200, np.ones(A.shape[1]), num_tries = 10000)
Duration 0.8432261943817139
The slice solution also scales much worse with the size and sparsity of the matrix.
# Larger matrix, same fraction sparsity
>>> A = csr_matrix((np.ones(10000), (np.random.randint(0,10000,10000), np.random.randint(0, 10000, 10000))))
>>> test_method(set_row_csr_slice, A, 200, np.ones(A.shape[1]), num_tries = 10000)
Duration 18.335174798965454
>>> test_method(set_row_csr, A, 200, np.ones(A.shape[1]), num_tries = 10000)
Duration 1.1089558601379395
# Super sparse matrix
>>> A = csr_matrix((np.ones(100), (np.random.randint(0,10000,100), np.random.randint(0, 10000, 100))))
>>> test_method(set_row_csr_slice, A, 200, np.ones(A.shape[1]), num_tries = 10000)
Duration 13.371600151062012
>>> test_method(set_row_csr, A, 200, np.ones(A.shape[1]), num_tries = 10000)
Duration 1.0454308986663818
In the end, I managed to get this done with index juggling.
def set_row_csr(A, row_idx, new_row):
'''
Replace a row in a CSR sparse matrix A.
Parameters
----------
A: csr_matrix
Matrix to change
row_idx: int
index of the row to be changed
new_row: np.array
list of new values for the row of A
Returns
-------
None (the matrix A is changed in place)
Prerequisites
-------------
The row index shall be smaller than the number of rows in A
The number of elements in new row must be equal to the number of columns in matrix A
'''
assert sparse.isspmatrix_csr(A), 'A shall be a csr_matrix'
assert row_idx < A.shape[0], \
'The row index ({0}) shall be smaller than the number of rows in A ({1})' \
.format(row_idx, A.shape[0])
try:
N_elements_new_row = len(new_row)
except TypeError:
msg = 'Argument new_row shall be a list or numpy array, is now a {0}'\
.format(type(new_row))
raise AssertionError(msg)
N_cols = A.shape[1]
assert N_cols == N_elements_new_row, \
'The number of elements in new row ({0}) must be equal to ' \
'the number of columns in matrix A ({1})' \
.format(N_elements_new_row, N_cols)
idx_start_row = A.indptr[row_idx]
idx_end_row = A.indptr[row_idx + 1]
additional_nnz = N_cols - (idx_end_row - idx_start_row)
A.data = np.r_[A.data[:idx_start_row], new_row, A.data[idx_end_row:]]
A.indices = np.r_[A.indices[:idx_start_row], np.arange(N_cols), A.indices[idx_end_row:]]
A.indptr = np.r_[A.indptr[:row_idx + 1], A.indptr[(row_idx + 1):] + additional_nnz]
This is my approach:
A = A.tolil()
A[index, :] = new_row
A = A.tocsr()
Just convert to lil_matrix, change the row and convert back.
Something is wrong with this set_row_csr. Yes, it is fast and it seemed to work for some test cases. However, it seems to garble the internal csr structure of the csr sparse matrix in my test cases. Try lil_matrix(A) afterwards and you will see error messages.
In physicalattraction's answer, the len(new_row) must be equal to A.shape[1] what may not be interesting when adding sparse rows.
So, based on his answer I've came up with a method to set rows in csr while it keeps the sparcity property. Additionally I've added a method to convert dense arrays to sparse arrays (on data, indices format)
def to_sparse(dense_arr):
sparse = [(data, index) for index, data in enumerate(dense_arr) if data != 0]
# Convert list of tuples to lists
sparse = list(map(list, zip(*sparse)))
# Return data and indices
return sparse[0], sparse[1]
def set_row_csr_unbounded(A, row_idx, new_row_data, new_row_indices):
'''
Replace a row in a CSR sparse matrix A.
Parameters
----------
A: csr_matrix
Matrix to change
row_idx: int
index of the row to be changed
new_row_data: np.array
list of new values for the row of A
new_row_indices: np.array
list of indices for new row
Returns
-------
None (the matrix A is changed in place)
Prerequisites
-------------
The row index shall be smaller than the number of rows in A
Row data and row indices must have the same size
'''
assert isspmatrix_csr(A), 'A shall be a csr_matrix'
assert row_idx < A.shape[0], \
'The row index ({0}) shall be smaller than the number of rows in A ({1})' \
.format(row_idx, A.shape[0])
try:
N_elements_new_row = len(new_row_data)
except TypeError:
msg = 'Argument new_row_data shall be a list or numpy array, is now a {0}'\
.format(type(new_row_data))
raise AssertionError(msg)
try:
assert N_elements_new_row == len(new_row_indices), \
'new_row_data and new_row_indices must have the same size'
except TypeError:
msg = 'Argument new_row_indices shall be a list or numpy array, is now a {0}'\
.format(type(new_row_indices))
raise AssertionError(msg)
idx_start_row = A.indptr[row_idx]
idx_end_row = A.indptr[row_idx + 1]
A.data = np.r_[A.data[:idx_start_row], new_row_data, A.data[idx_end_row:]]
A.indices = np.r_[A.indices[:idx_start_row], new_row_indices, A.indices[idx_end_row:]]
A.indptr = np.r_[A.indptr[:row_idx + 1], A.indptr[(row_idx + 1):] + N_elements_new_row]
I have two arrays which are lex-sorted.
In [2]: a = np.array([1,1,1,2,2,3,5,6,6])
In [3]: b = np.array([10,20,30,5,10,100,10,30,40])
In [4]: ind = np.lexsort((b, a)) # sorts elements first by a and then by b
In [5]: print a[ind]
[1 1 1 2 2 3 5 6 6]
In [7]: print b[ind]
[ 10 20 30 5 10 100 10 30 40]
I want to do a binary search for (2, 7) and (5, 150) expecting (4, 7) as the answer.
In [6]: np.lexsearchsorted((a,b), ([2, 5], [7,150]))
We have searchsorted function but that works only on 1D arrays.
EDIT: Edited to reflect comment.
def comp_leq(t1,t2):
if (t1[0] > t2[0]) or ((t1[0] == t2[0]) and (t1[1] > t2[1])):
return 0
else:
return 1
def bin_search(L,item):
from math import floor
x = L[:]
while len(x) > 1:
index = int(floor(len(x)/2) - 1)
#Check item
if comp_leq(x[index], item):
x = x[index+1:]
else:
x = x[:index+1]
out = L.index(x[0])
#If greater than all
if item >= L[-1]:
return len(L)
else:
return out
def lexsearch(a,b,items):
z = zip(a,b)
return [bin_search(z,item) for item in items]
if __name__ == '__main__':
a = [1,1,1,2,2,3,5,6,6]
b = [10,20,30,5,10,100,10,30,40]
print lexsearch(a,b,([2,7],[5,150])) #prints [4,7]
This code seems to do it for a set of (exactly) 2 lexsorted arrays
You might be able to make it faster if you create a set of values[-1], and than create a dictionary with the boundries for them.
I haven't checked other cases apart from the posted one, so please verify it's not bugged.
def lexsearchsorted_2(arrays, values, side='left'):
assert len(arrays) == 2
assert (np.lexsort(arrays) == range(len(arrays[0]))).all()
# here it will be faster to work on all equal values in 'values[-1]' in one time
boundries_l = np.searchsorted(arrays[-1], values[-1], side='left')
boundries_r = np.searchsorted(arrays[-1], values[-1], side='right')
# a recursive definition here will make it work for more than 2 lexsorted arrays
return tuple([boundries_l[i] +
np.searchsorted(arrays[-2[boundries_l[i]:boundries_r[i]],
values[-2][i],
side=side)
for i in range(len(boundries_l))])
Usage:
import numpy as np
a = np.array([1,1,1,2,2,3,5,6,6])
b = np.array([10,20,30,5,10,100,10,30,40])
lexsearchsorted_2((b, a), ([7,150], [2, 5])) # return (4, 7)
I ran into the same issue and came up with a different solution. You can treat the multi-column data instead as single entries using a structured data type. A structured data type will allow one to use argsort/sort on the data (instead of lexsort, although lexsort appears faster at this stage) and then use the standard searchsorted. Here is an example:
import numpy as np
from itertools import repeat
# Setup our input data
# Every row is an entry, every column what we want to sort by
# Unlike lexsort, this takes columns in decreasing priority, not increasing
a = np.array([1,1,1,2,2,3,5,6,6])
b = np.array([10,20,30,5,10,100,10,30,40])
data = np.transpose([a,b])
# Sort the data
data = data[np.lexsort(data.T[::-1])]
# Convert to a structured data-type
dt = np.dtype(zip(repeat(''), repeat(data.dtype, data.shape[1]))) # the structured dtype
data = np.ascontiguousarray(data).view(dt).squeeze(-1) # the dtype change leaves a trailing 1 dimension, ascontinguousarray is required for the dtype change
# You can also first convert to the structured data-type with the two lines above then use data.sort()/data.argsort()/np.sort(data)
# Search the data
values = np.array([(2,7),(5,150)], dtype=dt) # note: when using structured data types the rows must be a tuple
pos = np.searchsorted(data, values)
# pos is (4,7) in this example, exactly what you would want
This works for any number of columns, uses the built-in numpy functions, the columns remain in the "logical" order (decreasing priority), and it should be quite fast.
A compared the two two numpy-based methods time-wise.
#1 is the recursive method from #j0ker5 (the one below extends his example with his suggestion of recursion and works with any number of lexsorted rows)
#2 is the structured array from me
They both take the same inputs, basically like searchsorted except a and v are as per lexsort.
import numpy as np
def lexsearch1(a, v, side='left', sorter=None):
def _recurse(a, v):
if a.shape[1] == 0: return 0
if a.shape[0] == 1: return a.squeeze(0).searchsorted(v.squeeze(0), side)
bl = np.searchsorted(a[-1,:], v[-1], side='left')
br = np.searchsorted(a[-1,:], v[-1], side='right')
return bl + _recurse(a[:-1,bl:br], v[:-1])
a,v = np.asarray(a), np.asarray(v)
if v.ndim == 1: v = v[:,np.newaxis]
assert a.ndim == 2 and v.ndim == 2 and a.shape[0] == v.shape[0] and a.shape[0] > 1
if sorter is not None: a = a[:,sorter]
bl = np.searchsorted(a[-1,:], v[-1,:], side='left')
br = np.searchsorted(a[-1,:], v[-1,:], side='right')
for i in xrange(len(bl)): bl[i] += _recurse(a[:-1,bl[i]:br[i]], v[:-1,i])
return bl
def lexsearch2(a, v, side='left', sorter=None):
from itertools import repeat
a,v = np.asarray(a), np.asarray(v)
if v.ndim == 1: v = v[:,np.newaxis]
assert a.ndim == 2 and v.ndim == 2 and a.shape[0] == v.shape[0] and a.shape[0] > 1
a_dt = np.dtype(zip(repeat(''), repeat(a.dtype, a.shape[0])))
v_dt = np.dtype(zip(a_dt.names, repeat(v.dtype, a.shape[0])))
a = np.asfortranarray(a[::-1,:]).view(a_dt).squeeze(0)
v = np.asfortranarray(v[::-1,:]).view(v_dt).squeeze(0)
return a.searchsorted(v, side, sorter).ravel()
a = np.random.randint(100, size=(2,10000)) # Values to sort, rows in increasing priority
v = np.random.randint(100, size=(2,10000)) # Values to search for, rows in increasing priority
sorted_idx = np.lexsort(a)
a_sorted = a[:,sorted_idx]
And the timing results (in iPython):
# 2 rows
%timeit lexsearch1(a_sorted, v)
10 loops, best of 3: 33.4 ms per loop
%timeit lexsearch2(a_sorted, v)
100 loops, best of 3: 14 ms per loop
# 10 rows
%timeit lexsearch1(a_sorted, v)
10 loops, best of 3: 103 ms per loop
%timeit lexsearch2(a_sorted, v)
100 loops, best of 3: 14.7 ms per loop
Overall the structured array approach is faster, and can be made even faster if you design it to work with the flipped and transposed versions of a and v. It gets even faster as the numbers of rows/keys goes up, barely slowing down when going from 2 rows to 10 rows.
I did not notice any significant timing difference between using a_sorted or a and sorter=sorted_idx so I left those out for clarity.
I believe that a really fast method could be made using Cython, but this is as fast as it is going to get with pure pure Python and numpy.