I am trying to significantly speed up the following code but to no avail. The code takes in a 2D array and removes rows of the array that, when compared to other rows in the array, are too similar. Please see below code and comments.
as0 = a.shape[0]
for i in range(as0):
a2s0 = a.shape[0] # shape may change after each iteration
if i > (a2s0 - 1):
break
# takes the difference between all rows in array by iterating over each
# row. Then sums the absolutes. The condition finally gives a boolean
# array output - similarity condition of 0.01
t = np.sum(np.absolute(a[i,:] - a), axis=1)<0.01
# Retains the indices that are too similar and then deletes the
# necessary row
inddel = np.where(t)[0]
inddel = [k for k in inddel if k != i]
a = np.delete(a, inddel, 0)
I was wondering if vectorization was possible but I'm not too familiar with it. Any assistance would be greatly appreciated.
Edit:
if i >= (a2s0 - 1): # Added equality sign
break
# Now this only calculates over rows that have not been compared.
t = np.sum(np.absolute(a[i,:] - a[np.arange(i+1,a2s0),:]), axis=1)>0.01
t = np.concatenate((np.ones(i+1,dtype=bool), t))
a = a[t, :]
Approach #1 : Broadcasting
Here's one vectorized approach making use of broadcasting upon extending a to 3D and then performing those computations across all iterations in a vectorized manner -
mask = (np.absolute(a[:,None] - a)).sum(2) < 0.01
a_out = a[~np.triu(mask,1).any(0)]
Approach #2 : Using pdist('cityblock')
For large arrays, we would run into memory issues with the previous approach. So, as another method, we could make use of pdist's 'cityblock' that computes the manhattan distances and then ID the corresponding row/col in its squared distance matrix form without actually computing that form by using searchsorted instead for an efficient solution to it.
Here's the implementation -
from scipy.spatial.distance import pdist
n = a.shape[0]
dists = pdist(a, 'cityblock')
idx = np.flatnonzero(dists < thresh)
sep_idx = np.arange(n-1,0,-1).cumsum()
rm_idx = np.unique(np.searchsorted(sep_idx,idx,'right'))
a_out = np.delete(a,rm_idx,axis=0)
Benchmarking
Approaches -
# Approach#2 from this post
def remove_similar_rows(a, thresh=0.01):
n = a.shape[0]
dists = pdist(a, 'cityblock')
idx = np.flatnonzero(dists < thresh)
sep_idx = np.arange(n-1,0,-1).cumsum()
rm_idx = np.unique(np.searchsorted(sep_idx,idx,'right'))
return np.delete(a,rm_idx,axis=0)
# #John Zwinck's soln
def pairwise_manhattan_distances(a, thresh=0.01):
d = manhattan_distances(a)
return a[~np.any(np.tril(d < thresh), axis=0)]
Timings -
In [209]: a = np.random.randint(0,9,(4000,30))
# Let's set 100 rows randomly as dups
In [210]: idx0 = np.random.choice(4000,size=100, replace=0)
In [211]: idx1 = np.random.choice(4000,size=100, replace=0)
In [217]: a[idx0] = a[idx1]
In [238]: %timeit pairwise_manhattan_distances(a, thresh=0.01)
1 loops, best of 3: 225 ms per loop
In [239]: %timeit remove_similar_rows(a, thresh=0.01)
10 loops, best of 3: 100 ms per loop
Let's create some fake data:
np.random.seed(0)
a = np.random.random((4,3))
Now we have:
array([[ 0.5488135 , 0.71518937, 0.60276338],
[ 0.54488318, 0.4236548 , 0.64589411],
[ 0.43758721, 0.891773 , 0.96366276],
[ 0.38344152, 0.79172504, 0.52889492]])
Next, we want the sum of elementwise differences for all pairs of rows. We can use Manhattan Distance:
d = sklearn.metrics.pairwise.manhattan_distances(a)
Which gives:
array([[ 0. , 0.33859562, 0.64870931, 0.31577611],
[ 0.33859562, 0. , 0.89318282, 0.6465111 ],
[ 0.64870931, 0.89318282, 0. , 0.5889615 ],
[ 0.31577611, 0.6465111 , 0.5889615 , 0. ]])
Now you can apply a threshold, keeping only one triangle:
m = np.tril(d < 0.4, -1) # large threshold just for this example
And get a boolean mask:
array([[False, False, False, False],
[ True, False, False, False],
[False, False, False, False],
[ True, False, False, False]], dtype=bool)
Which tells you that row 0 is "too similar" to both row 1 and row 3. Now you can remove rows from the original matrix where any element of the mask is True:
a[~np.any(m, axis=0)] # axis can be either 0 or 1 - design choice
Which gives you:
array([[ 0.54488318, 0.4236548 , 0.64589411],
[ 0.43758721, 0.891773 , 0.96366276],
[ 0.38344152, 0.79172504, 0.52889492]])
Putting it all together:
d = sklearn.metrics.pairwise.manhattan_distances(a)
a = a[~np.any(np.tril(d < 0.4, -1), axis=0)]
First the line:
t = np.sum(np.absolute(a[i,:] - a), axis=1)<0.01
is taking the sum of the absolute value of the difference between a single line the and the whole array every time. This is probably not what you would need, try instead taking the differences between the current line and the lines later in the array. You have already compared all of the preceding lines with the current so why do it again?
Also deleting lines from the array is an expensive, slow operation, so you will probably find it quicker to check all of the lines and then delete the near duplicates. You could also not check any lines that already slated for deletion as you know that they will be removed.
Related
For a given 2D matrix np.array([[1,3,1],[2,0,5]]) if one needs to calculate the max of each row in a matrix excluding its own column, with expected example return np.array([[3,1,3],[5,5,2]]), what would be the most efficient way to do so?
Currently I implemented it with a loop to exclude its own col index:
n=x.shape[0]
row_max_mat=np.zeros((n,n))
rng=np.arange(n)
for i in rng:
row_max_mat[:,i] = np.amax(s_a_array_sum[:,rng!=i],axis=1)
Is there a faster way to do so?
Similar idea to yours (exclude columns one by one), but with indexing:
mask = ~np.eye(cols, dtype=bool)
a[:,np.where(mask)[1]].reshape((a.shape[0], a.shape[1]-1, -1)).max(1)
Output:
array([[3, 1, 3],
[5, 5, 2]])
You could do this using np.accumulate. Compute the forward and backward accumulations of maximums along the horizontal axis and then combine them with an offset of one:
import numpy as np
m = np.array([[1,3,1],[2,0,5]])
fmax = np.maximum.accumulate(m,axis=1)
bmax = np.maximum.accumulate(m[:,::-1],axis=1)[:,::-1]
r = np.full(m.shape,np.min(m))
r[:,:-1] = np.maximum(r[:,:-1],bmax[:,1:])
r[:,1:] = np.maximum(r[:,1:],fmax[:,:-1])
print(r)
# [[3 1 3]
# [5 5 2]]
This will require 3x the size of your matrix to process (although you could take that down to 2x if you want an in-place update). Adding a 3rd&4th dimension could also work using a mask but that will require columns^2 times matrix's size to process and will likely be slower.
If needed, you can apply the same technique column wise or to both dimensions (by combining row wise and column wise results).
a = np.array([[1,3,1],[2,0,5]])
row_max = a.max(axis=1).reshape(-1,1)
b = (((a // row_max)+1)%2)
c = b*row_max
d = (a // row_max)*((a*b).max(axis=1).reshape(-1,1))
c+d # result
Since, we are looking to get max excluding its own column, basically the output would have each row filled with the max from it, except for the max element position, for which we will need to fill in with the second largest value. As such, argpartition seems would fit right in there. So, here's one solution with it -
def max_exclude_own_col(m):
out = np.full(m.shape, m.max(1, keepdims=True))
sidx = np.argpartition(-m,2,axis=1)
R = np.arange(len(sidx))
s0,s1 = sidx[:,0], sidx[:,1]
mask = m[R,s0]>m[R,s1]
L1c,L2c = np.where(mask,s0,s1), np.where(mask,s1,s0)
out[R,L1c] = m[R,L2c]
return out
Benchmarking
Other working solution(s) for large arrays -
# #Alain T.'s soln
def max_accum(m):
fmax = np.maximum.accumulate(m,axis=1)
bmax = np.maximum.accumulate(m[:,::-1],axis=1)[:,::-1]
r = np.full(m.shape,np.min(m))
r[:,:-1] = np.maximum(r[:,:-1],bmax[:,1:])
r[:,1:] = np.maximum(r[:,1:],fmax[:,:-1])
return r
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
So, we will test out with large arrays of various shapes for timings and speedups -
In [54]: import benchit
In [55]: funcs = [max_exclude_own_col, max_accum]
In [170]: inputs = [np.random.randint(0,100,(100000,n)) for n in [10, 20, 50, 100, 200, 500]]
In [171]: T = benchit.timings(funcs, inputs, indexby='shape')
In [172]: T
Out[172]:
Functions max_exclude_own_col max_accum
Shape
100000x10 0.017721 0.014580
100000x20 0.028078 0.028124
100000x50 0.056355 0.089285
100000x100 0.103563 0.200085
100000x200 0.188760 0.407956
100000x500 0.439726 0.976510
# Speedups with max_exclude_own_col over max_accum
In [173]: T.speedups(ref_func_by_index=1)
Out[173]:
Functions max_exclude_own_col Ref:max_accum
Shape
100000x10 0.822783 1.0
100000x20 1.001660 1.0
100000x50 1.584334 1.0
100000x100 1.932017 1.0
100000x200 2.161241 1.0
100000x500 2.220725 1.0
Is there a way to get rid of the loop in the code below and replace it with vectorized operation?
Given a data matrix, for each row I want to find the index of the minimal value that fits within ranges defined (per row) in a separate array.
Here's an example:
import numpy as np
np.random.seed(10)
# Values of interest, for this example a random 6 x 100 matrix
data = np.random.random((6,100))
# For each row, define an inclusive min/max range
ranges = np.array([[0.3, 0.4],
[0.35, 0.5],
[0.45, 0.6],
[0.52, 0.65],
[0.6, 0.8],
[0.75, 0.92]])
# For each row, find the index of the minimum value that fits inside the given range
result = np.zeros(6).astype(np.int)
for i in xrange(6):
ind = np.where((ranges[i][0] <= data[i]) & (data[i] <= ranges[i][1]))[0]
result[i] = ind[np.argmin(data[i,ind])]
print result
# Result: [35 8 22 8 34 78]
print data[np.arange(6),result]
# Result: [ 0.30070006 0.35065639 0.45784951 0.52885388 0.61393513 0.75449247]
Approach #1 : Using broadcasting and np.minimum.reduceat -
mask = (ranges[:,None,0] <= data) & (data <= ranges[:,None,1])
r,c = np.nonzero(mask)
cut_idx = np.unique(r, return_index=1)[1]
out = np.minimum.reduceat(data[mask], cut_idx)
Improvement to avoid np.nonzero and compute cut_idx directly from mask :
cut_idx = np.concatenate(( [0], np.count_nonzero(mask[:-1],1).cumsum() ))
Approach #2 : Using broadcasting and filling invalid places with NaNs and then using np.nanargmin -
mask = (ranges[:,None,0] <= data) & (data <= ranges[:,None,1])
result = np.nanargmin(np.where(mask, data, np.nan), axis=1)
out = data[np.arange(6),result]
Approach #3 : If you are not iterating enough (just like you have a loop of 6 iterations in the sample), you might want to stick to a loop for memory efficiency, but make use of more efficient masking with a boolean array instead -
out = np.zeros(6)
for i in xrange(6):
mask_i = (ranges[i,0] <= data[i]) & (data[i] <= ranges[i,1])
out[i] = np.min(data[i,mask_i])
Approach #4 : There is one more loopy solution possible here. The idea would be to sort each row of data. Then, use the two range limits for each row to decide on the start and stop indices with help from np.searchsorted. Further, we would use those indices to slice and then get the minimum values. Benefit with slicing that way is, we would be working with views and as such would be very efficient, both on memory and performance.
The implementation would look something like this -
out = np.zeros(6)
sdata = np.sort(data, axis=1)
for i in xrange(6):
start = np.searchsorted(sdata[i], ranges[i,0])
stop = np.searchsorted(sdata[i], ranges[i,1], 'right')
out[i] = np.min(sdata[i,start:stop])
Furthermore, we could get those start, stop indices in a vectorized manner following an implementation of vectorized searchsorted.
Based on suggestion by #Daniel F for the case when we are dealing with ranges that are within the limits of given data, we could simply use the start indices -
out[i] = sdata[i, start]
Assuming at least one value in range, you don't even have to bother with the upper limit:
result = np.empty(6)
for i in xrange(6):
lt = (ranges[i,0] >= data[i]).sum()
result[i] = np.argpartition(data[i], lt)[lt]
Actually, you could even vectorize the whole thing using argpartition
lt = (ranges[:,None,0] >= data).sum(1)
result = np.argpartition(data, lt)[np.arange(data.shape[0]), lt]
Of course, this is only efficient if data.shape[0] << data.shape[1], as otherwise you're basically sorting
I've got a 2-row array called C like this:
from numpy import *
A = [1,2,3,4,5]
B = [50,40,30,20,10]
C = vstack((A,B))
I want to take all the columns in C where the value in the first row falls between i and i+2, and average them. I can do this with just A no problem:
i = 0
A_avg = []
while(i<6):
selection = A[logical_and(A >= i, A < i+2)]
A_avg.append(mean(selection))
i += 2
then A_avg is:
[1.0,2.5,4.5]
I want to carry out the same process with my two-row array C, but I want to take the average of each row separately, while doing it in a way that's dictated by the first row. For example, for C, I want to end up with a 2 x 3 array that looks like:
[[1.0,2.5,4.5],
[50,35,15]]
Where the first row is A averaged in blocks between i and i+2 as before, and the second row is B averaged in the same blocks as A, regardless of the values it has. So the first entry is unchanged, the next two get averaged together, and the next two get averaged together, for each row separately. Anyone know of a clever way to do this? Many thanks!
I hope this is not too clever. TIL boolean indexing does not broadcast, so I had to manually do the broadcasting. Let me know if anything is unclear.
import numpy as np
A = [1,2,3,4,5]
B = [50,40,30,20,10]
C = np.vstack((A,B)) # float so that I can use np.nan
i = np.arange(0, 6, 2)[:, None]
selections = np.logical_and(A >= i, A < i+2)[None]
D, selections = np.broadcast_arrays(C[:, None], selections)
D = D.astype(float) # allows use of nan, and makes a copy to prevent repeated behavior
D[~selections] = np.nan # exclude these elements from mean
D = np.nanmean(D, axis=-1)
Then,
>>> D
array([[ 1. , 2.5, 4.5],
[ 50. , 35. , 15. ]])
Another way, using np.histogram to bin your data. This may be faster for large arrays, but is only useful for few rows, since a hist must be done with different weights for each row:
bins = np.arange(0, 7, 2) # include the end
n = np.histogram(A, bins)[0] # number of columns in each bin
a_mean = np.histogram(A, bins, weights=A)[0]/n
b_mean = np.histogram(A, bins, weights=B)[0]/n
D = np.vstack([a_mean, b_mean])
I have a symmetric matrix represented as a numpy array, like the following example:
[[ 1. 0.01735908 0.01628629 0.0183845 0.01678901 0.00990739 0.03326491 0.0167446 ]
[ 0.01735908 1. 0.0213712 0.02364181 0.02603567 0.01807505 0.0130358 0.0107082 ]
[ 0.01628629 0.0213712 1. 0.01293289 0.02041379 0.01791615 0.00991932 0.01632739]
[ 0.0183845 0.02364181 0.01293289 1. 0.02429031 0.01190878 0.02007371 0.01399866]
[ 0.01678901 0.02603567 0.02041379 0.02429031 1. 0.01496896 0.00924174 0.00698689]
[ 0.00990739 0.01807505 0.01791615 0.01190878 0.01496896 1. 0.0110924 0.01514519]
[ 0.03326491 0.0130358 0.00991932 0.02007371 0.00924174 0.0110924 1. 0.00808803]
[ 0.0167446 0.0107082 0.01632739 0.01399866 0.00698689 0.01514519 0.00808803 1. ]]
And I need to find the indices (row and column) of the greatest value without considering the diagonal. Since is a symmetric matrix I just took the the upper triangle of the matrix.
ind = np.triu_indices(M_size, 1)
And then the index of the max value
max_ind = np.argmax(H[ind])
However max_ind is the index of the vector resulting after taking the upper triangle with triu_indices, how do I know which are the row and column of the value I've just found?
The matrix could be any size but it's always symmetric. Do you know a better method to achieve the same?
Thank you
Couldn't you do this by using np.triu to return a copy of your matrix with all but the upper triangle zeroed, then just use np.argmax and np.unravel_index to get the row/column indices?
Example:
x = np.zeros((10,10))
x[3, 8] = 1
upper = np.triu(x, 1)
idx = np.argmax(upper)
row, col = np.unravel_index(idx, upper.shape)
The drawback of this method is that it creates a copy of the input matrix, but it should still be a lot quicker than looping over elements in Python. It also assumes that the maximum value in the upper triangle is > 0.
You can use the value of max_ind as an index into the ind data
max_ind = np.argmax(H[ind])
Out: 23
ind[0][max_ind], ind[1][max_ind],
Out: (4, 6)
Validate this by looking for the maximum in the entire matrix (won't always work -- data-dependent):
np.unravel_index(np.argmax(H), H.shape)
Out: (4, 6)
There's probably a neater "numpy way" to do this, but this is what comest to mind first:
answer = None
biggest = 0
for r,row in enumerate(matrix):
i,elem = max(enumerate(row[r+1:]), key=operator.itemgetter(1))
if elem > biggest:
biggest, answre = elem, i
Let's say I have a 2-dimensional matrix as a numpy array. If I want to delete rows with specific indices in this matrix, I use numpy.delete(). Here is an example of what I mean:
In [1]: my_matrix = numpy.array([
...: [10, 20, 30, 40, 50],
...: [15, 25, 35, 45, 55],
...: [95, 96, 97, 98, 99]
...: ])
In [2]: numpy.delete(my_matrix, [0, 2], axis=0)
Out[2]: array([[15, 25, 35, 45, 55]])
I'm looking for a way to do the above with matrices from the scipy.sparse package. I know it's possible to do this by converting the entire matrix into a numpy array but I don't want to do that. Is there any other way of doing that?
Thanks a lot!
For CSR, this is probably the most efficient way to do it in-place:
def delete_row_csr(mat, i):
if not isinstance(mat, scipy.sparse.csr_matrix):
raise ValueError("works only for CSR format -- use .tocsr() first")
n = mat.indptr[i+1] - mat.indptr[i]
if n > 0:
mat.data[mat.indptr[i]:-n] = mat.data[mat.indptr[i+1]:]
mat.data = mat.data[:-n]
mat.indices[mat.indptr[i]:-n] = mat.indices[mat.indptr[i+1]:]
mat.indices = mat.indices[:-n]
mat.indptr[i:-1] = mat.indptr[i+1:]
mat.indptr[i:] -= n
mat.indptr = mat.indptr[:-1]
mat._shape = (mat._shape[0]-1, mat._shape[1])
In LIL format it's even simpler:
def delete_row_lil(mat, i):
if not isinstance(mat, scipy.sparse.lil_matrix):
raise ValueError("works only for LIL format -- use .tolil() first")
mat.rows = np.delete(mat.rows, i)
mat.data = np.delete(mat.data, i)
mat._shape = (mat._shape[0] - 1, mat._shape[1])
Pv.s answer is a good and solid in-place solution that takes
a = scipy.sparse.csr_matrix((100,100), dtype=numpy.int8)
%timeit delete_row_csr(a.copy(), 0)
10000 loops, best of 3: 80.3 us per loop
for any array size. Since boolean indexing works for sparse matrices, at least in scipy >= 0.14.0, I would suggest to use it whenever multiple rows are to be removed:
def delete_rows_csr(mat, indices):
"""
Remove the rows denoted by ``indices`` form the CSR sparse matrix ``mat``.
"""
if not isinstance(mat, scipy.sparse.csr_matrix):
raise ValueError("works only for CSR format -- use .tocsr() first")
indices = list(indices)
mask = numpy.ones(mat.shape[0], dtype=bool)
mask[indices] = False
return mat[mask]
This solution takes significantly longer for a single row removal
%timeit delete_rows_csr(a.copy(), [50])
1000 loops, best of 3: 509 us per loop
But is more efficient for the removal of multiple rows, as the execution time barely increases with the number of rows
%timeit delete_rows_csr(a.copy(), numpy.random.randint(0, 100, 30))
1000 loops, best of 3: 523 us per loop
In addition to #loli's version of #pv's answer, I expanded their function to allow for row and/or column deletion by index on CSR matrices.
import numpy as np
from scipy.sparse import csr_matrix
def delete_from_csr(mat, row_indices=[], col_indices=[]):
"""
Remove the rows (denoted by ``row_indices``) and columns (denoted by ``col_indices``) from the CSR sparse matrix ``mat``.
WARNING: Indices of altered axes are reset in the returned matrix
"""
if not isinstance(mat, csr_matrix):
raise ValueError("works only for CSR format -- use .tocsr() first")
rows = []
cols = []
if row_indices:
rows = list(row_indices)
if col_indices:
cols = list(col_indices)
if len(rows) > 0 and len(cols) > 0:
row_mask = np.ones(mat.shape[0], dtype=bool)
row_mask[rows] = False
col_mask = np.ones(mat.shape[1], dtype=bool)
col_mask[cols] = False
return mat[row_mask][:,col_mask]
elif len(rows) > 0:
mask = np.ones(mat.shape[0], dtype=bool)
mask[rows] = False
return mat[mask]
elif len(cols) > 0:
mask = np.ones(mat.shape[1], dtype=bool)
mask[cols] = False
return mat[:,mask]
else:
return mat
You can delete row 0 < i < X.shape[0] - 1 from a CSR matrix X with
scipy.sparse.vstack([X[:i, :], X[i:, :]])
You can delete the first or the last row with X[1:, :] or X[:-1, :], respectively. Deleting multiple rows in one gone will probably require rolling your own function.
For other formats than CSR, this might not necessarily work as not all formats support row slicing.
To remove the i'th row from A simply use left matrix multiplication:
B = J*A
where J is a sparse identity matrix with i'th row removed.
Left multiplication by the transpose of J will insert a zero-vector back to the i'th row of B, which makes this solution a bit more general.
A0 = J.T * B
To construct J itself, I used pv.'s solution on a sparse diagonal matrix as follows (maybe there's a simpler solution for this special case?)
def identity_minus_rows(N, rows):
if np.isscalar(rows):
rows = [rows]
J = sps.diags(np.ones(N), 0).tocsr() # make a diag matrix
for r in sorted(rows):
J = delete_row_csr(J, r)
return J
You may also remove columns by right-multiplying by J.T of the appropriate size.
Finally, multiplication is efficient in this case because J is so sparse.
Note that sparse matrices support fancy indexing to some degree. So what you can do is this:
mask = np.ones(len(mat), dtype=bool)
mask[rows_to_delete] = False
# unfortunatly I think boolean indexing does not work:
w = np.flatnonzero(mask)
result = s[w,:]
The delete method doesn't really do anything else either.
Using #loli implementation, here I leave a function to remove columns:
def delete_cols_csr(mat, indices):
"""
Remove the cols denoted by ``indices`` form the CSR sparse matrix ``mat``.
"""
if not isinstance(mat, csr_matrix):
raise ValueError("works only for CSR format -- use .tocsr() first")
indices = list(indices)
mask = np.ones(mat.shape[1], dtype=bool)
mask[indices] = False
return mat[:,mask]