L2 normalization of rows in scipy sparse matrix - python

As I want to use only numpy and scipy (I don't want to use scikit-learn), I was wondering how to perform a L2 normalization of rows in a huge scipy csc_matrix (2,000,000 x 500,000). The operation must consume as little memory as possible since it must fit in memory.
What I have so far is:
import scipy.sparse as sp
tf_idf_matrix = sp.lil_matrix((n_docs, n_terms), dtype=np.float16)
# ... perform several operations and fill up the matrix
tf_idf_matrix = tf_idf_matrix / l2_norm(tf_idf_matrix)
# l2_norm() is what I want
def l2_norm(sparse_matrix):

Since I couldn't find the answer anywhere, I will post here how I approached the problem.
def l2_norm(sparse_csc_matrix):
# first, I convert the csc_matrix to csr_matrix which is done in linear time
norm = sparse_csc_matrix.tocsr(copy=True)
# compute the inverse of l2 norm of non-zero elements
norm.data **= 2
norm = norm.sum(axis=1)
n_nzeros = np.where(norm > 0)
norm[n_nzeros] = 1.0 / np.sqrt(norm[n_nzeros])
norm = np.array(norm).T[0]
# modify sparse_csc_matrix in place
sparse_csc_matrix.data, norm)
If anyone has a better approach, please post it.


Is there a way to have fast boolean operations on scipy.sparse matrices?

I have to solve a XOR operation on very high dimensional (~30'000) vectors to compute the Hamming distance. For example, I need to compute the XOR operation between one vector full of False with 16 sparsily located True with each row of a 50'000x30'000 matrix.
As of now, the quickest way I found is to not use scipy.sparse but the simple ^ operation on each row.
Happens to be ten times faster than this:
sparse_hashes = scipy.sparse.csr_matrix((self.hashes)).astype('bool')
for i in range(all_points.shape[0]):
But ten times faster is still quite slow since, theoretically, having a sparse vector with 16 activations should make the computation the same as having a 16 dimension one.
Is there any solution? I'm really struggling here, thanks for the help!
If your vector is highly sparse (like 16/30000) I'd probably just skip fiddling with sparse xor entirely.
from scipy import sparse
import numpy as np
import numpy.testing as npt
matrix_1 = sparse.random(10000, 100, density=0.1, format='csc')
matrix_1.data = np.ones(matrix_1.data.shape, dtype=bool)
matrix_2 = sparse.random(1, 100, density=0.1, format='csc', dtype=bool)
vec = matrix_2.A.flatten()
# Pull out the part of the sparse matrix that matches the vector and sum it after xor
matrix_xor = (matrix_1[:, vec].A ^ np.ones(vec.sum(), dtype=bool)[np.newaxis, :]).sum(axis=1)
# Sum the part that doesnt match the vector and add it
l1distances = matrix_1[:, ~vec].sum(axis=1).A.flatten() + matrix_xor
# Double check that I can do basic math
npt.assert_array_equal(l1distances, (matrix_1.A ^ vec[np.newaxis, :]).sum(axis=1))

Covariance matrices should be equal but are not

I'm computing covariance in two ways that I think should tie out, but they do not.
Method 1: Compute the covariance matrix of a slice of an array of data
Method 2: Compute the covariance matrix of the full array of data, and reference an equivalent slice of that matrix.
The differences are tiny (order 1e-18), but these differences are growing with subsequent calculations in my code and preventing reproducibility. Is this a floating point issue? If so I'd still like to understand why it's happening and how to avoid it.
I'm on numpy 1.16.3
import numpy as np
state = np.random.RandomState(1)
X = state.rand(40,100)
A = np.cov(X[:20])
B = np.cov(X)[:20,:20]
print(np.array_equal(A, B))
diff = A - B
I would have expected a True result from array_equal, but I get False

Scipy: epsilon neighborhood by sparse similarity with threshold

I am wondering if scipy offers the option to implement a primitive but memory-friendly approach to epsilon neighborhood search:
Compute pairwise similarity for my data, but set all similarities smaller than a threshold epsilon to zero on the fly and then output result directly as sparse matrix.
For example scipy.spatial.distance.pdist() is really fast, but the memory limit is reached early compared to my time limit, at least if I take squareform().
I know there are O(n*log(n)) solutions in this case but for now it would be enough if the result could be sparse. Also obviously I would have to use a similarity as opposed to a distance, but that should not be such a big problem, should it.
As long as you can recast your similarity measure in terms of a distance metric (say 1 minus the similarity) then the most efficient solution is to use sklearn's BallTree.
Otherwise you could build a your own scipy.sparse.csr_matrix matrix by comparing each point against the other $ i -1$ points and throwing away all values smaller than the threshold.
Without knowing your specific similarity metric, this code should roughly do the trick:
import scipy.sparse as spsparse
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def sparse_similarity(X, epsilon=0.99, Y=None, similarity_metric=cosine_similarity):
X : ndarray
An m by n array of m original observations in an n-dimensional space.
Nx, Dx = X.shape
if Y is None:
Ny, Dy = Y.shape
assert Dx==Dy
data = []
indices = []
indptr = [0]
for ix in range(Nx):
xsim = similarity_metric([X[ix]], Y)
_ , kept_points = np.nonzero(xsim>=epsilon)
indptr.append(indptr[-1] + len(kept_points))
return spsparse.csr_matrix((data, indices, indptr), shape=(Nx,Ny))
X = np.random.random(size=(1000,10))
sparse_similarity(X, epsilon=0.95)

Vectorizing or boosting time for an interpolation in Python

I have to boost the time for an interpolation over a large (NxMxT) matrix MTR, where:
N is about 8000;
M is about 10000;
T represents the number of times at which each NxM matrix is calculated (in my case it's 23).
I have to compute the interpolation element-wise, on all the T different times, and return the interpolated values over a different array of times (T_interp, in my case with lenght 47) so, as output, I want an NxMxT_interp matrix.
The code snippet below defines the function I built for the interpolation, using scipy.interpolate.Rbf (y is the array MTR[i,j,:], x is the times array with length T, x_interp is the new array of times with length T_interp:
# Interpolate without nans
def interp(x,y,x_interp,**kwargs):
import numpy as np
from scipy.interpolate import Rbf
mask = np.isnan(y)
y_mask = np.ma.array(y,mask = mask)
x_new = [x[i] for i in np.where(~mask)[0]]
if len(y_mask.compressed()) == 0:
return [np.nan for i,n in enumerate(x_interp)]
elif len(y_mask.compressed()) == 1:
return [y_mask.compressed() for i,n in enumerate(x_interp)]
interp = Rbf(x_new,y_mask.compressed(),**kwargs)
y_interp = interp(x_interp)
return y_interp
I tried to achieve my goal either by looping over the NxM elements of the MTR matrix:
new_MTR = np.empty((N,M,T_interp))
for i in range(N):
for j in range(M):
new_MTR[i,j,:]=interp(times,MTR[i,j,:],New_times,function = 'linear')
or by using the np.apply_along_axis funtion:
new_MTR = np.apply_along_axis(lambda x: interp(times,x,New_times,function = 'linear'),2,MTR)
In both cases I extimated the time it takes to perform the whole operation and it appears to be slightly better for the np.apply_along_axis function, but still it will take about 15 hours!!
Is there a way to reduce this time? Maybe by vectorizing the entire operation? I don't know much about vectorizing and how it can be done in a situation like mine so any help would be much appreciated. Thank you!

How to get the condensed form of pairwise distances directly?

I have a very large scipy sparse csr matrix. It is a 100,000x2,000,000 dimensional matrix. Let's call it X. Each row is a sample vector in a 2,000,000 dimensional space.
I need to calculate the cosine distances between each pair of samples very efficiently. I have been using sklearn pairwise_distances function with a subset of vectors in X which gives me a dense matrix D: the square form of the pairwise distances which contains redundant entries. How can I use sklearn pairwise_distances to get the condensed form directly? Please refer to http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html to see what the condensed form is. It is the output of scipy pdist function.
I have memory limitations and I can't calculate the square form and then get the condensed form. Due to memory limitations, I also cannot use scipy pdist as it requires a dense matrix X which does not again fit in memory. I thought about looping through different chunks of X and calculate the condensed form for each chunk and join them together to get the complete condensed form, but this is relatively cumbersome. Any better ideas?
Any help is much much appreciated. Thanks in advance.
Below is a reproducible example (of course for demonstration purposes X is much smaller):
from scipy.sparse import rand
from scipy.spatial.distance import pdist
from sklearn.metrics.pairwise import pairwise_distances
X = rand(1000, 10000, density=0.01, format='csr')
dist1 = pairwise_distances(X, metric='cosine')
dist2 = pdist(X.A, 'cosine')
As you see dist2 is in the condensed form and is a 499500 dimensional vector. But dist1 is in the symmetric squareform and is a 1000x1000 matrix.
I dug into the code for both versions, and think I understand what both are doing.
Start with a small simple X (dense):
X = np.arange(9.).reshape(3,3)
pdist cosine does:
norms = _row_norms(X)
_distance_wrap.pdist_cosine_wrap(_convert_to_double(X), dm, norms)
where _row_norms is a row dot - using einsum:
norms = np.sqrt(np.einsum('ij,ij->i', X,X)
So this is the first place where X has to be an array.
I haven't dug into the cosine_wrap, but it appears to do (probably in cython)
xy = np.dot(X, X.T)
# or xy = np.einsum('ij,kj',X,X)
d = np.zeros((3,3),float) # square receiver
d2 = [] # condensed receiver
for i in range(3):
for j in range(i+1,3):
from scipy.spatial import distance
print(' pdist',d1)
[[ 0. 0.11456226 0.1573452 ]
[ 0.11456226 0. 0.00363075]
[ 0.1573452 0.00363075 0. ]]
condensed [ 0.11456226 0.1573452 0.00363075]
pdist [ 0.11456226 0.1573452 0.00363075]
distance.squareform(d1) produces the same thing as my d array.
I can produce the same square array by dividing the xy dot product with the appropriate norm outer product:
dd[range(dd.shape[0]),range(dd.shape[1])]=0 # clean up 0s
Or by normalizing X before taking dot product. This appears to be what the scikit version does.
Xnorm = X/norms[:,None]
scikit has added some cython code to do faster sparse calculations (beyond those provided by sparse.sparse, but using the same csr format):
from scipy import sparse
# csr_row_norm - pyx of following
cnorm = Xc.multiply(Xc).sum(axis=1)
cnorm = np.sqrt(cnorm)
X1 = Xc.multiply(1/cnorm) # dense matrix
dd = 1-X1*X1.T
To get a fast condensed form with sparse matrices I think you need to implement a fast condensed version of X1*X1.T. That means you need to understand how the sparse matrix multiplication is implemented - in c code. The scikit cython 'fast sparse' code might also give ideas.
numpy has some tri... functions which are straight forward Python code. It does not attempt to save time or space by implementing tri calculations directly. It's easier to iterate over the rectangular layout of a nd array (with shape and strides) than to do the more complex variable length steps of a triangular array. The condensed form only cuts the space and calculation steps by half.
Here's the main part of the c function pdist_cosine, which iterates over i and the upper j, calculating dot(x[i],y[j])/(norm[i]*norm[j]).
for (i = 0; i < m; i++) {
for (j = i + 1; j < m; j++, dm++) {
u = X + (n * i);
v = X + (n * j);
cosine = dot_product(u, v, n) / (norms[i] * norms[j]);
if (fabs(cosine) > 1.) {
/* Clip to correct rounding error. */
cosine = npy_copysign(1, cosine);
*dm = 1. - cosine;
