I would like to compute a distance matrix using the Jaccard distance. And do so as fast as possible. I used to use scikit-learn's pairwise_distances function. But scikit-learn doesn't plan to support GPU, and there's even a known bug that makes the function slower when run in parallel.
My only constraint is that the resulting distance matrix can then be fed to scikit-learn's DBSCAN clustering algorithm. I was thinking about implementing the computation with tensorflow but couldn't find a nice and simple way to do it.
PS: I have reasons to precompute the distance matrix instead of letting DBSCAN do it as needed.
Hej I was facing the same problem.
Given the idea that the jaccard similarity is the ratio of true postives (tp) to the sum of true positives, false negatives (fn) and false positives (fp), I came up with this solution:
def jaccard_distance(self):
tp = tf.reduce_sum(tf.mul(self.target, self.prediction), 1)
fn = tf.reduce_sum(tf.mul(self.target, 1-self.prediction), 1)
fp = tf.reduce_sum(tf.mul(1-self.target, self.prediction), 1)
return 1 - (tp / (tp + fn + fp))
Hope this helps!
I am not a tensorflow expert, but here is the solution I got. As far as I know, the only ways in tensorflow to do a computation on all-pairs of a list is to do a matrix multiplication or use the broadcasting rules, this solution uses both at some point.
So let's assume we have an input boolean matrix of n_samples rows, one per set, and n_features columns, one per possible element. A value True in the i-th row, j-th column means the i-th set contains the element j. Just like scikit-learn's pairwise_distances expect. We can then proceed as follow.
Cast the matrix to numbers, getting 1 for True and 0 for False.
Multiply the matrix by its own transpose. This produce a matrix where each element M[i][j] contains size of the intersection between the i-th and j-th sets.
Compute a cardv vector that contains the cardinality of all the sets by summing the input matrix by rows.
Make a row and a column vector from cardv.
Compute 1 - M / (cardvrow + cardvcol - M). The broadcasting rules will do all the work when adding a row and a column vector.
This algorithm as a whole seems a bit hack-ish, but it works and produce results within a reasonable margin from the result computed by scikit-learn's pairwise_distances function. A better algorithm should probably make a single pass on every pair of input vectors and compute only half of the matrix as it is symmetric. Any improvement is welcome.
setsin = tf.placeholder(tf.bool, shape=(N, M))
sets = tf.cast(setsin, tf.float16)
mat = tf.matmul(sets, sets, transpose_b=True, name="Main_matmul")
#mat = tf.cast(mat, tf.float32, name="Upgrade_mat")
#sets = tf.cast(sets, tf.float32, name="Upgrade_sets")
cardinal = tf.reduce_sum(sets, 1, name="Richelieu")
cardinalrow = tf.expand_dims(cardinal, 0)
cardinalcol = tf.expand_dims(cardinal, 1)
mat = 1 - mat / (cardinalrow + cardinalcol - mat)
I used float16 type as it seems much faster than float32. Casting to float32 might only be useful if the cardinals are large enough to make them inaccurate or if more precision is needed when performing the division. But even when the casts are needed, it seems to be still relevant to do the matrix multiplication as float16.
Related
I was going through the book called Hands-On Machine Learning with Scikit-Learn, Keras and Tensorflow and the author was explaining how the pseudo-inverse (Moore-Penrose inverse) of a matrix is calculated in the context of Linear Regression. I'm quoting verbatim here:
The pseudoinverse itself is computed using a standard matrix
factorization technique called Singular Value Decomposition (SVD) that
can decompose the training set matrix X into the matrix
multiplication of three matrices U Σ VT (see numpy.linalg.svd()). The
pseudoinverse is calculated as X+ = V * Σ+ * UT. To compute the matrix
Σ+, the algorithm takes Σ and sets to zero all values smaller than a
tiny threshold value, then it replaces all nonzero values with their
inverse, and finally it transposes the resulting matrix. This approach
is more efficient than computing the Normal equation.
I've got an understanding of how the pseudo-inverse and SVD are related from this post. But I'm not able to grasp the rationale behind setting all values less than the threshold to zero. The inverse of a diagonal matrix is obtained by taking the reciprocals of the diagonal elements. Then small values would be converted to large values in the inverse matrix, right? Then why are we removing the large values?
I went and looked into the numpy code, and it looks like follows, just for reference:
#array_function_dispatch(_pinv_dispatcher)
def pinv(a, rcond=1e-15, hermitian=False):
a, wrap = _makearray(a)
rcond = asarray(rcond)
if _is_empty_2d(a):
m, n = a.shape[-2:]
res = empty(a.shape[:-2] + (n, m), dtype=a.dtype)
return wrap(res)
a = a.conjugate()
u, s, vt = svd(a, full_matrices=False, hermitian=hermitian)
# discard small singular values
cutoff = rcond[..., newaxis] * amax(s, axis=-1, keepdims=True)
large = s > cutoff
s = divide(1, s, where=large, out=s)
s[~large] = 0
res = matmul(transpose(vt), multiply(s[..., newaxis], transpose(u)))
return wrap(res)
It's almost certainly an adjustment for numerical error. To see why this might be necessary, look what happens when you take the svd of a rank-one 2x2 matrix. We can create a rank-one matrix by taking the outer product of a vector like so:
>>> a = numpy.arange(2) + 1
>>> A = a[:, None] * a[None, :]
>>> A
array([[1, 2],
[2, 4]])
Although this is a 2x2 matrix, it only has one linearly independent column, and so its rank is one instead of two. So we should expect that when we pass it to svd, one of the singular values will be zero. But look what happens:
>>> U, s, V = numpy.linalg.svd(A)
>>> s
array([5.00000000e+00, 1.98602732e-16])
What we actually get is a singular value that is not quite zero. This result is inevitable in many cases given that we are working with finite-precision floating point numbers. So although the problem you have identified is a real one, we will not be able to tell in practice the difference between a matrix that really has a very small singular value and a matrix that ought to have a zero singular value but doesn't. Setting small values to zero is the safest practical way to handle that problem.
I am wondering if scipy offers the option to implement a primitive but memory-friendly approach to epsilon neighborhood search:
Compute pairwise similarity for my data, but set all similarities smaller than a threshold epsilon to zero on the fly and then output result directly as sparse matrix.
For example scipy.spatial.distance.pdist() is really fast, but the memory limit is reached early compared to my time limit, at least if I take squareform().
I know there are O(n*log(n)) solutions in this case but for now it would be enough if the result could be sparse. Also obviously I would have to use a similarity as opposed to a distance, but that should not be such a big problem, should it.
As long as you can recast your similarity measure in terms of a distance metric (say 1 minus the similarity) then the most efficient solution is to use sklearn's BallTree.
Otherwise you could build a your own scipy.sparse.csr_matrix matrix by comparing each point against the other $ i -1$ points and throwing away all values smaller than the threshold.
Without knowing your specific similarity metric, this code should roughly do the trick:
import scipy.sparse as spsparse
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def sparse_similarity(X, epsilon=0.99, Y=None, similarity_metric=cosine_similarity):
'''
X : ndarray
An m by n array of m original observations in an n-dimensional space.
'''
Nx, Dx = X.shape
if Y is None:
Y=X
Ny, Dy = Y.shape
assert Dx==Dy
data = []
indices = []
indptr = [0]
for ix in range(Nx):
xsim = similarity_metric([X[ix]], Y)
_ , kept_points = np.nonzero(xsim>=epsilon)
data.extend(xsim[0,kept_points])
indices.extend(kept_points)
indptr.append(indptr[-1] + len(kept_points))
return spsparse.csr_matrix((data, indices, indptr), shape=(Nx,Ny))
X = np.random.random(size=(1000,10))
sparse_similarity(X, epsilon=0.95)
I've got a 2x2 matrix defined by the variables J00, J01, J10, J11 coming in from other inputs. Since the matrix is small, I was able to compute the spectral norm by first computing the trace and determinant
J_T = tf.reduce_sum([J00, J11])
J_ad = tf.reduce_prod([J00, J11])
J_cb = tf.reduce_prod([J01, J10])
J_det = tf.reduce_sum([J_ad, -J_cb])
and then solving the quadratic
L1 = J_T/2.0 + tf.sqrt(J_T**2/4.0 - J_det)
L2 = J_T/2.0 - tf.sqrt(J_T**2/4.0 - J_det)
spectral_norm = tf.maximum(L1, L2)
This works, but it looks rather ugly and it isn't generalizable to larger matrices. Is there cleaner way (maybe a method call that I'm missing) to compute spectral_norm?
The spectral norm of a matrix J equals the largest singular value of the matrix.
Therefore you can use tf.svd() to perform the singular value decomposition, and take the largest singular value:
spectral_norm = tf.svd(J,compute_uv=False)[...,0]
where J is your matrix.
Notes:
I use compute_uv=False since we are interested only in singular values, not singular vectors.
J does not need to be square.
This solution works also for the case where J has any number of batch dimensions (as long as the two last dimensions are the matrix dimensions).
The elipsis ... operation works as in NumPy.
I take the 0 index because we are interested only in the largest singular value.
I have two np.ndarrays, data with shape (8000, 500) and sample with shape (1, 500).
What I am trying to achieve is measure various types of metrics between every row in data to sample.
When using from sklearn.metrics.pairwise.cosine_distances I was able to take advantage of numpy's broadcasting executing the following line
x = cosine_distances(data, sample)
But when I tried to use the same procedure with scipy.spatial.distance.cosine I got the error
ValueError: Input vector should be 1-D.
I guess this is a broadcasting issue and I'm trying to find a way to get around it.
My ultimate goal is to iterate over all of the distances available in scipy.spatial.distance that can accept two vectors and apply them to the data and the sample.
How can I replicate the broadcasting that automatically happens in sklearn's in my scipy version of the code?
OK, looking at the docs, http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_distances.html
With (800,500) and (1,500) inputs ((samples, features)), you should get back a (800,1) result ((samples1, samples2)).
I wouldn't describe that as broadcasting. It's more like dot product, that performs some sort calculation (norm) over features (the 500 shape), reducing that down to one value. It's more like np.dot(data, sample.T) in its handling of dimensions.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html is Computes the Cosine distance between 1-D arrays, more like
for row in data:
for s in sample:
d = cosine(row, s)
or since sample has only one row
distances = np.array([cosine(row, sample[0]) for row in data])
In other words, the sklearn version does the pairwise iteration (maybe in compiled code), while the spartial just evaluates the distance for one pair.
pairwise.cosine_similarity does
# K(X, Y) = <X, Y> / (||X||*||Y||)
K = safe_sparse_dot(X_normalized, Y_normalized.T, dense_output=dense_output)
That's the dot like behavior that I mentioned earlier, but with the normalization added.
I have a large scipy.sparse.csc_matrix and would like to normalize it. That is subtract the column mean from each element and divide by the column standard deviation (std)i.
scipy.sparse.csc_matrix has a .mean() but is there an efficient way to compute the variance or std?
You can calculate the variance yourself using the mean, with the following formula:
E[X^2] - (E[X])^2
E[X] stands for the mean. So to calculate E[X^2] you would have to square the csc_matrix and then use the mean function. To get (E[X])^2 you simply need to square the result of the mean function obtained using the normal input.
Sicco has the better answer.
However, another way is to convert the sparse matrix to a dense numpy array one column at a time (to keep the memory requirements lower compared to converting the whole matrix at once):
# mat is the sparse matrix
# Get the number of columns
cols = mat.shape[1]
arr = np.empty(shape=cols)
for i in range(cols):
arr[i] = np.var(mat[:, i].toarray())
The most efficient way I know of is to use StandardScalar from scikit:
from sklearn.preprocessing import StandardScaler
scalar = StandardScaler(with_mean=False)
scalar.fit(X)
Then the variances are in the attribute var_:
X_var = scalar.var_
The curious thing though, is that when I densified first using pandas (which is very slow) my answer was off by a few percent. I don't know which is more accurate.
The efficient way is actually to densify the entire matrix, then standardize it in the usual way with
X = X.toarray()
X -= X.mean()
X /= X.std()
As #Sebastian has noted in his comments, standardizing destroys the sparsity structure (introduces lots of non-zero elements) in the subtraction step, so there's no use keeping the matrix in a sparse format.