I want to calculate a NxN similarity Matrix using the cosine distance formula of sklearn. My problem is that my Matrix is very very large. It has about 1000 entries. My current approach is very very slow and I need a real speed-up. Can anybody help me speeding the code up?
for i in similarity_matrix.columns:
for j in similarity_matrix.columns:
if i == j:
similarity_matrix.ix[i,j] = 0
else:
similarity_matrix.ix[i,j] = cosine(documents[int(i)], documents[int(j)])
Bonus task: In addition I would like to use the weighted cosine formula. But it seems not to be implemented in sklearn? Is that true?
Using for-loops is not the ideal solution. I would recommend to fall back to the pdist functions of scipy. My read is that you don't mean your matrix has 1000 entries but 1000x1000? However Scipy can handle this easily.
import numpy as np
from scipy.spatial.distance import pdist
res = pdist(documents.T, 'cosine')
distances = 1-pd.DataFrame(squareform(res), index=documents.columns, columns=documents.columns)
I have problems understanding how your weight vector looks like? Is is a constant value? Pdist allows for adding custom functions. For example you can calculate your cosine distance using numpy (which is also really fast)
pdist(X, lambda u, v: np.dot(np.dot(u, v), weightvec) / (norm(np.multiply(u, weightvec)) * norm(np.multiply(v, weightvec))))
Related
Anyone has any idea how to efficiently implement a 2D probability Jaccard similarity algorithm in numpy? It looks like this specific algorithm is almost non-existent in computer vision (not in pytorch, not in tensorflow nor in skilearn, I wonder is there a specific reason for this). The formula for probability Jaccard similarity is (taken from wikipedia):
This is one way of doing it. It's pretty straightforward, we use broadcasting to perform the divisions of all pair of points without loops:
def jaccard_probability(x,y):
# Ignore == 0 terms
x0 = x[x!=0]
y0 = y[y!=0]
jac = np.sum(
1.0 / np.sum(np.maximum(x0[:,None] / x0, y0[:,None] / y0), axis=0)
)
return jac
However, I suggest you read the NumPy guide to get a grasp of the basics, at least of broadcasting, as it is a very useful tool to know if you plan on using NumPy in the future and want to make efficient code!
I have 40,000 points and I need to find out the euclidean distance between each of the pairs. After going through the net, I found that the efficient way of calculating euclidean distance between pairs of points is by using scipy.spatial distance.cdist. But, since the no. of points is 40,000, the distance matirx will take around 12 GB of memory.
Is there a way of reducing the memory required to store the distance matrix without compromising the speed of calculating the same? Can the data type be change to float 32 instead of float 64 in the calculation of the distance matrix?
cdist like approach
The output datatype is the same as given as input.
import numpy as np
import numba as nb
#nb.njit(fastmath=True,parallel=True)
def calc_distance(vec_1,vec_2):
res=np.empty((vec_1.shape[0],vec_2.shape[0]),dtype=vec_1.dtype)
for i in nb.prange(vec_1.shape[0]):
for j in range(vec_2.shape[0]):
res[i,j]=np.sqrt((vec_1[i,0]-vec_2[j,0])**2+(vec_1[i,1]-vec_2[j,1])**2+(vec_1[i,2]-vec_2[j,2])**2)
return res
Aproach without repetitions
#nb.njit(fastmath=True)
def calc_distance_pairs(vec):
res=np.empty(((vec.shape[0]**2)//2-vec.shape[0]//2),dtype=vec.dtype)
ii=0
for i in range(vec.shape[0]):
for j in range(i+1,vec.shape[0]):
res[ii]=np.sqrt((vec[i,0]-vec[j,0])**2+(vec[i,1]-vec[j,1])**2+(vec[i,2]-vec[j,2])**2)
ii+=1
return res
This cuts the amount of memory to less than 1/4 of the scipy cdist approach.
Timings
calc_distance: ~2s
calc_distance_pairs: ~3s
cdist: ~11s
I have around 1M of binary numpy array which I need to get Hamming Distance between them to found de k-nearest-neighbours, the fastest method that I get is using cdist, returning a float matrix with distance.
Since I don't have memory enough to get a 1Mx1M float matrix so I'm doing it one element at the time like this:
from scipy.spatial Import distance
Hamming_Distance = distance.cdist(array1,all_array,'hamming')
The probles is that it's taken like 2-3s for each Hamming_Distance, to 1m document it took an eternity (And I need to use it to different k).
Is there any fastest way to do it?
I'm thinking on multiprocessing or make it on C but I have some troubles understanding how it works multiprocessing on python and I don't know how to mix C code with Python code.
If you want to compute the k-nearest neighbors, it may not be necessary to compute all n^2 pairs of distances. Instead, you can use a Kd tree or a ball tree (both are data structures for efficiently querying relations between a set of points).
Scipy has a package called scipy.spatial.kdtree. It however does not currently support hamming distance as a metric between points. However, the wonderful folks at scikit-learn (aka sklearn) do have an implementation of ball tree with hamming distance supported. Here's a small example using sklearn's ball tree.
from sklearn.neighbors import BallTree
import numpy as np
# Generate random binary data.
data = np.random.random_integers(0, 1, size=(10,10))
# Implement BallTree.
ballt = BallTree(data, leaf_size = 30, metric = 'hamming')
distances, neighbors = ballt.query(data, k=3)
print neighbors # Row n has the nth vector's k closest neighbors.
print distances # Same idea but the hamming distance to neighbors.
Now for the big caveat. For high dimensional vectors, KDTree and BallTree become comparable to the brute force algorithm. I'm a bit unclear on the nature of your vectors, but hopefully the above snippet gives you some ideas/direction.
I am interested in taking a look at the Eigenvalues after performing Multidimensional scaling. What function can do that ? I looked at the documentation, but it does not mention Eigenvalues at all.
Here is a code sample:
mds = manifold.MDS(n_components=100, max_iter=3000, eps=1e-9,
random_state=seed, dissimilarity="precomputed", n_jobs=1)
results = mds.fit(wordDissimilarityMatrix)
# need a way to get the Eigenvalues
I also couldn't find it from reading the documentation. I suspect they aren't performing classical MDS, but something more sophisticated:
“Modern Multidimensional Scaling - Theory and Applications” Borg, I.; Groenen P. Springer Series in Statistics (1997)
“Nonmetric multidimensional scaling: a numerical method” Kruskal, J. Psychometrika, 29 (1964)
“Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis” Kruskal, J. Psychometrika, 29, (1964)
If you're looking for eigenvalues per classical MDS then it's not hard to get them yourself. The steps are:
Get your distance matrix. Then square it.
Perform double-centering.
Find eigenvalues and eigenvectors
Select top k eigenvalues.
Your ith principle component is sqrt(eigenvalue_i)*eigenvector_i
See below for code example:
import numpy.linalg as la
import pandas as pd
# get some distance matrix
df = pd.read_csv("http://rosetta.reltech.org/TC/v15/Mapping/data/dist-Aus.csv")
A = df.values.T[1:].astype(float)
# square it
A = A**2
# centering matrix
n = A.shape[0]
J_c = 1./n*(np.eye(n) - 1 + (n-1)*np.eye(n))
# perform double centering
B = -0.5*(J_c.dot(A)).dot(J_c)
# find eigenvalues and eigenvectors
eigen_val = la.eig(B)[0]
eigen_vec = la.eig(B)[1].T
# select top 2 dimensions (for example)
PC1 = np.sqrt(eigen_val[0])*eigen_vec[0]
PC2 = np.sqrt(eigen_val[1])*eigen_vec[1]
I'm trying to write a spectral clustering algorithm using NumPy/SciPy for larger (but still tractable) systems, making use of SciPy's sparse linear algebra library. Unfortunately, I'm running into stability issues with eigsh().
Here's my code:
import numpy as np
import scipy.sparse
import scipy.sparse.linalg as SLA
import sklearn.utils.graph as graph
W = self._sparse_rbf_kernel(self.X_, self.datashape)
D = scipy.sparse.csc_matrix(np.diag(np.array(W.sum(axis = 0))[0]))
L = graph.graph_laplacian(W) # D - W
vals, vects = SLA.eigsh(L, k = self.k, M = D, which = 'SM', sigma = 0, maxiter = 1000)
The sklearn library refers to the scikit-learn package, specifically this method for calculating a graph laplacian from a sparse SciPy matrix.
_sparse_rbf_kernel is a method I wrote to compute pairwise affinities of the data points. It operates by creating a sparse affinity matrix from image data, specifically by only computing pairwise affinities for the 8-neighborhoods around each pixel (instead of pairwise for all pixels with scikit-learn's rbf_kernel method, which for the record doesn't fix this either).
Since the laplacian is unnormalized, I'm looking for the smallest eigenvalues and corresponding eigenvectors of the system. I understand that ARPACK is ill-suited for finding small eigenvalues, but I'm trying to use shift-invert to find these values and am still not having much success.
With the above arguments (specifically, sigma = 0), I get the following error:
RuntimeError: Factor is exactly singular
With sigma = 0.001, I get a different error:
scipy.sparse.linalg.eigen.arpack.arpack.ArpackNoConvergence: ARPACK error -1: No convergence (1001 iterations, 0/5 eigenvectors converged)
I've tried all three different values for mode with the same result. Any suggestions for using the SciPy sparse library for finding small eigenvalues of a large system?
You should use which='LM': in the shift-invert mode, this parameter refers to the transformed eigenvalues. (As explained in the documentation.)