Use affinity or adjacency matrix for Spectral Clustering? - python

I was wondering whether there is a reason for the adjacency matrix to be commonly used in Spectral Clustering, instead of the affinity matrix. As far as I understand, an affinity matrix is a weighted adjacency matrix, right?
So how come there are such severe differences when applying Spectral Clustering? Shouldn't the underlying structure of both be the same?
And which one is more suitable for Spectral Clustering, and why? I was using this code to construct an affinity matrix and the nx.adjacency_matrix function for the other one.
Also, do you know of any algorithm or method to evaluate the performance of my Spectral Cluster algorithm?
Thanks in advance

Related

K-means with Cosine similarity in Python

I have embedded vectors with K-means using cosine similarity.
when I used Spark to to so, I had no problem. Now I wish to convert it into Python.
Is it possible to so with scikit learn? I couldn't find this option, it seems to support only Euclidian distance, is that correct?
Do you think that scaling all vectors to be on the unit sphere and then run the sklearn Kmeans is identical to running k-means with cosine similarity?
If 2 is correct, so for inference on new points, should I just scale it to the unit sphere and use the sklearn Kmeans inference?

Is there an efficient python implementation of spectral clustering for large, dense matrices?

Currently I'm using the spectral clustering method from sklearn for my dense 7000x7000 matrix which performs very slowly and exceeds an execution time of 6 hours. Is there a faster implementation of spectral clustering in python?
I'd recommend performing PCA to project the data to a lower dimensionality , and then utilize mini batch k-means

Affinity Propagation method selection

I am using sklearn affinity propagation algorithm as below.
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
I also have a similarity matrix created for the data I am using. Now I want to use my similarity matrix to use in the affinity propagation model.
In sklearn they have different methods for this such as fit, fit_predict, predict. So, I'm not sure what to use.
Is it correct if I use,
affprop.fit(my similarity matrix)
Please suggest me what suits me most?

KNN when using a precomputed affinity matrix in Scikit's spectral clustering?

I have a similarity matrix that I have calculated between a large number of objects, and each object can have a non-zero similarity with any other object. I generated this matrix for another task, and would now like to cluster it for a new analysis.
It seems like scikit's spectral clustering method could be a good fit, because I can pass in a precomputed affinity matrix. I also know that spectral clustering typically uses some number of nearest neighbors when building the affinity matrix, and my similarity matrix does not have that same constraint.
If I pass in a matrix that allows any number of edges between nodes in the affinity matrix, will scikit limit each node to having only a certain number of nearest neighbors? If not, I guess I will have to make that change to my pre-computed affinity matrix.
You don't have to compute the affinity yourself to do some spectral clustering, sklearn does that for you.
When you call sc = SpectralClustering(),, the affinity parameter allows you to chose the kernel used to compute the affinity matrix. rbf seems to be the kernel by default and doesn't use a particular number of nearest neighbours. However, if you decide to chose another kernel, you might want to specify that number with the n_neighboursparameter.
You can then use sc.fit_predict(your_matrix) to compute the clusters.
Spectral clustering does not require a sparsified matrix.
But if I'm not mistaken it's faster to find the dmallest non-zero Eigenvectors of a sparse matrix rather than of a dense matrix. Worst case may remain O(n^3) though - spectral clustering is one of the slowest methods you can find.

Memory Efficient Agglomerative Clustering with Linkage in Python

I want to cluster 2d points (latitude/longitude) on a map. The number of points is 400K so the input matrix would be 400k x 2.
When I run scikit-learn's Agglomerative Clustering I run out of memory and my memory is about 500GB.
class sklearn.cluster.AgglomerativeClustering(n_clusters=2, affinity='euclidean', memory=Memory(cachedir=None), connectivity=None, n_components=None, compute_full_tree='auto', linkage='ward', pooling_func=<function mean at 0x2b8085912398>)[source]
I also tried the memory=Memory(cachedir) option with no success. Does anybody have a suggestion (another library or change in the scikit code) so that I can run the clustering algorithm on the data?
I have run the algorithm successfully on small datasets.

Categories