I have a similarity matrix that I have calculated between a large number of objects, and each object can have a non-zero similarity with any other object. I generated this matrix for another task, and would now like to cluster it for a new analysis.
It seems like scikit's spectral clustering method could be a good fit, because I can pass in a precomputed affinity matrix. I also know that spectral clustering typically uses some number of nearest neighbors when building the affinity matrix, and my similarity matrix does not have that same constraint.
If I pass in a matrix that allows any number of edges between nodes in the affinity matrix, will scikit limit each node to having only a certain number of nearest neighbors? If not, I guess I will have to make that change to my pre-computed affinity matrix.
You don't have to compute the affinity yourself to do some spectral clustering, sklearn does that for you.
When you call sc = SpectralClustering(),, the affinity parameter allows you to chose the kernel used to compute the affinity matrix. rbf seems to be the kernel by default and doesn't use a particular number of nearest neighbours. However, if you decide to chose another kernel, you might want to specify that number with the n_neighboursparameter.
You can then use sc.fit_predict(your_matrix) to compute the clusters.
Spectral clustering does not require a sparsified matrix.
But if I'm not mistaken it's faster to find the dmallest non-zero Eigenvectors of a sparse matrix rather than of a dense matrix. Worst case may remain O(n^3) though - spectral clustering is one of the slowest methods you can find.
Related
I was wondering whether there is a reason for the adjacency matrix to be commonly used in Spectral Clustering, instead of the affinity matrix. As far as I understand, an affinity matrix is a weighted adjacency matrix, right?
So how come there are such severe differences when applying Spectral Clustering? Shouldn't the underlying structure of both be the same?
And which one is more suitable for Spectral Clustering, and why? I was using this code to construct an affinity matrix and the nx.adjacency_matrix function for the other one.
Also, do you know of any algorithm or method to evaluate the performance of my Spectral Cluster algorithm?
Thanks in advance
I'd like to (efficiently) evaluate a Gaussian mixture model (GMM) over an (n,d) list of datapoints, given the GMM parameters ($\pi_k, \mu_k, \Sigma_k$). I can't find a way to do this using standard sklearn or scipy packages.
EDIT: assume there is n datapoints, dimension d so (n,d), and GMM has k components, so for example the covariance matrix of the k-th component, \Sigma_k, is (d,d), and altogether \Sigma is (k,d,d).
For example, if you first fit a GMM in sklearn, you can call score_samples, but this only works if I'm fitting to data. Or, in scipy you can run a for-loop over multivariate_normal.pdf with each set of parameters, and do a weighted sum/dot product, but this is slow. Checking the source code of either was not illuminating (for me).
I'm currently hacking something together with n-d arrays and tensor dot products .. oy .. hoping someone has a better way?
I have a np.array with 400 entries, each containing the values of a spectrum with 1000 points.
I want to identify the n most interesting indices of the spectrum and return them. So I can visualize and use them as an input vector for my classificator.
Is it best to calculate the variance, apply a PCA or are there better-suited algorithms?
And how do I compute the accounted variance for that selection?
Thanks
Dimensionality reduction can be performed in two different broad ways: feature extraction and feature selection. PCA is more suited for feature extraction, while "identify the n most interesting indices" is a feature selection problem. More details and how to code it up here: sklearn.feature_selection
I am using sklearn affinity propagation algorithm as below.
affprop = sklearn.cluster.AffinityPropagation(affinity="precomputed", damping=0.5)
I also have a similarity matrix created for the data I am using. Now I want to use my similarity matrix to use in the affinity propagation model.
In sklearn they have different methods for this such as fit, fit_predict, predict. So, I'm not sure what to use.
Is it correct if I use,
affprop.fit(my similarity matrix)
Please suggest me what suits me most?
I have an array of TF-IDF feature vectors. I'd like to find similar vectors in the array using two methods:
Cosine similarity
k-means clustering
Using Scikit Learn, this process is pretty simple.
Now I'd like to weight certain features so that they will influence the results more than the other features. For example, I might like to weight the first 100 elements of the TF-IDF vectors so that those features are more indicative of similarity than the rest of the features.
How can I meaningfully weight certain features in my feature vectors? Is the process for weighting certain features the same for each of the similarity algorithms I listed above?
As I understand, low values in the TFIDF matrix mean that the words are less significant. So one approach is to lower the values in the matrix for those columns you considered.
The arrays in scikit are sparse, so for testing and debugging you might want to convert to regular matrix. I also used xlsxwriter to get an overview to what is really happening when applying TFIDF and KMeans++ (see) https://www.dbc-enterprise-it-consulting.com/text-classifier/.