I have a an unlabeled data set that I am trying to cluster with a variety of clustering algorithms.
I am successful in being able to find the centroids/"mean of each mixture component" in sklearn.mixture.GaussianMixture using .means_. In my code I am then taking the point that is closest to the means to get a representative sample at each cluster.
I want to do this same thing with SpectralClustering, but I don't see a ".means_" method or some method to get the centroid of each cluster. This may be a result of my misunderstanding of how spectral clustering works or just a lack of features in this library.
As an example I would like to do:
sc = SpectralClustering(n_components=10, n_init=100)
sc.fit(data)
closest, _ = pairwise_distances_argmin_min(sc.means_, data)
But of course SpectralClustering doesn't have a .means_ method.
Thanks for any help on this.
Centroid are used for the KMean algorithm. For spectal clustering, the algorithm only store the affinity matrix and the labels obtained from the algorithm.
It doesn't matter if Spectral Clustering (or any other clustering algorithm) uses the cluster centers or not!
You can compute the centroid of any cluster! It is the mean of the elements in that cluster (well, there actually is a constraint, that the dataset itself allows the notion of mean).
So, compute the clusters using Spectral Clustering. For each cluster, compute the mean of the elements inside it (as is, the mean on every dimension for a cluster comprised of m n-dimensional elements).
Related
To my understanding, Agglomerative Hierarchical clustering starts by clustering the points that are closest to each other. I am trying to get the different clustering results where only a certain percentage of the data has been clustered for comparison. i.e. 40%, 50%, 60%...
So I need a way to terminate the hierarchical clustering(ward's) algorithm using sklearn after it has clustered a specified percentage of the data points. For example, stop clustering after 60% of the dataset has been clustered.
Please explain what would be the best way to do this?
Based on the Scikit-learn documentation:
The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together.
Hence, you can do "early stopping" by defining a number of clusters, and appropriately setting the compute_full_tree parameter (as defined in the API). From the number of clusters obtained when running the algorithm with the full tree computation, you could define ratios of the number of clusters.
It will remain to find the relation between the number of clusters and the proportion of data that has been clustered; but this is probably the most straightforward way to do what you want, without applying changes to the actual Agglomerative Clustering algorithm.
I am a bit confused about Clustering e.g. K-means clustering.
I have already created clusters for the training for and in the testing part I want to know if the new points are already in the clusters or if they can be in the cluster or not?
My idea is to find the center of each cluster and also find the farthest point in each cluster in training data then in testing part if the distance of the new point is great than a threshold (e.g. 1.5x the farthest point) then it cannot be in the cluster!
Is this idea efficient and correct and is there any python function to do this?
One more question:
Could someone help me to understand the difference between kmeans.fit() and kmeans.predict()? I get the same result in both functions!!
I appreciate any help
In general, when you fitting K-means algorithm, you will get cluster centers as result.
So, if you want to test to what cluster new point belong, you must calculate distance between each cluster center to the point, and label point as closest cluster center label.
If you usning scikit-learn library
Predict(X) method predicts the closest cluster each sample in X belongs to.
Fit(X) - fitting the data, or in other words calculating the cluster centers.
Here is nice example how to use K-means in scikit-learn
I have a dataset of users and their music plays, with every play having location data. For every user i want to cluster their plays to see if they play music in given locations.
I plan on using the sci-kit learn k-means package, but how do I get this to work with location data, as opposed to its default, euclidean distance?
An example of it working would really help me!
Don't use k-means with anything other than Euclidean distance.
K-means is not designed to work with other distance metrics (see k-medians for Manhattan distance, k-medoids aka. PAM for arbitrary other distance functions).
The concept of k-means is variance minimization. And variance is essentially the same as squared Euclidean distances, but it is not the same as other distances.
Have you considered DBSCAN? sklearn should have DBSCAN, and it should by now have index support to make it fast.
Is the data already in vector space e.g. gps coordinates? If so you can cluster on it directly, lat and lon are close enough to x and y that it shouldn't matter much. If not, preprocessing will have to be applied to convert it to a vector space format (table lookup of locations to coords for instance). Euclidean distance is a good choice to work with vector space data.
To answer the question of whether they played music in a given location, you first fit your kmeans model on their location data, then find the "locations" of their clusters using the cluster_centers_ attribute. Then you check whether any of those cluster centers are close enough to the locations you are checking for. This can be done using thresholding on the distance functions in scipy.spatial.distance.
It's a little difficult to provide a full example since I don't have the dataset, but I can provide an example given arbitrary x and y coords instead if that's what you want.
Also note KMeans is probably not ideal as you have to manually set the number of clusters "k" which could vary between people, or have some more wrapper code around KMeans to determine the "k". There are other clustering models which can determine the number of clusters automatically, such as meanshift, which may be more ideal in this case and also can tell you cluster centers.
My objective is to cluster words based on how similar they are with respect to a corpus of text documents. I have computed Jaccard Similarity between every pair of words. In other words, I have a sparse distance matrix available with me. Can anyone point me to any clustering algorithm (and possibly its library in Python) which takes distance matrix as input ? I also do not know the number of clusters beforehand. I only want to cluster these words and obtain which words are clustered together.
You can use most algorithms in scikit-learn with a precomputed distance matrix. Unfortunately you need the number of clusters for many algorithm.
DBSCAN is the only one that doesn't need the number of clusters and also uses arbitrary distance matrices.
You could also try MeanShift, but that will interpret the distances as coordinates - which might also work.
There is also affinity propagation, but I haven't really seen that working well. If you want many clusters, that might be helpful, though.
disclosure: I'm a scikit-learn core dev.
The scipy clustering package could be usefull (scipy.cluster). There are hierarchical clustering functions in scipy.cluster.hierarchy. Note however that those require a condensed matrix as input (the upper triangular of the distance matrix). Hopefully the documentation pages will help you along.
Recommend to take a look at agglomerative clustering.
There are myriad of optins in the scipy clustering module, and I'd like to be sure that I'm using them correctly. I have a symmetric distance matrix DR and I'd like to find all clusters such that any point in the cluster has a neighbor with a distance of no more than 1.2.
L = linkage(DR,method='single')
F = fcluster(L, 1.2)
In linkage, I'm pretty sure single is what I want (the Nearest Point Algorithm). However for fcluster, I think I want the default, ‘inconsistent’, method:
‘inconsistent’: If a cluster node and all its descendants have an inconsistent value less than or equal to t then all its leaf descendants belong to the same flat cluster. When no non-singleton cluster meets this criterion, every node is assigned to its own cluster. (Default)
But maybe it's the ‘distance’ method:
‘distance’: Forms flat clusters so that the original observations in each flat cluster have no greater a cophenetic distance than t.
... I'm not sure. Which one to use? What does cophenetic distance distance mean in this context?
You might want to look at DBSCAN. See the Wikipedia article on it. It looks like you are looking for an output of DBSCAN with minPts=1 and epsilon=1.2
It's fairly simple to implement judging from the pseudocode on wikipedia, in particular since you already seem to have a distance matrix. Just do it yourself.