Hierarchical clustering termination - python

To my understanding, Agglomerative Hierarchical clustering starts by clustering the points that are closest to each other. I am trying to get the different clustering results where only a certain percentage of the data has been clustered for comparison. i.e. 40%, 50%, 60%...
So I need a way to terminate the hierarchical clustering(ward's) algorithm using sklearn after it has clustered a specified percentage of the data points. For example, stop clustering after 60% of the dataset has been clustered.
Please explain what would be the best way to do this?

Based on the Scikit-learn documentation:
The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together.
Hence, you can do "early stopping" by defining a number of clusters, and appropriately setting the compute_full_tree parameter (as defined in the API). From the number of clusters obtained when running the algorithm with the full tree computation, you could define ratios of the number of clusters.
It will remain to find the relation between the number of clusters and the proportion of data that has been clustered; but this is probably the most straightforward way to do what you want, without applying changes to the actual Agglomerative Clustering algorithm.

Related

Hierarchical Clustering Threshold

I'm using the scikit-learn module of agglomerative hierarchical clustering to obtain clusters of a three million geographical hexagrid using contiguity constraints and ward affinity.
My question is related to the resulted number of clusters that the routine returns. I know that the ward's affinity minimizes the sums of the within cluster variance. This is trivially minimized when the cluster is just one observation, right? That is why I assume that the algorithm starting point is just a random big number for this variance so that including the nearest contiguous observation (given that is within the max distance threshold) will decrease the variance of the group and it continues this way until it reaches the top of the tree.
My question is what is the criteria used to return the clusters labels. Reading about it seems that the optimum number of clusters is given by when there is a biggest jump in the tree. But I'm not sure if this is the criteria the developers are using. Anyone knows?
Ideally I could check by plotting the tree, but I'm clustering nearly 3 million cells which makes plotting the tree both messy and unfeasible (with my computer or the cluster I have access at least).
Thanks

Is there a way to assign a maximum number of clusters using DBSCAN?

If I am trying to cluster my data using DBSCAN, is there a way to assign a maximum number of clusters? I know I can set the minimum distance between points to be considered a cluster, but my data changes case by case and I would prefer to not allow more than 4 clusters. Any suggestions?
Not with DBSCAN itself. Connected components are connected components, there is no ambiguity at this point.
You could write your own rules to extract the X most significant costs from an OPTICS plot though. OPTICS is the more variable formulation of DBSCAN.

Finding cluster centroid or ".means_" with sklearn.cluster.SpectralClustering

I have a an unlabeled data set that I am trying to cluster with a variety of clustering algorithms.
I am successful in being able to find the centroids/"mean of each mixture component" in sklearn.mixture.GaussianMixture using .means_. In my code I am then taking the point that is closest to the means to get a representative sample at each cluster.
I want to do this same thing with SpectralClustering, but I don't see a ".means_" method or some method to get the centroid of each cluster. This may be a result of my misunderstanding of how spectral clustering works or just a lack of features in this library.
As an example I would like to do:
sc = SpectralClustering(n_components=10, n_init=100)
sc.fit(data)
closest, _ = pairwise_distances_argmin_min(sc.means_, data)
But of course SpectralClustering doesn't have a .means_ method.
Thanks for any help on this.
Centroid are used for the KMean algorithm. For spectal clustering, the algorithm only store the affinity matrix and the labels obtained from the algorithm.
It doesn't matter if Spectral Clustering (or any other clustering algorithm) uses the cluster centers or not!
You can compute the centroid of any cluster! It is the mean of the elements in that cluster (well, there actually is a constraint, that the dataset itself allows the notion of mean).
So, compute the clusters using Spectral Clustering. For each cluster, compute the mean of the elements inside it (as is, the mean on every dimension for a cluster comprised of m n-dimensional elements).

How can you compare two cluster groupings in terms of similarity or overlap in Python?

Simplified example of what I'm trying to do:
Let's say I have 3 data points A, B, and C. I run KMeans clustering on this data and get 2 clusters [(A,B),(C)]. Then I run MeanShift clustering on this data and get 2 clusters [(A),(B,C)]. So clearly the two clustering methods have clustered the data in different ways. I want to be able to quantify this difference. In other words, what metric can I use to determine percent similarity/overlap between the two cluster groupings obtained from the two algorithms? Here is a range of scores that might be given:
100% score for [(A,B),(C)] vs. [(A,B),(C)]
~50% score for [(A,B),(C)] vs. [(A),(B,C)]
~20% score for [(A,B),(C)] vs. [(A,B,C)]
These scores are a bit arbitrary because I'm not sure how to measure similarity between two different cluster groupings. Keep in mind that this is a simplified example, and in real applications you can have many data points and also more than 2 clusters per cluster grouping. Having such a metric is also useful when trying to compare a cluster grouping to a labeled grouping of data (when you have labeled data).
Edit: One idea that I have is to take every cluster in the first cluster grouping and get its percent overlap with every cluster in the second cluster grouping. This would give you a similarity matrix of clusters in the first cluster grouping against clusters in the second cluster grouping. But then I'm not sure what you would do with this matrix. Maybe take the highest similarity score in each row or column and do something with that?
Use evaluation metrics.
Many metrics are symmetric. For example, the adjusted Rand index.
A value close to 1 means they are very similar, close to 0 is random, and much less than 0 means each cluster of one is "evenly" distributed over all clusters of the other.
Well, determining the number of clusters is problem in data analysis and different issue from clustering problem itself. There are quite a few criteria for this AIC
or Cubic Clustering Criteria. I think that with scikit-learn there is not an option to calculate these by default two but I know that there are packages in R.

Computing K-means clustering on Location data in Python

I have a dataset of users and their music plays, with every play having location data. For every user i want to cluster their plays to see if they play music in given locations.
I plan on using the sci-kit learn k-means package, but how do I get this to work with location data, as opposed to its default, euclidean distance?
An example of it working would really help me!
Don't use k-means with anything other than Euclidean distance.
K-means is not designed to work with other distance metrics (see k-medians for Manhattan distance, k-medoids aka. PAM for arbitrary other distance functions).
The concept of k-means is variance minimization. And variance is essentially the same as squared Euclidean distances, but it is not the same as other distances.
Have you considered DBSCAN? sklearn should have DBSCAN, and it should by now have index support to make it fast.
Is the data already in vector space e.g. gps coordinates? If so you can cluster on it directly, lat and lon are close enough to x and y that it shouldn't matter much. If not, preprocessing will have to be applied to convert it to a vector space format (table lookup of locations to coords for instance). Euclidean distance is a good choice to work with vector space data.
To answer the question of whether they played music in a given location, you first fit your kmeans model on their location data, then find the "locations" of their clusters using the cluster_centers_ attribute. Then you check whether any of those cluster centers are close enough to the locations you are checking for. This can be done using thresholding on the distance functions in scipy.spatial.distance.
It's a little difficult to provide a full example since I don't have the dataset, but I can provide an example given arbitrary x and y coords instead if that's what you want.
Also note KMeans is probably not ideal as you have to manually set the number of clusters "k" which could vary between people, or have some more wrapper code around KMeans to determine the "k". There are other clustering models which can determine the number of clusters automatically, such as meanshift, which may be more ideal in this case and also can tell you cluster centers.

Categories