I'm using the scikit-learn module of agglomerative hierarchical clustering to obtain clusters of a three million geographical hexagrid using contiguity constraints and ward affinity.
My question is related to the resulted number of clusters that the routine returns. I know that the ward's affinity minimizes the sums of the within cluster variance. This is trivially minimized when the cluster is just one observation, right? That is why I assume that the algorithm starting point is just a random big number for this variance so that including the nearest contiguous observation (given that is within the max distance threshold) will decrease the variance of the group and it continues this way until it reaches the top of the tree.
My question is what is the criteria used to return the clusters labels. Reading about it seems that the optimum number of clusters is given by when there is a biggest jump in the tree. But I'm not sure if this is the criteria the developers are using. Anyone knows?
Ideally I could check by plotting the tree, but I'm clustering nearly 3 million cells which makes plotting the tree both messy and unfeasible (with my computer or the cluster I have access at least).
Thanks
Related
To my understanding, Agglomerative Hierarchical clustering starts by clustering the points that are closest to each other. I am trying to get the different clustering results where only a certain percentage of the data has been clustered for comparison. i.e. 40%, 50%, 60%...
So I need a way to terminate the hierarchical clustering(ward's) algorithm using sklearn after it has clustered a specified percentage of the data points. For example, stop clustering after 60% of the dataset has been clustered.
Please explain what would be the best way to do this?
Based on the Scikit-learn documentation:
The AgglomerativeClustering object performs a hierarchical clustering using a bottom up approach: each observation starts in its own cluster, and clusters are successively merged together.
Hence, you can do "early stopping" by defining a number of clusters, and appropriately setting the compute_full_tree parameter (as defined in the API). From the number of clusters obtained when running the algorithm with the full tree computation, you could define ratios of the number of clusters.
It will remain to find the relation between the number of clusters and the proportion of data that has been clustered; but this is probably the most straightforward way to do what you want, without applying changes to the actual Agglomerative Clustering algorithm.
If I am trying to cluster my data using DBSCAN, is there a way to assign a maximum number of clusters? I know I can set the minimum distance between points to be considered a cluster, but my data changes case by case and I would prefer to not allow more than 4 clusters. Any suggestions?
Not with DBSCAN itself. Connected components are connected components, there is no ambiguity at this point.
You could write your own rules to extract the X most significant costs from an OPTICS plot though. OPTICS is the more variable formulation of DBSCAN.
I was reading dlib's face clustering code and noticed that the process is like so:
Convert faces to vector using trained network
Use Chinese whisper clustering algorithm to compute groups based on distance
Chinese whisper clustering can take a pretty long time when trying to cluster a large number (>10,000) images.
In this pyimagesearch article the author uses DBSCAN, another clustering algorithm to group a number of images by person.
Since the vectors generated by the neural net can be used to calculate the similarity between two faces, wouldn't it be better to just calculate a euclidean distance matrix, then search for all the values that meet a confidence threshold (eg x < 0.3 for 70% confidence)?
Why use a clustering algorithm at all when you can just compare every face with every other face to determine which ones are the same person? both DBSCAN and chinese whisper clustering take a much longer time than calculating a distance matrix. With my dataset of 30,000 images the times are:
C-whisper - 5 minutes
distance matrix + search - 10-20 seconds
DBSCAN actually takes only marginally longer than computing a distance matrix (when implemented right, 99% of the computation is the distance computations) and with indexing can sometimes be much faster because it does not need every pairwise distance if the index can prune computations.
But you can't just "read off" clusters from the distance matrix. The data there may be contradictory: the face detector may consider A and B similar and B similar to C, but A and C dissimilar! What do you do then? Clustering algorithms try to solve exactly such situations. For example single-Link, and to a lesser extend DBSCAN, would make A and C the same cluster, whereas complete linkage would decide upon either AB or BC.
Actually, dlib's implementation is doing something very similar to what you are thinking of. Here is the code. It first checks every pair and rejects pairs whose distance is greater than a threshold. This is exactly what you proposed. But then it is doing a fine clustering on the result. So, what would change?
Simply cutting off by distances can work if you have well-separated data points. However, if your data points are very close to each other, this problem becomes very hard. Imagine a 1D feature. Your data points are the integer positions between 0 and 10 and you want to put two data points together in a cluster if their distance is at most 1.5. So, what would you do? If you start with a pair, you could make a cluster. But if you pick a neighboring point, you will see that it will be closer than your threshold to one point already in the cluster and larger than the threshold to the other. Clustering is about solving this ambiguity.
Simplified example of what I'm trying to do:
Let's say I have 3 data points A, B, and C. I run KMeans clustering on this data and get 2 clusters [(A,B),(C)]. Then I run MeanShift clustering on this data and get 2 clusters [(A),(B,C)]. So clearly the two clustering methods have clustered the data in different ways. I want to be able to quantify this difference. In other words, what metric can I use to determine percent similarity/overlap between the two cluster groupings obtained from the two algorithms? Here is a range of scores that might be given:
100% score for [(A,B),(C)] vs. [(A,B),(C)]
~50% score for [(A,B),(C)] vs. [(A),(B,C)]
~20% score for [(A,B),(C)] vs. [(A,B,C)]
These scores are a bit arbitrary because I'm not sure how to measure similarity between two different cluster groupings. Keep in mind that this is a simplified example, and in real applications you can have many data points and also more than 2 clusters per cluster grouping. Having such a metric is also useful when trying to compare a cluster grouping to a labeled grouping of data (when you have labeled data).
Edit: One idea that I have is to take every cluster in the first cluster grouping and get its percent overlap with every cluster in the second cluster grouping. This would give you a similarity matrix of clusters in the first cluster grouping against clusters in the second cluster grouping. But then I'm not sure what you would do with this matrix. Maybe take the highest similarity score in each row or column and do something with that?
Use evaluation metrics.
Many metrics are symmetric. For example, the adjusted Rand index.
A value close to 1 means they are very similar, close to 0 is random, and much less than 0 means each cluster of one is "evenly" distributed over all clusters of the other.
Well, determining the number of clusters is problem in data analysis and different issue from clustering problem itself. There are quite a few criteria for this AIC
or Cubic Clustering Criteria. I think that with scikit-learn there is not an option to calculate these by default two but I know that there are packages in R.
I'm currently doing some clustering analysis of 3D coordinate points using the python package sklearn.cluster.
I've used K-mean clustering, which outputs a cluster centre that is calculated. What I really want is what data point of that cluster has the minimum distance to all other data points in that cluster. I'm guessing this would be the point closest to the cluster centre in my set of data, but as my data set is huge, it isn't really practical to use some sort of minimising search algorithm. Any suggestions of other clustering methods or other python scripts that could help me find this?
Finding the closest pair to the center is only O(n), so as cheap as one more iteration of k-means -- not too bad.
It is worse than the mean, but your best guess.
Beware: it has not the smallest average distance (Euclidean).
The mean is a least-squares optimum, it has the least squared deviation (i.e. squared Euclidean).
This is the difference between the mean and the median. The median is the most central data point; not the mean. But finding the median is much more expensive than computing the average.
It should be not too hard to prove that the point closest to the mean will have the least-squared deviation of all your data points (try showing that a point that has a smaller RMSD must be closer).