Are there any types of clustering algorithms that focus on forming specific sized clusters? This can be thought of us as a grouping algorithm more than a clustering algorithm.
Basically, given n data points, and fixed groups of a certain size k, find the optimal distribution of points to sets based upon certain classifiers, that will hopefully minimize the distance of classifiers for each point in a given group.
This problem seems to be pretty similar to a clustering problem, but the main difference is that we are concerned with a specific cluster size, but not concerned about the number of clusters.
There is a tutorial on how to implement such an algorithm in ELKI:
http://elki.dbs.ifi.lmu.de/wiki/Tutorial/SameSizeKMeans
Also have a look at constraint clustering algorithms; although usually these algorithms only support "Must link" and "cannot link" constraints, not size constraints.
You should be able to do a similar modification where you first specify the group sizes, then assign points randomly, and swap cluster members as long as your objective function improves; similar to k-means / k-medoids. As you may get stuck in local minima, restart a number of times and only keep the best.
See also earlier questions, e.g.
K-means algorithm variation with equal cluster size
and
Group n points in k clusters of equal size
The problem that you are posing is a combinatorial optimization problem. It is very important to know if you need an exact solution, or that can you settle for an approximate one?
If you need exact solutions, there is a body of work that focuses on clustering with different types of constraints. The constraint that you mentioned can be encoded in this framework. However, you should now that this approach scales up to a datasets with a certain size.
Related
This Question is more Theoretical, and not specifically trying to problem-solve.
I recently was introduced to the K-Means Clustering algorithm, and unsupervised machine learning algorithm, and I was intrigued by the though that one some sets of data, even if completely random, the average centroids drawn could keep changing through each iteration.
Example:
What I am trying to show here, is, imagine if the program flipped between iteration 6, to iteration 9, and kept doing this forever.
I have had my code randomly hang before using K-Means, so I don't believe this is impossible, but please let me know if this is a known occurrence, or if it is impossible due to the nature of the algorithm.
If you need more information just ask me in a comment. Using Python 3.7
tl;dr No, a K-means algorithm always has an end point if the algorithm is coded correctly.
Explanation:
The ideal way to think about this is not in the sense of what datapoints would cause issues, but rather about how kmeans is working in the broader sense of things. The k-means algorithm is always working in a finite space. For N data points, there are only N ^ k distinct arrangements for the data points. (This number can be pretty large, but is still finite)
Secondly, a k-means algorithm is always optimizing a loss function, based on the sum of squared distances between each data point and it's assigned cluster center. This means two very important things: Each of the N ^ k distinct arrangements can be arranged in an ascending/descending order of minimum loss to maximum loss. Also, the K-means algorithm will never go from a state of lower net loss to a higher net loss.
These two conditions guarantee that the algorithm will always tend towards the minimum loss arrangement in a finite space, thus ensuring that it has an end.
The last edge case: What if more than one minimum state has equal loss? This is a highly unlikely scenario, but can cause issues if and only if the algorithm is coded poorly for tie breakers. Essentially, the only way this can cause a cycle is if a data point has equal distance for two clusters, and is allowed to change clusters away from it's current cluster even on equal distance. Suffice to say, the algorithms are generally coded so that the data points never swap on a tie, or in some other deterministic manner, thus avoiding this scenario entirely.
I have a graph with 240k nodes and 550k edges with five attributes per node coming out of an autoencoder from a sparse dataset. I'm looking to partition the graph into n clusters, such that intra-partition attribute similarity is maximized, the partitions are connected, and the sum of one of the attributes doesn't exceed a threshold for any given cluster.
I've tried poking around with an autoencoder but had issues making a loss function that would get the results I needed. I've also looked at heirarchical clustering with connectivity constraints but can't find a way to enforce my sum constraint optimally. Same issue with community detection algorithms on graphs like Louvain.
If anyone knows of any approaches to solving this I'd love to hear it, ideally something implemented in Python already but I can probably implement whatever algorithm I need should it not be. Thanks!
First of all, the problem is most likely NP-hard, so the best you can do is some greedy optimization. It will definitely help to first break the graph into subsets that cannot be connected ever (remove links of nodes that are not similar enough, then compute the connected components). Then for each component (which hopefully are much smaller than 250k, otherwise tough luck!) run a classic optimizer that allows you to specify the cost function. It is probably a good idea to use an integer linear program, and consider the Lagrange dual version of the problem.
The linkage matrix for clustering provides the cluster index, and distance
for each step of the clustering hierarchy.
When two clusters are merged, I would like to know which two points were the closest in the clusters. I am using the metric "single" i.e. closest distance
I know I can do this trivially by an exhaustive search and comparison. Is the information already there after linkage ? Is there a smarter way to get this information?
To answer your questions:
No, this information is not available after linkage, at least according to the official Python documentation.
The closest pair of points problem is a problem of computational geometry, and can be solved in logarithmic time by a recursive divide and conquer algorithm (note that exhaustive search is quadratic). See this Wikipedia article for more information. Check also this paper by Shamos and Hoey. Note that the original formulation of the problem involves only one set of points. However, adaptation for two sets is straightforward; you might find this discussion helpful.
I am running k-means clustering on ~1 million items (each represented as a ~100-feature vector). I have run the clustering for various k, and now want to evaluate the different results with the silhouette score implemented in sklearn. Attempting to run it with no sampling seems unfeasible and takes a prohibitively long time, so I assume I need to use sampling, i.e.:
metrics.silhouette_score(feature_matrix, cluster_labels, metric='euclidean',sample_size=???)
I don't have a good sense of what an appropriate sampling approach is, however. Is there a rule of thumb for what size sample to use given the size of my matrix? Is it better to take the largest sample my analysis machine can handle, or to take the average of more smaller samples?
I ask in large part because my preliminary test (with sample_size=10000) has produced some really really unintuitive results.
I'm also open to alternative, more scalable evaluation metrics.
Editing to visualize the issue: The plot shows, for varying sample sizes, the silhouette score as a function of the number of clusters
What's not weird is that increasing sample size seems to reduce noise. What is weird, given that I have 1 million, very heterogenous vectors, that 2 or 3 is the "best" number of clusters. In other words, what's unintuitive is that I would find a more-or-less monotonic decreases in silhouette score as I increase the number of clusters.
Other metrics
Elbow method: Compute the % variance explained for each K, and choose the K where the plot starts to level off. (a good description is here https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). Obviously if you have k == number of data points, you can explain 100% of the variance. The question is where do the improvements in variance explained start to level off.
Information theory: If you can calculate a likelihood for a given K, then you can use the AIC, AICc, or BIC (or any other information-theoretic approach). E.g. for the AICc, it just balances the increase in likelihood as you increase K with the increase in the number of parameters you need. In practice all you do is choose the K that minimises the AICc.
You may be able to get a feel for a roughly appropriate K by running alternative methods that give you back an estimate of the number of clusters, like DBSCAN. Though I haven't seen this approach used to estimate K, and it is likely inadvisable to rely on it like this. However, if DBSCAN also gave you a small number of clusters here, then there's likely something about your data that you might not be appreciating (i.e. not as many clusters are you're expecting).
How much to sample
It looks like you've answered this from your plot: no matter what your sampling you get the same pattern in silhouette score. So that patterns seems very robust to sampling assumptions.
kmeans converge to local minima. Starting positions plays a crucial role in optimal number of clusters. It would be a good idea often to reduce the noise and dimensions using PCA or any other dimension reduction techniques to proceed with kmeans.
Just to add for the sake of completeness. It might be a good idea to get optimal number of clusters by "partition around medoids". It is equivalent to using silhouette method.
Reason for the weird observations could be different starting points for different sized samples.
Having said all the above, it is important to evaluate clusterability of the dataset in hand. Tractable means is by Worst Pair ratio as discussed here Clusterability.
Since there is no widely-accepted best approach to determine the optimal number of clusters, all evaluation techniques, including Silhouette Score, Gap Statistic, etc. fundamentally rely on some form of heuristic/trial&error argument. So to me, the best approach is to try out multiple techniques and to NOT develop over-confidence in any.
In your case, the ideal and most accurate score should be calculated on the entire data set. However, if you need to use partial samples to speed up the computation, you should use largest possible sample size your machine can handle. The rationale is the same as getting as many data points as possible out of the population of interest.
One more thig is that the sklearn implementation of Silhouette Score uses random (non-stratified) sampling. You can repeat the calculation multiple time using the same sample size (say sample_size=50000) to get a sensing on whether the sample size is large enough to produce consistent results.
I have written code in python to implement DBSCAN clustering algorithm.
My dataset consists of 14k users with each user represented by 10 features.
I am unable to decide what exactly to keep as the value of Min_samples and epsilon as input
How should I decide that?
Similarity measure is euclidean distance.(Hence it becomes even more tough to decide.) Any pointers?
DBSCAN is pretty often hard to estimate its parameters.
Did you think about the OPTICS algorithm? You only need in this case Min_samples which would correspond to the minimal cluster size.
Otherwise for DBSCAN I've done it in the past by trial and error : try some values and see what happens. A general rule to follow is that if your dataset is noisy, you should have a larger value, and it is also correlated with the number of dimensions (10 in this case).