I read slide about top-k algorithms which are Fagin Algorithm (FA), Threshold Algorithm (TA) and No random access algorithm (NRA). Those algorithms are good because we do not need to see all values, we just need to set a threshold or compute lower bound and upper bound.
Slide = http://alumni.cs.ucr.edu/~skulhari/Top-k-Query.pdf
Now, I have large of data points and I want to get the top-k points (subset) which diversity maximum, means that the distances among points in the top-k (subset) are maximum. I know that this problem is NP-Hard. The algorithm such as Brute force is not feasible for this case due to the dataset is large.
In here, I just want to know, is it possible to get subset/top-k points without computing all distances among all pair points like in FA/TA/NRA algorithm?
Related
Question: The best way to find out the Eps and MinPts parameters for DBSCAN algorithm?
Problem: The goal is to find the locations (clusters) based on coordinates (input data). The algorithm calculates the most visited areas and retrieves these clusters.
Approach:
I defined the epsilon (EPS) parameter as 1.5 km - converted to radians to be used by the DBSCAN algorithm: epsilon = 1.5 / 6371.0088 (ref to this 1.5 km: https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/).
If I define the MinPts to a low value (e.g. MinPts = 5, it will produce 2000 clusters), the DBSCAN will produce too many clusters and I want to limit the relevance/size of the clusters to an acceptable value.
I use the haversine metric and ball tree algorithm to calculate great-circle distances between points.
Suggestions:
knn approach to find EPS;
domain knowledge and to decide the best values for EPS and MinPts.
Data: I'm using 160k coordinates but the program should be capable to handle different data inputs.
As you may know, setting MinPts high will not only prevent small clusters from forming, but will also change the shape of larger clusters as its outskirts will be considered outliers.
Consider instead a third way to reduce the number of clusters; simply sort by descending size (number of coordinates) and limit that to 4 or 5. This way, you won't be shown all the small clusters if you're not interested in them, but you can instead treat all those points as noise.
You're essentially using DBSCAN for something it's not meant for, namely to find the n largest clusters, but that's fine - you just need to "tweak the algorithm" to fit your use case.
Update
If you know the entire dataset and it will not change in the future, I would just tweak minPts manually, based on your knowledge.
In scientific environments and with varying data sets, you consider the data as "generated from a stochastic process". However, that would mean that there is a chance - no matter how small - that there are minPts dogs in a remote forest somewhere at the same time, or minPts - 1 dogs in Central Park, where it's normally overcrowded.
What I mean by that is that if you go down the scientific road, you need to find a balance between the deterministic value of minPts and the probabilistic distribution of the points in your data set.
In my experience, it all comes down to whether or not you trust your knowledge, or would like to defer responsibility. In some government/scientific/large corporate positions, it's a safer choice to pin something on an algorithm than on gut feeling. In other situations, it's safe to use gut feeling.
I'm using the scikit-learn module of agglomerative hierarchical clustering to obtain clusters of a three million geographical hexagrid using contiguity constraints and ward affinity.
My question is related to the resulted number of clusters that the routine returns. I know that the ward's affinity minimizes the sums of the within cluster variance. This is trivially minimized when the cluster is just one observation, right? That is why I assume that the algorithm starting point is just a random big number for this variance so that including the nearest contiguous observation (given that is within the max distance threshold) will decrease the variance of the group and it continues this way until it reaches the top of the tree.
My question is what is the criteria used to return the clusters labels. Reading about it seems that the optimum number of clusters is given by when there is a biggest jump in the tree. But I'm not sure if this is the criteria the developers are using. Anyone knows?
Ideally I could check by plotting the tree, but I'm clustering nearly 3 million cells which makes plotting the tree both messy and unfeasible (with my computer or the cluster I have access at least).
Thanks
If I am trying to cluster my data using DBSCAN, is there a way to assign a maximum number of clusters? I know I can set the minimum distance between points to be considered a cluster, but my data changes case by case and I would prefer to not allow more than 4 clusters. Any suggestions?
Not with DBSCAN itself. Connected components are connected components, there is no ambiguity at this point.
You could write your own rules to extract the X most significant costs from an OPTICS plot though. OPTICS is the more variable formulation of DBSCAN.
I was reading dlib's face clustering code and noticed that the process is like so:
Convert faces to vector using trained network
Use Chinese whisper clustering algorithm to compute groups based on distance
Chinese whisper clustering can take a pretty long time when trying to cluster a large number (>10,000) images.
In this pyimagesearch article the author uses DBSCAN, another clustering algorithm to group a number of images by person.
Since the vectors generated by the neural net can be used to calculate the similarity between two faces, wouldn't it be better to just calculate a euclidean distance matrix, then search for all the values that meet a confidence threshold (eg x < 0.3 for 70% confidence)?
Why use a clustering algorithm at all when you can just compare every face with every other face to determine which ones are the same person? both DBSCAN and chinese whisper clustering take a much longer time than calculating a distance matrix. With my dataset of 30,000 images the times are:
C-whisper - 5 minutes
distance matrix + search - 10-20 seconds
DBSCAN actually takes only marginally longer than computing a distance matrix (when implemented right, 99% of the computation is the distance computations) and with indexing can sometimes be much faster because it does not need every pairwise distance if the index can prune computations.
But you can't just "read off" clusters from the distance matrix. The data there may be contradictory: the face detector may consider A and B similar and B similar to C, but A and C dissimilar! What do you do then? Clustering algorithms try to solve exactly such situations. For example single-Link, and to a lesser extend DBSCAN, would make A and C the same cluster, whereas complete linkage would decide upon either AB or BC.
Actually, dlib's implementation is doing something very similar to what you are thinking of. Here is the code. It first checks every pair and rejects pairs whose distance is greater than a threshold. This is exactly what you proposed. But then it is doing a fine clustering on the result. So, what would change?
Simply cutting off by distances can work if you have well-separated data points. However, if your data points are very close to each other, this problem becomes very hard. Imagine a 1D feature. Your data points are the integer positions between 0 and 10 and you want to put two data points together in a cluster if their distance is at most 1.5. So, what would you do? If you start with a pair, you could make a cluster. But if you pick a neighboring point, you will see that it will be closer than your threshold to one point already in the cluster and larger than the threshold to the other. Clustering is about solving this ambiguity.
I'm currently doing some clustering analysis of 3D coordinate points using the python package sklearn.cluster.
I've used K-mean clustering, which outputs a cluster centre that is calculated. What I really want is what data point of that cluster has the minimum distance to all other data points in that cluster. I'm guessing this would be the point closest to the cluster centre in my set of data, but as my data set is huge, it isn't really practical to use some sort of minimising search algorithm. Any suggestions of other clustering methods or other python scripts that could help me find this?
Finding the closest pair to the center is only O(n), so as cheap as one more iteration of k-means -- not too bad.
It is worse than the mean, but your best guess.
Beware: it has not the smallest average distance (Euclidean).
The mean is a least-squares optimum, it has the least squared deviation (i.e. squared Euclidean).
This is the difference between the mean and the median. The median is the most central data point; not the mean. But finding the median is much more expensive than computing the average.
It should be not too hard to prove that the point closest to the mean will have the least-squared deviation of all your data points (try showing that a point that has a smaller RMSD must be closer).