I have written code in python to implement DBSCAN clustering algorithm.
My dataset consists of 14k users with each user represented by 10 features.
I am unable to decide what exactly to keep as the value of Min_samples and epsilon as input
How should I decide that?
Similarity measure is euclidean distance.(Hence it becomes even more tough to decide.) Any pointers?
DBSCAN is pretty often hard to estimate its parameters.
Did you think about the OPTICS algorithm? You only need in this case Min_samples which would correspond to the minimal cluster size.
Otherwise for DBSCAN I've done it in the past by trial and error : try some values and see what happens. A general rule to follow is that if your dataset is noisy, you should have a larger value, and it is also correlated with the number of dimensions (10 in this case).
Related
Hi I am new to Python and trying to figure out these below. Really appreciate any help. Thank you
How to get intracluster and intercluster distances in kmeans using python?
How to verify the quality of clusters? Any measures to check the goodness of clusters formed?
Is there a way to find out which factors/variables are most significant features affecting the clustering - Feature Extraction/Selection
I tried this for question 1 above, is this correct approach??
dists = euclidean_distances(km.cluster_centers_)
tri_dists = dists[np.triu_indices(4, 1)]
max_dist, avg_dist, min_dist = tri_dists.max(), tri_dists.mean(), tri_dists.min()
print(max_dist, avg_dist, min_dist)
Avoid putting multiple questions into one.
K-means does not compute all these distances. Otherwise it would need O(n²) time and memory, that would be much slower! It uses a special property of variance (another reason why it does not just optimize other distances except sum-of-squares) known as the Koenig-Huygens theorem.
Yes, there have been over 20, probably even 100, such quality measures proposed in literature. But that does not make it much easier to pick the "best" clustering: in the end, clusters are subjective for the user.
Yes, you can apply various techniques ranging from variance analysis to factor analysis to random forests.
As seen above,how to build clusters that are approximately balanced in size in sklearn?I have a question,clustering is done according to certain rules,Why can we specify the number in cluster?Anyway, I want to know how to achieve this step.
I have another idea about it.Calculate the number of each label, then calculate the variance,and get the one with the smallest variance
Some methods (for example, non-sklearn's HDBSCAN: https://hdbscan.readthedocs.io/en/latest/parameter_selection.html) have parameters like minimal_cluster_size. Probably, sklearn's DBSCAN's min_samples will work the similar way. It will not give you exact 'balanced' clusters but may help.
But in my opinion, sometimes it is more reasonable to run clusterization algorithms with different parameters and select 'more balanced' output by your hands. In this case you can see what points are not separable and probably add more data (calculate additional distance matrix, for example) or change target metric.
Why can we specify the number in cluster?
Because the tasks 'find clusters' and 'balance them' are a bit opposite in their meaning in the most cases. I'm not even speaking about algorithms when you need to specify the number of clusters.
I am running k-means clustering on ~1 million items (each represented as a ~100-feature vector). I have run the clustering for various k, and now want to evaluate the different results with the silhouette score implemented in sklearn. Attempting to run it with no sampling seems unfeasible and takes a prohibitively long time, so I assume I need to use sampling, i.e.:
metrics.silhouette_score(feature_matrix, cluster_labels, metric='euclidean',sample_size=???)
I don't have a good sense of what an appropriate sampling approach is, however. Is there a rule of thumb for what size sample to use given the size of my matrix? Is it better to take the largest sample my analysis machine can handle, or to take the average of more smaller samples?
I ask in large part because my preliminary test (with sample_size=10000) has produced some really really unintuitive results.
I'm also open to alternative, more scalable evaluation metrics.
Editing to visualize the issue: The plot shows, for varying sample sizes, the silhouette score as a function of the number of clusters
What's not weird is that increasing sample size seems to reduce noise. What is weird, given that I have 1 million, very heterogenous vectors, that 2 or 3 is the "best" number of clusters. In other words, what's unintuitive is that I would find a more-or-less monotonic decreases in silhouette score as I increase the number of clusters.
Other metrics
Elbow method: Compute the % variance explained for each K, and choose the K where the plot starts to level off. (a good description is here https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set). Obviously if you have k == number of data points, you can explain 100% of the variance. The question is where do the improvements in variance explained start to level off.
Information theory: If you can calculate a likelihood for a given K, then you can use the AIC, AICc, or BIC (or any other information-theoretic approach). E.g. for the AICc, it just balances the increase in likelihood as you increase K with the increase in the number of parameters you need. In practice all you do is choose the K that minimises the AICc.
You may be able to get a feel for a roughly appropriate K by running alternative methods that give you back an estimate of the number of clusters, like DBSCAN. Though I haven't seen this approach used to estimate K, and it is likely inadvisable to rely on it like this. However, if DBSCAN also gave you a small number of clusters here, then there's likely something about your data that you might not be appreciating (i.e. not as many clusters are you're expecting).
How much to sample
It looks like you've answered this from your plot: no matter what your sampling you get the same pattern in silhouette score. So that patterns seems very robust to sampling assumptions.
kmeans converge to local minima. Starting positions plays a crucial role in optimal number of clusters. It would be a good idea often to reduce the noise and dimensions using PCA or any other dimension reduction techniques to proceed with kmeans.
Just to add for the sake of completeness. It might be a good idea to get optimal number of clusters by "partition around medoids". It is equivalent to using silhouette method.
Reason for the weird observations could be different starting points for different sized samples.
Having said all the above, it is important to evaluate clusterability of the dataset in hand. Tractable means is by Worst Pair ratio as discussed here Clusterability.
Since there is no widely-accepted best approach to determine the optimal number of clusters, all evaluation techniques, including Silhouette Score, Gap Statistic, etc. fundamentally rely on some form of heuristic/trial&error argument. So to me, the best approach is to try out multiple techniques and to NOT develop over-confidence in any.
In your case, the ideal and most accurate score should be calculated on the entire data set. However, if you need to use partial samples to speed up the computation, you should use largest possible sample size your machine can handle. The rationale is the same as getting as many data points as possible out of the population of interest.
One more thig is that the sklearn implementation of Silhouette Score uses random (non-stratified) sampling. You can repeat the calculation multiple time using the same sample size (say sample_size=50000) to get a sensing on whether the sample size is large enough to produce consistent results.
Are there any types of clustering algorithms that focus on forming specific sized clusters? This can be thought of us as a grouping algorithm more than a clustering algorithm.
Basically, given n data points, and fixed groups of a certain size k, find the optimal distribution of points to sets based upon certain classifiers, that will hopefully minimize the distance of classifiers for each point in a given group.
This problem seems to be pretty similar to a clustering problem, but the main difference is that we are concerned with a specific cluster size, but not concerned about the number of clusters.
There is a tutorial on how to implement such an algorithm in ELKI:
http://elki.dbs.ifi.lmu.de/wiki/Tutorial/SameSizeKMeans
Also have a look at constraint clustering algorithms; although usually these algorithms only support "Must link" and "cannot link" constraints, not size constraints.
You should be able to do a similar modification where you first specify the group sizes, then assign points randomly, and swap cluster members as long as your objective function improves; similar to k-means / k-medoids. As you may get stuck in local minima, restart a number of times and only keep the best.
See also earlier questions, e.g.
K-means algorithm variation with equal cluster size
and
Group n points in k clusters of equal size
The problem that you are posing is a combinatorial optimization problem. It is very important to know if you need an exact solution, or that can you settle for an approximate one?
If you need exact solutions, there is a body of work that focuses on clustering with different types of constraints. The constraint that you mentioned can be encoded in this framework. However, you should now that this approach scales up to a datasets with a certain size.
I have a list of many float numbers, representing the length of an operation made several times.
For each type of operation, I have a different trend in numbers.
I'm aware of many random generators presented in some python modules, like in numpy.random
For example, I have binomial, exponencial, normal, weibul, and so on...
I'd like to know if there's a way to find the best random generator, given a list of values, that best fit each list of numbers that I have.
I.e, the generator (with its params) that best fit the trend of the numbers on the list
That's because I'd like to automatize the generation of time lengths, of each operation, so that I can simulate it during n years, without having to find by hand what method fits best what list of numbers.
EDIT: In other words, trying to clarify the problem:
I have a list of numbers. I'm trying to find the probability distribution that best fit the array of numbers I already have. The only problem I see is that each probability distribution has input params that may interfer on the result. So I'll have to figure out how to enter this params automatically, trying to best fit the list.
Any idea?
You might find it better to think about this in terms of probability distributions, rather than thinking about random number generators. You can then think in terms of testing goodness of fit for your different distributions.
As a starting point, you might try constructing probability plots for your samples. Probably the easiest in terms of the math behind it would be to consider a Q-Q plot. Using the random number generators, create a sample of the same size as your data. Sort both of these, and plot them against one another. If the distributions are the same, then you should get a straight line.
Edit: To find appropriate parameters for a statistical model, maximum likelihood estimation is a standard approach. Depending on how many samples of numbers you have and the precision you require, you may well find that just playing with the parameters by hand will give you a "good enough" solution.
Why using random numbers for this is a bad idea has already been explained. It seems to me that what you really need is to fit the distributions you mentioned to your points (for example, with a least squares fit), then check which one fits the points best (for example, with a chi-squared test).
EDIT Adding reference to numpy least squares fitting example
Given a parameterized univariate distirbution (e.g. exponential depends on lambda, or gamma depends on theta and k), the way to find the parameter values that best fit a given sample of numbers is called the Maximum Likelyhood procedure. It is not a least squares procedure, which would require binning and thus loosing information! Some Wikipedia distribution articles give expressions for the maximum likelyhood estimates of parameters, but many do not, and even the ones that do are missing expressions for error bars and covarainces. If you know calculus, you can derive these results by expressing the log likeyhood of your data set in terms of the parameters, setting the second derivative to zero to maximize it, and using the inverse of the curvature matrix at the minimum as the covariance matrix of your parameters.
Given two different fits to two different parameterized distributions, the way to compare them is called the likelyhood ratio test. Basically, you just pick the one with the larger log likelyhood.
Gabriel, if you have access to Mathematica, parameter estimation is built in:
In[43]:= data = RandomReal[ExponentialDistribution[1], 10]
Out[43]= {1.55598, 0.375999, 0.0878202, 1.58705, 0.874423, 2.17905, \
0.247473, 0.599993, 0.404341, 0.31505}
In[44]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MaximumLikelihood"]
Out[44]= ExponentialDistribution[1.21548]
In[45]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MethodOfMoments"]
Out[45]= ExponentialDistribution[1.21548]
However, it might be easy to figure what maximum likelihood method commands the parameter to be.
In[48]:= Simplify[
D[LogLikelihood[ExponentialDistribution[la], {x}], la], x > 0]
Out[48]= 1/la - x
Hence the estimated parameter for exponential distribution is sum (1/la -x_i) from where la = 1/Mean[data]. Similar equations can be worked out for other distribution families and coded in the language of your choice.