Distortion function from K Means of Scikit-Learn - python

To get distortion function (sum of distance for each point to its center) when doing K means clustering by Scikit-Learn, one simple way is just to get the centers (k_means.cluster_centers_) and sum up the distance for each point.
Just wondering if there is a faster way? (In terms of programmer time) Something like a direct function call or so.

This is already pre-computed at fit time in the inertia_ attribute for the KMeans class.
>>> from sklearn.datasets import load_iris
>>> from sklearn.cluster import KMeans
>>> iris = load_iris()
>>> km = KMeans(3).fit(iris.data)
>>> km.inertia_
78.940841426146108

Depending on the definition of distortion measure,it can either be
Sum of the square of the distance
of each example to its nearest cluster center.
OR
Average of the euclidean squared distance from the centroid of the respective clusters.
For the latter case,you can visit
Can distortion be derived from inertia rather than recalculating it from scratch in case of kmeans?
The inertia_ attribute in KMeans is defined in official docs as
Sum of squared distances of samples to their closest cluster center,
weighted by the sample weights if provided.

Related

Python code for automatic execution of the Elbow curve method in K-modes clustering

having the code for manual and therefore possibly wrong Elbow method selection of optimal number of clusters when K-modes clustering of binary df:
cost = []
for num_clusters in list(range(1,10)):
kmode = KModes(n_clusters=num_clusters, init = "Huang", n_init = 10)
kmode.fit_predict(newdf_matrix)
cost.append(kmode.cost_)
y = np.array([i for i in range(1,10,1)])
plt.plot(y,cost)
An outcome of the for loop is a plot with the so called elbow curve. I know this curve helps me choose a optimal K. I do not want to do that myself tho, I am looking for some computational way. I want a computer to do the job without me determining it "manually". Otherwise it stops executing the whole code at some point.
Thank you.
What would be the code for selecting the K automatically that would replace my manual selection?
Thank you.
Use silhouette coefficient [will not work if the data points are represented as categorical values rather then N-d points]
The silhouette coefficient give the measure of how similar a data point is within the cluster compared to other clusters. check Sklearn doc here.
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
So calculate silhouette_score for different values of k and use the one which has best score (near to 1).
Sample using digits dataset.
from sklearn.cluster import KMeans
import numpy as np
from sklearn.datasets import load_digits
data, labels = load_digits(return_X_y=True)
from sklearn.metrics import silhouette_score
silhouette_avg = []
for num_clusters in list(range(2,20)):
kmeans = KMeans(n_clusters=num_clusters, init = "k-means++", n_init = 10)
kmeans.fit_predict(data)
score = silhouette_score(data, kmeans.labels_)
silhouette_avg.append(score)
import matplotlib.pyplot as plt
plt.plot(np.arange(2,20),silhouette_avg,'bx-')
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k')
_ = plt.xticks(np.arange(2,20))
print (f"Best K: {np.argmax(silhouette_avg)+2}")
output:
Best K: 9

Precomputed distance matrix in DBSCAN

Reading around, I find it is possible to pass a precomputed distance matrix into SKLearn DBSCAN. Unfortunately, I don't know how to pass it for calculation.
Say I have a 1D array with 100 elements, with just the names of the nodes. Then I have a 2D matrix, 100x100 with the distance between each element (in the same order).
I know I have to call it:
db = DBSCAN(eps=2, min_samples=5, metric="precomputed")
For a distance between nodes of 2 and a minimum of 5 node clusters. Also, use "precomputed" to indicate to use the 2D matrix. But how do I pass the info for the calculation?
The same question could apply if using RAPIDS CUML DBScan function (GPU accelerated).
documentation:
class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean',
metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)
[...]
[...]
metricstring, or callable, default=’euclidean’
The metric to use when calculating distance between instances in a feature array. If
metric is a string or callable, it must be one of the options allowed by
sklearn.metrics.pairwise_distances for its metric parameter. If metric is
“precomputed”, X is assumed to be a distance matrix and must be square. X may be a
Glossary, in which case only “nonzero” elements may be considered neighbors for
DBSCAN.
[...]
So, the way you normally call this is:
from sklearn.cluster import DBSCAN
clustering = DBSCAN()
DBSCAN.fit(X)
if you have a distance matrix, you do:
from sklearn.cluster import DBSCAN
clustering = DBSCAN(metric='precomputed')
clustering.fit(distance_matrix)

Does sklearn DBSCAN suppose distances are normalized

I'm learning about DBSCAN and apparently the most important hyperparameter is eps, from sklearn documentation:
eps float, default=0.5
The maximum distance between two samples for one to be considered as in the neighborhood of the other.
This is not a maximum bound on the distances of points within a cluster.
This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
I notice that the number 0.5 doesn't take in fact the range of the distances of our data, in other words, if I use distances from 1 to 100 will it still work the same way if I scale up those distances by a factor of x100? Or scale down by x10? Or this parameter is supposed to be used in normalized distances (max_distance = 1)?

Specify max distance in agglomerative clustering (scikit learn)

When using a clustering algorithm, you always have to specify a shutoff parameter.
I am currently using Agglomerative clustering with scikit learn, and the only shutoff parameter that I can see is the number of clusters.
agg_clust = AgglomerativeClustering(n_clusters=N)
y_pred = agg_clust.fit_predict(matrix)
But I would like to find an algorithm where you would specify the maximum distance within elements of a clusters, and not the number of clusters.
Therefore the algorithm would simply agglomerate clusters until the max distance is reached.
Any suggestion ?
What you are looking for is implemented in scipy.cluster.hierarchy, see here.
So here is how you can do it:
from scipy.cluster.hierarchy import linkage, fcluster
y_pred = fcluster(linkage(matrix), t, criterion='distance')
# or more direct way
from scipy.cluster.hierarchy import fclusterdata
y_pred = fclusterdata(matrix, t, criterion='distance')

Wrap-around when calculating distance for k-means

I'm trying to do a K-means clustering of some dataset using sklearn. The problem is that one of the dimensions is hour-of-day: a number from 0-23 and so the distance algorithm then thinks that 0 is very far from 23, because in absolute terms it is. In reality and for my purposes, hour 0 is very close to hour 23. Is there a way to make the distance algorithm do some form of wrap-around so it computes the more 'real' time difference.
I'm doing something simple, similar to the following:
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters = 2)
data = vstack(data)
fit = clusters.fit(data)
classes = fit.predict(data)
data elements looks something like [22, 418, 192] where the first element is the hour.
Any ideas?
Even though #elyase answer is accepted, I think it is not the correct approach.
Yes, to use such distance you have to refine your distance measure and so - use different library. But what is more important - concept of mean used in k-means won't suit the cyclic dimension. Lets consider following example:
#current cluster X,, based on centroid position Xc=24
x1=1
x2=24
#current cluster Y, based on centroid position Yc=10
y1=12
y2=13
computing simple arithmetic mean will place the centoids in Xc=12.5,Yc=12.5, which from the point of view of cyclic meausre is incorect, it should be Xc=0.5,Yc=12.5. As you can see, asignment based on the cyclic distance measure is not "compatible" with simple mean operation, and leads to bizzare results.
Simple k-means will result in clusters {x1,y1}, {x2,y2}
Simple k--means + distance measure result in degenerated super cluster {x1,x2,y1,y2}
Correct clustering would be {x1,x2},{y1,y2}
Solving this problem requires checking one if (whether it is better to measure "simple average" or by representing one of the points as x'=x-24). Unfortunately given n points it makes 2^n possibilities.
This seems as a use case of the kernelized k-means, where you are actually clustering in the abstract feature space (in your case - a "tube" rolled around the time dimension) induced by kernel ("similarity measure", being the inner product of some vector space).
Details of the kernel k-means are given here
Why k-means doesn't work with arbitrary distances
K-means is not a distance-based algorithm.
K-means minimizes the Within-Cluster-Sum-of-Squares, which is a kind of variance (it's roughly the weighted average variance of all clusters, where each object and dimension is given the same weight).
In order for Lloyds algorithm to converge you need to have both steps optimize the same function:
the reassignment step
the centroid update step
Now the "mean" function is a least-squares estimator. I.e. choosing the mean in step 2 is optimal for the WCSS objective. Assigning objects by least-squares deviation (= squared Euclidean distance, monotone to Euclidean distance) in step 1 also yields guaranteed convergence. The mean is exactly where your wrap-around idea would fall apart.
If you plug in a random other distance function as suggested by #elyase k-means might no longer converge.
Proper solutions
There are various solutions to this:
Use K-medoids (PAM). By choosing the medoid instead of the mean you do get guaranteed convergence with arbitrary distances. However, computing the medoid is rather expensive.
Transform the data into a kernel space where you are happy with minimizing Sum-of-Squares. For example, you could transform the hour into sin(hour / 12 * pi), cos(hour / 12 * pi) which may be okay for SSQ.
Use other, distance-based clustering algorithms. K-means is old, and there has been a lot of research on clustering since. You may want to start with hierarchical clustering (which actually is just as old as k-means), and then try DBSCAN and the variants of it.
The easiest approach, to me, is to adapt the K-means algorithm wraparound dimension via computing the "circular mean" for the dimension. Of course, you will also need to change the distance-to-centroid calculation accordingly.
#compute the mean of hour 0 and 23
import numpy as np
hours = np.array(range(24))
#hours to angles
angles = hours/24 * (2*np.pi)
sin = np.sin(angles)
cos = np.cos(angles)
a = np.arctan2(sin[23]+sin[0], cos[23]+cos[0])
if a < 0: a += 2*np.pi
#angle back to hour
hour = a * 24 / (2*np.pi)
#23.5

Categories