Elbow Method for K-Means in python - python

I'm using K-Means algorithm (in sklearn) to cluster 1-D array of values, and I want to decide the optimal number of clusters (K) in my script.
I'm familiar with the Elbow Method, but all implementations require drawing the the clustering WCSS value, and spotting visually the "Elbow" in the plot.
Is there a way to find the elbow by code (not visually), or other way to find optimal K by code?

A relatively simple method is to connect the points corresponding to the minimum k value and the maximum k value on the elbow fold line, and then find the point with the maximum vertical distance between the fold line and the straight line:
import numpy as np
from sklearn.cluster import KMeans
def select_k(X: np.ndarray, k_range: np.ndarray) -> int:
wss = np.empty(k_range.size)
for i, k in enumerate(k_range):
kmeans = KMeans(k)
kmeans.fit(X)
wss[i] = ((X - kmeans.cluster_centers_[kmeans.labels_]) ** 2).sum()
slope = (wss[0] - wss[-1]) / (k_range[0] - k_range[-1])
intercept = wss[0] - slope * k_range[0]
y = k_range * slope + intercept
return k_range[(y - wss).argmax()]

Related

Python code for automatic execution of the Elbow curve method in K-modes clustering

having the code for manual and therefore possibly wrong Elbow method selection of optimal number of clusters when K-modes clustering of binary df:
cost = []
for num_clusters in list(range(1,10)):
kmode = KModes(n_clusters=num_clusters, init = "Huang", n_init = 10)
kmode.fit_predict(newdf_matrix)
cost.append(kmode.cost_)
y = np.array([i for i in range(1,10,1)])
plt.plot(y,cost)
An outcome of the for loop is a plot with the so called elbow curve. I know this curve helps me choose a optimal K. I do not want to do that myself tho, I am looking for some computational way. I want a computer to do the job without me determining it "manually". Otherwise it stops executing the whole code at some point.
Thank you.
What would be the code for selecting the K automatically that would replace my manual selection?
Thank you.
Use silhouette coefficient [will not work if the data points are represented as categorical values rather then N-d points]
The silhouette coefficient give the measure of how similar a data point is within the cluster compared to other clusters. check Sklearn doc here.
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
So calculate silhouette_score for different values of k and use the one which has best score (near to 1).
Sample using digits dataset.
from sklearn.cluster import KMeans
import numpy as np
from sklearn.datasets import load_digits
data, labels = load_digits(return_X_y=True)
from sklearn.metrics import silhouette_score
silhouette_avg = []
for num_clusters in list(range(2,20)):
kmeans = KMeans(n_clusters=num_clusters, init = "k-means++", n_init = 10)
kmeans.fit_predict(data)
score = silhouette_score(data, kmeans.labels_)
silhouette_avg.append(score)
import matplotlib.pyplot as plt
plt.plot(np.arange(2,20),silhouette_avg,'bx-')
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k')
_ = plt.xticks(np.arange(2,20))
print (f"Best K: {np.argmax(silhouette_avg)+2}")
output:
Best K: 9

how to find multiple elbow points in a curve?

You can see my below curve has multiple elbow points. Currently I am using two algorithm to find the elbow points and both of then found the 2nd elbow point which has x value of 10. I'd like to find the first elbow point at x=0.6. so question is how to find these 2 elbow points.
The algorithms I have tested out are here:
find the "elbow point" on an optimization curve with Python
Finding the best trade-off point on a curve
I prefer the 2nd algorithm which is easy to implement and no other parameters to adjust which is listed here as well. Thanks for your help.
def find_elbow(allCoord):
# https://stackoverflow.com/questions/18062135/combining-two-series-into-a-dataframe-in-pandas
# allcord is a array
import numpy.matlib # need to ne import separately
nPoints=len(allCoord)
firstPoint = allCoord[0]
lineVec = allCoord[-1] - allCoord[0]
lineVecNorm = lineVec / np.sqrt(np.sum(lineVec**2))
vecFromFirst = allCoord - firstPoint
scalarProduct = np.sum(vecFromFirst*numpy.matlib.repmat(lineVecNorm, nPoints, 1), axis=1)
vecFromFirstParallel = np.outer(scalarProduct, lineVecNorm)
vecToLine = vecFromFirst - vecFromFirstParallel
distToLine = np.sqrt(np.sum(vecToLine ** 2, axis=1))
idxOfBestPoint = np.argmax(distToLine)
x_elbow= allCoord[idxOfBestPoint,0]
y_elbow= allCoord[idxOfBestPoint,1]
return idxOfBestPoint,x_elbow,y_elbow
x=delta_time_seconds
y=delta_data.iloc[:,inx].values
df1=pd.concat([pd.Series(np.log10(x).tolist()),pd.Series(y)],axis=1)
data_curve_array=df1.iloc[1:,:].to_numpy()
inx_elbow,x_elbow,y_elbow=find_elbow(data_curve_array)

How to define a range of values for the eps parameter of sklearn.cluster.DBSCAN?

I want to use DBSCAN with the metric sklearn.metrics.pairwise.cosine_similarity to cluster points that have cosine similarity close to 1 (i.e. whose vectors (from "the" origin) are parallel or almost parallel).
The issue:
eps is the maximum distance between two samples for them to be considered as in the same neighbourhood by DBSCAN - meaning that if the distance between two points is lower than or equal to eps, these points are considered neighbours;
but
sklearn.metrics.pairwise.cosine_similarity spits out values between -1 and 1 and I want DBSCAN to consider two points to be neighbours if the distance between them is, say, between 0.75 and 1 - i.e. greater than or equal to 0.75.
I see two possible solutions:
pass a range of values to the eps parameter of DBSCAN e.g. eps=[0.75,1]
Pass the value eps=-0.75 to DBSCAN but (somehow) force it to use the negative of the cosine similarities matrix that is spit out by sklearn.metrics.pairwise.cosine_similarity
I do not know how to implement either of these.
Any guidance would be appreciated!
DBSCAN has a metric keyword argument. Docstring:
metric : string, or callable
The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of
the options allowed by metrics.pairwise.calculate_distance for its
metric parameter.
If metric is "precomputed", X is assumed to be a distance matrix and
must be square. X may be a sparse matrix, in which case only "nonzero"
elements may be considered neighbors for DBSCAN.
So probably the easiest thing to do is to precompute a distance matrix using cosine similarity as your distance metric, preprocess the distance matrix such that it fits your bespoke distance criterion (probably something like D = np.abs(np.abs(CD) -1), where CD is your cosine distance matrix), and then set metric to precomputed, and pass the precomputed distance matrix D in for X, i.e. the data.
For example:
#!/usr/bin/env python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN
total_samples = 1000
dimensionality = 3
points = np.random.rand(total_samples, dimensionality)
cosine_distance = cosine_similarity(points)
# option 1) vectors are close to each other if they are parallel
bespoke_distance = np.abs(np.abs(cosine_distance) -1)
# option 2) vectors are close to each other if they point in the same direction
bespoke_distance = np.abs(cosine_distance - 1)
results = DBSCAN(metric='precomputed', eps=0.25).fit(bespoke_distance)
A) check out Generalized DBSCAN which works fine with similarities too. With cosine, sklearn will supposedly be slow anyway.
B) you can trivially use: cosine distance = 1 - cosine similarity. But that may well cause the sklearn implementation to run in O(n²).
C) you supposedly can even pass -cosinesimilarity as precomputed distance matrix and use -0.75 as eps.
d) just make a binary distance matrix (in O(n²) memory, though, so slow), where distance = 0 of the cosine similarity is larger than your threshold, and 0 otherwise. Then use DBSCAN with eps=0.5. it is trivial to show that distance < eps if and only if similarity > threshold.
A few options:
dist = np.abs(cos_sim - 1) accepted answer here
dist = np.arccos(cos_sim) / np.pi https://math.stackexchange.com/a/3385463/816178
dist = 1 - (sim + 1) / 2 https://math.stackexchange.com/q/3241174/816178
I've found they all work the same in practice for this application (precomputed distances in hierarchical clustering; I've hit the snag too). As I understand #2 is the more mathematically-correct approach; preserving angular distance.

Elbow method in python

i am trying to implement the elbow method in python on my own to get the optimum number of clusters.
Therefore i summed the inertia's of the different k-means runs:
sum_squared_dist = []
K = range(1,30)
for k in K:
km = KMeans(n_clusters=k, random_state=0)
km = km.fit(normalized_modeling_data)
sum_squared_dist.append(km.inertia_)
plt.plot(K, sum_squared_dist, 'bx-')
plt.xlabel('number of clusters k')
plt.ylabel('Sum of squared distances')
plt.show
So the next approach would be to find the point, were the curve starts to flatten (which should mean that the first derivation is falling).
Is there an built-in method in numpy or scikit-learn to calculate the derivation from an array?

Spectral Clustering Scikit learn print items in Cluster

I know I can get the contents of a particular cluster in K-means clustering with the following code using scikit-learn.
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print "Cluster %d:" % i,
for ind in order_centroids[i, :10]:
print ' %s' % terms[ind],
print
How do I do the same for spectral clustering as there is no attribute 'cluster_centers_'for spectral clustering? I am trying to cluster terms in Text documents.
UPDATED:
Sorry, I've not understood your question correctly at first time.
I think it's impossible to do what you want with Spectral Clustering, because spectral clustering method by itself doesn't compute any centers, it doesn't needs them at all. It even doesn't operates on sample points in raw space, Spectral Clustering transforms your dataset into different subspace and then tries to cluster points at this dataset. And i don't know how to invert this transformation mathematically.
A Tutorial on Spectral Clustering
Maybe you should ask your question as more theoretical on Math-related communities of SO.
spectral = cluster.SpectralClustering(n_clusters=2, eigen_solver='arpack', affinity="nearest_neighbors")
spectral.fit(X)
y_pred = spectral.labels_.astype(np.int)
From here
Spectral clustering does not compute any centroids. In a more practical context, if you really need a kind of 'centroids' derived by the spectral clustering algorithm you can always compute the average (mean) of the points belonging at the same cluster, after the end of the clustering process. These would be an approximation of the centroids defined in the context of the typical k-means algorithm. The same principle applies also in other clustering algorithms that do not produce centroids (e.g. hierarchical).
While it's true that you can't get the cluster centers for spectral clustering, you can do something close that might be useful in some cases. To explain, I'll run through the spectral clustering algorithm quickly and explain the modification.
First, let's call our dataset X = {x_1, ..., x_N}, where each point is d-dimensional (d is the number of features you have in your dataset). We can think of X as an N by d matrix. Let's say we want to put this data into k clusters. Spectral clustering first transforms the data set into another representation and then uses K-means clustering on the new representation of the data to obtain clusters. First, the affinity matrix A is formed by using K-neighbors information. For this, we need to choose a positive integer n to construct A. The element A_{i, j} is equal to 1 if x_i and x_j are both in the list of the top n neighbors of each other, and A_{i, j} is equal to 0 otherwise. A is a symmetric N by N matrix. Next, we construct the normalized Laplacian matrix L of A, which is L = I - D^{-1/2}AD^{-1/2}, where D is the degree matrix. Then the eigenvalue decomposition is performed on L to get L = VEV^{-1}, where V is the matrix of eigenvectors of L, and E is the diagonal matrix with the eigenvalues of L in the diagonal. Since L is positive semi-definite, it's eigenvalues are all non-negative real numbers. For spectral clustering, we use this to order the columns of V so that the first column of V corresponds to the smallest eigenvalue of L, and the last column to the largest eigenvalue of L.
Next, we take the first k columns of V, and view them as N points in k-dimensional space. Let's write this truncated matrix as V', and write it's rows as {v'_1, ..., v'_N}, where each v'_i is k-dimensional. Then we use the K-means algorithm to cluster these points into k clusters; {C'_1,...,C'_k}. Then the clusters are assigned to the points in the dataset X by "pulling back" the clusters from V' to X: the point x_i is in cluster C_j if and only if v'_i is in cluster C'_j.
Now, one of the main points of transforming X into V' and clustering on that representation is that often X is not spherically distributed, and V' at least comes closer to being so. Since V' is closer to being spherically distributed, the centroid will be "inside" the cluster of points it defines. We can take the point in V' that is closest to the cluster centroid for each cluster. Let's call the cluster centroids {c_1,...,c_k}. These are points in the parameter space that V' is represented in. Then for each cluster, choose the point of V' that is closest to the cluster's centroid to get k points of V'. Let's say {v'_i_1,...,v'_i_k} are the representative points closest to the cluster centroids of V'. Then choose {x_i_1,...,x_i_k} as the cluster representatives for the clusters of X.
This method might not always work how you might want, but it's at least a way to get closer to what you're wanting, and maybe you can modify it to get closer to what you need. Here's some example code to show how to do this.
Let's use some fake data provided by scikit-learn.
import numpy as np
import pandas as pd
from sklearn.datasets import make_moons
moons_data = make_moons(n_samples=1000, noise=0.07, random_state=0)
moons = pd.DataFrame(data=moons_data[0], columns=['x', 'y'])
moons['label_truth'] = moons_data[1]
moons.plot(
kind='scatter',
x='x',
y='y',
figsize=(8, 8),
s=10,
alpha=0.7
);
I'm going to kind of cheat and use the spectral clustering method provided by scikit-learn, and then extract the affinity matrix from there.
from sklearn.cluster import SpectralClustering
sclust = SpectralClustering(
n_clusters=2,
random_state=42,
affinity='nearest_neighbors',
n_neighbors=10,
assign_labels='kmeans'
)
sclust.fit(moons[['x', 'y']]);
moons['label_cluster'] = sclust.labels_
moons.plot(
kind='scatter',
x='x',
y='y',
figsize=(16, 14),
s=10,
alpha=0.7,
c='label_cluster',
cmap='Spectral'
);
Next, we'll compute the normalized Laplacian of the affinity matrix, and instead of computing the whole eigenvalue decomposition of the Laplacian, we use the scipy function eigsh to extract the two (since we are wanting two clusters) eigenvectors corresponding to the two smallest eigenvalues.
from scipy.sparse.csgraph import laplacian
from scipy.sparse.linalg import eigsh
affinity_matrix = sclust.affinity_matrix_
lpn = laplacian(affinity_matrix, normed=True)
w, v = eigsh(lpn, k=2, which='SM')
pd.DataFrame(v).plot(
kind='scatter',
x=0,
y=1,
figsize=(16, 16),
s=10,
alpha=0.7
);
Then let's use K-means to cluster on this new representation of the data. Let's also find the two points in this new representation that are closest to the cluster centroids, and highlight them.
from sklearn.cluster import KMeans
from scipy.spatial.distance import euclidean
import matplotlib.pyplot as plt
kmeans = KMeans(
n_clusters=2,
random_state=42
)
kmeans.fit(v)
center_0, center_1 = kmeans.cluster_centers_
representative_index_0 = np.argmin(np.array([euclidean(a, center_0) for a in v]))
representative_index_1 = np.argmin(np.array([euclidean(a, center_1) for a in v]))
fig, ax = plt.subplots(figsize=(16, 16));
pd.DataFrame(v).plot(
kind='scatter',
x=0,
y=1,
ax=ax,
s=10,
alpha=0.7);
pd.DataFrame(v).iloc[[representative_index_0, representative_index_1]].plot(
kind='scatter',
x=0,
y=1,
ax=ax,
s=100,
alpha=0.9,
c='orange',
)
And finally, let's plot the original dataset with the corresponding points highlighted.
moons['labels_lpn_kmeans'] = kmeans.labels_
fig, ax = plt.subplots(figsize=(16, 14));
moons.plot(
kind='scatter',
x='x',
y='y',
ax=ax,
s=10,
alpha=0.7,
c='labels_lpn_kmeans',
cmap='Spectral'
);
moons.iloc[[representative_index_0, representative_index_1]].plot(
kind='scatter',
x='x',
y='y',
ax=ax,
s=100,
alpha=0.9,
c='orange',
);
As we can see, the highlighted points are maybe not where we might expect them to be, but this might be useful to give some way of algorithmically choosing points from each cluster.

Categories