Elbow method in python - python

i am trying to implement the elbow method in python on my own to get the optimum number of clusters.
Therefore i summed the inertia's of the different k-means runs:
sum_squared_dist = []
K = range(1,30)
for k in K:
km = KMeans(n_clusters=k, random_state=0)
km = km.fit(normalized_modeling_data)
sum_squared_dist.append(km.inertia_)
plt.plot(K, sum_squared_dist, 'bx-')
plt.xlabel('number of clusters k')
plt.ylabel('Sum of squared distances')
plt.show
So the next approach would be to find the point, were the curve starts to flatten (which should mean that the first derivation is falling).
Is there an built-in method in numpy or scikit-learn to calculate the derivation from an array?

Related

Python code for automatic execution of the Elbow curve method in K-modes clustering

having the code for manual and therefore possibly wrong Elbow method selection of optimal number of clusters when K-modes clustering of binary df:
cost = []
for num_clusters in list(range(1,10)):
kmode = KModes(n_clusters=num_clusters, init = "Huang", n_init = 10)
kmode.fit_predict(newdf_matrix)
cost.append(kmode.cost_)
y = np.array([i for i in range(1,10,1)])
plt.plot(y,cost)
An outcome of the for loop is a plot with the so called elbow curve. I know this curve helps me choose a optimal K. I do not want to do that myself tho, I am looking for some computational way. I want a computer to do the job without me determining it "manually". Otherwise it stops executing the whole code at some point.
Thank you.
What would be the code for selecting the K automatically that would replace my manual selection?
Thank you.
Use silhouette coefficient [will not work if the data points are represented as categorical values rather then N-d points]
The silhouette coefficient give the measure of how similar a data point is within the cluster compared to other clusters. check Sklearn doc here.
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
So calculate silhouette_score for different values of k and use the one which has best score (near to 1).
Sample using digits dataset.
from sklearn.cluster import KMeans
import numpy as np
from sklearn.datasets import load_digits
data, labels = load_digits(return_X_y=True)
from sklearn.metrics import silhouette_score
silhouette_avg = []
for num_clusters in list(range(2,20)):
kmeans = KMeans(n_clusters=num_clusters, init = "k-means++", n_init = 10)
kmeans.fit_predict(data)
score = silhouette_score(data, kmeans.labels_)
silhouette_avg.append(score)
import matplotlib.pyplot as plt
plt.plot(np.arange(2,20),silhouette_avg,'bx-')
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k')
_ = plt.xticks(np.arange(2,20))
print (f"Best K: {np.argmax(silhouette_avg)+2}")
output:
Best K: 9

Elbow Method for K-Means in python

I'm using K-Means algorithm (in sklearn) to cluster 1-D array of values, and I want to decide the optimal number of clusters (K) in my script.
I'm familiar with the Elbow Method, but all implementations require drawing the the clustering WCSS value, and spotting visually the "Elbow" in the plot.
Is there a way to find the elbow by code (not visually), or other way to find optimal K by code?
A relatively simple method is to connect the points corresponding to the minimum k value and the maximum k value on the elbow fold line, and then find the point with the maximum vertical distance between the fold line and the straight line:
import numpy as np
from sklearn.cluster import KMeans
def select_k(X: np.ndarray, k_range: np.ndarray) -> int:
wss = np.empty(k_range.size)
for i, k in enumerate(k_range):
kmeans = KMeans(k)
kmeans.fit(X)
wss[i] = ((X - kmeans.cluster_centers_[kmeans.labels_]) ** 2).sum()
slope = (wss[0] - wss[-1]) / (k_range[0] - k_range[-1])
intercept = wss[0] - slope * k_range[0]
y = k_range * slope + intercept
return k_range[(y - wss).argmax()]

To determine the optimal k-mean for given dataset using python

I am pretty new to python and the clusttering stuff. Right now I have a task to analyze a set of data and determine its optimal Kmean by using elbow and silhouette method.
As shown in the picture, my dataset has three features, one is the weight of tested person, the second is the blood Cholesterol content of the person, the third is the gender of the tested person('0' means female, '1' means male)
I firstly use elbow method to see the wcss value at different k values
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(data)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
And get the plot at below:
Then, I used the silhouette method to look at the silhouette score:
from sklearn.metrics import silhouette_score
sil = []
for k in range(2, 6):
kmeans = KMeans(n_clusters = k).fit(data)
preds = kmeans.fit_predict(data)
sil.append(silhouette_score(data, preds, metric = 'euclidean'))
plt.plot(range(2, 6), sil)
plt.title('Silhouette Method')
plt.xlabel('Number of clusters')
plt.ylabel('Sil')
plt.show()
for i in range(len(sil)):
print(str(i+2) +":"+ str(sil[i]))
And I got the following results:
Could anybody suggest how can I pick the optimal Kmean? I did some light research, someone says the higher the s-score the better(in my case the cluster number should be 2?), but in some other cases, they are not simply using the cluster number has the highest score.
Another thought is that here I included the gender as one feature, should I first divide my data into two classes by gender and then cluster them separately ?
K-means algorithm is very much susceptible to the range in which your features are measured, in your case gender is a binary variable which just takes values 0 and 1, but the other two features are measures in a larger scale, I recommend you to normalize your data first and then do the plots again which could produce consistent results between your elbow curve and the silhouette method.

How to identify and separate clusters using K Means in python?

I'm trying to find clusters in a data set using K-means method. I got the number of clusters from the elbow method, but don't know how to identify and separate these clusters for further analysis on each cluster like applying linear regression on each cluster. My data set contain more than two variables.
I got the number of clusters from the elbow method
Applying Kmeans
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(df)
kmeanModel.fit(df)
distortions.append(sum(np.min(cdist(df, kmeanModel.cluster_centers_, 'euclidean'), axis=1))**2 / df.shape[0])
Elbow method for number of clusters
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
Suppose you found that the value k is the optimal number of clusters for your data using the Elbow method.
So you can use the following code to divide the data into different clusters:
kmeans = KMeans(n_clusters=k, random_state=0).fit(df)
y = kmeans.labels_ # Will return the cluster numbers for each datapoint
y_pred = kmeans.predict(<unknown_sample>) # If want to predict for a new sample
After that you can separate the data based on the clusters as:
for i in range(k):
cluster_i = df[y == i, :] # Subset of the datapoints that have been assigned to the cluster i
# Do analysis on this subset of datapoints.
You can find more details related to different parameters in this link: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Spectral Clustering Scikit learn print items in Cluster

I know I can get the contents of a particular cluster in K-means clustering with the following code using scikit-learn.
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print "Cluster %d:" % i,
for ind in order_centroids[i, :10]:
print ' %s' % terms[ind],
print
How do I do the same for spectral clustering as there is no attribute 'cluster_centers_'for spectral clustering? I am trying to cluster terms in Text documents.
UPDATED:
Sorry, I've not understood your question correctly at first time.
I think it's impossible to do what you want with Spectral Clustering, because spectral clustering method by itself doesn't compute any centers, it doesn't needs them at all. It even doesn't operates on sample points in raw space, Spectral Clustering transforms your dataset into different subspace and then tries to cluster points at this dataset. And i don't know how to invert this transformation mathematically.
A Tutorial on Spectral Clustering
Maybe you should ask your question as more theoretical on Math-related communities of SO.
spectral = cluster.SpectralClustering(n_clusters=2, eigen_solver='arpack', affinity="nearest_neighbors")
spectral.fit(X)
y_pred = spectral.labels_.astype(np.int)
From here
Spectral clustering does not compute any centroids. In a more practical context, if you really need a kind of 'centroids' derived by the spectral clustering algorithm you can always compute the average (mean) of the points belonging at the same cluster, after the end of the clustering process. These would be an approximation of the centroids defined in the context of the typical k-means algorithm. The same principle applies also in other clustering algorithms that do not produce centroids (e.g. hierarchical).
While it's true that you can't get the cluster centers for spectral clustering, you can do something close that might be useful in some cases. To explain, I'll run through the spectral clustering algorithm quickly and explain the modification.
First, let's call our dataset X = {x_1, ..., x_N}, where each point is d-dimensional (d is the number of features you have in your dataset). We can think of X as an N by d matrix. Let's say we want to put this data into k clusters. Spectral clustering first transforms the data set into another representation and then uses K-means clustering on the new representation of the data to obtain clusters. First, the affinity matrix A is formed by using K-neighbors information. For this, we need to choose a positive integer n to construct A. The element A_{i, j} is equal to 1 if x_i and x_j are both in the list of the top n neighbors of each other, and A_{i, j} is equal to 0 otherwise. A is a symmetric N by N matrix. Next, we construct the normalized Laplacian matrix L of A, which is L = I - D^{-1/2}AD^{-1/2}, where D is the degree matrix. Then the eigenvalue decomposition is performed on L to get L = VEV^{-1}, where V is the matrix of eigenvectors of L, and E is the diagonal matrix with the eigenvalues of L in the diagonal. Since L is positive semi-definite, it's eigenvalues are all non-negative real numbers. For spectral clustering, we use this to order the columns of V so that the first column of V corresponds to the smallest eigenvalue of L, and the last column to the largest eigenvalue of L.
Next, we take the first k columns of V, and view them as N points in k-dimensional space. Let's write this truncated matrix as V', and write it's rows as {v'_1, ..., v'_N}, where each v'_i is k-dimensional. Then we use the K-means algorithm to cluster these points into k clusters; {C'_1,...,C'_k}. Then the clusters are assigned to the points in the dataset X by "pulling back" the clusters from V' to X: the point x_i is in cluster C_j if and only if v'_i is in cluster C'_j.
Now, one of the main points of transforming X into V' and clustering on that representation is that often X is not spherically distributed, and V' at least comes closer to being so. Since V' is closer to being spherically distributed, the centroid will be "inside" the cluster of points it defines. We can take the point in V' that is closest to the cluster centroid for each cluster. Let's call the cluster centroids {c_1,...,c_k}. These are points in the parameter space that V' is represented in. Then for each cluster, choose the point of V' that is closest to the cluster's centroid to get k points of V'. Let's say {v'_i_1,...,v'_i_k} are the representative points closest to the cluster centroids of V'. Then choose {x_i_1,...,x_i_k} as the cluster representatives for the clusters of X.
This method might not always work how you might want, but it's at least a way to get closer to what you're wanting, and maybe you can modify it to get closer to what you need. Here's some example code to show how to do this.
Let's use some fake data provided by scikit-learn.
import numpy as np
import pandas as pd
from sklearn.datasets import make_moons
moons_data = make_moons(n_samples=1000, noise=0.07, random_state=0)
moons = pd.DataFrame(data=moons_data[0], columns=['x', 'y'])
moons['label_truth'] = moons_data[1]
moons.plot(
kind='scatter',
x='x',
y='y',
figsize=(8, 8),
s=10,
alpha=0.7
);
I'm going to kind of cheat and use the spectral clustering method provided by scikit-learn, and then extract the affinity matrix from there.
from sklearn.cluster import SpectralClustering
sclust = SpectralClustering(
n_clusters=2,
random_state=42,
affinity='nearest_neighbors',
n_neighbors=10,
assign_labels='kmeans'
)
sclust.fit(moons[['x', 'y']]);
moons['label_cluster'] = sclust.labels_
moons.plot(
kind='scatter',
x='x',
y='y',
figsize=(16, 14),
s=10,
alpha=0.7,
c='label_cluster',
cmap='Spectral'
);
Next, we'll compute the normalized Laplacian of the affinity matrix, and instead of computing the whole eigenvalue decomposition of the Laplacian, we use the scipy function eigsh to extract the two (since we are wanting two clusters) eigenvectors corresponding to the two smallest eigenvalues.
from scipy.sparse.csgraph import laplacian
from scipy.sparse.linalg import eigsh
affinity_matrix = sclust.affinity_matrix_
lpn = laplacian(affinity_matrix, normed=True)
w, v = eigsh(lpn, k=2, which='SM')
pd.DataFrame(v).plot(
kind='scatter',
x=0,
y=1,
figsize=(16, 16),
s=10,
alpha=0.7
);
Then let's use K-means to cluster on this new representation of the data. Let's also find the two points in this new representation that are closest to the cluster centroids, and highlight them.
from sklearn.cluster import KMeans
from scipy.spatial.distance import euclidean
import matplotlib.pyplot as plt
kmeans = KMeans(
n_clusters=2,
random_state=42
)
kmeans.fit(v)
center_0, center_1 = kmeans.cluster_centers_
representative_index_0 = np.argmin(np.array([euclidean(a, center_0) for a in v]))
representative_index_1 = np.argmin(np.array([euclidean(a, center_1) for a in v]))
fig, ax = plt.subplots(figsize=(16, 16));
pd.DataFrame(v).plot(
kind='scatter',
x=0,
y=1,
ax=ax,
s=10,
alpha=0.7);
pd.DataFrame(v).iloc[[representative_index_0, representative_index_1]].plot(
kind='scatter',
x=0,
y=1,
ax=ax,
s=100,
alpha=0.9,
c='orange',
)
And finally, let's plot the original dataset with the corresponding points highlighted.
moons['labels_lpn_kmeans'] = kmeans.labels_
fig, ax = plt.subplots(figsize=(16, 14));
moons.plot(
kind='scatter',
x='x',
y='y',
ax=ax,
s=10,
alpha=0.7,
c='labels_lpn_kmeans',
cmap='Spectral'
);
moons.iloc[[representative_index_0, representative_index_1]].plot(
kind='scatter',
x='x',
y='y',
ax=ax,
s=100,
alpha=0.9,
c='orange',
);
As we can see, the highlighted points are maybe not where we might expect them to be, but this might be useful to give some way of algorithmically choosing points from each cluster.

Categories