I'm trying to find clusters in a data set using K-means method. I got the number of clusters from the elbow method, but don't know how to identify and separate these clusters for further analysis on each cluster like applying linear regression on each cluster. My data set contain more than two variables.
I got the number of clusters from the elbow method
Applying Kmeans
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(df)
kmeanModel.fit(df)
distortions.append(sum(np.min(cdist(df, kmeanModel.cluster_centers_, 'euclidean'), axis=1))**2 / df.shape[0])
Elbow method for number of clusters
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
Suppose you found that the value k is the optimal number of clusters for your data using the Elbow method.
So you can use the following code to divide the data into different clusters:
kmeans = KMeans(n_clusters=k, random_state=0).fit(df)
y = kmeans.labels_ # Will return the cluster numbers for each datapoint
y_pred = kmeans.predict(<unknown_sample>) # If want to predict for a new sample
After that you can separate the data based on the clusters as:
for i in range(k):
cluster_i = df[y == i, :] # Subset of the datapoints that have been assigned to the cluster i
# Do analysis on this subset of datapoints.
You can find more details related to different parameters in this link: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Related
having the code for manual and therefore possibly wrong Elbow method selection of optimal number of clusters when K-modes clustering of binary df:
cost = []
for num_clusters in list(range(1,10)):
kmode = KModes(n_clusters=num_clusters, init = "Huang", n_init = 10)
kmode.fit_predict(newdf_matrix)
cost.append(kmode.cost_)
y = np.array([i for i in range(1,10,1)])
plt.plot(y,cost)
An outcome of the for loop is a plot with the so called elbow curve. I know this curve helps me choose a optimal K. I do not want to do that myself tho, I am looking for some computational way. I want a computer to do the job without me determining it "manually". Otherwise it stops executing the whole code at some point.
Thank you.
What would be the code for selecting the K automatically that would replace my manual selection?
Thank you.
Use silhouette coefficient [will not work if the data points are represented as categorical values rather then N-d points]
The silhouette coefficient give the measure of how similar a data point is within the cluster compared to other clusters. check Sklearn doc here.
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
So calculate silhouette_score for different values of k and use the one which has best score (near to 1).
Sample using digits dataset.
from sklearn.cluster import KMeans
import numpy as np
from sklearn.datasets import load_digits
data, labels = load_digits(return_X_y=True)
from sklearn.metrics import silhouette_score
silhouette_avg = []
for num_clusters in list(range(2,20)):
kmeans = KMeans(n_clusters=num_clusters, init = "k-means++", n_init = 10)
kmeans.fit_predict(data)
score = silhouette_score(data, kmeans.labels_)
silhouette_avg.append(score)
import matplotlib.pyplot as plt
plt.plot(np.arange(2,20),silhouette_avg,'bx-')
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k')
_ = plt.xticks(np.arange(2,20))
print (f"Best K: {np.argmax(silhouette_avg)+2}")
output:
Best K: 9
I am pretty new to python and the clusttering stuff. Right now I have a task to analyze a set of data and determine its optimal Kmean by using elbow and silhouette method.
As shown in the picture, my dataset has three features, one is the weight of tested person, the second is the blood Cholesterol content of the person, the third is the gender of the tested person('0' means female, '1' means male)
I firstly use elbow method to see the wcss value at different k values
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(data)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
And get the plot at below:
Then, I used the silhouette method to look at the silhouette score:
from sklearn.metrics import silhouette_score
sil = []
for k in range(2, 6):
kmeans = KMeans(n_clusters = k).fit(data)
preds = kmeans.fit_predict(data)
sil.append(silhouette_score(data, preds, metric = 'euclidean'))
plt.plot(range(2, 6), sil)
plt.title('Silhouette Method')
plt.xlabel('Number of clusters')
plt.ylabel('Sil')
plt.show()
for i in range(len(sil)):
print(str(i+2) +":"+ str(sil[i]))
And I got the following results:
Could anybody suggest how can I pick the optimal Kmean? I did some light research, someone says the higher the s-score the better(in my case the cluster number should be 2?), but in some other cases, they are not simply using the cluster number has the highest score.
Another thought is that here I included the gender as one feature, should I first divide my data into two classes by gender and then cluster them separately ?
K-means algorithm is very much susceptible to the range in which your features are measured, in your case gender is a binary variable which just takes values 0 and 1, but the other two features are measures in a larger scale, I recommend you to normalize your data first and then do the plots again which could produce consistent results between your elbow curve and the silhouette method.
Im doing clustering of text data with Kmeans in Python's Scikit-Learn.
I have problem with Vectorizing the data because I get very different results when Im using different vectorizers.
I want to do clustering of text data (data are instagram comments about USA politics) and I want to find the key-words for every cluster. But I do not know what vectorizer should I use
For example when I'm using :
cv = CountVectorizer(analyzer = 'word', max_features = 8000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(x)
#should I scale x value?
#x = scale(x, with_mean=False)
#If I do this I get the graph just one dot and silhouette_score less than 0.01
I get that my optimal number of clusters is 2, based on silhouette_score that gives me score of 0.87. And my graph looks like this:
And when Im using:
cv = TfidfVectorizer(analyzer = 'word',max_features = 8000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(x)
I get that my optimal number of clusters is 13, based on silhouette_score that gives me score of 0.0159. And my graph looks like this:
This is how I'm doing the clustering:
my_list = []
list_of_clusters = []
for i in range(2,15):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(x)
my_list.append(kmeans.inertia_)
cluster_labels = kmeans.fit_predict(x)
silhouette_avg = silhouette_score(x, cluster_labels)
print(round(silhouette_avg,2))
list_of_clusters.append(round(silhouette_avg, 1))
plt.plot(range(2,15),my_list)
plt.show()
number_of_clusters = max(list_of_clusters)
number_of_clusters = list_of_clusters.index(number_of_clusters)+2
print('Number of clusters: ', number_of_clusters)
kmeans = KMeans(n_clusters = number_of_clusters, init = 'k-means++', random_state = 42)
kmeans.fit(x)
And this is how I plot the data:
# reduce the features to 2D
pca = PCA(n_components=2, random_state=0)
reduced_features = pca.fit_transform(x.toarray())
# reduce the cluster centers to 2D
reduced_cluster_centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(reduced_features[:,0], reduced_features[:,1], c=kmeans.predict(x), s=3)
plt.scatter(reduced_cluster_centers[:, 0], reduced_cluster_centers[:,1], marker='x', s=50, c='r')
plt.show()
I think that this is the very big difference, so Im sure that Im doing something wrong, but I do not know what.
Thanks for your help :)
The Silhouette score, from sklearn:
The Silhouette Coefficient is calculated using the mean intra-cluster
distance (a) and the mean nearest-cluster distance (b) for each
sample. The Silhouette Coefficient for a sample is (b - a) / max(a,
b). To clarify, b is the distance between a sample and the nearest
cluster that the sample is not a part of. Note that Silhouette
Coefficient is only defined if number of labels is 2 <= n_labels <=
n_samples - 1.
This function returns the mean Silhouette Coefficient over all
samples. To obtain the values for each sample, use silhouette_samples.
The best value is 1 and the worst value is -1. Values near 0 indicate
overlapping clusters. Negative values generally indicate that a sample
has been assigned to the wrong cluster, as a different cluster is more
similar.
Your best value is near 0, which means you don't have a good clustering
A good clustering is a clustering in which:
the intra-class similarity is high (the documents in your cluster are similar).
the inter-class similarity is low (the documents from two clusters are not similar).
And of course, the clusters should tell you something interesting. For example one cluster could contains all the words linked to a specific domain. But this is a work that you do after the clusters are done.
Changing the vectorizing means that you are changing the features on which you are doing clustering.
From the python sklearn doc, CountVectorizer:
CountVectorizer implements both tokenization and occurrence counting
in a single class
Basically you have the token counts as features.
Instead for TfidfVectorizer:
Convert a collection of raw documents to a matrix of TF-IDF features.
Which means that you are using the Term Frequency - Inverse Document Frequency formula as features for your documents, which is calculated as:
TF(t) = (Number of times term t appears in a document) / (Total
number of terms in the document). IDF(t) = log_e(Total number of
documents / Number of documents with term t in it)
TF-IDF = TF(t) * IDF(t)
i am trying to implement the elbow method in python on my own to get the optimum number of clusters.
Therefore i summed the inertia's of the different k-means runs:
sum_squared_dist = []
K = range(1,30)
for k in K:
km = KMeans(n_clusters=k, random_state=0)
km = km.fit(normalized_modeling_data)
sum_squared_dist.append(km.inertia_)
plt.plot(K, sum_squared_dist, 'bx-')
plt.xlabel('number of clusters k')
plt.ylabel('Sum of squared distances')
plt.show
So the next approach would be to find the point, were the curve starts to flatten (which should mean that the first derivation is falling).
Is there an built-in method in numpy or scikit-learn to calculate the derivation from an array?
After reading this post here about duplicate values in k-means clustering, I realized I cannot simply use unique points for clustering.
https://stats.stackexchange.com/questions/152808/do-i-need-to-remove-duplicate-objects-for-cluster-analysis-of-objects
I have over 10000000 points, though only 8000 unique ones. Therefore, I initially thought that for speeding it up, I’d use unique points only. Seems like this is a bad idea.
To keep computational time down, this post suggests to add weights to each point. How can this be implemented in python?
Using K-Means package from Scikit library, clustering is performed for number of clusters as 11 here.
The array Y contains data that has been inserted as weights where as X has actual points that need to be clustered.
from sklearn.cluster import KMeans #For applying KMeans
##--------------------------------------------------------------------------------------------------------##
#Starting k-means clustering
kmeans = KMeans(n_clusters=11, n_init=10, random_state=0, max_iter=1000)
#Running k-means clustering and enter the ‘X’ array as the input coordinates and ‘Y’
array as sample weights
wt_kmeansclus = kmeans.fit(X,sample_weight = Y)
predicted_kmeans = kmeans.predict(X, sample_weight = Y)
#Storing results obtained together with respective city-state labels
kmeans_results =
pd.DataFrame({"label":data_label,"kmeans_cluster":predicted_kmeans+1})
#Printing count of points alloted to each cluster and then the cluster centers
print(kmeans_results.kmeans_cluster.value_counts())
I think the post suggests to work with weighted average.
You can create a new dataset out of the old one, and the new dataset will have an extra attribute for each point, it's frequency (i.e it's weight).
Every time you calculate the new centroid for each cluster, take the weighted average of all points of that cluster (instead of calculating the simple mean of all points).
PS: Manipulating the dataset is dangerous. I'd parallelize the code if computational cost is a major factor.