I am pretty new to python and the clusttering stuff. Right now I have a task to analyze a set of data and determine its optimal Kmean by using elbow and silhouette method.
As shown in the picture, my dataset has three features, one is the weight of tested person, the second is the blood Cholesterol content of the person, the third is the gender of the tested person('0' means female, '1' means male)
I firstly use elbow method to see the wcss value at different k values
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(data)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
And get the plot at below:
Then, I used the silhouette method to look at the silhouette score:
from sklearn.metrics import silhouette_score
sil = []
for k in range(2, 6):
kmeans = KMeans(n_clusters = k).fit(data)
preds = kmeans.fit_predict(data)
sil.append(silhouette_score(data, preds, metric = 'euclidean'))
plt.plot(range(2, 6), sil)
plt.title('Silhouette Method')
plt.xlabel('Number of clusters')
plt.ylabel('Sil')
plt.show()
for i in range(len(sil)):
print(str(i+2) +":"+ str(sil[i]))
And I got the following results:
Could anybody suggest how can I pick the optimal Kmean? I did some light research, someone says the higher the s-score the better(in my case the cluster number should be 2?), but in some other cases, they are not simply using the cluster number has the highest score.
Another thought is that here I included the gender as one feature, should I first divide my data into two classes by gender and then cluster them separately ?
K-means algorithm is very much susceptible to the range in which your features are measured, in your case gender is a binary variable which just takes values 0 and 1, but the other two features are measures in a larger scale, I recommend you to normalize your data first and then do the plots again which could produce consistent results between your elbow curve and the silhouette method.
Related
having the code for manual and therefore possibly wrong Elbow method selection of optimal number of clusters when K-modes clustering of binary df:
cost = []
for num_clusters in list(range(1,10)):
kmode = KModes(n_clusters=num_clusters, init = "Huang", n_init = 10)
kmode.fit_predict(newdf_matrix)
cost.append(kmode.cost_)
y = np.array([i for i in range(1,10,1)])
plt.plot(y,cost)
An outcome of the for loop is a plot with the so called elbow curve. I know this curve helps me choose a optimal K. I do not want to do that myself tho, I am looking for some computational way. I want a computer to do the job without me determining it "manually". Otherwise it stops executing the whole code at some point.
Thank you.
What would be the code for selecting the K automatically that would replace my manual selection?
Thank you.
Use silhouette coefficient [will not work if the data points are represented as categorical values rather then N-d points]
The silhouette coefficient give the measure of how similar a data point is within the cluster compared to other clusters. check Sklearn doc here.
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
So calculate silhouette_score for different values of k and use the one which has best score (near to 1).
Sample using digits dataset.
from sklearn.cluster import KMeans
import numpy as np
from sklearn.datasets import load_digits
data, labels = load_digits(return_X_y=True)
from sklearn.metrics import silhouette_score
silhouette_avg = []
for num_clusters in list(range(2,20)):
kmeans = KMeans(n_clusters=num_clusters, init = "k-means++", n_init = 10)
kmeans.fit_predict(data)
score = silhouette_score(data, kmeans.labels_)
silhouette_avg.append(score)
import matplotlib.pyplot as plt
plt.plot(np.arange(2,20),silhouette_avg,'bx-')
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k')
_ = plt.xticks(np.arange(2,20))
print (f"Best K: {np.argmax(silhouette_avg)+2}")
output:
Best K: 9
I'm trying to use k-means in a dataset available at this link using only the variables about the client. The problem is that 7 of the 8 variables are categorical, so I've used one hot encoder on them.
To use the elbow method to select an ideal number of clusters I've ran the KMeans for 2 to 22 clusters and plotted the inertia_ values. But the shape wasn't anything like an elbow, it looked more like a straight line.
Am I doing something wrong?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
bank = pd.read_csv('bank-additional-full.csv', sep=';') #available at https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#
# 1. selecting only informations about the client
cli_vars = ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan']
bank_cli = bank[cli_vars].copy()
#2. applying one hot encoder to categorical variables
X = bank_cli[['job', 'marital', 'education', 'default', 'housing', 'loan']]
le = preprocessing.LabelEncoder()
X_2 = X.apply(le.fit_transform)
X_2.values
enc = preprocessing.OneHotEncoder()
enc.fit(X_2)
one_hot_labels = enc.transform(X_2).toarray()
one_hot_labels.shape #(41188, 33)
#3. concatenating numeric and categorical variables
X = np.concatenate((bank_cli.values[:,0].reshape((41188,1)),one_hot_labels), axis = 1)
X.shape
X = X.astype(float)
X_fit = StandardScaler().fit_transform(X)
X_fit
#4. function to calculate k-means for 2 to 22 clusters
def calcular_cotovelo(data):
wcss = []
for i in range(2, 23):
kmeans = KMeans(init = 'k-means++', n_init= 12, n_clusters = i)
kmeans.fit(data)
wcss.append(kmeans.inertia_)
return wcss
cotovelo = calcular_cotovelo(X_fit)
#5. plot to see the elbow to select the ideal number of clusters
plt.plot(cotovelo)
plt.show()
This is the plot of the inertia to select the clusters. It's not in an elbow shape, and the values are very high.
K-means is not suited for categorical data. You should look to k-prototypes instead which combines k-modes and k-means and is able to cluster mixed numerical and categorical data.
An implementation of k-prototypes is available in Python.
If you consider only the numerical variable however, you can see an elbow with k-means criteria:
To understand why you do not see any elbow (with k-means on both numerical and categorical data), you can look at the number of points per clusters. You can observe that each time you increase the number of clusters, a new cluster is formed with only a few points which were in a big cluster at the previous step, thus the criterion is only a few less than at the previous step.
Im doing clustering of text data with Kmeans in Python's Scikit-Learn.
I have problem with Vectorizing the data because I get very different results when Im using different vectorizers.
I want to do clustering of text data (data are instagram comments about USA politics) and I want to find the key-words for every cluster. But I do not know what vectorizer should I use
For example when I'm using :
cv = CountVectorizer(analyzer = 'word', max_features = 8000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(x)
#should I scale x value?
#x = scale(x, with_mean=False)
#If I do this I get the graph just one dot and silhouette_score less than 0.01
I get that my optimal number of clusters is 2, based on silhouette_score that gives me score of 0.87. And my graph looks like this:
And when Im using:
cv = TfidfVectorizer(analyzer = 'word',max_features = 8000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(x)
I get that my optimal number of clusters is 13, based on silhouette_score that gives me score of 0.0159. And my graph looks like this:
This is how I'm doing the clustering:
my_list = []
list_of_clusters = []
for i in range(2,15):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(x)
my_list.append(kmeans.inertia_)
cluster_labels = kmeans.fit_predict(x)
silhouette_avg = silhouette_score(x, cluster_labels)
print(round(silhouette_avg,2))
list_of_clusters.append(round(silhouette_avg, 1))
plt.plot(range(2,15),my_list)
plt.show()
number_of_clusters = max(list_of_clusters)
number_of_clusters = list_of_clusters.index(number_of_clusters)+2
print('Number of clusters: ', number_of_clusters)
kmeans = KMeans(n_clusters = number_of_clusters, init = 'k-means++', random_state = 42)
kmeans.fit(x)
And this is how I plot the data:
# reduce the features to 2D
pca = PCA(n_components=2, random_state=0)
reduced_features = pca.fit_transform(x.toarray())
# reduce the cluster centers to 2D
reduced_cluster_centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(reduced_features[:,0], reduced_features[:,1], c=kmeans.predict(x), s=3)
plt.scatter(reduced_cluster_centers[:, 0], reduced_cluster_centers[:,1], marker='x', s=50, c='r')
plt.show()
I think that this is the very big difference, so Im sure that Im doing something wrong, but I do not know what.
Thanks for your help :)
The Silhouette score, from sklearn:
The Silhouette Coefficient is calculated using the mean intra-cluster
distance (a) and the mean nearest-cluster distance (b) for each
sample. The Silhouette Coefficient for a sample is (b - a) / max(a,
b). To clarify, b is the distance between a sample and the nearest
cluster that the sample is not a part of. Note that Silhouette
Coefficient is only defined if number of labels is 2 <= n_labels <=
n_samples - 1.
This function returns the mean Silhouette Coefficient over all
samples. To obtain the values for each sample, use silhouette_samples.
The best value is 1 and the worst value is -1. Values near 0 indicate
overlapping clusters. Negative values generally indicate that a sample
has been assigned to the wrong cluster, as a different cluster is more
similar.
Your best value is near 0, which means you don't have a good clustering
A good clustering is a clustering in which:
the intra-class similarity is high (the documents in your cluster are similar).
the inter-class similarity is low (the documents from two clusters are not similar).
And of course, the clusters should tell you something interesting. For example one cluster could contains all the words linked to a specific domain. But this is a work that you do after the clusters are done.
Changing the vectorizing means that you are changing the features on which you are doing clustering.
From the python sklearn doc, CountVectorizer:
CountVectorizer implements both tokenization and occurrence counting
in a single class
Basically you have the token counts as features.
Instead for TfidfVectorizer:
Convert a collection of raw documents to a matrix of TF-IDF features.
Which means that you are using the Term Frequency - Inverse Document Frequency formula as features for your documents, which is calculated as:
TF(t) = (Number of times term t appears in a document) / (Total
number of terms in the document). IDF(t) = log_e(Total number of
documents / Number of documents with term t in it)
TF-IDF = TF(t) * IDF(t)
I'm trying to find clusters in a data set using K-means method. I got the number of clusters from the elbow method, but don't know how to identify and separate these clusters for further analysis on each cluster like applying linear regression on each cluster. My data set contain more than two variables.
I got the number of clusters from the elbow method
Applying Kmeans
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(df)
kmeanModel.fit(df)
distortions.append(sum(np.min(cdist(df, kmeanModel.cluster_centers_, 'euclidean'), axis=1))**2 / df.shape[0])
Elbow method for number of clusters
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
Suppose you found that the value k is the optimal number of clusters for your data using the Elbow method.
So you can use the following code to divide the data into different clusters:
kmeans = KMeans(n_clusters=k, random_state=0).fit(df)
y = kmeans.labels_ # Will return the cluster numbers for each datapoint
y_pred = kmeans.predict(<unknown_sample>) # If want to predict for a new sample
After that you can separate the data based on the clusters as:
for i in range(k):
cluster_i = df[y == i, :] # Subset of the datapoints that have been assigned to the cluster i
# Do analysis on this subset of datapoints.
You can find more details related to different parameters in this link: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
I have trained and tested a KNN model on a small supervised dataset of about 200 samples in Python. I would like to apply these results to a much larger unsupervised dataset of several thousand samples.
My question is: is there a way to fit the KNN model using the small supervised dataset, and then change the K-value for the large unsupervised dataset? I do not want to overfit the model by using the low K-value from the smaller dataset but am unsure how to fit the model and then change the K-value in Python.
Is this possible using KNN? Is there some other way to apply KNN to a much larger unsupervised dataset?
I would recommend actually fitting a KNN model on your larger dataset a couple of different times, each time using a different value for k. For each of those models you could then calculate the Silhouette Score.
Compare the various silhouette scores and choose for your final value of k (number of clusters) the value that you used for your highest scoring model.
As an example, here's some code I used to do this for myself last year:
from sklearn import mixture
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
## A list of the different numbers of clusters (the 'n_components' parameter) with
## which we will run GMM.
number_of_clusters = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
## Graph plotting method
def makePlot(number_of_clusters, silhouette_scores):
# Plot the each value of 'number of clusters' vs. the silhouette score at that value
fig, ax = plt.subplots(figsize=(16, 6))
ax.set_xlabel('GMM - number of clusters')
ax.set_ylabel('Silhouette Score (higher is better)')
ax.plot(number_of_clusters, silhouette_scores)
# Ticks and grid
xticks = np.arange(min(number_of_clusters), max(number_of_clusters)+1, 1.0)
ax.set_xticks(xticks, minor=False)
ax.set_xticks(xticks, minor=True)
ax.xaxis.grid(True, which='both')
yticks = np.arange(round(min(silhouette_scores), 2), max(silhouette_scores), .02)
ax.set_yticks(yticks, minor=False)
ax.set_yticks(yticks, minor=True)
ax.yaxis.grid(True, which='both')
## Graph the mean silhouette score of each cluster amount.
## Print out the number of clusters that results in the highest
## silhouette score for GMM.
def findBestClusterer(number_of_clusters):
silhouette_scores = []
for i in number_of_clusters:
clusterer = mixture.GMM(n_components=i) # Use the model of your choice here
clusterer.fit(<your data set>) # enter your data set's variable name here
preds = clusterer.predict(<your data set>)
score = silhouette_score(<your data set>, preds)
silhouette_scores.append(score)
## Print a table of all the silhouette scores
print("")
print("| Number of clusters | Silhouette score |")
print("| ------------------ | ---------------- |")
for i in range(len(number_of_clusters)):
## Ensure printed table is properly formatted, taking into account
## amount of digits (either one or two) in the value for number of clusters.
if number_of_clusters[i] <= 9:
print("| {number} | {score:.4f} |".format(number=number_of_clusters[i],
score=round(silhouette_scores[i], 4)))
else:
print("| {number} | {score:.4f} |".format(number=number_of_clusters[i],
score=round(silhouette_scores[i], 4)))
## Graph the plot of silhoutte scores for each amount of clusters
makePlot(number_of_clusters, silhouette_scores)
## Find and print out the cluster amount that gives the highest
## silhouette score.
best_silhouette_score = max(silhouette_scores)
index_of_best_score = silhouette_scores.index(best_silhouette_score)
ideal_number_of_clusters = number_of_clusters[index_of_best_score]
print("")
print("Having {} clusters gives the highest silhouette score of {}.".format(ideal_number_of_clusters,
round(best_silhouette_score, 4)))
findBestClusterer(number_of_clusters)
Please note that in my example, I used a GMM model instead of KNN, but you should be able to slightly modify the findBestClusterer() method to use whatever clustering algorithm you wish. In this method you will also specify your dataset.
In machine learning there are two broad types of learners, namely eager learners (Decision trees, neural nets, svms...) and lazy learners such as KNN. In fact, KNN doesn't do any learning at all. It just stores the "labeled" data you have and then uses it to perform inference such that it computes how similar the new sample (unlabeled) is, to all of the samples in the data that it has stored (labeled data). Then based on majority voting of the K nearest instances (K nearest neighbours hence the name) of the new sample, it will infer it's class/value.
Now to get to your question, "training" the KNN has nothing to do with the K itself, so when performing inference feel free to use whatever K gives the best result for you.