I'm trying to use k-means in a dataset available at this link using only the variables about the client. The problem is that 7 of the 8 variables are categorical, so I've used one hot encoder on them.
To use the elbow method to select an ideal number of clusters I've ran the KMeans for 2 to 22 clusters and plotted the inertia_ values. But the shape wasn't anything like an elbow, it looked more like a straight line.
Am I doing something wrong?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
bank = pd.read_csv('bank-additional-full.csv', sep=';') #available at https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#
# 1. selecting only informations about the client
cli_vars = ['age', 'job', 'marital', 'education', 'default', 'housing', 'loan']
bank_cli = bank[cli_vars].copy()
#2. applying one hot encoder to categorical variables
X = bank_cli[['job', 'marital', 'education', 'default', 'housing', 'loan']]
le = preprocessing.LabelEncoder()
X_2 = X.apply(le.fit_transform)
X_2.values
enc = preprocessing.OneHotEncoder()
enc.fit(X_2)
one_hot_labels = enc.transform(X_2).toarray()
one_hot_labels.shape #(41188, 33)
#3. concatenating numeric and categorical variables
X = np.concatenate((bank_cli.values[:,0].reshape((41188,1)),one_hot_labels), axis = 1)
X.shape
X = X.astype(float)
X_fit = StandardScaler().fit_transform(X)
X_fit
#4. function to calculate k-means for 2 to 22 clusters
def calcular_cotovelo(data):
wcss = []
for i in range(2, 23):
kmeans = KMeans(init = 'k-means++', n_init= 12, n_clusters = i)
kmeans.fit(data)
wcss.append(kmeans.inertia_)
return wcss
cotovelo = calcular_cotovelo(X_fit)
#5. plot to see the elbow to select the ideal number of clusters
plt.plot(cotovelo)
plt.show()
This is the plot of the inertia to select the clusters. It's not in an elbow shape, and the values are very high.
K-means is not suited for categorical data. You should look to k-prototypes instead which combines k-modes and k-means and is able to cluster mixed numerical and categorical data.
An implementation of k-prototypes is available in Python.
If you consider only the numerical variable however, you can see an elbow with k-means criteria:
To understand why you do not see any elbow (with k-means on both numerical and categorical data), you can look at the number of points per clusters. You can observe that each time you increase the number of clusters, a new cluster is formed with only a few points which were in a big cluster at the previous step, thus the criterion is only a few less than at the previous step.
Related
having the code for manual and therefore possibly wrong Elbow method selection of optimal number of clusters when K-modes clustering of binary df:
cost = []
for num_clusters in list(range(1,10)):
kmode = KModes(n_clusters=num_clusters, init = "Huang", n_init = 10)
kmode.fit_predict(newdf_matrix)
cost.append(kmode.cost_)
y = np.array([i for i in range(1,10,1)])
plt.plot(y,cost)
An outcome of the for loop is a plot with the so called elbow curve. I know this curve helps me choose a optimal K. I do not want to do that myself tho, I am looking for some computational way. I want a computer to do the job without me determining it "manually". Otherwise it stops executing the whole code at some point.
Thank you.
What would be the code for selecting the K automatically that would replace my manual selection?
Thank you.
Use silhouette coefficient [will not work if the data points are represented as categorical values rather then N-d points]
The silhouette coefficient give the measure of how similar a data point is within the cluster compared to other clusters. check Sklearn doc here.
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
So calculate silhouette_score for different values of k and use the one which has best score (near to 1).
Sample using digits dataset.
from sklearn.cluster import KMeans
import numpy as np
from sklearn.datasets import load_digits
data, labels = load_digits(return_X_y=True)
from sklearn.metrics import silhouette_score
silhouette_avg = []
for num_clusters in list(range(2,20)):
kmeans = KMeans(n_clusters=num_clusters, init = "k-means++", n_init = 10)
kmeans.fit_predict(data)
score = silhouette_score(data, kmeans.labels_)
silhouette_avg.append(score)
import matplotlib.pyplot as plt
plt.plot(np.arange(2,20),silhouette_avg,'bx-')
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k')
_ = plt.xticks(np.arange(2,20))
print (f"Best K: {np.argmax(silhouette_avg)+2}")
output:
Best K: 9
I am using make_moons dataset and I am trying to implement an outlier detection algorithm. That's why I want to generate 3 points which are away from normal data, and testify if they are outlier or not. These 3 points should be randomly selected from my data and should be far as possible from the normal data.
My algorithm will compare the distance between that point with theresold value and finds if it is an outlier or not.
I am aware of the other resources to do that, but my specific problem to do that, is my dataset. I could not find a way to fit the solutions to my dataset
Here is my code to define dataset and fit into K-Means(I have to use K-Means fitted data):
data = make_moons(n_samples=100,noise=0, random_state=0)
X,y=data
n_clusters=10
kmeans = KMeans(n_clusters = n_clusters,random_state=10)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
Shortly, how can i find farthest 3 points in my data, to use it in outlier detection?
As stated in the comments, you should define a criteria to classify outliers. Either way, in the following code, I randomly selected three entries from X and multiplied them by 1,000, so surely that should make them outliers regardless of the definition you choose.
# Import libraries
import numpy as np
from sklearn.datasets import make_moons
# Create data
X, y = make_moons(100, random_state=123)
# Randomly select 3 row numbers from X
np.random.seed(5)
idx = np.random.randint(low=0, high=len(df[0]) + 1, size=3)
# Overwrite the data from the randomly selected rows
for i in idx:
scaler = 1000 # Change this number to whatever you need
X[i] = X[i] * scaler
Note: There is a small probability that idx will have duplicates. It won't happen with np.random.seed(5), but if you choose another seed (or opt to not use one at all) and get duplicates, simply try another one or repeat until you don't get duplicates.
I am pretty new to python and the clusttering stuff. Right now I have a task to analyze a set of data and determine its optimal Kmean by using elbow and silhouette method.
As shown in the picture, my dataset has three features, one is the weight of tested person, the second is the blood Cholesterol content of the person, the third is the gender of the tested person('0' means female, '1' means male)
I firstly use elbow method to see the wcss value at different k values
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(data)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
And get the plot at below:
Then, I used the silhouette method to look at the silhouette score:
from sklearn.metrics import silhouette_score
sil = []
for k in range(2, 6):
kmeans = KMeans(n_clusters = k).fit(data)
preds = kmeans.fit_predict(data)
sil.append(silhouette_score(data, preds, metric = 'euclidean'))
plt.plot(range(2, 6), sil)
plt.title('Silhouette Method')
plt.xlabel('Number of clusters')
plt.ylabel('Sil')
plt.show()
for i in range(len(sil)):
print(str(i+2) +":"+ str(sil[i]))
And I got the following results:
Could anybody suggest how can I pick the optimal Kmean? I did some light research, someone says the higher the s-score the better(in my case the cluster number should be 2?), but in some other cases, they are not simply using the cluster number has the highest score.
Another thought is that here I included the gender as one feature, should I first divide my data into two classes by gender and then cluster them separately ?
K-means algorithm is very much susceptible to the range in which your features are measured, in your case gender is a binary variable which just takes values 0 and 1, but the other two features are measures in a larger scale, I recommend you to normalize your data first and then do the plots again which could produce consistent results between your elbow curve and the silhouette method.
Im doing clustering of text data with Kmeans in Python's Scikit-Learn.
I have problem with Vectorizing the data because I get very different results when Im using different vectorizers.
I want to do clustering of text data (data are instagram comments about USA politics) and I want to find the key-words for every cluster. But I do not know what vectorizer should I use
For example when I'm using :
cv = CountVectorizer(analyzer = 'word', max_features = 8000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(x)
#should I scale x value?
#x = scale(x, with_mean=False)
#If I do this I get the graph just one dot and silhouette_score less than 0.01
I get that my optimal number of clusters is 2, based on silhouette_score that gives me score of 0.87. And my graph looks like this:
And when Im using:
cv = TfidfVectorizer(analyzer = 'word',max_features = 8000, preprocessor=None, lowercase=True, tokenizer=None, stop_words = 'english')
x = cv.fit_transform(x)
I get that my optimal number of clusters is 13, based on silhouette_score that gives me score of 0.0159. And my graph looks like this:
This is how I'm doing the clustering:
my_list = []
list_of_clusters = []
for i in range(2,15):
kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
kmeans.fit(x)
my_list.append(kmeans.inertia_)
cluster_labels = kmeans.fit_predict(x)
silhouette_avg = silhouette_score(x, cluster_labels)
print(round(silhouette_avg,2))
list_of_clusters.append(round(silhouette_avg, 1))
plt.plot(range(2,15),my_list)
plt.show()
number_of_clusters = max(list_of_clusters)
number_of_clusters = list_of_clusters.index(number_of_clusters)+2
print('Number of clusters: ', number_of_clusters)
kmeans = KMeans(n_clusters = number_of_clusters, init = 'k-means++', random_state = 42)
kmeans.fit(x)
And this is how I plot the data:
# reduce the features to 2D
pca = PCA(n_components=2, random_state=0)
reduced_features = pca.fit_transform(x.toarray())
# reduce the cluster centers to 2D
reduced_cluster_centers = pca.transform(kmeans.cluster_centers_)
plt.scatter(reduced_features[:,0], reduced_features[:,1], c=kmeans.predict(x), s=3)
plt.scatter(reduced_cluster_centers[:, 0], reduced_cluster_centers[:,1], marker='x', s=50, c='r')
plt.show()
I think that this is the very big difference, so Im sure that Im doing something wrong, but I do not know what.
Thanks for your help :)
The Silhouette score, from sklearn:
The Silhouette Coefficient is calculated using the mean intra-cluster
distance (a) and the mean nearest-cluster distance (b) for each
sample. The Silhouette Coefficient for a sample is (b - a) / max(a,
b). To clarify, b is the distance between a sample and the nearest
cluster that the sample is not a part of. Note that Silhouette
Coefficient is only defined if number of labels is 2 <= n_labels <=
n_samples - 1.
This function returns the mean Silhouette Coefficient over all
samples. To obtain the values for each sample, use silhouette_samples.
The best value is 1 and the worst value is -1. Values near 0 indicate
overlapping clusters. Negative values generally indicate that a sample
has been assigned to the wrong cluster, as a different cluster is more
similar.
Your best value is near 0, which means you don't have a good clustering
A good clustering is a clustering in which:
the intra-class similarity is high (the documents in your cluster are similar).
the inter-class similarity is low (the documents from two clusters are not similar).
And of course, the clusters should tell you something interesting. For example one cluster could contains all the words linked to a specific domain. But this is a work that you do after the clusters are done.
Changing the vectorizing means that you are changing the features on which you are doing clustering.
From the python sklearn doc, CountVectorizer:
CountVectorizer implements both tokenization and occurrence counting
in a single class
Basically you have the token counts as features.
Instead for TfidfVectorizer:
Convert a collection of raw documents to a matrix of TF-IDF features.
Which means that you are using the Term Frequency - Inverse Document Frequency formula as features for your documents, which is calculated as:
TF(t) = (Number of times term t appears in a document) / (Total
number of terms in the document). IDF(t) = log_e(Total number of
documents / Number of documents with term t in it)
TF-IDF = TF(t) * IDF(t)
I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set.
I could use the function score() to compute the log probability under the model.
However, I am looking for a metric called 'purity' which is defined in this article.
How can I implement it in Python? My current implementation looks like this:
from sklearn.mixture import GMM
# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)
clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)
# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)
But I can not loop through each cluster in order to compute the confusion matrix (according this question)
David's answer works but here is another way to do it.
import numpy as np
from sklearn import metrics
def purity_score(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# return purity
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
Also if you need to compute Inverse Purity, all you need to do is replace "axis=0" by "axis=1".
sklearn doesn't implement a cluster purity metric. You have 2 options:
Implement the measurement using sklearn data structures yourself. This and this have some python source for measuring purity, but either your data or the function bodies need to be adapted for compatibility with each other.
Use the (much less mature) PML library, which does implement cluster purity.
A very late contribution.
You can try to implement it like this, pretty much like in this gist
def purity_score(y_true, y_pred):
"""Purity score
Args:
y_true(np.ndarray): n*1 matrix Ground truth labels
y_pred(np.ndarray): n*1 matrix Predicted clusters
Returns:
float: Purity score
"""
# matrix which will hold the majority-voted labels
y_voted_labels = np.zeros(y_true.shape)
# Ordering labels
## Labels might be missing e.g with set like 0,2 where 1 is missing
## First find the unique labels, then map the labels to an ordered set
## 0,2 should become 0,1
labels = np.unique(y_true)
ordered_labels = np.arange(labels.shape[0])
for k in range(labels.shape[0]):
y_true[y_true==labels[k]] = ordered_labels[k]
# Update unique labels
labels = np.unique(y_true)
# We set the number of bins to be n_classes+2 so that
# we count the actual occurence of classes between two consecutive bins
# the bigger being excluded [bin_i, bin_i+1[
bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)
for cluster in np.unique(y_pred):
hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
# Find the most present label in the cluster
winner = np.argmax(hist)
y_voted_labels[y_pred==cluster] = winner
return accuracy_score(y_true, y_voted_labels)
The currently top voted answer correctly implements the purity metric, but may not be the most appropriate metric in all cases, because it does not ensure that each predicted cluster label is assigned only once to a true label.
For example, consider a dataset that is very imbalanced, with 99 examples of one label and 1 example of another label. Then any clustering (e.g: having two equal clusters of size 50) will achieve purity of at least 0.99, rendering it a useless metric.
Instead, in cases where the number of clusters is the same as the number of labels, cluster accuracy may be more appropriate. This has the advantage of mirroring classification accuracy in an unsupervised setting. To compute cluster accuracy, we need to use the Hungarian algorithm to find the optimal matching between cluster labels and true labels. The SciPy function linear_sum_assignment does this:
import numpy as np
from sklearn import metrics
from scipy.optimize import linear_sum_assignment
def cluster_accuracy(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# Find optimal one-to-one mapping between cluster labels and true labels
row_ind, col_ind = linear_sum_assignment(-contingency_matrix)
# Return cluster accuracy
return contingency_matrix[row_ind, col_ind].sum() / np.sum(contingency_matrix)