sklearn: silhouette score different for same clustering

sklearn: silhouette score different for same clustering - python

I modified the Cluster comparison script to compute silhouette_score on the clustering output.
I added this line:
sil = silhouette_score(X, y_pred, metric='euclidean') if len(np.unique(y_pred)) > 1 else float('NaN')
and modified the plt.text() line for showing the sil value in the subplot:
txt = 'sil={:.3f}\n{:.2f}s'.format(sil,(t1 - t0))
plt.text(.99, .01, txt, transform=plt.gca().transAxes, size=15, horizontalalignment='right')
This is what I get:
Look at 3rd row, for columns MeanShoft and DBSCAN. Clustering is the same, but silhouette score is significantly lower for DBSCAN. How come?
Since this question is not about a programming error, shall this be moved to stats?

In short, the clusterings aren't the same. If you look very closely at the DBSCAN plot, you'll see that there is an outlier at the bottom left of the blue cluster that is not assigned to any cluster -- it appears as a black point.
Note that the silhouette score assumes that all points are assigned to a cluster, so it may not give the answer you'd expect. In this case, the single point not assigned to any cluster is enough to make a significant difference in the silhouette scores.

Related

How to interpret the value of Cluster Centers in k means

I use the intended code to output the results of the clustering. What does this value mean in "Cluster Centers", and how should I interpret this data?
kmeans = KMeans(n_clusters = 4).fit(df)
print("Number of clusters: ", kmeans.n_clusters)
print("-"*70)
print("Cluster Centers: ", '\n', kmeans.cluster_centers_)
Number of clusters: 4
----------------------------------------------------------------------
Cluster Centers:
[[4.10000000e+02 9.92833333e+03 3.42200000e+03 3.73333333e+00
2.32433333e+03 1.36733333e+03 1.31600000e+03 5.16666667e+01
9.57000000e+02]
[4.55000000e+01 3.41650000e+03 1.42100000e+04 3.70000000e+00
5.95000000e+02 3.60000000e+02 3.46500000e+02 1.35000000e+01
2.34500000e+02]
[3.41666667e+01 1.14600000e+03 3.33358333e+03 3.69166667e+00
7.02500000e+02 4.14583333e+02 3.99166667e+02 1.53333333e+01
2.87916667e+02]
[5.14000000e+02 2.48310000e+04 5.78750000e+03 3.75000000e+00
1.75350000e+03 1.05200000e+03 1.01200000e+03 3.95000000e+01
7.02000000e+02]]

It means that you have four clusters, and the given vectors ar the centers of those clusters.
So, for a new point, you can check which centroid is the closest and you can determine the new point cluster accordingly.
For example, for the following four clusters above the X represent its centroids for the clusters and a new point can be classified accordingly.
Also, you can check for yourself measurement on the clusters. you can check here: Silhouette - Wikipedia

Your code asked to find four clusters using the KMeans algorithm. See the docs. As expected, you obtain 4 clusters. Based on the kmeans.cluster_centers_, we can tell that your space is 9-dimensional (9 coordinates for each point), because the cluster centroids are 9-dimensional.
The centroids are the means of all points within a cluster. This doc is a good introduction for getting an intuitive understanding of the k-means algorithm.

What is considered to be a good silhouette score?

I am currently doing some clustering based on words embeddings, and I am using some methods (elbow and David-Boulding) to determine the optimal number of clusters I should consider. In addition, I consider the silhouette measure. If I understood it correctly, it is a measure of the correct match of the data with the correct cluster, ranging from - 1 (mismatch) to 1 (correct match).
Using kmeans clustering, I obtain a silhouette score oscillating between 0.5 and 0.55. So according to the silhouette, the elbow method (that is a bit too smooth but it might because I have a lot of data) and the David-Bouldin index, I should consider 5 clusters. However, I don't know if 0.5 can be considered as a good score? I added the graphs of the different measures I made, the function I used to generate them (found online) as well as the clustering obtained.
def check_clustering(X, K):
sse,db,slc = {}, {}, {}
for k in range(2, K):
# seed of 10 for reproducibility.
kmeans = KMeans(n_clusters=k, max_iter=1000,random_state=SEED).fit(X)
if k == 3: labels = kmeans.labels_
clusters = kmeans.labels_
sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
db[k] = davies_bouldin_score(X,clusters)
slc[k] = silhouette_score(X,clusters)
plt.figure(figsize=(15,10))
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
plt.figure(figsize=(15,10))
plt.plot(list(db.keys()), list(db.values()))
plt.xlabel("Number of cluster")
plt.ylabel("Davies-Bouldin values")
plt.show()
plt.figure(figsize=(15,10))
plt.plot(list(slc.keys()), list(slc.values()))
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette score")
plt.show()
I am quite new to k-means clustering and mainly followed online tutorials. Can somebody tell me if the scores obtained through the different measures (but mostly silhouette's) seem correct?
Thank you for your answer.
(Also, there is a subsidiary question but I find the shape of the clusters a bit weird (I would expect them to be more fragmented). Is it a possible shape of clusters? (Note that I used the PCA to reduce the dimensions, so it might be because of that).
Thank you for your help.

Just searched this myself.
A silhouette score of one means each data point is unlikely to be assigned to another cluster.
A score close to zero means each data point could be easily assigned to another cluster
A score close to -1 means the datapoint is misclassified.
Based on these assumptions, I'd say 0.55 is still informative though not definitive and therefore you would need additional analysis to make any assertions based on your data.

Weighted k-means in python

After reading this post here about duplicate values in k-means clustering, I realized I cannot simply use unique points for clustering.
https://stats.stackexchange.com/questions/152808/do-i-need-to-remove-duplicate-objects-for-cluster-analysis-of-objects
I have over 10000000 points, though only 8000 unique ones. Therefore, I initially thought that for speeding it up, I’d use unique points only. Seems like this is a bad idea.
To keep computational time down, this post suggests to add weights to each point. How can this be implemented in python?

Using K-Means package from Scikit library, clustering is performed for number of clusters as 11 here.
The array Y contains data that has been inserted as weights where as X has actual points that need to be clustered.
from sklearn.cluster import KMeans #For applying KMeans
##--------------------------------------------------------------------------------------------------------##
#Starting k-means clustering
kmeans = KMeans(n_clusters=11, n_init=10, random_state=0, max_iter=1000)
#Running k-means clustering and enter the ‘X’ array as the input coordinates and ‘Y’
array as sample weights
wt_kmeansclus = kmeans.fit(X,sample_weight = Y)
predicted_kmeans = kmeans.predict(X, sample_weight = Y)
#Storing results obtained together with respective city-state labels
kmeans_results =
pd.DataFrame({"label":data_label,"kmeans_cluster":predicted_kmeans+1})
#Printing count of points alloted to each cluster and then the cluster centers
print(kmeans_results.kmeans_cluster.value_counts())

I think the post suggests to work with weighted average.
You can create a new dataset out of the old one, and the new dataset will have an extra attribute for each point, it's frequency (i.e it's weight).
Every time you calculate the new centroid for each cluster, take the weighted average of all points of that cluster (instead of calculating the simple mean of all points).
PS: Manipulating the dataset is dangerous. I'd parallelize the code if computational cost is a major factor.

How to measure the accuracy of predictions using Python/Pandas?

I have used the Elo and Glicko rating systems along with the results for matches to generate ratings for players. Prior to each match, I can generate an expectation (a float between 0 and 1) for each player based on their respective ratings. I would like test how accurate this expectation is, for two reasons:
To compare the difference rating systems
To tune variables (such as kfactor in Elo) used to calculate ratings
There are a few differences from chess worth being aware of:
Possible results are wins (which I am treating as 1.0), losses (0.0), with the very occasional (<5%) draws (0.5 each). Each individual match is rated, not a series like in chess.
Players have less matches -- many have less than 10, few go over 25, max is 75
Thinking the appropriate function is "correlation", I have attempted creating a DataFrame containing the prediction in one column (a float between 0, 1) and the result in the other (1|0.5|0) and using corr(), but based on the output, I am not sure if this is correct.
If I create a DataFrame containing expectations and results for only the first player in a match (the results will always be 1.0 or 0.5 since due to my data source, losers are never displayed first), corr() returns very low: < 0.05. However, if I create a series which has two rows for each match and contains both the expectation and result for each player (or, alternatively, randomly choose which player to append, so results will be either 0, 0.5, or 1), the corr() is much higher: ~0.15 to 0.30. I don't understand why this would make a difference, which makes me wonder if I am either misusing the function or using the wrong function entirely.
If it helps, here is some real (not random) sample data: http://pastebin.com/eUzAdNij

An industry standard way to judge the accuracy of prediction is Receiver Operating Characteristic (ROC). You can create it from your data using sklearn and matplotlib with this code below.
ROC is a 2-D plot of true positive vs false positive rates. You want the line to be above diagonal, the higher the better. Area Under Curve (AUC) is a standard measure of accuracy: the larger the more accurate your classifier is.
import pandas as pd
# read data
df = pd.read_csv('sample_data.csv', header=None, names=['classifier','category'])
# remove values that are not 0 or 1 (two of those)
df = df.loc[(df.category==1.0) | (df.category==0.0),:]
# examine data frame
df.head()
from matplotlib import pyplot as plt
# add this magic if you're in a notebook
# %matplotlib inline
from sklearn.metrics import roc_curve, auc
# matplot figure
figure, ax1 = plt.subplots(figsize=(8,8))
# create ROC itself
fpr,tpr,_ = roc_curve(df.category,df.classifier)
# compute AUC
roc_auc = auc(fpr,tpr)
# plotting bells and whistles
ax1.plot(fpr,tpr, label='%s (area = %0.2f)' % ('Classifier',roc_auc))
ax1.plot([0, 1], [0, 1], 'k--')
ax1.set_xlim([0.0, 1.0])
ax1.set_ylim([0.0, 1.0])
ax1.set_xlabel('False Positive Rate', fontsize=18)
ax1.set_ylabel('True Positive Rate', fontsize=18)
ax1.set_title("Receiver Operating Characteristic", fontsize=18)
plt.tick_params(axis='both', labelsize=18)
ax1.legend(loc="lower right", fontsize=14)
plt.grid(True)
figure.show()
From your data, you should get a plot like this one:

Actually, what you observe makes perfectly sense. If there were no draws and you would always show the expectation of the winner in the first row, then there would be no correlation with the second row at all! Because no matter how big or small the expectation, the number in the second row is always 1.0, i.e. it does not depend on the number in the first row at all.
Due to a low percentage of draws (draws probably correlate with the values around 0.5) you still can observe a small correlation.
Maybe the correlation is not the best measure for the accuracy of the predictions here.
One of the problems is, that the Elo does not predict the single result but the expected amount of points. There is at least one unknown factor: The probability of the draw. You have to put additional knowledge about the probability of the draw into your models. This probability is dependent on the strength difference between the players: the bigger the difference the smaller the chance of a draw. One could try the following approaches:
mapping expected points onto expected results, e.g. 0...0.4 means a loss, 0.4..0.6 - a draw and 0.6...1.0 - a win and see how many results are predicted correctly.
For a player and a bunch of games, the measure for accuracy would be |predicted_score-score|/number_of_games averaged over the players. The smaller the difference, the better.
A kind of Bayesian approach: if for a game the predicted amount of points is x than the score of the predictor is x if the game were won and 1-x if the game were lost (maybe you have to skip the draws or score them as (1-x)*x/4 - thus the prediction of 0.5 would have the score of 1). The overall score of the predictor over all games would be the product of the single game scores. The bigger the score, the better.

Scikit DBSCAN eps and min_sample value determination

I have been trying to implement DBSCAN using scikit and am so far failing to determine the values of epsilon and min_sample which will give me a sizeable number of clusters. I tried finding the average value in the distance matrix and used values on either side of the mean but haven't got a satisfactory number of clusters:
Input:
db=DBSCAN(eps=13.0,min_samples=100).fit(X)
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
output:
Estimated number of clusters: 1
Input:
db=DBSCAN(eps=27.0,min_samples=100).fit(X)
Output:
Estimated number of clusters: 1
Also so other information:
The average distance between any 2 points in the distance matrix is 16.8354
the min distance is 1.0
the max distance is 258.653
Also the X passed in the code is not the distance matrix but the matrix of feature vectors.
So please tell me how do i determine these parameters

plot a k-distance graph, and look for a knee there. As suggested in the DBSCAN article.
(Your min_samples might be too high - you probably won't have a knee in the 100-distance graph then.)
Visualize your data. If you can't visually see clusters, there might be no clusters. DBSCAN cannot be forced to produce an arbitrary number of clusters. If your data set is a Gaussian distribution, it is supposed to be a single cluster only.

Try changing the min_samples parameter to a lower value. This parameter affects the minimum size of each cluster formed. May be, the possible clusters to be formed are all small sized and the parameter you are using right now is too high for them to be formed.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.