I use the intended code to output the results of the clustering. What does this value mean in "Cluster Centers", and how should I interpret this data?
kmeans = KMeans(n_clusters = 4).fit(df)
print("Number of clusters: ", kmeans.n_clusters)
print("-"*70)
print("Cluster Centers: ", '\n', kmeans.cluster_centers_)
Number of clusters: 4
----------------------------------------------------------------------
Cluster Centers:
[[4.10000000e+02 9.92833333e+03 3.42200000e+03 3.73333333e+00
2.32433333e+03 1.36733333e+03 1.31600000e+03 5.16666667e+01
9.57000000e+02]
[4.55000000e+01 3.41650000e+03 1.42100000e+04 3.70000000e+00
5.95000000e+02 3.60000000e+02 3.46500000e+02 1.35000000e+01
2.34500000e+02]
[3.41666667e+01 1.14600000e+03 3.33358333e+03 3.69166667e+00
7.02500000e+02 4.14583333e+02 3.99166667e+02 1.53333333e+01
2.87916667e+02]
[5.14000000e+02 2.48310000e+04 5.78750000e+03 3.75000000e+00
1.75350000e+03 1.05200000e+03 1.01200000e+03 3.95000000e+01
7.02000000e+02]]
It means that you have four clusters, and the given vectors ar the centers of those clusters.
So, for a new point, you can check which centroid is the closest and you can determine the new point cluster accordingly.
For example, for the following four clusters above the X represent its centroids for the clusters and a new point can be classified accordingly.
Also, you can check for yourself measurement on the clusters. you can check here: Silhouette - Wikipedia
Your code asked to find four clusters using the KMeans algorithm. See the docs. As expected, you obtain 4 clusters. Based on the kmeans.cluster_centers_, we can tell that your space is 9-dimensional (9 coordinates for each point), because the cluster centroids are 9-dimensional.
The centroids are the means of all points within a cluster. This doc is a good introduction for getting an intuitive understanding of the k-means algorithm.
Related
I am currently doing some clustering based on words embeddings, and I am using some methods (elbow and David-Boulding) to determine the optimal number of clusters I should consider. In addition, I consider the silhouette measure. If I understood it correctly, it is a measure of the correct match of the data with the correct cluster, ranging from - 1 (mismatch) to 1 (correct match).
Using kmeans clustering, I obtain a silhouette score oscillating between 0.5 and 0.55. So according to the silhouette, the elbow method (that is a bit too smooth but it might because I have a lot of data) and the David-Bouldin index, I should consider 5 clusters. However, I don't know if 0.5 can be considered as a good score? I added the graphs of the different measures I made, the function I used to generate them (found online) as well as the clustering obtained.
def check_clustering(X, K):
sse,db,slc = {}, {}, {}
for k in range(2, K):
# seed of 10 for reproducibility.
kmeans = KMeans(n_clusters=k, max_iter=1000,random_state=SEED).fit(X)
if k == 3: labels = kmeans.labels_
clusters = kmeans.labels_
sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
db[k] = davies_bouldin_score(X,clusters)
slc[k] = silhouette_score(X,clusters)
plt.figure(figsize=(15,10))
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
plt.figure(figsize=(15,10))
plt.plot(list(db.keys()), list(db.values()))
plt.xlabel("Number of cluster")
plt.ylabel("Davies-Bouldin values")
plt.show()
plt.figure(figsize=(15,10))
plt.plot(list(slc.keys()), list(slc.values()))
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette score")
plt.show()
I am quite new to k-means clustering and mainly followed online tutorials. Can somebody tell me if the scores obtained through the different measures (but mostly silhouette's) seem correct?
Thank you for your answer.
(Also, there is a subsidiary question but I find the shape of the clusters a bit weird (I would expect them to be more fragmented). Is it a possible shape of clusters? (Note that I used the PCA to reduce the dimensions, so it might be because of that).
Thank you for your help.
Just searched this myself.
A silhouette score of one means each data point is unlikely to be assigned to another cluster.
A score close to zero means each data point could be easily assigned to another cluster
A score close to -1 means the datapoint is misclassified.
Based on these assumptions, I'd say 0.55 is still informative though not definitive and therefore you would need additional analysis to make any assertions based on your data.
After reading this post here about duplicate values in k-means clustering, I realized I cannot simply use unique points for clustering.
https://stats.stackexchange.com/questions/152808/do-i-need-to-remove-duplicate-objects-for-cluster-analysis-of-objects
I have over 10000000 points, though only 8000 unique ones. Therefore, I initially thought that for speeding it up, I’d use unique points only. Seems like this is a bad idea.
To keep computational time down, this post suggests to add weights to each point. How can this be implemented in python?
Using K-Means package from Scikit library, clustering is performed for number of clusters as 11 here.
The array Y contains data that has been inserted as weights where as X has actual points that need to be clustered.
from sklearn.cluster import KMeans #For applying KMeans
##--------------------------------------------------------------------------------------------------------##
#Starting k-means clustering
kmeans = KMeans(n_clusters=11, n_init=10, random_state=0, max_iter=1000)
#Running k-means clustering and enter the ‘X’ array as the input coordinates and ‘Y’
array as sample weights
wt_kmeansclus = kmeans.fit(X,sample_weight = Y)
predicted_kmeans = kmeans.predict(X, sample_weight = Y)
#Storing results obtained together with respective city-state labels
kmeans_results =
pd.DataFrame({"label":data_label,"kmeans_cluster":predicted_kmeans+1})
#Printing count of points alloted to each cluster and then the cluster centers
print(kmeans_results.kmeans_cluster.value_counts())
I think the post suggests to work with weighted average.
You can create a new dataset out of the old one, and the new dataset will have an extra attribute for each point, it's frequency (i.e it's weight).
Every time you calculate the new centroid for each cluster, take the weighted average of all points of that cluster (instead of calculating the simple mean of all points).
PS: Manipulating the dataset is dangerous. I'd parallelize the code if computational cost is a major factor.
On doing K means fit on some vectors with 3 clusters, I was able to get the labels for the input data.
KMeans.cluster_centers_ returns the coordinates of the centers and so shouldn't there be some vector corresponding to that? How can I find the value at the centroid of these clusters?
closest, _ = pairwise_distances_argmin_min(KMeans.cluster_centers_, X)
The array closest will contain the index of the point in X that is closest to each centroid.
Let's say the closest gave output as array([0,8,5]) for the three clusters. So X[0] is the closest point in X to centroid 0, and X[8] is the closest to centroid 1 and so on.
Source: https://codedump.io/share/XiME3OAGY5Tm/1/get-nearest-point-to-centroid-scikit-learn
The cluster centre value is the value of the centroid. At the end of k-means clustering, you'll have three individual clusters and three centroids, with each centroid being located at the centre of each cluster. The centroid doesn't necessarily have to coincide with an existing data point.
Sharda neglected to import the metrics module from scikit-learn, see below.
from sklearn.metrics import pairwise_distances_argmin_min
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, X)
or
closest, _ = sklearn.metrics.pairwise_distances_argmin_min(kmeans.cluster_centers_, X)
Assuming X is the input data and kmeans has been fit to that data, both options give you an array, closest, for which each element is the index of the closest element in X to that centroid. Thus, closest[0] is the index of the data closest to the first centroid and X[closest[0]] is that data.
To answer your first question, k-means clustering randomly selects a point in the plane for each centroid and then adjusts them all to be the best representatives of the data. The centroids will not necessarily end up coinciding with any of the original data. This contrasts with the Affinity Propagation Clustering algorithm which picks an exemplar data point as the representative for each cluster, not just a point in the same plane.
I have been using KMeans in order to extract clusters from a set of lines and i'm not very impressed with the results and i wanted to try out DBSCAN to see if this can produce better results. Does DBSCAN output cluster words as KMeans ?
I was able to use DBSCAN and was able to output number of clusters as '3' but i would like to know what context is driving it to make '3' clusters (i would like to know the words)
here is my code snippet
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
print("Silhouette Coefficient: %0.3f"% metrics.silhouette_score(X, labels))
You do not have direct control over how many clusters DBSCAN produces. It produces as many as happen to be there at the given density level; which is best done by varying epsilon.
Note that it also produces noise, i.e. one cluster (probably the first) is not a cluster but leftover points that do not belong to any cluster. But when you simply discard these points, your silhouette becomes false.
As DBSCAN clusters may be arbitrarily shaped, there is no meaningful 'centroid' as in k-means that you could interpretnas "words" (but often this interpretation is all but good anyway).
Please read the Wikipedia article & DBSCAN literature for further details.
I have been trying to implement DBSCAN using scikit and am so far failing to determine the values of epsilon and min_sample which will give me a sizeable number of clusters. I tried finding the average value in the distance matrix and used values on either side of the mean but haven't got a satisfactory number of clusters:
Input:
db=DBSCAN(eps=13.0,min_samples=100).fit(X)
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
output:
Estimated number of clusters: 1
Input:
db=DBSCAN(eps=27.0,min_samples=100).fit(X)
Output:
Estimated number of clusters: 1
Also so other information:
The average distance between any 2 points in the distance matrix is 16.8354
the min distance is 1.0
the max distance is 258.653
Also the X passed in the code is not the distance matrix but the matrix of feature vectors.
So please tell me how do i determine these parameters
plot a k-distance graph, and look for a knee there. As suggested in the DBSCAN article.
(Your min_samples might be too high - you probably won't have a knee in the 100-distance graph then.)
Visualize your data. If you can't visually see clusters, there might be no clusters. DBSCAN cannot be forced to produce an arbitrary number of clusters. If your data set is a Gaussian distribution, it is supposed to be a single cluster only.
Try changing the min_samples parameter to a lower value. This parameter affects the minimum size of each cluster formed. May be, the possible clusters to be formed are all small sized and the parameter you are using right now is too high for them to be formed.