I have been trying to implement DBSCAN using scikit and am so far failing to determine the values of epsilon and min_sample which will give me a sizeable number of clusters. I tried finding the average value in the distance matrix and used values on either side of the mean but haven't got a satisfactory number of clusters:
Input:
db=DBSCAN(eps=13.0,min_samples=100).fit(X)
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
output:
Estimated number of clusters: 1
Input:
db=DBSCAN(eps=27.0,min_samples=100).fit(X)
Output:
Estimated number of clusters: 1
Also so other information:
The average distance between any 2 points in the distance matrix is 16.8354
the min distance is 1.0
the max distance is 258.653
Also the X passed in the code is not the distance matrix but the matrix of feature vectors.
So please tell me how do i determine these parameters
plot a k-distance graph, and look for a knee there. As suggested in the DBSCAN article.
(Your min_samples might be too high - you probably won't have a knee in the 100-distance graph then.)
Visualize your data. If you can't visually see clusters, there might be no clusters. DBSCAN cannot be forced to produce an arbitrary number of clusters. If your data set is a Gaussian distribution, it is supposed to be a single cluster only.
Try changing the min_samples parameter to a lower value. This parameter affects the minimum size of each cluster formed. May be, the possible clusters to be formed are all small sized and the parameter you are using right now is too high for them to be formed.
Related
I am currently clustering data and I used hierarchy.fcluster with the parameter "maxclust".
According to the documentation the algorithm picks a threshold thats fits.
Finds a minimum threshold r so that the cophenetic distance between any two original observations in the same flat cluster is no more than r and no more than t flat clusters are formed.
Is there a way to figure out the chosen threshold?
I use the intended code to output the results of the clustering. What does this value mean in "Cluster Centers", and how should I interpret this data?
kmeans = KMeans(n_clusters = 4).fit(df)
print("Number of clusters: ", kmeans.n_clusters)
print("-"*70)
print("Cluster Centers: ", '\n', kmeans.cluster_centers_)
Number of clusters: 4
----------------------------------------------------------------------
Cluster Centers:
[[4.10000000e+02 9.92833333e+03 3.42200000e+03 3.73333333e+00
2.32433333e+03 1.36733333e+03 1.31600000e+03 5.16666667e+01
9.57000000e+02]
[4.55000000e+01 3.41650000e+03 1.42100000e+04 3.70000000e+00
5.95000000e+02 3.60000000e+02 3.46500000e+02 1.35000000e+01
2.34500000e+02]
[3.41666667e+01 1.14600000e+03 3.33358333e+03 3.69166667e+00
7.02500000e+02 4.14583333e+02 3.99166667e+02 1.53333333e+01
2.87916667e+02]
[5.14000000e+02 2.48310000e+04 5.78750000e+03 3.75000000e+00
1.75350000e+03 1.05200000e+03 1.01200000e+03 3.95000000e+01
7.02000000e+02]]
It means that you have four clusters, and the given vectors ar the centers of those clusters.
So, for a new point, you can check which centroid is the closest and you can determine the new point cluster accordingly.
For example, for the following four clusters above the X represent its centroids for the clusters and a new point can be classified accordingly.
Also, you can check for yourself measurement on the clusters. you can check here: Silhouette - Wikipedia
Your code asked to find four clusters using the KMeans algorithm. See the docs. As expected, you obtain 4 clusters. Based on the kmeans.cluster_centers_, we can tell that your space is 9-dimensional (9 coordinates for each point), because the cluster centroids are 9-dimensional.
The centroids are the means of all points within a cluster. This doc is a good introduction for getting an intuitive understanding of the k-means algorithm.
I am currently doing some clustering based on words embeddings, and I am using some methods (elbow and David-Boulding) to determine the optimal number of clusters I should consider. In addition, I consider the silhouette measure. If I understood it correctly, it is a measure of the correct match of the data with the correct cluster, ranging from - 1 (mismatch) to 1 (correct match).
Using kmeans clustering, I obtain a silhouette score oscillating between 0.5 and 0.55. So according to the silhouette, the elbow method (that is a bit too smooth but it might because I have a lot of data) and the David-Bouldin index, I should consider 5 clusters. However, I don't know if 0.5 can be considered as a good score? I added the graphs of the different measures I made, the function I used to generate them (found online) as well as the clustering obtained.
def check_clustering(X, K):
sse,db,slc = {}, {}, {}
for k in range(2, K):
# seed of 10 for reproducibility.
kmeans = KMeans(n_clusters=k, max_iter=1000,random_state=SEED).fit(X)
if k == 3: labels = kmeans.labels_
clusters = kmeans.labels_
sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
db[k] = davies_bouldin_score(X,clusters)
slc[k] = silhouette_score(X,clusters)
plt.figure(figsize=(15,10))
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
plt.figure(figsize=(15,10))
plt.plot(list(db.keys()), list(db.values()))
plt.xlabel("Number of cluster")
plt.ylabel("Davies-Bouldin values")
plt.show()
plt.figure(figsize=(15,10))
plt.plot(list(slc.keys()), list(slc.values()))
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette score")
plt.show()
I am quite new to k-means clustering and mainly followed online tutorials. Can somebody tell me if the scores obtained through the different measures (but mostly silhouette's) seem correct?
Thank you for your answer.
(Also, there is a subsidiary question but I find the shape of the clusters a bit weird (I would expect them to be more fragmented). Is it a possible shape of clusters? (Note that I used the PCA to reduce the dimensions, so it might be because of that).
Thank you for your help.
Just searched this myself.
A silhouette score of one means each data point is unlikely to be assigned to another cluster.
A score close to zero means each data point could be easily assigned to another cluster
A score close to -1 means the datapoint is misclassified.
Based on these assumptions, I'd say 0.55 is still informative though not definitive and therefore you would need additional analysis to make any assertions based on your data.
I'm learning about DBSCAN and apparently the most important hyperparameter is eps, from sklearn documentation:
eps float, default=0.5
The maximum distance between two samples for one to be considered as in the neighborhood of the other.
This is not a maximum bound on the distances of points within a cluster.
This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
I notice that the number 0.5 doesn't take in fact the range of the distances of our data, in other words, if I use distances from 1 to 100 will it still work the same way if I scale up those distances by a factor of x100? Or scale down by x10? Or this parameter is supposed to be used in normalized distances (max_distance = 1)?
I currently have a list with 3D coordinates which I want cluster by density into a unknown number of clusters. In addition to that I want to score the clusters by population and by distance to the centroids.
I would also like to be able to set a maximum possible distance from a certain centroid. Ideally the centroid represent a point of the data-set, but it is not absolutely necessary. I want to do this for a list ranging from approximately 100 to 10000 3D coordinates.
So for example, say i have a point [x,y,z] which could be my centroid:
Points that are closest to x,y,z should contribute the most to its score (i.e. a logistic scoring function like y = (1 + exp(4*(-1.0+x)))** -1 ,where x represents the euclidean distance to point [x,y ,z]
( https://www.wolframalpha.com/input/?i=(1+%2B+exp(4(-1.0%2Bx)))**+-1 )
Since this function never reaches 0, it is needed to set a maximum distance, e.g. 2 distance units to set a limit to the cluster.
I want to do this until no more clusters can be made, I am only interested in the centroid, thus it should preferably be a real datapoint instead of an interpolated one it also has other properties connected to it.
I have already tried DBSCAN from sklearn, which is several orders of magnitude faster than my code, but it does obviously not accomplish what I want to do
Currently I am just calculating the proximity of every point relative to all other points and am scoring every point by the number and distance to its neighbors (with the same scoring function discussed above), then I take the highest scored point and remove all other, lower scored, points that are within a certain cutoff distance. It gets the job done and is accurate, but it is too slow.
I hope I could be somewhat clear with what I want to do.
Use the neighbor search function of sklearn to find points within the maximum distance 2 fast. Only do this once compute the logistic weights only once.
Then do the remainder using ony this precomputed data?