DBSCAN Clustering Python - cluster words

DBSCAN Clustering Python - cluster words - python

I have been using KMeans in order to extract clusters from a set of lines and i'm not very impressed with the results and i wanted to try out DBSCAN to see if this can produce better results. Does DBSCAN output cluster words as KMeans ?
I was able to use DBSCAN and was able to output number of clusters as '3' but i would like to know what context is driving it to make '3' clusters (i would like to know the words)
here is my code snippet
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
print("Silhouette Coefficient: %0.3f"% metrics.silhouette_score(X, labels))

You do not have direct control over how many clusters DBSCAN produces. It produces as many as happen to be there at the given density level; which is best done by varying epsilon.
Note that it also produces noise, i.e. one cluster (probably the first) is not a cluster but leftover points that do not belong to any cluster. But when you simply discard these points, your silhouette becomes false.
As DBSCAN clusters may be arbitrarily shaped, there is no meaningful 'centroid' as in k-means that you could interpretnas "words" (but often this interpretation is all but good anyway).
Please read the Wikipedia article & DBSCAN literature for further details.

Related

How to interpret the value of Cluster Centers in k means

I use the intended code to output the results of the clustering. What does this value mean in "Cluster Centers", and how should I interpret this data?
kmeans = KMeans(n_clusters = 4).fit(df)
print("Number of clusters: ", kmeans.n_clusters)
print("-"*70)
print("Cluster Centers: ", '\n', kmeans.cluster_centers_)
Number of clusters: 4
----------------------------------------------------------------------
Cluster Centers:
[[4.10000000e+02 9.92833333e+03 3.42200000e+03 3.73333333e+00
2.32433333e+03 1.36733333e+03 1.31600000e+03 5.16666667e+01
9.57000000e+02]
[4.55000000e+01 3.41650000e+03 1.42100000e+04 3.70000000e+00
5.95000000e+02 3.60000000e+02 3.46500000e+02 1.35000000e+01
2.34500000e+02]
[3.41666667e+01 1.14600000e+03 3.33358333e+03 3.69166667e+00
7.02500000e+02 4.14583333e+02 3.99166667e+02 1.53333333e+01
2.87916667e+02]
[5.14000000e+02 2.48310000e+04 5.78750000e+03 3.75000000e+00
1.75350000e+03 1.05200000e+03 1.01200000e+03 3.95000000e+01
7.02000000e+02]]

It means that you have four clusters, and the given vectors ar the centers of those clusters.
So, for a new point, you can check which centroid is the closest and you can determine the new point cluster accordingly.
For example, for the following four clusters above the X represent its centroids for the clusters and a new point can be classified accordingly.
Also, you can check for yourself measurement on the clusters. you can check here: Silhouette - Wikipedia

Your code asked to find four clusters using the KMeans algorithm. See the docs. As expected, you obtain 4 clusters. Based on the kmeans.cluster_centers_, we can tell that your space is 9-dimensional (9 coordinates for each point), because the cluster centroids are 9-dimensional.
The centroids are the means of all points within a cluster. This doc is a good introduction for getting an intuitive understanding of the k-means algorithm.

What is considered to be a good silhouette score?

I am currently doing some clustering based on words embeddings, and I am using some methods (elbow and David-Boulding) to determine the optimal number of clusters I should consider. In addition, I consider the silhouette measure. If I understood it correctly, it is a measure of the correct match of the data with the correct cluster, ranging from - 1 (mismatch) to 1 (correct match).
Using kmeans clustering, I obtain a silhouette score oscillating between 0.5 and 0.55. So according to the silhouette, the elbow method (that is a bit too smooth but it might because I have a lot of data) and the David-Bouldin index, I should consider 5 clusters. However, I don't know if 0.5 can be considered as a good score? I added the graphs of the different measures I made, the function I used to generate them (found online) as well as the clustering obtained.
def check_clustering(X, K):
sse,db,slc = {}, {}, {}
for k in range(2, K):
# seed of 10 for reproducibility.
kmeans = KMeans(n_clusters=k, max_iter=1000,random_state=SEED).fit(X)
if k == 3: labels = kmeans.labels_
clusters = kmeans.labels_
sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
db[k] = davies_bouldin_score(X,clusters)
slc[k] = silhouette_score(X,clusters)
plt.figure(figsize=(15,10))
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
plt.figure(figsize=(15,10))
plt.plot(list(db.keys()), list(db.values()))
plt.xlabel("Number of cluster")
plt.ylabel("Davies-Bouldin values")
plt.show()
plt.figure(figsize=(15,10))
plt.plot(list(slc.keys()), list(slc.values()))
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette score")
plt.show()
I am quite new to k-means clustering and mainly followed online tutorials. Can somebody tell me if the scores obtained through the different measures (but mostly silhouette's) seem correct?
Thank you for your answer.
(Also, there is a subsidiary question but I find the shape of the clusters a bit weird (I would expect them to be more fragmented). Is it a possible shape of clusters? (Note that I used the PCA to reduce the dimensions, so it might be because of that).
Thank you for your help.

Just searched this myself.
A silhouette score of one means each data point is unlikely to be assigned to another cluster.
A score close to zero means each data point could be easily assigned to another cluster
A score close to -1 means the datapoint is misclassified.
Based on these assumptions, I'd say 0.55 is still informative though not definitive and therefore you would need additional analysis to make any assertions based on your data.

Python DBSCAN - How to plot clusters based on mean of vectors?

Hi i have gotten the mean of the vectors and used DBSCAN to cluster them. However, i am unsure of how i should plot the results since my data does not have an [x,y,z...] format.
sample dataset:
mean_vec = [[2.2771908044815063],
[3.0691280364990234],
[2.7700443267822266],
[2.6123080253601074],
[2.6043469309806824],
[2.6386525630950928],
[2.7034034729003906],
[2.3540258407592773]]
I have used this code below(from scikit-learn) to achieve my clusters:
X = StandardScaler().fit_transform(mean_vec)
db = DBSCAN(eps = 0.15, min_samples = 5).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
is it possible to plot out my clusters ? the plot from scikit-learn is not working for me. The scikit-learn link can be found here

On one dimensional data. Use kernel density estimation rather than DBSCAN. It is much better supported by theory and much better understood. One can see DBSCAN as a fast approximation to KDE for the multivariate case.
Any way, plotting 1 dimensional data is not that hard. For example, you can plot a histogram.
Also the clusters will necessarily correspond to intervals, so you can also plot lines for (min,max) of each cluster.
You can even abuse 2d scatter plots. Simply use the label as y value.

Scikit DBSCAN eps and min_sample value determination

I have been trying to implement DBSCAN using scikit and am so far failing to determine the values of epsilon and min_sample which will give me a sizeable number of clusters. I tried finding the average value in the distance matrix and used values on either side of the mean but haven't got a satisfactory number of clusters:
Input:
db=DBSCAN(eps=13.0,min_samples=100).fit(X)
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
output:
Estimated number of clusters: 1
Input:
db=DBSCAN(eps=27.0,min_samples=100).fit(X)
Output:
Estimated number of clusters: 1
Also so other information:
The average distance between any 2 points in the distance matrix is 16.8354
the min distance is 1.0
the max distance is 258.653
Also the X passed in the code is not the distance matrix but the matrix of feature vectors.
So please tell me how do i determine these parameters

plot a k-distance graph, and look for a knee there. As suggested in the DBSCAN article.
(Your min_samples might be too high - you probably won't have a knee in the 100-distance graph then.)
Visualize your data. If you can't visually see clusters, there might be no clusters. DBSCAN cannot be forced to produce an arbitrary number of clusters. If your data set is a Gaussian distribution, it is supposed to be a single cluster only.

Try changing the min_samples parameter to a lower value. This parameter affects the minimum size of each cluster formed. May be, the possible clusters to be formed are all small sized and the parameter you are using right now is too high for them to be formed.

dbscan - setting limit on maximum cluster span

By my understanding of DBSCAN, it's possible for you to specify an epsilon of, say, 100 meters and — because DBSCAN takes into account density-reachability and not direct density-reachability when finding clusters — end up with a cluster in which the maximum distance between any two points is > 100 meters. In a more extreme possibility, it seems possible that you could set epsilon of 100 meters and end up with a cluster of 1 kilometer:
see [2][6] in this array of images from scikit learn for an example of when that might occur. (I'm more than willing to be told I'm a total idiot and am misunderstanding DBSCAN if that's what's happening here.)
Is there an algorithm that is density-based like DBSCAN but takes into account some kind of thresholding for the maximum distance between any two points in a cluster?

DBSCAN indeed does not impose a total size constraint on the cluster.
The epsilon value is best interpreted as the size of the gap separating two clusters (that may at most contain minpts-1 objects).
I believe, you are in fact not even looking for clustering: clustering is the task of discovering structure in data. The structure can be simpler (such as k-means) or complex (such as the arbitrarily shaped clusters discovered by hierarchical clustering and k-means).
You might be looking for vector quantization - reducing a data set to a smaller set of representatives - or set cover - finding the optimal cover for a given set - instead.
However, I also have the impression that you aren't really sure on what you need and why.
A stength of DBSCAN is that it has a mathematical definition of structure in the form of density-connected components. This is a strong and (except for some rare border cases) well-defined mathematical concept, and the DBSCAN algorithm is an optimally efficient algorithm to discover this structure.
Direct density reachability however, doesn't define a useful (partitioning) structure. It just does not partition the data into disjoint partitions.
If you don't need this kind of strong structure (i.e. you don't do clustering as in "structure discovery", but you just want to compress your data as in vector quantization), you could give "canopy preclustering" a try. It can be seen as a preprocessing step designed for clustering. Essentially, it is like DBSCAN, except that it uses two epsilon values, and the structure is not guaranteed to be optimal in any way, but will highly depend on the ordering of your data. If you then preprocess it appropriately, it can still be useful. Unless you are in a distributed setting, canopy preclustering however is at least as expensive than a full DBSCAN run. Due to the loose requirements (in particular, "clusters" may overlap, and objects are expected to belong to multiple "clusters"), it is easier to parallelize.
Oh, and you might also just be looking for complete-linkage hierarchical clustering. If you cut the dendrogram at your desired height, the resulting clusters should all have the desired maximum distance inbetween of any two objects. The only problem is that hierarchical clustering usually is O(n^3), i.e. it doesn't scale to large data sets. DBSCAN runs in O(n log n) in good implementations (with index support).

I had the same problem and ended up solving it by using DBSCAN in combination with KMeans clustering: First I use DBSCAN to identify high density clusters and remove outliers, then I take any cluster larger than 250 Miles (in my case) and break it apart. Here's the code:
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=0.3, min_samples=100).fit(load_geocodes[['lat', 'long']])
load_geocodes.loc[:,'cluster'] = clustering.labels_
import mpu
def calculate_cluster_size(lat, long):
left_top = (max(lat), min(long))
right_bottom = (min(lat), max(long))
distance = mpu.haversine_distance(left_top, right_bottom)*0.621371
return distance
for c, df in load_geocodes.groupby('cluster'):
if c == -1:
continue # don't do this for outliers
distance = calculate_cluster_size(df['lat'], df['long'])
print(distance)
if distance > 250:
# break clusters into more clusters until the maximum size of a cluster is less than 250 Miles
max_distance = distance
i = 2
while max_distance > 250:
kmeans = KMeans(n_clusters=i, random_state=0).fit(df[['lat', 'long']])
df.loc[:, 'cl_temp'] = kmeans.labels_
max_temp_cl_size = 0
for temp_cl, temp_cl_df in df.groupby('cl_temp'):
temp_cl_size = calculate_cluster_size(temp_cl_df['lat'], temp_cl_df['long'])
if temp_cl_size > max_temp_cl_size:
max_temp_cl_size = temp_cl_size
i += 1
max_distance = max_temp_cl_size
load_geocodes.loc[df.index,'subcluster'] = kmeans.labels_

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.