Python DBSCAN - How to plot clusters based on mean of vectors? - python

Hi i have gotten the mean of the vectors and used DBSCAN to cluster them. However, i am unsure of how i should plot the results since my data does not have an [x,y,z...] format.
sample dataset:
mean_vec = [[2.2771908044815063],
[3.0691280364990234],
[2.7700443267822266],
[2.6123080253601074],
[2.6043469309806824],
[2.6386525630950928],
[2.7034034729003906],
[2.3540258407592773]]
I have used this code below(from scikit-learn) to achieve my clusters:
X = StandardScaler().fit_transform(mean_vec)
db = DBSCAN(eps = 0.15, min_samples = 5).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
is it possible to plot out my clusters ? the plot from scikit-learn is not working for me. The scikit-learn link can be found here

On one dimensional data. Use kernel density estimation rather than DBSCAN. It is much better supported by theory and much better understood. One can see DBSCAN as a fast approximation to KDE for the multivariate case.
Any way, plotting 1 dimensional data is not that hard. For example, you can plot a histogram.
Also the clusters will necessarily correspond to intervals, so you can also plot lines for (min,max) of each cluster.
You can even abuse 2d scatter plots. Simply use the label as y value.

Related

Detect cluster outliers

I have a dataset where every data sample consists of 10-20 2D coordinates points. The data is mostly clean but occasionally there are falsely annotated points. For illustration the cleany annotated data would look like these:
either clustered in a small area or spread across a larger area. The outliers I'm trying to filter out look like this:
the outlier is away from the "correct" cluster.
I tried z-score filtering but this approach falsely marked many annotations as outliers
std_score = np.abs((points - points.mean(axis=0)) / (np.std(points, axis=0) + 0.01))
validity = np.all(std_score <= np.quantile(std_score, 0.95, axis=0), axis=1)
Is there a method designed to solve this problem?
This seems like a typical clustering problem, and if the data looks as you suggested the KMeans from scikit-learn should do the trick. Lets look how we can do this.
First I am generating a data sample, which might look somewhat like your data.
import numpy as np
import matplotlib.pylab as plt
np.random.seed(1) # For reproducibility
cluster_1 = np.random.normal(loc = [1,1], scale = [0.2,0.2], size = (20,2))
cluster_2 = np.random.normal(loc = [2,1], scale = [0.4,0.4], size = (5,2))
plt.scatter(cluster_1[:,0], cluster_1[:,1])
plt.scatter(cluster_2[:,0], cluster_2[:,1])
plt.show()
points = np.vstack([cluster_1, cluster_2])
This is how the data will look like.
Further we will be doing KMeans clustering.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 2).fit(points)
We are choosing n_clusters as 2 believing that there are 2 clusters in the dataset. And after finding these clusters lets look at them.
plt.scatter(points[kmeans.labels_==0][:,0], points[kmeans.labels_==0][:,1], label='cluster_1')
plt.scatter(points[kmeans.labels_==1][:,0], points[kmeans.labels_==1][:,1], label ='cluster_2')
plt.scatter(kmeans.cluster_centers_[:,0], kmeans.cluster_centers_[:,1], label = 'cluster_center')
plt.legend()
plt.show()
This will look like as the image shown below.
This should solve your problem. But there ares some things which should be kept in mind.
It will not be perfect all the times.
Might be a problem if you don't have any outliers. Can be solved through silhouette scores.
Difficult to know which cluster to discard (Can be done through locating the center of the clusters (green colored points) or can also be done by finding the cluster with lesser number of points.
Endnote: You might loose some points but would automate the entire process. Depends upon how much you want to trade off in terms of data saved versus manual time saved.

What is considered to be a good silhouette score?

I am currently doing some clustering based on words embeddings, and I am using some methods (elbow and David-Boulding) to determine the optimal number of clusters I should consider. In addition, I consider the silhouette measure. If I understood it correctly, it is a measure of the correct match of the data with the correct cluster, ranging from - 1 (mismatch) to 1 (correct match).
Using kmeans clustering, I obtain a silhouette score oscillating between 0.5 and 0.55. So according to the silhouette, the elbow method (that is a bit too smooth but it might because I have a lot of data) and the David-Bouldin index, I should consider 5 clusters. However, I don't know if 0.5 can be considered as a good score? I added the graphs of the different measures I made, the function I used to generate them (found online) as well as the clustering obtained.
def check_clustering(X, K):
sse,db,slc = {}, {}, {}
for k in range(2, K):
# seed of 10 for reproducibility.
kmeans = KMeans(n_clusters=k, max_iter=1000,random_state=SEED).fit(X)
if k == 3: labels = kmeans.labels_
clusters = kmeans.labels_
sse[k] = kmeans.inertia_ # Inertia: Sum of distances of samples to their closest cluster center
db[k] = davies_bouldin_score(X,clusters)
slc[k] = silhouette_score(X,clusters)
plt.figure(figsize=(15,10))
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
plt.figure(figsize=(15,10))
plt.plot(list(db.keys()), list(db.values()))
plt.xlabel("Number of cluster")
plt.ylabel("Davies-Bouldin values")
plt.show()
plt.figure(figsize=(15,10))
plt.plot(list(slc.keys()), list(slc.values()))
plt.xlabel("Number of cluster")
plt.ylabel("Silhouette score")
plt.show()
I am quite new to k-means clustering and mainly followed online tutorials. Can somebody tell me if the scores obtained through the different measures (but mostly silhouette's) seem correct?
Thank you for your answer.
(Also, there is a subsidiary question but I find the shape of the clusters a bit weird (I would expect them to be more fragmented). Is it a possible shape of clusters? (Note that I used the PCA to reduce the dimensions, so it might be because of that).
Thank you for your help.
Just searched this myself.
A silhouette score of one means each data point is unlikely to be assigned to another cluster.
A score close to zero means each data point could be easily assigned to another cluster
A score close to -1 means the datapoint is misclassified.
Based on these assumptions, I'd say 0.55 is still informative though not definitive and therefore you would need additional analysis to make any assertions based on your data.

DBSCAN Clustering Python - cluster words

I have been using KMeans in order to extract clusters from a set of lines and i'm not very impressed with the results and i wanted to try out DBSCAN to see if this can produce better results. Does DBSCAN output cluster words as KMeans ?
I was able to use DBSCAN and was able to output number of clusters as '3' but i would like to know what context is driving it to make '3' clusters (i would like to know the words)
here is my code snippet
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
print("Silhouette Coefficient: %0.3f"% metrics.silhouette_score(X, labels))
You do not have direct control over how many clusters DBSCAN produces. It produces as many as happen to be there at the given density level; which is best done by varying epsilon.
Note that it also produces noise, i.e. one cluster (probably the first) is not a cluster but leftover points that do not belong to any cluster. But when you simply discard these points, your silhouette becomes false.
As DBSCAN clusters may be arbitrarily shaped, there is no meaningful 'centroid' as in k-means that you could interpretnas "words" (but often this interpretation is all but good anyway).
Please read the Wikipedia article & DBSCAN literature for further details.

K-means Clustering in Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
x = [916,684,613,612,593,552,487,484,475,474,438,431,421,418,409,391,389,388,
380,374,371,369,357,356,340,338,328,317,316,315,313,303,283,257,255,254,245,
234,232,227,227,222,221,221,219,214,201,200,194,169,155,140]
kmeans = KMeans(n_clusters=4)
a = kmeans.fit(np.reshape(x,(len(x),1)))
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["g.","r.","y.","b."]
for i in range(len(x)):
plt.plot(x[i], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0], marker = "x", s = 150, linewidths = 5, zorder = 10)
plt.show()
The code above displays 4 clusters, but they are definitely not something I want to have.
I also get an error, which makes it even worst. The output I get is in the picture below.
The error I get is: TypeError: scatter() missing 1 required positional argument: 'y' Error is not a big deal because I don't like what I have anyways.
Following is the image of how I want my output of clusters to look like.
your data is one-dimension (a line), if you want to visualize in two-dimension like pic in your post, your should use two-dimension or multi-dimension data, for example [[1,3], [2,3], [1,5]].
after k-means they are divided into k clusters, and you can use scatter to visualize the output. by the way, scatter take x and y, scatter is two-dimension visualization.
i suggest you to take a look at Orange, a python data mining tool. you can do k-means by drag and drop.
and visualize the output of k-means easily.
good luck! data mining is fun :-)
Your data is 1 dimensional
Don't expect a pretty 2d plot without making up data.
To get rid of the warning, you can set y=x. But it will not change much, the data will continue to be a 1-dimensional line.
You could of course add random noise, and set y to random values. But that means making up fake data.
For one-dimensional algorithm, I recommend to not use clustering at all. These algorithms are designed for complex multivariate data where you cannot afforf a good statistical model anymore. One-dimensional data can be sorted which allows for much more efficient algorithms. You can easily do KDE on such data, and fit thousands of statistical distributions. This will give you a much more meaningful model of higher statistical power.
From a quick look at your plot, I'd say there are no clusters. Instead your data looks like a skewed normal distribution with one clear outlier (to be expected at this data set size) to me. Please, try a more statistical approach.
Since you work with only one dimensional, you should understand what exactly you are computing. With KMeans, you extract four average values; the best thing you can do here is draw your data as below with four horizontal lines showing these values. I get the following picture with the code below. This picture is like the equivalent for 1D of the picture you are showing for 2D.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
x = [916,684,613,612,593,552,487,484,475,474,438,431,421,418,409,391,389,388,
380,374,371,369,357,356,340,338,328,317,316,315,313,303,283,257,255,254,245,
234,232,227,227,222,221,221,219,214,201,200,194,169,155,140]
kmeans = KMeans(n_clusters=4)
a = kmeans.fit(np.reshape(x,(len(x),1)))
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["g.","r.","y.","b."]
for i in centroids: plt.plot( [0, len(x)-1],[i,i], "k" )
for i in range(len(x)):
plt.plot(i, x[i], colors[labels[i]], markersize = 10)
plt.show()
Computing kmeans with 1D data is more interesting with curves like the following one (from the page http://lasp.colorado.edu/home/sorce/2013/01/28/the-sorce-mission-celebrates-ten-years/) because you obviously can see tow distinct average values:

Scikit DBSCAN eps and min_sample value determination

I have been trying to implement DBSCAN using scikit and am so far failing to determine the values of epsilon and min_sample which will give me a sizeable number of clusters. I tried finding the average value in the distance matrix and used values on either side of the mean but haven't got a satisfactory number of clusters:
Input:
db=DBSCAN(eps=13.0,min_samples=100).fit(X)
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
output:
Estimated number of clusters: 1
Input:
db=DBSCAN(eps=27.0,min_samples=100).fit(X)
Output:
Estimated number of clusters: 1
Also so other information:
The average distance between any 2 points in the distance matrix is 16.8354
the min distance is 1.0
the max distance is 258.653
Also the X passed in the code is not the distance matrix but the matrix of feature vectors.
So please tell me how do i determine these parameters
plot a k-distance graph, and look for a knee there. As suggested in the DBSCAN article.
(Your min_samples might be too high - you probably won't have a knee in the 100-distance graph then.)
Visualize your data. If you can't visually see clusters, there might be no clusters. DBSCAN cannot be forced to produce an arbitrary number of clusters. If your data set is a Gaussian distribution, it is supposed to be a single cluster only.
Try changing the min_samples parameter to a lower value. This parameter affects the minimum size of each cluster formed. May be, the possible clusters to be formed are all small sized and the parameter you are using right now is too high for them to be formed.

Categories