K means clustering on unevenly sized clusters - python

I have to use k means clustering (I am using Scikit learn) on a dataset looks like this
But when I apply the K means it doesn't give me the centroids as expected. and classifies incorrectly.
Also What would be the ideas if I want to know the points not correctly classify in scikit learn.
Here is the code.
km = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10)
km.fit(Train_data.values)
plt.plot(km.cluster_centers_[:,0],km.cluster_centers_[:,1],'ro')
plt.show()
Here Train_data is pandas frame and having 2 features and 3500 samples and the code gives following.
I might have happened because of bad choice of initial centroids but what could be the solution ?

First of all I hope you noticed that range on X and Y axis is different in both figures. So, the first centroid(sorted by X-value) isn't that bad. The second and third ones are so obtained because of large number of outliers. They are probably taking half of both the rightmost clusters each. Also, the output of k-means is dependent on initial choice of centroids so see if different runs or setting init parameter to random improves results. Another way to improve efficiency would be to remove all the points having less than some n neighbors within a radius of distance d. To implement that efficiently you would need a kd-tree probably or just use DBSCAN provided by sklearn here and see if it works better.
Also K-Means++ is likely to pick outliers as initial cluster as explained here. So you may want to change init parameter in KMeans to 'random' and perform multiple runs and take the best centroids.
For your data since it is 2-D it is easy to know if points are classified correctly or not. Use mouse to 'pick' up coordinates of approximate centroid (see here) and just compare the cluster obtained from picked coordinates to those obtained from k-means.

I got a solution for this.
The problem was scaling.
I just scaled both axes using
sklearn.preprocessing.scale
And this is my result

Related

HDBSCAN Shouldn't any object in a cluster have a probability value > 0? And producing inconsistent results

I am using hdbscan to find clusters within a dataset in a Python Jupyter notebook.
import pandas as pandas
import numpy as np
data = pandas.read_csv('data.csv')
That data looks something like this:
import hdbscan
clusterSize = 6
clusterer = hdbscan.HDBSCAN(min_cluster_size=clusterSize).fit(data)
And yay! everything seems to work!
So I then want to see some results, so I add these results to my data frame:
data.insert(18,"labels",clusterer.labels_)
data.insert(19,"probabilities",clusterer.probabilities_)
But wait, I have rows with labels for clusters that have probabilities at 0. How does that make sense? Shouldn't any object in a cluster have a probability value > 0? Oh, and all the probabilities are only 0 OR 1.
So I rerun this in Jupyter notebook, specifically, I just rerun
clusterer = hdbscan.HDBSCAN(min_cluster_size=clusterSize).fit(data)
and I check the values for clusterer.labels_ and clusterer.probabilities_ and they are different. Isn't this thing supposed to be consistent? Why would those values change? Is there some hidden state that I'm not told about? But now my clusterer.probabilities_ have values that are between 0 and 1... so that's good right?
So I'm not very familiar with this hdbscan tool obviously, but can someone explain why it gives out different answers when ran multiple times and if probability 0 on a labeled/clustered object makes sense?
According to API:
labels: Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.
probabilities: The strength with which each sample is a member of its assigned cluster. Noise points have probability zero; points in clusters have values assigned proportional to the degree that they persist as part of the cluster.
Therefore probability of zero is meaningful.
I was also expecting that the results of different runs on the same data be the same, but it looks like it is not exactly true. According to wiki:
DBSCAN is not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data are processed. For most data sets and domains, this situation does not arise often and has little impact on the clustering result:[4] both on core points and noise points, DBSCAN is deterministic. DBSCAN* is a variation that treats border points as noise, and this way achieves a fully deterministic result as well as a more consistent statistical interpretation of density-connected components.
So maybe the selection of a specific algorithm will help to fix the clustering be deterministic.

Weighted clustering in sklearn

Assuming I have a set of points (x,y and size). I want to find clusters in my data using sklearn.cluster.DBSCAN and their centers. That is no problem if I treat every point the same. But actually I want the weighted centers instead of the geometrical centers (meaning a bigger sized point should be counted more than a smaller) .
I came across with sample_weight, but I don't quite get if that is what I need. When I use sample_weight (right side) I get completely different clusters to the case when I don`t use it (left side):
Second I thought about using np.repeat(x,w) where x is my data and w is the size of each point so I get multiple copies of the points proportional to their weights. But this is probably not a smart solution as I get a lot of data, right?
Is sample_weight useful in my case or are there suggestions for better solutions than using np.repeat? I know that there are some questions about sample_weight already, but I could not read out how to use it exactly.
Thanks!
The most important thing for DBSCAN is the parameter setting. There are 2 parameters, epsilon and minPts (=min_samples). The epsilon parameter is the radius around your points and minPts considers your points as a part of a cluster if minPts is fulfilled. So instead of using np.repeat I would suggest adjusting the parameters for this dataset.
According to the documentation of DBSCAN, sample_weight is a tuning parameter for your runtime:
Another way to reduce memory and computation time is to remove
(near-)duplicate points and use sample_weight instead.
I think you want to address the quality of your result first before you tune your runtime.
I am not sure what you mean with weighted centers, probably you are refering to a differt clustering algorithm such as Gaussian mixture model.

what is the best algorithm to cluster this data

can some one help me find a good clustering algorithm that will cluster this into 3 clusters without defining the number of clusters.
i have tried many algorithms in its basic form.. nothing seems to work properly.
clustering = AgglomerativeClustering().fit(temp)
same way i tried the dbscan and kmeans too.. just used the guidelines from sklean. i couldn't get the expected results.
my original data set is a 1D list of numbers.. but the order of the numbers matters, so generated a 2D list as bellow.
temp = []
for i in range(len(avgs)):
temp.append([avgs[i], i+1])
clustering = AgglomerativeClustering().fit(temp)
in plotting piloting i used a similter range as the y axis
ax2.scatter(range(len(plots[i])), plots[i], c=np.random.rand(3,))
the order of the data matters, so this need to clustered into 3. and there might be some other data sets that the data is very good so that the result of that need to be just one cluster.
Link to the list if someone want to try
so i tried using the step detection and got the following image according to ur answer. but how can i find the values of the peaks.. if i get the max value i can get one of them.. but how to get the rest of it.. the second max is not an answer because the one right next to the max is the second max
Your data is not 2d coordinates. So don't choose an algorithm designed for that!
Instead your data appears to be sequential or time series.
What you want to use is a change point detection algorithm, capable of detecting a change in the mean value of a series.
A simple approach would be to compute the sum of the next 10 points minus the sum of the previous 10 points, then look for extreme values of this curve.

How can I fine tune K means clustering when I'm only getting clusters in lines?

It's my first time trying to do K-Means clustering using Python and Sci-Kit Learn and I don't know what to make of my final cluster plot or how to fine tune my K means clustering algorithm.
My end goal is to find a clustering of user categories that delineates some interesting or useful behavior traits.
ATTEMPT 1:
Input: Gender, Age Range, Country (all one hot encoded because the data is categorical), and Account Age (numerical in weeks old)
Code:
# Convert DataFrame to matrix
mat2 = all_dummy.as_matrix()
# Using sklearn
km2 = sklearn.cluster.KMeans(n_clusters=6)
km2.fit(mat2)
# Get cluster assignment labels
labels2 = km2.labels_
# Format results as a DataFrame
results2 = pd.DataFrame([all_dummy.index,labels2]).T
plot_x2 = results2[0].tolist()
plot_y2 = results2[1].tolist()
pyplot.scatter(plot_x2,plot_y2)
pyplot.show()
Plot:
Specific Questions:
What is the X and Y axis of this graph?
What is this graph even telling me?
Why are there only 3 clusters showing up when I put 6 clusters as an input? (answered by first comment and updated code and graph)
How can I fine tune this graph to tell me more and show me a useful relationship if I don't know what the relationship I am looking for is?
Read up on the limitations of k-means.
In particular, be aware that
you must remove all identifier columns
k-means is very sensitive to scale. All attributes need to be carefully scaled according to their value range, distribution, and importance. Preprocessing is essential!
k-means assumes continuous variables. The use on categorical data, even when one-hot encoded, is questionable. It sometimes works "okayish" but barely ever workd "good".
According to your code, the X axis corresponds to the indices of your samples (seeing your graph, I suppose you have around 10 000 users then), and the Y axis corresponds to the labels of each sample.
You might not have 6 clusters as an input. Indeed, when you format your results as a dataframe, a labels variable is used, while it is actually labels2 that contain the computed cluster assignments. I don't know where your labels come from, but I suspect this is the reason you obtain those results. Hence, regarding question 2, this graph probably doesn't show anything relevant.
You first could use other visualisations to better understand how your data is being clustered. Sklearn's documentation provides many examples you could use for inspiration (1, 2, 3).
Hope it helped !

Computing K-means clustering on Location data in Python

I have a dataset of users and their music plays, with every play having location data. For every user i want to cluster their plays to see if they play music in given locations.
I plan on using the sci-kit learn k-means package, but how do I get this to work with location data, as opposed to its default, euclidean distance?
An example of it working would really help me!
Don't use k-means with anything other than Euclidean distance.
K-means is not designed to work with other distance metrics (see k-medians for Manhattan distance, k-medoids aka. PAM for arbitrary other distance functions).
The concept of k-means is variance minimization. And variance is essentially the same as squared Euclidean distances, but it is not the same as other distances.
Have you considered DBSCAN? sklearn should have DBSCAN, and it should by now have index support to make it fast.
Is the data already in vector space e.g. gps coordinates? If so you can cluster on it directly, lat and lon are close enough to x and y that it shouldn't matter much. If not, preprocessing will have to be applied to convert it to a vector space format (table lookup of locations to coords for instance). Euclidean distance is a good choice to work with vector space data.
To answer the question of whether they played music in a given location, you first fit your kmeans model on their location data, then find the "locations" of their clusters using the cluster_centers_ attribute. Then you check whether any of those cluster centers are close enough to the locations you are checking for. This can be done using thresholding on the distance functions in scipy.spatial.distance.
It's a little difficult to provide a full example since I don't have the dataset, but I can provide an example given arbitrary x and y coords instead if that's what you want.
Also note KMeans is probably not ideal as you have to manually set the number of clusters "k" which could vary between people, or have some more wrapper code around KMeans to determine the "k". There are other clustering models which can determine the number of clusters automatically, such as meanshift, which may be more ideal in this case and also can tell you cluster centers.

Categories