How to check a new point is inside the exist clusters (Python) - python

I am a bit confused about Clustering e.g. K-means clustering.
I have already created clusters for the training for and in the testing part I want to know if the new points are already in the clusters or if they can be in the cluster or not?
My idea is to find the center of each cluster and also find the farthest point in each cluster in training data then in testing part if the distance of the new point is great than a threshold (e.g. 1.5x the farthest point) then it cannot be in the cluster!
Is this idea efficient and correct and is there any python function to do this?
One more question:
Could someone help me to understand the difference between kmeans.fit() and kmeans.predict()? I get the same result in both functions!!
I appreciate any help

In general, when you fitting K-means algorithm, you will get cluster centers as result.
So, if you want to test to what cluster new point belong, you must calculate distance between each cluster center to the point, and label point as closest cluster center label.
If you usning scikit-learn library
Predict(X) method predicts the closest cluster each sample in X belongs to.
Fit(X) - fitting the data, or in other words calculating the cluster centers.
Here is nice example how to use K-means in scikit-learn

Related

Hierarchical Clustering Threshold

I'm using the scikit-learn module of agglomerative hierarchical clustering to obtain clusters of a three million geographical hexagrid using contiguity constraints and ward affinity.
My question is related to the resulted number of clusters that the routine returns. I know that the ward's affinity minimizes the sums of the within cluster variance. This is trivially minimized when the cluster is just one observation, right? That is why I assume that the algorithm starting point is just a random big number for this variance so that including the nearest contiguous observation (given that is within the max distance threshold) will decrease the variance of the group and it continues this way until it reaches the top of the tree.
My question is what is the criteria used to return the clusters labels. Reading about it seems that the optimum number of clusters is given by when there is a biggest jump in the tree. But I'm not sure if this is the criteria the developers are using. Anyone knows?
Ideally I could check by plotting the tree, but I'm clustering nearly 3 million cells which makes plotting the tree both messy and unfeasible (with my computer or the cluster I have access at least).
Thanks

Finding cluster centroid or ".means_" with sklearn.cluster.SpectralClustering

I have a an unlabeled data set that I am trying to cluster with a variety of clustering algorithms.
I am successful in being able to find the centroids/"mean of each mixture component" in sklearn.mixture.GaussianMixture using .means_. In my code I am then taking the point that is closest to the means to get a representative sample at each cluster.
I want to do this same thing with SpectralClustering, but I don't see a ".means_" method or some method to get the centroid of each cluster. This may be a result of my misunderstanding of how spectral clustering works or just a lack of features in this library.
As an example I would like to do:
sc = SpectralClustering(n_components=10, n_init=100)
sc.fit(data)
closest, _ = pairwise_distances_argmin_min(sc.means_, data)
But of course SpectralClustering doesn't have a .means_ method.
Thanks for any help on this.
Centroid are used for the KMean algorithm. For spectal clustering, the algorithm only store the affinity matrix and the labels obtained from the algorithm.
It doesn't matter if Spectral Clustering (or any other clustering algorithm) uses the cluster centers or not!
You can compute the centroid of any cluster! It is the mean of the elements in that cluster (well, there actually is a constraint, that the dataset itself allows the notion of mean).
So, compute the clusters using Spectral Clustering. For each cluster, compute the mean of the elements inside it (as is, the mean on every dimension for a cluster comprised of m n-dimensional elements).

DBSCAN provided with lines as input

I am new to both machine learning and python and my goal is to experiment with route prediction through clustering.
I've just started using DBSCAN and I was able to obtain results given an array of coordinates as input to the fit procedure, e.g. [[1,1],[2,2],[3,3],...], which includes all coordinates of all routes.
However, what I really want is to provide DBSCAN with a set containing all routes/lines instead of a set containing all coordinates of all routes. Therefore, my question is whether this is possible (does it even make sense?) and if so how can I accomplish this?
Thank you for your time.
Why do you think density based clustering is a good choice for clustering routes? What notion of density would you use here?
I'd rather try hierarchical clustering with a proper route distance.
But if you have the distance matrix anyway, you can of course just try DBSCAN on it for "free" (computing the distances will be way more expensive than DBSCAN on a distance matrix).

K means clustering on unevenly sized clusters

I have to use k means clustering (I am using Scikit learn) on a dataset looks like this
But when I apply the K means it doesn't give me the centroids as expected. and classifies incorrectly.
Also What would be the ideas if I want to know the points not correctly classify in scikit learn.
Here is the code.
km = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10)
km.fit(Train_data.values)
plt.plot(km.cluster_centers_[:,0],km.cluster_centers_[:,1],'ro')
plt.show()
Here Train_data is pandas frame and having 2 features and 3500 samples and the code gives following.
I might have happened because of bad choice of initial centroids but what could be the solution ?
First of all I hope you noticed that range on X and Y axis is different in both figures. So, the first centroid(sorted by X-value) isn't that bad. The second and third ones are so obtained because of large number of outliers. They are probably taking half of both the rightmost clusters each. Also, the output of k-means is dependent on initial choice of centroids so see if different runs or setting init parameter to random improves results. Another way to improve efficiency would be to remove all the points having less than some n neighbors within a radius of distance d. To implement that efficiently you would need a kd-tree probably or just use DBSCAN provided by sklearn here and see if it works better.
Also K-Means++ is likely to pick outliers as initial cluster as explained here. So you may want to change init parameter in KMeans to 'random' and perform multiple runs and take the best centroids.
For your data since it is 2-D it is easy to know if points are classified correctly or not. Use mouse to 'pick' up coordinates of approximate centroid (see here) and just compare the cluster obtained from picked coordinates to those obtained from k-means.
I got a solution for this.
The problem was scaling.
I just scaled both axes using
sklearn.preprocessing.scale
And this is my result

Computing K-means clustering on Location data in Python

I have a dataset of users and their music plays, with every play having location data. For every user i want to cluster their plays to see if they play music in given locations.
I plan on using the sci-kit learn k-means package, but how do I get this to work with location data, as opposed to its default, euclidean distance?
An example of it working would really help me!
Don't use k-means with anything other than Euclidean distance.
K-means is not designed to work with other distance metrics (see k-medians for Manhattan distance, k-medoids aka. PAM for arbitrary other distance functions).
The concept of k-means is variance minimization. And variance is essentially the same as squared Euclidean distances, but it is not the same as other distances.
Have you considered DBSCAN? sklearn should have DBSCAN, and it should by now have index support to make it fast.
Is the data already in vector space e.g. gps coordinates? If so you can cluster on it directly, lat and lon are close enough to x and y that it shouldn't matter much. If not, preprocessing will have to be applied to convert it to a vector space format (table lookup of locations to coords for instance). Euclidean distance is a good choice to work with vector space data.
To answer the question of whether they played music in a given location, you first fit your kmeans model on their location data, then find the "locations" of their clusters using the cluster_centers_ attribute. Then you check whether any of those cluster centers are close enough to the locations you are checking for. This can be done using thresholding on the distance functions in scipy.spatial.distance.
It's a little difficult to provide a full example since I don't have the dataset, but I can provide an example given arbitrary x and y coords instead if that's what you want.
Also note KMeans is probably not ideal as you have to manually set the number of clusters "k" which could vary between people, or have some more wrapper code around KMeans to determine the "k". There are other clustering models which can determine the number of clusters automatically, such as meanshift, which may be more ideal in this case and also can tell you cluster centers.

Categories