I have a dataset of users and their music plays, with every play having location data. For every user i want to cluster their plays to see if they play music in given locations.
I plan on using the sci-kit learn k-means package, but how do I get this to work with location data, as opposed to its default, euclidean distance?
An example of it working would really help me!
Don't use k-means with anything other than Euclidean distance.
K-means is not designed to work with other distance metrics (see k-medians for Manhattan distance, k-medoids aka. PAM for arbitrary other distance functions).
The concept of k-means is variance minimization. And variance is essentially the same as squared Euclidean distances, but it is not the same as other distances.
Have you considered DBSCAN? sklearn should have DBSCAN, and it should by now have index support to make it fast.
Is the data already in vector space e.g. gps coordinates? If so you can cluster on it directly, lat and lon are close enough to x and y that it shouldn't matter much. If not, preprocessing will have to be applied to convert it to a vector space format (table lookup of locations to coords for instance). Euclidean distance is a good choice to work with vector space data.
To answer the question of whether they played music in a given location, you first fit your kmeans model on their location data, then find the "locations" of their clusters using the cluster_centers_ attribute. Then you check whether any of those cluster centers are close enough to the locations you are checking for. This can be done using thresholding on the distance functions in scipy.spatial.distance.
It's a little difficult to provide a full example since I don't have the dataset, but I can provide an example given arbitrary x and y coords instead if that's what you want.
Also note KMeans is probably not ideal as you have to manually set the number of clusters "k" which could vary between people, or have some more wrapper code around KMeans to determine the "k". There are other clustering models which can determine the number of clusters automatically, such as meanshift, which may be more ideal in this case and also can tell you cluster centers.
Related
I was reading dlib's face clustering code and noticed that the process is like so:
Convert faces to vector using trained network
Use Chinese whisper clustering algorithm to compute groups based on distance
Chinese whisper clustering can take a pretty long time when trying to cluster a large number (>10,000) images.
In this pyimagesearch article the author uses DBSCAN, another clustering algorithm to group a number of images by person.
Since the vectors generated by the neural net can be used to calculate the similarity between two faces, wouldn't it be better to just calculate a euclidean distance matrix, then search for all the values that meet a confidence threshold (eg x < 0.3 for 70% confidence)?
Why use a clustering algorithm at all when you can just compare every face with every other face to determine which ones are the same person? both DBSCAN and chinese whisper clustering take a much longer time than calculating a distance matrix. With my dataset of 30,000 images the times are:
C-whisper - 5 minutes
distance matrix + search - 10-20 seconds
DBSCAN actually takes only marginally longer than computing a distance matrix (when implemented right, 99% of the computation is the distance computations) and with indexing can sometimes be much faster because it does not need every pairwise distance if the index can prune computations.
But you can't just "read off" clusters from the distance matrix. The data there may be contradictory: the face detector may consider A and B similar and B similar to C, but A and C dissimilar! What do you do then? Clustering algorithms try to solve exactly such situations. For example single-Link, and to a lesser extend DBSCAN, would make A and C the same cluster, whereas complete linkage would decide upon either AB or BC.
Actually, dlib's implementation is doing something very similar to what you are thinking of. Here is the code. It first checks every pair and rejects pairs whose distance is greater than a threshold. This is exactly what you proposed. But then it is doing a fine clustering on the result. So, what would change?
Simply cutting off by distances can work if you have well-separated data points. However, if your data points are very close to each other, this problem becomes very hard. Imagine a 1D feature. Your data points are the integer positions between 0 and 10 and you want to put two data points together in a cluster if their distance is at most 1.5. So, what would you do? If you start with a pair, you could make a cluster. But if you pick a neighboring point, you will see that it will be closer than your threshold to one point already in the cluster and larger than the threshold to the other. Clustering is about solving this ambiguity.
I have a an unlabeled data set that I am trying to cluster with a variety of clustering algorithms.
I am successful in being able to find the centroids/"mean of each mixture component" in sklearn.mixture.GaussianMixture using .means_. In my code I am then taking the point that is closest to the means to get a representative sample at each cluster.
I want to do this same thing with SpectralClustering, but I don't see a ".means_" method or some method to get the centroid of each cluster. This may be a result of my misunderstanding of how spectral clustering works or just a lack of features in this library.
As an example I would like to do:
sc = SpectralClustering(n_components=10, n_init=100)
sc.fit(data)
closest, _ = pairwise_distances_argmin_min(sc.means_, data)
But of course SpectralClustering doesn't have a .means_ method.
Thanks for any help on this.
Centroid are used for the KMean algorithm. For spectal clustering, the algorithm only store the affinity matrix and the labels obtained from the algorithm.
It doesn't matter if Spectral Clustering (or any other clustering algorithm) uses the cluster centers or not!
You can compute the centroid of any cluster! It is the mean of the elements in that cluster (well, there actually is a constraint, that the dataset itself allows the notion of mean).
So, compute the clusters using Spectral Clustering. For each cluster, compute the mean of the elements inside it (as is, the mean on every dimension for a cluster comprised of m n-dimensional elements).
I am new to both machine learning and python and my goal is to experiment with route prediction through clustering.
I've just started using DBSCAN and I was able to obtain results given an array of coordinates as input to the fit procedure, e.g. [[1,1],[2,2],[3,3],...], which includes all coordinates of all routes.
However, what I really want is to provide DBSCAN with a set containing all routes/lines instead of a set containing all coordinates of all routes. Therefore, my question is whether this is possible (does it even make sense?) and if so how can I accomplish this?
Thank you for your time.
Why do you think density based clustering is a good choice for clustering routes? What notion of density would you use here?
I'd rather try hierarchical clustering with a proper route distance.
But if you have the distance matrix anyway, you can of course just try DBSCAN on it for "free" (computing the distances will be way more expensive than DBSCAN on a distance matrix).
I am a bit confused about Clustering e.g. K-means clustering.
I have already created clusters for the training for and in the testing part I want to know if the new points are already in the clusters or if they can be in the cluster or not?
My idea is to find the center of each cluster and also find the farthest point in each cluster in training data then in testing part if the distance of the new point is great than a threshold (e.g. 1.5x the farthest point) then it cannot be in the cluster!
Is this idea efficient and correct and is there any python function to do this?
One more question:
Could someone help me to understand the difference between kmeans.fit() and kmeans.predict()? I get the same result in both functions!!
I appreciate any help
In general, when you fitting K-means algorithm, you will get cluster centers as result.
So, if you want to test to what cluster new point belong, you must calculate distance between each cluster center to the point, and label point as closest cluster center label.
If you usning scikit-learn library
Predict(X) method predicts the closest cluster each sample in X belongs to.
Fit(X) - fitting the data, or in other words calculating the cluster centers.
Here is nice example how to use K-means in scikit-learn
My objective is to cluster words based on how similar they are with respect to a corpus of text documents. I have computed Jaccard Similarity between every pair of words. In other words, I have a sparse distance matrix available with me. Can anyone point me to any clustering algorithm (and possibly its library in Python) which takes distance matrix as input ? I also do not know the number of clusters beforehand. I only want to cluster these words and obtain which words are clustered together.
You can use most algorithms in scikit-learn with a precomputed distance matrix. Unfortunately you need the number of clusters for many algorithm.
DBSCAN is the only one that doesn't need the number of clusters and also uses arbitrary distance matrices.
You could also try MeanShift, but that will interpret the distances as coordinates - which might also work.
There is also affinity propagation, but I haven't really seen that working well. If you want many clusters, that might be helpful, though.
disclosure: I'm a scikit-learn core dev.
The scipy clustering package could be usefull (scipy.cluster). There are hierarchical clustering functions in scipy.cluster.hierarchy. Note however that those require a condensed matrix as input (the upper triangular of the distance matrix). Hopefully the documentation pages will help you along.
Recommend to take a look at agglomerative clustering.