My objective is to cluster words based on how similar they are with respect to a corpus of text documents. I have computed Jaccard Similarity between every pair of words. In other words, I have a sparse distance matrix available with me. Can anyone point me to any clustering algorithm (and possibly its library in Python) which takes distance matrix as input ? I also do not know the number of clusters beforehand. I only want to cluster these words and obtain which words are clustered together.
You can use most algorithms in scikit-learn with a precomputed distance matrix. Unfortunately you need the number of clusters for many algorithm.
DBSCAN is the only one that doesn't need the number of clusters and also uses arbitrary distance matrices.
You could also try MeanShift, but that will interpret the distances as coordinates - which might also work.
There is also affinity propagation, but I haven't really seen that working well. If you want many clusters, that might be helpful, though.
disclosure: I'm a scikit-learn core dev.
The scipy clustering package could be usefull (scipy.cluster). There are hierarchical clustering functions in scipy.cluster.hierarchy. Note however that those require a condensed matrix as input (the upper triangular of the distance matrix). Hopefully the documentation pages will help you along.
Recommend to take a look at agglomerative clustering.
Related
I have a an unlabeled data set that I am trying to cluster with a variety of clustering algorithms.
I am successful in being able to find the centroids/"mean of each mixture component" in sklearn.mixture.GaussianMixture using .means_. In my code I am then taking the point that is closest to the means to get a representative sample at each cluster.
I want to do this same thing with SpectralClustering, but I don't see a ".means_" method or some method to get the centroid of each cluster. This may be a result of my misunderstanding of how spectral clustering works or just a lack of features in this library.
As an example I would like to do:
sc = SpectralClustering(n_components=10, n_init=100)
sc.fit(data)
closest, _ = pairwise_distances_argmin_min(sc.means_, data)
But of course SpectralClustering doesn't have a .means_ method.
Thanks for any help on this.
Centroid are used for the KMean algorithm. For spectal clustering, the algorithm only store the affinity matrix and the labels obtained from the algorithm.
It doesn't matter if Spectral Clustering (or any other clustering algorithm) uses the cluster centers or not!
You can compute the centroid of any cluster! It is the mean of the elements in that cluster (well, there actually is a constraint, that the dataset itself allows the notion of mean).
So, compute the clusters using Spectral Clustering. For each cluster, compute the mean of the elements inside it (as is, the mean on every dimension for a cluster comprised of m n-dimensional elements).
I am new to both machine learning and python and my goal is to experiment with route prediction through clustering.
I've just started using DBSCAN and I was able to obtain results given an array of coordinates as input to the fit procedure, e.g. [[1,1],[2,2],[3,3],...], which includes all coordinates of all routes.
However, what I really want is to provide DBSCAN with a set containing all routes/lines instead of a set containing all coordinates of all routes. Therefore, my question is whether this is possible (does it even make sense?) and if so how can I accomplish this?
Thank you for your time.
Why do you think density based clustering is a good choice for clustering routes? What notion of density would you use here?
I'd rather try hierarchical clustering with a proper route distance.
But if you have the distance matrix anyway, you can of course just try DBSCAN on it for "free" (computing the distances will be way more expensive than DBSCAN on a distance matrix).
The linkage matrix for clustering provides the cluster index, and distance
for each step of the clustering hierarchy.
When two clusters are merged, I would like to know which two points were the closest in the clusters. I am using the metric "single" i.e. closest distance
I know I can do this trivially by an exhaustive search and comparison. Is the information already there after linkage ? Is there a smarter way to get this information?
To answer your questions:
No, this information is not available after linkage, at least according to the official Python documentation.
The closest pair of points problem is a problem of computational geometry, and can be solved in logarithmic time by a recursive divide and conquer algorithm (note that exhaustive search is quadratic). See this Wikipedia article for more information. Check also this paper by Shamos and Hoey. Note that the original formulation of the problem involves only one set of points. However, adaptation for two sets is straightforward; you might find this discussion helpful.
I used sklearn cluster-algorithm dbscan to get clusters of my data.
Data: Non-Geometrical objects based on hex-decimal strings
I used a simple distance to create a distance matrix as input for dbscan resulting in expected clusters.
Question Is it possible to create a plot of these cluster-results like in demo
I didn't found a solution through search.
I need to graphically demonstrate the similarities of the objects and clusters to each other.
Since I am using python for everything (in that project) I would appreciate it to choose a solution in python.
I don't use python, so I cannot give you example code.
If your data isn't 2 dimensional, you can try to find a good 2-dimensional approximation using Multidimensional Scaling.
Essentially, it takes an input matrix (which should satistify triangular ineuqality, and ideally be derived from Euclidean distance in some vector space; but you can often get good results if this does not strictly hold). It then tries to find the best 2-dimensional data set that has the same distances.
I have a dataset of users and their music plays, with every play having location data. For every user i want to cluster their plays to see if they play music in given locations.
I plan on using the sci-kit learn k-means package, but how do I get this to work with location data, as opposed to its default, euclidean distance?
An example of it working would really help me!
Don't use k-means with anything other than Euclidean distance.
K-means is not designed to work with other distance metrics (see k-medians for Manhattan distance, k-medoids aka. PAM for arbitrary other distance functions).
The concept of k-means is variance minimization. And variance is essentially the same as squared Euclidean distances, but it is not the same as other distances.
Have you considered DBSCAN? sklearn should have DBSCAN, and it should by now have index support to make it fast.
Is the data already in vector space e.g. gps coordinates? If so you can cluster on it directly, lat and lon are close enough to x and y that it shouldn't matter much. If not, preprocessing will have to be applied to convert it to a vector space format (table lookup of locations to coords for instance). Euclidean distance is a good choice to work with vector space data.
To answer the question of whether they played music in a given location, you first fit your kmeans model on their location data, then find the "locations" of their clusters using the cluster_centers_ attribute. Then you check whether any of those cluster centers are close enough to the locations you are checking for. This can be done using thresholding on the distance functions in scipy.spatial.distance.
It's a little difficult to provide a full example since I don't have the dataset, but I can provide an example given arbitrary x and y coords instead if that's what you want.
Also note KMeans is probably not ideal as you have to manually set the number of clusters "k" which could vary between people, or have some more wrapper code around KMeans to determine the "k". There are other clustering models which can determine the number of clusters automatically, such as meanshift, which may be more ideal in this case and also can tell you cluster centers.