Scipy clustering: which method to use in fcluster for simple grouping? - python

There are myriad of optins in the scipy clustering module, and I'd like to be sure that I'm using them correctly. I have a symmetric distance matrix DR and I'd like to find all clusters such that any point in the cluster has a neighbor with a distance of no more than 1.2.
L = linkage(DR,method='single')
F = fcluster(L, 1.2)
In linkage, I'm pretty sure single is what I want (the Nearest Point Algorithm). However for fcluster, I think I want the default, ‘inconsistent’, method:
‘inconsistent’: If a cluster node and all its descendants have an inconsistent value less than or equal to t then all its leaf descendants belong to the same flat cluster. When no non-singleton cluster meets this criterion, every node is assigned to its own cluster. (Default)
But maybe it's the ‘distance’ method:
‘distance’: Forms flat clusters so that the original observations in each flat cluster have no greater a cophenetic distance than t.
... I'm not sure. Which one to use? What does cophenetic distance distance mean in this context?

You might want to look at DBSCAN. See the Wikipedia article on it. It looks like you are looking for an output of DBSCAN with minPts=1 and epsilon=1.2
It's fairly simple to implement judging from the pseudocode on wikipedia, in particular since you already seem to have a distance matrix. Just do it yourself.

Related

Efficient method for counting number of data points inside sphere of fixed radius centered on each data point

I have a database with many data-points each with an x,y,z coordinate. I want to count the number of points that are within a certain distance to neighboring points. Some points will have a pair that is within a radius R, others will not. I simply want to count the number of pairs within some distance. I could easily write an algorithm to do this but it would not be efficient enough (for I would iterate over every single data point).
This seems like something that must already exist in astropy, scipy, etc. but I cannot seem to find what I am looking for. Is there anything out there that accomplishes this?
As mentioned by #Davis Herring in the comments, an efficient option is a k-d tree.
The k-d tree is an algorithm that avoids the brute-force approach and allows for efficient distance computations* (see bottom of answer for background).
There are several Python implementations of this, one of which is through SciPy:
SciPy k-d tree in Cython (faster since it uses C/Cython)
SciPy k-d tree in pure Python
You can use this by first constructing a k-d tree for your xyz data:
import numpy as np #for later code
from scipy.spatial import cKDTree
kdtree = cKDTree(xyzData)
Then, you must query the k-d tree with a point point to compute the distance between point and its nearest neighbor. The output of this query is the distance NN_dist between point and its nearest neighbor and the index NN_idx of that neighbor. To compute this for all of your points, we need a for loop, but given the k-d tree algorithm, this is much faster than a brute-force computation:
NN_dists = np.zeros(numPoints) #pre-allocate an array to store distances
for i in range(numPoints):
point = xyzData[i]
NN_dist, NN_idx = kdtree.query(point,k=[1])
#Note: 'k' specifies the kth neighbor distance to compute,
#so set k=2 if you end up finding the point as its own "neighbor":
if NN_dist == 0:
NN_dist, NN_idx = targetTree.query(curCoord,k=[2])
NN_dists[i] = NN_dist
(see k-d tree query for more details).
Then, to find the distances that are below some threshold, you could use the built-in utility of NumPy arrays when using comparison operators (like <):
distanceThres = 10
goodIdx = NN_dists < distanceThres
goodPoints = xyzData[goodIdx]
This will give you the indices goodIdx and points goodPoints that are within your specified distance threshold distanceThres (though you may have to change this code depending on the shape/format of your xyz coordinate data).
*A light background on k-d trees (glossing over fine details -- see references for more): the k-d tree method partitions a dataset in such a way that avoids computing the distance between every single point (i.e., the brute force method). It does this by dividing the dataset into binary space partitions to construct a k-d tree. These partitions are such that a distance computation (e.g., a nearest-neighbor search) can ignore datapoints that are in distant partitions. Additionally, this same k-d tree is reused for each point.
There are a lot of resources online about k-d trees in general. I found these references most helpful when I was learning about this algorithm: Stanford k-d trees or Princeton k-d trees.
Let me know if you have questions -- I had this exact problem myself during an astronomy project, so I may be able to help more.
I don't have direct experience with it but scipy.spatial.distance.pdist may be what you're looking for.
This link may be helpful as well. It gives an example of how to solve the problem as I understand it.

Finding cluster centroid or ".means_" with sklearn.cluster.SpectralClustering

I have a an unlabeled data set that I am trying to cluster with a variety of clustering algorithms.
I am successful in being able to find the centroids/"mean of each mixture component" in sklearn.mixture.GaussianMixture using .means_. In my code I am then taking the point that is closest to the means to get a representative sample at each cluster.
I want to do this same thing with SpectralClustering, but I don't see a ".means_" method or some method to get the centroid of each cluster. This may be a result of my misunderstanding of how spectral clustering works or just a lack of features in this library.
As an example I would like to do:
sc = SpectralClustering(n_components=10, n_init=100)
sc.fit(data)
closest, _ = pairwise_distances_argmin_min(sc.means_, data)
But of course SpectralClustering doesn't have a .means_ method.
Thanks for any help on this.
Centroid are used for the KMean algorithm. For spectal clustering, the algorithm only store the affinity matrix and the labels obtained from the algorithm.
It doesn't matter if Spectral Clustering (or any other clustering algorithm) uses the cluster centers or not!
You can compute the centroid of any cluster! It is the mean of the elements in that cluster (well, there actually is a constraint, that the dataset itself allows the notion of mean).
So, compute the clusters using Spectral Clustering. For each cluster, compute the mean of the elements inside it (as is, the mean on every dimension for a cluster comprised of m n-dimensional elements).

Python Agglomerative Clustering : finding the closest points in clusters

The linkage matrix for clustering provides the cluster index, and distance
for each step of the clustering hierarchy.
When two clusters are merged, I would like to know which two points were the closest in the clusters. I am using the metric "single" i.e. closest distance
I know I can do this trivially by an exhaustive search and comparison. Is the information already there after linkage ? Is there a smarter way to get this information?
To answer your questions:
No, this information is not available after linkage, at least according to the official Python documentation.
The closest pair of points problem is a problem of computational geometry, and can be solved in logarithmic time by a recursive divide and conquer algorithm (note that exhaustive search is quadratic). See this Wikipedia article for more information. Check also this paper by Shamos and Hoey. Note that the original formulation of the problem involves only one set of points. However, adaptation for two sets is straightforward; you might find this discussion helpful.

Computing K-means clustering on Location data in Python

I have a dataset of users and their music plays, with every play having location data. For every user i want to cluster their plays to see if they play music in given locations.
I plan on using the sci-kit learn k-means package, but how do I get this to work with location data, as opposed to its default, euclidean distance?
An example of it working would really help me!
Don't use k-means with anything other than Euclidean distance.
K-means is not designed to work with other distance metrics (see k-medians for Manhattan distance, k-medoids aka. PAM for arbitrary other distance functions).
The concept of k-means is variance minimization. And variance is essentially the same as squared Euclidean distances, but it is not the same as other distances.
Have you considered DBSCAN? sklearn should have DBSCAN, and it should by now have index support to make it fast.
Is the data already in vector space e.g. gps coordinates? If so you can cluster on it directly, lat and lon are close enough to x and y that it shouldn't matter much. If not, preprocessing will have to be applied to convert it to a vector space format (table lookup of locations to coords for instance). Euclidean distance is a good choice to work with vector space data.
To answer the question of whether they played music in a given location, you first fit your kmeans model on their location data, then find the "locations" of their clusters using the cluster_centers_ attribute. Then you check whether any of those cluster centers are close enough to the locations you are checking for. This can be done using thresholding on the distance functions in scipy.spatial.distance.
It's a little difficult to provide a full example since I don't have the dataset, but I can provide an example given arbitrary x and y coords instead if that's what you want.
Also note KMeans is probably not ideal as you have to manually set the number of clusters "k" which could vary between people, or have some more wrapper code around KMeans to determine the "k". There are other clustering models which can determine the number of clusters automatically, such as meanshift, which may be more ideal in this case and also can tell you cluster centers.

Python Implementation of OPTICS (Clustering) Algorithm

I'm looking for a decent implementation of the OPTICS algorithm in Python. I will use it to form density-based clusters of points ((x,y) pairs).
I'm looking for something that takes in (x,y) pairs and outputs a list of clusters, where each cluster in the list contains a list of (x, y) pairs belonging to that cluster.
I'm not aware of a complete and exact python implementation of OPTICS. The links posted here seem just rough approximations of the OPTICS idea. They also do not use an index for acceleration, so they will run in O(n^2) or more likely even O(n^3).
OPTICS has a number of tricky things besides the obvious idea. In particular, the thresholding is proposed to be done with relative thresholds ("xi") instead of absolute thresholds as posted here (at which point the result will be approximately that of DBSCAN!).
The original OPTICS paper contains a suggested approach to converting the algorithm's output into actual clusters:
http://www.dbs.informatik.uni-muenchen.de/Publikationen/Papers/OPTICS.pdf
The OPTICS implementation in Weka is essentially unmaintained and just as incomplete. It doesn't actually produce clusters, it only computes the cluster order. For this it makes a duplicate of the database - it isn't really Weka code.
There seems to be a rather extensive implementation available in ELKI in Java by the group that published OPTICS in the first place. You might want to test any other implementation against this "official" version.
EDIT: the following is known to not be a complete implementation of OPTICS.
I did a quick search and found the following (Optics). I can't vouch for its quality, however the algorithm seems pretty simple, so you should be able to validate/adapt it quickly.
Here is a quick example of how to build clusters on the output of the optics algorithm:
def cluster(order, distance, points, threshold):
''' Given the output of the options algorithm,
compute the clusters:
#param order The order of the points
#param distance The relative distances of the points
#param points The actual points
#param threshold The threshold value to cluster on
#returns A list of cluster groups
'''
clusters = [[]]
points = sorted(zip(order, distance, points))
splits = ((v > threshold, p) for i,v,p in points)
for iscluster, point in splits:
if iscluster: clusters[-1].append(point)
elif len(clusters[-1]) > 0: clusters.append([])
return clusters
rd, cd, order = optics(points, 4)
print cluster(order, rd, points, 38.0)
While not technically OPTICS there is an HDBSCAN* implementation for python available at https://github.com/lmcinnes/hdbscan . This is equivalent to OPTICS with an infinite maximal epsilon, and a different cluster extraction method. Since the implementation provides access to the generated cluster hierarchy you can extract clusters from that via more traditional OPTICS methods as well if you would prefer.
Note that despite not limiting the epsilon parameter this implementation still achieves O(n log(n)) performance using kd-tree and ball-tree based minimal spanning tree algorithms, and can handle quite large datasets.
There now exists the library pyclustering that contains, amongst others, a Python and a C++ implementation of OPTICS.
It is now implemented in the development version (scikit-learn v0.21.dev0) of sklearn (a clustering and maschine learning module for python)
here is the link:
https://scikit-learn.org/dev/modules/generated/sklearn.cluster.OPTICS.html
See "Density-based clustering approaches" on
http://www.chemometria.us.edu.pl/index.php?goto=downloads
You want to look at a space-filling-curve or a spatial index. A sfc reduce the 2d complexity to a 1d complexity. You want to look at Nick's hilbert curve quadtree spatial index blog. You want to download my implementation of a sfc at phpclasses.org (hilbert-curve).

Categories