Python Agglomerative Clustering : finding the closest points in clusters - python

The linkage matrix for clustering provides the cluster index, and distance
for each step of the clustering hierarchy.
When two clusters are merged, I would like to know which two points were the closest in the clusters. I am using the metric "single" i.e. closest distance
I know I can do this trivially by an exhaustive search and comparison. Is the information already there after linkage ? Is there a smarter way to get this information?

To answer your questions:
No, this information is not available after linkage, at least according to the official Python documentation.
The closest pair of points problem is a problem of computational geometry, and can be solved in logarithmic time by a recursive divide and conquer algorithm (note that exhaustive search is quadratic). See this Wikipedia article for more information. Check also this paper by Shamos and Hoey. Note that the original formulation of the problem involves only one set of points. However, adaptation for two sets is straightforward; you might find this discussion helpful.

Related

DBSCAN plotting Non-geometrical-Data

I used sklearn cluster-algorithm dbscan to get clusters of my data.
Data: Non-Geometrical objects based on hex-decimal strings
I used a simple distance to create a distance matrix as input for dbscan resulting in expected clusters.
Question Is it possible to create a plot of these cluster-results like in demo
I didn't found a solution through search.
I need to graphically demonstrate the similarities of the objects and clusters to each other.
Since I am using python for everything (in that project) I would appreciate it to choose a solution in python.
I don't use python, so I cannot give you example code.
If your data isn't 2 dimensional, you can try to find a good 2-dimensional approximation using Multidimensional Scaling.
Essentially, it takes an input matrix (which should satistify triangular ineuqality, and ideally be derived from Euclidean distance in some vector space; but you can often get good results if this does not strictly hold). It then tries to find the best 2-dimensional data set that has the same distances.

Clustering with Specific Sized Groups

Are there any types of clustering algorithms that focus on forming specific sized clusters? This can be thought of us as a grouping algorithm more than a clustering algorithm.
Basically, given n data points, and fixed groups of a certain size k, find the optimal distribution of points to sets based upon certain classifiers, that will hopefully minimize the distance of classifiers for each point in a given group.
This problem seems to be pretty similar to a clustering problem, but the main difference is that we are concerned with a specific cluster size, but not concerned about the number of clusters.
There is a tutorial on how to implement such an algorithm in ELKI:
http://elki.dbs.ifi.lmu.de/wiki/Tutorial/SameSizeKMeans
Also have a look at constraint clustering algorithms; although usually these algorithms only support "Must link" and "cannot link" constraints, not size constraints.
You should be able to do a similar modification where you first specify the group sizes, then assign points randomly, and swap cluster members as long as your objective function improves; similar to k-means / k-medoids. As you may get stuck in local minima, restart a number of times and only keep the best.
See also earlier questions, e.g.
K-means algorithm variation with equal cluster size
and
Group n points in k clusters of equal size
The problem that you are posing is a combinatorial optimization problem. It is very important to know if you need an exact solution, or that can you settle for an approximate one?
If you need exact solutions, there is a body of work that focuses on clustering with different types of constraints. The constraint that you mentioned can be encoded in this framework. However, you should now that this approach scales up to a datasets with a certain size.

Clustering words based on Distance Matrix

My objective is to cluster words based on how similar they are with respect to a corpus of text documents. I have computed Jaccard Similarity between every pair of words. In other words, I have a sparse distance matrix available with me. Can anyone point me to any clustering algorithm (and possibly its library in Python) which takes distance matrix as input ? I also do not know the number of clusters beforehand. I only want to cluster these words and obtain which words are clustered together.
You can use most algorithms in scikit-learn with a precomputed distance matrix. Unfortunately you need the number of clusters for many algorithm.
DBSCAN is the only one that doesn't need the number of clusters and also uses arbitrary distance matrices.
You could also try MeanShift, but that will interpret the distances as coordinates - which might also work.
There is also affinity propagation, but I haven't really seen that working well. If you want many clusters, that might be helpful, though.
disclosure: I'm a scikit-learn core dev.
The scipy clustering package could be usefull (scipy.cluster). There are hierarchical clustering functions in scipy.cluster.hierarchy. Note however that those require a condensed matrix as input (the upper triangular of the distance matrix). Hopefully the documentation pages will help you along.
Recommend to take a look at agglomerative clustering.

Scipy clustering: which method to use in fcluster for simple grouping?

There are myriad of optins in the scipy clustering module, and I'd like to be sure that I'm using them correctly. I have a symmetric distance matrix DR and I'd like to find all clusters such that any point in the cluster has a neighbor with a distance of no more than 1.2.
L = linkage(DR,method='single')
F = fcluster(L, 1.2)
In linkage, I'm pretty sure single is what I want (the Nearest Point Algorithm). However for fcluster, I think I want the default, ‘inconsistent’, method:
‘inconsistent’: If a cluster node and all its descendants have an inconsistent value less than or equal to t then all its leaf descendants belong to the same flat cluster. When no non-singleton cluster meets this criterion, every node is assigned to its own cluster. (Default)
But maybe it's the ‘distance’ method:
‘distance’: Forms flat clusters so that the original observations in each flat cluster have no greater a cophenetic distance than t.
... I'm not sure. Which one to use? What does cophenetic distance distance mean in this context?
You might want to look at DBSCAN. See the Wikipedia article on it. It looks like you are looking for an output of DBSCAN with minPts=1 and epsilon=1.2
It's fairly simple to implement judging from the pseudocode on wikipedia, in particular since you already seem to have a distance matrix. Just do it yourself.

Python Implementation of OPTICS (Clustering) Algorithm

I'm looking for a decent implementation of the OPTICS algorithm in Python. I will use it to form density-based clusters of points ((x,y) pairs).
I'm looking for something that takes in (x,y) pairs and outputs a list of clusters, where each cluster in the list contains a list of (x, y) pairs belonging to that cluster.
I'm not aware of a complete and exact python implementation of OPTICS. The links posted here seem just rough approximations of the OPTICS idea. They also do not use an index for acceleration, so they will run in O(n^2) or more likely even O(n^3).
OPTICS has a number of tricky things besides the obvious idea. In particular, the thresholding is proposed to be done with relative thresholds ("xi") instead of absolute thresholds as posted here (at which point the result will be approximately that of DBSCAN!).
The original OPTICS paper contains a suggested approach to converting the algorithm's output into actual clusters:
http://www.dbs.informatik.uni-muenchen.de/Publikationen/Papers/OPTICS.pdf
The OPTICS implementation in Weka is essentially unmaintained and just as incomplete. It doesn't actually produce clusters, it only computes the cluster order. For this it makes a duplicate of the database - it isn't really Weka code.
There seems to be a rather extensive implementation available in ELKI in Java by the group that published OPTICS in the first place. You might want to test any other implementation against this "official" version.
EDIT: the following is known to not be a complete implementation of OPTICS.
I did a quick search and found the following (Optics). I can't vouch for its quality, however the algorithm seems pretty simple, so you should be able to validate/adapt it quickly.
Here is a quick example of how to build clusters on the output of the optics algorithm:
def cluster(order, distance, points, threshold):
''' Given the output of the options algorithm,
compute the clusters:
#param order The order of the points
#param distance The relative distances of the points
#param points The actual points
#param threshold The threshold value to cluster on
#returns A list of cluster groups
'''
clusters = [[]]
points = sorted(zip(order, distance, points))
splits = ((v > threshold, p) for i,v,p in points)
for iscluster, point in splits:
if iscluster: clusters[-1].append(point)
elif len(clusters[-1]) > 0: clusters.append([])
return clusters
rd, cd, order = optics(points, 4)
print cluster(order, rd, points, 38.0)
While not technically OPTICS there is an HDBSCAN* implementation for python available at https://github.com/lmcinnes/hdbscan . This is equivalent to OPTICS with an infinite maximal epsilon, and a different cluster extraction method. Since the implementation provides access to the generated cluster hierarchy you can extract clusters from that via more traditional OPTICS methods as well if you would prefer.
Note that despite not limiting the epsilon parameter this implementation still achieves O(n log(n)) performance using kd-tree and ball-tree based minimal spanning tree algorithms, and can handle quite large datasets.
There now exists the library pyclustering that contains, amongst others, a Python and a C++ implementation of OPTICS.
It is now implemented in the development version (scikit-learn v0.21.dev0) of sklearn (a clustering and maschine learning module for python)
here is the link:
https://scikit-learn.org/dev/modules/generated/sklearn.cluster.OPTICS.html
See "Density-based clustering approaches" on
http://www.chemometria.us.edu.pl/index.php?goto=downloads
You want to look at a space-filling-curve or a spatial index. A sfc reduce the 2d complexity to a 1d complexity. You want to look at Nick's hilbert curve quadtree spatial index blog. You want to download my implementation of a sfc at phpclasses.org (hilbert-curve).

Categories