I'm looking for a decent implementation of the OPTICS algorithm in Python. I will use it to form density-based clusters of points ((x,y) pairs).
I'm looking for something that takes in (x,y) pairs and outputs a list of clusters, where each cluster in the list contains a list of (x, y) pairs belonging to that cluster.
I'm not aware of a complete and exact python implementation of OPTICS. The links posted here seem just rough approximations of the OPTICS idea. They also do not use an index for acceleration, so they will run in O(n^2) or more likely even O(n^3).
OPTICS has a number of tricky things besides the obvious idea. In particular, the thresholding is proposed to be done with relative thresholds ("xi") instead of absolute thresholds as posted here (at which point the result will be approximately that of DBSCAN!).
The original OPTICS paper contains a suggested approach to converting the algorithm's output into actual clusters:
http://www.dbs.informatik.uni-muenchen.de/Publikationen/Papers/OPTICS.pdf
The OPTICS implementation in Weka is essentially unmaintained and just as incomplete. It doesn't actually produce clusters, it only computes the cluster order. For this it makes a duplicate of the database - it isn't really Weka code.
There seems to be a rather extensive implementation available in ELKI in Java by the group that published OPTICS in the first place. You might want to test any other implementation against this "official" version.
EDIT: the following is known to not be a complete implementation of OPTICS.
I did a quick search and found the following (Optics). I can't vouch for its quality, however the algorithm seems pretty simple, so you should be able to validate/adapt it quickly.
Here is a quick example of how to build clusters on the output of the optics algorithm:
def cluster(order, distance, points, threshold):
''' Given the output of the options algorithm,
compute the clusters:
#param order The order of the points
#param distance The relative distances of the points
#param points The actual points
#param threshold The threshold value to cluster on
#returns A list of cluster groups
'''
clusters = [[]]
points = sorted(zip(order, distance, points))
splits = ((v > threshold, p) for i,v,p in points)
for iscluster, point in splits:
if iscluster: clusters[-1].append(point)
elif len(clusters[-1]) > 0: clusters.append([])
return clusters
rd, cd, order = optics(points, 4)
print cluster(order, rd, points, 38.0)
While not technically OPTICS there is an HDBSCAN* implementation for python available at https://github.com/lmcinnes/hdbscan . This is equivalent to OPTICS with an infinite maximal epsilon, and a different cluster extraction method. Since the implementation provides access to the generated cluster hierarchy you can extract clusters from that via more traditional OPTICS methods as well if you would prefer.
Note that despite not limiting the epsilon parameter this implementation still achieves O(n log(n)) performance using kd-tree and ball-tree based minimal spanning tree algorithms, and can handle quite large datasets.
There now exists the library pyclustering that contains, amongst others, a Python and a C++ implementation of OPTICS.
It is now implemented in the development version (scikit-learn v0.21.dev0) of sklearn (a clustering and maschine learning module for python)
here is the link:
https://scikit-learn.org/dev/modules/generated/sklearn.cluster.OPTICS.html
See "Density-based clustering approaches" on
http://www.chemometria.us.edu.pl/index.php?goto=downloads
You want to look at a space-filling-curve or a spatial index. A sfc reduce the 2d complexity to a 1d complexity. You want to look at Nick's hilbert curve quadtree spatial index blog. You want to download my implementation of a sfc at phpclasses.org (hilbert-curve).
Related
Hi I am new to Python and trying to figure out these below. Really appreciate any help. Thank you
How to get intracluster and intercluster distances in kmeans using python?
How to verify the quality of clusters? Any measures to check the goodness of clusters formed?
Is there a way to find out which factors/variables are most significant features affecting the clustering - Feature Extraction/Selection
I tried this for question 1 above, is this correct approach??
dists = euclidean_distances(km.cluster_centers_)
tri_dists = dists[np.triu_indices(4, 1)]
max_dist, avg_dist, min_dist = tri_dists.max(), tri_dists.mean(), tri_dists.min()
print(max_dist, avg_dist, min_dist)
Avoid putting multiple questions into one.
K-means does not compute all these distances. Otherwise it would need O(n²) time and memory, that would be much slower! It uses a special property of variance (another reason why it does not just optimize other distances except sum-of-squares) known as the Koenig-Huygens theorem.
Yes, there have been over 20, probably even 100, such quality measures proposed in literature. But that does not make it much easier to pick the "best" clustering: in the end, clusters are subjective for the user.
Yes, you can apply various techniques ranging from variance analysis to factor analysis to random forests.
I have a database with many data-points each with an x,y,z coordinate. I want to count the number of points that are within a certain distance to neighboring points. Some points will have a pair that is within a radius R, others will not. I simply want to count the number of pairs within some distance. I could easily write an algorithm to do this but it would not be efficient enough (for I would iterate over every single data point).
This seems like something that must already exist in astropy, scipy, etc. but I cannot seem to find what I am looking for. Is there anything out there that accomplishes this?
As mentioned by #Davis Herring in the comments, an efficient option is a k-d tree.
The k-d tree is an algorithm that avoids the brute-force approach and allows for efficient distance computations* (see bottom of answer for background).
There are several Python implementations of this, one of which is through SciPy:
SciPy k-d tree in Cython (faster since it uses C/Cython)
SciPy k-d tree in pure Python
You can use this by first constructing a k-d tree for your xyz data:
import numpy as np #for later code
from scipy.spatial import cKDTree
kdtree = cKDTree(xyzData)
Then, you must query the k-d tree with a point point to compute the distance between point and its nearest neighbor. The output of this query is the distance NN_dist between point and its nearest neighbor and the index NN_idx of that neighbor. To compute this for all of your points, we need a for loop, but given the k-d tree algorithm, this is much faster than a brute-force computation:
NN_dists = np.zeros(numPoints) #pre-allocate an array to store distances
for i in range(numPoints):
point = xyzData[i]
NN_dist, NN_idx = kdtree.query(point,k=[1])
#Note: 'k' specifies the kth neighbor distance to compute,
#so set k=2 if you end up finding the point as its own "neighbor":
if NN_dist == 0:
NN_dist, NN_idx = targetTree.query(curCoord,k=[2])
NN_dists[i] = NN_dist
(see k-d tree query for more details).
Then, to find the distances that are below some threshold, you could use the built-in utility of NumPy arrays when using comparison operators (like <):
distanceThres = 10
goodIdx = NN_dists < distanceThres
goodPoints = xyzData[goodIdx]
This will give you the indices goodIdx and points goodPoints that are within your specified distance threshold distanceThres (though you may have to change this code depending on the shape/format of your xyz coordinate data).
*A light background on k-d trees (glossing over fine details -- see references for more): the k-d tree method partitions a dataset in such a way that avoids computing the distance between every single point (i.e., the brute force method). It does this by dividing the dataset into binary space partitions to construct a k-d tree. These partitions are such that a distance computation (e.g., a nearest-neighbor search) can ignore datapoints that are in distant partitions. Additionally, this same k-d tree is reused for each point.
There are a lot of resources online about k-d trees in general. I found these references most helpful when I was learning about this algorithm: Stanford k-d trees or Princeton k-d trees.
Let me know if you have questions -- I had this exact problem myself during an astronomy project, so I may be able to help more.
I don't have direct experience with it but scipy.spatial.distance.pdist may be what you're looking for.
This link may be helpful as well. It gives an example of how to solve the problem as I understand it.
The linkage matrix for clustering provides the cluster index, and distance
for each step of the clustering hierarchy.
When two clusters are merged, I would like to know which two points were the closest in the clusters. I am using the metric "single" i.e. closest distance
I know I can do this trivially by an exhaustive search and comparison. Is the information already there after linkage ? Is there a smarter way to get this information?
To answer your questions:
No, this information is not available after linkage, at least according to the official Python documentation.
The closest pair of points problem is a problem of computational geometry, and can be solved in logarithmic time by a recursive divide and conquer algorithm (note that exhaustive search is quadratic). See this Wikipedia article for more information. Check also this paper by Shamos and Hoey. Note that the original formulation of the problem involves only one set of points. However, adaptation for two sets is straightforward; you might find this discussion helpful.
Are there any types of clustering algorithms that focus on forming specific sized clusters? This can be thought of us as a grouping algorithm more than a clustering algorithm.
Basically, given n data points, and fixed groups of a certain size k, find the optimal distribution of points to sets based upon certain classifiers, that will hopefully minimize the distance of classifiers for each point in a given group.
This problem seems to be pretty similar to a clustering problem, but the main difference is that we are concerned with a specific cluster size, but not concerned about the number of clusters.
There is a tutorial on how to implement such an algorithm in ELKI:
http://elki.dbs.ifi.lmu.de/wiki/Tutorial/SameSizeKMeans
Also have a look at constraint clustering algorithms; although usually these algorithms only support "Must link" and "cannot link" constraints, not size constraints.
You should be able to do a similar modification where you first specify the group sizes, then assign points randomly, and swap cluster members as long as your objective function improves; similar to k-means / k-medoids. As you may get stuck in local minima, restart a number of times and only keep the best.
See also earlier questions, e.g.
K-means algorithm variation with equal cluster size
and
Group n points in k clusters of equal size
The problem that you are posing is a combinatorial optimization problem. It is very important to know if you need an exact solution, or that can you settle for an approximate one?
If you need exact solutions, there is a body of work that focuses on clustering with different types of constraints. The constraint that you mentioned can be encoded in this framework. However, you should now that this approach scales up to a datasets with a certain size.
I am interested to perform kmeans clustering on a list of words with the distance measure being Leveshtein.
1) I know there are a lot of frameworks out there, including scipy and orange that has a kmeans implementation. However they all require some sort of vector as the data which doesn't really fit me.
2) I need a good clustering implementation. I looked at python-clustering and realize that it doesn't a) return the sum of all the distance to each centroid, and b) it doesn't have any sort of iteration limit or cut off which ensures the quality of the clustering. python-clustering and the clustering algorithm on daniweb doesn't really work for me.
Can someone find me a good lib? Google hasn't been my friend
Yeah I think there isn't a good implementation to what I need.
I have some crazy requirements, like distance caching etc.
So i think i will just write my own lib and release it as GPLv3 soon.
Not really an answer to your specific question, but I recommend glancing at "Programming Collective Intelligence". At the end of each chapter, e.g., clustering, it wanders off into describing all the best reading on the subject.
Maybe have a look at Weka. It is a Java library with some unsupervised learning implementations and nice visualization tools. It has been a while since I used it, not sure if it is great for a real production environment but defenitely a good starting point.
What about this very nice answer on CrossValidated?
It uses Affinity Propagation instead of k-means and in that case you can give as input a distance metric. I do not think any k-means based approach could work in your case since it is based on building a centroid and in order to do that you have to be in a vector space.
Affinity Propagation has the bonus that it selects automatically the number of clusters, which you can tweak (to have more or less clusters) by altering the preference (which by default is the median of all pairwise distance, but you can choose other percentiles).
If you need to specify the exact number of clusters, besides tweaking Affinity Propagation by trial and error, you could look for implementation of k-medoids (apparently there is no implementation of it in sklearn, but people have asked for it here and there). K-medoids does not build centroids, so it does not need the concept of vector space. So implementation might accept as input a precomputed distance matrix (haven't checked the references I give, though).