I currently have a list with 3D coordinates which I want cluster by density into a unknown number of clusters. In addition to that I want to score the clusters by population and by distance to the centroids.
I would also like to be able to set a maximum possible distance from a certain centroid. Ideally the centroid represent a point of the data-set, but it is not absolutely necessary. I want to do this for a list ranging from approximately 100 to 10000 3D coordinates.
So for example, say i have a point [x,y,z] which could be my centroid:
Points that are closest to x,y,z should contribute the most to its score (i.e. a logistic scoring function like y = (1 + exp(4*(-1.0+x)))** -1 ,where x represents the euclidean distance to point [x,y ,z]
( https://www.wolframalpha.com/input/?i=(1+%2B+exp(4(-1.0%2Bx)))**+-1 )
Since this function never reaches 0, it is needed to set a maximum distance, e.g. 2 distance units to set a limit to the cluster.
I want to do this until no more clusters can be made, I am only interested in the centroid, thus it should preferably be a real datapoint instead of an interpolated one it also has other properties connected to it.
I have already tried DBSCAN from sklearn, which is several orders of magnitude faster than my code, but it does obviously not accomplish what I want to do
Currently I am just calculating the proximity of every point relative to all other points and am scoring every point by the number and distance to its neighbors (with the same scoring function discussed above), then I take the highest scored point and remove all other, lower scored, points that are within a certain cutoff distance. It gets the job done and is accurate, but it is too slow.
I hope I could be somewhat clear with what I want to do.
Use the neighbor search function of sklearn to find points within the maximum distance 2 fast. Only do this once compute the logistic weights only once.
Then do the remainder using ony this precomputed data?
This is a question that I was asked on a job interview some time ago. And I still can't figure out sensible answer.
Question is:
you are given set of points (x,y). Find 2 most distant points. Distant from each other.
For example, for points: (0,0), (1,1), (-8, 5) - the most distant are: (1,1) and (-8,5) because the distance between them is larger from both (0,0)-(1,1) and (0,0)-(-8,5).
The obvious approach is to calculate all distances between all points, and find maximum. The problem is that it is O(n^2), which makes it prohibitively expensive for large datasets.
There is approach with first tracking points that are on the boundary, and then calculating distances for them, on the premise that there will be less points on boundary than "inside", but it's still expensive, and will fail in worst case scenario.
Tried to search the web, but didn't find any sensible answer - although this might be simply my lack of search skills.
For this specific problem, with just a list of Euclidean points, one way is to find the convex hull of the set of points. The two distant points can then be found by traversing the hull once with the rotating calipers method.
Here is an O(N log N) implementation:
http://mukeshiiitm.wordpress.com/2008/05/27/find-the-farthest-pair-of-points/
If the list of points is already sorted, you can remove the sort to get the optimal O(N) complexity.
For a more general problem of finding most distant points in a graph:
Algorithm to find two points furthest away from each other
The accepted answer works in O(N^2).
Boundary point algorithms abound (look for convex hull algorithms). From there, it should take O(N) time to find the most-distant opposite points.
From the author's comment: first find any pair of opposite points on the hull, and then walk around it in semi-lock-step fashion. Depending on the angles between edges, you will have to advance either one walker or the other, but it will always take O(N) to circumnavigate the hull.
You are looking for an algorithm to compute the diameter of a set of points, Diam(S). It can be shown that this is the same as the diameter of the convex hull of S, Diam(S) = Diam(CH(S)). So first compute the convex hull of the set.
Now you have to find all the antipodal points on the convex hull and pick the pair with maximum distance. There are O(n) antipodal points on a convex polygon. So this gives a O(n lg n) algorithm for finding the farthest points.
This technique is known as Rotating Calipers. This is what Marcelo Cantos describes in his answer.
If you write the algorithm carefully, you can do without computing angles. For details, check this URL.
A stochastic algorithm to find the most distant pair would be
Choose a random point
Get the point most distant to it
Repeat a few times
Remove all visited points
Choose another random point and repeat a few times.
You are in O(n) as long as you predetermine "a few times", but are not guaranteed to actually find the most distant pair. But depending on your set of points the result should be pretty good. =)
This question is introduced at Introduction to Algorithm. It mentioned 1) Calculate Convex Hull O(NlgN). 2) If there is M vectex on Convex Hull. Then we need O(M) to find the farthest pair.
I find this helpful links. It includes analysis of algorithm details and program.
http://www.seas.gwu.edu/~simhaweb/alg/lectures/module1/module1.html
Wish this will be helpful.
Find the mean of all the points, measure the difference between all points and the mean, take the point the largest distance from the mean and find the point farthest from it. Those points will be the absolute corners of the convex hull and the two most distant points.
I recently did this for a project that needed convex hulls confined to randomly directed infinite planes. It worked great.
See the comments: this solution isn't guaranteed to produce the correct answer.
Just a few thoughts:
You might look at only the points that define the convex hull of your set of points to reduce the number,... but it still looks a bit "not optimal".
Otherwise there might be a recursive quad/oct-tree approach to rapidly bound some distances between sets of points and eliminate large parts of your data.
This seems easy if the points are given in Cartesian coordinates. So easy that I'm pretty sure that I'm overlooking something. Feel free to point out what I'm missing!
Find the points with the max and min values of their x, y, and z coordinates (6 points total). These should be the most "remote" of all the boundary points.
Compute all the distances (30 unique distances)
Find the max distance
The two points that correspond to this max distance are the ones you're looking for.
Here's a good solution, which works in O(n log n). It's called Rotating Caliper’s Method.
https://www.geeksforgeeks.org/maximum-distance-between-two-points-in-coordinate-plane-using-rotating-calipers-method/
Firstly you find the convex hull, which you can make in O(n log n) with the Graham's scan. Only the point from the convex hull can provide you the maximal distance. This algorithm arranges points of the convex hull in the clockwise traversal. This property will be used later.
Secondly, for all the points on the convex hull, you'll need to find the most distant point on this hull (it's called the antipodal point here). You don't have to find all the antipodal points separately (which would give quadratic time). Let's say the points of the convex hall are called p_1, ..., p_n, and their order corresponds to the clockwise traversal. There is a property of convex polygons that when you iterate through points p_j on the hull in the clockwise order and calculate the distances d(p_i, p_j), these distances firstly don't decrease (and maybe increase) and then don't increase (and maybe decrease). So you can find the maximum distance easily in this case. But when you've found the correct antipodal point p_j* for the p_i, you can start this search for p_{i+1} with the candidates points starting from that p_j*. You don't need to check all previously seen points. in total p_i iterates through points p_1, ..., p_n once, and p_j iterates through these points at most twice, because p_j can never catch up p_i as it would give zero distance, and we stop when the distance starts decreasing.
A solution that has runtime complexity O(N) is a combination of the above
answers. In detail:
(1) One can compute the convex hull with runtime complexity O(N) if you
use counting sort as an internal polar angle sort and are willing to
use angles rounded to the nearest integer [0, 359], inclusive.
(2) Note that the number of points on the convex hull is then N_H which is usually less than N.
We can speculate about the size of the hull from information in Cormen et al. Introduction to Algorithms, Exercise 33-5.
For sparse-hulled distributions of a unit-radius disk, a convex polygon with k sides, and a 2-D normal distribution respectively as n^(1/3), log_2(n), sqrt(log_2(n)).
The furthest pair problem is then between comparison of points on the hull.
This is N_H^2, but each leading point's search for distance point can be
truncated when the distances start to decrease if the points are traversed
in the order of the convex hull (those points are ordered CCW from first point).
The runtime complexity for this part is then O(N_H^2).
Because N_H^2 is usually less than N, the total runtime complexity
for furthest pair is O(N) with a caveat of using integer degree angles to reduce the sort in the convex hull to linear.
Given a set of points {(x1,y1), (x2,y2) ... (xn,yn)} find 2 most distant points.
My approach:
1). You need a reference point (xa,ya), and it will be:
xa = ( x1 + x2 +...+ xn )/n
ya = ( y1 + y2 +...+ yn )/n
2). Calculate all distance from point (xa,ya) to (x1,y1), (x2,y2),...(xn,yn)
The first "most distant point" (xb,yb) is the one with the maximum distance.
3). Calculate all distance from point (xb,yb) to (x1,y1), (x2,y2),...(xn,yn)
The other "most distant point" (xc,yc) is the one with the maximum distance.
So you got your most distant points (xb,yb) (xc,yc) in O(n)
For example, for points: (0,0), (1,1), (-8, 5)
1). Reference point (xa,ya) = (-2.333, 2)
2). Calculate distances:
from (-2.333, 2) to (0,0) : 3.073
from (-2.333, 2) to (1,1) : 3.480
from (-2.333, 2) to (-8, 5) : 6.411
So the first most distant point is (-8, 5)
3). Calculate distances:
from (-8, 5) to (0,0) : 9.434
from (-8, 5) to (1,1) : 9.849
from (-8, 5) to (-8, 5) : 0
So the other most distant point is (1, 1)
I have n and m binary vectors(of length 1500) from set A and B respectively.
I need a metric that can say how similar (kind of distance metric) all those n vectors and m vectors are.
The output should be total_distance_of_n_vectors and total_distance_of_m_vectors.
And if total_distance_of_n_vectors > total_distance_of_m_vectors, it means Set B have more similar vectors than Set A.
Which metric should I use? I thought of Jaccard similarity. But I am not able to put it in this context. Should I find the distance of each vector with each other to find the total distance or something else ?
There are two concepts relevant to your question, which you should consider separately.
Similarity Measure:
Independent of your scoring mechanism, you should find a similarity measure which suits your data best. It can be an Euclidean distance (not suitable for a 1500 dimensional space), a cosine (dot product based) distance, or a Hamiltonian distance (assuming your input features are completely independent, which rarely is the case).
A lot can go on in your distance function, and you should find one which makes sense for your data.
Scoring Mechanism:
You mention total_distance_of_vectors in your question, which probably is not what you want. If n >> m, almost certainly the total sum of distances for n vectors is more than the total distance for m vectors.
What you're looking for is most probably an average of the distances between the members of your sets. Then, depending on weather you want your average to be sensitive to outliers or not, you can go for average of the distances or average of squared distances.
If you want to dig deeper, you can also get the mean and variance of the distances within the two sets and compare the distributions.
As a newbie in Dynamic Time Warping (DTW), I find its Python implementation mlpy.dtw is not documented in a very detailed extend. I have some problems with its return value.
Regarding the returned value dist? I have two questions:
Any typo here? For standard DTW, the document says
Standard DTW as described in [Muller07], using the Euclidean distance
(absolute value of the difference) or squared Euclidean distance (as
in [Keogh01]) as local cost measure.
and for subsequence DTW, the document says
Subsequence DTW as described in [Muller07], assuming that the length
of y is much larger than the length of x and using the Manhattan
distance (absolute value of the difference) as local cost measure.
The same so-called "absolute value of the difference" corresponds two different distance metrics?
Total distance? After running the snippet
dist, cost, path = mlpy.dtw_std(x, y, dist_only=False)
dist is one value. So is it the lumped sum of all the distances between each matched pair?
Yes, the mlpy.dtw() function is not well documented.
First question: no typo here. As you can see in the documentation, euclidean, squared euclidean and manhattan distances concern the local cost measure. In this case the cost measure is defined as a distance between two real values (one dimension), see cost in the pseudocode in http://en.wikipedia.org/wiki/Dynamic_time_warping. So, in this case, Manhattan distance and Euclidean distance are the same (http://en.wikipedia.org/wiki/Euclidean_distance#One_dimension). Anyway, in the standard dtw, you can choose the euclidean distance (absolute value of the difference) or the squared euclidean distance (squared difference) by the parameter squared:
>>> import mlpy
>>> mlpy.dtw_std([1,2,3], [4,5,6], squared=False) # Euclidean distance
9.0
>>> mlpy.dtw_std([1,2,3], [4,5,6], squared=True) # Squared Euclidean distance
26.0
Second question: dist is the unnormalized minimum-distance warp path between time series x and y. It is the unnormalized DTW distance. You can normalize it dividing by len(X)+len(Y). See http://www.irit.fr/~Julien.Pinquier/Docs/TP_MABS/res/dtw-sakoe-chiba78.pdf
Cheers,
Davide
It seems to be an error in the documentation. Euclidean distance is not the "absolute value of the difference", it is the correct description of the Manhattan metric. Probably author was thinking about one dimension case, as in R both Euclidean and manhattan metrics are the same (and Euclidean metric really expresses the absolute value of the difference then). I am not familiar with the library, if it only operates on 1 dimensional objects, then there is no error and these two distance measures are equivalent
The dist value is the value of best time-warp (measured as the summarized costs of matching, see the algorithm definiton on wikipedia). So it is in fact the minimum edit distance between two sequences, where particular edits' costs are expressed in dissimilarity (distance) between "matched" objects
I'm trying to do a K-means clustering of some dataset using sklearn. The problem is that one of the dimensions is hour-of-day: a number from 0-23 and so the distance algorithm then thinks that 0 is very far from 23, because in absolute terms it is. In reality and for my purposes, hour 0 is very close to hour 23. Is there a way to make the distance algorithm do some form of wrap-around so it computes the more 'real' time difference.
I'm doing something simple, similar to the following:
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters = 2)
data = vstack(data)
fit = clusters.fit(data)
classes = fit.predict(data)
data elements looks something like [22, 418, 192] where the first element is the hour.
Any ideas?
Even though #elyase answer is accepted, I think it is not the correct approach.
Yes, to use such distance you have to refine your distance measure and so - use different library. But what is more important - concept of mean used in k-means won't suit the cyclic dimension. Lets consider following example:
#current cluster X,, based on centroid position Xc=24
x1=1
x2=24
#current cluster Y, based on centroid position Yc=10
y1=12
y2=13
computing simple arithmetic mean will place the centoids in Xc=12.5,Yc=12.5, which from the point of view of cyclic meausre is incorect, it should be Xc=0.5,Yc=12.5. As you can see, asignment based on the cyclic distance measure is not "compatible" with simple mean operation, and leads to bizzare results.
Simple k-means will result in clusters {x1,y1}, {x2,y2}
Simple k--means + distance measure result in degenerated super cluster {x1,x2,y1,y2}
Correct clustering would be {x1,x2},{y1,y2}
Solving this problem requires checking one if (whether it is better to measure "simple average" or by representing one of the points as x'=x-24). Unfortunately given n points it makes 2^n possibilities.
This seems as a use case of the kernelized k-means, where you are actually clustering in the abstract feature space (in your case - a "tube" rolled around the time dimension) induced by kernel ("similarity measure", being the inner product of some vector space).
Details of the kernel k-means are given here
Why k-means doesn't work with arbitrary distances
K-means is not a distance-based algorithm.
K-means minimizes the Within-Cluster-Sum-of-Squares, which is a kind of variance (it's roughly the weighted average variance of all clusters, where each object and dimension is given the same weight).
In order for Lloyds algorithm to converge you need to have both steps optimize the same function:
the reassignment step
the centroid update step
Now the "mean" function is a least-squares estimator. I.e. choosing the mean in step 2 is optimal for the WCSS objective. Assigning objects by least-squares deviation (= squared Euclidean distance, monotone to Euclidean distance) in step 1 also yields guaranteed convergence. The mean is exactly where your wrap-around idea would fall apart.
If you plug in a random other distance function as suggested by #elyase k-means might no longer converge.
Proper solutions
There are various solutions to this:
Use K-medoids (PAM). By choosing the medoid instead of the mean you do get guaranteed convergence with arbitrary distances. However, computing the medoid is rather expensive.
Transform the data into a kernel space where you are happy with minimizing Sum-of-Squares. For example, you could transform the hour into sin(hour / 12 * pi), cos(hour / 12 * pi) which may be okay for SSQ.
Use other, distance-based clustering algorithms. K-means is old, and there has been a lot of research on clustering since. You may want to start with hierarchical clustering (which actually is just as old as k-means), and then try DBSCAN and the variants of it.
The easiest approach, to me, is to adapt the K-means algorithm wraparound dimension via computing the "circular mean" for the dimension. Of course, you will also need to change the distance-to-centroid calculation accordingly.
#compute the mean of hour 0 and 23
import numpy as np
hours = np.array(range(24))
#hours to angles
angles = hours/24 * (2*np.pi)
sin = np.sin(angles)
cos = np.cos(angles)
a = np.arctan2(sin[23]+sin[0], cos[23]+cos[0])
if a < 0: a += 2*np.pi
#angle back to hour
hour = a * 24 / (2*np.pi)
#23.5