K-mean clustering, make centroids not overlap nodes

K-mean clustering, make centroids not overlap nodes - python

I have done K-mean++ clustering and obtained the centroids of the clusters, using Python 2.7, following the method given in http://datasciencelab.wordpress.com/2014/01/15/improved-seeding-for-clustering-with-k-means/
In my problem, there is a further constraint that the distances between any centroid to any node should be larger than a constant. What is the best way to achieve?
It is possible that one centroid is too close to several nodes.
Any suggestions on how to displace the centroids a bit?
Many thanks.
For example, the nodes to be clustered are
MyNodes = [[469500, 5802610], [468764, 5803422], [467991, 5804202], [470260, 5799949], [469486, 5800730], [468713, 5801510], [467939, 5802291], [467166, 5803072], [467966, 5800204], [467193, 5800985], [466420, 5801766], [466457, 5799700], [465678, 5800488], [464950, 5799229], [470615, 5796600], [469842, 5797405], [470320, 5794955], [469547, 5795735], [468773, 5796516], [467990, 5797297], [470062, 5793215], [469289, 5793996], [468515, 5794776], [467742, 5795557], [466969, 5796338], [466195, 5797119], [469976, 5791334], [469202, 5792115], [468429, 5792896], [467656, 5793676], [466882, 5794457], [466109, 5795238], [465336, 5796050], [464600, 5796840], [470160, 5789250], [469354, 5789972], [468581, 5790753], [467808, 5791534], [467034, 5792315], [466261, 5793096], [465488, 5793877], [464714, 5794658], [463941, 5795499], [463150, 5796210], [469500, 5787920], [468698, 5788614], [467925, 5789395], [467152, 5790176]]
Centroids = [[ 467839.6, 5793224.1], [ 467617.22222222, 5800489.94444444]]
Centroid[0] would be too close to node[29], Centroid1 would be too close to node[8].

If I understand your question correctly, and I am not at all sure whether I do, then the solution is already apparent from your drawings:
You want the point that is closest to a given centroid point; and has a minimum distance from a set of node points.
Draw a circle around each node point, with your minimum distance as the radius
Intersect each circle with each other circle, note the intersection points
discard any intersection point that is closer than the minimum distance to a node point.
from the remaining intersection points, take the one closest to the centroid point. That is your new displaced centroid.
Runtime for the naive implementation should be O(number_of_nodes^2), though you can optimize that by using some fast nearest-neighbour lookup data structure, such as a kd-tree, when you intersect the circles and discard the intersection points that are too close.
This should be the optimal solution; no matter which algorithm you use, you cannot find a point closer to the original centroid that fits the minimum distance constraint; and this algorithm should always find that optimal point.
Though since the centroid is generally surrounded by node points, the new point may be quite some distance away if the node points are densely packed.

Related

Looking for a clustering algorithm, that can cluster by density around a centroid, but with a fixed maximum distance cutoff

I currently have a list with 3D coordinates which I want cluster by density into a unknown number of clusters. In addition to that I want to score the clusters by population and by distance to the centroids.
I would also like to be able to set a maximum possible distance from a certain centroid. Ideally the centroid represent a point of the data-set, but it is not absolutely necessary. I want to do this for a list ranging from approximately 100 to 10000 3D coordinates.
So for example, say i have a point [x,y,z] which could be my centroid:
Points that are closest to x,y,z should contribute the most to its score (i.e. a logistic scoring function like y = (1 + exp(4*(-1.0+x)))** -1 ,where x represents the euclidean distance to point [x,y ,z]
( https://www.wolframalpha.com/input/?i=(1+%2B+exp(4(-1.0%2Bx)))**+-1 )
Since this function never reaches 0, it is needed to set a maximum distance, e.g. 2 distance units to set a limit to the cluster.
I want to do this until no more clusters can be made, I am only interested in the centroid, thus it should preferably be a real datapoint instead of an interpolated one it also has other properties connected to it.
I have already tried DBSCAN from sklearn, which is several orders of magnitude faster than my code, but it does obviously not accomplish what I want to do
Currently I am just calculating the proximity of every point relative to all other points and am scoring every point by the number and distance to its neighbors (with the same scoring function discussed above), then I take the highest scored point and remove all other, lower scored, points that are within a certain cutoff distance. It gets the job done and is accurate, but it is too slow.
I hope I could be somewhat clear with what I want to do.

Use the neighbor search function of sklearn to find points within the maximum distance 2 fast. Only do this once compute the logistic weights only once.
Then do the remainder using ony this precomputed data?

Fastest algorithm to find the max distance within a set of points [duplicate]

This is a question that I was asked on a job interview some time ago. And I still can't figure out sensible answer.
Question is:
you are given set of points (x,y). Find 2 most distant points. Distant from each other.
For example, for points: (0,0), (1,1), (-8, 5) - the most distant are: (1,1) and (-8,5) because the distance between them is larger from both (0,0)-(1,1) and (0,0)-(-8,5).
The obvious approach is to calculate all distances between all points, and find maximum. The problem is that it is O(n^2), which makes it prohibitively expensive for large datasets.
There is approach with first tracking points that are on the boundary, and then calculating distances for them, on the premise that there will be less points on boundary than "inside", but it's still expensive, and will fail in worst case scenario.
Tried to search the web, but didn't find any sensible answer - although this might be simply my lack of search skills.

For this specific problem, with just a list of Euclidean points, one way is to find the convex hull of the set of points. The two distant points can then be found by traversing the hull once with the rotating calipers method.
Here is an O(N log N) implementation:
http://mukeshiiitm.wordpress.com/2008/05/27/find-the-farthest-pair-of-points/
If the list of points is already sorted, you can remove the sort to get the optimal O(N) complexity.
For a more general problem of finding most distant points in a graph:
Algorithm to find two points furthest away from each other
The accepted answer works in O(N^2).

Boundary point algorithms abound (look for convex hull algorithms). From there, it should take O(N) time to find the most-distant opposite points.
From the author's comment: first find any pair of opposite points on the hull, and then walk around it in semi-lock-step fashion. Depending on the angles between edges, you will have to advance either one walker or the other, but it will always take O(N) to circumnavigate the hull.

You are looking for an algorithm to compute the diameter of a set of points, Diam(S). It can be shown that this is the same as the diameter of the convex hull of S, Diam(S) = Diam(CH(S)). So first compute the convex hull of the set.
Now you have to find all the antipodal points on the convex hull and pick the pair with maximum distance. There are O(n) antipodal points on a convex polygon. So this gives a O(n lg n) algorithm for finding the farthest points.
This technique is known as Rotating Calipers. This is what Marcelo Cantos describes in his answer.
If you write the algorithm carefully, you can do without computing angles. For details, check this URL.

A stochastic algorithm to find the most distant pair would be
Choose a random point
Get the point most distant to it
Repeat a few times
Remove all visited points
Choose another random point and repeat a few times.
You are in O(n) as long as you predetermine "a few times", but are not guaranteed to actually find the most distant pair. But depending on your set of points the result should be pretty good. =)

This question is introduced at Introduction to Algorithm. It mentioned 1) Calculate Convex Hull O(NlgN). 2) If there is M vectex on Convex Hull. Then we need O(M) to find the farthest pair.
I find this helpful links. It includes analysis of algorithm details and program.
http://www.seas.gwu.edu/~simhaweb/alg/lectures/module1/module1.html
Wish this will be helpful.

Find the mean of all the points, measure the difference between all points and the mean, take the point the largest distance from the mean and find the point farthest from it. Those points will be the absolute corners of the convex hull and the two most distant points.
I recently did this for a project that needed convex hulls confined to randomly directed infinite planes. It worked great.
See the comments: this solution isn't guaranteed to produce the correct answer.

Just a few thoughts:
You might look at only the points that define the convex hull of your set of points to reduce the number,... but it still looks a bit "not optimal".
Otherwise there might be a recursive quad/oct-tree approach to rapidly bound some distances between sets of points and eliminate large parts of your data.

This seems easy if the points are given in Cartesian coordinates. So easy that I'm pretty sure that I'm overlooking something. Feel free to point out what I'm missing!
Find the points with the max and min values of their x, y, and z coordinates (6 points total). These should be the most "remote" of all the boundary points.
Compute all the distances (30 unique distances)
Find the max distance
The two points that correspond to this max distance are the ones you're looking for.

Here's a good solution, which works in O(n log n). It's called Rotating Caliper’s Method.
https://www.geeksforgeeks.org/maximum-distance-between-two-points-in-coordinate-plane-using-rotating-calipers-method/
Firstly you find the convex hull, which you can make in O(n log n) with the Graham's scan. Only the point from the convex hull can provide you the maximal distance. This algorithm arranges points of the convex hull in the clockwise traversal. This property will be used later.
Secondly, for all the points on the convex hull, you'll need to find the most distant point on this hull (it's called the antipodal point here). You don't have to find all the antipodal points separately (which would give quadratic time). Let's say the points of the convex hall are called p_1, ..., p_n, and their order corresponds to the clockwise traversal. There is a property of convex polygons that when you iterate through points p_j on the hull in the clockwise order and calculate the distances d(p_i, p_j), these distances firstly don't decrease (and maybe increase) and then don't increase (and maybe decrease). So you can find the maximum distance easily in this case. But when you've found the correct antipodal point p_j* for the p_i, you can start this search for p_{i+1} with the candidates points starting from that p_j*. You don't need to check all previously seen points. in total p_i iterates through points p_1, ..., p_n once, and p_j iterates through these points at most twice, because p_j can never catch up p_i as it would give zero distance, and we stop when the distance starts decreasing.

A solution that has runtime complexity O(N) is a combination of the above
answers. In detail:
(1) One can compute the convex hull with runtime complexity O(N) if you
use counting sort as an internal polar angle sort and are willing to
use angles rounded to the nearest integer [0, 359], inclusive.
(2) Note that the number of points on the convex hull is then N_H which is usually less than N.
We can speculate about the size of the hull from information in Cormen et al. Introduction to Algorithms, Exercise 33-5.
For sparse-hulled distributions of a unit-radius disk, a convex polygon with k sides, and a 2-D normal distribution respectively as n^(1/3), log_2(n), sqrt(log_2(n)).
The furthest pair problem is then between comparison of points on the hull.
This is N_H^2, but each leading point's search for distance point can be
truncated when the distances start to decrease if the points are traversed
in the order of the convex hull (those points are ordered CCW from first point).
The runtime complexity for this part is then O(N_H^2).
Because N_H^2 is usually less than N, the total runtime complexity
for furthest pair is O(N) with a caveat of using integer degree angles to reduce the sort in the convex hull to linear.

Given a set of points {(x1,y1), (x2,y2) ... (xn,yn)} find 2 most distant points.
My approach:
1). You need a reference point (xa,ya), and it will be:
xa = ( x1 + x2 +...+ xn )/n
ya = ( y1 + y2 +...+ yn )/n
2). Calculate all distance from point (xa,ya) to (x1,y1), (x2,y2),...(xn,yn)
The first "most distant point" (xb,yb) is the one with the maximum distance.
3). Calculate all distance from point (xb,yb) to (x1,y1), (x2,y2),...(xn,yn)
The other "most distant point" (xc,yc) is the one with the maximum distance.
So you got your most distant points (xb,yb) (xc,yc) in O(n)
For example, for points: (0,0), (1,1), (-8, 5)
1). Reference point (xa,ya) = (-2.333, 2)
2). Calculate distances:
from (-2.333, 2) to (0,0) : 3.073
from (-2.333, 2) to (1,1) : 3.480
from (-2.333, 2) to (-8, 5) : 6.411
So the first most distant point is (-8, 5)
3). Calculate distances:
from (-8, 5) to (0,0) : 9.434
from (-8, 5) to (1,1) : 9.849
from (-8, 5) to (-8, 5) : 0
So the other most distant point is (1, 1)

Divide a region into parts efficiently Python

I have a square grid with some points marked off as being the centers of the subparts of the grid. I'd like to be able to assign each location within the grid to the correct subpart. For example, if the subparts of the region were centered on the black dots, I'd like to be able to assign the red dot to the region in the lower right, as it is the closest black dot.
Currently, I do this by iterating over each possible red dot, and comparing its distance to each of the black dots. However, the width, length, and number of black dots in the grid is very high, so I'd like to know if there's a more efficient algorithm.
My particular data is formatted as such, where the numbers are just placeholders to correspond with the given example:
black_dots = [(38, 8), (42, 39), (5, 14), (6, 49)]
grid = [[0 for i in range(0, 50)] for j in range(0, 50)]
For reference, in the sample case, I hope to be able to fill grid up with integers 1, 2, 3, 4, depending on whether they are closest to the 1st, 2nd, 3rd, or 4th entry in black_dots to end up with something that would allow me to create something similar to the following picture where each integer correspond to a color (dots are left on for show).
To summarize, is there / what is the more efficient way to do this?

You can use a breadth-first traversal to solve this problem.
Create a first-in, first-out queue. (A queue makes a traversal breadth-first.)
Create a Visited mask indicating whether a cell in your grid has been added to the queue or not. Set the mask to false.
Create a Parent mask indicating what black dot the cell ultimately belongs to.
Place all the black dots into the queue, flag them in the Visited mask, and assign them unique ids in the Parent mask.
Begin popping cells from the queue one by one. For each cell, iterate of the cell's neighbours. Place each neighbour into the Queue, flag it in Visited, and set its value in Parent to be equal to that of the cell you just popped.
Continue until the queue is empty.
The breadth-first traversal makes a wave which expands outward from each source cell (black dot). Since the waves all travel at the same speed across your grid, each wave gobbles up those cells closest to its source.
This solves the problem in O(N) time.

If I understand correctly what you really need is to construct a Voronoi diagram of your centers:
https://en.m.wikipedia.org/wiki/Voronoi_diagram
Which can be constructed very efficiently with similar computational complexity as calculating its convex hull.
The Voronoi diagram allows you to construct the optimal polygons sorrounding your centers which delimit the regions closest to the centers.
Having the Voronoi diagram the task is reduced to detect in which polygon the red dots lies. Since the Voronoi cells are convex you need an algorithm to decide wether a point is inside a convex polygon. However traversing all polygons has complexity O(n).
There are several algorithms to accelerate the point location so it can be done in O(log n):
https://en.m.wikipedia.org/wiki/Point_location
See also
Nearest Neighbor Searching using Voronoi Diagrams

The "8-way" Voronoi diagram can be constructed efficiently (in linear time wrt the number of pixels) by a two-passes scanline process. (8-way means that distances are evaluated as the length of the shortest 8-connected path between two pixels.)
Assign every center a distinct color and create an array of distances of the same size as the image, initialized with 0 at the centers and "infinity" elsewhere.
In a top-down/left-right pass, update the distances of all pixels as being the minimum of the distances of the four neighbors W, NW, N and NE plus one, and assign the current pixel the color of the neighbor that achieves the minimum.
In a bottom-up/right-left pass, update the distances of all pixels as being the minimum of the current distance and the distances of the four neighbors E, SE, S, SW plus one, and assign the current pixel the color of the neighbor that achieves the minimum (or keep the current color).
It is also possible to compute the Euclidean Voronoi diagram efficiently (in linear time), but this requires a more sophisticated algorithm. It can be based on the wonderful paper "A GENERAL ALGORITHM FOR COMPUTING DISTANCE
TRANSFORMS IN LINEAR TIME" by A. MEIJSTER‚ J.B.T.M. ROERDINK and W.H. HESSELINK, which must be enhanced with some accounting of the neighbor that causes the smallest distance.

Width of an arbitrary polygon

I need a way to characterize the size of sets of 2-D points, so I can determine whether to render them as individual points in a space or as representative polygons, dependent on the scale of the viewport. I already have an algorithm to calculate the convex hull of the set to produce the representative polygon, but I need a way to characterize its size. One obvious measure is the maximum distance between points on the convex hull, which is the diameter of the set. But I'm really more interested in the size of its cross-section perpendicular to its diameter, to figure out how narrow the bounding polygon is. Is there a simple way to do this, given the sorted list of vertices and and the indices of the furthest points (ideally in Python)?
Or alternatively, is there an easy way to calculate the radii of the minimal area bounding ellipse of a set of points? I have seen some approaches to this problem, but nothing that I can readily convert to Python, so I'm really looking for something that's turnkey.

You can compute:
the size of its cross-section perpendicular to its diameter
with the following steps:
Find the convex hull
Find the two points a and b which are furthest apart
Find the direction vector d = (a - b).normalized() between those two
Rotate your axes so that this direction vector lies horizontal, using the matrix:
[ d.x, d.y]
[-d.y, d.x]
Find the minimum and maximum y value of points in this new coordinate system. The difference is your "width"
Note that this is not a particularly good definition of "width" - a better one is:
The minimal perpendicular distance between two distinct parallel lines each having at least one point in common with the polygon's boundary but none with the polygon's interior
Another useful definition of size might be twice the average distance between points on the hull and the center
center = sum(convexhullpoints) / len(convexhullpoints)
size = 2 * sum(abs(p - center) for p in convexhullpoints) / len(convexhullpoints)

How do I calculate a 3D centroid?

Is there even such a thing as a 3D centroid? Let me be perfectly clear—I've been reading and reading about centroids for the last 2 days both on this site and across the web, so I'm perfectly aware at the existing posts on the topic, including Wikipedia.
That said, let me explain what I'm trying to do. Basically, I want to take a selection of edges and/or vertices, but NOT faces. Then, I want to place an object at the 3D centroid position.
I'll tell you what I don't want:
The vertices average, which would pull too far in any direction that has a more high-detailed mesh.
The bounding box center, because I already have something working for this scenario.
I'm open to suggestions about center of mass, but I don't see how this would work, because vertices or edges alone don't define any sort of mass, especially when I just have an edge loop selected.
For kicks, I'll show you some PyMEL that I worked up, using #Emile's code as reference, but I don't think it's working the way it should:
from pymel.core import ls, spaceLocator
from pymel.core.datatypes import Vector
from pymel.core.nodetypes import NurbsCurve
def get_centroid(node):
if not isinstance(node, NurbsCurve):
raise TypeError("Requires NurbsCurve.")
centroid = Vector(0, 0, 0)
signed_area = 0.0
cvs = node.getCVs(space='world')
v0 = cvs[len(cvs) - 1]
for i, cv in enumerate(cvs[:-1]):
v1 = cv
a = v0.x * v1.y - v1.x * v0.y
signed_area += a
centroid += sum([v0, v1]) * a
v0 = v1
signed_area *= 0.5
centroid /= 6 * signed_area
return centroid
texas = ls(selection=True)[0]
centroid = get_centroid(texas)
print(centroid)
spaceLocator(position=centroid)

In theory centroid = SUM(pos*volume)/SUM(volume) when you split the part into finite volumes each with a location pos and volume value volume.
This is precisely the calculation done for finding the center of gravity of a composite part.

There is not just a 3D centroid, there is an n-dimensional centroid, and the formula for it is given in the "By integral formula" section of the Wikipedia article you cite.
Perhaps you are having trouble setting up this integral? You have not defined your shape.
[Edit] I'll beef up this answer in response to your comment. Since you have described your shape in terms of edges and vertices, then I'll assume it is a polyhedron. You can partition a polyedron into pyramids, find the centroids of the pyramids, and then the centroid of your shape is the centroid of the centroids (this last calculation is done using ja72's formula).
I'll assume your shape is convex (no hollow parts---if this is not the case then break it into convex chunks). You can partition it into pyramids (triangulate it) by picking a point in the interior and drawing edges to the vertices. Then each face of your shape is the base of a pyramid. There are formulas for the centroid of a pyramid (you can look this up, it's 1/4 the way from the centroid of the face to your interior point). Then as was said, the centroid of your shape is the centroid of the centroids---ja72's finite calculation, not an integral---as given in the other answer.
This is the same algorithm as in Hugh Bothwell's answer, however I believe that 1/4 is correct instead of 1/3. Perhaps you can find some code for it lurking around somewhere using the search terms in this description.

I like the question. Centre of mass sounds right, but the question then becomes, what mass for each vertex?
Why not use the average length of each edge that includes the vertex? This should compensate nicely areas with a dense mesh.

You will have to recreate face information from the vertices (essentially a Delauney triangulation).
If your vertices define a convex hull, you can pick any arbitrary point A inside the object. Treat your object as a collection of pyramidal prisms having apex A and each face as a base.
For each face, find the area Fa and the 2d centroid Fc; then the prism's mass is proportional to the volume (== 1/3 base * height (component of Fc-A perpendicular to the face)) and you can disregard the constant of proportionality so long as you do the same for all prisms; the center of mass is (2/3 A + 1/3 Fc), or a third of the way from the apex to the 2d centroid of the base.
You can then do a mass-weighted average of the center-of-mass points to find the 3d centroid of the object as a whole.
The same process should work for non-convex hulls - or even for A outside the hull - but the face-calculation may be a problem; you will need to be careful about the handedness of your faces.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.