Divide a region into parts efficiently Python

Divide a region into parts efficiently Python - python

I have a square grid with some points marked off as being the centers of the subparts of the grid. I'd like to be able to assign each location within the grid to the correct subpart. For example, if the subparts of the region were centered on the black dots, I'd like to be able to assign the red dot to the region in the lower right, as it is the closest black dot.
Currently, I do this by iterating over each possible red dot, and comparing its distance to each of the black dots. However, the width, length, and number of black dots in the grid is very high, so I'd like to know if there's a more efficient algorithm.
My particular data is formatted as such, where the numbers are just placeholders to correspond with the given example:
black_dots = [(38, 8), (42, 39), (5, 14), (6, 49)]
grid = [[0 for i in range(0, 50)] for j in range(0, 50)]
For reference, in the sample case, I hope to be able to fill grid up with integers 1, 2, 3, 4, depending on whether they are closest to the 1st, 2nd, 3rd, or 4th entry in black_dots to end up with something that would allow me to create something similar to the following picture where each integer correspond to a color (dots are left on for show).
To summarize, is there / what is the more efficient way to do this?

You can use a breadth-first traversal to solve this problem.
Create a first-in, first-out queue. (A queue makes a traversal breadth-first.)
Create a Visited mask indicating whether a cell in your grid has been added to the queue or not. Set the mask to false.
Create a Parent mask indicating what black dot the cell ultimately belongs to.
Place all the black dots into the queue, flag them in the Visited mask, and assign them unique ids in the Parent mask.
Begin popping cells from the queue one by one. For each cell, iterate of the cell's neighbours. Place each neighbour into the Queue, flag it in Visited, and set its value in Parent to be equal to that of the cell you just popped.
Continue until the queue is empty.
The breadth-first traversal makes a wave which expands outward from each source cell (black dot). Since the waves all travel at the same speed across your grid, each wave gobbles up those cells closest to its source.
This solves the problem in O(N) time.

If I understand correctly what you really need is to construct a Voronoi diagram of your centers:
https://en.m.wikipedia.org/wiki/Voronoi_diagram
Which can be constructed very efficiently with similar computational complexity as calculating its convex hull.
The Voronoi diagram allows you to construct the optimal polygons sorrounding your centers which delimit the regions closest to the centers.
Having the Voronoi diagram the task is reduced to detect in which polygon the red dots lies. Since the Voronoi cells are convex you need an algorithm to decide wether a point is inside a convex polygon. However traversing all polygons has complexity O(n).
There are several algorithms to accelerate the point location so it can be done in O(log n):
https://en.m.wikipedia.org/wiki/Point_location
See also
Nearest Neighbor Searching using Voronoi Diagrams

The "8-way" Voronoi diagram can be constructed efficiently (in linear time wrt the number of pixels) by a two-passes scanline process. (8-way means that distances are evaluated as the length of the shortest 8-connected path between two pixels.)
Assign every center a distinct color and create an array of distances of the same size as the image, initialized with 0 at the centers and "infinity" elsewhere.
In a top-down/left-right pass, update the distances of all pixels as being the minimum of the distances of the four neighbors W, NW, N and NE plus one, and assign the current pixel the color of the neighbor that achieves the minimum.
In a bottom-up/right-left pass, update the distances of all pixels as being the minimum of the current distance and the distances of the four neighbors E, SE, S, SW plus one, and assign the current pixel the color of the neighbor that achieves the minimum (or keep the current color).
It is also possible to compute the Euclidean Voronoi diagram efficiently (in linear time), but this requires a more sophisticated algorithm. It can be based on the wonderful paper "A GENERAL ALGORITHM FOR COMPUTING DISTANCE
TRANSFORMS IN LINEAR TIME" by A. MEIJSTER‚ J.B.T.M. ROERDINK and W.H. HESSELINK, which must be enhanced with some accounting of the neighbor that causes the smallest distance.

Related

locating cell containing point in irregular 3D grid

I have an irregular 3D grid which looks something like this:
Typical dimensions of the grid are 100/100/100 cells. Each cell is spatially defined by the coords of the 8 corner nodes. The 4 vertices of the each face of a cell are not necessarily co-planar, so I represent each face as a pair of triangles and thus a cell as a polyhedron consisting of 12 triangles (2 per face). I am trying to locate the IJK index of the cell that contains an XYZ point using Python. I bisect sequentially the cell range in the I, J and K directions and test which half of the grid the point lies using the method described here Testing whether a 3D point is inside a 3D polyhedron to locate the point. Unfortunately, this does not work in some cases. In the above figure, point A is physically outside the grid but inside the current bisection range (defined by the brown dotted lines) while point B is inside the grid but outside the current range. I think the reason for this is that triangles representing the faces of the cells within the current range (eg the large brown triangles in the figure) are not co-planar with the triangles that comprise the individual cell faces within that range (eg those shaded yellow, blue etc). I have tried to show this in 2D below:
The current bisection range is shown by the brown dotted line and brown vertices. Initially, the red point is within the current range. We bisect in the X direction (bisection 1) and the red point is within the current range (dotted brown line) so we discard the right half. We now bisect in the Y direction (bisection 2) and the red point is outisde this range so we discard the top half. We eventually arrive at the final step when we have a single index in each of the I & J directions. As shown here, this places the red point in the wrong cell.
Would appreciate any suggestions for an alternative algorithm to the one I am currently trying to implement. Stepping back, I am actually interested in calculating the faces within the grid intersected by a series of line segments, so am using the "point in a polyhedron" method as an intermediate step. I looked at geomdl which could represent each face as a NURBS object but does not seem to implement intersection between a ray and a NURBS object. I also had a quick look at the Python bindings to CGAL but that looked like a massive learning curve to climb, so put that aside. Thanks in advance!

Triangulation patterns in .ifc file format using coordinates and indexes

I've posted this in another forum as well due to the mathematical nature of the issue:
forum post
I have an .ifc file in which the raw data exported describes a wall in the xy plane by a set of coordinates and their corresponding indexes according to the link explanation:
Explanation
I have a txt where the data is divided into the coordinates in xyz space, then indexes and some other data.
I was hoping that someone can help me understand how to link the indexes to their corresponding coordinates. There are 164 coordinate pairs and 324 index pairs so it doesn't make sense to me that each index relates to only 1 coordinate pair.
The goal is to establish a relationship between indexes and coordinates such that this type of data can output the wall thickness, which is in this case '10'. I was thinking that (according to the link above) by taking the first triangle described, it should describe the edge of the wall in 3D and therefore give us one of its sides as the shortest segment in the wall which is the thickness.
I received an answer in the mentioned forum post, that I should
"...expanding out each coordinate in terms of X's, Y's, and Z's [instead of (X,Y,Z) triples) and then use every index triple to get the actual coordinate for the individual coordinate instead of one triple.
So for example you have X[], Y[] and Z[] and you have an index (a,b,c) then you find X[a], Y[b], and Z[c] not Point(a,b,c)... "
I didn't quite understand this explanation, and would appreciate any help or further explanation in order to achieve my goal.
Thank you

Let's start with the cordinates (IfcCartesianPointList3D): each one is a triplet, resulting in a Point with (x,y,z) coordinates.
Then the IfcTriangulatedFaceSet uses indices to construct triangles. It has 2 indexing modes: direct and indirect via PnIndex. The indexing mode is determined by the existence of an array for PnIndex (attribute number 5). Take note that I call these variants direct and indirect - they aren't mentioned that way in the IFC documentation.
Direct indexing
PnIndex is not set. Lets look at an (simple and constructed) example:
#100=IFCCARTESIANPOINTLIST(((0,0,0),(1,0,0),(1,1,0),(0,1,0)));
#101=IFCTRIANGULATEDFACESET(
/*reference to the points*/ #100,
/*no normals*/ $,
/*no indication if closed or open*/ $,
/*coordinate indices*/ ((1,2,3),(1,3,4)),
/*no PnIndex*/ ());
This describes a square lying in the x-y-plane. Each entry in attribute CoordIndex is a triplet giving a one-based index into a point in the IfcCartesianPointList. This means there are two triangles constructed from the following points:
(0,0,0) (1,0,0) (1,1,0)
(0,0,0) (1,1,0) (0,1,0)
Indirect indexing
Lets build further on the previous example:
#100=IFCCARTESIANPOINTLIST(((0,0,0),(1,0,0),(1,1,0),(0,1,0)));
#101=IFCTRIANGULATEDFACESET(
/*reference to the points*/ #100,
/*no normals*/ $,
/*no indication if closed or open*/ $,
/*coordinate indices*/ ((1,2,3),(1,3,4)),
/*PnIndex*/ (2,3,4,1));
This time there is PnIndex set. It adds a level of indirection to access the points. Triplets from CoordIndex point into PnIndex (1-based). The value found in PnIndex is then used to access the IfcCartesianPointList.
So for the first triangle we have: (1,2,3) in CoordIndex. These point to 2, 3 and 4 in PnIndex. These result in the following points from the point list: (1,0,0) (1,1,0) (0,1,0)
Repeating the procudure for the second triangle (1,3,4) we get values 2, 4, 1 from PnIndex and the following points: (1,0,0) (0,1,0) (0,0,0)
It is again a square, but this time with a different triangulation.
Now if you want to know your wall thickness you will need to calculate the extents from the resulting geometry. If your wall is aligned with the coordinate system axes this is easy (get the difference between the smallest and largest X, Y and Z). If it is not, you might need to transform the points or look further into 3D-extent calculations (my knowledge ends there).

In a triangulation it's roughly num of triangles = 2 * num of vertices.
A wall (e.g. a rectangle) may be described by two triangles that share an edge and the two vertices of this edge.
Instead of describing the whole model triangle by triangle, each with its three vertices, or edge by edge, it's cheaper, avoids repeating vertex data, to set an index for each vertex and set a triangle by the three indices of its vertices. This is usually called "indexed rendering".

K-mean clustering, make centroids not overlap nodes

I have done K-mean++ clustering and obtained the centroids of the clusters, using Python 2.7, following the method given in http://datasciencelab.wordpress.com/2014/01/15/improved-seeding-for-clustering-with-k-means/
In my problem, there is a further constraint that the distances between any centroid to any node should be larger than a constant. What is the best way to achieve?
It is possible that one centroid is too close to several nodes.
Any suggestions on how to displace the centroids a bit?
Many thanks.
For example, the nodes to be clustered are
MyNodes = [[469500, 5802610], [468764, 5803422], [467991, 5804202], [470260, 5799949], [469486, 5800730], [468713, 5801510], [467939, 5802291], [467166, 5803072], [467966, 5800204], [467193, 5800985], [466420, 5801766], [466457, 5799700], [465678, 5800488], [464950, 5799229], [470615, 5796600], [469842, 5797405], [470320, 5794955], [469547, 5795735], [468773, 5796516], [467990, 5797297], [470062, 5793215], [469289, 5793996], [468515, 5794776], [467742, 5795557], [466969, 5796338], [466195, 5797119], [469976, 5791334], [469202, 5792115], [468429, 5792896], [467656, 5793676], [466882, 5794457], [466109, 5795238], [465336, 5796050], [464600, 5796840], [470160, 5789250], [469354, 5789972], [468581, 5790753], [467808, 5791534], [467034, 5792315], [466261, 5793096], [465488, 5793877], [464714, 5794658], [463941, 5795499], [463150, 5796210], [469500, 5787920], [468698, 5788614], [467925, 5789395], [467152, 5790176]]
Centroids = [[ 467839.6, 5793224.1], [ 467617.22222222, 5800489.94444444]]
Centroid[0] would be too close to node[29], Centroid1 would be too close to node[8].

If I understand your question correctly, and I am not at all sure whether I do, then the solution is already apparent from your drawings:
You want the point that is closest to a given centroid point; and has a minimum distance from a set of node points.
Draw a circle around each node point, with your minimum distance as the radius
Intersect each circle with each other circle, note the intersection points
discard any intersection point that is closer than the minimum distance to a node point.
from the remaining intersection points, take the one closest to the centroid point. That is your new displaced centroid.
Runtime for the naive implementation should be O(number_of_nodes^2), though you can optimize that by using some fast nearest-neighbour lookup data structure, such as a kd-tree, when you intersect the circles and discard the intersection points that are too close.
This should be the optimal solution; no matter which algorithm you use, you cannot find a point closer to the original centroid that fits the minimum distance constraint; and this algorithm should always find that optimal point.
Though since the centroid is generally surrounded by node points, the new point may be quite some distance away if the node points are densely packed.

Finding n nearest data points to grid locations

I'm working on a problem where I have a large set (>4 million) of data points located in a three-dimensional space, each with a scalar function value. This is represented by four arrays: XD, YD, ZD, and FD. The tuple (XD[i], YD[i], ZD[i]) refers to the location of data point i, which has a value of FD[i].
I'd like to superimpose a rectilinear grid of, say, 100x100x100 points in the same space as my data. This grid is set up as follows.
[XGrid, YGrid, ZGrid] = np.mgrid[Xmin:Xmax:Xstep, Ymin:Ymax:Ystep, Zmin:Zmax:Zstep]
XG = XGrid[:,0,0]
YG = YGrid[0,:,0]
ZG = ZGrid[0,0,:]
XGrid is a 3D array of the x-value at each point in the grid. XG is a 1D array of the x-values going from Xmin to Xmax, separated by a distance of XStep.
I'd like to use an interpolation algorithm I have to find the value of the function at each grid point based on the data surrounding it. In this algorithm I require 20 data points closest (or at least close) to my grid point of interest. That is, for grid point (XG[i], YG[j], ZG[k]) I want to find the 20 closest data points.
The only way I can think of is to have one for loop that goes through each data point and a subsequent embedded for loop going through all (so many!) data points, calculating the Euclidean distance, and picking out the 20 closest ones.
for i in range(0,XG.shape):
for j in range(0,YG.shape):
for k in range(0,ZG.shape):
Distance = np.zeros([XD.shape])
for a in range(0,XD.shape):
Distance[a] = (XD[a] - XG[i])**2 + (YD[a] - YG[j])**2 + (ZD[a] - ZG[k])**2
B = np.zeros([20], int)
for a in range(0,20):
indx = np.argmin(Distance)
B[a] = indx
Distance[indx] = float(inf)
This would give me an array, B, of the indices of the data points closest to the grid point. I feel like this would take too long to go through each data point at each grid point.
I'm looking for any suggestions, such as how I might be able to organize the data points before calculating distances, which could cut down on computation time.

Have a look at a seemingly simmilar but 2D problem and see if you cannot improve with ideas from there.
From the top of my head, I'm thinking that you can sort the points according to their coordinates (three separate arrays). When you need the closest points to the [X, Y, Z] grid point you'll quickly locate points in those three arrays and start from there.

Also, you don't really need the euclidian distance, since you are only interested in relative distance, which can also be described as:
abs(deltaX) + abs(deltaY) + abs(deltaZ)
And save on the expensive power and square roots...

No need to iterate over your data points for each grid location: Your grid locations are inherently ordered, so just iterate over your data points once, and assign each data point to the eight grid locations that surround it. When you're done, some grid locations may have too few data points. Check the data points of adjacent grid locations. If you have plenty of data points to go around (it depends on how your data is distributed), you can already select the 20 closest neighbors during the initial pass.
Addendum: You may want to reconsider other parts of your algorithm as well. Your algorithm is a kind of piecewise-linear interpolation, and there are plenty of relatively simple improvements. Instead of dividing your space into evenly spaced cubes, consider allocating a number of center points and dynamically repositioning them until the average distance of data points from the nearest center point is minimized, like this:
Allocate each data point to its closest center point.
Reposition each center point to the coordinates that would minimize the average distance from "its" points (to the "centroid" of the data subset).
Some data points now have a different closest center point. Repeat steps 1. and 2. until you converge (or near enough).

Networkx graph clustering

in Networkx, how can I cluster nodes based on nodes color? E.g., I have 100 nodes, some of them are close to black, while others are close to white. In the graph layout, I want nodes with similar color stay close to each other, and nodes with very different color stay away from each other. How can I do that? Basically, how does the edge weight influence the layout of spring_layout? If NetworkX cannot do that, is there any other tools can help to calculate the layout?
Thanks

Ok, lets build us adjacency matrix W for that graph following the simple procedure:
if both of adjacent vertexes i-th and j-th are of the same color then weight of the edge between them W_{i,j} is big number (which you will tune in your experiments later) and else it is some small number which you will figure out analogously.
Now, lets write Laplacian of the matrix as
L = D - W, where D is a diagonal matrix with elements d_{i,i} equal to the sum of W i-th row.
Now, one can easily show that the value of
fLf^T, where f is some arbitrary vector, is small if vertexes with huge adjustments weights are having close f values. You may think about it as of the way to set a coordinate system for graph with i-the vertex has f_i coordinate in 1D space.
Now, let's choose some number of such vectors f^k which give us representation of the graph as a set of points in some euclidean space in which, for example, k-means works: now you have i-th vertex of the initial graph having coordinates f^1_i, f^2_i, ... and also adjacent vectors of the same color on the initial graph will be close in this new coordinate space.
The question about how to choose vectors f is a simple one: just take couple of eigenvectors of matrix L as f which correspond to small but nonzero eigenvalues.
This is a well known method called spectral clustering.
Further reading:
The Elements of Statistical Learning: Data Mining, Inference, and Prediction. by Trevor Hastie, Robert Tibshirani and Jerome Friedman
which is available for free from the authors page http://www-stat.stanford.edu/~tibs/ElemStatLearn/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.