Match new shapely Point to a series of shapely polygons - fast [duplicate]

Match new shapely Point to a series of shapely polygons - fast [duplicate] - python

This question already has answers here:
Looking for a fast way to find the polygon a point belongs to using Shapely
(2 answers)
Closed 2 years ago.
I have a Series of Shapely Polygons (>1000), which do not overlap. And I want to introduce a new shapely point and want to know fast, in which polygon the point would be. I have a for loop for this but I am looking for a method that would be faster.
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon
test_points = pd.Series([[(0,1), (1,1), (1,0)], [(0,0), (0,1), (1,0)]])
# a Dataframe containing my polygons and an id
polygonized_points = pd.DataFrame({"polygons" : test_points.map(lambda x : Polygon(x)), "id" : range(0, len(test_points), 1)})
# a new point
new_point = Point(0.4, 0.3)
# allocation of point to hexes (which I want to be faster)
for idx, row in polygonized_points.iterrows() :
if row.polygons.contains(new_point) :
new_point_in_id = row.id # imagine this would be a df with an empty column for the id variable
I'm seriously sure I missed something to speed this up b/c I don't think the for loop scales well. Thank you for your help! Best

The for-loop is not the problem in this case: The point in polygon test is slow. Optimizing your code means optimizing the number of point in polygon tests, which is typically done using a spatial index. This answer: https://gis.stackexchange.com/a/119935 from GIS Stack-Exchange does a good job to list a number of possible spatial index strategies. The for loop is with some 1000's of repetitions of no concern. A very good possibility is to use an R-Tree, like from this Python package: https://toblerity.org/rtree/. The R-Tree searches efficiently after fitting bounding boxes (of your Polygons). After that you perform the costly point in polygon test only for Polygons having the point in their boundion box (usually 2-5).

Related

Retrieving MST for geographic coordinates using scipy minimum spanning tree

I am trying to create a minimum spanning tree (MST) from geographical coordinates using scipy, but for the life of me I cannot understand how to extract information to it . The scipy documentation is not very clear and multiple searches have not provided results.
For context in total I have around 200k datapoints per set and they look like this
My final objective is to create a line vector that connects these points through the MST, more or less as they appear in the image above. But for that I need an ordered list of point indices (or coordinates) I can work with.
Most of all I would need help understanding how to use the output of minimum_spanning_tree but it might be that I am making mistakes along the path
Overall steps
The steps I take are:
Create the sparse matrix with coordinate info
provide the matrix to scipy.sparse.csgraph.minimum_spanning_tree
Do some magic to extract column values
This is the small sample test data:
test_data = {
"index": [0,1,2,3,4],
"X": [35,36,37,38,38],
"Y": [2113,2113,2112,2101,2102]
}
df= pd.DataFrame(test_data)
Step 1, create the sparse matrix
xs = df[["X"]].values.squeeze().astype(int)
ys = df[["Y"]].values.squeeze().astype(int)
data= np.array(df.index).squeeze().astype(int)
max_dim =max(np.max(xs), np.max(ys)) +1
max_dim
dist_matr=csr_matrix((data, (xs,ys)),shape=(max_dim, max_dim))
Q1:I couldn't understand what data is in this context as scipy docs do not explain that in detail. should {data} be the labels of the points or are they the edge weights?
Step2: calculate the minimum spanning tree
mst = minimum_spanning_tree(dist_matr)
Step3: get an ordered list of indices (or coordinates)
As I understand it the output of MST is a sparse graph that should look something like this (source)
Q2: However, my matrix is not 5X5, but max_value*max_value (2113 in this case). And it seems like the content of the matrix is not the edge weight. Am I getting this wrong?
I have tried to extract the connected components, but the labels don't make sense to me
# Label connected components.
num_graphs, labels = connected_components(mst, directed=False)
# This is a snippet I found somewhere but I have difficulties following the logic of it
results = [[] for i in range(max(labels) + 1)]
for idx, label in enumerate(labels):
results[label].append(idx)
portion of the results:
As you can see point coordinates are grouped in an odd way, without a relationship between x and y. I have also tried 'depth_first_order' but aside for asking a starting point (that I wouldn't know how to choose) it provides me with equally confusing outputs
Q4: How do I "read" the MST matrix and extract the minimum spanning tree for all points?
I am happy to explore other solutions as long as they provide a similar result and are scalable, however I have seen concerns about NetworkX for lots of data and MisTree doesn't install on my setup

How can I efficiently find every combination of intersection of two arrays of polygons?

I have two arrays of polygons called grid_1 and grid_2. grid_1 has 1,500,000 polygons and grid_2 has 60,000 polygons. As it is much too computationally expensive to calculate the intersecting polygon of each combination and then calculate its area, I instead would like to first calculate those polygons that intersect and then proceed. With nested for loops, it could look something like this:
from shapely.geometry import Polygon
from scipy import sparse
intersection_bool = sparse.lil_matrix((len(grid_1),len(grid_2))),astype("bool")
for i in range(len(grid_1)):
for j in range(len(grid_2)):
if i.intersects(j):
intersection_bool[i,j] = 1.0
However, this is still too computationally expensive. This question recommends the use of STRtrees from shapely. However, I am not sure what the best way to do this efficiently is. I have tried the implementation below, but it is slower than the nested for loops. I think STRtrees are my answer, I am just not sure how to use them most efficiently. Thank you!
from shapely.strtree import STRtree
from shapely.geometry import Polygon
from scipy import sparse
intersection_bool = sparse.lil_matrix((len(grid_1),len(grid_2))),astype("bool")
grid_1_tree = STRtree(grid_1)
for j in range(len(grid_2)):
result = tree.query(grid_2[j])
for i in range(len(grid_1)):
if grid_1[i] in result:
intersection_bool[i,j] = 1.0

It is not yet the ideal answer, but something that sped up my code ~3 times was putting my polygons into a geopandas.GeoSeries which can vectorize the intersects function and bring me down to one for loop.

How to join a point to nearest polygon boundary

I am trying to join two spatial datasets. The first contains points and the second polygons.
However, some of the points are outside of the polygons.
Is there a simple way of joining/snapping these points to the nearest polygon boundary , not the nearest polygon centroid?
At the moment I am joining to the nearest polygon centroid but this does not yield the results I am looking for.

You need to put all points (not polygon points into a KD-Tree) using something like the sklearn package. This package contains an efficient nearest neighbours calculation. In Python it can be imported using:
import sklearn.neighbors as neighbors
If you have about 10 million polygons you only need a tree depth of 12 for it to be efficient. You can experiment with this. If less that 100,000 a leaf_size=9 might be enough. The code to put all points (in one single array) into a tree is done using the following:
tree = neighbors.KDTree( arrayOfPoints, leaf_size=12 )
Then you iterate over each polygon and the individual points in each polygon to find the nearest 5 points (for instance). The algorithm is superquick at finding these because of the nature of the KDTree. Bruteforce comparison can be 1000 times slower (as I found for massive data sets).
shortestDistances, closestIndices = tree.query( pointInPolygon, k=5 )
You might just want the nearest point, so you can set k=1 and then the closestIndices[0] is what you want for the actual array index from the point list.

This is no complete answer, but you can check distances between points and polygons'boudaries using shapely :
from shapely.geometry import Point, Polygon
p = Point(0,0)
poly = Polygon([[1,0], [1,10], [10, 0]])
print(p.distance(poly))
Edit :
So if you're not using big datasets, you could do something like :
my_points = [...]
my_polys = [...]
dict_point_poly = {}
for point in my_points:
try:
intersecting_poly = [poly for poly in my_polys if
point.intersects(poly)][0]
this_poly = intersecting_poly
except IndexError:
distances = [(poly, point.distance(poly)) for poly in my_polys]
distances.sort(key=lambda x:x[1])
this_poly = distances[0][0]
finally:
dict_point_poly[point] = this_poly
(not the most efficient method, but one easily understood I think)

Points in which polygon

I have several (order of M) points [(x0,y0),...,(xn,yn)]. I also have limited number of hexagons. I want to find each point falls in which hexagon. Using shapely this can be done for 1 point at a time so the loop below does the job. But is there any other way to do it faster?
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon
zones = np.zeros(n)-1
for j,p in enumerate(points):
point = Point(p)
for i in range(len(poly_hex)): #poly_hex = list of hexagonal polygones
polygon = Polygon(poly_hex[i])
if polygon.contains(point):
zones[j] = int(i)
break

Some ideas to speed things up:
Don't convert your hexagons to polygon = Polygon(poly_hex[i]) at each step of the inner loop. Calculate them once before the loop and store them in a list.
As the surface of a hexagon is close to a circle, also make a list of just the centre and the radius of the hexagons. Inside the loop, first test the distance between the point p and the centre of each hexagon, comparing to the radius. Only in case p is less than the radius to the centre, do an explicit test to see whether it is inside the hexagon.
If this still isn't fast enough, and depending on how small, and how far apart the hexagons are relative to your points, you could speed things up further with a quadtree or a grid. But this is less trivial. If most of the points are outside all hexagons, you could first test whether p is inside the convex hull of the hexagons.
Some remarks:
I don't understand your zone[j] = int(i). You convert i to an integer, while it already is an integer. Also, the way you store this, means a point can only be inside maximum one hexagon. Therefore, you can stop the loop as soon as you encountered such a hexagon. (There is also the rare case where a point is just on the edge between two hexagon. This might assign it to the first hexagon encountered. Or, due to rounding errors, even might have it outside both hexagons.)
If you want to save the zones as integers, create it as: zones = np.zeros(n, dtype=int)-1.
If your hexagons form a regular hexagonal grid, you can find the exact hexagon simply from the x and y coordinates without the need for in-polygon testing.

Re-arrange the array contain endpoints to creat an closed polygon in Python

For some purpose, I want to plot an polygon based on several latitude and longitude as endpoints which combined together.
The example data shows like this:
fig=plt.figure()
ax = plt.gca()
x_map1, x_map2 = 114.166,114.996
y_map1, y_map2 = 37.798,38.378
map = Basemap(llcrnrlon=x_map1,llcrnrlat=y_map1,urcrnrlon=x_map2,urcrnrlat=y_map2)
map.drawparallels(np.arange(y_map1+0.102,y_map2,0.2),labels=[1,0,0,1],size=14,linewidth=0,color= '#FFFFFF')
map.drawmeridians(np.arange(x_map1+0.134,x_map2,0.2),labels=[1,0,0,1],size=14,linewidth=0)
bo_x = [114.4390022, 114.3754847, 114.3054522, 114.3038236, 114.2802081, 114.2867228, 114.3378847, 114.3888619, \
114.6288783, 114.6848733, 114.7206292, 114.7341219]
bo_y = [38.16671389, 38.14472722, 38.14309861, 38.10156778, 38.08853833, 38.06980889, 38.03587472, 37.96409056, \
37.84975278, 37.84840333, 37.9017, 38.16683306]
x, y = map( bo_x, bo_y )
xy = zip(x,y)
poly = Polygon( xy, facecolor='red', alpha=0.4 )
plt.gca().add_patch(poly)
The figure shows like this:
But when the Lons array and Lats array are not in the anticlockwise order, and the arrays contain many items that hard to adjust manually. The polygon output may show non-conformity.
Here, I disorganize the bo_x and bo_y as an suppositional situation.
bo_x_adjust = [114.4390022, 114.3754847, 114.3054522, 114.3038236, 114.6288783, 114.6848733, 114.7206292, 114.7341219,
114.2802081, 114.2867228, 114.3378847, 114.3888619, ]
bo_y_adjust = [38.16671389, 38.14472722, 38.14309861, 38.10156778, 37.84975278, 37.84840333, 37.9017, 38.16683306,
38.08853833, 38.06980889, 38.03587472, 37.96409056, ]
Figure shows like:
So, here is my question. Sometimes, the original endpoints are not in order which can output a closed polygon. Pre-organize the arrays is the way to go.
I think to adjust the order of arrays like bo_x and bo_y must follow two principles:
Elements in these two array should be adjust synchronously for the purpose to not break the endpoint pairs(X~Y)
The new arrays should be outlined in clockwise or anticlockwise order on 2-D space.
Any advice or guidelines would be appreciate.

Not an answer yet, but I needed the ability to attach images.
The problem may be ill defined. For example, these two legitimate polygons have the same vertices.
Do you want to get either one?

Here is a way to solve what you want by linear algebra. Sorry but I am writing just the general guidelines. Nonetheless it should work.
Write a function that accept two edges numbers j and k and check if there is an intersection. Note that you need to handle correctly the last to first vertices edge. You also need to make sure you give 'False' when adjacent edges are called since these always intersect by definition.
Now the way to know if two edges intersect is to follow a little algebra. Extract from each edge its straight line parameters a and b by y = a*x + b. Then solve for the two edges to find the intersection x by equating a1*x+b1==a2*x+b2. If the intersection x for both edges is between the x's of the edge's vertices, then the two edges indeed intersect.
Write a function that goes over all edges pairs and test for intersection. Only when no intersection exist the polygon is legitimate.
Next you can go in two approaches:
Comprehensive approach - Go over all possible permutations of the vertices. Test each permutation polygon for intersections. Note that when permutating you need to permutate x and y together. Note that there are a lot of permutations so this could be very time consuming.
Greedy approach - As long as there are still intersections, go over the edges pairs combinations and whenever there is an intersection simply switch the two last edge coordinates (unwind the intersection). Then restart going over all the edges pairs again . Repeat this until there are no more intersections. This should work pretty fast but will not give the best polygon (e.g. will not optimize the largest polygon area)
Hope this helps...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.