I am trying to join two spatial datasets. The first contains points and the second polygons.
However, some of the points are outside of the polygons.
Is there a simple way of joining/snapping these points to the nearest polygon boundary , not the nearest polygon centroid?
At the moment I am joining to the nearest polygon centroid but this does not yield the results I am looking for.
You need to put all points (not polygon points into a KD-Tree) using something like the sklearn package. This package contains an efficient nearest neighbours calculation. In Python it can be imported using:
import sklearn.neighbors as neighbors
If you have about 10 million polygons you only need a tree depth of 12 for it to be efficient. You can experiment with this. If less that 100,000 a leaf_size=9 might be enough. The code to put all points (in one single array) into a tree is done using the following:
tree = neighbors.KDTree( arrayOfPoints, leaf_size=12 )
Then you iterate over each polygon and the individual points in each polygon to find the nearest 5 points (for instance). The algorithm is superquick at finding these because of the nature of the KDTree. Bruteforce comparison can be 1000 times slower (as I found for massive data sets).
shortestDistances, closestIndices = tree.query( pointInPolygon, k=5 )
You might just want the nearest point, so you can set k=1 and then the closestIndices[0] is what you want for the actual array index from the point list.
This is no complete answer, but you can check distances between points and polygons'boudaries using shapely :
from shapely.geometry import Point, Polygon
p = Point(0,0)
poly = Polygon([[1,0], [1,10], [10, 0]])
print(p.distance(poly))
Edit :
So if you're not using big datasets, you could do something like :
my_points = [...]
my_polys = [...]
dict_point_poly = {}
for point in my_points:
try:
intersecting_poly = [poly for poly in my_polys if
point.intersects(poly)][0]
this_poly = intersecting_poly
except IndexError:
distances = [(poly, point.distance(poly)) for poly in my_polys]
distances.sort(key=lambda x:x[1])
this_poly = distances[0][0]
finally:
dict_point_poly[point] = this_poly
(not the most efficient method, but one easily understood I think)
Related
I have a set of 2000 geospatial points (lon/lat), which I need to match with several other geospatial datasets (I am using Geopandas GeoDataFrames). I am using the sklearn BallTree function to find the neighbors within a certain radius of each point (in the function below, point is one of the 2000 points and right_gdf is the dataset that I need to get the neighbors from).
I am currently using a for-loop to loop through all of the 2000 points and find the neighbors for each of them. However, depending on the size of right_gdf, this can take a long time. I am sure there is a way to speed this process up, potentially with parallel computing, but I am struggling to find it. I tried to use Dask delayed to parellelise the loop (see code below) but somehow this takes even longer than the simple for loop.
# Function that finds a point's neighbors within a certain radius
def neighbours_radius(point, right_gdf, R=1):
# Create tree from the right gdf (use haversine for lat/lon coordinates)
tree = BallTree(right_gdf, leaf_size=40, metric='haversine')
# Find indices of all neighbors within R
indices = tree.query_radius(point, r=r)[0]
return indices
# Function that loops through the 2000 points
def knn_gpd(right_gdf, R=75):
# Load the gdf with the 2000 points
base = gpd.read_file(...)
# Empty list to fill in the indices of the neighbors
neighbors = []
# Loop through the points and find the neighbors within R.
for i in range(len(base)):
point = base.iloc[i:i+1,:]
ind = neighbours_radius(point, right_gdf, R=R)
# append index lists
neighbors.append(ind)
return neighbors
# Function that loops through the 2000 points with Dask delayed
def knn_gpd_dask(right_gdf, R=75):
# Load the gdf with the 2000 points
base = gpd.read_file(...)
# Empty list to fill in the indices of the neighbors
neighbors = []
# Loop through the points and find the neighbors within R.
for i in range(len(base):
point = base.iloc[i:i+1,:]
ind = delayed(neighbours_radius)(point, right_gdf, R=R)
# append index list
neighbors.append(ind)
result = compute(neighbors)
return result
Can anyone help me speed up this process?
If you profile your code, I suspect you will find that creating the BallTree is taking up most of the time, because you are creating it 2000 times. You should try to create it only once, like this for example:
# Function that loops through the 2000 points
def knn_gpd(right_gdf, R=75):
# Load the gdf with the 2000 points
base = gpd.read_file(...)
# Create tree from the right gdf (use haversine for lat/lon coordinates)
tree = BallTree(right_gdf, leaf_size=40, metric='haversine')
# Empty list to fill in the indices of the neighbors
neighbors = []
# Loop through the points and find the neighbors within R.
for i in range(len(base)):
point = base.iloc[i:i+1,:]
# Find indices of all neighbors within R
indices = tree.query_radius(point, r=r)[0]
# append index lists
neighbors.append(indices )
return neighbors
This question already has answers here:
Looking for a fast way to find the polygon a point belongs to using Shapely
(2 answers)
Closed 2 years ago.
I have a Series of Shapely Polygons (>1000), which do not overlap. And I want to introduce a new shapely point and want to know fast, in which polygon the point would be. I have a for loop for this but I am looking for a method that would be faster.
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon
test_points = pd.Series([[(0,1), (1,1), (1,0)], [(0,0), (0,1), (1,0)]])
# a Dataframe containing my polygons and an id
polygonized_points = pd.DataFrame({"polygons" : test_points.map(lambda x : Polygon(x)), "id" : range(0, len(test_points), 1)})
# a new point
new_point = Point(0.4, 0.3)
# allocation of point to hexes (which I want to be faster)
for idx, row in polygonized_points.iterrows() :
if row.polygons.contains(new_point) :
new_point_in_id = row.id # imagine this would be a df with an empty column for the id variable
I'm seriously sure I missed something to speed this up b/c I don't think the for loop scales well. Thank you for your help! Best
The for-loop is not the problem in this case: The point in polygon test is slow. Optimizing your code means optimizing the number of point in polygon tests, which is typically done using a spatial index. This answer: https://gis.stackexchange.com/a/119935 from GIS Stack-Exchange does a good job to list a number of possible spatial index strategies. The for loop is with some 1000's of repetitions of no concern. A very good possibility is to use an R-Tree, like from this Python package: https://toblerity.org/rtree/. The R-Tree searches efficiently after fitting bounding boxes (of your Polygons). After that you perform the costly point in polygon test only for Polygons having the point in their boundion box (usually 2-5).
I have a large list of polygons (>10^6) most of which are non-intersecting but some of these polygons are hole of another polygon (~10^3 cases). Here a image to explain the problem, the smaller polygon is a hole in the larger polygon but both are independent polygon in the list of polygons.
Now I would like to efficiently determine which polygons are holes and substract the holes i.e. subtract the smaller polygons which lie completely inside in another polygon and return a list of "cleaned" polygons. A pair of hole and parent polygon should be transformed like this (so basically hole subtracted from the parent):
There are plenty of similar questions on Stackoverflow and gis.stackexchange.com but I haven't found one that actually solves this problems. Here are some related questions:
1. https://gis.stackexchange.com/questions/5405/using-shapely-translating-between-polygons-and-multipolygons
2. https://gis.stackexchange.com/questions/319546/converting-list-of-polygons-to-multipolygon-using-shapely
Here is a sample code.
from shapely.geometry import Point
from shapely.geometry import MultiPolygon
from shapely.ops import unary_union
import numpy as np
#Generate a list of polygons, where some are holes in others;
def generateRandomPolygons(polygonCount = 100, areaDimension = 1000, holeProbability = 0.5):
pl = []
radiusLarge = 2 #In the real dataset the size of polygons can vary
radiusSmall = 1 #Size of holes can also vary
for i in range(polygonCount):
x, y = np.random.randint(0,areaDimension,(2))
rn1 = np.random.random(1)
pl.append(Point(x, y).buffer(radiusLarge))
if rn1 < holeProbability: #With a holeProbability add a hole in the large polygon that was just added to the list
pl.append(Point(x, y).buffer(radiusSmall))
return pl
polygons = generateRandomPolygons()
print(len(pl))
Output looks like this:
Now how can I create a new list of polygons with the holes removed. Shapely provides functions to subtract one polygon from another (difference) but is there a similar function for lists of polygons (maybe something like unary_union but where overlaps are removed)? Alternative how to efficiently determine which are holes and then subtract them from the larger polygons?
Your problem is you don't know which ones are "holes", right? To "efficiently determine which polygons are holes", you can use an rtree to speed up the intersection check:
from rtree.index import Index
# create an rtree for efficient spatial queries
rtree = Index((i, p.bounds, None) for i, p in enumerate(polygons))
donuts = []
for i, this_poly in enumerate(polygons):
# loop over indices of approximately intersecting polygons
for j in rtree.intersection(this_poly.bounds):
# ignore the intersection of this polygon with itself
if i == j:
continue
other_poly = polygons[j]
# ensure the polygon fully contains our match
if this_poly.contains(other_poly):
donut = this_poly.difference(other_poly)
donuts.append(donut)
break # quit searching
print(len(donuts))
I have a large set of data points in a pandas dataframe, with columns containing x/y coordinates for these points. I would like to identify all points that are within a certain distance "d" of any other point in the dataframe.
I first tried to do this using 'for' loops, checking the distance between the first point and all other points, then the distance between the second point and all others, etc. Clearly this is not very efficient for a large data set.
Recent searching online suggests that the best way might be to use scipy.spatial.ckdtree, but I can't figure out how to implement this. Most examples I see check against a single x/y location, whereas I want to check all vs all. Is anyone able to provide suggestions or examples, starting from an array of x/y coordinates taken from my dataframe as follows:
points = df_sub.loc[:,['FRONT_X','FRONT_Y']].values
That looks something like this:
[[19091199.587 -544406.722]
[19091161.475 -544452.426]
[19091163.893 -544464.899]
...
[19089150.04 -544747.196]
[19089774.213 -544729.005]
[19089690.516 -545165.489]]
The ideal output would be the ID's of all pairs of points that are within a cutoff distance "d" of each other.
scipy.spatial has many good functions for handling distance computations.
Let's create an array pos of 1000 (x, y) points, similar to what you have in your dataframe.
import numpy as np
from scipy.spatial import distance_matrix
num = 1000
pos = np.random.uniform(size=(num, 2))
# Distance threshold
d = 0.25
From here we shall use the distance_matrix function to calculate pairwise distances. Then we use np.argwhere to find the indices of all the pairwise distances less than some threshold d.
pair_dist = distance_matrix(pos, pos)
ids = np.argwhere(pair_dist < d)
ids now contains the "ID's of all pairs of points that are within a cutoff distance "d" of each other", as you desired.
Shortcomings
Of course, this method has the shortcoming that we always compute the distance between each point and itself (returning a distance of 0), which will always be less than our threshold d. However, we can exclude self-comparisons from our ids with the following fudge:
pair_dist[np.r_[:num], np.r_[:num]] = np.inf
ids = np.argwhere(pair_dist < d)
Another shortcoming is that we compute the full symmetric pairwise distance matrix when we only really need the upper or lower triangular pairwise distance matrix. However, unless this computation really is a bottleneck in your code, I wouldn't worry too much about this.
I have a list of X and Y coordinates from geodata of a specific part of the world. I want to assign each coordinate, a weight, based upon where it lies in the graph.
For Example: If a point lies in a place where there are a lot of other nodes around it, it lies in a high density area, and therefore has a higher weight.
The most immediate method I can think of is drawing circles of unit radius around each point and then calculating if the other points lie within in and then using a function, assign a weight to that point. But this seems primitive.
I've looked at pySAL and NetworkX but it looks like they work with graphs. I don't have any edges in the graph, just nodes.
A standard solution would be using KDE (Kernel Density Estimation).
Search on web: "KDE Estimation" you will find enormous links.
in Google type: KDE Estimation ext:pdf
Also, Scipy has KDE, follow this http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html. There is working example codes there ;)
If you have a lot of points, you may compute nearest neighbors more efficiently using a KDTree:
import numpy as np
import scipy.spatial as spatial
points = np.array([(1, 2), (3, 4), (4, 5), (100,100)])
tree = spatial.KDTree(np.array(points))
radius = 3.0
neighbors = tree.query_ball_tree(tree, radius)
print(neighbors)
# [[0, 1], [0, 1, 2], [1, 2], [3]]
tree.query_ball_tree returns indices (of points) of the nearest neighbors. For example, [0,1] (at index 0) means points[0] and points[1] are within radius distance from points[0]. [0,1,2] (at index 1) means points[0], points[1] and points[2] are within radius distance from points[1].
frequency = np.array(map(len, neighbors))
print(frequency)
# [2 3 2 1]
density = frequency/radius**2
print(density)
# [ 0.22222222 0.33333333 0.22222222 0.11111111]
Yes, you do have edges, and they are the distances between the nodes. In your case, you have a complete graph with weighted edges.
Simply derive the distance from each node to each other node -- which gives you O(N^2) in time complexity --, and use both nodes and edges as input to one of these approaches you found.
Happens though your problem seems rather an analysis problem other than anything else; you should try to run some clustering algorithm on your data, like K-means, that clusters nodes based on a distance function, in which you can simply use the euclidean distance.
The result of this algorithm is exactly what you'll need, as you'll have clusters of close elements, you'll know what and how many elements are assigned to each group, and you'll be able to, according to these values, generate the coefficient you want to assign to each node.
The only concern worth pointing out here is that you'll have to determine how many clusters -- k-means, k-clusters -- you want to create.
You initial inclination to draw a circle around each point and count the number of other points in that circle is a good one and as mentioned by unutbu, a KDTree will be a fast way to solve this problem.
This can be done very easily with PySAL, which using scipy's kdtree under the hood.
import pysal
import numpy
pts = numpy.random.random((100,2)) #generate some random points
radius = 0.2 #pick an arbitrary radius
#Build a Spatial Weights Matrix
W = pysal.threshold_continuousW_from_array(pts, threshold=radius)
# Note: if your points are in Latitude and Longitude you can increase the accuracy by
# passing the radius of earth to this function and it will use arc distances.
# W = pysal.threshold_continuousW_from_array(pts, threshold=radius, radius=pysal.cg.RADIUS_EARTH_KM)
print W.cardinalities
#{0: 10, 1: 15, ..... }
If your data is in a Shapefile, simply replace threshold_continuousW_from_array with threshold_continuousW_from_shapefile, see the docs for details.