Fast method to match geospatial datasets in Python - python

I have a set of 2000 geospatial points (lon/lat), which I need to match with several other geospatial datasets (I am using Geopandas GeoDataFrames). I am using the sklearn BallTree function to find the neighbors within a certain radius of each point (in the function below, point is one of the 2000 points and right_gdf is the dataset that I need to get the neighbors from).
I am currently using a for-loop to loop through all of the 2000 points and find the neighbors for each of them. However, depending on the size of right_gdf, this can take a long time. I am sure there is a way to speed this process up, potentially with parallel computing, but I am struggling to find it. I tried to use Dask delayed to parellelise the loop (see code below) but somehow this takes even longer than the simple for loop.
# Function that finds a point's neighbors within a certain radius
def neighbours_radius(point, right_gdf, R=1):
# Create tree from the right gdf (use haversine for lat/lon coordinates)
tree = BallTree(right_gdf, leaf_size=40, metric='haversine')
# Find indices of all neighbors within R
indices = tree.query_radius(point, r=r)[0]
return indices
# Function that loops through the 2000 points
def knn_gpd(right_gdf, R=75):
# Load the gdf with the 2000 points
base = gpd.read_file(...)
# Empty list to fill in the indices of the neighbors
neighbors = []
# Loop through the points and find the neighbors within R.
for i in range(len(base)):
point = base.iloc[i:i+1,:]
ind = neighbours_radius(point, right_gdf, R=R)
# append index lists
neighbors.append(ind)
return neighbors
# Function that loops through the 2000 points with Dask delayed
def knn_gpd_dask(right_gdf, R=75):
# Load the gdf with the 2000 points
base = gpd.read_file(...)
# Empty list to fill in the indices of the neighbors
neighbors = []
# Loop through the points and find the neighbors within R.
for i in range(len(base):
point = base.iloc[i:i+1,:]
ind = delayed(neighbours_radius)(point, right_gdf, R=R)
# append index list
neighbors.append(ind)
result = compute(neighbors)
return result
Can anyone help me speed up this process?

If you profile your code, I suspect you will find that creating the BallTree is taking up most of the time, because you are creating it 2000 times. You should try to create it only once, like this for example:
# Function that loops through the 2000 points
def knn_gpd(right_gdf, R=75):
# Load the gdf with the 2000 points
base = gpd.read_file(...)
# Create tree from the right gdf (use haversine for lat/lon coordinates)
tree = BallTree(right_gdf, leaf_size=40, metric='haversine')
# Empty list to fill in the indices of the neighbors
neighbors = []
# Loop through the points and find the neighbors within R.
for i in range(len(base)):
point = base.iloc[i:i+1,:]
# Find indices of all neighbors within R
indices = tree.query_radius(point, r=r)[0]
# append index lists
neighbors.append(indices )
return neighbors

Related

Manually find the distance between centroid and labelled data points

I have carried out some clustering analysis on some data X and have arrived at both the labels y and the centroids c. Now, I'm trying to calculate the distance between X and their assigned cluster's centroid c. This is easy when we have a small number of points:
import numpy as np
# 10 random points in 3D space
X = np.random.rand(10,3)
# define the number of clusters, say 3
clusters = 3
# give each point a random label
# (in the real code this is found using KMeans, for example)
y = np.asarray([np.random.randint(0,clusters) for i in range(10)]).reshape(-1,1)
# randomly assign location of centroids
# (in the real code this is found using KMeans, for example)
c = np.random.rand(clusters,3)
# calculate distances
distances = []
for i in range(len(X)):
distances.append(np.linalg.norm(X[i]-c[y[i][0]]))
Unfortunately, the actual data has many more rows. Is there a way to vectorise this somehow (instead of using a for loop)? I can't seem to get my head around the mapping.
Thanks to numpy's array indexing, you can actually turn your for loop into a one-liner and avoid explicit looping altogether:
distances = np.linalg.norm(X- np.einsum('ijk->ik', c[y]), axis=1)
will do the same thing as your original for loop.
EDIT: Thanks #Kris, I forgot the axis keyword, and since I didn't specify it, numpy automatically computed the norm of the entire flattened matrix, not just along the rows (axis 1). I've updated it now, and it should return an array of distances for each point. Also, einsum was suggested by #Kris for their specific application.

How to join a point to nearest polygon boundary

I am trying to join two spatial datasets. The first contains points and the second polygons.
However, some of the points are outside of the polygons.
Is there a simple way of joining/snapping these points to the nearest polygon boundary , not the nearest polygon centroid?
At the moment I am joining to the nearest polygon centroid but this does not yield the results I am looking for.
You need to put all points (not polygon points into a KD-Tree) using something like the sklearn package. This package contains an efficient nearest neighbours calculation. In Python it can be imported using:
import sklearn.neighbors as neighbors
If you have about 10 million polygons you only need a tree depth of 12 for it to be efficient. You can experiment with this. If less that 100,000 a leaf_size=9 might be enough. The code to put all points (in one single array) into a tree is done using the following:
tree = neighbors.KDTree( arrayOfPoints, leaf_size=12 )
Then you iterate over each polygon and the individual points in each polygon to find the nearest 5 points (for instance). The algorithm is superquick at finding these because of the nature of the KDTree. Bruteforce comparison can be 1000 times slower (as I found for massive data sets).
shortestDistances, closestIndices = tree.query( pointInPolygon, k=5 )
You might just want the nearest point, so you can set k=1 and then the closestIndices[0] is what you want for the actual array index from the point list.
This is no complete answer, but you can check distances between points and polygons'boudaries using shapely :
from shapely.geometry import Point, Polygon
p = Point(0,0)
poly = Polygon([[1,0], [1,10], [10, 0]])
print(p.distance(poly))
Edit :
So if you're not using big datasets, you could do something like :
my_points = [...]
my_polys = [...]
dict_point_poly = {}
for point in my_points:
try:
intersecting_poly = [poly for poly in my_polys if
point.intersects(poly)][0]
this_poly = intersecting_poly
except IndexError:
distances = [(poly, point.distance(poly)) for poly in my_polys]
distances.sort(key=lambda x:x[1])
this_poly = distances[0][0]
finally:
dict_point_poly[point] = this_poly
(not the most efficient method, but one easily understood I think)

Efficiently find closest points to track in space & time on gridded data

Summary/simplified version
Given a list of track points defined by three 1-dimensional arrays (lats, lons and dtime all with same length) and a gridded 3-dimensional array rr (defined by 2-D lat_radar, lon_radar coordinate arrays and a 1-dimensional time array time_radar) I want to extract all the grid values in rr where the coordinates (latitude, longitude AND time included) are closest to the three 1-dimensional arrays.
I've managed to use cKDTree to select points in space but I don't know how to generalize the solution to space & time together. Right now I have to do the selection on time separately and it makes the code quite bulky and hard to read.
for more details about this problem see hereinafter
Extended version
I'm trying to develop an app that uses precipitation data obtained from weather radar composites to predict the precipitation along a track. Most apps usually predict the precipitation at a point without considering the point moving in time.
The idea is, given points identifying a track in space and time, find the closest grid points from radar data to obtain a precipitation estimate over the track (see plot). The final goal would be to shift the start time to identify the best time to leave to avoid rain.
I just optimized my previous algorithm, that was using plain loops, to use cKDTree from scipy. Execution time went down from 30s to 380ms :). However I think the code can still be optimized. Here is my attempt.
As input we have
lons, lats: coordinates of the track as N-dimensional arrays
dtime: timedelta T-dimensional array containing the time elapsed on the track
lon_radar, lat_radar: M x P matrices containing the coordinates of the radar data
dtime_radar: timedelta Q-dimensional array containing the radar forecast
rr: M x P X Q array containing the radar forecast at every time step
First find the grid points closest to the trajectory using cKDTree:
combined_x_y_arrays = np.dstack([lon_radar.ravel(),
lat_radar.ravel()])[0]
points_list = list(np.vstack([lons, lats]).T)
def do_kdtree(combined_x_y_arrays, points):
mytree = cKDTree(combined_x_y_arrays)
dist, indexes = mytree.query(points)
return indexes
results = do_kdtree(combined_x_y_arrays, points_list)
# As we have many duplicates, since the itinerary has a much higher resolution than the radar,
# we only select the unique points
inds_itinerary = np.unique(results)
lon_lat_itinerary = combined_x_y_arrays[inds_itinerary]
then find the closest points in the track to subset it. It doesn't make sense to have a track resolution of 10 m if the radar only has grid points every km.
combined_x_y_arrays = np.vstack([lons, lats]).T
points_list = list(lon_lat_itinerary)
results = do_kdtree(combined_x_y_arrays, points_list)
Now we can use these positions to get the elapsed time on the trajectory and the relative time steps in radar data
dtime_itinerary = dtime[results]
# find indices of these dtimes in radar dtime
inds_dtime_radar = np.abs(np.subtract.outer(dtime_radar, dtime_itinerary)).argmin(0)
Now we have everything that we need to find the precipitation so we only need one last loop. I also loop on shifts to obtain prediction with different start times.
shifts = (1, 3, 5, 7, 9)
rain = np.empty(shape=(len(shifts), len(inds_itinerary)))
for i, shift in enumerate(shifts):
temp = []
for i_time, i_space in zip(inds_dtime_radar, inds_itinerary):
temp.append(rr[i_time+shift].ravel()[i_space])
rain[i, :] = temp
In particular I would like to find a way to combine the time search with the lat-lon search for the closest points.

Large set of x,y coordinates. Efficient way to find any within certain distance of each other?

I have a large set of data points in a pandas dataframe, with columns containing x/y coordinates for these points. I would like to identify all points that are within a certain distance "d" of any other point in the dataframe.
I first tried to do this using 'for' loops, checking the distance between the first point and all other points, then the distance between the second point and all others, etc. Clearly this is not very efficient for a large data set.
Recent searching online suggests that the best way might be to use scipy.spatial.ckdtree, but I can't figure out how to implement this. Most examples I see check against a single x/y location, whereas I want to check all vs all. Is anyone able to provide suggestions or examples, starting from an array of x/y coordinates taken from my dataframe as follows:
points = df_sub.loc[:,['FRONT_X','FRONT_Y']].values
That looks something like this:
[[19091199.587 -544406.722]
[19091161.475 -544452.426]
[19091163.893 -544464.899]
...
[19089150.04 -544747.196]
[19089774.213 -544729.005]
[19089690.516 -545165.489]]
The ideal output would be the ID's of all pairs of points that are within a cutoff distance "d" of each other.
scipy.spatial has many good functions for handling distance computations.
Let's create an array pos of 1000 (x, y) points, similar to what you have in your dataframe.
import numpy as np
from scipy.spatial import distance_matrix
num = 1000
pos = np.random.uniform(size=(num, 2))
# Distance threshold
d = 0.25
From here we shall use the distance_matrix function to calculate pairwise distances. Then we use np.argwhere to find the indices of all the pairwise distances less than some threshold d.
pair_dist = distance_matrix(pos, pos)
ids = np.argwhere(pair_dist < d)
ids now contains the "ID's of all pairs of points that are within a cutoff distance "d" of each other", as you desired.
Shortcomings
Of course, this method has the shortcoming that we always compute the distance between each point and itself (returning a distance of 0), which will always be less than our threshold d. However, we can exclude self-comparisons from our ids with the following fudge:
pair_dist[np.r_[:num], np.r_[:num]] = np.inf
ids = np.argwhere(pair_dist < d)
Another shortcoming is that we compute the full symmetric pairwise distance matrix when we only really need the upper or lower triangular pairwise distance matrix. However, unless this computation really is a bottleneck in your code, I wouldn't worry too much about this.

Improving a method for a spatially aware median filter for point clouds in Python

I have point cloud data from airborne LiDAR. It is noisy, so I want to run a median filter which collects points within N metres of each point, finds the median elevation value, and returns the neighbourhood median as an adjusted elevation value.
This is roughly analogous to gridding the data, and taking the median of elevations within each bin. Or scipy.signal.medfilt.
But - I want to preserve the location (x,y) of each point. Also I'm not sure that medfilt preserves the spatial information required.
I have a method, but it involves multiple for loops. Expensive when millions of points go in
Updated - for each iteration of the first loop, a small patch of points is selected for the shapely intersection operation. The first version searched all input points for an intersection at every iteration. Now, only a small patch at a time is converted to a shapely geometry and used for the intersection:
import numpy as np
from shapely import geometry
def spatialmedian(pointcloud,radius):
"""
Using shapely geometires, replace every point in a cloud with the
median value of points within 'radius' units of the point
'pointcloud' must have no more than 3 dimensions (x,y,z)
"""
new_z = []
i = 0
for point in pointcloud:
#pick a point and make it a shapely Point
point = geometry.Point(pointcloud[i,:])
#select a patch around the point and make it a shapely
# MultiPoint
patch = geometry.MultiPoint(list(pointcloud[\
(pointcloud[:,0] > point.x - radius+0.5) &\
(pointcloud[:,0] < point.x + radius+0.5) &\
(pointcloud[:,1] > point.y - radius+0.5) &\
(pointcloud[:,1] < point.y + radius+0.5)\
]))
#buffer the Point by radius
pbuff = point.buffer(radius)
#use the intersection method to find points in our
# patch that lie inside the Point buffer
isect = pbuff.intersection(patch)
#print(isect.geom_type)
#initialise another list
plist = []
#for every intersection set,
# unpack it into a list and collect the median
# Z value.
if isect.geom_type == 'MultiPoint':
#print('point has neightbours')
for p in isect:
plist.append(p.z)
new_z.append(np.median(plist))
else:
# if the intersection set isn't MultiPoint,
# it is an isolated point, whose median Z value
# is it's own.
#print('isolated point')
#append it to the big list
new_z.append(isect.z)
#iterate i
i += 1
#print(i)
#return a list of new median filtered Z coordinates
return new_z
This works by:
ingesting a list/array of XYZ points
the first for loop goes through the list and for every point:
picks out a patch of the point cloud just bigger than the neighbourhood specified
uses shapely to place a 3 metre buffer around the point
finds the intersection of the buffer and the whole point cloud
extracts the set of points from that operation in another for loop
finding the median and appending it to a list of new Z values
returning the list of new Z values
For 10^4 points, I get a result in 11 seconds. For 10^5 points 3 minutes, and most of my datasets run into 2- 5 * 10^6 points. On a 2 * 10^6 point cloud it's been running overnight.
What I want is a faster/more efficient method!
I've been tinkering with python-pcl, which is fast for filtering point clouds, but I don't know how to return indices of points which pass/fail pcl-python filters. I need those indices because each point has other attributes which must remain attached to it.
If anyone can suggest a more efficient method, please do so - I would highly appreciate your help. If it can't go faster and this code is helpful, feel free to use it.
Thanks!
After some good advice, I tried this:
#import numpy and scikit-learn's neighbours modulw
import numpy as np
from sklearn import neighbors as nb
#make a little time ticker
from datetime import datetime
startTime = datetime.now()
# generate a KDTree object. This takes ~95% of the
# processing time
tree = nb.KDTree(xyzi[:,0:3], leaf_size=60)
# how long did tree generation take
print(datetime.now() - startTime)
#initialise a list
new_z = []
#for each point, collect neighbours within radius r
nhoods = tree.query_radius(xyzi[:,0:3], r=3)
# iterate through the list of neighbourhoods,
# find the median height, and add it to the output list
for point in nhoods:
new_z.append(np.median(xyzi[point,2]))
# how long did it take?
print(datetime.now() - startTime)
This version took ~33 minutes for just under two million points. Acceptable, but still could be better.
Can the KDTree generation go faster using a %jit method?
IS there a better method than looping through all the neighbourhoods to find neighbourhood means? here, nhood is an array-of-arrays - I thought something like:
median = np.median(nhoods[:][:,2])
...but it didn't.
Thanks!

Categories