Efficiently finding the closest coordinate pair from a set in Python

Efficiently finding the closest coordinate pair from a set in Python - python

The Problem
Imagine I am stood in an airport. Given a geographic coordinate pair, how can one efficiently determine which airport I am stood in?
Inputs
A coordinate pair (x,y) representing the location I am stood at.
A set of coordinate pairs [(a1,b1), (a2,b2)...] where each coordinate pair represents one airport.
Desired Output
A coordinate pair (a,b) from the set of airport coordinate pairs representing the closest airport to the point (x,y).
Inefficient Solution
Here is my inefficient attempt at solving this problem. It is clearly linear in the length of the set of airports.
shortest_distance = None
shortest_distance_coordinates = None
point = (50.776435, -0.146834)
for airport in airports:
distance = compute_distance(point, airport)
if distance < shortest_distance or shortest_distance is None:
shortest_distance = distance
shortest_distance_coordinates = airport
The Question
How can this solution be improved? This might involve some way of pre-filtering the list of airports based on the coordinates of the location we are currently stood at, or sorting them in a certain order beforehand.

Using a k-dimensional tree:
>>> from scipy import spatial
>>> airports = [(10,10),(20,20),(30,30),(40,40)]
>>> tree = spatial.KDTree(airports)
>>> tree.query([(21,21)])
(array([ 1.41421356]), array([1]))
Where 1.41421356 is the distance between the queried point and the nearest neighbour and 1 is the index of the neighbour.
See: http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html#scipy.spatial.KDTree.query

If your coordinates are unsorted, your search can only be improved slightly assuming it is (latitude,longitude) by filtering on latitude first as for earth
1 degree of latitude on the sphere is 111.2 km or 69 miles
but that would not give a huge speedup.
If you sort the airports by latitude first then you can use a binary search for finding the first airport that could match (airport_lat >= point_lat-tolerance) and then only compare up to the last one that could match (airport_lat <= point_lat+tolerance) - but take care of 0 degrees equaling 360. While you cannot use that library directly, the sources of bisect are a good start for implementing a binary search.
While technically this way the search is still O(n), you have much fewer actual distance calculations (depending on tolerance) and few latitude comparisons. So you will have a huge speedup.

From this SO question:
import numpy as np
def closest_node(node, nodes):
nodes = np.asarray(nodes)
deltas = nodes - node
dist_2 = np.einsum('ij,ij->i', deltas, deltas)
return np.argmin(dist_2)
where node is a tuple with two values (x, y) and nodes is an array of tuples with two values ([(x_1, y_1), (x_2, y_2),])

The answer of #Juddling is great, but KDTree does not support haversine distance, which is better suited for latitude/longitude coordinates.
For the haversine distance you can use BallTree. Please note, that you need to convert your coordinates to radians first.
from math import radians
from sklearn.neighbors import BallTree
import numpy as np
airports = [(10,10),(20,20),(30,30),(40,40)]
airports_rad = np.array([[radians(x[0]), radians(x[1])] for x in airports ])
tree = BallTree(airports_rad , metric = 'haversine')
result = tree.query([(radians(21),radians(21))])
print(result)
gives
(array([[0.02391369]]), array([[1]], dtype=int64))
To convert the distance to meters you need to multiply by the earth radius (in meters).
earth_radius = 6371000 # meters in earth
print(result[0][0] * earth_radius)
[152354.11114795]

Related

How to find all the neighboring points of a point within particular distance from it?

I have two dataframe df1 and df2,
I want to find out all the neighbouring points from df1 which is a neighbour to the points in df2(Need to find out for each point in the df2 in an iterative manner) within a particular radial distance.
How could I do that?
in the given figure: black points are in df1 and red points are in df2, I would like to find the neighbouring points of each red points .

In pseudocode (not language specific and VERY loosely typed):
function getDistance(point pointA, point pointB){
diffx = absoluteValue(pointA.x - pointB.x);
diffy = absoluteValue(pointA.y - pointB.y);
return squareRoot(diffx^2 + diffy^2)
}
for point1 in df1{
//each obj stores a point and a corresponding distance
Object distance{
point2Identifier;
distanceFromPoint1;
}
ObjectArray distances; //Array of distance objects
for point2 in df2{
distances.add(getDistance(point1, point2));
}
distances.getSmallest /*Finds the distance obj with the smallest distanceFromPoint1 prop and stores it however you see fit*/
}
This was off the top of my head and quickly typed, so simplification and implementation are up to you. This most likely is not the quickest nor the most efficient way of achieving what you want. I am sure it can be simplified greatly, especially in Python. As you probably know, the API is littered with methods for simplifying math-in-code.

Find all nearest neighbors within a specific distance
#x,y are the x, y columns in the data frame and the radial distance is 0.1
import numpy as np
import scipy.spatial as spatial
points=df1[['x','y']]
points_array= points.rename_axis('ID').values
point_tree = spatial.cKDTree(points_array)
for item in range(0,len(df2),1):
print(point_tree.data[point_tree.query_ball_point([df2.x.iloc[item], cells_final.lat.iloc[item]], 0.1)])

Finding random closest neighbours in 3D

I am trying to implement Python code when given the names and GPS positions of 750 people (latitude, longitude and elevation) to find the names of the 10 closest neighbors of a randomly selected individual.
import random
#random = rand.sample(range(0,750), 10)
coords = [(random.random()*2.0, random.random()*2.0, random.random()*2.0) for _ in range(750)]

To do this you should either work in spherical coordinates, or you can convert to Cartesian. Working in Cartesian makes the assumption that direct distance, and not a great elliptic arc, is how you are measuring distance.
import numpy as np
from sklearn.neighbors import DistanceMetric
R = 6371 # approximate radius of earth in km
# coordinates in (lat,lon,elv) in units of (rad,rad,km)
coords = np.random.random((750, 3)) * 2
cart_coords = np.array([((R+coord[2]) * np.cos(coord[0]) * np.cos(coord[1]),
(R+coord[2]) * np.cos(coord[0]) * np.sin(coord[1]),
(R+coord[2]) *np.sin(coord[0])) for coord in coords])
# calculate distances between points
dist = DistanceMetric.get_metric('euclidean')
dist_vals = dist.pairwise(cart_coords)
# pick a random person
random_person = np.random.choice(np.arange(750))
top_ten = np.where(dist_vals[random_person] < sorted(dist_vals[random_person])[11])[0]
# remove self from list
top_ten = top_ten[top_ten!=random_person]
print(top_ten)
If you wished to ignore the elevation and use the havesine formula, you can check this post Vectorizing Haversine distance calculation in Python
The Earth is an ellipsoid with a difference of about 21km between the polar and equatorial radii. If you really want to go deeper you can look into the science of geodesy. astropy is a good package for this type of problem https://docs.astropy.org/en/stable/api/astropy.coordinates.spherical_to_cartesian.html

Couldn't you just use the distance formula to calculate the distance between two points given x,y,z, where d=sqrt((x2-x1)^2+(y2-y1)^2+(z2-z1)^2) to get the distance between the randomly selected person and all other elements. Just calculate the distances of every single person from the random person and then only store the ten lowest values

You could use the excellent BallTree from sklearn:
import numpy as np
from sklearn.neighbors import BallTree
coords = np.random.random((750, 3)) * 2
tree = BallTree(coords)
random_person = np.random.choice(np.arange(750))
closest_people = tree.query(coords[None, random_person], k=10)[1]

Optimization function for the sum Google Maps distance

I am trying to find a point (latitude/longitude) that minimizes the sum of Google maps distance to all other N points.
I was able to extract the Google Maps distances between my latitude and longitude arrays but I wasn't able to minimize my function.
Code
def minimize_g(input_g):
gmaps1 = googlemaps.Client(key="xxx")
def distance_f(x):
dist = gmaps1.distance_matrix([x], np.array(input_g)[:,1:3])
sum_ = 0
for obs in range(len(np.array(df[:3]))):
sum_+= dist['rows'][0]['elements'][obs]['distance']['value']
return sum_
#initial guess: centroid
centroid = input_g.mean(axis=0)
optimization = minimize(distance_f, centroid, method='COBYLA')
return optimization.x
Thanks!

If you are looking for any point on the map that results in shortest distance to all coordinates in your list, you can try writing a function that calculates the distance from one coordinate to another coordinate. If you have that function ready to go, its a matter of calculating the total distance to all your points from a test point.
Then, from some artificially created coordinates, you would minimize the distances to all your points with something along the lines of
import numpy as np
lats = [12.3, 12.4, 12.5]
lons = [16.1, 15.1, 14.1]
def total_distance_to_lats_and_lons(lat, lon):
# some summation over distances from lat, lon to lats, lons
# create two lists with 0.01 degree precision as an artificial grid of possibilities
test_lats = np.arange(min(lats), max(lats), 0.01)
test_lons = np.arange(min(lons), max(lons), 0.01)
test_distances = [] # empty list to fill with the total_distance to each combination of test_lat, test_lon
coordinate_index_combinations = [] # corresponding coordinates
for test_lat in test_lats:
for test_lon in test_lons:
coordinate_combinations.append([test_lat, test_lon]) # add a combination of indices
test_distances.append(total_distance_to_lats_and_lons(test_lat, test_lon)) # add a distance
index_of_best_test_coordinate = np.argmin(test_distances) # find index of the minimum value
print('Best match is index {}'.format(index_of_best_test_coordinate))
print('Coordinates: {}'.format(coordinate_combinations[index_of_best_test_coordinate]))
print('Total distance: {}'.format(test_distances[index_of_best_test_coordinate]))
This brute force approach has some precision limitations and becomes an expensive loop quite quickly, so you can also apply this method iteratively with the minimum found after each round, so iteratively increasing precision and decreasing start and end points in the test coordinate lists. After a few iterations, you should have a pretty precise estimate. On the other hand, it is possible such an iterative method converges to one of multiple local minima, yielding only one of multiple solutions.

Identifying weighted clusters with max distance diameter and sum(weight) > 50

Problem
Need to identify a way to find 2 mile clusters of points where each point has a value. Identify 2 mile areas which have a sum(value) > 50.
Data
I have data that looks like the following
ID COUNT LATITUDE LONGITUDE
187601546 20 025.56394 -080.03206
187601547 25 025.56394 -080.03206
187601548 4 025.56394 -080.03206
187601550 0 025.56298 -080.03285
Roughly 200K records. What I need to determine is if there are any areas where more than sum of the count exceeds 65 in a one mile radius (2 mile diameter) area.
Using each point as a center for an area
Now, I have python code from another project that will draw a shapefile around a point of x diameter as follows:
def poly_based_on_distance(center_lat,center_long, distance, bearing):
# bearing is in degrees
# distance in miles
# print ('center', center_lat, center_long)
destination = (vincenty(miles=distance).destination(Point(center_lat,
center_long), bearing).format_decimal())
And a routine to return destination and then see which points are inside the radius.
## This is the evaluation for overlap between points and
## area polyshapes
area_list = []
store_geo_dict = {}
for stores in locationdict:
location = Polygon(locationdict[stores])
for areas in AREAdictionary:
area = Polygon(AREAdictionary[areass])
if store.intersects(area):
area_list.append(areas)
store_geo_dict[stores] = area_list
area_list = []
At this point, I am simply drawing a circular shapefile around each of the 200K points, see which others were inside and doing the count.
Need Clustering Algorithm?
However, there might be an area with the required count density where one of the points is not in the center.
I'm familiar with clustering algos such as DBSCAN that use attributes for classification but this is a matter of finding a density clusters using a value for each point. Is there any clustering algorithm to find any cluster of a 2 mile diameter circle where the inside count is >= 50?
Any suggestions, python or R are preferred tools but this is wide-open and probably a one-off so computation efficiency is not a priority.

Not a complete solution, but maybe it will help simplify the problem depending on the distribution of your data. I will use planar coordinates and cKDTree in my example, this might work with geographic data if you can ignore curvature in a projection.
The main observation is the following: a point (x,y) does not contribute to a dense cluster if a ball of radius 2*r (e.g. 2 miles) around (x,y) contributes less than the cutoff value (e.g. 50 in your title). In fact, any point within r of (x,y) does not contribute to ant dense cluster.
This allows you to repeatedly discard points from consideration. If you are left with no points, there are no dense clusters; if you are left with some points, clusters may exist.
import numpy as np
from scipy.spatial import cKDTree
# test data
N = 1000
data = np.random.rand(N, 2)
x, y = data.T
# test weights of each point
weights = np.random.rand(N)
def filter_noncontrib(pts, weights, radius=0.1, cutoff=60):
tree = cKDTree(pts)
contribs = np.array(
[weights[tree.query_ball_point(pt, 2 * radius)].sum() for pt in pts]
)
return contribs >= cutoff
def possible_contributors(pts, weights, radius=0.1, cutoff=60):
n_pts = len(pts)
while len(pts):
mask = filter_noncontrib(pts, weights, radius, cutoff)
pts = pts[mask]
weights = weights[mask]
if len(pts) == n_pts:
break
n_pts = len(pts)
return pts
Example with dummy data:

DBSCAN can be adapted (see Generalized DBSCAN; define core points as weight sum >= 50), but it will not ensure the maximum cluster size (it computes transitive closures).
You could also try complete linkage. Use it to find clusters with the desired maximum diameter, then check if these satisfy the desired density. But that does not guarantee to find all.
It's probably faster to (a) build an index for fast radius search. (b) for every point, find neighbors in radius r; keep if they have the desired minimum sum. But that does not guarantee to find everything because the center is not necessarily a data point. Consider a max radius of 1, minimum weight 100. Two points with weight 50 each, at (0,0) and (1,1). Neither a query at (0,0) nor one at (1,1) will discover the solution, but a cluster at (.5,.5) satisfies the conditions.
Unfortunately, I believe your problem is at least NP-hard, so you won't be able to afford the ultimate solution.

How to calculate 3D distance (including altitude) between two points in GeoDjango

Prologue:
This is a question arising often in SO:
3d distance calculations with GeoDjango
Calculating distance between two points using latitude longitude and altitude (elevation)
Distance between two 3D point in geodjango (postgis)
I wanted to compose an example on SO Documentation but the geodjango chapter never took off and since the Documentation got shut down on August 8, 2017, I will follow the suggestion of this widely upvoted and discussed meta answer and write my example as a self-answered post.
Of course, I would be more than happy to see any different approach as well!!
Question:
Assume the model:
class MyModel(models.Model):
name = models.CharField()
coordinates = models.PointField()
Where I store the point in the coordinate variable as a lan, lng, alt point:
MyModel.objects.create(
name='point_name',
coordinates='SRID=3857;POINT Z (100.00 10.00 150)')
I am trying to calculate the 3D distance between two such points:
p1 = MyModel.objects.get(name='point_1').coordinates
p2 = MyModel.objects.get(name='point_2').coordinates
d = Distance(m=p1.distance(p2))
Now d=X in meters.
If I change only the altitude of one of the points in question:
For example:
p1.coordinates = 'SRID=3857;POINT Z (100.00 10.00 200)'
from 150 previously, the calculation:
d = Distance(m=p1.distance(p2))
returns d=X again, like the elevation is ignored.
How can I calculate the 3D distance between my points?

Reading from the documentation on the GEOSGeometry.distance method:
Returns the distance between the closest points on this geometry and the given geom (another GEOSGeometry object).
Note
GEOS distance calculations are linear – in other words, GEOS does not perform a spherical calculation even if the SRID specifies a geographic coordinate system.
Therefore we need to implement a method to calculate a more accurate 2D distance between 2 points and then we can try to apply the altitude (Z) difference between those points.
1. Great-Circle 2D distance calculation (Take a look at the 2022 UPDATE below the explanation for a better approach using geopy):
The most common way to calculate the distance between 2 points on the surface of a sphere (as the Earth is simplistically but usually modeled) is the Haversine formula:
The haversine formula determines the great-circle distance between two points on a sphere given their longitudes and latitudes.
Although from the great-circle distance wiki page we read:
Although this formula is accurate for most distances on a sphere, it too suffers from rounding errors for the special (and somewhat unusual) case of antipodal points (on opposite ends of the sphere). A formula that is accurate for all distances is the following special case of the Vincenty formula for an ellipsoid with equal major and minor axes.
We can create our own implementation of the Haversine or the Vincenty formula (as shown here for Haversine: Haversine Formula in Python (Bearing and Distance between two GPS points)) or we can use one of the already implemented methods contained in geopy:
geopy.distance.great_circle (Haversine):
from geopy.distance import great_circle
newport_ri = (41.49008, -71.312796)
cleveland_oh = (41.499498, -81.695391)
# This call will result in 536.997990696 miles
great_circle(newport_ri, cleveland_oh).miles)
geopy.distance.vincenty (Vincenty):
from geopy.distance import vincenty
newport_ri = (41.49008, -71.312796)
cleveland_oh = (41.499498, -81.695391)
# This call will result in 536.997990696 miles
vincenty(newport_ri, cleveland_oh).miles
!!!2022 UPDATE: On 2D distance calculation using geopy:
GeoPy discourages the use of Vincenty as of version 1.14.0. Changelog states:
CHANGED: Vincenty usage now issues a warning. Geodesic should be used instead. Vincenty is planned to be removed in geopy 2.0. (#293)
So (especially if we are going to apply the calculation on a WGS84 ellipsoid) we should use geodesic distance instead:
from geopy.distance import geodesic
newport_ri = (41.49008, -71.312796)
cleveland_oh = (41.499498, -81.695391)
# This call will result in 538.390445368 miles
geodesic(newport_ri, cleveland_oh).miles
2. Adding altitude to the mix:
As mentioned, each of the above calculations yields a great circle distance between 2 points. That distance is also called "as the crow flies", assuming that the "crow" flies without changing altitude and as straight as possible from point A to point B.
We can have a better estimation of the "walking/driving" ("as the crow walks"??) distance by combining the result of one of the previous methods with the difference (delta) in altitude between point A and point B, inside the Euclidean Formula for distance calculation:
acw_dist = sqrt(great_circle(p1, p2).m**2 + (p1.z - p2.z)**2)
The previous solution is prone to errors especially the longer the real distance between the points is. I leave it here for comment continuation reasons.
GeoDjango Distance calculates the 2D distance between two points and doesn't take into consideration the altitude differences.
In order to get the 3D calculation, we need to create a distance function that will consider altitude differences in the calculation:
Theory:
The latitude, longitude and altitude are Polar coordinates and we need to translate them to Cartesian coordinates (x, y, z) in order to apply the Euclidean Formula on them and calculate their 3D distance.
Assume:
polar_point_1 = (long_1, lat_1, alt_1)
and polar_point_2 = (long_2, lat_2, alt_2)
Translate each point to it's Cartesian equivalent by utilizing this formula:
x = alt * cos(lat) * sin(long)
y = alt * sin(lat)
z = alt * cos(lat) * cos(long)
and you will have p_1 = (x_1, y_1, z_1) and p_2 = (x_2, y_2, z_2) points respectively.
Finally use the Euclidean formula:
dist = sqrt((x_2-x_1)**2 + (y_2-y_1)**2 + (z_2-z_1)**2)

Using geopy, this is the easiest and perfect solution.
https://geopy.readthedocs.io/en/stable/#geopy.distance.lonlat
>>> from geopy.distance import distance
>>> from geopy.point import Point
>>> a = Point(-71.312796, 41.49008, 0)
>>> b = Point(-81.695391, 41.499498, 0)
>>> print(distance(a, b).miles)
538.3904453677203

Once converted into Cartesian coordinates, you can compute the norm with numpy:
np.linalg.norm(point_1 - point_2)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently finding the closest coordinate pair from a set in Python - python

Related

How to find all the neighboring points of a point within particular distance from it?

Finding random closest neighbours in 3D

Optimization function for the sum Google Maps distance

Identifying weighted clusters with max distance diameter and sum(weight) > 50

How to calculate 3D distance (including altitude) between two points in GeoDjango

Categories

Resources