Say I have a list of X,Y points. I then add a new point, how would i find out which old point the newly added point is closest too? I've seen a few similar questions but couldn't get any to work. I'm looking for something like:
pnts = [[11145,1146], [11124,1155], [11212,1147], etc]
new_pnt = [11444, 1160]
new_pnt.closest()
I've tried scipy and KDTree and kept getting various errors. Brand new to Python, any help would be greatly appreciated.
The easiest and quickest way would be for you to define a function to measure the distance to every point and return the closest one. For example (assuming euclidean distance):
>>> pnts = [[11145,1146], [11124,1155], [11212,1147]]
>>> new_pnt = [11444, 1160]
>>> def closest(points, new_point):
closest_point = None
closest_distance = None
for point in points:
distance = ((point[0] - new_point[0])**2 + (point[1] - new_point[1])**2)**0.5
if closest_distance is None or distance < closest_distance:
closest_point = point
closest_distance = distance
return closest_point
>>> closest(pnts, new_pnt)
[11212, 1147]
Do you have NumPy available? If so:
import numpy as np
index = np.argmin(np.sum((np.array(pnts) - np.array(new_pnt))**2, axis=1))
print(index) # 2
That is, the point pnts[2] is closest to new_pnt. The distance is given by the square root of the sum of differences between a pair of x and y coordinates. Here I leave out the square root, as the point with the smallest squared distance is also the point with the smallest distance.
Related
I have a set of co-ordinates in latitude and longitude format. I need to find the smallest cluster from these coordinates which are within say 50 mile distance to each other.
I am new to data science, how can I implement this in Python without using sklearn library.
Well, if you need to find distance and you're using XY coords, and you don't want to use extra libs, then maybe a place to start is to write a little funciton which finds the euclidean distance - you know a squared plus b squared equals c squared - Maybe something like this can get you started - this is a function which takes two tuples as coordinates, but it can also works for 3D coords if needed:
import math
def getEuclideanDistance(pointA, pointB):
xA,yA = pointA
# or for 3D coords, xA, yA, zA = pointA...
xB,yB = pointB
result = math.sqrt((xA-xB)**2 + (yA-yB)**2)
# or for distance in 3d:
# result = math.sqrt((xA-xB)**2 + (yA-yB)**2 + (zA-zB)**2)
return (result)
We're using the **2 operator to bring the result to the 2nd power, and math.sqrt to do the rest. I hope this puts you on the right track.
This is my data set:
https://pastebin.com/SsuKP2eH
I'm trying to find the nearest point for all points in the data set. These points are latitude and longitude on the Earth's surface. Of course, the nearest point cannot be the same point.
I tried the KDTree solutions listed in this post: https://stackoverflow.com/a/45128643 and changed the poster's random points (generated by np.random.uniform) to my own data set.
I expected to get an array full of distances, but instead, I got an array full of zeroes with some numbers like 2.87722e-06 and 0.616582 sprinkled in. This wasn't what I wanted. I tried the other solution, NearestNeighbours, on my data set and got the same result. So, I did some debugging and reduced the range of random numbers he used, making it closer to my own data set.
import numpy as np
import scipy.spatial as spatial
import pandas as pd
R = 6367
def using_kdtree(data):
"Based on https://stackoverflow.com/q/43020919/190597"
def dist_to_arclength(chord_length):
"""
https://en.wikipedia.org/wiki/Great-circle_distance
Convert Euclidean chord length to great circle arc length
"""
central_angle = 2*np.arcsin(chord_length/(2.0*R))
arclength = R*central_angle
return arclength
phi = np.deg2rad(data['Latitude'])
theta = np.deg2rad(data['Longitude'])
data['x'] = R * np.cos(phi) * np.cos(theta)
data['y'] = R * np.cos(phi) * np.sin(theta)
data['z'] = R * np.sin(phi)
tree = spatial.KDTree(data[['x', 'y','z']])
distance, index = tree.query(data[['x', 'y','z']], k=2)
return dist_to_arclength(distance[:, 1])
#return distance, index
np.random.seed(2017)
N = 1000
#data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N), 'Longitude':np.random.uniform(0, 360,size=N)})
data = pd.DataFrame({'Latitude':np.random.uniform(-49.19,49.32,size=N), 'Longitude':np.random.uniform(-123.02, -123.23,size=N)})
result = using_kdtree(data)
I found that the resulting distances array had small values, close to 0. This makes me believe that the reason why the result array for my data set is full of zeroes is because the differences between points are very small. Somewhere, the KD Tree/nearest neighbours loses precision and outputs garbage. Is there a way to make them keep the precision of my floats? The brute-force method can keep precision but it is far too slow with 7200 points to iterate through.
I think what's happening is that k=2 in
distance, index = tree.query(data[['x', 'y','z']], k=2)
tells KDTree you want the closest two points to a point. So the closest is obviously the point itself with distance from itself being zero. Also if you print index you see a Nx2 array and each row starts with the number of the row. This is KDTree's way of saying well the closest point to the i-th point is the i-th point itself.
Obviously that is not useful and you probably want only the 2nd closest point. Fortunately I found this in the documentation of the k parameter of query
Either the number of nearest neighbors to return, or a list of the
k-th nearest neighbors to return, starting from 1.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html#scipy.spatial.KDTree.query
So
distance, index = tree.query(data[['x', 'y','z']], k=[2])
gives only the distance and index of the 2nd to closest point.
I am writing some code to calculate the real distance between one point and the rest of the points from the same array. The array holds positions of particles in 3D space. There is N-particles so the array's shape is (N,3). I choose one particle and calculate the distance between this particle and the rest of the particles, all within one array.
Would anyone here have any idea how to do this?
What I have so far:
xbox = 10
ybox = 10
zbox = 10
nparticles =15
positions = np.empty([nparticles, 3])
for i in range(nparticles):
xrandomalocation = random.uniform(0, xbox)
yrandomalocation = random.uniform(0, ybox)
zrandomalocation = random.uniform(0, zbox)
positions[i, 0] = xrandomalocation
positions[i, 1] = yrandomalocation
positions[i, 2] = zrandomalocation
And that's pretty much all I have right now. I was thinking of using np.linalg.norm however I am not sure at all how to implement it to my code (or maybe use it in a loop)?
It sounds like you could use scipy.distance.cdist or scipy.distance.pdist for this. For example, to get the distances from point X to the points in coords:
>>> from scipy.spatial import distance
>>> X = [(35.0456, -85.2672)]
>>> coords = [(35.1174, -89.9711),
... (35.9728, -83.9422),
... (36.1667, -86.7833)]
>>> distance.cdist(X, coords, 'euclidean')
array([[ 4.70444794, 1.6171966 , 1.88558331]])
pdist is similar, but only takes one array, and you get the distances between all pairs.
i am using this function:
from scipy.spatial import distance
def closest_node(node, nodes):
closest = distance.cdist([node], nodes)
index = closest.argmin()
euclidean = closest[0]
return nodes[index], euclidean[index]
where node is the single point in the space you want to compare with an array of points called nodes. it returns the point and the euclidean distance to your original node
The Problem
Imagine I am stood in an airport. Given a geographic coordinate pair, how can one efficiently determine which airport I am stood in?
Inputs
A coordinate pair (x,y) representing the location I am stood at.
A set of coordinate pairs [(a1,b1), (a2,b2)...] where each coordinate pair represents one airport.
Desired Output
A coordinate pair (a,b) from the set of airport coordinate pairs representing the closest airport to the point (x,y).
Inefficient Solution
Here is my inefficient attempt at solving this problem. It is clearly linear in the length of the set of airports.
shortest_distance = None
shortest_distance_coordinates = None
point = (50.776435, -0.146834)
for airport in airports:
distance = compute_distance(point, airport)
if distance < shortest_distance or shortest_distance is None:
shortest_distance = distance
shortest_distance_coordinates = airport
The Question
How can this solution be improved? This might involve some way of pre-filtering the list of airports based on the coordinates of the location we are currently stood at, or sorting them in a certain order beforehand.
Using a k-dimensional tree:
>>> from scipy import spatial
>>> airports = [(10,10),(20,20),(30,30),(40,40)]
>>> tree = spatial.KDTree(airports)
>>> tree.query([(21,21)])
(array([ 1.41421356]), array([1]))
Where 1.41421356 is the distance between the queried point and the nearest neighbour and 1 is the index of the neighbour.
See: http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html#scipy.spatial.KDTree.query
If your coordinates are unsorted, your search can only be improved slightly assuming it is (latitude,longitude) by filtering on latitude first as for earth
1 degree of latitude on the sphere is 111.2 km or 69 miles
but that would not give a huge speedup.
If you sort the airports by latitude first then you can use a binary search for finding the first airport that could match (airport_lat >= point_lat-tolerance) and then only compare up to the last one that could match (airport_lat <= point_lat+tolerance) - but take care of 0 degrees equaling 360. While you cannot use that library directly, the sources of bisect are a good start for implementing a binary search.
While technically this way the search is still O(n), you have much fewer actual distance calculations (depending on tolerance) and few latitude comparisons. So you will have a huge speedup.
From this SO question:
import numpy as np
def closest_node(node, nodes):
nodes = np.asarray(nodes)
deltas = nodes - node
dist_2 = np.einsum('ij,ij->i', deltas, deltas)
return np.argmin(dist_2)
where node is a tuple with two values (x, y) and nodes is an array of tuples with two values ([(x_1, y_1), (x_2, y_2),])
The answer of #Juddling is great, but KDTree does not support haversine distance, which is better suited for latitude/longitude coordinates.
For the haversine distance you can use BallTree. Please note, that you need to convert your coordinates to radians first.
from math import radians
from sklearn.neighbors import BallTree
import numpy as np
airports = [(10,10),(20,20),(30,30),(40,40)]
airports_rad = np.array([[radians(x[0]), radians(x[1])] for x in airports ])
tree = BallTree(airports_rad , metric = 'haversine')
result = tree.query([(radians(21),radians(21))])
print(result)
gives
(array([[0.02391369]]), array([[1]], dtype=int64))
To convert the distance to meters you need to multiply by the earth radius (in meters).
earth_radius = 6371000 # meters in earth
print(result[0][0] * earth_radius)
[152354.11114795]
I have 2 sets of 2D points (A and B), each set have about 540 points. I need to find the points in set B that are farther than a defined distance alpha from all the points in A.
I have a solution, but is not fast enough
# find the closest point of each of the new point to the target set
def find_closest_point( self, A, B):
outliers = []
for i in range(len(B)):
# find all the euclidean distances
temp = distance.cdist([B[i]],A)
minimum = numpy.min(temp)
# if point is too far away from the rest is consider outlier
if minimum > self.alpha :
outliers.append([i, B[i]])
else:
continue
return outliers
I am using python 2.7 with numpy and scipy. Is there another way to do this that I may gain a considerable increase in speed?
Thanks in advance for the answers
>>> from scipy.spatial.distance import cdist
>>> A = np.random.randn(540, 2)
>>> B = np.random.randn(540, 2)
>>> alpha = 1.
>>> ind = np.all(cdist(A, B) > alpha, axis=0)
>>> outliers = B[ind]
gives you the points you want.
If you have a very large set of points you could calculate x & y bounds of a add & subtract aplha then eliminate all the points in b from specific consideration that lay outside of that boundary.