What is the fastes way of determening which point q out of n points in 2D space is the closest (smallest euclidian distance) to point p, see attached imgage.
My current method of doing this in Python is storing all the distances in a list and then running
numpy.argmin(list_of_distances)
This is however a bit slow when calculating this for m number of points p. Or is it?
Instead of calculating the distances, you could calculate the squared distances. That way you don't need to perform n * m square roots.
This falls under closest point query -problems.
How many points are expected? Are your points static or do they change? One naive but powerful approach for static points would be to pre-compute every known distance, which would result in O(1) lookup.
Put everything as soon as possible into numpy and do calculations there. If you have many points, it is much faster than calculating distances in lists:
import numpy as np
px, py
x = np.fromiter(point.x for point in points, dtype = np.float)
y = np.fromiter(point.y for point in points, dtype = np.float)
i_closest = np.argmin((x - px) ** 2 + (y - py) ** 2)
Related
I have a polymer with coordinates stored in Nx3 numpy array, where N is the number of particles in the polymer (degree of polymerization).
I am trying to calculate the hydrodynamic radius of this polymer. The hydrodynamic radius is given by the first expression found in this link. The hydrodynamic radius, Rh is essentially a harmonic average over pair-wise distance.
Given that P is the Nx3 array, this is my current numpy-pythonic implementation:
inv_dist = 0
for i in range(N-1):
for j in range(i+1, N):
inv_dist += 1/(np.linalg.norm (P[i,:]-P[j,:], 2))
Rh = 1/(inv_dist/(N**2) )
np is numpy in this case. I am aware that the wikipedia formula asks for an ensemble average. This would mean that I would loop over EVERY possible configuration of the polymer in my simulation. In any event, the two loops mentioned above will still be computed.
This is a nested for loop with Nx(N-1)/2 iterations. As N gets large, this computation becomes increasingly taxing. How can I vectorize this code to bypass the for loops to an extent?
I would appreciate any advice you have for me.
You can use scipy.spatial.distance.pdist:
from scipy.spatial.distance import pdist
inv_dist = (1/pdist(P)).sum()
This is my data set:
https://pastebin.com/SsuKP2eH
I'm trying to find the nearest point for all points in the data set. These points are latitude and longitude on the Earth's surface. Of course, the nearest point cannot be the same point.
I tried the KDTree solutions listed in this post: https://stackoverflow.com/a/45128643 and changed the poster's random points (generated by np.random.uniform) to my own data set.
I expected to get an array full of distances, but instead, I got an array full of zeroes with some numbers like 2.87722e-06 and 0.616582 sprinkled in. This wasn't what I wanted. I tried the other solution, NearestNeighbours, on my data set and got the same result. So, I did some debugging and reduced the range of random numbers he used, making it closer to my own data set.
import numpy as np
import scipy.spatial as spatial
import pandas as pd
R = 6367
def using_kdtree(data):
"Based on https://stackoverflow.com/q/43020919/190597"
def dist_to_arclength(chord_length):
"""
https://en.wikipedia.org/wiki/Great-circle_distance
Convert Euclidean chord length to great circle arc length
"""
central_angle = 2*np.arcsin(chord_length/(2.0*R))
arclength = R*central_angle
return arclength
phi = np.deg2rad(data['Latitude'])
theta = np.deg2rad(data['Longitude'])
data['x'] = R * np.cos(phi) * np.cos(theta)
data['y'] = R * np.cos(phi) * np.sin(theta)
data['z'] = R * np.sin(phi)
tree = spatial.KDTree(data[['x', 'y','z']])
distance, index = tree.query(data[['x', 'y','z']], k=2)
return dist_to_arclength(distance[:, 1])
#return distance, index
np.random.seed(2017)
N = 1000
#data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N), 'Longitude':np.random.uniform(0, 360,size=N)})
data = pd.DataFrame({'Latitude':np.random.uniform(-49.19,49.32,size=N), 'Longitude':np.random.uniform(-123.02, -123.23,size=N)})
result = using_kdtree(data)
I found that the resulting distances array had small values, close to 0. This makes me believe that the reason why the result array for my data set is full of zeroes is because the differences between points are very small. Somewhere, the KD Tree/nearest neighbours loses precision and outputs garbage. Is there a way to make them keep the precision of my floats? The brute-force method can keep precision but it is far too slow with 7200 points to iterate through.
I think what's happening is that k=2 in
distance, index = tree.query(data[['x', 'y','z']], k=2)
tells KDTree you want the closest two points to a point. So the closest is obviously the point itself with distance from itself being zero. Also if you print index you see a Nx2 array and each row starts with the number of the row. This is KDTree's way of saying well the closest point to the i-th point is the i-th point itself.
Obviously that is not useful and you probably want only the 2nd closest point. Fortunately I found this in the documentation of the k parameter of query
Either the number of nearest neighbors to return, or a list of the
k-th nearest neighbors to return, starting from 1.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query.html#scipy.spatial.KDTree.query
So
distance, index = tree.query(data[['x', 'y','z']], k=[2])
gives only the distance and index of the 2nd to closest point.
I have been trying for a while now to calculate the Euclidean and Minkowski distance between all the vectors in a list of lists. I don't have much advanced mathematical knowledge.
I am usually working with 4 or 5 dimension vectors
The vector list can range in size from 0 to around 200,000
When calculating the distance all the vectors will have the same amount of dimensions
I have relied on these two questions during the process:
python numpy euclidean distance calculation between matrices of row vectors
Calculate Euclidean Distance between all the elements in a list of lists python
At first my code looked like this:
import numpy as np
def euclidean_distance_np(vec_list, single_vec):
dist = (np.array(vec_list) - single_vec) ** 2
dist = np.sum(dist, axis=1)
dist = np.sqrt(dist)
return dist
def minkowski_distance_np(vec_list, single_vec, p_val):
dist = (np.abs(np.array(vec_list, dtype=np.int64) - single_vec) ** p_val).sum(axis=1) ** (1 / p_val)
return dist
This worked well when I had a small amount of vectors. I would calculate the distance of a single vector to all the vectors in the list and repeat the process for every vector in the list one by one,
but once the list became 5 or 6 digits in length, these functions became extremely slow.
I managed to improve the Euclidean distance calculation like so:
x = np.array([v[0] for v in vec_list])
y = np.array([v[1] for v in vec_list])
z = np.array([v[2] for v in vec_list])
w = np.array([v[3] for v in vec_list])
t = np.array([v[4] for v in vec_list])
res = np.sqrt(np.square(x - x.reshape(-1,1)) + np.square(y - y.reshape(-1,1)) + np.square(z - z.reshape(-1,1)) + np.square(w - w.reshape(-1,1)) + np.square(t - t.reshape(-1,1)))
But cannot figure out how to implement the calculation method above to correctly calculate Minkowski distance.
So, to be precise, my question is how can I calculate Minkowski distance in a similar way to the code I mentioned above.
I would also appreciate any ideas for improvement or better ways to preform the calculations
Scipy has already implemented distance functions: minkowski, euclidean. But probably what you need is cdist.
Numpy is great tool for matrices manipulation, but it doesn't contain all possible functions. You can find most of additional features and operations in SciPy which is more related to mathematics, science, and engineering.
I am working with two numpy matrixes, U (dimensions Nu x 3) and M (dimensions 3 x Nm)
A contains Nu users and 3 features
M contains Nm movies (and the same 3 features)
For each user of U, I would like to calculate its euclidian distance to every movie in M (so I need to compute Nu*Nm euclidian distances).
Is this possible without an explicit double for-loop? I am working with large dimensions matrixes and the double for-loop will probably take too much time.
Thanks in advance.
Check out scipy.spatial.distance.cdist. Something like this will do:
from scipy.spatial.distance import cdist
dist = cdist(U, M.T)
I'm afraid not. You need to compute the euclidian distance for every pair of (user, movie), so you'll have a time complexity of numOfUsers * numOfMovies, which would be a double for loop. You can't do less operations than that, unless you're willing to skip some pairs. The best you can do is optimize the euclidian distance calculation, but the number of operations you're going to do will be quadratic one way or the other.
I have a list of n polar coordinates, and a distance function which takes in two coordinates.
I want to create an n x n matrix which contains the pairwise distances under my function. I realize I probably need to use some form of vectorization with numpy but am not sure exactly how to do so.
A simple code segment is below for your reference
import numpy as np
length = 10
coord_r = np.random.rand(length)*10
coord_alpha = np.random.rand(length)*np.pi
# Repeat vector to matrix form
coord_r_X = np.tile(coord_r, [length,1])
coord_r_Y = coord_r_X.T
coord_alpha_X = np.tile(coord_alpha, [length,1])
coord_alpha_Y = coord_alpha_X.T
matDistance = np.sqrt(coord_r_X**2 + coord_r_Y**2 - 2*coord_r_X*coord_r_Y*np.cos(coord_alpha_X - coord_alpha_Y))
print matDistance
You can use scipy.spatial.distance.pdist. However, if the distance you want to calculate is the Euclidean distance, you may be better off just converting your points to rectangular coordinates, since then pdist will do the calculations quite quickly using its builtin Euclidean distance.