fastest way to find min euclidean distance between two arrays - python

I have two arrays of x,y,z coordinates, (e.g. a=[(x1,y1,z1)...(xN,yN,zN)], b = [(X1,Y1,Z1)...(XN,YN,ZN)] ). I need the fastest way to iterate through them and find the indices of b with the minimum euclidean distance to each point in a. here's the catch. I'm using a modified/weighted euclidean equation. Currently I'm doing two for loops which admittedly is the slowest way to do it.
b typically has around 500 coordinate sets to choose from, but a can have tens-to-hundreds of thousands
as an example:
a = (1,1,1), b = [(87,87,87),(2,2,2),(50,50,50)]
would return index 1.

You could create a k-d tree of array b and find the nearest distance of a coordinate in array a by traversing down the tree.
For array a of size n and array b of size m, the complexity would be O(mlog(m)) for building the tree and O(nlog(m)) for finding all the nearest distances.

Related

Selecting closest values by Euclidian distance from the mean from a numpy array

I'm sure there's a straightforward answer to this, but I'm very much a Python novice and trawling stackoverflow is getting me tantalisingly close but falling at the final hurdle, so apologies. I have an array of one dimensional arrays (in reality composed of >2000 arrays, each of ~800 values), but for representation sake:
group = [[0,1,3,4,5],[0,2,3,6,7],[0,4,3,2,5],...]
I'm trying to select the nearest n 1-d arrays to the mean (by Euclidian distance), but struggling to extract them from the original list. I can figure out the distances and sort them, but can't then extract them from the original group.
# Compute the mean
group_mean = group.mean(axis = 0)
distances = []
for x in group:
# Compute Euclidian distance from the mean
distances.append(np.linalg.norm(x - group_mean))
# Sort distances
distances.sort()
print(distances[0:5]) # Prints the five nearest distances
Any advice as to how to select out the five (or whatever) arrays from group corresponding to the nearest distances would be much appreciated.
you can put the array in with the dist array, and sort based on the distance to the mean:
import numpy as np
group = np.array([[0,1,3,4,5],[0,2,3,6,7],[0,4,3,2,5]])
group_mean = group.mean(axis = 0)
distances = [[np.linalg.norm(x - group_mean),x] for x in group]
distances.sort(key=lambda a : a[0])
print(distances[0:5]) # Prints the five nearest distances
If your arrays get larger, it might be wise to only save the index instead of the whole array:
distances = [[np.linalg.norm(x - group_mean),i] for i,x in enumerate(group)]
If you don't want to save the distances themself, but just want to sort based on the distance, you can do this:
group = list(group)
group.sort(key=lambda group: np.linalg.norm(group - np.mean(group)))

Python - Find closest indices from 2 sets

I have 2 sets of indices (i,j).
What I need to get is the 2 indices that are closest from the 2 sets.
It is easier to explain graphically:
Assuming I have all the indices that make the first black shape, and all the indices that make the second black shape, how do I find the closest indices (the red points in the figure) between those 2 shapes, in an efficient way (built in function in Python, not by iterating through all the possibilities)?
Any help will be appreciated!
As you asked about a built in function rather than looping through all combinations, there's a method in scipy.spacial.distance that does just that - it outputs a matrix of distances between all pairs of 2 inputs. If A and B are collections of 2D points, then:
from scipy.spatial import distance
dists = distance.cdist(A,B)
Then you can get the index of the minimal value in the matrix.

Efficient pairwise sum

I am looking for a way to optimize the following code. It computes all the possible pairwise sums of the elements in an array:
import numpy as np
from itertools import combinations
N = 5000
a = np.random.rand(N)
c = ([a[i]+a[j] for i,j in combinations(range(N),2)])
This is relatively slow. I could get much better performances using the following:
b = a+a[:,None]
c = b[np.triu_indices(N,1)]
yet it still seems largely non-optimized: computing the full matrix b is inefficient because half of it ends up useless, and extracting its upper part (omitting the diagonal) is actually even slower than computing b.
Is there a way to do this faster? This makes me think of a similar problem, computing pairwise distances between points (Fastest pairwise distance metric in python) but I don't know if there is a way to do something similar here using scipy.
edit: I would like to keep the order of the sums, so that the result contains the sum of the indices in this order:
[(0,1), (0,2), ... (0,N-1), (1,2), .... (1,N-1), (2,3), ...],
i.e. the order you would get doing a double for loop for 0=<i<N, for j<i<N

How to get the n-th nearest neighbor for each point in a NumPy array from the same array?

I have a numpy array with shape (291336, 50). i.e. there are 291336 points where each point has 50 dimensions.
For each point in this array, I want to find the distance and index of its kth nearest neighbor by distance, belonging to the same array. I found this related question, but it finds the 1st nearest neighbor not the kth.
I have thought about using this brute force approach-
for i in X.shape[0]:
distance_from_i = {}
for j in X.shape[0]:
store distance & index of j from i in distance_from_i
sort distance_from_i and select the k'th point
But I know it's terrible. There must be a better way.
How do I solve this problem?
What about sorting them once by distance from np.zeros(50)

Find the distance of each pair between two vectors

I have two vectors, let's say x=[2,4,6,7] and y=[2,6,7,8] and I want to find the euclidean distance, or any other implemented distance (from scipy for example), between each corresponding pair. That will be
dist=[0, 2, 1, 1].
When I try
dist = scipy.spatial.distance.cdist(x,y, metric='sqeuclidean')
or
dist = [scipy.spatial.distance.cdist(x,y, metric='sqeuclidean') for x,y in zip(x,y)]
I get
ValueError: XA must be a 2-dimensional array.
How am I supposed to calculate dist and why do I have to reshape data for that purpose?
cdist does not compute the list of distances between corresponding pairs, but the matrix of distances between all pairs.
np.linalg.norm((np.asarray(x)-np.asarray(y))[:, None], axis=1)
Is how id typically write this for the Euclidian distance between n-dimensional points; but if you are only dealing with 1 dimensional points, the absolute difference, as suggested by elpres would be simpler.

Categories