Argmin with the Euclidean distance condition - python

In Python, I have a vector v of 300 elements and an array arr of 20k 300-dimensional vectors. How do I get quickly the indices of the k closest elements to v from the array arr?

You can do this task with numpy
import numpy as np
v = np.array([[1,1,1,1]])
arr = np.array([
[1,1,1,1],
[2,2,2,2],
[3,3,3,3]
])
dist = np.linalg.norm(v - arr, axis=1) # Euclidean distance
min_distance_index = np.argmin(dist) # Find index of minimum distance
closest_vector = arr[min_distance_index] # Get vector having minimum distance
closest_vector
# array([1, 1, 1, 1])

Since 300 is a very small number, sorting all elements and then just using the `k first is not an expensive operation (usually; it depends on how many thousand times per second you need to do this).
so, sorted() is your friend; use the key= keyword argument, sorted_vector = sorted(v ,key=…) to implement sorting by euclidean distance.
Then, use the classic array[:end] syntax to select the first k.

Related

Selecting closest values by Euclidian distance from the mean from a numpy array

I'm sure there's a straightforward answer to this, but I'm very much a Python novice and trawling stackoverflow is getting me tantalisingly close but falling at the final hurdle, so apologies. I have an array of one dimensional arrays (in reality composed of >2000 arrays, each of ~800 values), but for representation sake:
group = [[0,1,3,4,5],[0,2,3,6,7],[0,4,3,2,5],...]
I'm trying to select the nearest n 1-d arrays to the mean (by Euclidian distance), but struggling to extract them from the original list. I can figure out the distances and sort them, but can't then extract them from the original group.
# Compute the mean
group_mean = group.mean(axis = 0)
distances = []
for x in group:
# Compute Euclidian distance from the mean
distances.append(np.linalg.norm(x - group_mean))
# Sort distances
distances.sort()
print(distances[0:5]) # Prints the five nearest distances
Any advice as to how to select out the five (or whatever) arrays from group corresponding to the nearest distances would be much appreciated.
you can put the array in with the dist array, and sort based on the distance to the mean:
import numpy as np
group = np.array([[0,1,3,4,5],[0,2,3,6,7],[0,4,3,2,5]])
group_mean = group.mean(axis = 0)
distances = [[np.linalg.norm(x - group_mean),x] for x in group]
distances.sort(key=lambda a : a[0])
print(distances[0:5]) # Prints the five nearest distances
If your arrays get larger, it might be wise to only save the index instead of the whole array:
distances = [[np.linalg.norm(x - group_mean),i] for i,x in enumerate(group)]
If you don't want to save the distances themself, but just want to sort based on the distance, you can do this:
group = list(group)
group.sort(key=lambda group: np.linalg.norm(group - np.mean(group)))

Get list of X minimum distances by their indices

I have a huge matrix (think 20000 x 1000) called Z that I need to generate the pairwise distance from so I'm currently using sklearn.metrics.pairwise.euclidean_distances(Z,Z) to generate the pairwise distances.
However, now I need to search through the result to find the smallest X distances but I need their indices.
An example would be:
A = 20000 x 1000 numpy.ndarray
B = sklearn.metrics.pairwise.euclidean_distances(A, A)
C = ((2400,100), (800,900), (29,999)) if X = 3
What would be the best way to go about doing this? I saw numpy.unravel_index(a.argmax(), a.shape) but I'm not sure it would work well for this instance.
You can use np.triu_indices to generate the indices that correspond to entries of the compressed distance matrix.
import numpy as np
from scipy.spatial.distance import pdist
# Generate points
Z = np.random.normal(0, 1, (1000, 3))
# Compute euclidean distance
distance = pdist(Z)
# Get the smallest distance
min_distance = np.min(distance)
# Get the indices (k = 1 to omit diagonal entries)
idx = np.asarray(np.triu_indices(len(Z), 1))
# Filter the indices (this is assuming that the minimum distance is not unique)
idx = idx[:, distance == min_distance]
If you know that there is exactly one minimum distance, you could also use
idx = idx[:, np.argmin(distance)]
which is slightly more efficient.
EDIT: To get the sorted indices, try the following
idx = idx[:, np.argsort(distance)]

fastest way to get closest 10 euclidean neighbors of large feature vector in python

I have a numpy array that has 10,000 vectors with 3,000 elements in each. I want to return the top 10 indices of the closest pairs with the distance between them. So if row 5 and row 7 have the closest euclidean distance of 0.005, and row 8 and row 10 have the second closest euclidean distance of 0.0052 then I want to return [(8,10,.0052),(5,7,.005)]. The traditional for loop method is very slow. Is there an alternative quicker approach for a way to get euclidean neighbors of large features vectors (stored as np array)?
I'm doing the following:
l = []
for i in range(0,M.shape[0]):
for j in range(0,M.shape[0]):
if i != j and i > j:
l.append( (i,j,euc(M[i],M[j]))
return l
Here euc is a function to calculate euclidean distances between two vectors of a matrix using scipy.
Then I sort l and pull out the top 10 closest distances
def topTen(M):
i,j = np.triu_indices(M.shape[0], 1)
dist_sq = np.einsum('ij,ij->i', M[i]-M[j], M[i]-M[j])
max_i=np.argpartition(dist_sq, 10)[:10]
max_o=np.argsort(dist_sq[max_i])
return np.vstack((i[max_i][max_o], j[max_i][max_o], dist_sq[max_i][max_o]**.5)).T
This should be pretty fast as it only does sorting and the square root on the top 10, which are the long steps (outside of the looping).
I'll post this as an answer, but I admit is not a real solution to the question, because it will only work for smaller arrays. The problem is that if you want to be really fast and avoid loops you would need to compute all the pairwise distances at once, and that implies a memory complexity in the order of the square of the input... Let's say 10,000 rows * 10,000 rows * 3,000 elems/row * 4 bytes/row (say we're using float32) ≈ 1TB (!) of memory required (actually maybe twice because you probably need a couple of arrays that size). So while it is possible, it is not practical with these kind of sizes. The following code shows how you could implement that (with sizes divided by 100).
import numpy as np
# Row length
n = 30
# Number of rows
m = 100
# Number of top elements
k = 10
# Input data
data = np.random.random((m, n))
# Tile the data in two different dimensions
data1 = np.tile(data[:, :, np.newaxis], (1, 1, m))
data2 = np.tile(data.T[np.newaxis, :, :], (m, 1, 1))
# Compute pairwise squared distances
dist = np.sum(np.square(data1 - data2), axis=1)
# Fill lower half with inf to avoid repeat and self-matching
dist[np.tril_indices(m)] = np.inf
# Find smallest distance for each row
i = np.arange(m)
j = np.argmin(dist, axis=1)
dmin = dist[i, j]
# Pick the top K smallest distances
idx = np.stack((i, j), axis=1)
isort = dmin.argsort()
# Top K indices pairs (K x 2 matrix)
top_idx = idx[isort[:k], :]
# Top K smallest distances
top_dist = np.sqrt(dmin[isort[:k]])

Compute numpy array pairwise Euclidean distance except with self

edit: this question is not specifically about calculating distances, rather the most efficient way to loop through a numpy array, specifying that for index i all comparisons should be made with the rest of the array, as long as the second index is not i.
I have a numpy array with columns (X, Y, ID) and want to compare each element to each other element, but not itself. So, for each X, Y coordinate, I want to calculate the distance to each other X, Y coordinate, but not itself (where distance = 0).
Here is what I have - there must be a more "numpy" way to write this.
import math, arcpy
# Point feature class
fc = "MY_FEATURE_CLASS"
# Load points to numpy array: (X, Y, ID)
npArray = arcpy.da.FeatureClassToNumPyArray(fc,["SHAPE#X","SHAPE#Y","OID#"])
for row in npArray:
for row2 in npArray:
if row[2] != row2[2]:
# Pythagoras's theorem
distance = math.sqrt(math.pow((row[0]-row2[0]),2)+math.pow((row[1]-row2[1]),2))
Obviously, I'm a numpy newbie. I will not be surprised to find this a duplicate, but I don't have the numpy vocabulary to search out the answer. Any help appreciated!
Using SciPy's pdist, you could write something like
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(npArray, lambda a,b: np.sqrt((a[0]-b[0])**2 + (a[1]-b[1])**2)))
pdist will compute the pair-wise distances using the custom metric that ignores the 3rd coordinate (which is your ID in this case). squareform turns this into a more readable matrix such that distances[0,1] gives the distance between the 0th and 1st rows.
Each row of X is a 3 dimensional data instance or point.
The output pairwisedist[i, j] is distance of X[i, :] and X[j, :]
X = np.array([[6,1,7],[10,9,4],[13,9,3],[10,8,15],[14,4,1]])
a = np.sum(X*X,1)
b = np.repeat( a[:,np.newaxis],5,axis=1)
pairwisedist = b + b.T -2* X.dot(X.T)
I wanted to point out that custom written sqrt of sum of squares are prone to overflow and underflow. Bultin math.hypot, np.hypot are way safer for no compromise on performance
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(npArray, lambda a,b: math.hypot(*(a-b))
Refer

python numpy euclidean distance calculation between matrices of row vectors

I am new to Numpy and I would like to ask you how to calculate euclidean distance between points stored in a vector.
Let's assume that we have a numpy.array each row is a vector and a single numpy.array. I would like to know if it is possible to calculate the euclidean distance between all the points and this single point and store them in one numpy.array.
Here is an interface:
points #2d list of row-vectors
singlePoint #one row-vector
listOfDistances= procedure( points,singlePoint)
Can we have something like this?
Or is it possible to have one command to have the single point as a list of other points and at the end we get a matrix of distances?
Thanks
To get the distance you can use the norm method of the linalg module in numpy:
np.linalg.norm(x - y)
While you can use vectorize, #Karl's approach will be rather slow with numpy arrays.
The easier approach is to just do np.hypot(*(points - single_point).T). (The transpose assumes that points is a Nx2 array, rather than a 2xN. If it's 2xN, you don't need the .T.
However this is a bit unreadable, so you write it out more explictly like this (using some canned example data...):
import numpy as np
single_point = [3, 4]
points = np.arange(20).reshape((10,2))
dist = (points - single_point)**2
dist = np.sum(dist, axis=1)
dist = np.sqrt(dist)
import numpy as np
def distance(v1, v2):
return np.sqrt(np.sum((v1 - v2) ** 2))
To apply a function to each element of a numpy array, try numpy.vectorize.
To do the actual calculation, we need the square root of the sum of squares of differences (whew!) between pairs of coordinates in the two vectors.
We can use zip to pair the coordinates, and sum with a comprehension to sum up the results. That looks like:
sum((x - y) ** 2 for (x, y) in zip(singlePoint, pointFromArray)) ** 0.5
import numpy as np
single_point = [3, 4]
points = np.arange(20).reshape((10,2))
distance = euclid_dist(single_point,points)
def euclid_dist(t1, t2):
return np.sqrt(((t1-t2)**2).sum(axis = 1))

Categories