python numpy euclidean distance calculation between matrices of row vectors - python

I am new to Numpy and I would like to ask you how to calculate euclidean distance between points stored in a vector.
Let's assume that we have a numpy.array each row is a vector and a single numpy.array. I would like to know if it is possible to calculate the euclidean distance between all the points and this single point and store them in one numpy.array.
Here is an interface:
points #2d list of row-vectors
singlePoint #one row-vector
listOfDistances= procedure( points,singlePoint)
Can we have something like this?
Or is it possible to have one command to have the single point as a list of other points and at the end we get a matrix of distances?
Thanks

To get the distance you can use the norm method of the linalg module in numpy:
np.linalg.norm(x - y)

While you can use vectorize, #Karl's approach will be rather slow with numpy arrays.
The easier approach is to just do np.hypot(*(points - single_point).T). (The transpose assumes that points is a Nx2 array, rather than a 2xN. If it's 2xN, you don't need the .T.
However this is a bit unreadable, so you write it out more explictly like this (using some canned example data...):
import numpy as np
single_point = [3, 4]
points = np.arange(20).reshape((10,2))
dist = (points - single_point)**2
dist = np.sum(dist, axis=1)
dist = np.sqrt(dist)

import numpy as np
def distance(v1, v2):
return np.sqrt(np.sum((v1 - v2) ** 2))

To apply a function to each element of a numpy array, try numpy.vectorize.
To do the actual calculation, we need the square root of the sum of squares of differences (whew!) between pairs of coordinates in the two vectors.
We can use zip to pair the coordinates, and sum with a comprehension to sum up the results. That looks like:
sum((x - y) ** 2 for (x, y) in zip(singlePoint, pointFromArray)) ** 0.5

import numpy as np
single_point = [3, 4]
points = np.arange(20).reshape((10,2))
distance = euclid_dist(single_point,points)
def euclid_dist(t1, t2):
return np.sqrt(((t1-t2)**2).sum(axis = 1))

Related

Argmin with the Euclidean distance condition

In Python, I have a vector v of 300 elements and an array arr of 20k 300-dimensional vectors. How do I get quickly the indices of the k closest elements to v from the array arr?
You can do this task with numpy
import numpy as np
v = np.array([[1,1,1,1]])
arr = np.array([
[1,1,1,1],
[2,2,2,2],
[3,3,3,3]
])
dist = np.linalg.norm(v - arr, axis=1) # Euclidean distance
min_distance_index = np.argmin(dist) # Find index of minimum distance
closest_vector = arr[min_distance_index] # Get vector having minimum distance
closest_vector
# array([1, 1, 1, 1])
Since 300 is a very small number, sorting all elements and then just using the `k first is not an expensive operation (usually; it depends on how many thousand times per second you need to do this).
so, sorted() is your friend; use the key= keyword argument, sorted_vector = sorted(v ,key=…) to implement sorting by euclidean distance.
Then, use the classic array[:end] syntax to select the first k.

vectorized / linear algebra distance between points?

Suppose I have an array of points,
import numpy as np
pts = np.random.rand(100,3) # 1000 points, X, Y, Z along second dimension
The naive approach to calculate the distance between each combination of points involves a double for loop and will be excruciatingly slow for large numbers of points,
def euclidian_distance(p1, p2):
d = p2 - p1
return np.sqrt(d**2).sum()
out = np.empty((pts.shape[0], pts.shape[0]))
pts_swapped = pts.swapaxes(0,1)
for idx, point in enumerate(pts_swapped):
for idx2, point_inner in enumerate(pts_swapped):
out[idx,idx2] = euclidian_distance(point, point_inner)
How do I vectorize this calculation?
Take a look at the scipy.spatial.distance.cdist. I'm not sure but i assume that scipy optimized this quite a lot. If you use the pts array for both inputs, I assume you'll get an M x M array with zeros on the diagonal . function

Vectorise Python code

I have coded a kriging algorithm but I find it quite slow. Especially, do you have an idea on how I could vectorise the piece of code in the cons function below:
import time
import numpy as np
B = np.zeros((200, 6))
P = np.zeros((len(B), len(B)))
def cons():
time1=time.time()
for i in range(len(B)):
for j in range(len(B)):
P[i,j] = corr(B[i], B[j])
time2=time.time()
return time2-time1
def corr(x,x_i):
return np.exp(-np.sum(np.abs(np.array(x) - np.array(x_i))))
time_av = 0.
for i in range(30):
time_av+=cons()
print "Average=", time_av/100.
Edit: Bonus questions
What happens to the broadcasting solution if I want corr(B[i], C[j]) with C the same dimension than B
What happens to the scipy solution if my p-norm orders are an array:
p=np.array([1.,2.,1.,2.,1.,2.])
def corr(x, x_i):
return np.exp(-np.sum(np.abs(np.array(x) - np.array(x_i))**p))
For 2., I tried P = np.exp(-cdist(B, C,'minkowski', p)) but scipy is expecting a scalar.
Your problem seems very simple to vectorize. For each pair of rows of B you want to compute
P[i,j] = np.exp(-np.sum(np.abs(B[i,:] - B[j,:])))
You can make use of array broadcasting and introduce a third dimension, summing along the last one:
P2 = np.exp(-np.sum(np.abs(B[:,None,:] - B),axis=-1))
The idea is to reshape the first occurence of B to shape (N,1,M) while the second B is left with shape (N,M). With array broadcasting, the latter is equivalent to (1,N,M), so
B[:,None,:] - B
is of shape (N,N,M). Summing along the last index will then result in the (N,N)-shape correlation array you're looking for.
Note that if you were using scipy, you would be able to do this using scipy.spatial.distance.cdist (or, equivalently, a combination of scipy.spatial.distance.pdist and scipy.spatial.distance.squareform), without unnecessarily computing the lower triangular half of this symmetrix matrix. Using #Divakar's suggestion in comments for the simplest solution this way:
from scipy.spatial.distance import cdist
P3 = 1/np.exp(cdist(B, B, 'minkowski',1))
cdist will compute the Minkowski distance in 1-norm, which is exactly the sum of the absolute values of coordinate differences.

Find the distance of each pair between two vectors

I have two vectors, let's say x=[2,4,6,7] and y=[2,6,7,8] and I want to find the euclidean distance, or any other implemented distance (from scipy for example), between each corresponding pair. That will be
dist=[0, 2, 1, 1].
When I try
dist = scipy.spatial.distance.cdist(x,y, metric='sqeuclidean')
or
dist = [scipy.spatial.distance.cdist(x,y, metric='sqeuclidean') for x,y in zip(x,y)]
I get
ValueError: XA must be a 2-dimensional array.
How am I supposed to calculate dist and why do I have to reshape data for that purpose?
cdist does not compute the list of distances between corresponding pairs, but the matrix of distances between all pairs.
np.linalg.norm((np.asarray(x)-np.asarray(y))[:, None], axis=1)
Is how id typically write this for the Euclidian distance between n-dimensional points; but if you are only dealing with 1 dimensional points, the absolute difference, as suggested by elpres would be simpler.

Compute numpy array pairwise Euclidean distance except with self

edit: this question is not specifically about calculating distances, rather the most efficient way to loop through a numpy array, specifying that for index i all comparisons should be made with the rest of the array, as long as the second index is not i.
I have a numpy array with columns (X, Y, ID) and want to compare each element to each other element, but not itself. So, for each X, Y coordinate, I want to calculate the distance to each other X, Y coordinate, but not itself (where distance = 0).
Here is what I have - there must be a more "numpy" way to write this.
import math, arcpy
# Point feature class
fc = "MY_FEATURE_CLASS"
# Load points to numpy array: (X, Y, ID)
npArray = arcpy.da.FeatureClassToNumPyArray(fc,["SHAPE#X","SHAPE#Y","OID#"])
for row in npArray:
for row2 in npArray:
if row[2] != row2[2]:
# Pythagoras's theorem
distance = math.sqrt(math.pow((row[0]-row2[0]),2)+math.pow((row[1]-row2[1]),2))
Obviously, I'm a numpy newbie. I will not be surprised to find this a duplicate, but I don't have the numpy vocabulary to search out the answer. Any help appreciated!
Using SciPy's pdist, you could write something like
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(npArray, lambda a,b: np.sqrt((a[0]-b[0])**2 + (a[1]-b[1])**2)))
pdist will compute the pair-wise distances using the custom metric that ignores the 3rd coordinate (which is your ID in this case). squareform turns this into a more readable matrix such that distances[0,1] gives the distance between the 0th and 1st rows.
Each row of X is a 3 dimensional data instance or point.
The output pairwisedist[i, j] is distance of X[i, :] and X[j, :]
X = np.array([[6,1,7],[10,9,4],[13,9,3],[10,8,15],[14,4,1]])
a = np.sum(X*X,1)
b = np.repeat( a[:,np.newaxis],5,axis=1)
pairwisedist = b + b.T -2* X.dot(X.T)
I wanted to point out that custom written sqrt of sum of squares are prone to overflow and underflow. Bultin math.hypot, np.hypot are way safer for no compromise on performance
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(npArray, lambda a,b: math.hypot(*(a-b))
Refer

Categories