Compute numpy array pairwise Euclidean distance except with self - python

edit: this question is not specifically about calculating distances, rather the most efficient way to loop through a numpy array, specifying that for index i all comparisons should be made with the rest of the array, as long as the second index is not i.
I have a numpy array with columns (X, Y, ID) and want to compare each element to each other element, but not itself. So, for each X, Y coordinate, I want to calculate the distance to each other X, Y coordinate, but not itself (where distance = 0).
Here is what I have - there must be a more "numpy" way to write this.
import math, arcpy
# Point feature class
fc = "MY_FEATURE_CLASS"
# Load points to numpy array: (X, Y, ID)
npArray = arcpy.da.FeatureClassToNumPyArray(fc,["SHAPE#X","SHAPE#Y","OID#"])
for row in npArray:
for row2 in npArray:
if row[2] != row2[2]:
# Pythagoras's theorem
distance = math.sqrt(math.pow((row[0]-row2[0]),2)+math.pow((row[1]-row2[1]),2))
Obviously, I'm a numpy newbie. I will not be surprised to find this a duplicate, but I don't have the numpy vocabulary to search out the answer. Any help appreciated!

Using SciPy's pdist, you could write something like
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(npArray, lambda a,b: np.sqrt((a[0]-b[0])**2 + (a[1]-b[1])**2)))
pdist will compute the pair-wise distances using the custom metric that ignores the 3rd coordinate (which is your ID in this case). squareform turns this into a more readable matrix such that distances[0,1] gives the distance between the 0th and 1st rows.

Each row of X is a 3 dimensional data instance or point.
The output pairwisedist[i, j] is distance of X[i, :] and X[j, :]
X = np.array([[6,1,7],[10,9,4],[13,9,3],[10,8,15],[14,4,1]])
a = np.sum(X*X,1)
b = np.repeat( a[:,np.newaxis],5,axis=1)
pairwisedist = b + b.T -2* X.dot(X.T)

I wanted to point out that custom written sqrt of sum of squares are prone to overflow and underflow. Bultin math.hypot, np.hypot are way safer for no compromise on performance
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(npArray, lambda a,b: math.hypot(*(a-b))
Refer

Related

Scipy: epsilon neighborhood by sparse similarity with threshold

I am wondering if scipy offers the option to implement a primitive but memory-friendly approach to epsilon neighborhood search:
Compute pairwise similarity for my data, but set all similarities smaller than a threshold epsilon to zero on the fly and then output result directly as sparse matrix.
For example scipy.spatial.distance.pdist() is really fast, but the memory limit is reached early compared to my time limit, at least if I take squareform().
I know there are O(n*log(n)) solutions in this case but for now it would be enough if the result could be sparse. Also obviously I would have to use a similarity as opposed to a distance, but that should not be such a big problem, should it.
As long as you can recast your similarity measure in terms of a distance metric (say 1 minus the similarity) then the most efficient solution is to use sklearn's BallTree.
Otherwise you could build a your own scipy.sparse.csr_matrix matrix by comparing each point against the other $ i -1$ points and throwing away all values smaller than the threshold.
Without knowing your specific similarity metric, this code should roughly do the trick:
import scipy.sparse as spsparse
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
def sparse_similarity(X, epsilon=0.99, Y=None, similarity_metric=cosine_similarity):
'''
X : ndarray
An m by n array of m original observations in an n-dimensional space.
'''
Nx, Dx = X.shape
if Y is None:
Y=X
Ny, Dy = Y.shape
assert Dx==Dy
data = []
indices = []
indptr = [0]
for ix in range(Nx):
xsim = similarity_metric([X[ix]], Y)
_ , kept_points = np.nonzero(xsim>=epsilon)
data.extend(xsim[0,kept_points])
indices.extend(kept_points)
indptr.append(indptr[-1] + len(kept_points))
return spsparse.csr_matrix((data, indices, indptr), shape=(Nx,Ny))
X = np.random.random(size=(1000,10))
sparse_similarity(X, epsilon=0.95)

vectorized / linear algebra distance between points?

Suppose I have an array of points,
import numpy as np
pts = np.random.rand(100,3) # 1000 points, X, Y, Z along second dimension
The naive approach to calculate the distance between each combination of points involves a double for loop and will be excruciatingly slow for large numbers of points,
def euclidian_distance(p1, p2):
d = p2 - p1
return np.sqrt(d**2).sum()
out = np.empty((pts.shape[0], pts.shape[0]))
pts_swapped = pts.swapaxes(0,1)
for idx, point in enumerate(pts_swapped):
for idx2, point_inner in enumerate(pts_swapped):
out[idx,idx2] = euclidian_distance(point, point_inner)
How do I vectorize this calculation?
Take a look at the scipy.spatial.distance.cdist. I'm not sure but i assume that scipy optimized this quite a lot. If you use the pts array for both inputs, I assume you'll get an M x M array with zeros on the diagonal . function

Vectorise Python code

I have coded a kriging algorithm but I find it quite slow. Especially, do you have an idea on how I could vectorise the piece of code in the cons function below:
import time
import numpy as np
B = np.zeros((200, 6))
P = np.zeros((len(B), len(B)))
def cons():
time1=time.time()
for i in range(len(B)):
for j in range(len(B)):
P[i,j] = corr(B[i], B[j])
time2=time.time()
return time2-time1
def corr(x,x_i):
return np.exp(-np.sum(np.abs(np.array(x) - np.array(x_i))))
time_av = 0.
for i in range(30):
time_av+=cons()
print "Average=", time_av/100.
Edit: Bonus questions
What happens to the broadcasting solution if I want corr(B[i], C[j]) with C the same dimension than B
What happens to the scipy solution if my p-norm orders are an array:
p=np.array([1.,2.,1.,2.,1.,2.])
def corr(x, x_i):
return np.exp(-np.sum(np.abs(np.array(x) - np.array(x_i))**p))
For 2., I tried P = np.exp(-cdist(B, C,'minkowski', p)) but scipy is expecting a scalar.
Your problem seems very simple to vectorize. For each pair of rows of B you want to compute
P[i,j] = np.exp(-np.sum(np.abs(B[i,:] - B[j,:])))
You can make use of array broadcasting and introduce a third dimension, summing along the last one:
P2 = np.exp(-np.sum(np.abs(B[:,None,:] - B),axis=-1))
The idea is to reshape the first occurence of B to shape (N,1,M) while the second B is left with shape (N,M). With array broadcasting, the latter is equivalent to (1,N,M), so
B[:,None,:] - B
is of shape (N,N,M). Summing along the last index will then result in the (N,N)-shape correlation array you're looking for.
Note that if you were using scipy, you would be able to do this using scipy.spatial.distance.cdist (or, equivalently, a combination of scipy.spatial.distance.pdist and scipy.spatial.distance.squareform), without unnecessarily computing the lower triangular half of this symmetrix matrix. Using #Divakar's suggestion in comments for the simplest solution this way:
from scipy.spatial.distance import cdist
P3 = 1/np.exp(cdist(B, B, 'minkowski',1))
cdist will compute the Minkowski distance in 1-norm, which is exactly the sum of the absolute values of coordinate differences.

Array is too big for euclidean distance

my_array is a sparse 78000 x 200 matrix of zeros and ones; many rows are only zeros. I am trying to calculate the euclidean distance. Further I use the multidimensional scaling to get the coordinates of every column (every column is a word in a vocabulary).
I get the error "array is too big" while calculating the euclidean distance. There are other similar questions, but I don't know how to apply it in this case. What I imagine is that if I reduce the precision of the "dist array" it will be less big, but I don't know how to do that. May also be working with sparse matrices or np.memmap, but the my_array is not a problem. The problem starts when it tries to keep all the distance values, so I need to integrate it during the dist array calculation. The dist array is a 78000 x 78000 matrix.
So my question is, how do I integrate any of these techniques in the calculation of the euclidean distance?
Or would it make sense to loop through dist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y))? And adjust the data type somewhere in there?
from sklearn.manifold import MDS
from sklearn.metrics.pairwise import euclidean_distances
my_array = np.array(Y[2:])
dist = euclidean_distances(my_array)
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=1)
pos = mds.fit_transform(dist)

python numpy euclidean distance calculation between matrices of row vectors

I am new to Numpy and I would like to ask you how to calculate euclidean distance between points stored in a vector.
Let's assume that we have a numpy.array each row is a vector and a single numpy.array. I would like to know if it is possible to calculate the euclidean distance between all the points and this single point and store them in one numpy.array.
Here is an interface:
points #2d list of row-vectors
singlePoint #one row-vector
listOfDistances= procedure( points,singlePoint)
Can we have something like this?
Or is it possible to have one command to have the single point as a list of other points and at the end we get a matrix of distances?
Thanks
To get the distance you can use the norm method of the linalg module in numpy:
np.linalg.norm(x - y)
While you can use vectorize, #Karl's approach will be rather slow with numpy arrays.
The easier approach is to just do np.hypot(*(points - single_point).T). (The transpose assumes that points is a Nx2 array, rather than a 2xN. If it's 2xN, you don't need the .T.
However this is a bit unreadable, so you write it out more explictly like this (using some canned example data...):
import numpy as np
single_point = [3, 4]
points = np.arange(20).reshape((10,2))
dist = (points - single_point)**2
dist = np.sum(dist, axis=1)
dist = np.sqrt(dist)
import numpy as np
def distance(v1, v2):
return np.sqrt(np.sum((v1 - v2) ** 2))
To apply a function to each element of a numpy array, try numpy.vectorize.
To do the actual calculation, we need the square root of the sum of squares of differences (whew!) between pairs of coordinates in the two vectors.
We can use zip to pair the coordinates, and sum with a comprehension to sum up the results. That looks like:
sum((x - y) ** 2 for (x, y) in zip(singlePoint, pointFromArray)) ** 0.5
import numpy as np
single_point = [3, 4]
points = np.arange(20).reshape((10,2))
distance = euclid_dist(single_point,points)
def euclid_dist(t1, t2):
return np.sqrt(((t1-t2)**2).sum(axis = 1))

Categories