fastest way to find euclidean distance in python - python

I have 2 sets of 2D points (A and B), each set have about 540 points. I need to find the points in set B that are farther than a defined distance alpha from all the points in A.
I have a solution, but is not fast enough
# find the closest point of each of the new point to the target set
def find_closest_point( self, A, B):
outliers = []
for i in range(len(B)):
# find all the euclidean distances
temp = distance.cdist([B[i]],A)
minimum = numpy.min(temp)
# if point is too far away from the rest is consider outlier
if minimum > self.alpha :
outliers.append([i, B[i]])
else:
continue
return outliers
I am using python 2.7 with numpy and scipy. Is there another way to do this that I may gain a considerable increase in speed?
Thanks in advance for the answers

>>> from scipy.spatial.distance import cdist
>>> A = np.random.randn(540, 2)
>>> B = np.random.randn(540, 2)
>>> alpha = 1.
>>> ind = np.all(cdist(A, B) > alpha, axis=0)
>>> outliers = B[ind]
gives you the points you want.

If you have a very large set of points you could calculate x & y bounds of a add & subtract aplha then eliminate all the points in b from specific consideration that lay outside of that boundary.

Related

Is there a faster way to perform this neighbour finding operation

I'm trying to calculate Moran's I in Python (This is the underlying equation). My inputs are a coords Nx3 array containing the coordinates of each point and a Nx3 array z which contains the values minus the overall mean. The operation requires each value of z to be multiplied with every point within a set distance (here set to 1.99). My problem is that in my case N=~2 Million and so the find_neighbours operation is very slow. Is there a way I could speed this up?
def find_neighbours(coords,idx,k):
distances = np.sqrt(np.power(coords - coords[idx], 2).sum(axis=1))
distances[idx] = np.inf
return np.argwhere(distances<=k)
z = x - np.mean(x)
n = len(coords)
A = 0
B = np.sum([z[idx]**2 for idx,coord in enumerate(coords)])
S_0 = 0
for idx in range(len(coords)):
neighbours = find_neighbours(coords,idx,1.99)
S_0 += len(neighbours)
A += np.sum([(z[neighbour]*z[idx]) for neighbour in neighbours])
I = (n/S_0)*(A/B)
This is a classical problem with plenty of literature about. It's called Radius Neighbor Search in Three-dimensional Point Clouds . You need to store your points in a better data structure to do the search faster. I would suggest an octree.
Check python code here and adapt to your case.
For explanations, check this paper.

Choosing subset of farthest points in given set of points

Imagine you are given set S of n points in 3 dimensions. Distance between any 2 points is simple Euclidean distance. You want to chose subset Q of k points from this set such that they are farthest from each other. In other words there is no other subset Q’ of k points exists such that min of all pair wise distances in Q is less than that in Q’.
If n is approximately 16 million and k is about 300, how do we efficiently do this?
My guess is that this NP-hard so may be we just want to focus on approximation. One idea I can think of is using Multidimensional scaling to sort these points in a line and then use version of binary search to get points that are furthest apart on this line.
This is known as the discrete p-dispersion (maxmin) problem.
The optimality bound is proved in White (1991) and Ravi et al. (1994) give a factor-2 approximation for the problem with the latter proving this heuristic is the best possible (unless P=NP).
Factor-2 Approximation
The factor-2 approximation is as follows:
Let V be the set of nodes/objects.
Let i and j be two nodes at maximum distance.
Let p be the number of objects to choose.
P = set([i,j])
while size(P)<p:
Find a node v in V-P such that min_{v' in P} dist(v,v') is maximum.
\That is: find the node with the greatest minimum distance to the set P.
P = P.union(v)
Output P
You could implement this in Python like so:
#!/usr/bin/env python3
import numpy as np
p = 50
N = 400
print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2 #Make the matrix symmetric
print("Finding initial edge...")
maxdist = 0
bestpair = ()
for i in range(N):
for j in range(i+1,N):
if d[i,j]>maxdist:
maxdist = d[i,j]
bestpair = (i,j)
P = set()
P.add(bestpair[0])
P.add(bestpair[1])
print("Finding optimal set...")
while len(P)<p:
print("P size = {0}".format(len(P)))
maxdist = 0
vbest = None
for v in range(N):
if v in P:
continue
for vprime in P:
if d[v,vprime]>maxdist:
maxdist = d[v,vprime]
vbest = v
P.add(vbest)
print(P)
Exact Solution
You could also model this as an MIP. For p=50, n=400 after 6000s, the optimality gap was still 568%. The approximation algorithm took 0.47s to obtain an optimality gap of 100% (or less). A naive Gurobi Python representation might look like this:
#!/usr/bin/env python
import numpy as np
import gurobipy as grb
p = 50
N = 400
print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2 #Make the matrix symmetric
m = grb.Model(name="MIP Model")
used = [m.addVar(vtype=grb.GRB.BINARY) for i in range(N)]
objective = grb.quicksum( d[i,j]*used[i]*used[j] for i in range(0,N) for j in range(i+1,N) )
m.addConstr(
lhs=grb.quicksum(used),
sense=grb.GRB.EQUAL,
rhs=p
)
# for maximization
m.ModelSense = grb.GRB.MAXIMIZE
m.setObjective(objective)
# m.Params.TimeLimit = 3*60
# solving with Glpk
ret = m.optimize()
Scaling
Obviously, the O(N^2) scaling for the initial points is bad. We can find them more efficiently by recognizing that the pair must lie on the convex hull of the dataset. This gives us an O(N log N) way to find the pair. Once we've found it we proceed as before (using SciPy for acceleration).
The best way of scaling would be to use an R*-tree to efficiently find the minimum distance between a candidate point p and the set P. But this cannot be done efficiently in Python since a for loop is still involved.
import numpy as np
from scipy.spatial import ConvexHull
from scipy.spatial.distance import cdist
p = 300
N = 16000000
# Find a convex hull in O(N log N)
points = np.random.rand(N, 3) # N random points in 3-D
# Returned 420 points in testing
hull = ConvexHull(points)
# Extract the points forming the hull
hullpoints = points[hull.vertices,:]
# Naive way of finding the best pair in O(H^2) time if H is number of points on
# hull
hdist = cdist(hullpoints, hullpoints, metric='euclidean')
# Get the farthest apart points
bestpair = np.unravel_index(hdist.argmax(), hdist.shape)
P = np.array([hullpoints[bestpair[0]],hullpoints[bestpair[1]]])
# Now we have a problem
print("Finding optimal set...")
while len(P)<p:
print("P size = {0}".format(len(P)))
distance_to_P = cdist(points, P)
minimum_to_each_of_P = np.min(distance_to_P, axis=1)
best_new_point_idx = np.argmax(minimum_to_each_of_P)
best_new_point = np.expand_dims(points[best_new_point_idx,:],0)
P = np.append(P,best_new_point,axis=0)
print(P)
I am also pretty sure that the problem is NP-Hard, the most similar problem I found is the k-Center Problem. If runtime is more important than correctness a greedy algorithm is probably your best choice:
Q ={}
while |Q| < k
Q += p from S where mindist(p, Q) is maximal
Side note: In similar problems e.g., the set-cover problem it can be shown that the solution from the greedy algorithm is at least 63% as good as the optimal solution.
In order to speed things up I see 3 possibilities:
Index your dataset in an R-Tree first, then perform a greedy search. Construction of the R-Tree is O(n log n), but though being developed for nearest neighbor search, it can also help you finding the furthest point to a set of points in O(log n). This might be faster than the naive O(k*n) algorithm.
Sample a subset from your 16 million points and perform the greedy algorithm on the subset. You are approximate anyway so you might be able to spare a little more accuracy. You can also combine this with the 1. algorithm.
Use an iterative approach and stop when you are out of time. The idea here is to randomly select k points from S (lets call this set Q'). Then in each step you switch the point p_ from Q' that has the minimum distance to another one in Q' with a random point from S. If the resulting set Q'' is better proceed with Q'', otherwise repeat with Q'. In order not to get stuck you might want to choose another point from Q' than p_ if you could not find an adequate replacement for a couple of iterations.
If you can afford to do ~ k*n distance calculations then you could
Find the center of the distribution of points.
Select the point furthest from the center. (and remove it from the set of un-selected points).
Find the point furthest from all the currently selected points and select it.
Repeat 3. until you end with k points.
Find the maximum extent of all points. Split into 7x7x7 voxels. For all points in a voxel find the point closest to its centre. Return these 7x7x7 points. Some voxels may contain no points, hopefully not too many.

Distance between one point and rest of the points in an array

I am writing some code to calculate the real distance between one point and the rest of the points from the same array. The array holds positions of particles in 3D space. There is N-particles so the array's shape is (N,3). I choose one particle and calculate the distance between this particle and the rest of the particles, all within one array.
Would anyone here have any idea how to do this?
What I have so far:
xbox = 10
ybox = 10
zbox = 10
nparticles =15
positions = np.empty([nparticles, 3])
for i in range(nparticles):
xrandomalocation = random.uniform(0, xbox)
yrandomalocation = random.uniform(0, ybox)
zrandomalocation = random.uniform(0, zbox)
positions[i, 0] = xrandomalocation
positions[i, 1] = yrandomalocation
positions[i, 2] = zrandomalocation
And that's pretty much all I have right now. I was thinking of using np.linalg.norm however I am not sure at all how to implement it to my code (or maybe use it in a loop)?
It sounds like you could use scipy.distance.cdist or scipy.distance.pdist for this. For example, to get the distances from point X to the points in coords:
>>> from scipy.spatial import distance
>>> X = [(35.0456, -85.2672)]
>>> coords = [(35.1174, -89.9711),
... (35.9728, -83.9422),
... (36.1667, -86.7833)]
>>> distance.cdist(X, coords, 'euclidean')
array([[ 4.70444794, 1.6171966 , 1.88558331]])
pdist is similar, but only takes one array, and you get the distances between all pairs.
i am using this function:
from scipy.spatial import distance
def closest_node(node, nodes):
closest = distance.cdist([node], nodes)
index = closest.argmin()
euclidean = closest[0]
return nodes[index], euclidean[index]
where node is the single point in the space you want to compare with an array of points called nodes. it returns the point and the euclidean distance to your original node

How to find mean in kmeans in single shot using numpy

I have a function:
def update(points, closest, centroids):
return np.array([points[closest==k].mean(axis=0) for k in range(centroids.shape[0])])
It basically the update of centroids step in kmeans algorithm.
Basically, points is a matrix, closest is an assignment of a point to a cluster..
and then all i am doing is finding the new mean based on points in a cluster..
but I was wondering if i can get rid of that for loop?
which is if i can find the cluster mean in one shot?
Here's a vectorized approach based on np.add.reduceat -
c = np.bincount(closest,minlength=centroids.shape[0])
mask = c != 0
pts_grp = points[closest.argsort()]
cut_idx = np.append(0,c[mask].cumsum()[:-1])
out = np.full((centroids.shape[0],points.shape[1]),np.nan)
out[mask] = np.add.reduceat(pts_grp,cut_idx,axis=0)/c[mask,None].astype(float)

Efficient distance calculation between N points and a reference in numpy/scipy

I just started using scipy/numpy. I have an 100000*3 array, each row is a coordinate, and a 1*3 center point. I want to calculate the distance for each row in the array to the center and store them in another array. What is the most efficient way to do it?
I would take a look at scipy.spatial.distance.cdist:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
import numpy as np
import scipy
a = np.random.normal(size=(10,3))
b = np.random.normal(size=(1,3))
dist = scipy.spatial.distance.cdist(a,b) # pick the appropriate distance metric
dist for the default distant metric is equivalent to:
np.sqrt(np.sum((a-b)**2,axis=1))
although cdist is much more efficient for large arrays (on my machine for your size problem, cdist is faster by a factor of ~35x).
I would use the sklearn implementation of the euclidean distance. The advantage is the usage of the more efficient expression by using Matrix multiplication:
dist(x, y) = sqrt(np.dot(x, x) - 2 * np.dot(x, y) + np.dot(y, y)
A simple script would look like this:
import numpy as np
x = np.random.rand(1000, 3)
y = np.random.rand(1000, 3)
dist = np.sqrt(np.dot(x, x)) - (np.dot(x, y) + np.dot(x, y)) + np.dot(y, y)
The advantage of this approach has been nicely described in the sklearn documentation:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html#sklearn.metrics.pairwise.euclidean_distances
I am using this approach to crunch large datamatrices (10000, 10000) with some minor modifications like using the np.einsum function.
You can also use the development of the norm (similar to remarkable identities). This is probably the most efficent way to compute the distance of a matrix of points.
Here is a code snippet that I originally used for a k-Nearest-Neighbors implementation, in Octave, but you can easily adapt it to numpy since it only uses matrix multiplications (the equivalent is numpy.dot()):
% Computing the euclidian distance between each known point (Xapp) and unknown points (Xtest)
% Note: we use the development of the norm just like a remarkable identity:
% ||x1 - x2||^2 = ||x1||^2 + ||x2||^2 - 2*<x1,x2>
[napp, d] = size(Xapp);
[ntest, d] = size(Xtest);
A = sum(Xapp.^2, 2);
A = repmat(A, 1, ntest);
B = sum(Xtest.^2, 2);
B = repmat(B', napp, 1);
C = Xapp*Xtest';
dist = A+B-2.*C;
This might not answer your question directly, but if you are after all permutations of particle pairs, I've found the following solution to be faster than the pdist function in some cases.
import numpy as np
L = 100 # simulation box dimension
N = 100 # Number of particles
dim = 2 # Dimensions
# Generate random positions of particles
r = (np.random.random(size=(N,dim))-0.5)*L
# uti is a list of two (1-D) numpy arrays
# containing the indices of the upper triangular matrix
uti = np.triu_indices(100,k=1) # k=1 eliminates diagonal indices
# uti[0] is i, and uti[1] is j from the previous example
dr = r[uti[0]] - r[uti[1]] # computes differences between particle positions
D = np.sqrt(np.sum(dr*dr, axis=1)) # computes distances; D is a 4950 x 1 np array
See this for a more in-depth look on this matter, on my blog post.
You may need to specify a more detailed manner the distance function you are interested of, but here is a very simple (and efficient) implementation of Squared Euclidean Distance based on inner product (which obviously can be generalized, straightforward manner, to other kind of distance measures):
In []: P, c= randn(5, 3), randn(1, 3)
In []: dot(((P- c)** 2), ones(3))
Out[]: array([ 8.80512, 4.61693, 2.6002, 3.3293, 12.41800])
Where P are your points and c is the center.
#is it true, to find the biggest distance between the points in surface?
from math import sqrt
n = int(input( "enter the range : "))
x = list(map(float,input("type x coordinates: ").split()))
y = list(map(float,input("type y coordinates: ").split()))
maxdis = 0
for i in range(n):
for j in range(n):
print(i, j, x[i], x[j], y[i], y[j])
dist = sqrt((x[j]-x[i])**2+(y[j]-y[i])**2)
if maxdis < dist:
maxdis = dist
print(" maximum distance is : {:5g}".format(maxdis))

Categories