I have a collection of n dimensional points and I want to find which 2 are the closest. The best I could come up for 2 dimensions is:
from numpy import *
myArr = array( [[1, 2],
[3, 4],
[5, 6],
[7, 8]] )
n = myArr.shape[0]
cross = [[sum( ( myArr[i] - myArr[j] ) ** 2 ), i, j]
for i in xrange( n )
for j in xrange( n )
if i != j
]
print min( cross )
which gives
[8, 0, 1]
But this is too slow for large arrays. What kind of optimisation can I apply to it?
RELATED:
Euclidean distance between points in two different Numpy arrays, not within
Try scipy.spatial.distance.pdist(myArr). This will give you a condensed distance matrix. You can use argmin on it and find the index of the smallest value. This can be converted into the pair information.
There's a whole Wikipedia page on just this problem, see: http://en.wikipedia.org/wiki/Closest_pair_of_points
Executive summary: you can achieve O(n log n) with a recursive divide and conquer algorithm (outlined on the Wiki page, above).
You could take advantage of the latest version of SciPy's (v0.9) Delaunay triangulation tools. You can be sure that the closest two points will be an edge of a simplex in the triangulation, which is a much smaller subset of pairs than doing every combination.
Here's the code (updated for general N-D):
import numpy
from scipy import spatial
def closest_pts(pts):
# set up the triangluataion
# let Delaunay do the heavy lifting
mesh = spatial.Delaunay(pts)
# TODO: eliminate reduncant edges (numpy.unique?)
edges = numpy.vstack((mesh.vertices[:,:dim], mesh.vertices[:,-dim:]))
# the rest is easy
x = mesh.points[edges[:,0]]
y = mesh.points[edges[:,1]]
dists = numpy.sum((x-y)**2, 1)
idx = numpy.argmin(dists)
return edges[idx]
#print 'distance: ', dists[idx]
#print 'coords:\n', pts[closest_verts]
dim = 3
N = 1000*dim
pts = numpy.random.random(N).reshape(N/dim, dim)
Seems closely O(n):
There is a scipy function pdist that will get you the pairwise distances between points in an array in a fairly efficient manner:
http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
that outputs the N*(N-1)/2 unique pairs (since r_ij == r_ji). You can then search on the minimum value and avoid the whole loop mess in your code.
Perhaps you could proceed along these lines:
In []: from scipy.spatial.distance import pdist as pd, squareform as sf
In []: m= 1234
In []: n= 123
In []: p= randn(m, n)
In []: d= sf(pd(p))
In []: a= arange(m)
In []: d[a, a]= d.max()
In []: where(d< d.min()+ 1e-9)
Out[]: (array([701, 730]), array([730, 701]))
With substantially more points you need to be able to somehow utilize the hierarchical structure of your clustering.
How fast is it compared to just doing a nested loop and keeping track of the shortest pair? I think creating a huge cross array is what might be hurting you. Even O(n^2) is still pretty quick if you're only doing 2 dimensional points.
The accepted answer is OK for small datasets, but its execution time scales as n**2. However, as pointed out by #payne, an optimal solution can achieve n*log(n) computation time scaling.
This optial solution can be obtained using sklearn.neighbors.BallTree as follows.
import matplotlib.pyplot as plt
import numpy as np
from sklearn.neighbors import BallTree as tree
n = 10
dim = 2
xy = np.random.uniform(size=[n, dim])
# This solution is optimal when xy is very large
res = tree(xy)
dist, ids = res.query(xy, 2)
mindist = dist[:, 1] # second nearest neighbour
minid = np.argmin(mindist)
plt.plot(*xy.T, 'o')
plt.plot(*xy[ids[minid]].T, '-o')
This procedure scales well for very large sets of xy values and even for large dimensions dim (altough the example illustrates the case dim=2). The resulting output looks like this
An identical solution can be obtained using scipy.spatial.cKDTree, by replacing the sklearn import with the following Scipy one. Note however that cKDTree, unlike BallTree, does not scale well for high dimensions
from scipy.spatial import cKDTree as tree
Related
I have N persons let's call them p1, p2 and p3.
And I have K questions let's note them Q1, Q2, Q3, Q4.
Plus I have a variable called data which is a python3 dict:
data = {"p1": {"Q1":1, "Q2":0, "Q3":1, "Q4":0},
"p2": {"Q1":1, "Q2":1, "Q3":1, "Q4":0},
"p3": {"Q1":0, "Q2":0, "Q3":1, "Q4":1} }
This dictionary means that p1 knows the answer to question 1 and 3 but not the answers to questions 2 and 4.
The problem is that we have an unknown person pX. We asked him our K questions and say we got:
"pX": {"Q1":1, "Q2":0, "Q3":1, "Q4":1}
The question to rank the N persons from the most likely to be pX to the least likely.
What I did was to simply compute the euclidean distance between each person and pX.
So I have
d1 = compute_distance(pX, p1)
d2 = compute_distance(pX, p2)
d3 = compute_distance(pX, p3)
Then I just sort the distances and thus I get the person most likely to be pX (having the smallest distance) to the least likely to be pX (having the higher distance).
And to finish I just present to the user each person with their distances sorted.
I find this approach easy to program and to do. However a friend told me that for N and K big, a classification algorithm would be better (using sklearn).
Is that true ? What would a classification algorithm (like KNN or forestTree) bring to the table that the euclidean distance doesn't ?
Side question: Do you think there is another result that is interesting to see and to present to the user ? Like can we define an accuracy (the standard deviation of our data for example ?) ?
Let's re-phrase the question slightly. Instead of modeling the questions for each student as a series of Python dictionaries, we can define a 2D array of size N people times K questions. We can also define a "new" array of shape K.
import numpy as np
from scipy.spatial.distance import euclidean
X = np.array(
[
[1, 0, 1, 0],
[1, 1, 1, 0],
[0, 0, 1, 1],
]
)
new_x = np.array([1, 0, 1, 1])
If the goal is to find the "most similar" vector in X, we can frame that as minimizing the distance between X and new_x. If we have no prior knowledge of the structure of X, the naive approach is to iterate over all elements of X, compare the euclidean distance, and return a tuple of (distance, index):
def most_similar_brute_force(X, new_x):
# Return index in X with minimum Euclidean distance to new_x
ind = -1
minimum_distance = np.inf
for i, u in enumerate(X):
if euc := euclidean(u, new_x) < minimum_distance:
minimum_distance = euc
ind = i
return euclidean(X[ind], new_x), ind
print(most_similar_brute_force(X, new_x))
# (1.0, 0)
But this is inefficient. Even assuming that euclidean distance can be calculated in O(1) time (by assuming that all math operations are constant), there's still a O(N) linear hurdle when we have to check every value.
A more efficient approach would involve building a KDTree data structure (see scipy documentation, or Wikipedia's section on time complexity). Assuming the tree is well-balanced, querying for the 1-nearest neighbor is O(log N) on average and can give the same result as our naive solution:
from scipy.spatial import KDTree
tree = KDTree(X)
print(tree.query(new_x, k=1, p=2))
# (1.0, 0)
There's one caveat: in high-dimensional space, everything tends to be similar. The scipy KDTree documentation notes that performance may not be significant over the brute force approach for high-dimensions (roughly defined as K=20+).
Our most_similar_brute_force will probably be slower than KDTree since it's implemented in Python, but hints that a different approach may be needed.
I have a long skinny data matrix (size: 250,000 x 10), which I will denote X. I also have a vector p measuring the quality of my data points. My goal is to compute the following function for each row x in my data matrix X:
r(x) = min{ ||x-y|| | p[y]>p[x], y in X }
On a smaller dataset, what I would use sklearn.metrics.pairwise_distances to precompute distances, like so:
from sklearn import metrics
n = len(X);
D_full = metrics.pairwise_distances(X);
r = np.zeros((n,1));
for i in range(n):
r[i] = (D_full[i,p>p[i]]).min();
However, the above approach is memory-expensive, since I need to store D_full: a full n x n matrix. It seems like sklearn.metrics.pairwise_distances_chunked could be a good tool for this sort of problem since the distance matrix is only stored one chunk at a time. I was hoping to get some assistance in how to use it though, as I'm currently unfamiliar with generator objects. Suppose I call the following:
from sklearn import metrics
D = metrics.pairwise_distances_chunked(X);
D_chunk = next(D)
The above code yields D (a generator object) and D_chunk (a 536 x n array). Does D_chunkcorrespond to the first 536 rows of the matrix D_full from my earlier approach? If so, does next(D_chunk) correspond to the next 536 rows?
Thank you very much for your help.
This is a outline of a possible solution, but details are missing. In short, I would do the following:
Create a BallTree to query, and initialise min_quality_distance of size 250000 with say zeros.
for k=2
For each vector, find the closest k neighbour (including itself).
If vector with most distance within k found, has sufficient quality, update min_quality_distance for that point.
For remaining, repeat with k=k+1
In each iteration, we have to query less vectors. The idea is that in each iteration you nibble a few nearest neighbors with the right condition away, and it will be easier with every step. (50% easier?) I will show how to do the first iteration, and with this is should be possible to build the loop.
You can do;
import numpy as np
size = 250000
X = np.random.random( size=(size,10))
p = np.random.random( size=size)
And create a BallTree with
from sklearn.neighbors import BallTree
tree = BallTree(X, leaf_size=10, metric='minkowski')
and query it for first iteration with (this will take about 5 minutes.)
k_nearest = 2
distances, indici = tree.query(X, k=k_nearest, return_distance=True, dualtree=False, sort_results=True)
The indici of most far away point within the nearest k is
most_far_away_indici = indici[:,-1:]
And its quality
p[most_far_away_indici]
So we can
quality_closeby = p[most_far_away_indici]
And check if it is sifficient with
indici_sufficient_quality = quality_closeby > np.expand_dims(p, axis=1)
And we have
found_closeby = np.all( indici_sufficient_quality, axis=1 )
Which is True is we have found a sufficient quality nearby.
We can update the vector with
distances_nearby = distances[:,-1:]
rx = np.zeros(size)
rx[found_closeby] = distances_nearby[found_closeby][:,0]
And we now need to take care for the remaining where we were unlucky, these are
~found_closeby
so
indici_not_found = ~found_closeby
and
distances, indici = tree.query(X[indici_not_found], k=3, return_distance=True, dualtree=False, sort_results=True)
etc..
I am sure the first few loops will take minutes, but after a few iterations the speeds will quickly go to seconds.
It is a little exercise with np.argwhere() etc to make sure the right indicis get updates.
It might not be the fastest, but it is a workable approach.
Since one cannot know the dimensions of some chunk, I suggest using np.ones_like instead of np.zeros.
Imagine you are given set S of n points in 3 dimensions. Distance between any 2 points is simple Euclidean distance. You want to chose subset Q of k points from this set such that they are farthest from each other. In other words there is no other subset Q’ of k points exists such that min of all pair wise distances in Q is less than that in Q’.
If n is approximately 16 million and k is about 300, how do we efficiently do this?
My guess is that this NP-hard so may be we just want to focus on approximation. One idea I can think of is using Multidimensional scaling to sort these points in a line and then use version of binary search to get points that are furthest apart on this line.
This is known as the discrete p-dispersion (maxmin) problem.
The optimality bound is proved in White (1991) and Ravi et al. (1994) give a factor-2 approximation for the problem with the latter proving this heuristic is the best possible (unless P=NP).
Factor-2 Approximation
The factor-2 approximation is as follows:
Let V be the set of nodes/objects.
Let i and j be two nodes at maximum distance.
Let p be the number of objects to choose.
P = set([i,j])
while size(P)<p:
Find a node v in V-P such that min_{v' in P} dist(v,v') is maximum.
\That is: find the node with the greatest minimum distance to the set P.
P = P.union(v)
Output P
You could implement this in Python like so:
#!/usr/bin/env python3
import numpy as np
p = 50
N = 400
print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2 #Make the matrix symmetric
print("Finding initial edge...")
maxdist = 0
bestpair = ()
for i in range(N):
for j in range(i+1,N):
if d[i,j]>maxdist:
maxdist = d[i,j]
bestpair = (i,j)
P = set()
P.add(bestpair[0])
P.add(bestpair[1])
print("Finding optimal set...")
while len(P)<p:
print("P size = {0}".format(len(P)))
maxdist = 0
vbest = None
for v in range(N):
if v in P:
continue
for vprime in P:
if d[v,vprime]>maxdist:
maxdist = d[v,vprime]
vbest = v
P.add(vbest)
print(P)
Exact Solution
You could also model this as an MIP. For p=50, n=400 after 6000s, the optimality gap was still 568%. The approximation algorithm took 0.47s to obtain an optimality gap of 100% (or less). A naive Gurobi Python representation might look like this:
#!/usr/bin/env python
import numpy as np
import gurobipy as grb
p = 50
N = 400
print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2 #Make the matrix symmetric
m = grb.Model(name="MIP Model")
used = [m.addVar(vtype=grb.GRB.BINARY) for i in range(N)]
objective = grb.quicksum( d[i,j]*used[i]*used[j] for i in range(0,N) for j in range(i+1,N) )
m.addConstr(
lhs=grb.quicksum(used),
sense=grb.GRB.EQUAL,
rhs=p
)
# for maximization
m.ModelSense = grb.GRB.MAXIMIZE
m.setObjective(objective)
# m.Params.TimeLimit = 3*60
# solving with Glpk
ret = m.optimize()
Scaling
Obviously, the O(N^2) scaling for the initial points is bad. We can find them more efficiently by recognizing that the pair must lie on the convex hull of the dataset. This gives us an O(N log N) way to find the pair. Once we've found it we proceed as before (using SciPy for acceleration).
The best way of scaling would be to use an R*-tree to efficiently find the minimum distance between a candidate point p and the set P. But this cannot be done efficiently in Python since a for loop is still involved.
import numpy as np
from scipy.spatial import ConvexHull
from scipy.spatial.distance import cdist
p = 300
N = 16000000
# Find a convex hull in O(N log N)
points = np.random.rand(N, 3) # N random points in 3-D
# Returned 420 points in testing
hull = ConvexHull(points)
# Extract the points forming the hull
hullpoints = points[hull.vertices,:]
# Naive way of finding the best pair in O(H^2) time if H is number of points on
# hull
hdist = cdist(hullpoints, hullpoints, metric='euclidean')
# Get the farthest apart points
bestpair = np.unravel_index(hdist.argmax(), hdist.shape)
P = np.array([hullpoints[bestpair[0]],hullpoints[bestpair[1]]])
# Now we have a problem
print("Finding optimal set...")
while len(P)<p:
print("P size = {0}".format(len(P)))
distance_to_P = cdist(points, P)
minimum_to_each_of_P = np.min(distance_to_P, axis=1)
best_new_point_idx = np.argmax(minimum_to_each_of_P)
best_new_point = np.expand_dims(points[best_new_point_idx,:],0)
P = np.append(P,best_new_point,axis=0)
print(P)
I am also pretty sure that the problem is NP-Hard, the most similar problem I found is the k-Center Problem. If runtime is more important than correctness a greedy algorithm is probably your best choice:
Q ={}
while |Q| < k
Q += p from S where mindist(p, Q) is maximal
Side note: In similar problems e.g., the set-cover problem it can be shown that the solution from the greedy algorithm is at least 63% as good as the optimal solution.
In order to speed things up I see 3 possibilities:
Index your dataset in an R-Tree first, then perform a greedy search. Construction of the R-Tree is O(n log n), but though being developed for nearest neighbor search, it can also help you finding the furthest point to a set of points in O(log n). This might be faster than the naive O(k*n) algorithm.
Sample a subset from your 16 million points and perform the greedy algorithm on the subset. You are approximate anyway so you might be able to spare a little more accuracy. You can also combine this with the 1. algorithm.
Use an iterative approach and stop when you are out of time. The idea here is to randomly select k points from S (lets call this set Q'). Then in each step you switch the point p_ from Q' that has the minimum distance to another one in Q' with a random point from S. If the resulting set Q'' is better proceed with Q'', otherwise repeat with Q'. In order not to get stuck you might want to choose another point from Q' than p_ if you could not find an adequate replacement for a couple of iterations.
If you can afford to do ~ k*n distance calculations then you could
Find the center of the distribution of points.
Select the point furthest from the center. (and remove it from the set of un-selected points).
Find the point furthest from all the currently selected points and select it.
Repeat 3. until you end with k points.
Find the maximum extent of all points. Split into 7x7x7 voxels. For all points in a voxel find the point closest to its centre. Return these 7x7x7 points. Some voxels may contain no points, hopefully not too many.
Given two sets of points in n-dimensional space, how can one map points from one set to the other, such that each point is only used once and the total euclidean distance between the pairs of points is minimized?
For example,
import matplotlib.pyplot as plt
import numpy as np
# create six points in 2d space; the first three belong to set "A" and the
# second three belong to set "B"
x = [1, 2, 3, 1.8, 1.9, 3.4]
y = [2, 3, 1, 2.6, 3.4, 0.4]
colors = ['red'] * 3 + ['blue'] * 3
plt.scatter(x, y, c=colors)
plt.show()
So in the example above, the goal would be to map each red point to a blue point such that each blue point is only used once and the sum of the distances between points is minimized.
I came across this question which helps to solve the first part of the problem -- computing the distances between all pairs of points across sets using the scipy.spatial.distance.cdist() function.
From there, I could probably test every permutation of single elements from each row, and find the minimum.
The application I have in mind involves a fairly small number of datapoints in 3-dimensional space, so the brute force approach might be fine, but I thought I would check to see if anyone knows of a more efficient or elegant solution first.
An example of assigning (mapping) elements of one set to points to the elements of another set of points, such that the sum Euclidean distance is minimized.
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment
np.random.seed(100)
points1 = np.array([(x, y) for x in np.linspace(-1,1,7) for y in np.linspace(-1,1,7)])
N = points1.shape[0]
points2 = 2*np.random.rand(N,2)-1
C = cdist(points1, points2)
_, assigment = linear_sum_assignment(C)
plt.plot(points1[:,0], points1[:,1],'bo', markersize = 10)
plt.plot(points2[:,0], points2[:,1],'rs', markersize = 7)
for p in range(N):
plt.plot([points1[p,0], points2[assigment[p],0]], [points1[p,1], points2[assigment[p],1]], 'k')
plt.xlim(-1.1,1.1)
plt.ylim(-1.1,1.1)
plt.axes().set_aspect('equal')
There's a known algorithm for this, The Hungarian Method For Assignment, which works in time O(n3).
In SciPy, you can find an implementation in scipy.optimize.linear_sum_assignment
I just started using scipy/numpy. I have an 100000*3 array, each row is a coordinate, and a 1*3 center point. I want to calculate the distance for each row in the array to the center and store them in another array. What is the most efficient way to do it?
I would take a look at scipy.spatial.distance.cdist:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
import numpy as np
import scipy
a = np.random.normal(size=(10,3))
b = np.random.normal(size=(1,3))
dist = scipy.spatial.distance.cdist(a,b) # pick the appropriate distance metric
dist for the default distant metric is equivalent to:
np.sqrt(np.sum((a-b)**2,axis=1))
although cdist is much more efficient for large arrays (on my machine for your size problem, cdist is faster by a factor of ~35x).
I would use the sklearn implementation of the euclidean distance. The advantage is the usage of the more efficient expression by using Matrix multiplication:
dist(x, y) = sqrt(np.dot(x, x) - 2 * np.dot(x, y) + np.dot(y, y)
A simple script would look like this:
import numpy as np
x = np.random.rand(1000, 3)
y = np.random.rand(1000, 3)
dist = np.sqrt(np.dot(x, x)) - (np.dot(x, y) + np.dot(x, y)) + np.dot(y, y)
The advantage of this approach has been nicely described in the sklearn documentation:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html#sklearn.metrics.pairwise.euclidean_distances
I am using this approach to crunch large datamatrices (10000, 10000) with some minor modifications like using the np.einsum function.
You can also use the development of the norm (similar to remarkable identities). This is probably the most efficent way to compute the distance of a matrix of points.
Here is a code snippet that I originally used for a k-Nearest-Neighbors implementation, in Octave, but you can easily adapt it to numpy since it only uses matrix multiplications (the equivalent is numpy.dot()):
% Computing the euclidian distance between each known point (Xapp) and unknown points (Xtest)
% Note: we use the development of the norm just like a remarkable identity:
% ||x1 - x2||^2 = ||x1||^2 + ||x2||^2 - 2*<x1,x2>
[napp, d] = size(Xapp);
[ntest, d] = size(Xtest);
A = sum(Xapp.^2, 2);
A = repmat(A, 1, ntest);
B = sum(Xtest.^2, 2);
B = repmat(B', napp, 1);
C = Xapp*Xtest';
dist = A+B-2.*C;
This might not answer your question directly, but if you are after all permutations of particle pairs, I've found the following solution to be faster than the pdist function in some cases.
import numpy as np
L = 100 # simulation box dimension
N = 100 # Number of particles
dim = 2 # Dimensions
# Generate random positions of particles
r = (np.random.random(size=(N,dim))-0.5)*L
# uti is a list of two (1-D) numpy arrays
# containing the indices of the upper triangular matrix
uti = np.triu_indices(100,k=1) # k=1 eliminates diagonal indices
# uti[0] is i, and uti[1] is j from the previous example
dr = r[uti[0]] - r[uti[1]] # computes differences between particle positions
D = np.sqrt(np.sum(dr*dr, axis=1)) # computes distances; D is a 4950 x 1 np array
See this for a more in-depth look on this matter, on my blog post.
You may need to specify a more detailed manner the distance function you are interested of, but here is a very simple (and efficient) implementation of Squared Euclidean Distance based on inner product (which obviously can be generalized, straightforward manner, to other kind of distance measures):
In []: P, c= randn(5, 3), randn(1, 3)
In []: dot(((P- c)** 2), ones(3))
Out[]: array([ 8.80512, 4.61693, 2.6002, 3.3293, 12.41800])
Where P are your points and c is the center.
#is it true, to find the biggest distance between the points in surface?
from math import sqrt
n = int(input( "enter the range : "))
x = list(map(float,input("type x coordinates: ").split()))
y = list(map(float,input("type y coordinates: ").split()))
maxdis = 0
for i in range(n):
for j in range(n):
print(i, j, x[i], x[j], y[i], y[j])
dist = sqrt((x[j]-x[i])**2+(y[j]-y[i])**2)
if maxdis < dist:
maxdis = dist
print(" maximum distance is : {:5g}".format(maxdis))