Using pairwise_distances_chunked to compute nearest neighbor search

Using pairwise_distances_chunked to compute nearest neighbor search - python

I have a long skinny data matrix (size: 250,000 x 10), which I will denote X. I also have a vector p measuring the quality of my data points. My goal is to compute the following function for each row x in my data matrix X:
r(x) = min{ ||x-y|| | p[y]>p[x], y in X }
On a smaller dataset, what I would use sklearn.metrics.pairwise_distances to precompute distances, like so:
from sklearn import metrics
n = len(X);
D_full = metrics.pairwise_distances(X);
r = np.zeros((n,1));
for i in range(n):
r[i] = (D_full[i,p>p[i]]).min();
However, the above approach is memory-expensive, since I need to store D_full: a full n x n matrix. It seems like sklearn.metrics.pairwise_distances_chunked could be a good tool for this sort of problem since the distance matrix is only stored one chunk at a time. I was hoping to get some assistance in how to use it though, as I'm currently unfamiliar with generator objects. Suppose I call the following:
from sklearn import metrics
D = metrics.pairwise_distances_chunked(X);
D_chunk = next(D)
The above code yields D (a generator object) and D_chunk (a 536 x n array). Does D_chunkcorrespond to the first 536 rows of the matrix D_full from my earlier approach? If so, does next(D_chunk) correspond to the next 536 rows?
Thank you very much for your help.

This is a outline of a possible solution, but details are missing. In short, I would do the following:
Create a BallTree to query, and initialise min_quality_distance of size 250000 with say zeros.
for k=2
For each vector, find the closest k neighbour (including itself).
If vector with most distance within k found, has sufficient quality, update min_quality_distance for that point.
For remaining, repeat with k=k+1
In each iteration, we have to query less vectors. The idea is that in each iteration you nibble a few nearest neighbors with the right condition away, and it will be easier with every step. (50% easier?) I will show how to do the first iteration, and with this is should be possible to build the loop.
You can do;
import numpy as np
size = 250000
X = np.random.random( size=(size,10))
p = np.random.random( size=size)
And create a BallTree with
from sklearn.neighbors import BallTree
tree = BallTree(X, leaf_size=10, metric='minkowski')
and query it for first iteration with (this will take about 5 minutes.)
k_nearest = 2
distances, indici = tree.query(X, k=k_nearest, return_distance=True, dualtree=False, sort_results=True)
The indici of most far away point within the nearest k is
most_far_away_indici = indici[:,-1:]
And its quality
p[most_far_away_indici]
So we can
quality_closeby = p[most_far_away_indici]
And check if it is sifficient with
indici_sufficient_quality = quality_closeby > np.expand_dims(p, axis=1)
And we have
found_closeby = np.all( indici_sufficient_quality, axis=1 )
Which is True is we have found a sufficient quality nearby.
We can update the vector with
distances_nearby = distances[:,-1:]
rx = np.zeros(size)
rx[found_closeby] = distances_nearby[found_closeby][:,0]
And we now need to take care for the remaining where we were unlucky, these are
~found_closeby
so
indici_not_found = ~found_closeby
and
distances, indici = tree.query(X[indici_not_found], k=3, return_distance=True, dualtree=False, sort_results=True)
etc..
I am sure the first few loops will take minutes, but after a few iterations the speeds will quickly go to seconds.
It is a little exercise with np.argwhere() etc to make sure the right indicis get updates.
It might not be the fastest, but it is a workable approach.

Since one cannot know the dimensions of some chunk, I suggest using np.ones_like instead of np.zeros.

Related

How can I make the following code faster?

I am trying to optimize code for a challenge in codewars but I am having some trouble making it fast enough. The kata in question is this. The description is explained in a detailed way in the link I have given and I do not reproduce it here not to bloat the question. My attempted code is the following (in python)
# first I define the manhattan distance between any two given points
def manhattan(p1,p2):
return abs(p1[0]-p2[0])+abs(p1[1]-p2[1])
# I write a function that takes the minimum of the list of the distances from a given point to all agents
def distance(p,agents):
return min(manhattan(p,i) for i in agents)
# and the main function
def advice(agents,n):
# there could be agents outside of the range of the grid so I redefine agents cropping the agents that may be outside
agents=[i for i in agents if i[0]<n and i[1]<n]
# in case there is no grid or there are agents everywhere
if n==0 or len(agents)==n*n:
return []
# if there is no agent I output the whole matrix
if agents==[]:
return [(i,j) for i in range(n) for j in range(n)]
# THIS FOLLOWING PART IS THE PART THAT I THINK I NEED TO OPTIMIZE
# I define a dictionary with key point coordinates and value the output of the function distance then make a list with those entries with lowest values. This ends the script
dct=dict( ( (i,j),distance([i,j],agents) ) for i in range(n) for j in range(n))
# highest distance to an agent
mxm=max(dct.values())
return [ nn for nn,mm in dct.items() if mm==mxm]
The part I think I need to improve is marked in the code above. Any ideas of how to make this faster?

Your algorithm is doing a brute force check of all positions against all agents. If you want to accelerate the process, you have to use a strategy that allows you to skip large parts of these n x n x a combinations.
For example, you could group the points by distance in a dictionary starting with all the points at the largest possible distance. Then, go through the agents and redistribute only the farthest points to closer distances. Repeat this process until the farthest distance retains at least one point after going through all agents.
This would not eliminate all the distance checks but it skips a lot of distance computations for points that are already known to be closer to an agent than the farthest ones.
Here's an example:
def myAdvice2(agents,n):
maxDist = 2*n
distPoints = { maxDist:[ (x,y) for x in range(n) for y in range(n)] }
index = 0
retained = 0
lastMax = None
# keep refining until farthest distance is kept for all agents
while retained < len(agents):
ax,ay = agents[index]
index = (index + 1) % len(agents)
# redistribute points at farthest distance for this agent
points = distPoints.pop(maxDist)
for x,y in points:
distance = min(maxDist,abs(x-ax)+abs(y-ay))
if distance not in distPoints:
distPoints[distance] = []
distPoints[distance].append((x,y))
maxDist = max(distPoints)
# count retained agents at MaxDist
retained += 1
# reset count when maxDist is reduced
if lastMax and maxDist < lastMax : retained = 0
lastMax = maxDist
# once all agents have been applied to largest distance, we have a solution
return distPoints[maxDist]
In my performance tests this is about 3x faster and performs better on larger grids.
[EDIT] This algorithm can be further accelerated by sorting the agents based on their distance to the center. This will ensure that points at the farthest distances are eliminated faster because agents at central positions cover more ground.
You can add this sort at the beginning of the function:
agents = sorted(agents, key=lambda p: abs(p[0]-n//2)+abs(p[1]-n//2))
This will yield a 10% to 15% improvement on speed depending on the number of agent, their positions and the size of the area. Further optimisations could determine when this sort is worthwhile based on data.
[EDIT2] if you're going to use a brute force approach, leveraging parallel processing (using the GPU) could give suprizingly fast results.
This example, using numpy, is 3 to 10 times faster than my original approach. I'm guessing that this will only be the case up to a point (having more agents slows it down considerably) but it was much faster in all small scale tests I did.
import numpy as np
def myAdvice4(agents,n):
ax,ay = [np.array(c)[:,None,None] for c in zip(*agents)]
px = np.repeat(np.arange(n)[None,:],n,axis=0)
py = np.repeat(np.arange(n)[:,None],n,axis=1)
agentDist = abs(ax - px) + abs(ay - py)
minDist = np.min(agentDist,axis=0)
maxDist = np.max(minDist)
maxCoord = zip(*np.where(minDist == maxDist))
return list(maxCoord)

You can use sklearn.metrics.pairwise_distances with manhattan metric.

Simple k-means algorithm in Python

The following is a very simple implementation of the k-means algorithm.
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(0)
DIM = 2
N = 2000
num_cluster = 4
iterations = 3
x = np.random.randn(N, DIM)
y = np.random.randint(0, num_cluster, N)
mean = np.zeros((num_cluster, DIM))
for t in range(iterations):
for k in range(num_cluster):
mean[k] = np.mean(x[y==k], axis=0)
for i in range(N):
dist = np.sum((mean - x[i])**2, axis=1)
pred = np.argmin(dist)
y[i] = pred
for k in range(num_cluster):
plt.scatter(x[y==k,0], x[y==k,1])
plt.show()
Here are two example outputs the code produces:
The first example (num_cluster = 4) looks as expected. The second example (num_cluster = 11) however shows only on cluster which is clearly not what I wanted. The code works depending on the number of classes I define and the number of iterations.
So far, I couldn't find the bug in the code. Somehow the clusters disappear but I don't know why.
Does anyone see my mistake?

You're getting one cluster because there really is only one cluster.
There's nothing in your code to avoid clusters disappearing, and the truth is that this will happen also for 4 clusters but after more iterations.
I ran your code with 4 clusters and 1000 iterations and they all got swallowed up in the one big and dominant cluster.
Think about it, your large cluster passes a critical point, and just keeps growing because other points are gradually becoming closer to it than to their previous mean.
This will not happen in the case that you reach an equilibrium (or stationary) point, in which nothing moves between clusters. But it's obviously a bit rare, and more rare the more clusters you're trying to estimate.
A clarification: The same thing can happen also when there are 4 "real" clusters and you're trying to estimate 4 clusters. But that would mean a rather nasty initialization and can be avoided by intelligently aggregating multiple randomly seeded runs.
There are also common "tricks" like taking the initial means to be far apart, or at the centers of different pre-estimated high density locations, etc. But that's starting to get involved, and you should read more deeply about k-means for that purpose.

K-means is also pretty sensitive to initial conditions. That said, k-means can and will drop clusters (but dropping to one is weird). In your code, you assign random clusters to the points.
Here's the problem: if I take several random subsamples of your data, they're going to have about the same mean point. Each iteration, the very similar centroids will be close to each other and more likely to drop.
Instead, I changed your code to pick num_cluster number of points in your data set to use as the initial centroids (higher variance). This seems to produce more stable results (didn't observe the dropping to one cluster behavior over several dozen runs):
import numpy as np
import matplotlib.pyplot as plt
DIM = 2
N = 2000
num_cluster = 11
iterations = 3
x = np.random.randn(N, DIM)
y = np.zeros(N)
# initialize clusters by picking num_cluster random points
# could improve on this by deliberately choosing most different points
for t in range(iterations):
if t == 0:
index_ = np.random.choice(range(N),num_cluster,replace=False)
mean = x[index_]
else:
for k in range(num_cluster):
mean[k] = np.mean(x[y==k], axis=0)
for i in range(N):
dist = np.sum((mean - x[i])**2, axis=1)
pred = np.argmin(dist)
y[i] = pred
for k in range(num_cluster):
fig = plt.scatter(x[y==k,0], x[y==k,1])
plt.show()

It does seem that there are NaN's entering the picture.
Using a seed=1, iterations=2, the number of clusters reduce from the initial 4 to effectively 3. In the next iteration this technically plummets to 1.
The NaN mean coordinates of the problematic centroid then result in weird things. To rule out those problematic clusters which became empty, one (possibly a bit too lazy) option is to set the related coordinates to Inf, thereby making it a "more distant than any other" point than those still in the game (as long as the 'input' coordinates cannot be Inf).
The below snippet is a quick illustration of that and a few debug messages that I used to peek into what was going on:
[...]
for k in range(num_cluster):
mean[k] = np.mean(x[y==k], axis=0)
# print mean[k]
if any(np.isnan(mean[k])):
# print "oh no!"
mean[k] = [np.Inf] * DIM
[...]
With this modification the posted algorithm seems to work in a more stable fashion (i.e. I couldn't break it so far).
Please also see the Quora link also mentioned among the comments about the split opinions, and the book "The Elements of Statistical Learning" for example here - the algorithm is not too explicitly defined there either in the relevant respect.

Choosing subset of farthest points in given set of points

Imagine you are given set S of n points in 3 dimensions. Distance between any 2 points is simple Euclidean distance. You want to chose subset Q of k points from this set such that they are farthest from each other. In other words there is no other subset Q’ of k points exists such that min of all pair wise distances in Q is less than that in Q’.
If n is approximately 16 million and k is about 300, how do we efficiently do this?
My guess is that this NP-hard so may be we just want to focus on approximation. One idea I can think of is using Multidimensional scaling to sort these points in a line and then use version of binary search to get points that are furthest apart on this line.

This is known as the discrete p-dispersion (maxmin) problem.
The optimality bound is proved in White (1991) and Ravi et al. (1994) give a factor-2 approximation for the problem with the latter proving this heuristic is the best possible (unless P=NP).
Factor-2 Approximation
The factor-2 approximation is as follows:
Let V be the set of nodes/objects.
Let i and j be two nodes at maximum distance.
Let p be the number of objects to choose.
P = set([i,j])
while size(P)<p:
Find a node v in V-P such that min_{v' in P} dist(v,v') is maximum.
\That is: find the node with the greatest minimum distance to the set P.
P = P.union(v)
Output P
You could implement this in Python like so:
#!/usr/bin/env python3
import numpy as np
p = 50
N = 400
print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2 #Make the matrix symmetric
print("Finding initial edge...")
maxdist = 0
bestpair = ()
for i in range(N):
for j in range(i+1,N):
if d[i,j]>maxdist:
maxdist = d[i,j]
bestpair = (i,j)
P = set()
P.add(bestpair[0])
P.add(bestpair[1])
print("Finding optimal set...")
while len(P)<p:
print("P size = {0}".format(len(P)))
maxdist = 0
vbest = None
for v in range(N):
if v in P:
continue
for vprime in P:
if d[v,vprime]>maxdist:
maxdist = d[v,vprime]
vbest = v
P.add(vbest)
print(P)
Exact Solution
You could also model this as an MIP. For p=50, n=400 after 6000s, the optimality gap was still 568%. The approximation algorithm took 0.47s to obtain an optimality gap of 100% (or less). A naive Gurobi Python representation might look like this:
#!/usr/bin/env python
import numpy as np
import gurobipy as grb
p = 50
N = 400
print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2 #Make the matrix symmetric
m = grb.Model(name="MIP Model")
used = [m.addVar(vtype=grb.GRB.BINARY) for i in range(N)]
objective = grb.quicksum( d[i,j]*used[i]*used[j] for i in range(0,N) for j in range(i+1,N) )
m.addConstr(
lhs=grb.quicksum(used),
sense=grb.GRB.EQUAL,
rhs=p
)
# for maximization
m.ModelSense = grb.GRB.MAXIMIZE
m.setObjective(objective)
# m.Params.TimeLimit = 3*60
# solving with Glpk
ret = m.optimize()
Scaling
Obviously, the O(N^2) scaling for the initial points is bad. We can find them more efficiently by recognizing that the pair must lie on the convex hull of the dataset. This gives us an O(N log N) way to find the pair. Once we've found it we proceed as before (using SciPy for acceleration).
The best way of scaling would be to use an R*-tree to efficiently find the minimum distance between a candidate point p and the set P. But this cannot be done efficiently in Python since a for loop is still involved.
import numpy as np
from scipy.spatial import ConvexHull
from scipy.spatial.distance import cdist
p = 300
N = 16000000
# Find a convex hull in O(N log N)
points = np.random.rand(N, 3) # N random points in 3-D
# Returned 420 points in testing
hull = ConvexHull(points)
# Extract the points forming the hull
hullpoints = points[hull.vertices,:]
# Naive way of finding the best pair in O(H^2) time if H is number of points on
# hull
hdist = cdist(hullpoints, hullpoints, metric='euclidean')
# Get the farthest apart points
bestpair = np.unravel_index(hdist.argmax(), hdist.shape)
P = np.array([hullpoints[bestpair[0]],hullpoints[bestpair[1]]])
# Now we have a problem
print("Finding optimal set...")
while len(P)<p:
print("P size = {0}".format(len(P)))
distance_to_P = cdist(points, P)
minimum_to_each_of_P = np.min(distance_to_P, axis=1)
best_new_point_idx = np.argmax(minimum_to_each_of_P)
best_new_point = np.expand_dims(points[best_new_point_idx,:],0)
P = np.append(P,best_new_point,axis=0)
print(P)

I am also pretty sure that the problem is NP-Hard, the most similar problem I found is the k-Center Problem. If runtime is more important than correctness a greedy algorithm is probably your best choice:
Q ={}
while |Q| < k
Q += p from S where mindist(p, Q) is maximal
Side note: In similar problems e.g., the set-cover problem it can be shown that the solution from the greedy algorithm is at least 63% as good as the optimal solution.
In order to speed things up I see 3 possibilities:
Index your dataset in an R-Tree first, then perform a greedy search. Construction of the R-Tree is O(n log n), but though being developed for nearest neighbor search, it can also help you finding the furthest point to a set of points in O(log n). This might be faster than the naive O(k*n) algorithm.
Sample a subset from your 16 million points and perform the greedy algorithm on the subset. You are approximate anyway so you might be able to spare a little more accuracy. You can also combine this with the 1. algorithm.
Use an iterative approach and stop when you are out of time. The idea here is to randomly select k points from S (lets call this set Q'). Then in each step you switch the point p_ from Q' that has the minimum distance to another one in Q' with a random point from S. If the resulting set Q'' is better proceed with Q'', otherwise repeat with Q'. In order not to get stuck you might want to choose another point from Q' than p_ if you could not find an adequate replacement for a couple of iterations.

If you can afford to do ~ k*n distance calculations then you could
Find the center of the distribution of points.
Select the point furthest from the center. (and remove it from the set of un-selected points).
Find the point furthest from all the currently selected points and select it.
Repeat 3. until you end with k points.

Find the maximum extent of all points. Split into 7x7x7 voxels. For all points in a voxel find the point closest to its centre. Return these 7x7x7 points. Some voxels may contain no points, hopefully not too many.

How can I speed up nearest neighbor search with python?

I have a code, which calculates the nearest voxel (which is unassigned) to a voxel ( which is assigned). That is i have an array of voxels, few voxels already have a scalar (1,2,3,4....etc) values assigned, and few voxels are empty (lets say a value of '0'). This code below finds the nearest assigned voxel to an unassigned voxel and assigns that voxel the same scalar. So, a voxel with a scalar '0' will be assigned a value (1 or 2 or 3,...) based on the nearest voxel. This code below works, but it takes too much time.
Is there an alternative to this ? or if you have any feedback on how to improve it further?
""" #self.voxels is a 3D numpy array"""
def fill_empty_voxel1(self,argx, argy, argz):
""" where # argx, argy, argz are the voxel location where the voxel is zero"""
argx1, argy1, argz1 = np.where(self.voxels!=0) # find the non zero voxels
a = np.column_stack((argx1, argy1, argz1))
b = np.column_stack((argx, argy, argz))
tree = cKDTree(a, leafsize=a.shape[0]+1)
distances, ndx = tree.query(b, k=1, distance_upper_bound= self.mean) # self.mean is a mean radius search value
argx2, argy2, argz2 = a[ndx][:][:,0],a[ndx][:][:,1],a[ndx][:][:,2]
self.voxels[argx,argy,argz] = self.voxels[argx2,argy2,argz2] # update the voxel array
Example
""" Here is a small example with small dataset:"""
import numpy as np
from scipy.spatial import cKDTree
import timeit
voxels = np.zeros((10,10,5), dtype=np.uint8)
voxels[1:2,:,:] = 5.
voxels[5:6,:,:] = 2.
voxels[:,3:4,:] = 1.
voxels[:,8:9,:] = 4.
argx, argy, argz = np.where(voxels==0)
tic=timeit.default_timer()
argx1, argy1, argz1 = np.where(voxels!=0) # non zero voxels
a = np.column_stack((argx1, argy1, argz1))
b = np.column_stack((argx, argy, argz))
tree = cKDTree(a, leafsize=a.shape[0]+1)
distances, ndx = tree.query(b, k=1, distance_upper_bound= 5.)
argx2, argy2, argz2 = a[ndx][:][:,0],a[ndx][:][:,1],a[ndx][:][:,2]
voxels[argx,argy,argz] = voxels[argx2,argy2,argz2]
toc=timeit.default_timer()
timetaken = toc - tic #elapsed time in seconds
print '\nTime to fill empty voxels', timetaken
for visualization:
from mayavi import mlab
data = voxels.astype('float')
scalar_field = mlab.pipeline.scalar_field(data)
iso_surf = mlab.pipeline.iso_surface(scalar_field)
surf = mlab.pipeline.surface(scalar_field)
vol = mlab.pipeline.volume(scalar_field,vmin=0,vmax=data.max())
mlab.outline()
mlab.show()
Now, if I have the dimension of the voxels array as something like (500,500,500), then the time it takes to compute the nearest search is no longer efficient. How can I overcome this? Could parallel computation reduce the time (I have no idea whether I can parallelize the code, if you do, please let me know)?
A potential fix:
I could substantially improve the computation time by adding the n_jobs = -1 parameter in the cKDTree query.
distances, ndx = tree.query(b, k=1, distance_upper_bound= 5., n_jobs=-1)
I was able to compute the distances in less than a hour for an array of (400,100,100) on a 13 core CPU. I tried with 1 processor and it takes around 18 hours to complete the same array.
Thanks to #gsamaras for the answer!

You can switch to approximate nearest neighbors (ANN) algorithms which usually take advantage of sophisticated hashing or proximity graph techniques to index your data quickly and perform faster queries. One example is Spotify's Annoy. Annoy's README includes a plot which shows precision-performance tradeoff comparison of various ANN algorithms published in recent years. The top-performing algorithm (at the time this comment was posted), hnsw, has a Python implementation under Non-Metric Space Library (NMSLIB).

It would be interesting to try sklearn.neighbors.NearestNeighbors, which offers n_jobs parameter:
The number of parallel jobs to run for neighbors search.
This package also provides the Ball Tree algorithm, which you can test versus the kd-tree one, however my hunch is that the kd-tree will be better (but that again does depend on your data, so research that!).
You might also want to use dimensionality reduction, which is easy. The idea is that you reduce your dimensions, thus your data contain less info, so that tackling the Nearest Neighbour Problem can be done much faster. Of course, there is a trade off here, accuracy!
You might/will get less accuracy with dimensionality reduction, but it might worth the try. However, this usually applies in a high dimensional space, and you are just in 3D. So I don't know if for your specific case it would make sense to use sklearn.decomposition.PCA.
A remark:
If you really want high performance though, you won't get it with python, you could switch to c++, and use CGAL for example.

How can I select L neighbors

I'd like to know how to select the optimal L neighbors of a specific point. Like saying I need to select the 5 neighbors. Is there any parameter to change.
I want to let it select L points where : L = SQRT[number of points in data set]
I have a huge data set, so I might find a lot of points near to each others while others far from them.
L, the number of neighbors to consider, can be chosen arbitrarily, or
with cross validation. With more training data, L can be larger, since
the training data is more dense in the underlying space X. With more
discontinuous or nonlinear dynamics in the classification, K should be
smaller, to capture these more local fluctuations.
NearestNeighbors(algorithm='auto', leaf_size=30, n_neighbors=5, p=2,
radius=1.0, warn_on_equidistant=True)

I want to let it select L points where : L = SQRT[number of points in data set]
That's not possible unless you compute the number of samples and its square root yourself. You can only pass an integer as n_neighbors.
The only way to take a variable number of neighbors into account is to use RadiusNeighbors{Classifier,Regressor}, which take a distance cutoff instead of a k parameter.

Please try the following example:
import numpy as np
rng = np.random.RandomState(42)
from sklearn.neighbors import NearestNeighbors
nnbrs = NearestNeighbors(n_neighbors=5)
points = rng.randn(500, 3)
nnbrs.fit(points)
point_of_interest = np.array([0, 1, 0])
distances, neighbor_indices = nnbrs.kneighbors(point_of_interest)
neighbors = points[neighbor_indices]
Does this obtain the desired result? You should try this on your sparse matrix data and play with algorithm= (see docs), if there are calculation time / memory problems

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.