How can I make the following code faster?

How can I make the following code faster? - python

I am trying to optimize code for a challenge in codewars but I am having some trouble making it fast enough. The kata in question is this. The description is explained in a detailed way in the link I have given and I do not reproduce it here not to bloat the question. My attempted code is the following (in python)
# first I define the manhattan distance between any two given points
def manhattan(p1,p2):
return abs(p1[0]-p2[0])+abs(p1[1]-p2[1])
# I write a function that takes the minimum of the list of the distances from a given point to all agents
def distance(p,agents):
return min(manhattan(p,i) for i in agents)
# and the main function
def advice(agents,n):
# there could be agents outside of the range of the grid so I redefine agents cropping the agents that may be outside
agents=[i for i in agents if i[0]<n and i[1]<n]
# in case there is no grid or there are agents everywhere
if n==0 or len(agents)==n*n:
return []
# if there is no agent I output the whole matrix
if agents==[]:
return [(i,j) for i in range(n) for j in range(n)]
# THIS FOLLOWING PART IS THE PART THAT I THINK I NEED TO OPTIMIZE
# I define a dictionary with key point coordinates and value the output of the function distance then make a list with those entries with lowest values. This ends the script
dct=dict( ( (i,j),distance([i,j],agents) ) for i in range(n) for j in range(n))
# highest distance to an agent
mxm=max(dct.values())
return [ nn for nn,mm in dct.items() if mm==mxm]
The part I think I need to improve is marked in the code above. Any ideas of how to make this faster?

Your algorithm is doing a brute force check of all positions against all agents. If you want to accelerate the process, you have to use a strategy that allows you to skip large parts of these n x n x a combinations.
For example, you could group the points by distance in a dictionary starting with all the points at the largest possible distance. Then, go through the agents and redistribute only the farthest points to closer distances. Repeat this process until the farthest distance retains at least one point after going through all agents.
This would not eliminate all the distance checks but it skips a lot of distance computations for points that are already known to be closer to an agent than the farthest ones.
Here's an example:
def myAdvice2(agents,n):
maxDist = 2*n
distPoints = { maxDist:[ (x,y) for x in range(n) for y in range(n)] }
index = 0
retained = 0
lastMax = None
# keep refining until farthest distance is kept for all agents
while retained < len(agents):
ax,ay = agents[index]
index = (index + 1) % len(agents)
# redistribute points at farthest distance for this agent
points = distPoints.pop(maxDist)
for x,y in points:
distance = min(maxDist,abs(x-ax)+abs(y-ay))
if distance not in distPoints:
distPoints[distance] = []
distPoints[distance].append((x,y))
maxDist = max(distPoints)
# count retained agents at MaxDist
retained += 1
# reset count when maxDist is reduced
if lastMax and maxDist < lastMax : retained = 0
lastMax = maxDist
# once all agents have been applied to largest distance, we have a solution
return distPoints[maxDist]
In my performance tests this is about 3x faster and performs better on larger grids.
[EDIT] This algorithm can be further accelerated by sorting the agents based on their distance to the center. This will ensure that points at the farthest distances are eliminated faster because agents at central positions cover more ground.
You can add this sort at the beginning of the function:
agents = sorted(agents, key=lambda p: abs(p[0]-n//2)+abs(p[1]-n//2))
This will yield a 10% to 15% improvement on speed depending on the number of agent, their positions and the size of the area. Further optimisations could determine when this sort is worthwhile based on data.
[EDIT2] if you're going to use a brute force approach, leveraging parallel processing (using the GPU) could give suprizingly fast results.
This example, using numpy, is 3 to 10 times faster than my original approach. I'm guessing that this will only be the case up to a point (having more agents slows it down considerably) but it was much faster in all small scale tests I did.
import numpy as np
def myAdvice4(agents,n):
ax,ay = [np.array(c)[:,None,None] for c in zip(*agents)]
px = np.repeat(np.arange(n)[None,:],n,axis=0)
py = np.repeat(np.arange(n)[:,None],n,axis=1)
agentDist = abs(ax - px) + abs(ay - py)
minDist = np.min(agentDist,axis=0)
maxDist = np.max(minDist)
maxCoord = zip(*np.where(minDist == maxDist))
return list(maxCoord)

You can use sklearn.metrics.pairwise_distances with manhattan metric.

Related

Using pairwise_distances_chunked to compute nearest neighbor search

I have a long skinny data matrix (size: 250,000 x 10), which I will denote X. I also have a vector p measuring the quality of my data points. My goal is to compute the following function for each row x in my data matrix X:
r(x) = min{ ||x-y|| | p[y]>p[x], y in X }
On a smaller dataset, what I would use sklearn.metrics.pairwise_distances to precompute distances, like so:
from sklearn import metrics
n = len(X);
D_full = metrics.pairwise_distances(X);
r = np.zeros((n,1));
for i in range(n):
r[i] = (D_full[i,p>p[i]]).min();
However, the above approach is memory-expensive, since I need to store D_full: a full n x n matrix. It seems like sklearn.metrics.pairwise_distances_chunked could be a good tool for this sort of problem since the distance matrix is only stored one chunk at a time. I was hoping to get some assistance in how to use it though, as I'm currently unfamiliar with generator objects. Suppose I call the following:
from sklearn import metrics
D = metrics.pairwise_distances_chunked(X);
D_chunk = next(D)
The above code yields D (a generator object) and D_chunk (a 536 x n array). Does D_chunkcorrespond to the first 536 rows of the matrix D_full from my earlier approach? If so, does next(D_chunk) correspond to the next 536 rows?
Thank you very much for your help.

This is a outline of a possible solution, but details are missing. In short, I would do the following:
Create a BallTree to query, and initialise min_quality_distance of size 250000 with say zeros.
for k=2
For each vector, find the closest k neighbour (including itself).
If vector with most distance within k found, has sufficient quality, update min_quality_distance for that point.
For remaining, repeat with k=k+1
In each iteration, we have to query less vectors. The idea is that in each iteration you nibble a few nearest neighbors with the right condition away, and it will be easier with every step. (50% easier?) I will show how to do the first iteration, and with this is should be possible to build the loop.
You can do;
import numpy as np
size = 250000
X = np.random.random( size=(size,10))
p = np.random.random( size=size)
And create a BallTree with
from sklearn.neighbors import BallTree
tree = BallTree(X, leaf_size=10, metric='minkowski')
and query it for first iteration with (this will take about 5 minutes.)
k_nearest = 2
distances, indici = tree.query(X, k=k_nearest, return_distance=True, dualtree=False, sort_results=True)
The indici of most far away point within the nearest k is
most_far_away_indici = indici[:,-1:]
And its quality
p[most_far_away_indici]
So we can
quality_closeby = p[most_far_away_indici]
And check if it is sifficient with
indici_sufficient_quality = quality_closeby > np.expand_dims(p, axis=1)
And we have
found_closeby = np.all( indici_sufficient_quality, axis=1 )
Which is True is we have found a sufficient quality nearby.
We can update the vector with
distances_nearby = distances[:,-1:]
rx = np.zeros(size)
rx[found_closeby] = distances_nearby[found_closeby][:,0]
And we now need to take care for the remaining where we were unlucky, these are
~found_closeby
so
indici_not_found = ~found_closeby
and
distances, indici = tree.query(X[indici_not_found], k=3, return_distance=True, dualtree=False, sort_results=True)
etc..
I am sure the first few loops will take minutes, but after a few iterations the speeds will quickly go to seconds.
It is a little exercise with np.argwhere() etc to make sure the right indicis get updates.
It might not be the fastest, but it is a workable approach.

Since one cannot know the dimensions of some chunk, I suggest using np.ones_like instead of np.zeros.

Efficient way to sample N points from a 3D space with the constraint of a minimum distance between two points in Python

I have 200 data points, each point is a list of 3 numbers that represents the position. I want to sample N=100 points from this 3D space, but with the constraint that the minimum distance between every two points must be larger than 0.15. The script below is the way I sample the points, but it just keeps running and never stops. Also, if I set a N larger than a value, the code cannot find all N points because I sample each point randomly and it gets to a point where no points can be sampled that isn't too close to the current points, but in reality the N can be much larger than this value if the point distribution is very "dense" (but still satisfies minimum distance larger than 0.15). Is there a more efficient way to do this?
import numpy as np
import random
import time
def get_random_points_not_too_close(points, npoints, min_distance):
random.shuffle(points)
final_points = [points[0]]
while len(final_points) < npoints:
for point in points:
if point in final_points:
continue
elif min([np.linalg.norm(np.array(p) - np.array(point)) for p in final_points]) > min_distance:
final_points.append(point)
return final_points
data = [[random.random() for i in range(3)] for j in range(200)]
t1 = time.time()
sample_points = get_random_points_not_too_close(points=data, npoints=100, min_distance=0.15)
t2 = time.time()
print(t2-t1)

Your algorithm could work for small sets of points, but it does not run in a deterministic time.
I did the following to create a random forest (simulated trees): first generate a square grid of points where the grid point distance is 3x the minimal distance. Now you take each point in the regular grid and translate it in a random direction, at a random distance which is max your distance. The result points will never be closer than max your distance.

How to get the K most distant points, given their coordinates?

We have boring CSV with 10000 rows of ages (float), titles (enum/int), scores (float), ....
We have N columns each with int/float values in a table.
You can imagine this as points in ND space
We want to pick K points that would have maximised distance between each other.
So if we have 100 points in a tightly packed cluster and one point in the distance we would get something like this for three points:
or this
For 4 points it will become more interesting and pick some point in the middle.
So how to select K most distant rows (points) from N (with any complexity)? It looks like an ND point cloud "triangulation" with a given resolution yet not for 3d points.
I search for a reasonably fast approach (approximate - no precise solution needed) for K=200 and N=100000 and ND=6 (probably multigrid or ANN on KDTree based, SOM or triangulation based..).. Does anyone know one?

From past experience with a pretty similar problem, a simple solution of computing the mean Euclidean distance of all pairs within each group of K points and then taking the largest mean, works very well. As someone noted above, it's probably hard to avoid a loop on all combinations (not on all pairs). So a possible implementation of all this can be as follows:
import itertools
import numpy as np
from scipy.spatial.distance import pdist
Npoints = 3 # or 4 or 5...
# making up some data:
data = np.matrix([[3,2,4,3,4],[23,25,30,21,27],[6,7,8,7,9],[5,5,6,6,7],[0,1,2,0,2],[3,9,1,6,5],[0,0,12,2,7]])
# finding row indices of all combinations:
c = [list(x) for x in itertools.combinations(range(len(data)), Npoints )]
distances = []
for i in c:
distances.append(np.mean(pdist(data[i,:]))) # pdist: a method of computing all pairwise Euclidean distances in a condensed way.
ind = distances.index(max(distances)) # finding the index of the max mean distance
rows = c[ind] # these are the points in question

I propose an approximate solution. The idea is to start from a set of K points chosen in a way I'll explain below, and repeatedly loop through these points replacing the current one with the point, among the N-K+1 points not belonging to the set but including the current one, that maximizes the sum of the distances from the points of the set. This procedure leads to a set of K points where the replacement of any single point would cause the sum of the distances among the points of the set to decrease.
To start the process we take the K points that are closest to the mean of all points. This way we have good chances that on the first loop the set of K points will be spread out close to its optimum. Subsequent iterations will make adjustments to the set of K points towards a maximum of the sum of distances, which for the current values of N, K and ND appears to be reachable in just a few seconds. In order to prevent excessive looping in edge cases, we limit the number of loops nonetheless.
We stop iterating when an iteration does not improve the total distance among the K points. Of course, this is a local maximum. Other local maxima will be reached for different initial conditions, or by allowing more than one replacement at a time, but I don't think it would be worthwhile.
The data must be adjusted in order for unit displacements in each dimension to have the same significance, i.e., in order for Euclidean distances to be meaningful. E.g., if your dimensions are salary and number of children, unadjusted, the algorithm will probably yield results concentrated in the extreme salary regions, ignoring that person with 10 kids. To get a more realistic output you could divide salary and number of children by their standard deviation, or by some other estimate that makes differences in salary comparable to differences in number of children.
To be able to plot the output for a random Gaussian distribution, I have set ND = 2 in the code, but setting ND = 6, as per your request, is no problem (except you cannot plot it).
import matplotlib.pyplot as plt
import numpy as np
import scipy.spatial as spatial
N, K, ND = 100000, 200, 2
MAX_LOOPS = 20
SIGMA, SEED = 40, 1234
rng = np.random.default_rng(seed=SEED)
means, variances = [0] * ND, [SIGMA**2] * ND
data = rng.multivariate_normal(means, np.diag(variances), N)
def distances(ndarray_0, ndarray_1):
if (ndarray_0.ndim, ndarray_1.ndim) not in ((1, 2), (2, 1)):
raise ValueError("bad ndarray dimensions combination")
return np.linalg.norm(ndarray_0 - ndarray_1, axis=1)
# start with the K points closest to the mean
# (the copy() is only to avoid a view into an otherwise unused array)
indices = np.argsort(distances(data, data.mean(0)))[:K].copy()
# distsums is, for all N points, the sum of the distances from the K points
distsums = spatial.distance.cdist(data, data[indices]).sum(1)
# but the K points themselves should not be considered
# (the trick is that -np.inf ± a finite quantity always yields -np.inf)
distsums[indices] = -np.inf
prev_sum = 0.0
for loop in range(MAX_LOOPS):
for i in range(K):
# remove this point from the K points
old_index = indices[i]
# calculate its sum of distances from the K points
distsums[old_index] = distances(data[indices], data[old_index]).sum()
# update the sums of distances of all points from the K-1 points
distsums -= distances(data, data[old_index])
# choose the point with the greatest sum of distances from the K-1 points
new_index = np.argmax(distsums)
# add it to the K points replacing the old_index
indices[i] = new_index
# don't consider it any more in distsums
distsums[new_index] = -np.inf
# update the sums of distances of all points from the K points
distsums += distances(data, data[new_index])
# sum all mutual distances of the K points
curr_sum = spatial.distance.pdist(data[indices]).sum()
# break if the sum hasn't changed
if curr_sum == prev_sum:
break
prev_sum = curr_sum
if ND == 2:
X, Y = data.T
marker_size = 4
plt.scatter(X, Y, s=marker_size)
plt.scatter(X[indices], Y[indices], s=marker_size)
plt.grid(True)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
Output:
Splitting the data into 3 equidistant Gaussian distributions the output is this:

Assuming that if you read your csv file with N (10000) rows and D dimension (or features) into a N*D martix X. You can calculate the distance between each point and store it in a distance matrix as follows:
import numpy as np
X = np.asarray(X) ### convert to numpy array
distance_matrix = np.zeros((X.shape[0],X.shape[0]))
for i in range(X.shape[0]):
for j in range(i+1,X.shape[0]):
## We compute triangle matrix and copy the rest. Distance from point A to point B and distance from point B to point A are the same.
distance_matrix[i][j]= np.linalg.norm(X[i]-X[j]) ## Here I am calculating Eucledian distance. Other distance measures can also be used.
#distance_matrix = distance_matrix + distance_matrix.T - np.diag(np.diag(distance_matrix)) ## This syntax can be used to get the lower triangle of distance matrix, which is not really required in your case.
K = 5 ## Number of points that you want to pick
indexes = np.unravel_index(np.argsort(distance_matrix.ravel())[-1*K:], distance_matrix.shape)
print(indexes)

Bottom Line Up Front: Dealing with multiple equally distant points and the Curse of Dimensionality are going to be larger problems than just finding the points. Spoiler alert: There's a surprise ending.
I think this an interesting question but I'm bewildered by some of the answers. I think this is, in part, due to the sketches provided. You've no doubt noticed the answers look similar -- 2d, with clusters -- even though you indicated a wider scope was needed. Because others will eventually see this, I'm going to step through my thinking a bit slowly so bear with me for the early part.
It makes sense to start with a simplified example to see if we can generalize a solution with data that's easy to grasp and a linear 2D model is easiest of the easy.
We don't need to calculate all the distances though. We just need the ones at the extremes. So we can then take the top and bottom few values:
right = lin_2_D.nlargest(8, ['x'])
left = lin_2_D.nsmallest(8, ['x'])
graph = sns.scatterplot(x="x", y="y", data=lin_2_D, color = 'gray', marker = '+', alpha = .4)
sns.scatterplot(x = right['x'], y = right['y'], color = 'red')
sns.scatterplot(x = left['x'], y = left['y'], color = 'green')
fig = graph.figure
fig.set_size_inches(8,3)
What we have so far: Of 100 points, we've eliminated the need to calculate the distance between 84 of them. Of what's left we can further drop this by ordering the results on one side and checking the distance against the others.
You can imagine a case where you have a couple of data points way off the trend line that could be captured by taking the greatest or least y values, and all that starts to look like Walter Tross's top diagram. Add in a couple of extra clusters and you get what looks his bottom diagram and it appears that we're sort of making the same point.
The problem with stopping here is the requirement you mentioned is that you need a solution that works for any number of dimensions.
The unfortunate part is that we run into four challenges:
Challenge 1: As you increase the dimensions you can run into a large number of cases where you have multiple solutions when seeking midpoints. So you're looking for k furthest points but have a large number of equally valid possible solutions and no way prioritizing them. Here are two super easy examples illustrate this:
A) Here we have just four points and in only two dimensions. You really can't get any easier than this, right? The distance from red to green is trivial. But try to find the next furthest point and you'll see both of the black points are equidistant from both the red and green points. Imagine you wanted the furthest six points using the first graphs, you might have 20 or more points that are all equidistant.
edit: I just noticed the red and green dots are at the edges of their circles rather than at the center, I'll update later but the point is the same.
B) This is super easy to imagine: Think of a D&D 4 sided die. Four points of data in a three-dimensional space, all equidistant so it's known as a triangle-based pyramid. If you're looking for the closest two points, which two? You have 4 choose 2 (aka, 6) combinations possible. Getting rid of valid solutions can be a bit of a problem because invariably you face questions such as "why did we get rid of these and not this one?"
Challenge 2: The Curse of Dimensionality. Nuff Said.
Challenge 3 Revenge of The Curse of Dimensionality Because you're looking for the most distant points, you have to x,y,z ... n coordinates for each point or you have to impute them. Now, your data set is much larger and slower.
Challenge 4 Because you're looking for the most distant points, dimension reduction techniques such as ridge and lasso are not going to be useful.
So, what to do about this?
Nothing.
Wait. What?!?
Not truly, exactly, and literally nothing. But nothing crazy. Instead, rely on a simple heuristic that is understandable and computationally easy. Paul C. Kainen puts it well:
Intuitively, when a situation is sufficiently complex or uncertain,
only the simplest methods are valid. Surprisingly, however,
common-sense heuristics based on these robustly applicable techniques
can yield results which are almost surely optimal.
In this case, you have not the Curse of Dimensionality but rather the Blessing of Dimensionality. It's true you have a lot of points and they'll scale linearly as you seek other equidistant points (k) but the total dimensional volume of space will increase to power of the dimensions. The k number of furthest points you're is insignificant to the total number of points. Hell, even k^2 becomes insignificant as the number of dimensions increase.
Now, if you had a low dimensionality, I would go with them as a solution (except the ones that are use nested for loops ... in NumPy or Pandas).
If I was in your position, I'd be thinking how I've got code in these other answers that I could use as a basis and maybe wonder why should I should trust this other than it lays out a framework on how to think through the topic. Certainly, there should be some math and maybe somebody important saying the same thing.
Let me reference to chapter 18 of Computer Intensive Methods in Control and Signal Processing and an expanded argument by analogy with some heavy(-ish) math. You can see from the above (the graph with the colored dots at the edges) that the center is removed, particularly if you followed the idea of removing the extreme y values. It's a though you put a balloon in a box. You could do this a sphere in a cube too. Raise that into multiple dimensions and you have a hypersphere in a hypercube. You can read more about that relationship here.
Finally, let's get to a heuristic:
Select the points that have the most max or min values per dimension. When/if you run out of them pick ones that are close to those values if there isn't one at the min/max. Essentially, you're choosing the corners of a box For a 2D graph you have four points, for a 3D you have the 8 corners of the box (2^3).
More accurately this would be a 4d or 5d (depending on how you might assign the marker shape and color) projected down to 3d. But you can easily see how this data cloud gives you the full range of dimensions.
Here is a quick check on learning; for purposes of ease, ignore the color/shape aspect: It's easy to graphically intuit that you have no problem with up to k points short of deciding what might be slightly closer. And you can see how you might need to randomize your selection if you have a k < 2D. And if you added another point you can see it (k +1) would be in a centroid. So here is the check: If you had more points, where would they be? I guess I have to put this at the bottom -- limitation of markdown.
So for a 6D data cloud, the values of k less than 64 (really 65 as we'll see in just a moment) points are pretty easy. But...
If you don't have a data cloud but instead have data that has a linear relationship, you'll 2^(D-1) points. So, for that linear 2D space, you have a line, for linear 3D space, you'd have a plane. Then a rhomboid, etc. This is true even if your shape is curved. Rather than do this graph myself, I'm using the one from an excellent post on by Inversion Labs on Best-fit Surfaces for 3D Data
If the number of points, k, is less than 2^D you need a process to decide what you don't use. Linear discriminant analysis should be on your shortlist. That said, you can probably satisfice the solution by randomly picking one.
For a single additional point (k = 1 + 2^D), you're looking for one that is as close to the center of the bounding space.
When k > 2^D, the possible solutions will scale not geometrically but factorially. That may not seem intuitive so let's go back to the two circles. For 2D you have just two points that could be a candidate for being equidistant. But if that were 3D space and rotate the points about the line, any point in what is now a ring would suffice as a solution for k. For a 3D example, they would be a sphere. Hyperspheres (n-spheres) from thereon. Again, 2^D scaling.
One last thing: You should seriously look at xarray if you're not already familiar with it.
Hope all this helps and I also hope you'll read through the links. It'll be worth the time.
*It would be the same shape, centrally located, with the vertices at the 1/3 mark. So like having 27 six-sided dice shaped like a giant cube. Each vertice (or point nearest it) would fix the solution. Your original k+1 would have to be relocated too. So you would select 2 of the 8 vertices. Final question: would it be worth calculating the distances of those points against each other (remember the diagonal is slightly longer than the edge) and then comparing them to the original 2^D points? Bluntly, no. Satifice the solution.

If you're interested in getting the most distant points you can take advantage of all of the methods that were developed for nearest neighbors, you just have to give a different "metric".
For example, using scikit-learn's nearest neighbors and distance metrics tools you can do something like this
import numpy as np
from sklearn.neighbors import BallTree
from sklearn.neighbors.dist_metrics import PyFuncDistance
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
def inverted_euclidean(x1, x2):
# You can speed this up using cython like scikit-learn does or numba
dist = np.sum((x1 - x2) ** 2)
# We invert the euclidean distance and set nearby points to the biggest possible
# positive float that isn't inf
inverted_dist = np.where(dist == 0, np.nextafter(np.inf, 0), 1 / dist)
return inverted_dist
# Make up some fake data
n_samples = 100000
n_features = 200
X, _ = make_blobs(n_samples=n_samples, centers=3, n_features=n_features, random_state=0)
# We exploit the BallTree algorithm to get the most distant points
ball_tree = BallTree(X, leaf_size=50, metric=PyFuncDistance(inverted_euclidean))
# Some made up query, you can also provide a stack of points to query against
test_point = np.zeros((1, n_features))
distance, distant_points_inds = ball_tree.query(X=test_point, k=10, return_distance=True)
distant_points = X[distant_points_inds[0]]
# We can try to visualize the query results
plt.plot(X[:, 0], X[:, 1], ".b", alpha=0.1)
plt.plot(test_point[:, 0], test_point[:, 1], "*r", markersize=9)
plt.plot(distant_points[:, 0], distant_points[:, 1], "sg", markersize=5, alpha=0.8)
plt.show()
Which will plot something like:
There are many points that you can improve on:
I implemented the inverted_euclidean distance function with numpy, but you can try to do what the folks of scikit-learn do with their distance functions and implement them in cython. You could also try to jit compile them with numba.
Maybe the euclidean distance isn't the metric you would like to use to find the furthest points, so you're free to implement your own or simply roll with what scikit-learn provides.
The nice thing about using the Ball Tree algorithm (or the KdTree algorithm) is that for each queried point you have to do log(N) comparisons to find the furthest point in the training set. Building the Ball Tree itself, I think also requires log(N) comparison, so in the end if you want to find the k furthest points for every point in the ball tree training set (X), it will have almost O(D N log(N)) complexity (where D is the number of features), which will increase up to O(D N^2) with the increasing k.

Choosing subset of farthest points in given set of points

Imagine you are given set S of n points in 3 dimensions. Distance between any 2 points is simple Euclidean distance. You want to chose subset Q of k points from this set such that they are farthest from each other. In other words there is no other subset Q’ of k points exists such that min of all pair wise distances in Q is less than that in Q’.
If n is approximately 16 million and k is about 300, how do we efficiently do this?
My guess is that this NP-hard so may be we just want to focus on approximation. One idea I can think of is using Multidimensional scaling to sort these points in a line and then use version of binary search to get points that are furthest apart on this line.

This is known as the discrete p-dispersion (maxmin) problem.
The optimality bound is proved in White (1991) and Ravi et al. (1994) give a factor-2 approximation for the problem with the latter proving this heuristic is the best possible (unless P=NP).
Factor-2 Approximation
The factor-2 approximation is as follows:
Let V be the set of nodes/objects.
Let i and j be two nodes at maximum distance.
Let p be the number of objects to choose.
P = set([i,j])
while size(P)<p:
Find a node v in V-P such that min_{v' in P} dist(v,v') is maximum.
\That is: find the node with the greatest minimum distance to the set P.
P = P.union(v)
Output P
You could implement this in Python like so:
#!/usr/bin/env python3
import numpy as np
p = 50
N = 400
print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2 #Make the matrix symmetric
print("Finding initial edge...")
maxdist = 0
bestpair = ()
for i in range(N):
for j in range(i+1,N):
if d[i,j]>maxdist:
maxdist = d[i,j]
bestpair = (i,j)
P = set()
P.add(bestpair[0])
P.add(bestpair[1])
print("Finding optimal set...")
while len(P)<p:
print("P size = {0}".format(len(P)))
maxdist = 0
vbest = None
for v in range(N):
if v in P:
continue
for vprime in P:
if d[v,vprime]>maxdist:
maxdist = d[v,vprime]
vbest = v
P.add(vbest)
print(P)
Exact Solution
You could also model this as an MIP. For p=50, n=400 after 6000s, the optimality gap was still 568%. The approximation algorithm took 0.47s to obtain an optimality gap of 100% (or less). A naive Gurobi Python representation might look like this:
#!/usr/bin/env python
import numpy as np
import gurobipy as grb
p = 50
N = 400
print("Building distance matrix...")
d = np.random.rand(N,N) #Random matrix
d = (d + d.T)/2 #Make the matrix symmetric
m = grb.Model(name="MIP Model")
used = [m.addVar(vtype=grb.GRB.BINARY) for i in range(N)]
objective = grb.quicksum( d[i,j]*used[i]*used[j] for i in range(0,N) for j in range(i+1,N) )
m.addConstr(
lhs=grb.quicksum(used),
sense=grb.GRB.EQUAL,
rhs=p
)
# for maximization
m.ModelSense = grb.GRB.MAXIMIZE
m.setObjective(objective)
# m.Params.TimeLimit = 3*60
# solving with Glpk
ret = m.optimize()
Scaling
Obviously, the O(N^2) scaling for the initial points is bad. We can find them more efficiently by recognizing that the pair must lie on the convex hull of the dataset. This gives us an O(N log N) way to find the pair. Once we've found it we proceed as before (using SciPy for acceleration).
The best way of scaling would be to use an R*-tree to efficiently find the minimum distance between a candidate point p and the set P. But this cannot be done efficiently in Python since a for loop is still involved.
import numpy as np
from scipy.spatial import ConvexHull
from scipy.spatial.distance import cdist
p = 300
N = 16000000
# Find a convex hull in O(N log N)
points = np.random.rand(N, 3) # N random points in 3-D
# Returned 420 points in testing
hull = ConvexHull(points)
# Extract the points forming the hull
hullpoints = points[hull.vertices,:]
# Naive way of finding the best pair in O(H^2) time if H is number of points on
# hull
hdist = cdist(hullpoints, hullpoints, metric='euclidean')
# Get the farthest apart points
bestpair = np.unravel_index(hdist.argmax(), hdist.shape)
P = np.array([hullpoints[bestpair[0]],hullpoints[bestpair[1]]])
# Now we have a problem
print("Finding optimal set...")
while len(P)<p:
print("P size = {0}".format(len(P)))
distance_to_P = cdist(points, P)
minimum_to_each_of_P = np.min(distance_to_P, axis=1)
best_new_point_idx = np.argmax(minimum_to_each_of_P)
best_new_point = np.expand_dims(points[best_new_point_idx,:],0)
P = np.append(P,best_new_point,axis=0)
print(P)

I am also pretty sure that the problem is NP-Hard, the most similar problem I found is the k-Center Problem. If runtime is more important than correctness a greedy algorithm is probably your best choice:
Q ={}
while |Q| < k
Q += p from S where mindist(p, Q) is maximal
Side note: In similar problems e.g., the set-cover problem it can be shown that the solution from the greedy algorithm is at least 63% as good as the optimal solution.
In order to speed things up I see 3 possibilities:
Index your dataset in an R-Tree first, then perform a greedy search. Construction of the R-Tree is O(n log n), but though being developed for nearest neighbor search, it can also help you finding the furthest point to a set of points in O(log n). This might be faster than the naive O(k*n) algorithm.
Sample a subset from your 16 million points and perform the greedy algorithm on the subset. You are approximate anyway so you might be able to spare a little more accuracy. You can also combine this with the 1. algorithm.
Use an iterative approach and stop when you are out of time. The idea here is to randomly select k points from S (lets call this set Q'). Then in each step you switch the point p_ from Q' that has the minimum distance to another one in Q' with a random point from S. If the resulting set Q'' is better proceed with Q'', otherwise repeat with Q'. In order not to get stuck you might want to choose another point from Q' than p_ if you could not find an adequate replacement for a couple of iterations.

If you can afford to do ~ k*n distance calculations then you could
Find the center of the distribution of points.
Select the point furthest from the center. (and remove it from the set of un-selected points).
Find the point furthest from all the currently selected points and select it.
Repeat 3. until you end with k points.

Find the maximum extent of all points. Split into 7x7x7 voxels. For all points in a voxel find the point closest to its centre. Return these 7x7x7 points. Some voxels may contain no points, hopefully not too many.

Improving a method for a spatially aware median filter for point clouds in Python

I have point cloud data from airborne LiDAR. It is noisy, so I want to run a median filter which collects points within N metres of each point, finds the median elevation value, and returns the neighbourhood median as an adjusted elevation value.
This is roughly analogous to gridding the data, and taking the median of elevations within each bin. Or scipy.signal.medfilt.
But - I want to preserve the location (x,y) of each point. Also I'm not sure that medfilt preserves the spatial information required.
I have a method, but it involves multiple for loops. Expensive when millions of points go in
Updated - for each iteration of the first loop, a small patch of points is selected for the shapely intersection operation. The first version searched all input points for an intersection at every iteration. Now, only a small patch at a time is converted to a shapely geometry and used for the intersection:
import numpy as np
from shapely import geometry
def spatialmedian(pointcloud,radius):
"""
Using shapely geometires, replace every point in a cloud with the
median value of points within 'radius' units of the point
'pointcloud' must have no more than 3 dimensions (x,y,z)
"""
new_z = []
i = 0
for point in pointcloud:
#pick a point and make it a shapely Point
point = geometry.Point(pointcloud[i,:])
#select a patch around the point and make it a shapely
# MultiPoint
patch = geometry.MultiPoint(list(pointcloud[\
(pointcloud[:,0] > point.x - radius+0.5) &\
(pointcloud[:,0] < point.x + radius+0.5) &\
(pointcloud[:,1] > point.y - radius+0.5) &\
(pointcloud[:,1] < point.y + radius+0.5)\
]))
#buffer the Point by radius
pbuff = point.buffer(radius)
#use the intersection method to find points in our
# patch that lie inside the Point buffer
isect = pbuff.intersection(patch)
#print(isect.geom_type)
#initialise another list
plist = []
#for every intersection set,
# unpack it into a list and collect the median
# Z value.
if isect.geom_type == 'MultiPoint':
#print('point has neightbours')
for p in isect:
plist.append(p.z)
new_z.append(np.median(plist))
else:
# if the intersection set isn't MultiPoint,
# it is an isolated point, whose median Z value
# is it's own.
#print('isolated point')
#append it to the big list
new_z.append(isect.z)
#iterate i
i += 1
#print(i)
#return a list of new median filtered Z coordinates
return new_z
This works by:
ingesting a list/array of XYZ points
the first for loop goes through the list and for every point:
picks out a patch of the point cloud just bigger than the neighbourhood specified
uses shapely to place a 3 metre buffer around the point
finds the intersection of the buffer and the whole point cloud
extracts the set of points from that operation in another for loop
finding the median and appending it to a list of new Z values
returning the list of new Z values
For 10^4 points, I get a result in 11 seconds. For 10^5 points 3 minutes, and most of my datasets run into 2- 5 * 10^6 points. On a 2 * 10^6 point cloud it's been running overnight.
What I want is a faster/more efficient method!
I've been tinkering with python-pcl, which is fast for filtering point clouds, but I don't know how to return indices of points which pass/fail pcl-python filters. I need those indices because each point has other attributes which must remain attached to it.
If anyone can suggest a more efficient method, please do so - I would highly appreciate your help. If it can't go faster and this code is helpful, feel free to use it.
Thanks!

After some good advice, I tried this:
#import numpy and scikit-learn's neighbours modulw
import numpy as np
from sklearn import neighbors as nb
#make a little time ticker
from datetime import datetime
startTime = datetime.now()
# generate a KDTree object. This takes ~95% of the
# processing time
tree = nb.KDTree(xyzi[:,0:3], leaf_size=60)
# how long did tree generation take
print(datetime.now() - startTime)
#initialise a list
new_z = []
#for each point, collect neighbours within radius r
nhoods = tree.query_radius(xyzi[:,0:3], r=3)
# iterate through the list of neighbourhoods,
# find the median height, and add it to the output list
for point in nhoods:
new_z.append(np.median(xyzi[point,2]))
# how long did it take?
print(datetime.now() - startTime)
This version took ~33 minutes for just under two million points. Acceptable, but still could be better.
Can the KDTree generation go faster using a %jit method?
IS there a better method than looping through all the neighbourhoods to find neighbourhood means? here, nhood is an array-of-arrays - I thought something like:
median = np.median(nhoods[:][:,2])
...but it didn't.
Thanks!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.