Simulate speakers around a sphere using superposition - speed improvments needed

Simulate speakers around a sphere using superposition - speed improvments needed - python

Note: Drastic speed improvements since posting, see edits at bottom.
I have some working code by it over utilizes loops and I'm pretty sure there should be a faster way of doing it. The size of the output array ends up being pretty large and so when I try to make other arrays the same size of the output, I run out of memory rather quickly.
I am simulating many speakers placed around a sphere all pointing toward the center. I have a simulation of a single speaker and I would like to leverage this single simulation by using the principle of superposition. Basically I want to sum up rotated copies of the single transducer simulation to get my final result.
I have an axisymmetric simulation of acoustic pressure data in cylindrical coordinates ("polar_coord_r", "polar_coord_z"). The pressure field from the simulation is unique at each R and Z value and completely described by a 2D array ("P_real_RZ").
I want to sum together multiple, rotated copies of the this pressure field onto a 3D Cartesian output array. Each copy is rotated to a different location on the sphere. Currently, I am specifying the rotation with an x,y,z point because it allows me to do vector math (spherical coordinates wouldn't allow me to do this as elegantly). The output array is rather large (770 × 770 × 804).
I have some working code to get the output from a single copy of the speaker ("transducer"). It takes about 12 seconds for each slice so it would take over two hours to add each new speaker!! I want to have a dozen or so copies of the speaker so this will take way to long.
The code takes a slice with constant X and computes the R and Z positions at each location in the that slice. "r_distance" is a 2D array containing the radial distance from a line passing between the origin and a point ("point"). Similarlity, "z_distance" is a 2D array containing the distance along that same line.
To get the pressure for the slice, I find the indices of the closest matching "polar_coord_r" and "polar_coord_z" to the computed R distances and Z distances. I use these indices to find what value of pressure (from P_real_RZ) to place at each value in the output.
Some definitions:
xx, yy, and zz are 1D arrays of describing the slices through the output volume
XXX, YYY, and ZZZ are 3D arrays produced by numpy.meshgrid
point is a point which defines the direction that the speaker is rotated. Basically it's just a position vector of the speakers center.
P_real_RZ is a 2D array which specifies the real pressure at each unique R and Z value.
polar_coord_r and polar_coord_z are 1D arrays which define the unique values of R and Z on which P_real_RZ is defined.
current_transducer (only one so far represented in this code) is the pressure values computer for the current point.
output is the result from summing many speakers/transducers together.
Any suggestions to speed up this code is greatly appreciated.
Working loop:
for i, x in enumerate(xx):
# Creates a unit vector from origin to a point
vector = normalize(point)
# Create a slice of the cartesian space with constant X
xyz_slice = np.array([x*np.ones_like(XXX[i,:,:]), YYY[i,:,:], ZZZ[i,:,:]])
# Projects the position vector of each point of the slice onto the unit vector.
projection = np.array(list(map(np.dot, xyz_slice, vector )))
# Normalizes the projection which results in the Z distance along the line passing through the point
#z_distance = np.apply_along_axis(np.linalg.norm, 0, projection) # this is the slow bit
z_distance = np.linalg.norm(projection, axis=0) # I'm an idiot
# Uses vector math to determine the distance from the line
# Each point in the XYZ slice is the sum of vector along the line and the vector away from the line (radial vector).
# By extension the position of the xyz point minus the projection of the point against the unit vector, results in the radial vector
# Norm the radial vector to get the R value for everywhere in the slice
#r_distance = np.apply_along_axis(np.linalg.norm, 0, xyz_slice - projection) # this is the slow bit
r_distance = np.linalg.norm(xyz_slice - projection, axis=0) # I'm an idiot
# Map the pressure data to each point in the slice using the R and Z distance with the RZ pressure slice.
# look for a more efficient way to do this perhaps. currently takes about 12 seconds per slice
r_indices = r_map_v(r_distance) # 1.3 seconds by itself
z_indices = z_map_v(z_distance)
r_indices[r_indices>384] = 384 # find and remove indicies above the max for r_distance
z_indices[r_indices>803] = 803
current_transducer[i,:,:] = P_real_RZ[z_indices, r_indices]
# Sum the mapped pressure data into the output.
output += current_transducer
I have also tried to work with the simulation data in the form of a 3D Cartesian array. That is the pressure data from the simulation for all x, y, and z values the same size as the output.I can rotate this 3D array in one direction (not two rotations needed for speakers arranged on a sphere). This takes up waaaay too much memory and is still painfully slow. I end up getting memory errors with this approach.
Edit: I found a slightly simpler way to do it but it is still slow. I've updated the code above so that there are no longer nested loops.
I ran a line profiler and the slowest lines by far were the two containing np.apply_along_axis(). I'm afraid I might have to rethink how I do this completely.
Final Edit: I initially had a nested loop which I assumed to be the issue. I don't know what made me think I needed to use apply_along_axis with linalg.norm. In any case that was the issue.

I haven't looked for all the ways that you could optimize this code, but this issue jumped out at me: "I ran a line profiler and the slowest lines by far were the two containing np.apply_along_axis()." np.linalg.norm accepts an axis argument. You can replace the line
z_distance = np.apply_along_axis(np.linalg.norm, 0, projection)
with
z_distance = np.linalg.norm(projection, axis=0)
(and likewise for the other use of np.apply_along_axis and np.linalg.norm).
That should improve the performance a bit.

Related

Python - How to resample a 2D shape?

I am writing a python script for some geometrical data manipulation (calculating motion trajectories for a multi-drive industrial machine). Generally, the idea is that there is a given shape (let's say - an ellipse, but it general case it can be any convex shape, defined with a series of 2D points), which is rotated and it's uppermost tangent point must be followed. I don't have a problem with the latter part but I need a little hint with the 2D shape preparation.
Let's say that the ellipse was defined with too little points, for example - 25. (As I said, ultimately this can be any shape, for example a rounded hexagon). To maintain necessary precision I need far more points (let's say - 1000), preferably equally distributed over whole shape or with higher density of points near corners, sharp curves, etc.
I have a few things ringing in my head, I guess that DFT (FFT) would be a good starting point for this resampling, analyzing the scipy.signal.resample() I have found out that there are far more functions in the scipy.signal package which sound promising to me...
What I'm asking for is a suggestion which way I should follow, what tool I should try for this job, which may be the most suitable. Maybe there is a tool meant exactly for what I'm looking for or maybe I'm overthinking this and one of the implementations of FFT like resample() will work just fine (of course, after some adjustments at the starting and ending point of the shape to make sure it's closing without issues)?
Scipy.signal sounds promising, however, as far as I understand, it is meant to work with time series data, not geometrical data - I guess this may cause some problems as my data isn't a function (in a mathematical understanding).
Thanks and best regards!

As far as I understood, what you want is to get an interpolated version of your original data.
The DFT (or FFT) will not achieve this purpose, since it will perform an Fourier Transform (which is not what you want).
Talking theoretically, what you need to interpolate your data is to define a function to calculate the result in the new-data-points.
So, let's say your data contains 5 points, in which one you have a 1D (to simplify) number stored, representing your data, and you want a new array with 10 points, filled with the linear-interpolation of your original data.
Using numpy.interp:
import numpy as np
original_data = [2, 0, 3, 5, 1] # define your data in 1D
new_data_resolution = 0.5 # define new sampling distance (i.e, your x-axis resolution)
interp_data = np.interp(
x = np.arange(0, 5-1+new_data_resolution , new_data_resolution), # new sampling points (new axis)
xp = range(original_data),
fp = original_data
)
# now interp_data contains (5-1) / 0.5 + 1 = 9 points
After this, you will have a (5-1) / new_resolution (which is greater than 5, since new_resolution < 1)-length data, which values will be (in this case) a linear interpolation of your original data.
After you have achieved/understood this example, you can dive in the scipy.interpolate module to get a better understanding in the interpolation functions (my example uses a linear function to get the data in the missing points).
Applying this to n-D dimensional arrays is straight-forward, iterating over each dimension of your data.

nD "cube" from ranges

I have a mixed integer problem. I need to minimize a function, which is a weighted least square regression, the weights being dependent on the regression (iteratively reweighted least square). 7 parameters define my piecewise regression. I need to find a local minima around a first guess.
I tried to write the problem in gekko, but I somehow find it very difficult to implement. After many tries, I stopped at "negative DOF".
Anyway, I decided to brute force the problem. It works, but it's slow. I build a cube (itertools) around my working point in 7D and calculate the weighted square errors at each of the 3^7 points. I have boundaries for each dimension, and sometimes my working point is on one of the faces of my 7D domain. Technically, I have 2^p * 3^(7-p) points. I now have a list of all the values, find the minimum, move my working point there and restart building a cube, excluding all the points that I have already calculated in the previous loop steps.
Now I want to accelerate it by calculating the gradient at my working point and move faster (skip a step or two in my loop). np.gradient will require a 7d array in order to perform correctly.
Given a point, and 7 ranges around that point, how to make a 7D array in an efficient way? How to make an image of this array with the values of my function?
Please don't say 7 for loops.

Regardless of whether your function is vectorized, you can use an approach with np.indices like this:
base_grid = np.indices(7 * (3,), sparse=False) - 1
This produces an array of all the combinations of -1, 0, 1 that you need. np.meshgrid does something similar, but the arrays will be separated into a tuple, which is inconvenient.
At each iteration, you modify the grid with your step (scale) and offset:
current_grid = base_grid * scale + offset
If your function is vectorized, you call it directly, the grid is 7 3x3x3x3x3x3x3 arrays. If it accepts seven inputs, just use star expansion.
If your function is not vectorized, you can still step along the corresponding elements in a single loop, not seven loops, using np.nditer:
with np.nditer([current_grid, None],
op_axes=[list(range(1, current_grid.ndim)), None]) as it:
for x, y in it:
y[:] = f(*x)
j = it.operands[1]

How to get the K most distant points, given their coordinates?

We have boring CSV with 10000 rows of ages (float), titles (enum/int), scores (float), ....
We have N columns each with int/float values in a table.
You can imagine this as points in ND space
We want to pick K points that would have maximised distance between each other.
So if we have 100 points in a tightly packed cluster and one point in the distance we would get something like this for three points:
or this
For 4 points it will become more interesting and pick some point in the middle.
So how to select K most distant rows (points) from N (with any complexity)? It looks like an ND point cloud "triangulation" with a given resolution yet not for 3d points.
I search for a reasonably fast approach (approximate - no precise solution needed) for K=200 and N=100000 and ND=6 (probably multigrid or ANN on KDTree based, SOM or triangulation based..).. Does anyone know one?

From past experience with a pretty similar problem, a simple solution of computing the mean Euclidean distance of all pairs within each group of K points and then taking the largest mean, works very well. As someone noted above, it's probably hard to avoid a loop on all combinations (not on all pairs). So a possible implementation of all this can be as follows:
import itertools
import numpy as np
from scipy.spatial.distance import pdist
Npoints = 3 # or 4 or 5...
# making up some data:
data = np.matrix([[3,2,4,3,4],[23,25,30,21,27],[6,7,8,7,9],[5,5,6,6,7],[0,1,2,0,2],[3,9,1,6,5],[0,0,12,2,7]])
# finding row indices of all combinations:
c = [list(x) for x in itertools.combinations(range(len(data)), Npoints )]
distances = []
for i in c:
distances.append(np.mean(pdist(data[i,:]))) # pdist: a method of computing all pairwise Euclidean distances in a condensed way.
ind = distances.index(max(distances)) # finding the index of the max mean distance
rows = c[ind] # these are the points in question

I propose an approximate solution. The idea is to start from a set of K points chosen in a way I'll explain below, and repeatedly loop through these points replacing the current one with the point, among the N-K+1 points not belonging to the set but including the current one, that maximizes the sum of the distances from the points of the set. This procedure leads to a set of K points where the replacement of any single point would cause the sum of the distances among the points of the set to decrease.
To start the process we take the K points that are closest to the mean of all points. This way we have good chances that on the first loop the set of K points will be spread out close to its optimum. Subsequent iterations will make adjustments to the set of K points towards a maximum of the sum of distances, which for the current values of N, K and ND appears to be reachable in just a few seconds. In order to prevent excessive looping in edge cases, we limit the number of loops nonetheless.
We stop iterating when an iteration does not improve the total distance among the K points. Of course, this is a local maximum. Other local maxima will be reached for different initial conditions, or by allowing more than one replacement at a time, but I don't think it would be worthwhile.
The data must be adjusted in order for unit displacements in each dimension to have the same significance, i.e., in order for Euclidean distances to be meaningful. E.g., if your dimensions are salary and number of children, unadjusted, the algorithm will probably yield results concentrated in the extreme salary regions, ignoring that person with 10 kids. To get a more realistic output you could divide salary and number of children by their standard deviation, or by some other estimate that makes differences in salary comparable to differences in number of children.
To be able to plot the output for a random Gaussian distribution, I have set ND = 2 in the code, but setting ND = 6, as per your request, is no problem (except you cannot plot it).
import matplotlib.pyplot as plt
import numpy as np
import scipy.spatial as spatial
N, K, ND = 100000, 200, 2
MAX_LOOPS = 20
SIGMA, SEED = 40, 1234
rng = np.random.default_rng(seed=SEED)
means, variances = [0] * ND, [SIGMA**2] * ND
data = rng.multivariate_normal(means, np.diag(variances), N)
def distances(ndarray_0, ndarray_1):
if (ndarray_0.ndim, ndarray_1.ndim) not in ((1, 2), (2, 1)):
raise ValueError("bad ndarray dimensions combination")
return np.linalg.norm(ndarray_0 - ndarray_1, axis=1)
# start with the K points closest to the mean
# (the copy() is only to avoid a view into an otherwise unused array)
indices = np.argsort(distances(data, data.mean(0)))[:K].copy()
# distsums is, for all N points, the sum of the distances from the K points
distsums = spatial.distance.cdist(data, data[indices]).sum(1)
# but the K points themselves should not be considered
# (the trick is that -np.inf ± a finite quantity always yields -np.inf)
distsums[indices] = -np.inf
prev_sum = 0.0
for loop in range(MAX_LOOPS):
for i in range(K):
# remove this point from the K points
old_index = indices[i]
# calculate its sum of distances from the K points
distsums[old_index] = distances(data[indices], data[old_index]).sum()
# update the sums of distances of all points from the K-1 points
distsums -= distances(data, data[old_index])
# choose the point with the greatest sum of distances from the K-1 points
new_index = np.argmax(distsums)
# add it to the K points replacing the old_index
indices[i] = new_index
# don't consider it any more in distsums
distsums[new_index] = -np.inf
# update the sums of distances of all points from the K points
distsums += distances(data, data[new_index])
# sum all mutual distances of the K points
curr_sum = spatial.distance.pdist(data[indices]).sum()
# break if the sum hasn't changed
if curr_sum == prev_sum:
break
prev_sum = curr_sum
if ND == 2:
X, Y = data.T
marker_size = 4
plt.scatter(X, Y, s=marker_size)
plt.scatter(X[indices], Y[indices], s=marker_size)
plt.grid(True)
plt.gca().set_aspect('equal', adjustable='box')
plt.show()
Output:
Splitting the data into 3 equidistant Gaussian distributions the output is this:

Assuming that if you read your csv file with N (10000) rows and D dimension (or features) into a N*D martix X. You can calculate the distance between each point and store it in a distance matrix as follows:
import numpy as np
X = np.asarray(X) ### convert to numpy array
distance_matrix = np.zeros((X.shape[0],X.shape[0]))
for i in range(X.shape[0]):
for j in range(i+1,X.shape[0]):
## We compute triangle matrix and copy the rest. Distance from point A to point B and distance from point B to point A are the same.
distance_matrix[i][j]= np.linalg.norm(X[i]-X[j]) ## Here I am calculating Eucledian distance. Other distance measures can also be used.
#distance_matrix = distance_matrix + distance_matrix.T - np.diag(np.diag(distance_matrix)) ## This syntax can be used to get the lower triangle of distance matrix, which is not really required in your case.
K = 5 ## Number of points that you want to pick
indexes = np.unravel_index(np.argsort(distance_matrix.ravel())[-1*K:], distance_matrix.shape)
print(indexes)

Bottom Line Up Front: Dealing with multiple equally distant points and the Curse of Dimensionality are going to be larger problems than just finding the points. Spoiler alert: There's a surprise ending.
I think this an interesting question but I'm bewildered by some of the answers. I think this is, in part, due to the sketches provided. You've no doubt noticed the answers look similar -- 2d, with clusters -- even though you indicated a wider scope was needed. Because others will eventually see this, I'm going to step through my thinking a bit slowly so bear with me for the early part.
It makes sense to start with a simplified example to see if we can generalize a solution with data that's easy to grasp and a linear 2D model is easiest of the easy.
We don't need to calculate all the distances though. We just need the ones at the extremes. So we can then take the top and bottom few values:
right = lin_2_D.nlargest(8, ['x'])
left = lin_2_D.nsmallest(8, ['x'])
graph = sns.scatterplot(x="x", y="y", data=lin_2_D, color = 'gray', marker = '+', alpha = .4)
sns.scatterplot(x = right['x'], y = right['y'], color = 'red')
sns.scatterplot(x = left['x'], y = left['y'], color = 'green')
fig = graph.figure
fig.set_size_inches(8,3)
What we have so far: Of 100 points, we've eliminated the need to calculate the distance between 84 of them. Of what's left we can further drop this by ordering the results on one side and checking the distance against the others.
You can imagine a case where you have a couple of data points way off the trend line that could be captured by taking the greatest or least y values, and all that starts to look like Walter Tross's top diagram. Add in a couple of extra clusters and you get what looks his bottom diagram and it appears that we're sort of making the same point.
The problem with stopping here is the requirement you mentioned is that you need a solution that works for any number of dimensions.
The unfortunate part is that we run into four challenges:
Challenge 1: As you increase the dimensions you can run into a large number of cases where you have multiple solutions when seeking midpoints. So you're looking for k furthest points but have a large number of equally valid possible solutions and no way prioritizing them. Here are two super easy examples illustrate this:
A) Here we have just four points and in only two dimensions. You really can't get any easier than this, right? The distance from red to green is trivial. But try to find the next furthest point and you'll see both of the black points are equidistant from both the red and green points. Imagine you wanted the furthest six points using the first graphs, you might have 20 or more points that are all equidistant.
edit: I just noticed the red and green dots are at the edges of their circles rather than at the center, I'll update later but the point is the same.
B) This is super easy to imagine: Think of a D&D 4 sided die. Four points of data in a three-dimensional space, all equidistant so it's known as a triangle-based pyramid. If you're looking for the closest two points, which two? You have 4 choose 2 (aka, 6) combinations possible. Getting rid of valid solutions can be a bit of a problem because invariably you face questions such as "why did we get rid of these and not this one?"
Challenge 2: The Curse of Dimensionality. Nuff Said.
Challenge 3 Revenge of The Curse of Dimensionality Because you're looking for the most distant points, you have to x,y,z ... n coordinates for each point or you have to impute them. Now, your data set is much larger and slower.
Challenge 4 Because you're looking for the most distant points, dimension reduction techniques such as ridge and lasso are not going to be useful.
So, what to do about this?
Nothing.
Wait. What?!?
Not truly, exactly, and literally nothing. But nothing crazy. Instead, rely on a simple heuristic that is understandable and computationally easy. Paul C. Kainen puts it well:
Intuitively, when a situation is sufficiently complex or uncertain,
only the simplest methods are valid. Surprisingly, however,
common-sense heuristics based on these robustly applicable techniques
can yield results which are almost surely optimal.
In this case, you have not the Curse of Dimensionality but rather the Blessing of Dimensionality. It's true you have a lot of points and they'll scale linearly as you seek other equidistant points (k) but the total dimensional volume of space will increase to power of the dimensions. The k number of furthest points you're is insignificant to the total number of points. Hell, even k^2 becomes insignificant as the number of dimensions increase.
Now, if you had a low dimensionality, I would go with them as a solution (except the ones that are use nested for loops ... in NumPy or Pandas).
If I was in your position, I'd be thinking how I've got code in these other answers that I could use as a basis and maybe wonder why should I should trust this other than it lays out a framework on how to think through the topic. Certainly, there should be some math and maybe somebody important saying the same thing.
Let me reference to chapter 18 of Computer Intensive Methods in Control and Signal Processing and an expanded argument by analogy with some heavy(-ish) math. You can see from the above (the graph with the colored dots at the edges) that the center is removed, particularly if you followed the idea of removing the extreme y values. It's a though you put a balloon in a box. You could do this a sphere in a cube too. Raise that into multiple dimensions and you have a hypersphere in a hypercube. You can read more about that relationship here.
Finally, let's get to a heuristic:
Select the points that have the most max or min values per dimension. When/if you run out of them pick ones that are close to those values if there isn't one at the min/max. Essentially, you're choosing the corners of a box For a 2D graph you have four points, for a 3D you have the 8 corners of the box (2^3).
More accurately this would be a 4d or 5d (depending on how you might assign the marker shape and color) projected down to 3d. But you can easily see how this data cloud gives you the full range of dimensions.
Here is a quick check on learning; for purposes of ease, ignore the color/shape aspect: It's easy to graphically intuit that you have no problem with up to k points short of deciding what might be slightly closer. And you can see how you might need to randomize your selection if you have a k < 2D. And if you added another point you can see it (k +1) would be in a centroid. So here is the check: If you had more points, where would they be? I guess I have to put this at the bottom -- limitation of markdown.
So for a 6D data cloud, the values of k less than 64 (really 65 as we'll see in just a moment) points are pretty easy. But...
If you don't have a data cloud but instead have data that has a linear relationship, you'll 2^(D-1) points. So, for that linear 2D space, you have a line, for linear 3D space, you'd have a plane. Then a rhomboid, etc. This is true even if your shape is curved. Rather than do this graph myself, I'm using the one from an excellent post on by Inversion Labs on Best-fit Surfaces for 3D Data
If the number of points, k, is less than 2^D you need a process to decide what you don't use. Linear discriminant analysis should be on your shortlist. That said, you can probably satisfice the solution by randomly picking one.
For a single additional point (k = 1 + 2^D), you're looking for one that is as close to the center of the bounding space.
When k > 2^D, the possible solutions will scale not geometrically but factorially. That may not seem intuitive so let's go back to the two circles. For 2D you have just two points that could be a candidate for being equidistant. But if that were 3D space and rotate the points about the line, any point in what is now a ring would suffice as a solution for k. For a 3D example, they would be a sphere. Hyperspheres (n-spheres) from thereon. Again, 2^D scaling.
One last thing: You should seriously look at xarray if you're not already familiar with it.
Hope all this helps and I also hope you'll read through the links. It'll be worth the time.
*It would be the same shape, centrally located, with the vertices at the 1/3 mark. So like having 27 six-sided dice shaped like a giant cube. Each vertice (or point nearest it) would fix the solution. Your original k+1 would have to be relocated too. So you would select 2 of the 8 vertices. Final question: would it be worth calculating the distances of those points against each other (remember the diagonal is slightly longer than the edge) and then comparing them to the original 2^D points? Bluntly, no. Satifice the solution.

If you're interested in getting the most distant points you can take advantage of all of the methods that were developed for nearest neighbors, you just have to give a different "metric".
For example, using scikit-learn's nearest neighbors and distance metrics tools you can do something like this
import numpy as np
from sklearn.neighbors import BallTree
from sklearn.neighbors.dist_metrics import PyFuncDistance
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
def inverted_euclidean(x1, x2):
# You can speed this up using cython like scikit-learn does or numba
dist = np.sum((x1 - x2) ** 2)
# We invert the euclidean distance and set nearby points to the biggest possible
# positive float that isn't inf
inverted_dist = np.where(dist == 0, np.nextafter(np.inf, 0), 1 / dist)
return inverted_dist
# Make up some fake data
n_samples = 100000
n_features = 200
X, _ = make_blobs(n_samples=n_samples, centers=3, n_features=n_features, random_state=0)
# We exploit the BallTree algorithm to get the most distant points
ball_tree = BallTree(X, leaf_size=50, metric=PyFuncDistance(inverted_euclidean))
# Some made up query, you can also provide a stack of points to query against
test_point = np.zeros((1, n_features))
distance, distant_points_inds = ball_tree.query(X=test_point, k=10, return_distance=True)
distant_points = X[distant_points_inds[0]]
# We can try to visualize the query results
plt.plot(X[:, 0], X[:, 1], ".b", alpha=0.1)
plt.plot(test_point[:, 0], test_point[:, 1], "*r", markersize=9)
plt.plot(distant_points[:, 0], distant_points[:, 1], "sg", markersize=5, alpha=0.8)
plt.show()
Which will plot something like:
There are many points that you can improve on:
I implemented the inverted_euclidean distance function with numpy, but you can try to do what the folks of scikit-learn do with their distance functions and implement them in cython. You could also try to jit compile them with numba.
Maybe the euclidean distance isn't the metric you would like to use to find the furthest points, so you're free to implement your own or simply roll with what scikit-learn provides.
The nice thing about using the Ball Tree algorithm (or the KdTree algorithm) is that for each queried point you have to do log(N) comparisons to find the furthest point in the training set. Building the Ball Tree itself, I think also requires log(N) comparison, so in the end if you want to find the k furthest points for every point in the ball tree training set (X), it will have almost O(D N log(N)) complexity (where D is the number of features), which will increase up to O(D N^2) with the increasing k.

Finding n nearest data points to grid locations

I'm working on a problem where I have a large set (>4 million) of data points located in a three-dimensional space, each with a scalar function value. This is represented by four arrays: XD, YD, ZD, and FD. The tuple (XD[i], YD[i], ZD[i]) refers to the location of data point i, which has a value of FD[i].
I'd like to superimpose a rectilinear grid of, say, 100x100x100 points in the same space as my data. This grid is set up as follows.
[XGrid, YGrid, ZGrid] = np.mgrid[Xmin:Xmax:Xstep, Ymin:Ymax:Ystep, Zmin:Zmax:Zstep]
XG = XGrid[:,0,0]
YG = YGrid[0,:,0]
ZG = ZGrid[0,0,:]
XGrid is a 3D array of the x-value at each point in the grid. XG is a 1D array of the x-values going from Xmin to Xmax, separated by a distance of XStep.
I'd like to use an interpolation algorithm I have to find the value of the function at each grid point based on the data surrounding it. In this algorithm I require 20 data points closest (or at least close) to my grid point of interest. That is, for grid point (XG[i], YG[j], ZG[k]) I want to find the 20 closest data points.
The only way I can think of is to have one for loop that goes through each data point and a subsequent embedded for loop going through all (so many!) data points, calculating the Euclidean distance, and picking out the 20 closest ones.
for i in range(0,XG.shape):
for j in range(0,YG.shape):
for k in range(0,ZG.shape):
Distance = np.zeros([XD.shape])
for a in range(0,XD.shape):
Distance[a] = (XD[a] - XG[i])**2 + (YD[a] - YG[j])**2 + (ZD[a] - ZG[k])**2
B = np.zeros([20], int)
for a in range(0,20):
indx = np.argmin(Distance)
B[a] = indx
Distance[indx] = float(inf)
This would give me an array, B, of the indices of the data points closest to the grid point. I feel like this would take too long to go through each data point at each grid point.
I'm looking for any suggestions, such as how I might be able to organize the data points before calculating distances, which could cut down on computation time.

Have a look at a seemingly simmilar but 2D problem and see if you cannot improve with ideas from there.
From the top of my head, I'm thinking that you can sort the points according to their coordinates (three separate arrays). When you need the closest points to the [X, Y, Z] grid point you'll quickly locate points in those three arrays and start from there.

Also, you don't really need the euclidian distance, since you are only interested in relative distance, which can also be described as:
abs(deltaX) + abs(deltaY) + abs(deltaZ)
And save on the expensive power and square roots...

No need to iterate over your data points for each grid location: Your grid locations are inherently ordered, so just iterate over your data points once, and assign each data point to the eight grid locations that surround it. When you're done, some grid locations may have too few data points. Check the data points of adjacent grid locations. If you have plenty of data points to go around (it depends on how your data is distributed), you can already select the 20 closest neighbors during the initial pass.
Addendum: You may want to reconsider other parts of your algorithm as well. Your algorithm is a kind of piecewise-linear interpolation, and there are plenty of relatively simple improvements. Instead of dividing your space into evenly spaced cubes, consider allocating a number of center points and dynamically repositioning them until the average distance of data points from the nearest center point is minimized, like this:
Allocate each data point to its closest center point.
Reposition each center point to the coordinates that would minimize the average distance from "its" points (to the "centroid" of the data subset).
Some data points now have a different closest center point. Repeat steps 1. and 2. until you converge (or near enough).

Performing many means in numpy

Good Morning,
I am implimenting a Cressman filter for doing distance weighted averages in Numpy.. I use a Ball Tree implimentation (thanks to Jake VanderPlas) to return a list of locatations for each point in a request array.. the query array (q) is shape [n,3] and at each point has the x,y,z at point I want to do a weighted average of points stored in the tree.. the code wrapped around the tree returns points within a certain distance so I get an arrays of variable length arrays..
I use a where to find non-empty entries (ie positions where there were at least some points within the radius of influence) creating the isgood array...
I then loop over all query points to return the weighted average of the values self.z (note that this can either be dims=1 or dims=2 to allow multiple co-gridding)
so the thing that complilcates using map or other quicker methods is the nonuniformity of the lengths of the arrays within self.distances and self.locations... I am still fairly green to numpy/python but I can not think of a way to do this array wise (ie not reverting to loops)
self.locations, self.distances = self.tree.query_radius( q, r, return_distance=True)
t2=time()
if debug: print "Removing voids"
isgood=np.where( np.array([len(x) for x in self.locations])!=0)[0]
interpol = np.zeros( (len(self.locations),) + np.shape(self.z[0]) )
interpol.fill(np.nan)
for dist, ix, posn, roi in zip(self.distances[isgood], self.locations[isgood], isgood, r[isgood]):
interpol[isgood[jinterpol]] = np.average(self.z[ix], weights=(roi**2-dist**2) / (roi**2 + dist**2), axis=0)
jinterpol += 1
so... Any hints of how to speed up the loop?..
For a typical mapping as appied to mapping weather radar data from a range,azimuth,elevation grid to a cartesian grid where I have 240x240x34 points and 4 variables takes 99s to query the tree (written by Jake in C and cython.. this is the hard step as you need to search the data!) and 100 seconds to do the calculation... which in my opinon is slow?? where is my overhead? is np.mean efficient or as it is called millions of times is there a speedup to be gained here? would I gain by using float32 rather than the default64... or even scaling to ints (which would be very hard to avoid wrap around in the weighting... any hints gratefully recieved!

You can find a discussion about the relative merits of the Cressman scheme vs using a Gaussian weight function at:
http://www.flame.org/~cdoswell/publications/radar_oa_00.pdf
The key is to match the smoothing parameter to the data (I recommend using a value close to the average spacing between data points). Once you know the smoothing parameter, you can set an "influence radius" equal to the radius where the weight function falls to 0.01 (or whatever).
How important is speed? If you wish, rather than calling an exponential function to determine the weight, you can make up a discrete table of weights for some fixed number of radius increments, which speeds up the calculation considerably. Ideally, you should have data outside the grid boundaries that can be used in the mapping of the values surrounding the gridpoints (even on the boundary points of the grid). Note this is NOT a true interpolation scheme - it won't return the observed values at the data points exactly. Like the Cressman scheme, it's a low-pass filer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.