Is there a better way to make a 3D density function?
def make_spot_3d(bright, spread, x0,y0,z0):
# Create x and y indices
x = np.linspace(-50, 50, 200)
y = np.linspace(-50, 50, 200)
z = np.linspace(-50, 50, 200)
X, Y, Z = np.meshgrid(x, y, z)
Intensity = np.uint16(bright*np.exp(-((X-x0)/spread)**2
-((Y-y0)/spread)**2
-((Z-z0)/spread)**2))
return Intensity
The function can generate a 3D numpy array which can be plotted with mayavi
However when the function is used to generate a cluster of spots (~100) as follow:
Spots = np.asarray([make_spot_3d(100,2, *loc) for loc in locations])
cluster = np.sum(Spots, axis=0)
yielding for example:
The execution time is around 1 minute (cpu i5);I bet this could be faster.
An obvious improvement would be to use broadcasting to evaluate your intensity function over a 'sparse' mesh rather than a full meshgrid, e.g.:
X, Y, Z = np.meshgrid(x, y, z, sparse=True)
This reduces the runtime by a factor of about 4x on my machine:
%timeit make_spot_3d(1., 1., 0, 0, 0)
1 loops, best of 3: 1.56 s per loop
%timeit make_spot_3d_ogrid(1., 1., 0, 0, 0)
1 loops, best of 3: 359 ms per loop
You can get rid of the overhead involved in the list comprehension by vectorizing the calculation over locations, spreads and brightnesses, e.g.:
def make_spots(bright, spread, x0, y0, z0):
# Create x and y indices
x = np.linspace(-50, 50, 200)
y = np.linspace(-50, 50, 200)
z = np.linspace(-50, 50, 200)
# this will broadcast out to an (nblobs, ny, nx, nz) array
dx = x[None, None, :, None] - x0[:, None, None, None]
dy = y[None, :, None, None] - y0[:, None, None, None]
dz = z[None, None, None, :] - z0[:, None, None, None]
spread = spread[:, None, None, None]
bright = bright[:, None, None, None]
# we can save time by performing the exponentiation over 2D arrays
# before broadcasting out to 4D, since exp(a + b) == exp(a) * exp(b)
s2 = spread * spread
a = np.exp(-(dx * dx) / s2)
b = np.exp(-(dy * dy) / s2)
c = np.exp(-(dz * dz) / s2)
intensity = bright * a * b * c
return intensity.astype(np.uint16)
where bright, spread, x0, y0 and z0 are 1D vectors. This will generate an (nblobs, ny, nx, nz) array, which you could then sum over the first axis. Depending on how many blobs you are generating, and how large the grid is that you are evaluating them over, creating this intermediate array might become quite expensive in terms of memory.
Another option would be to initialize a single (ny, nx, nz) output array and compute the sum in-place:
def sum_spots_inplace(bright, spread, x0, y0, z0):
# Create x and y indices
x = np.linspace(-50, 50, 200)
y = np.linspace(-50, 50, 200)
z = np.linspace(-50, 50, 200)
dx = x[None, None, :, None] - x0[:, None, None, None]
dy = y[None, :, None, None] - y0[:, None, None, None]
dz = z[None, None, None, :] - z0[:, None, None, None]
spread = spread[:, None, None, None]
bright = bright[:, None, None, None]
s2 = spread * spread
a = np.exp(-(dx * dx) / s2)
b = np.exp(-(dy * dy) / s2)
c = np.exp(-(dz * dz) / s2)
out = np.zeros((200, 200, 200), dtype=np.uint16)
for ii in xrange(bright.shape[0]):
out += bright[ii] * a[ii] * b[ii] * c[ii]
return out
This will require less memory, but the potential downside is that it necessitates looping in Python.
To give you some idea of the relative performance:
def sum_spots_listcomp(bright, spread, x0, y0, z0):
return np.sum([make_spot_3d(bright[ii], spread[ii], x0[ii], y0[ii], z0[ii])
for ii in xrange(len(bright))], axis=0)
def sum_spots_vec(bright, spread, x0, y0, z0):
return make_spots(bright, spread, x0, y0, z0).sum(0)
# some fake data
bright = np.random.rand(10) * 100
spread = np.random.rand(10) * 100
x0 = (np.random.rand(10) - 0.5) * 50
y0 = (np.random.rand(10) - 0.5) * 50
z0 = (np.random.rand(10) - 0.5) * 50
%timeit sum_spots_listcomp(bright, spread, x0, y0, z0)
# 1 loops, best of 3: 16.6 s per loop
%timeit sum_spots_vec(bright, spread, x0, y0, z0)
# 1 loops, best of 3: 1.03 s per loop
%timeit sum_spots_inplace(bright, spread, x0, y0, z0)
# 1 loops, best of 3: 330 ms per loop
Since you have an i5 processor and the spots are independent from each other, it would be nice to implement multithreading. You don't necessarily require multiple processes as many Numpy operations release the GIL. The additional code can be quite simple:
from multiprocessing.dummy import Pool
if __name__ == '__main__':
wrap = lambda pos: make_spot_3d(100, 2, *pos)
cluster = sum(Pool().imap_unordered(wrap, positions))
Update
After some testing on my PC at work I must admit that the code above is just too naive and inefficient. On 8 cores the speedup is only ~1.5 times, relative to the singlecore performance.
I still think multithreading would be a good idea, but the success depends very much on the implementation.
So, you're doing every operation in your term 8 million (=200*200*200) times; first of all, you can cut that down to 1 million (if the sphere happens to be in the center of your grid) by just calculating one eighth of if and mirroring that. Mirroring doesn't come for free, but is still far less expensive than exp.
Also, it's very probable that you should just stop calculation after the value of intensity has dropped to 0. Using a little logarithm magic, you can come up with a region of interest that might be much smaller than the 200*200*200 grid.
Related
I have a set of 68 keypoints (size [68, 2]) that I am mapping to gaussian heatmaps. To do this, I have the following function:
def generate_gaussian(t, x, y, sigma=10):
"""
Generates a 2D Gaussian point at location x,y in tensor t.
x should be in range (-1, 1).
sigma is the standard deviation of the generated 2D Gaussian.
"""
h,w = t.shape
# Heatmap pixel per output pixel
mu_x = int(0.5 * (x + 1.) * w)
mu_y = int(0.5 * (y + 1.) * h)
tmp_size = sigma * 3
# Top-left
x1,y1 = int(mu_x - tmp_size), int(mu_y - tmp_size)
# Bottom right
x2, y2 = int(mu_x + tmp_size + 1), int(mu_y + tmp_size + 1)
if x1 >= w or y1 >= h or x2 < 0 or y2 < 0:
return t
size = 2 * tmp_size + 1
tx = np.arange(0, size, 1, np.float32)
ty = tx[:, np.newaxis]
x0 = y0 = size // 2
# The gaussian is not normalized, we want the center value to equal 1
g = torch.tensor(np.exp(- ((tx - x0) ** 2 + (ty - y0) ** 2) / (2 * sigma ** 2)))
# Determine the bounds of the source gaussian
g_x_min, g_x_max = max(0, -x1), min(x2, w) - x1
g_y_min, g_y_max = max(0, -y1), min(y2, h) - y1
# Image range
img_x_min, img_x_max = max(0, x1), min(x2, w)
img_y_min, img_y_max = max(0, y1), min(y2, h)
t[img_y_min:img_y_max, img_x_min:img_x_max] = \
g[g_y_min:g_y_max, g_x_min:g_x_max]
return t
def rescale(a, img_size):
# scale tensor to [-1, 1]
return 2 * a / img_size[0] - 1
My current code uses a for loop to compute the gaussian heatmap for each of the 68 keypoint coordinates, then stacks the resulting tensors to create a [68, H, W] tensor:
x_k1 = [generate_gaussian(torch.zeros(H, W), x, y) for x, y in rescale(kp1.numpy(), frame.shape)]
x_k1 = torch.stack(x_k1, dim=0)
However, this method is super slow. Is there some way that I can do this without a for loop?
Edit:
I tried #Cris Luengo's proposal to compute a 1D Gaussian:
def generate_gaussian1D(t, x, y, sigma=10):
h,w = t.shape
# Heatmap pixel per output pixel
mu_x = int(0.5 * (x + 1.) * w)
mu_y = int(0.5 * (y + 1.) * h)
tmp_size = sigma * 3
# Top-left
x1, y1 = int(mu_x - tmp_size), int(mu_y - tmp_size)
# Bottom right
x2, y2 = int(mu_x + tmp_size + 1), int(mu_y + tmp_size + 1)
if x1 >= w or y1 >= h or x2 < 0 or y2 < 0:
return t
size = 2 * tmp_size + 1
tx = np.arange(0, size, 1, np.float32)
ty = tx[:, np.newaxis]
x0 = y0 = size // 2
g = torch.tensor(np.exp(-np.power(tx - mu_x, 2.) / (2 * np.power(sigma, 2.))))
g = g * g[:, None]
g_x_min, g_x_max = max(0, -x1), min(x2, w) - x1
g_y_min, g_y_max = max(0, -y1), min(y2, h) - y1
img_x_min, img_x_max = max(0, x1), min(x2, w)
img_y_min, img_y_max = max(0, y1), min(y2, h)
t[img_y_min:img_y_max, img_x_min:img_x_max] = \
g[g_y_min:g_y_max, g_x_min:g_x_max]
return t
but my output ends up being an incomplete gaussian.
I'm not sure what I'm doing wrong. Any help would be appreciated.
You generate an NxN array g with a Gaussian centered on its center pixel. N is computed such that it extends by 3*sigma from that center pixel. This is the fastest way to build such an array:
tmp_size = sigma * 3
tx = np.arange(1, tmp_size + 1, 1, np.float32)
g = np.exp(-(tx**2) / (2 * sigma**2))
g = np.concatenate((np.flip(g), [1], g))
g = g * g[:, None]
What we're doing here is compute half a 1D Gaussian. We don't even bother computing the value of the Gaussian for the middle pixel, which we know will be 1. We then build the full 1D Gaussian by flipping our half-Gaussian and concatenating. Finally, the 2D Gaussian is built by the outer product of the 1D Gaussian with itself.
We could shave a bit of extra time by building a quarter of the 2D Gaussian, then concatenating four rotated copies of it. But the difference in computational cost is not very large, and this is much simpler. Note that np.exp is the most expensive operation here by far, so just minimizing how often we call it we significantly reduce the computational cost.
However, the best way to speed up the complete code is to compute the array g only once, rather than anew for each key point. Note how your sigma doesn't change, so all the arrays g that are computed are identical. If you compute it only once, it no longer matters which method you use to compute it, since this will be a minimal portion of the total program anyway.
You could, for example, have a global variable _gaussian to hold your array, and have your function compute it only the first time it is called. Or you could separate your function into two functions, one that constructs this array, and one that copies it into an image, and call them as follows:
g = create_gaussian(sigma=3)
x_k1 = [
copy_gaussian(torch.zeros(H, W), x, y, g)
for x, y in rescale(kp1.numpy(), frame.shape)
]
On the other hand, you're likely best off using existing functionality. For example, DIPlib has a function dip.DrawBandlimitedPoint() [disclosure: I'm an author] that adds a Gaussian blob to an image. Likely you'll find similar functions in other libraries.
I want to find all the numbers that are not in a list and are equal to the following formula : x + x_max * y + x_max * y_max * z, where (x_max, y_max) are parameters and (x, y, z) can vary in a restricted set. I also want to know the values of (x, y, z) that I need to obtain each number.
I have succeeded to do this with the following code but it is currently extremely slow due to the three-nested loops. I am looking for a faster solution (maybe using NumPy arrays?).
What my code is actually doing with an extremely low speed (but works):
import numpy as np
x_max, y_max, z_max = 180, 90, 90
numbers = np.random.randint(x_max * y_max * z_max, size=1000)
results = np.array([[0, 0, 0, 0]]) # Initializes the array
for x in range(0, x_max):
for y in range(0, y_max):
for z in range(0, z_max):
result = x + y * x_max + z * x_max * y_max
if (result not in numbers):
results = np.append(results, [[x, y, z, result]], axis=0)
# Initial line is useless
results = np.delete(results, (0), axis=0)
Your nested loop and the calcuation:
for x in range(0, x_max):
for y in range(0, y_max):
for z in range(0, z_max):
result = x + y * x_max + z * x_max * y_max
simply calculate all integers between 0 and x_max * y_max * z_max. And all the integers are unique as well: no integer is calculated twice.
That fact makes this a lot easier:
values = np.arange(x_max * y_max * z_max)
results = np.setdiff1d(values, numbers)
This will give you all the integers that have been calculated and are not in the numbers exclusion list.
Now you only miss the input x, y and z values. These, though, can be calculated from the actual results with some straightforward modulo arithmetic:
z = results // (x_max * y_max)
rem = results % (x_max * y_max)
y = rem // x_max
x = rem % x_max
Now you can stack it all nicely together:
results = np.array([x, y, z, results])
You can tweak your results array if needs be, e.g.:
results = results.T # simple transpose
results = np.sort(results, axis=1) # sort the inner list
I used the above to compare the output from this calculation with that of the triple-nested loop calculation. The results were indeed equal.
Ok, so first of all, the main reason your code is slow is not the nested looping, it's because you're using np.append - this allocates an entirely new array with each append, and should almost never be used. You're much better off just using a python list, for which appending is an O(1) operation (internally, it only actually reallocates memory when the list grows past some multiple (I think something like 1+1/e times) its previous size).
The following should run somewhere on the order of 100x faster.
import numpy as np
x_max, y_max, z_max = 180, 90, 90
numbers = np.random.randint(x_max * y_max * z_max, size=1000)
# results = np.array([[0, 0, 0, 0]]) # Initializes the array <-- Unnecessary
results = [] # <-- Use a python list when you want to expand things
for x in range(0, x_max):
print(f'x={x}')
for y in range(0, y_max):
for z in range(0, z_max):
result = x + y * x_max + z * x_max * y_max
if (result not in numbers):
# results = np.append(results, [[x, y, z, result]], axis=0) # <-- np.append is very slow O(N)
results.append([x, y, z, result]) # <-- List.append is O(1)
results = np.array(results)
# Initial line is useless
# results = np.delete(results, (0), axis=0) <-- Unnecessary without the unnecessary initialization.
... but, we can still get faster using numpy vectorization.
# ... See above code for computation of "results"
print(f'Found result with python loop and list in {time.time() - t_start:.3f}s')
t_start = time.time()
# Get the x, y, z indices that you'd normally get from a nested loop. They'll be in arrays of shape (x_max, y_max, z_max)
xs, ys, zs = np.meshgrid(np.arange(x_max), np.arange(y_max), np.arange(z_max), indexing='ij')
all_values = xs + ys * x_max + zs * x_max * y_max
valid_indices = ~np.isin(all_values, numbers) # Get a shape (x_max, y_max, z_max) boolean mask
# Now use the mask to filter each array (yielding a flat (x_max*y_max*zmax) v[valid_indices] array)
# ... Then reshape it into a (x_max*y_max*zmax, 1) array
# ... So it can be stacked horizontally (h-stack) with the others along the second axis.
results_vectorized = np.hstack([v[valid_indices].reshape(-1, 1) for v in (xs, ys, zs, all_values)])
assert np.array_equal(results_vectorized, results)
print(f'Found result in {time.time() - t_start:.3f}s')
This is around 20x faster than the previous:
Found result with python loop and list in 3.630s
Found result in 0.154s
Speeding things up with numpy is always the same process:
Construct a big array containing all the information
Do computations on the whole array at the same time
Here I use np.fromfunction to construct the big array from the formula you gave.
So here is my ~60x speedup solution:
import numpy as np
from functools import partial
def formula(x, y, z, x_max, y_max, z_max):
return x + y * x_max + z * x_max * y_max
def my_solution(x_max, y_max, z_max, seen):
seen = np.unique(seen)
results = np.fromfunction(
partial(formula, x_max=x_max, y_max=y_max, z_max=z_max),
shape=(x_max, y_max, z_max),
)
mask_not_seen = ~np.isin(results, seen)
results_not_seen = results[mask_not_seen]
indices_not_seen = np.where(mask_not_seen)
return np.stack([*indices_not_seen, results_not_seen], axis=-1)
Let's check that it outputs the same as your solution:
x_max, y_max, z_max = 18, 9, 9
seen = np.random.randint(x_max * y_max * z_max, size=100)
op_out = op_solution(x_max, y_max, z_max, seen)
my_out = my_solution(x_max, y_max, z_max, seen)
assert np.all(op_out == my_out)
and that it is indeed quicker (~60x):
...: %timeit op_solution(x_max, y_max, z_max, seen)
...: %timeit my_solution(x_max, y_max, z_max, seen)
9.74 ms ± 37.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
161 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Finally with the values you gave:
...: seen = np.random.randint(x_max * y_max * z_max, size=1000)
...: %timeit my_solution(x_max=180, y_max=90, z_max=90, seen=seen)
242 ms ± 3.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I need to reconstruct a heighfield f(x,y) -> z from a vector field g(x,y) -> (a,b,c), according to a set of given equations, but I have an issue for computing the gradient (I am using L_BFGS_B because I think it's the only optimizer that will allow me to not explode memory).
I am having the following conflict:
Images are 2d arrays, gradient is computed along each axis and needs two values per pixel
Heightfield is only 1 variable per pixel (the height)
Therefore, after computing gradient, scipy complains that the function cannot return as gradient an array with (img_x * img_y, 2) shape. Only 1D shape with exact same count than heightfield array.
Am I missing something obvious here ?
Detailed problem
I define three variables nx, ny, nz:
nx = g(x,y).x
ny = g(x,y).y
nz = g(x,y).z
Now, for each point (x,y) of the grid, I need to compute the height of the point (f(x,y)), according to the following equations:
nz * [f(x+1, y) - f(x,y)] == nx
nz * [f(x, y+1) - f(x,y)] == ny
I have tried expressing this in a loss function:
class Eval():
def __init__(self, g):
'''
g: (x, y, 3)
'''
self.g = g
def loss(self, x):
depth = x.reshape(self.g.shape[:2])
x_roll = np.roll(depth, -1, axis=0)
y_roll = np.roll(depth, -1, axis=1)
dx = depth - x_roll
dy = depth - y_roll
nx = self.g[:,:,0]
ny = self.g[:,:,1]
nz = self.g[:,:,2]
loss_x = nz * dx - nx
loss_y = nz * dy - ny
self.error_loss = np.stack([loss_x, loss_y], axis=-1)
total_loss = norm(self.error_loss, axis=-1)
return np.sum(total_loss)
def grads(self, x):
x_roll = np.roll(self.error_loss[:,:,0], -1, axis=0)
y_roll = np.roll(self.error_loss[:,:,1], -1, axis=1)
dx = self.error_loss[:,:,0] - x_roll
dy = self.error_loss[:,:,1] - y_roll
g_xy = np.stack([dx, dy], axis=-1)
# g_xy has shape (x, y, 2)
# BUT THIS MUST RETURN (x * y), not (x * y, 2) nor (x * y * 2)
# WHAT SHOULD I RETURN HERE ? ||g_xy|| ?
And call it like this
vector_field = ... # shape: 1024, 1024, 3
x0 = np.random.uniform(size=vector_field.shape[:2])
ev = Eval(vector_field)
result = optim.fmin_l_bfgs_b(ev.loss, x0, fprime=ev.grads)
Questions
Is there a way to express those equations for resolving them in another way ? In theory, they are overconstrained linear system that could be resolved with least squares approach, but I cannot find how to reformulate the equations in the form of Ax = b
Otherwise what should I return for my gradient ? or I am computing it properly ?
I feel like there should be a quick way of speeding up this code. I think the answer is here, but I cannot seem to get my problem in that format. The underlying problem that I am attempting to solve is find the point wise difference in terms of the parallel and perpendicular components and create a 2d histogram of these differences.
out = np.zeros((len(rpbins)-1,len(pibins)-1))
tmp = np.zeros((len(x),2))
for i in xrange(len(x)):
tmp[:,0] = x - x[i]
tmp[:,1] = y - y[i]
para = np.sum(tmp**2,axis=-1)**(1./2)
perp = np.abs(z - z[i])
H, _, _ = np.histogram2d(para, perp, bins=[rpbins, pibins])
out += H
Vectorizing things like this is tricky, because to get rid of a loop over n elements you have to construct an array of (n, n), so for large inputs you are likely to get a worse performance than with a Python loop. But it can be done:
mask = np.triu_indices(x.shape[0], 1)
para = np.sqrt((x[:, None] - x)**2 + (y[:, None] - y)**2)
perp = np.abs(z[:, None] - z)
hist, _, _ = np.histogram2d(para[mask], perp[mask], bins=[rpbins, pibins])
The mask is to avoid counting each distance twice. I have also set the diagonal offset to 1 to avoid including the 0 distances of each point to itself in the histogram. But if you don't index para and perp with it, you get the exact same result as your code.
With this sample data:
items = 100
rpbins, pibins = np.linspace(0, 1, 3), np.linspace(0, 1, 3)
x = np.random.rand(items)
y = np.random.rand(items)
z = np.random.rand(items)
I get this for my hist and your out:
>>> hist
array([[ 1795., 651.],
[ 1632., 740.]])
>>> out
array([[ 3690., 1302.],
[ 3264., 1480.]])
and out[i, j] = 2 * hist[i, j] except for i = j = 0, where out[0, 0] = 2 * hist[0, 0] + items because of the 0 distance of each item to itself.
EDIT Tried the following after tcaswell's comment:
items = 1000
rpbins, pibins = np.linspace(0, 1, 3), np.linspace(0, 1, 3)
x, y, z = np.random.rand(3, items)
def hist1(x, y, z, rpbins, pibins) :
mask = np.triu_indices(x.shape[0], 1)
para = np.sqrt((x[:, None] - x)**2 + (y[:, None] - y)**2)
perp = np.abs(z[:, None] - z)
hist, _, _ = np.histogram2d(para[mask], perp[mask], bins=[rpbins, pibins])
return hist
def hist2(x, y, z, rpbins, pibins) :
mask = np.triu_indices(x.shape[0], 1)
para = np.sqrt((x[:, None] - x)[mask]**2 + (y[:, None] - y)[mask]**2)
perp = np.abs((z[:, None] - z)[mask])
hist, _, _ = np.histogram2d(para, perp, bins=[rpbins, pibins])
return hist
def hist3(x, y, z, rpbins, pibins) :
mask = np.triu_indices(x.shape[0], 1)
para = np.sqrt(((x[:, None] - x)**2 + (y[:, None] - y)**2)[mask])
perp = np.abs((z[:, None] - z)[mask])
hist, _, _ = np.histogram2d(para, perp, bins=[rpbins, pibins])
return hist
In [10]: %timeit -n1 -r10 hist1(x, y, z, rpbins, pibins)
1 loops, best of 10: 289 ms per loop
In [11]: %timeit -n1 -r10 hist2(x, y, z, rpbins, pibins)
1 loops, best of 10: 294 ms per loop
In [12]: %timeit -n1 -r10 hist3(x, y, z, rpbins, pibins)
1 loops, best of 10: 278 ms per loop
It seems that most of the time is spent instantiating new arrays, not doing actual computations, so while there is some efficiency to scrape off, there really isn't much.
I have a 3D array that I need to interpolate over one axis (the last dimension). Let's say y.shape = (nx, ny, nz), I want to interpolate in nz for every (nx, ny). However, I want to interpolate for a different value in each [i, j].
Here's some code to exemplify. If I wanted to interpolate to a single value, say new_z, I'd use scipy.interpolate.interp1d like this
# y is a 3D ndarray
# x is a 1D ndarray with the abcissa values
# new_z is a number
f = scipy.interpolate.interp1d(x, y, axis=-1, kind='linear')
result = f(new_z)
However, for this problem what I actually want is to interpolate to a different new_z for each y[i, j]. So I do this:
# y is a 3D ndarray
# x is a 1D ndarray with the abcissa values
# new_z is a 2D array
result = numpy.empty(y.shape[:-1])
for i in range(nx):
for j in range(ny):
f = scipy.interpolate.interp1d(x, y[i, j], axis=-1, kind='linear')
result[i, j] = f(new_z[i, j])
Unfortunately, with multiple loops this becomes inefficient and slow. Is there a better way to do this kind of interpolation? Linear interpolation is sufficient. A possibility is to implement this in Cython, but I was trying to avoid that because I want to have the flexibility of changing to cubic interpolation and don't want to do it by hand in Cython.
To speedup high order interpolate, you can call interp1d() only once, and then use the _spline attribute and the low level function _bspleval() in the _fitpack module. Here is the code:
from scipy.interpolate import interp1d
import numpy as np
nx, ny, nz = 30, 40, 50
x = np.arange(0, nz, 1.0)
y = np.random.randn(nx, ny, nz)
new_x = np.random.random_integers(1, (nz-1)*10, size=(nx, ny))/10.0
def original_interpolation(x, y, new_x):
result = np.empty(y.shape[:-1])
for i in xrange(nx):
for j in xrange(ny):
f = interp1d(x, y[i, j], axis=-1, kind=3)
result[i, j] = f(new_x[i, j])
return result
def fast_interpolation(x, y, new_x):
from scipy.interpolate._fitpack import _bspleval
f = interp1d(x, y, axis=-1, kind=3)
xj,cvals,k = f._spline
result = np.empty_like(new_x)
for (i, j), value in np.ndenumerate(new_x):
result[i, j] = _bspleval(value, x, cvals[:, i, j], k, 0)
return result
r1 = original_interpolation(x, y, new_x)
r2 = fast_interpolation(x, y, new_x)
>>> np.allclose(r1, r2)
True
%timeit original_interpolation(x, y, new_x)
%timeit fast_interpolation(x, y, new_x)
1 loops, best of 3: 3.78 s per loop
100 loops, best of 3: 15.4 ms per loop
I don't think interp1d has a method for doing this fast, so you can't avoid the loop here.
Cython you can probably still avoid by coding up the linear interpolation using np.searchsorted, something like this (not tested):
def interp3d(x, y, new_x):
assert x.ndim == 1 and y.ndim == 3 and new_x.ndim == 2
assert y.shape[:2] == new_x.shape and x.shape == y.shape[2:]
nx, ny = y.shape[:2]
new_x = new_x.ravel()
j = np.arange(len(new_x))
k = np.searchsorted(x, new_x).clip(1, len(x) - 1)
y = y.reshape(-1, x.shape[0])
p = (new_x - x[k-1]) / (x[k] - x[k-1])
result = (1 - p) * y[j,k-1] + p * y[j,k]
return result.reshape(nx, ny)
Doesn't help with cubic interpolation, though.
EDIT: made it a function and fixed off-by-one errors. Some timings vs. Cython (500x500x500 grid):
In [58]: %timeit interp3d(x, y, new_x)
10 loops, best of 3: 82.7 ms per loop
In [59]: %timeit cyfile.interp3d(x, y, new_x)
10 loops, best of 3: 86.3 ms per loop
In [60]: abs(interp3d(x, y, new_x) - cyfile.interp3d(x, y, new_x)).max()
Out[60]: 2.2204460492503131e-16
Though, one can argue that the Cython code is easier to read.
As the numpy suggestion above was taking too long, I could wait so here's the cython version for future reference. From some loose benchmarks it is about 3000 times faster (granted, it is only linear interpolation and doesn't to as much as interp1d but it's ok for this purpose).
import numpy as N
cimport numpy as N
cimport cython
DTYPEf = N.float64
ctypedef N.float64_t DTYPEf_t
#cython.boundscheck(False) # turn of bounds-checking for entire function
#cython.wraparound(False) # turn of bounds-checking for entire function
cpdef interp3d(N.ndarray[DTYPEf_t, ndim=1] x, N.ndarray[DTYPEf_t, ndim=3] y,
N.ndarray[DTYPEf_t, ndim=2] new_x):
"""
interp3d(x, y, new_x)
Performs linear interpolation over the last dimension of a 3D array,
according to new values from a 2D array new_x. Thus, interpolate
y[i, j, :] for new_x[i, j].
Parameters
----------
x : 1-D ndarray (double type)
Array containg the x (abcissa) values. Must be monotonically
increasing.
y : 3-D ndarray (double type)
Array containing the y values to interpolate.
x_new: 2-D ndarray (double type)
Array with new abcissas to interpolate.
Returns
-------
new_y : 3-D ndarray
Interpolated values.
"""
cdef int nx = y.shape[0]
cdef int ny = y.shape[1]
cdef int nz = y.shape[2]
cdef int i, j, k
cdef N.ndarray[DTYPEf_t, ndim=2] new_y = N.zeros((nx, ny), dtype=DTYPEf)
for i in range(nx):
for j in range(ny):
for k in range(1, nz):
if x[k] > new_x[i, j]:
new_y[i, j] = (y[i, j, k] - y[i, j, k - 1]) * \
(new_x[i, j] - x[k-1]) / (x[k] - x[k - 1]) + y[i, j, k - 1]
break
return new_y
Building on #pv.'s answer, and vectorising the inner loop, the following gives a substantial speedup (EDIT: changed the expensive numpy.tile to using numpy.lib.stride_tricks.as_strided):
import numpy
from scipy import interpolate
nx = 30
ny = 40
nz = 50
y = numpy.random.randn(nx, ny, nz)
x = numpy.float64(numpy.arange(0, nz))
# We select some locations in the range [0.1, nz-0.1]
new_z = numpy.random.random_integers(1, (nz-1)*10, size=(nx, ny))/10.0
# y is a 3D ndarray
# x is a 1D ndarray with the abcissa values
# new_z is a 2D array
def original_interpolation():
result = numpy.empty(y.shape[:-1])
for i in range(nx):
for j in range(ny):
f = interpolate.interp1d(x, y[i, j], axis=-1, kind='linear')
result[i, j] = f(new_z[i, j])
return result
grid_x, grid_y = numpy.mgrid[0:nx, 0:ny]
def faster_interpolation():
flat_new_z = new_z.ravel()
k = numpy.searchsorted(x, flat_new_z)
k = k.reshape(nx, ny)
lower_index = [grid_x, grid_y, k-1]
upper_index = [grid_x, grid_y, k]
tiled_x = numpy.lib.stride_tricks.as_strided(x, shape=(nx, ny, nz),
strides=(0, 0, x.itemsize))
z_upper = tiled_x[upper_index]
z_lower = tiled_x[lower_index]
z_step = z_upper - z_lower
z_delta = new_z - z_lower
y_lower = y[lower_index]
result = y_lower + z_delta * (y[upper_index] - y_lower)/z_step
return result
# both should be the same (giving a small difference)
print numpy.max(
numpy.abs(original_interpolation() - faster_interpolation()))
That gives the following times on my machine:
In [8]: timeit foo.original_interpolation()
10 loops, best of 3: 102 ms per loop
In [9]: timeit foo.faster_interpolation()
1000 loops, best of 3: 564 us per loop
Going to nx = 300, ny = 300 and nz = 500, gives a 130x speedup:
In [2]: timeit original_interpolation()
1 loops, best of 3: 8.27 s per loop
In [3]: timeit faster_interpolation()
10 loops, best of 3: 60.1 ms per loop
You'd need a write your own algorithm for cubic interpolation, but it shouldn't be so hard.
You could use map_coordinates for that:
from numpy import random, meshgrid, arange
from scipy.ndimage import map_coordinates
(nx, ny, nz) = (4, 5, 6)
# some random array
A = random.rand(nx, ny, nz)
# random floating-point indices in [0, nz-1]
Z = random.rand(nx, ny)*(nz-1)
# regular integer indices of shape (nx,ny)
X, Y = meshgrid(arange(nx), arange(ny), indexing='ij')
coords = (X, Y, Z) # X, Y, and Z are of shape (nx, ny)
print map_coordinates(A, coords, order=1, cval=-999.)
Although there are several nice answers,
they're still doing 250k interpolations in a fixed 500-long array:
j250k = np.searchsorted( X500, X250k ) # indices in [0, 500)
This can be sped up with a LUT, LookUp Table, with say 5k slots:
lut = np.interp( np.arange(5000), X500, np.arange(500) ).round().astype(int)
xscale = (X - X.min()) * (5000 - 1) \
/ (X.max() - X.min())
j = lut.take( xscale.astype(int), mode="clip" ) # take(floats) in numpy 1.7 ?
#---------------------------------------------------------------------------
# X | | | | |
# j 0 1 2 3 4 ...
# LUT |....|.......|.|.............|.... -> int j (+ offset in [0, 1) )
#---------------------------------------------------------------------------
searchsorted is pretty fast, time ~ ln2 500,
so this is probably not much faster.
But LUTs are very fast in C, a simple speed / memory tradeoff.