How to vectorize this loop difference in numpy?

How to vectorize this loop difference in numpy? - python

I feel like there should be a quick way of speeding up this code. I think the answer is here, but I cannot seem to get my problem in that format. The underlying problem that I am attempting to solve is find the point wise difference in terms of the parallel and perpendicular components and create a 2d histogram of these differences.
out = np.zeros((len(rpbins)-1,len(pibins)-1))
tmp = np.zeros((len(x),2))
for i in xrange(len(x)):
tmp[:,0] = x - x[i]
tmp[:,1] = y - y[i]
para = np.sum(tmp**2,axis=-1)**(1./2)
perp = np.abs(z - z[i])
H, _, _ = np.histogram2d(para, perp, bins=[rpbins, pibins])
out += H

Vectorizing things like this is tricky, because to get rid of a loop over n elements you have to construct an array of (n, n), so for large inputs you are likely to get a worse performance than with a Python loop. But it can be done:
mask = np.triu_indices(x.shape[0], 1)
para = np.sqrt((x[:, None] - x)**2 + (y[:, None] - y)**2)
perp = np.abs(z[:, None] - z)
hist, _, _ = np.histogram2d(para[mask], perp[mask], bins=[rpbins, pibins])
The mask is to avoid counting each distance twice. I have also set the diagonal offset to 1 to avoid including the 0 distances of each point to itself in the histogram. But if you don't index para and perp with it, you get the exact same result as your code.
With this sample data:
items = 100
rpbins, pibins = np.linspace(0, 1, 3), np.linspace(0, 1, 3)
x = np.random.rand(items)
y = np.random.rand(items)
z = np.random.rand(items)
I get this for my hist and your out:
>>> hist
array([[ 1795., 651.],
[ 1632., 740.]])
>>> out
array([[ 3690., 1302.],
[ 3264., 1480.]])
and out[i, j] = 2 * hist[i, j] except for i = j = 0, where out[0, 0] = 2 * hist[0, 0] + items because of the 0 distance of each item to itself.
EDIT Tried the following after tcaswell's comment:
items = 1000
rpbins, pibins = np.linspace(0, 1, 3), np.linspace(0, 1, 3)
x, y, z = np.random.rand(3, items)
def hist1(x, y, z, rpbins, pibins) :
mask = np.triu_indices(x.shape[0], 1)
para = np.sqrt((x[:, None] - x)**2 + (y[:, None] - y)**2)
perp = np.abs(z[:, None] - z)
hist, _, _ = np.histogram2d(para[mask], perp[mask], bins=[rpbins, pibins])
return hist
def hist2(x, y, z, rpbins, pibins) :
mask = np.triu_indices(x.shape[0], 1)
para = np.sqrt((x[:, None] - x)[mask]**2 + (y[:, None] - y)[mask]**2)
perp = np.abs((z[:, None] - z)[mask])
hist, _, _ = np.histogram2d(para, perp, bins=[rpbins, pibins])
return hist
def hist3(x, y, z, rpbins, pibins) :
mask = np.triu_indices(x.shape[0], 1)
para = np.sqrt(((x[:, None] - x)**2 + (y[:, None] - y)**2)[mask])
perp = np.abs((z[:, None] - z)[mask])
hist, _, _ = np.histogram2d(para, perp, bins=[rpbins, pibins])
return hist
In [10]: %timeit -n1 -r10 hist1(x, y, z, rpbins, pibins)
1 loops, best of 10: 289 ms per loop
In [11]: %timeit -n1 -r10 hist2(x, y, z, rpbins, pibins)
1 loops, best of 10: 294 ms per loop
In [12]: %timeit -n1 -r10 hist3(x, y, z, rpbins, pibins)
1 loops, best of 10: 278 ms per loop
It seems that most of the time is spent instantiating new arrays, not doing actual computations, so while there is some efficiency to scrape off, there really isn't much.

Related

Avoid three nested loops to search for all numbers not in a list

I want to find all the numbers that are not in a list and are equal to the following formula : x + x_max * y + x_max * y_max * z, where (x_max, y_max) are parameters and (x, y, z) can vary in a restricted set. I also want to know the values of (x, y, z) that I need to obtain each number.
I have succeeded to do this with the following code but it is currently extremely slow due to the three-nested loops. I am looking for a faster solution (maybe using NumPy arrays?).
What my code is actually doing with an extremely low speed (but works):
import numpy as np
x_max, y_max, z_max = 180, 90, 90
numbers = np.random.randint(x_max * y_max * z_max, size=1000)
results = np.array([[0, 0, 0, 0]]) # Initializes the array
for x in range(0, x_max):
for y in range(0, y_max):
for z in range(0, z_max):
result = x + y * x_max + z * x_max * y_max
if (result not in numbers):
results = np.append(results, [[x, y, z, result]], axis=0)
# Initial line is useless
results = np.delete(results, (0), axis=0)

Your nested loop and the calcuation:
for x in range(0, x_max):
for y in range(0, y_max):
for z in range(0, z_max):
result = x + y * x_max + z * x_max * y_max
simply calculate all integers between 0 and x_max * y_max * z_max. And all the integers are unique as well: no integer is calculated twice.
That fact makes this a lot easier:
values = np.arange(x_max * y_max * z_max)
results = np.setdiff1d(values, numbers)
This will give you all the integers that have been calculated and are not in the numbers exclusion list.
Now you only miss the input x, y and z values. These, though, can be calculated from the actual results with some straightforward modulo arithmetic:
z = results // (x_max * y_max)
rem = results % (x_max * y_max)
y = rem // x_max
x = rem % x_max
Now you can stack it all nicely together:
results = np.array([x, y, z, results])
You can tweak your results array if needs be, e.g.:
results = results.T # simple transpose
results = np.sort(results, axis=1) # sort the inner list
I used the above to compare the output from this calculation with that of the triple-nested loop calculation. The results were indeed equal.

Ok, so first of all, the main reason your code is slow is not the nested looping, it's because you're using np.append - this allocates an entirely new array with each append, and should almost never be used. You're much better off just using a python list, for which appending is an O(1) operation (internally, it only actually reallocates memory when the list grows past some multiple (I think something like 1+1/e times) its previous size).
The following should run somewhere on the order of 100x faster.
import numpy as np
x_max, y_max, z_max = 180, 90, 90
numbers = np.random.randint(x_max * y_max * z_max, size=1000)
# results = np.array([[0, 0, 0, 0]]) # Initializes the array <-- Unnecessary
results = [] # <-- Use a python list when you want to expand things
for x in range(0, x_max):
print(f'x={x}')
for y in range(0, y_max):
for z in range(0, z_max):
result = x + y * x_max + z * x_max * y_max
if (result not in numbers):
# results = np.append(results, [[x, y, z, result]], axis=0) # <-- np.append is very slow O(N)
results.append([x, y, z, result]) # <-- List.append is O(1)
results = np.array(results)
# Initial line is useless
# results = np.delete(results, (0), axis=0) <-- Unnecessary without the unnecessary initialization.
... but, we can still get faster using numpy vectorization.
# ... See above code for computation of "results"
print(f'Found result with python loop and list in {time.time() - t_start:.3f}s')
t_start = time.time()
# Get the x, y, z indices that you'd normally get from a nested loop. They'll be in arrays of shape (x_max, y_max, z_max)
xs, ys, zs = np.meshgrid(np.arange(x_max), np.arange(y_max), np.arange(z_max), indexing='ij')
all_values = xs + ys * x_max + zs * x_max * y_max
valid_indices = ~np.isin(all_values, numbers) # Get a shape (x_max, y_max, z_max) boolean mask
# Now use the mask to filter each array (yielding a flat (x_max*y_max*zmax) v[valid_indices] array)
# ... Then reshape it into a (x_max*y_max*zmax, 1) array
# ... So it can be stacked horizontally (h-stack) with the others along the second axis.
results_vectorized = np.hstack([v[valid_indices].reshape(-1, 1) for v in (xs, ys, zs, all_values)])
assert np.array_equal(results_vectorized, results)
print(f'Found result in {time.time() - t_start:.3f}s')
This is around 20x faster than the previous:
Found result with python loop and list in 3.630s
Found result in 0.154s

Speeding things up with numpy is always the same process:
Construct a big array containing all the information
Do computations on the whole array at the same time
Here I use np.fromfunction to construct the big array from the formula you gave.
So here is my ~60x speedup solution:
import numpy as np
from functools import partial
def formula(x, y, z, x_max, y_max, z_max):
return x + y * x_max + z * x_max * y_max
def my_solution(x_max, y_max, z_max, seen):
seen = np.unique(seen)
results = np.fromfunction(
partial(formula, x_max=x_max, y_max=y_max, z_max=z_max),
shape=(x_max, y_max, z_max),
)
mask_not_seen = ~np.isin(results, seen)
results_not_seen = results[mask_not_seen]
indices_not_seen = np.where(mask_not_seen)
return np.stack([*indices_not_seen, results_not_seen], axis=-1)
Let's check that it outputs the same as your solution:
x_max, y_max, z_max = 18, 9, 9
seen = np.random.randint(x_max * y_max * z_max, size=100)
op_out = op_solution(x_max, y_max, z_max, seen)
my_out = my_solution(x_max, y_max, z_max, seen)
assert np.all(op_out == my_out)
and that it is indeed quicker (~60x):
...: %timeit op_solution(x_max, y_max, z_max, seen)
...: %timeit my_solution(x_max, y_max, z_max, seen)
9.74 ms ± 37.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
161 µs ± 2.36 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Finally with the values you gave:
...: seen = np.random.randint(x_max * y_max * z_max, size=1000)
...: %timeit my_solution(x_max=180, y_max=90, z_max=90, seen=seen)
242 ms ± 3.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Double dot product with broadcasting in numpy

I have the following operation :
import numpy as np
x = np.random.rand(3,5,5)
w = np.random.rand(5,5)
y=np.zeros((3,5,5))
for i in range(3):
y[i] = np.dot(w.T,np.dot(x[i],w))
Which corresponds to the pseudo-expression y[m,i,j] = sum( w[k,i] * x[m,k,l] * w[l,j], axes=[k,l] or equivalently simply the dot product of w.T , x, w broadcaster over the first dimension of x.
How can I implement it with numpy's broadcasting rules ?
Thanks in advance.

Here's one vectorized approach with np.tensordot which should be better than broadcasting + summation anyday -
# Take care of "np.dot(x[i],w)" term
x_w = np.tensordot(x,w,axes=((2),(0)))
# Perform "np.dot(w.T,np.dot(x[i],w))" : "np.dot(w.T,x_w)"
y_out = np.tensordot(x_w,w,axes=((1),(0))).swapaxes(1,2)
Alternatively, all of the mess being taken care of with one np.einsum call, but could be slower -
y_out = np.einsum('ab,cae,eg->cbg',w,x,w)
Runtime test -
In [114]: def tensordot_app(x, w):
...: x_w = np.tensordot(x,w,axes=((2),(0)))
...: return np.tensordot(x_w,w,axes=((1),(0))).swapaxes(1,2)
...:
...: def einsum_app(x, w):
...: return np.einsum('ab,cae,eg->cbg',w,x,w)
...:
In [115]: x = np.random.rand(30,50,50)
...: w = np.random.rand(50,50)
...:
In [116]: %timeit tensordot_app(x, w)
1000 loops, best of 3: 477 µs per loop
In [117]: %timeit einsum_app(x, w)
1 loop, best of 3: 219 ms per loop
Giving the broadcasting a chance
The sum-notation was -
y[m,i,j] = sum( w[k,i] * x[m,k,l] * w[l,j], axes=[k,l] )
Thus, the three terms would be stacked for broadcasting, like so -
w : [ N x k x i x N x N]
x : [ m x k x N x l x N]
w : [ N x N X N x l x j]
, where N represents new-axis being appended to facilitate broadcasting along those dims.
The terms with new axes being added with None/np.newaxis would then look like this -
w : w[None, :, :, None, None]
x : x[:, :, None, :, None]
w : w[None, None, None, :, :]
Thus, the broadcasted product would be -
p = w[None,:,:,None,None]*x[:,:,None,:,None]*w[None,None,None,:,:]
Finally, the output would be sum-reduction to lose (k,l), i.e. axes =(1,3) -
y = p.sum((1,3))

faster way to make a 3D blob with python?

Is there a better way to make a 3D density function?
def make_spot_3d(bright, spread, x0,y0,z0):
# Create x and y indices
x = np.linspace(-50, 50, 200)
y = np.linspace(-50, 50, 200)
z = np.linspace(-50, 50, 200)
X, Y, Z = np.meshgrid(x, y, z)
Intensity = np.uint16(bright*np.exp(-((X-x0)/spread)**2
-((Y-y0)/spread)**2
-((Z-z0)/spread)**2))
return Intensity
The function can generate a 3D numpy array which can be plotted with mayavi
However when the function is used to generate a cluster of spots (~100) as follow:
Spots = np.asarray([make_spot_3d(100,2, *loc) for loc in locations])
cluster = np.sum(Spots, axis=0)
yielding for example:
The execution time is around 1 minute (cpu i5);I bet this could be faster.

An obvious improvement would be to use broadcasting to evaluate your intensity function over a 'sparse' mesh rather than a full meshgrid, e.g.:
X, Y, Z = np.meshgrid(x, y, z, sparse=True)
This reduces the runtime by a factor of about 4x on my machine:
%timeit make_spot_3d(1., 1., 0, 0, 0)
1 loops, best of 3: 1.56 s per loop
%timeit make_spot_3d_ogrid(1., 1., 0, 0, 0)
1 loops, best of 3: 359 ms per loop
You can get rid of the overhead involved in the list comprehension by vectorizing the calculation over locations, spreads and brightnesses, e.g.:
def make_spots(bright, spread, x0, y0, z0):
# Create x and y indices
x = np.linspace(-50, 50, 200)
y = np.linspace(-50, 50, 200)
z = np.linspace(-50, 50, 200)
# this will broadcast out to an (nblobs, ny, nx, nz) array
dx = x[None, None, :, None] - x0[:, None, None, None]
dy = y[None, :, None, None] - y0[:, None, None, None]
dz = z[None, None, None, :] - z0[:, None, None, None]
spread = spread[:, None, None, None]
bright = bright[:, None, None, None]
# we can save time by performing the exponentiation over 2D arrays
# before broadcasting out to 4D, since exp(a + b) == exp(a) * exp(b)
s2 = spread * spread
a = np.exp(-(dx * dx) / s2)
b = np.exp(-(dy * dy) / s2)
c = np.exp(-(dz * dz) / s2)
intensity = bright * a * b * c
return intensity.astype(np.uint16)
where bright, spread, x0, y0 and z0 are 1D vectors. This will generate an (nblobs, ny, nx, nz) array, which you could then sum over the first axis. Depending on how many blobs you are generating, and how large the grid is that you are evaluating them over, creating this intermediate array might become quite expensive in terms of memory.
Another option would be to initialize a single (ny, nx, nz) output array and compute the sum in-place:
def sum_spots_inplace(bright, spread, x0, y0, z0):
# Create x and y indices
x = np.linspace(-50, 50, 200)
y = np.linspace(-50, 50, 200)
z = np.linspace(-50, 50, 200)
dx = x[None, None, :, None] - x0[:, None, None, None]
dy = y[None, :, None, None] - y0[:, None, None, None]
dz = z[None, None, None, :] - z0[:, None, None, None]
spread = spread[:, None, None, None]
bright = bright[:, None, None, None]
s2 = spread * spread
a = np.exp(-(dx * dx) / s2)
b = np.exp(-(dy * dy) / s2)
c = np.exp(-(dz * dz) / s2)
out = np.zeros((200, 200, 200), dtype=np.uint16)
for ii in xrange(bright.shape[0]):
out += bright[ii] * a[ii] * b[ii] * c[ii]
return out
This will require less memory, but the potential downside is that it necessitates looping in Python.
To give you some idea of the relative performance:
def sum_spots_listcomp(bright, spread, x0, y0, z0):
return np.sum([make_spot_3d(bright[ii], spread[ii], x0[ii], y0[ii], z0[ii])
for ii in xrange(len(bright))], axis=0)
def sum_spots_vec(bright, spread, x0, y0, z0):
return make_spots(bright, spread, x0, y0, z0).sum(0)
# some fake data
bright = np.random.rand(10) * 100
spread = np.random.rand(10) * 100
x0 = (np.random.rand(10) - 0.5) * 50
y0 = (np.random.rand(10) - 0.5) * 50
z0 = (np.random.rand(10) - 0.5) * 50
%timeit sum_spots_listcomp(bright, spread, x0, y0, z0)
# 1 loops, best of 3: 16.6 s per loop
%timeit sum_spots_vec(bright, spread, x0, y0, z0)
# 1 loops, best of 3: 1.03 s per loop
%timeit sum_spots_inplace(bright, spread, x0, y0, z0)
# 1 loops, best of 3: 330 ms per loop

Since you have an i5 processor and the spots are independent from each other, it would be nice to implement multithreading. You don't necessarily require multiple processes as many Numpy operations release the GIL. The additional code can be quite simple:
from multiprocessing.dummy import Pool
if __name__ == '__main__':
wrap = lambda pos: make_spot_3d(100, 2, *pos)
cluster = sum(Pool().imap_unordered(wrap, positions))
Update
After some testing on my PC at work I must admit that the code above is just too naive and inefficient. On 8 cores the speedup is only ~1.5 times, relative to the singlecore performance.
I still think multithreading would be a good idea, but the success depends very much on the implementation.

So, you're doing every operation in your term 8 million (=200*200*200) times; first of all, you can cut that down to 1 million (if the sphere happens to be in the center of your grid) by just calculating one eighth of if and mirroring that. Mirroring doesn't come for free, but is still far less expensive than exp.
Also, it's very probable that you should just stop calculation after the value of intensity has dropped to 0. Using a little logarithm magic, you can come up with a region of interest that might be much smaller than the 200*200*200 grid.

How to vectorize finding max value in numpy array with if statement?

My Setup: Python 2.7.4.1, Numpy MKL 1.7.1, Windows 7 x64, WinPython
Context:
I tried to implement the Sequential Minimal Optimization algorithm for solving SVM. I use maximal violating pair approach.
The problem:
In working set selection procedure i want to find maximum value of gradient and its index for elements which met some condition, y[i]*alpha[i]<0 or y[i]*alpha[i]
#y - array of -1 and 1
y=np.array([-1,1,1,1,-1,1])
#alpha- array of floats in range [0,C]
alpha=np.array([0.4,0.1,1.33,0,0.9,0])
#grad - array of floats
grad=np.array([-1,-1,-0.2,-0.4,0.4,0.2])
GMaxI=float('-inf')
GMax_idx=-1
n=alpha.shape[0] #usually n=100000
C=4
B=[0,0,C]
for i in xrange(0,n):
yi=y[i] #-1 or 1
alpha_i=alpha[i]
if (yi * alpha_i< B[yi+1]): # B[-1+1]=0 B[1+1]=C
if( -yi*grad[i]>=GMaxI):
GMaxI= -yi*grad[i]
GMax_idx = i
This procedure is called many times (~50000) and profiler shows that this is the bottleneck.
It is possible to vectorize this code?
Edit 1:
Add some small exemplary data
Edit 2:
I have checked solution proposed by hwlau , larsmans and Mr E. Only solutions proposed Mr E is correct. Below sample code with all three answers:
import numpy as np
y=np.array([ -1, -1, -1, -1, -1, -1, -1, -1])
alpha=np.array([0, 0.9, 0.4, 0.1, 1.33, 0, 0.9, 0])
grad=np.array([-3, -0.5, -1, -1, -0.2, -4, -0.4, -0.3])
C=4
B=np.array([0,0,C])
#hwlau - wrong index and value
filter = (y*alpha < C*0.5*(y+1)).astype('float')
GMax_idx = (filter*(-y*grad)).argmax()
GMax = -y[GMax_idx]*grad[GMax_idx]
print GMax_idx,GMax
#larsmans - wrong index
neg_y_grad = (-y * grad)[y * alpha < B[y + 1]]
GMaxI = np.max(neg_y_grad)
GMax_ind = np.argmax(neg_y_grad)
print GMax_ind,GMaxI
#Mr E - correct result
BY = np.take(B, y+1)
valid_mask = (y * alpha < BY)
values = -y * grad
values[~valid_mask] = np.min(values) - 1.0
GMaxI = values.max()
GMax_idx = values.argmax()
print GMax_idx,GMaxI
Output (GMax_idx, GMaxI)
0 -3.0
3 -0.2
4 -0.2
Conclusions
After checking all solutions, the fastest one (2x-6x) is solution proposed by #ali_m. However it requires to install some python packages: numba and all its prerequisites.
I have some trouble to use numba with class methods, so I create global functions which are autojited with numba, my solution look something like this:
from numba import autojit
#autojit
def FindMaxMinGrad(A,B,alpha,grad,y):
'''
Finds i,j indices with maximal violatin pair scheme
A,B - 3 dim arrays, contains bounds A=[-C,0,0], B=[0,0,C]
alpha - array like, contains alpha coeficients
grad - array like, gradient
y - array like, labels
'''
GMaxI=-100000
GMaxJ=-100000
GMax_idx=-1
GMin_idx=-1
for i in range(0,alpha.shape[0]):
if (y[i] * alpha[i]< B[y[i]+1]):
if( -y[i]*grad[i]>GMaxI):
GMaxI= -y[i]*grad[i]
GMax_idx = i
if (y[i] * alpha[i]> A[y[i]+1]):
if( y[i]*grad[i]>GMaxJ):
GMaxJ= y[i]*grad[i]
GMin_idx = i
return (GMaxI,GMaxJ,GMax_idx,GMin_idx)
class SVM(object):
def working_set(self,....):
FindMaxMinGrad(.....)

You can probably do quite a lot better than plain vectorization if you use numba to JIT-compile your original code that used nested loops.
import numpy as np
from numba import autojit
#autojit
def jit_max_grad(y, alpha, grad, B):
maxgrad = -inf
maxind = -1
for ii in xrange(alpha.shape[0]):
if (y[ii] * alpha[ii] < B[y[ii] + 1]):
g = -y[ii] * grad[ii]
if g >= maxgrad:
maxgrad = g
maxind = ii
return maxind, maxgrad
For comparison, here's Mr E's vectorized version:
def mr_e_max_grad(y, alpha, grad, B):
BY = np.take(B, y+1)
valid_mask = (y * alpha < BY)
values = -y * grad
values[~valid_mask] = np.min(values) - 1.0
GMaxI = values.max()
GMax_idx = values.argmax()
return GMax_idx, GMaxI
Timing:
y = np.array([ -1, -1, -1, -1, -1, -1, -1, -1])
alpha = np.array([0, 0.9, 0.4, 0.1, 1.33, 0, 0.9, 0])
grad = np.array([-3, -0.5, -1, -1, -0.2, -4, -0.4, -0.3])
C = 4
B = np.array([0,0,C])
%timeit mr_e_max_grad(y, alpha, grad, B)
# 100000 loops, best of 3: 19.1 µs per loop
%timeit jit_max_grad(y, alpha, grad, B)
# 1000000 loops, best of 3: 1.07 µs per loop
Update: if you want to see what the timings look like on bigger arrays, it's easy to define a function that generates semi-realistic fake data based on your description in the question:
def make_fake(n, C=4):
y = np.random.choice((-1, 1), n)
alpha = np.random.rand(n) * C
grad = np.random.randn(n)
B = np.array([0,0,C])
return y, alpha, grad, B
%%timeit y, alpha, grad, B = make_fake(100000, 4)
mr_e_max_grad(y, alpha, grad, B)
# 1000 loops, best of 3: 1.83 ms per loop
%%timeit y, alpha, grad, B = make_fake(100000, 4)
jit_max_grad(y, alpha, grad, B)
# 1000 loops, best of 3: 471 µs per loop

I think this is a fully vectorized version
import numpy as np
#y - array of -1 and 1
y=np.array([-1,1,1,1,-1,1])
#alpha- array of floats in range [0,C]
alpha=np.array([0.4,0.1,1.33,0,0.9,0])
#grad - array of floats
grad=np.array([-1,-1,-0.2,-0.4,0.4,0.2])
BY = np.take(B, y+1)
valid_mask = (y * alpha < BY)
values = -yi * grad
values[~valid_mask] = np.min(values) - 1.0
GMaxI = values.max()
GMax_idx = values.argmax()

Here you go:
y=np.array([-1,1,1,1,-1,1])
alpha=np.array([0.4,0.1,1.33,0,0.9,0])
grad=np.array([-1,-1,-0.2,-0.4,0.4,0.2])
C=4
filter = (y*alpha < C*0.5*(y+1)).astype('float')
GMax_idx = (filter*(-y*grad)).argmax()
GMax = -y[GMax_idx]*grad[GMax_idx]
No benchmark tried, but it is pure numerical and vectorized so it should be fast.

If you change B from a list to a NumPy array, you can at least vectorize the yi * alpha_i< B[yi+1] and push the loop inwards:
GMaxI = float('-inf')
GMax_idx = -1
for i in np.where(y * alpha < B[y + 1])[0]:
if -y[i] * grad[i] >= GMaxI:
GMaxI= -y[i] * grad[i]
GMax_idx = i
That should save a bit of time. Next up, you can vectorize -y[i] * grad[i]:
GMaxI = float('-inf')
GMax_idx = -1
neg_y_grad = -y * grad
for i in np.where(y * alpha < B[y + 1])[0]:
if neg_y_grad[i] >= GMaxI:
GMaxI= -y[i] * grad[i]
GMax_idx = i
Finally, we can vectorize away the entire loop by using max and argmax on -y * grad, filtered by y * alpha < B[y + 1]:
neg_y_grad = (-y * grad)
GMaxI = np.max(neg_y_grad[y * alpha < B[y + 1]])
GMax_idx = np.where(neg_y_grad == GMaxI)[0][0]

Fast interpolation over 3D array

I have a 3D array that I need to interpolate over one axis (the last dimension). Let's say y.shape = (nx, ny, nz), I want to interpolate in nz for every (nx, ny). However, I want to interpolate for a different value in each [i, j].
Here's some code to exemplify. If I wanted to interpolate to a single value, say new_z, I'd use scipy.interpolate.interp1d like this
# y is a 3D ndarray
# x is a 1D ndarray with the abcissa values
# new_z is a number
f = scipy.interpolate.interp1d(x, y, axis=-1, kind='linear')
result = f(new_z)
However, for this problem what I actually want is to interpolate to a different new_z for each y[i, j]. So I do this:
# y is a 3D ndarray
# x is a 1D ndarray with the abcissa values
# new_z is a 2D array
result = numpy.empty(y.shape[:-1])
for i in range(nx):
for j in range(ny):
f = scipy.interpolate.interp1d(x, y[i, j], axis=-1, kind='linear')
result[i, j] = f(new_z[i, j])
Unfortunately, with multiple loops this becomes inefficient and slow. Is there a better way to do this kind of interpolation? Linear interpolation is sufficient. A possibility is to implement this in Cython, but I was trying to avoid that because I want to have the flexibility of changing to cubic interpolation and don't want to do it by hand in Cython.

To speedup high order interpolate, you can call interp1d() only once, and then use the _spline attribute and the low level function _bspleval() in the _fitpack module. Here is the code:
from scipy.interpolate import interp1d
import numpy as np
nx, ny, nz = 30, 40, 50
x = np.arange(0, nz, 1.0)
y = np.random.randn(nx, ny, nz)
new_x = np.random.random_integers(1, (nz-1)*10, size=(nx, ny))/10.0
def original_interpolation(x, y, new_x):
result = np.empty(y.shape[:-1])
for i in xrange(nx):
for j in xrange(ny):
f = interp1d(x, y[i, j], axis=-1, kind=3)
result[i, j] = f(new_x[i, j])
return result
def fast_interpolation(x, y, new_x):
from scipy.interpolate._fitpack import _bspleval
f = interp1d(x, y, axis=-1, kind=3)
xj,cvals,k = f._spline
result = np.empty_like(new_x)
for (i, j), value in np.ndenumerate(new_x):
result[i, j] = _bspleval(value, x, cvals[:, i, j], k, 0)
return result
r1 = original_interpolation(x, y, new_x)
r2 = fast_interpolation(x, y, new_x)
>>> np.allclose(r1, r2)
True
%timeit original_interpolation(x, y, new_x)
%timeit fast_interpolation(x, y, new_x)
1 loops, best of 3: 3.78 s per loop
100 loops, best of 3: 15.4 ms per loop

I don't think interp1d has a method for doing this fast, so you can't avoid the loop here.
Cython you can probably still avoid by coding up the linear interpolation using np.searchsorted, something like this (not tested):
def interp3d(x, y, new_x):
assert x.ndim == 1 and y.ndim == 3 and new_x.ndim == 2
assert y.shape[:2] == new_x.shape and x.shape == y.shape[2:]
nx, ny = y.shape[:2]
new_x = new_x.ravel()
j = np.arange(len(new_x))
k = np.searchsorted(x, new_x).clip(1, len(x) - 1)
y = y.reshape(-1, x.shape[0])
p = (new_x - x[k-1]) / (x[k] - x[k-1])
result = (1 - p) * y[j,k-1] + p * y[j,k]
return result.reshape(nx, ny)
Doesn't help with cubic interpolation, though.
EDIT: made it a function and fixed off-by-one errors. Some timings vs. Cython (500x500x500 grid):
In [58]: %timeit interp3d(x, y, new_x)
10 loops, best of 3: 82.7 ms per loop
In [59]: %timeit cyfile.interp3d(x, y, new_x)
10 loops, best of 3: 86.3 ms per loop
In [60]: abs(interp3d(x, y, new_x) - cyfile.interp3d(x, y, new_x)).max()
Out[60]: 2.2204460492503131e-16
Though, one can argue that the Cython code is easier to read.

As the numpy suggestion above was taking too long, I could wait so here's the cython version for future reference. From some loose benchmarks it is about 3000 times faster (granted, it is only linear interpolation and doesn't to as much as interp1d but it's ok for this purpose).
import numpy as N
cimport numpy as N
cimport cython
DTYPEf = N.float64
ctypedef N.float64_t DTYPEf_t
#cython.boundscheck(False) # turn of bounds-checking for entire function
#cython.wraparound(False) # turn of bounds-checking for entire function
cpdef interp3d(N.ndarray[DTYPEf_t, ndim=1] x, N.ndarray[DTYPEf_t, ndim=3] y,
N.ndarray[DTYPEf_t, ndim=2] new_x):
"""
interp3d(x, y, new_x)
Performs linear interpolation over the last dimension of a 3D array,
according to new values from a 2D array new_x. Thus, interpolate
y[i, j, :] for new_x[i, j].
Parameters
----------
x : 1-D ndarray (double type)
Array containg the x (abcissa) values. Must be monotonically
increasing.
y : 3-D ndarray (double type)
Array containing the y values to interpolate.
x_new: 2-D ndarray (double type)
Array with new abcissas to interpolate.
Returns
-------
new_y : 3-D ndarray
Interpolated values.
"""
cdef int nx = y.shape[0]
cdef int ny = y.shape[1]
cdef int nz = y.shape[2]
cdef int i, j, k
cdef N.ndarray[DTYPEf_t, ndim=2] new_y = N.zeros((nx, ny), dtype=DTYPEf)
for i in range(nx):
for j in range(ny):
for k in range(1, nz):
if x[k] > new_x[i, j]:
new_y[i, j] = (y[i, j, k] - y[i, j, k - 1]) * \
(new_x[i, j] - x[k-1]) / (x[k] - x[k - 1]) + y[i, j, k - 1]
break
return new_y

Building on #pv.'s answer, and vectorising the inner loop, the following gives a substantial speedup (EDIT: changed the expensive numpy.tile to using numpy.lib.stride_tricks.as_strided):
import numpy
from scipy import interpolate
nx = 30
ny = 40
nz = 50
y = numpy.random.randn(nx, ny, nz)
x = numpy.float64(numpy.arange(0, nz))
# We select some locations in the range [0.1, nz-0.1]
new_z = numpy.random.random_integers(1, (nz-1)*10, size=(nx, ny))/10.0
# y is a 3D ndarray
# x is a 1D ndarray with the abcissa values
# new_z is a 2D array
def original_interpolation():
result = numpy.empty(y.shape[:-1])
for i in range(nx):
for j in range(ny):
f = interpolate.interp1d(x, y[i, j], axis=-1, kind='linear')
result[i, j] = f(new_z[i, j])
return result
grid_x, grid_y = numpy.mgrid[0:nx, 0:ny]
def faster_interpolation():
flat_new_z = new_z.ravel()
k = numpy.searchsorted(x, flat_new_z)
k = k.reshape(nx, ny)
lower_index = [grid_x, grid_y, k-1]
upper_index = [grid_x, grid_y, k]
tiled_x = numpy.lib.stride_tricks.as_strided(x, shape=(nx, ny, nz),
strides=(0, 0, x.itemsize))
z_upper = tiled_x[upper_index]
z_lower = tiled_x[lower_index]
z_step = z_upper - z_lower
z_delta = new_z - z_lower
y_lower = y[lower_index]
result = y_lower + z_delta * (y[upper_index] - y_lower)/z_step
return result
# both should be the same (giving a small difference)
print numpy.max(
numpy.abs(original_interpolation() - faster_interpolation()))
That gives the following times on my machine:
In [8]: timeit foo.original_interpolation()
10 loops, best of 3: 102 ms per loop
In [9]: timeit foo.faster_interpolation()
1000 loops, best of 3: 564 us per loop
Going to nx = 300, ny = 300 and nz = 500, gives a 130x speedup:
In [2]: timeit original_interpolation()
1 loops, best of 3: 8.27 s per loop
In [3]: timeit faster_interpolation()
10 loops, best of 3: 60.1 ms per loop
You'd need a write your own algorithm for cubic interpolation, but it shouldn't be so hard.

You could use map_coordinates for that:
from numpy import random, meshgrid, arange
from scipy.ndimage import map_coordinates
(nx, ny, nz) = (4, 5, 6)
# some random array
A = random.rand(nx, ny, nz)
# random floating-point indices in [0, nz-1]
Z = random.rand(nx, ny)*(nz-1)
# regular integer indices of shape (nx,ny)
X, Y = meshgrid(arange(nx), arange(ny), indexing='ij')
coords = (X, Y, Z) # X, Y, and Z are of shape (nx, ny)
print map_coordinates(A, coords, order=1, cval=-999.)

Although there are several nice answers,
they're still doing 250k interpolations in a fixed 500-long array:
j250k = np.searchsorted( X500, X250k ) # indices in [0, 500)
This can be sped up with a LUT, LookUp Table, with say 5k slots:
lut = np.interp( np.arange(5000), X500, np.arange(500) ).round().astype(int)
xscale = (X - X.min()) * (5000 - 1) \
/ (X.max() - X.min())
j = lut.take( xscale.astype(int), mode="clip" ) # take(floats) in numpy 1.7 ?
#---------------------------------------------------------------------------
# X | | | | |
# j 0 1 2 3 4 ...
# LUT |....|.......|.|.............|.... -> int j (+ offset in [0, 1) )
#---------------------------------------------------------------------------
searchsorted is pretty fast, time ~ ln2 500,
so this is probably not much faster.
But LUTs are very fast in C, a simple speed / memory tradeoff.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to vectorize this loop difference in numpy? - python

Related

Avoid three nested loops to search for all numbers not in a list

Double dot product with broadcasting in numpy

faster way to make a 3D blob with python?

How to vectorize finding max value in numpy array with if statement?

Fast interpolation over 3D array

Categories

Resources