Speed up function evaluation for integration in scipy - python

I am trying to port code from Matlab to SciPy. Here is the simplified version of the code I have written so far: https://gist.github.com/atmo/01b6e007be9ef90e402c . However, Python version is considerably slower then Matlab. I've included profiling results in the gist and they show that almost 90% of time python spends evaluating function f. Is there any way to speed up its evaluation, except from rewriting it in C or Cython?

As I mentioned in the comments, you can get rid of about half the calls to quad (and consequently the complicated function f) if you take into account that the matrix is symmetric.
Further speed gains, still in pure python, are to be had by rewriting that complicated function. I did most of that in sympy.
Finally I tried to vectorize the call to quad using np.vectorize.
from scipy.integrate import quad
from scipy.special import jn as besselj
from scipy import exp, zeros, linspace
from scipy.linalg import norm
import numpy as np
def complicated_func(lmbd, a, n, k):
u,v,w = 5, 3, 2
x = a*lmbd
fac = exp(2*x)
comm = (2*w + x)
part1 = ((v**2 + 4*w*(w + 2*x) + 2*x*(x - 1))*fac**5
+ 2*u*fac**4
+ (-v**2 - 4*(w*(3*w + 4*x + 1) + x*(x-2)) + 1)*fac**3
+ (-8*(w + x) + 2)*fac**2
+ (2*comm*(comm + 1) - 1)*fac)
return part1/lmbd *besselj(n+1, lmbd) * besselj(k+1, lmbd)
def perform_quad(n, k, a):
return quad(complicated_func, 0, np.inf, args=(a,n,k))[0]
def improved_main():
sz = 20
amatrix = np.zeros((sz,sz))
ls = -np.linspace(1, 10, 20)/2
inds = np.tril_indices(sz)
myv3 = np.vectorize(perform_quad)
res = myv3(inds[0], inds[1], ls.reshape(-1,1))
results = np.empty(res.shape[0])
for rowind, row in enumerate(res):
amatrix[inds] = row
symm_matrix = amatrix + amatrix.T - np.diag(amatrix.diagonal())
results[rowind] = norm(symm_matrix)
return results
Timing results show me a speed increase of a factor 5 (you'll forgive me if I only ran it once, it takes long enough as it is):
In [11]: %timeit -n1 -r1 improved_main()
1 loops, best of 1: 6.92 s per loop
In [12]: %timeit -n1 -r1 main()
1 loops, best of 1: 35.9 s per loop
There was also a microgain to be had if you replaced v immediately by its square, because that's the only time it is used in that complicated function: as its square.
There's also an extreme amount of repetition in the calls to besselj, but I don't see how to avoid that, because quad will determine lmbd, so you can't easily precompute those values and then perform a lookup.
If you profile the improved_main, you'll see that the amount of calls to complicated_func has nearly decreased by a factor of 2 (the diagonal still needs to be computed). All the other speed gains can be attributed to np.vectorize and the improvements to complicated_func.
I don't have Matlab on my system, so I can't make any statements for its speed gain if you improve the complicated function there.

Your numpy version probably is comparable in to speed to older MATLAB runs. But new MATLAB versions do various forms of just-in-time compilation that speed up repeated calculations considerably.
My guess is that you can nibble away at the lambda and f code, and maybe cut their evaluation times in half. But the real killer is that you are calling f so many times.
For a start I'd try to precalculate things in f. For example define K1=K[1] and use K1 in the calculations. That will reduce the number of indexing calls. Are of the exponentials repeated? Maybe replace the lambda definition with a regular def, or combine it with f.

Related

When numba is effective?

I know numba creates some overheads and in some situations (non-intensive computation) it become slower that pure python. But what I don't know is where to draw the line. Is it possible to use order of algorithm complexity to figure out where?
for example for adding two arrays (~O(n)) shorter that 5 in this code pure python is faster:
def sum_1(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
#numba.jit('float64[:](float64[:],float64[:])')
def sum_2(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
# try 100
a = np.linspace(1.0,2.0,5)
b = np.linspace(1.0,2.0,5)
print("pure python: ")
%timeit -o sum_1(a,b)
print("\n\n\n\npython + numba: ")
%timeit -o sum_2(a,b)
UPDADE: what I am looking for is a similar guideline like here:
"A general guideline is to choose different targets for different data sizes and algorithms. The “cpu” target works well for small data sizes (approx. less than 1KB) and low compute intensity algorithms. It has the least amount of overhead. The “parallel” target works well for medium data sizes (approx. less than 1MB). Threading adds a small delay. The “cuda” target works well for big data sizes (approx. greater than 1MB) and high compute intensity algorithms. Transfering memory to and from the GPU adds significant overhead."
It's hard to draw the line when numba becomes effective. However there are a few indicators when it might not be effective:
If you cannot use jit with nopython=True - whenever you cannot compile it in nopython mode you either try to compile too much or it won't be significantly faster.
If you don't use arrays - When you deal with lists or other types that you pass to the numba function (except from other numba functions), numba needs to copy these which incurs a significant overhead.
If there is already a NumPy or SciPy function that does it - even if numba can be significantly faster for short arrays it will almost always be as fast for longer arrays (also you might easily neglect some common edge cases that these would handle).
There's also another reason why you might not want to use numba in cases where it's just "a bit" faster than other solutions: Numba functions have to be compiled, either ahead-of-time or when first called, in some situations the compilation will be much slower than your gain, even if you call it hundreds of times. Also the compilation times add up: numba is slow to import and compiling the numba functions also adds some overhead. It doesn't make sense to shave off a few milliseconds if the import overhead increased by 1-10 seconds.
Also numba is complicated to install (without conda at least) so if you want to share your code then you have a really "heavy dependency".
Your example is lacking a comparison with NumPy methods and a highly optimized version of pure Python. I added some more comparison functions and did a benchmark (using my library simple_benchmark):
import numpy as np
import numba as nb
from itertools import chain
def python_loop(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
#nb.njit
def numba_loop(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
def numpy_methods(a, b):
return a.sum() + b.sum()
def python_sum(a, b):
return sum(chain(a.tolist(), b.tolist()))
from simple_benchmark import benchmark, MultiArgument
arguments = {
2**i: MultiArgument([np.zeros(2**i), np.zeros(2**i)])
for i in range(2, 17)
}
b = benchmark([python_loop, numba_loop, numpy_methods, python_sum], arguments, warmups=[numba_loop])
%matplotlib notebook
b.plot()
Yes, the numba function is fastest for small arrays, however the NumPy solution will be slightly faster for longer arrays. The Python solutions are slower but the "faster" alternative is already significantly faster than your original proposed solution.
In this case I would simply use the NumPy solution because it's short, readable and fast, except when you're dealing with lots of short arrays and call the function a lot of times - then the numba solution would be significantly better.
If you do not exactly know what is the consequence of explicit input and output declarations let numba decide it. With your input you may want to use 'float64(float64[::1],float64[::1])'. (scalar output, contiguous input arrays). If you call the explicitly declared function with strided inputs it will fail, if you would Numba do the job it would simply recompile.
Without using fastmath=True it is also not possible to use SIMD, because it changes the precision of the result.
Calculating at least 4 partial sums (256 Bit vector) and than calculating the sum of these partial sums is preferable here (Numpy also don't calculate a naive sum).
Example using MSeiferts excellent benchmark utility
import numpy as np
import numba as nb
from itertools import chain
def python_loop(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
#nb.njit
def numba_loop_zip(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
#Your version with suboptimal input and output (prevent njit compilation) declaration
#nb.jit('float64[:](float64[:],float64[:])')
def numba_your_func(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
#nb.njit(fastmath=True)
def numba_loop_zip_fastmath(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
#nb.njit(fastmath=True)
def numba_loop_fastmath_single(a,b):
result = 0.0
size=min(a.shape[0],b.shape[0])
for i in range(size):
result += a[i]+b[i]
return result
#nb.njit(fastmath=True,parallel=True)
def numba_loop_fastmath_multi(a,b):
result = 0.0
size=min(a.shape[0],b.shape[0])
for i in nb.prange(size):
result += a[i]+b[i]
return result
#just for fun... single-threaded for small arrays,
#multithreaded for larger arrays
#nb.njit(fastmath=True,parallel=True)
def numba_loop_fastmath_combined(a,b):
result = 0.0
size=min(a.shape[0],b.shape[0])
if size>2*10**4:
result=numba_loop_fastmath_multi(a,b)
else:
result=numba_loop_fastmath_single(a,b)
return result
def numpy_methods(a, b):
return a.sum() + b.sum()
def python_sum(a, b):
return sum(chain(a.tolist(), b.tolist()))
from simple_benchmark import benchmark, MultiArgument
arguments = {
2**i: MultiArgument([np.zeros(2**i), np.zeros(2**i)])
for i in range(2, 19)
}
b = benchmark([python_loop, numba_loop_zip, numpy_methods,numba_your_func, python_sum,numba_loop_zip_fastmath,numba_loop_fastmath_single,numba_loop_fastmath_multi,numba_loop_fastmath_combined], arguments, warmups=[numba_loop_zip,numba_loop_zip_fastmath,numba_your_func,numba_loop_fastmath_single,numba_loop_fastmath_multi,numba_loop_fastmath_combined])
%matplotlib notebook
b.plot()
Please note that using the numba_loop_fastmath_multi or numba_loop_fastmath_combined(a,b) is only in some special cases recommended. More often such a simple function is one part of another problem which can be more efficiently parallelized (starting threads has some overhead)
Running this code lead to a ~6 times speedup on my machine:
#numba.autojit
def sum_2(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
Python: 3.31 µs, numba: 589 ns.
As for you question I really think this is not really related to the complexity and it will probably depend mostly on the kind of operations you are doing. On the other hand you can still plot a python/numba comparison to see where the shift happens for a given function.

Sympy define function of the upper limit of an integral

Using Sympy, I would like to define a function of one variable where the variable is the upper limit of some integral.
I tried the following, which works
import sympy as sp
def g(a,x):
y = sp.Symbol('y')
expr = sp.Integral( f(y,p), [y,a,x] )
return expr.doit()
However, I ask myself if this is gonna be efficient when evaluated on many points. I have been reading about lambdify and would like to use it for this case, but am not sure how.
I am actually not sure if lambdify is the right way to go. In alternative, one could think of computing the indefinite integral once, and then only apply the limits to evaluate the definite integral.
Let me show an example. I have a function of one variable with some parameters, say a polynomial in y
def f(y, p):
c0,c1,c2=p
return c0+c1*y+c2*y**2
I want to define another function by integrating this polynomial, where the function is going to depend on the upper limit of the integration (Latex because I don't have enough reputation...),
g_{a,p}(x) = \int_{a}^{x} f(y,p)dy
So, in this simple case, g(x) would be polynomial or order 3 which needs to be evaluated between a and x. Once I have g(x), I want to evaluate it on "many" points, so my question is if I can do this efficiently.
I made a naive implementation of the solution and one using sympy.lambdify. Only timed it once, so not the most accurate results. However, using sympy.lambdify seems 100x faster.
Naive implementation
import sympy as sp
import numpy as np
import time
def f(y, p):
c0,c1,c2,c3,c4,c5 = p
return c0 + c1*x + c2*x**2 + c3*x**3 + c4*x**4 + c5*x**5
def g(a,x):
y = sp.Symbol('y')
expr = sp.Integral( f(y,p), [y,a,x] )
return expr.doit()
start = time.clock()
l = []
for x in np.arange(a,b,0.001):
l.append(g(a,x))
end = time.clock()
print end-start
Improved implementation
import sympy as sp
import numpy as np
import time
def f(y, p):
c0,c1,c2,c3,c4,c5 = p
return c0 + c1*x + c2*x**2 + c3*x**3 + c4*x**4 + c5*x**5
x=sp.Symbol('x')
itgx = sp.Integral( f(y,p), [y,a, x] )
start = time.clock()
g = sp.lambdify(x, itgx.doit(), "numpy")
l = g(np.arange(a,b,0.001))
end = time.clock()
print end-start
On my architecture (i7-3770 #3.40GHz, Ubuntu 14.04),
the naive implementation times 12.086627s while the lambdify implementation times 0.054799s, that looks like a significant speed up. Sympy manual also suggests to use lambdify when possible
So my question, which maybe is not clear enough, is:
Is there a better way of doing this kind of computation? If so, please let me know
Of course the lambdified version is faster. Not only is it vectorizing the result over a numpy array rather than using a Python for loop, you're also being very inefficient in the other version by recomputing the integral each time.
I don't see an actual question here. Your lambdified version looks correct. I don't see any issues with it.

Tricking numpy/python into representing very large and very small numbers

I need to compute the integral of the following function within ranges that start as low as -150:
import numpy as np
from scipy.special import ndtr
def my_func(x):
return np.exp(x ** 2) * 2 * ndtr(x * np.sqrt(2))
The problem is that this part of the function
np.exp(x ** 2)
tends toward infinity -- I get inf for values of x less than approximately -26.
And this part of the function
2 * ndtr(x * np.sqrt(2))
which is equivalent to
from scipy.special import erf
1 + erf(x)
tends toward 0.
So, a very, very large number times a very, very small number should give me a reasonably sized number -- but, instead of that, python is giving me nan.
What can I do to circumvent this problem?
I think a combination of #askewchan's solution and scipy.special.log_ndtr will do the trick:
from scipy.special import log_ndtr
_log2 = np.log(2)
_sqrt2 = np.sqrt(2)
def my_func(x):
return np.exp(x ** 2) * 2 * ndtr(x * np.sqrt(2))
def my_func2(x):
return np.exp(x * x + _log2 + log_ndtr(x * _sqrt2))
print(my_func(-150))
# nan
print(my_func2(-150)
# 0.0037611803122451198
For x <= -20, log_ndtr(x) uses a Taylor series expansion of the error function to iteratively compute the log CDF directly, which is much more numerically stable than simply taking log(ndtr(x)).
Update
As you mentioned in the comments, the exp can also overflow if x is sufficiently large. Whilst you could work around this using mpmath.exp, a simpler and faster method is to cast up to a np.longdouble which, on my machine, can represent values up to 1.189731495357231765e+4932:
import mpmath
def my_func3(x):
return mpmath.exp(x * x + _log2 + log_ndtr(x * _sqrt2))
def my_func4(x):
return np.exp(np.float128(x * x + _log2 + log_ndtr(x * _sqrt2)))
print(my_func2(50))
# inf
print(my_func3(50))
# mpf('1.0895188633566085e+1086')
print(my_func4(50))
# 1.0895188633566084842e+1086
%timeit my_func3(50)
# The slowest run took 8.01 times longer than the fastest. This could mean that
# an intermediate result is being cached 100000 loops, best of 3: 15.5 µs per
# loop
%timeit my_func4(50)
# The slowest run took 11.11 times longer than the fastest. This could mean
# that an intermediate result is being cached 100000 loops, best of 3: 2.9 µs
# per loop
There already is such a function: erfcx. I think erfcx(-x) should give you the integrand you want (note that 1+erf(x)=erfc(-x)).
Not sure how helpful will this be, but here are a couple of thoughts that are too long for a comment.
You need to calculate the integral of , which you correctly identified would be . Opening the brackets you can integrate both parts of the summation.
Scipy has this imaginary error function implemented
The second part is harder:
This is a generalized hypergeometric function. Sadly it looks like scipy does not have an implementation of it, but this package claims it does.
Here I used indefinite integrals without constants, knowing the from to values it is clear how to use definite ones.

Speeding up dynamic programming in python/numpy

I have a 2D cost matrix M, perhaps 400x400, and I'm trying to calculate the optimal path through it. As such, I have a function like:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
which is obviously recursive. P1 is some additive constant. My code, which works more or less, is:
def optimalcost(cost, P1=10):
width1,width2 = cost.shape
M = array(cost)
for i in range(0,width1):
for j in range(0,width2):
try:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
except:
M[i,j] = inf
return M
Now I know looping in Numpy is a terrible idea, and for things like the calculation of the initial cost matrix I've been able to find shortcuts to cutting the time down. However, as I need to evaluate potentially the entire matrix I'm not sure how else to do it. This takes around 3 seconds per call on my machine and must be applied to around 300 of these cost matrices. I'm not sure where this time comes from, as profiling says the 200,000 calls to min only take 0.1s - maybe memory access?
Is there a way to do this in parallel somehow? I assume there may be, but to me it seems each iteration is dependent unless there's a smarter way to memoize things.
There are parallels to this question: Can I avoid Python loop overhead on dynamic programming with numpy?
I'm happy to switch to C if necessary, but I like the flexibility of Python for rapid testing and the lack of faff with file IO. Off the top of my head, is something like the following code likely to be significantly faster?
#define P1 10
void optimalcost(double** costin, double** costout){
/*
We assume that costout is initially
filled with costin's values.
*/
float a,b,c,prevcost;
for(i=0;i<400;i++){
for(j=0;j<400;j++){
a = prevcost+P1;
b = costout[i][j-1]+P1;
c = costout[i-1][j-1];
costout[i][j] += min(prevcost,min(b,c));
prevcost = costout[i][j];
}
}
}
return;
Update:
I'm on Mac, and I don't want to install a whole new Python toolchain so I used Homebrew.
> brew install llvm --rtti
> LLVM_CONFIG_PATH=/usr/local/opt/llvm/bin/llvm-config pip install llvmpy
> pip install numba
New "numba'd" code:
from numba import autojit, jit
import time
import numpy as np
#autojit
def cost(left, right):
height,width = left.shape
cost = np.zeros((height,width,width))
for row in range(height):
for x in range(width):
for y in range(width):
cost[row,x,y] = abs(left[row,x]-right[row,y])
return cost
#autojit
def optimalcosts(initcost):
costs = zeros_like(initcost)
for row in range(height):
costs[row,:,:] = optimalcost(initcost[row])
return costs
#autojit
def optimalcost(cost):
width1,width2 = cost.shape
P1=10
prevcost = 0.0
M = np.array(cost)
for i in range(1,width1):
for j in range(1,width2):
M[i,j] += min(M[i-1,j-1],prevcost+P1,M[i,j-1]+P1)
prevcost = M[i,j]
return M
prob_size = 400
left = np.random.rand(prob_size,prob_size)
right = np.random.rand(prob_size,prob_size)
print '---------- Numba Time ----------'
t = time.time()
c = cost(left,right)
optimalcost(c[100])
print time.time()-t
print '---------- Native python Time --'
t = time.time()
c = cost.py_func(left,right)
optimalcost.py_func(c[100])
print time.time()-t
It's interesting writing code in Python that is so un-Pythonic. Note for anyone interested in writing Numba code, you need to explicitly express loops in your code. Before, I had the neat Numpy one-liner,
abs(left[row,:][:,newaxis] - right[row,:])
to calculate the cost. That took around 7 seconds with Numba. Writing out the loops properly gives 0.5s.
It's an unfair comparison to compare it to native Python code, because Numpy can do that pretty quickly, but:
Numba compiled: 0.509318113327s
Native: 172.70626092s
I'm impressed both by the numbers and how utterly simple the conversion is.
If it's not hard for you to switch to the Anaconda distribution of Python, you can try using Numba, which for this particular simple dynamic algorithm would probably offer a lot of speedup without making you leave Python.
Numpy is usually not very good at iterative jobs (though it do have some commonly used iterative functions such as np.cumsum, np.cumprod, np.linalg.* and etc). But for simple tasks like finding the shortest path (or lowest energy path) above, you can vectorize the problem by thinking about what can be computed at the same time (also try to avoid making copy:
Suppose we are finding a shortest path in the "row" direction (i.e. horizontally), we can first create our algorithm input:
# The problem, 300 400*400 matrices
# Create infinitely high boundary so that we dont need to handle indexing "-1"
a = np.random.rand(300, 400, 402).astype('f')
a[:,:,::a.shape[2]-1] = np.inf
then prepare some utility arrays which we will use later (creation takes constant time):
# Create self-overlapping view for 3-way minimize
# This is the input in each iteration
# The shape is (400, 300, 400, 3), separately standing for row, batch, column, left-middle-right
A = np.lib.stride_tricks.as_strided(a, (a.shape[1],len(a),a.shape[2]-2,3), (a.strides[1],a.strides[0],a.strides[2],a.strides[2]))
# Create view for output, this is basically for convenience
# The shape is (399, 300, 400). 399 comes from the fact that first row is never modified
B = a[:,1:,1:-1].swapaxes(0, 1)
# Create a temporary array in advance (try to avoid cache miss)
T = np.empty((len(a), a.shape[2]-2), 'f')
and finally do the computation and timeit:
%%timeit
for i in np.arange(a.shape[1]-1):
A[i].min(2, T)
B[i] += T
The timing result on my (super old laptop) machine is 1.78s, which is already way faster than 3 minute. I believe you can improve even more (while stick to numpy) by optimize the memory layout and alignment (somehow). Or, you can simply use multiprocessing.Pool. It is easy to use, and this problem is trivial to split to smaller problems (by dividing on the batch axis).

Need help vectorizing code or optimizing

I am trying to do a double integral by first interpolating the data to make a surface. I am using numba to try and speed this process up, but it's just taking too long.
Here is my code, with the images needed to run the code located at here and here.
Noting that your code has a quadruple-nested set of for loops, I focused on optimizing the inner pair. Here's the old code:
for i in xrange(K.shape[0]):
for j in xrange(K.shape[1]):
print(i,j)
'''create an r vector '''
r=(i*distX,j*distY,z)
for x in xrange(img.shape[0]):
for y in xrange(img.shape[1]):
'''create an ksi vector, then calculate
it's norm, and the dot product of r and ksi'''
ksi=(x*distX,y*distY,z)
ksiNorm=np.linalg.norm(ksi)
ksiDotR=float(np.dot(ksi,r))
'''calculate the integrand'''
temp[x,y]=img[x,y]*np.exp(1j*k*ksiDotR/ksiNorm)
'''interpolate so that we can do the integral and take the integral'''
temp2=rbs(a,b,temp.real)
K[i,j]=temp2.integral(0,n,0,m)
Since K and img are each about 2000x2000, the innermost statements need to be executed sixteen trillion times. This is simply not practical using Python, but we can shift the work into C and/or Fortran using NumPy to vectorize. I did this one careful step at a time to try to make sure the results will match; here's what I ended up with:
'''create all r vectors'''
R = np.empty((K.shape[0], K.shape[1], 3))
R[:,:,0] = np.repeat(np.arange(K.shape[0]), K.shape[1]).reshape(K.shape) * distX
R[:,:,1] = np.arange(K.shape[1]) * distY
R[:,:,2] = z
'''create all ksi vectors'''
KSI = np.empty((img.shape[0], img.shape[1], 3))
KSI[:,:,0] = np.repeat(np.arange(img.shape[0]), img.shape[1]).reshape(img.shape) * distX
KSI[:,:,1] = np.arange(img.shape[1]) * distY
KSI[:,:,2] = z
# vectorized 2-norm; see http://stackoverflow.com/a/7741976/4323
KSInorm = np.sum(np.abs(KSI)**2,axis=-1)**(1./2)
# loop over entire K, which is same shape as img, rows first
# this loop populates K, one pixel at a time (so can be parallelized)
for i in xrange(K.shape[0]):
for j in xrange(K.shape[1]):
print(i, j)
KSIdotR = np.dot(KSI, R[i,j])
temp = img * np.exp(1j * k * KSIdotR / KSInorm)
'''interpolate so that we can do the integral and take the integral'''
temp2 = rbs(a, b, temp.real)
K[i,j] = temp2.integral(0, n, 0, m)
The inner pair of loops is now completely gone, replaced by vectorized operations done in advance (at a space cost linear in the size of the inputs).
This reduces the time per iteration of the outer two loops from 340 seconds to 1.3 seconds on my Macbook Air 1.6 GHz i5, without using Numba. Of the 1.3 seconds per iteration, 0.68 seconds are spent in the rbs function, which is scipy.interpolate.RectBivariateSpline. There is probably room to optimize further--here are some ideas:
Reenable Numba. I don't have it on my system. It may not make much difference at this point, but easy for you to test.
Do more domain-specific optimization, such as trying to simplify the fundamental calculations being done. My optimizations are intended to be lossless, and I don't know your problem domain so I can't optimize as deeply as you may be able to.
Try to vectorize the remaining loops. This may be tough unless you are willing to replace the scipy RBS function with something supporting multiple calculations per call.
Get a faster CPU. Mine is pretty slow; you can probably get a speedup of at least 2x simply by using a better computer than my tiny laptop.
Downsample your data. Your test images are 2000x2000 pixels, but contain fairly little detail. If you cut their linear dimensions by 2-10x, you'd get a huge speedup.
So that's it for me for now. Where does this leave you? Assuming a slightly better computer and no further optimization work, even the optimized code would take about a month to process your test images. If you only have to do this once, maybe that's fine. If you need to do it more often, or need to iterate on the code as you try different things, you probably need to keep optimizing--starting with that RBS function which consumes more than half the time now.
Bonus tip: your code would be a lot easier to deal with if it didn't have nearly-identical variable names like k and K, nor used j as a variable name and also as a complex number suffix (0j).

Categories