more efficient array calculation on this python code case - python

I have written a function which takes an N by N array and compute an output array based on it.
heres how my code looks like this:
def calculate_output(input,N):
output = np.zeros((N, N))
for y in range(N):
for x in range(N):
val1 = 0 if y-1<0 else output[y-1][x]+input[y][x]
val2 = 0 if x-1<0 else output[y][x-1]+input[y][x]
output[y][x] = max(val1,val2)
return output
N = 10000
input = np.reshape(np.random.binomial(1, [0.25] * N * N), (N, N))
output =calculate_output(input,N)
however this compution is not fast enough and takes about 300 seconds on my machine.(compared to 3 seconds when implemented on C++)
is there any way to improve this without writing a C extension?
I have tries using pypy but in this case the code is even slower using pypy

CPython is very slow because it is an interpreter and it clearly cannot compete with C and C++ in such a case. The usual approach to reduce the cost of the interpreter is to avoid loops as much as possible and use few Numpy vectorized calls instead. However in this case, it is barely possible to write an efficient implementation using Numpy vectorized calls.
On the other hand PyPy is often much better for numerical codes because of the JIT compilation. But its implementation of Numpy is not great at all mainly because they used an implementation of Numpy rewritten in Python which is not as good as the native Numpy implementation and the native implementation would not be efficient because of the way Python modules are currently implemented. To put it shortly, AFAIK, the PyPy JIT cannot optimize Numpy access with the native implementation. As the result, the JIT can be slower than the CPython interpreter in your case.
However, you can speed up the code a lot using the Numba JIT compiler which has been written for this exact use-case. Moreover, few optimizations can be implemented to speed up the code even more (whatever the programming language used):
conditionals are generally slow, you can move them in loops performing only the borders
writing zeros initially in the output matrix is not required and is actually slower
Using 2D direct indexing is cleaner and likely a bit faster
integers can be used instead of floating-point numbers since the output contains only integers and computing integers is faster than computing the same operation with floating-point numbers.
import numba as nb
#nb.njit(['int32[:,::1](int32[:,::1],int32)', 'int64[:,::1](int64[:,::1],int64)'])
def calculate_output(input,N):
output = np.empty((N, N), input.dtype)
for x in range(0,N):
val2 = 0 if x-1<0 else output[0,x-1]+input[0,x]
output[0,x] = max(0,val2)
for y in range(1,N):
val1 = 0 if y-1<0 else output[y-1,0]+input[y,0]
output[y,0] = max(val1,0)
for y in range(1,N):
for x in range(1,N):
val1 = output[y-1,x]+input[y,x]
val2 = output[y,x-1]+input[y,x]
output[y,x] = max(val1,val2)
return output
The resulting calculate_output call is 730 times faster on my machine.

Related

Python: how to speed up this function and make it more scalable?

I have the following function which accepts an indicator matrix of shape (20,000 x 20,000). And I have to run the function 20,000 x 20,000 = 400,000,000 times. Note that the indicator_Matrix has to be in the form of a pandas dataframe when passed as parameter into the function, as my actual problem's dataframe has timeIndex and integer columns but I have simplified this a bit for the sake of understanding the problem.
Pandas Implementation
indicator_Matrix = pd.DataFrame(np.random.randint(0,2,[20000,20000]))
def operations(indicator_Matrix):
s = indicator_Matrix.sum(axis=1)
d = indicator_Matrix.div(s,axis=0)
res = d[d>0].mean(axis=0)
return res.iloc[-1]
I tried to improve it by using numpy but it is still taking ages to run. I also tried concurrent.future.ThreadPoolExecutor but it still take a long time to run and not much improvement from list comprehension.
Numpy Implementation
indicator_Matrix = pd.DataFrame(np.random.randint(0,2,[20000,20000]))
def operations(indicator_Matrix):
s = indicator_Matrix.to_numpy().sum(axis=1)
d = (indicator_Matrix.to_numpy().T / s).T
d = pd.DataFrame(d, index = indicator_Matrix.index, columns = indicator_Matrix.columns)
res = d[d>0].mean(axis=0)
return res.iloc[-1]
output = [operations(indicator_Matrix) for i in range(0,20000**2)]
Note that the reason I convert d to a dataframe again is because I need to obtain the column means and retain only the last column mean using .iloc[-1]. d[d>0].mean(axis=0) return column means, i.e.
2478 1.0
0 1.0
Update: I am still stuck in this problem. I wonder if using gpu packages like cudf and CuPy on my local desktop would make any difference.
Assuming the answer of #CrazyChucky is correct, one can implement a faster parallel Numba implementation. The idea is to use plain loops and care about reading data the contiguous way. Reading data contiguously is important so to make the computation cache-friendly/memory-efficient. Here is an implementation:
import numba as nb
#nb.njit(['(int_[:,:],)', '(int_[:,::1],)', '(int_[::1,:],)'], parallel=True)
def compute_fastest(matrix):
n, m = matrix.shape
sum_by_row = np.zeros(n, matrix.dtype)
is_row_major = matrix.strides[0] >= matrix.strides[1]
if is_row_major:
for i in nb.prange(n):
s = 0
for j in range(m):
s += matrix[i, j]
sum_by_row[i] = s
else:
for chunk_id in nb.prange(0, (n+63)//64):
start = chunk_id * 64
end = min(start+64, n)
for j in range(m):
for i2 in range(start, end):
sum_by_row[i2] += matrix[i2, j]
count = 0
s = 0.0
for i in range(n):
value = matrix[i, -1] / sum_by_row[i]
if value > 0:
s += value
count += 1
return s / count
# output = [compute_fastest(indicator_Matrix.to_numpy()) for i in range(0,20000**2)]
Pandas dataframes can contain both row-major and column-major arrays. Regarding the memory layout, it is better to iterate over the rows or the column. This is why there is two implementations of the sum based on is_row_major. There is also 3 Numba signatures: one for row-major contiguous arrays, one for columns-major contiguous arrays and one for non-contiguous arrays. Numba will compile the 3 function variants and automatically pick the best one at runtime. The JIT-compiler of Numba can generate a faster implementation (eg. using SIMD instructions) when the input 2D array is known to be contiguous.
Experimental Results
This computation is about 14.5 times faster than operations_simpler on my i5-9600KF processor (6 cores). It still takes a lot of time but the computation is memory-bound and nearly optimal on my machine: it is bounded by the main-memory which has to be read:
On a 2000x2000 dataframe with 32-bit integers:
- operations: 86.310 ms/iter
- operations_simpler: 5.450 ms/iter
- compute_fastest: 0.375 ms/iter
- optimal: 0.345-0.370 ms/iter
If you want to get a faster code, then you need to use more compact data types. For example, a uint8 data type is large enough to contain the values 0 and 1, and it is 4 times smaller in memory on Windows. This means the code can be up to 4 time faster in this case. The smaller the data type, the faster the program. One could even try to compact 8 columns in 1 using bit tweaks though it is generally significantly slower using Numba unless you have a lot of available cores.
Notes & Discussion
The above code works only with uniformly-typed columns. If this is not the case, you can split the dataframe in multiple groups and convert each column group to Numpy array so to then call the Numba function (modified to support groups). Note the #CrazyChucky code has a similar issue: a dataframe column with mixed datatypes converted to a Numpy array results in an object-based Numpy array which is very inefficient (especially a row-major Numpy array).
Note that using a GPU will not make the computation faster unless the input dataframe is already stored in the GPU memory. Indeed, CPU-GPU data transfers are more expensive than just reading the RAM (due to the interconnect overhead which is generally a quite slow PCI one). Note that the GPU memory is quite limited compared to the CPU. If the target dataframe(s) do not need to be transferred, then using cudf is relatively simple and should give a small speed up. For a faster code, one need to implement a fast CUDA code but this is clearly far from being easy for dataframes with mixed dataype. In the end, the resulting speed up should be main_ram_throughput / gpu_ram_througput assuming there is no data transfer. Note that this factor is generally 5-12. Note also that CUDA and cudf require a Nvidia GPU.
Finally, reducing the input data size or just the amount of computation is certainly the best solution (as indicated in the comment by #zvone) since it is very computationally intensive.
You're doing some extra math you don't have to. In plain English, what you're doing is:
Summing each column
Turning the list of sums "sideways" and dividing each column by it
Taking the mean of each column, ignoring values ≤ 0
Returning only the rightmost mean
After step one, you no longer need anything but the rightmost column; you can ignore the other columns, only dividing and averaging the one whose result you care about. Changing your code accordingly:
def operations_simpler(indicator_matrix):
sums = indicator_matrix.sum(axis=1)
last_column = indicator_matrix.iloc[:, -1]
divided = last_column / sums
return divided[divided > 0].mean()
...yields the same result, and takes about a hundredth of the time. Extrapolating from shorter test runs, this cuts the time for 400,000,000 runs on my machine from about 114 years down to... about 324 days. Still not great. So far I've not managed to get it to run any faster by converting to NumPy, compiling with Numba, or employing multiprocessing, but I'll go ahead and post this for now in case it's helpful.
Note: You're unlikely to see any improvements with compute-heavy work like this from threading; if anything, you'd want to use multiprocessing. concurrent.futures offers executors for both. Threads are mostly useful to avoid waiting around for I/O.
As per the previous answer you can use Numba or you can you two other alternatives such as Dask which is a distributed computing package, to parallelize your function's execution it can divide your data into smaller bits and distribute computing across many CPU cores or even numerous machines.
import dask.array as da
def operations(indicator_matrix):
s = indicator_matrix.sum(axis=1)
d = indicator_matrix.div(s, axis=0)
res = d[d > 0].mean(axis=0)
return res.iloc[-1]
indicator_matrix_dask = da.from_array(indicator_matrix, chunks=(1000, 1000))
output_dask = indicator_matrix_dask.map_blocks(operations, dtype=float)
output = output_dask.compute()
or you can use CuPy which uses GPU to increase your function excution
import cupy as cp
def operations(indicator_matrix):
s = cp.sum(indicator_matrix, axis=1)
d = cp.divide(indicator_matrix.T, s).T
d = pd.DataFrame(d, index = indicator_matrix.index, columns = indicator_matrix.columns)
res = d[d > 0].mean(axis=0)
return res.iloc[-1]
indicator_matrix_cupy = cp.asarray(indicator_matrix)
output_cupy = operations(indicator_matrix_cupy)
output = cp.asnumpy(output_cupy)

When numba is effective?

I know numba creates some overheads and in some situations (non-intensive computation) it become slower that pure python. But what I don't know is where to draw the line. Is it possible to use order of algorithm complexity to figure out where?
for example for adding two arrays (~O(n)) shorter that 5 in this code pure python is faster:
def sum_1(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
#numba.jit('float64[:](float64[:],float64[:])')
def sum_2(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
# try 100
a = np.linspace(1.0,2.0,5)
b = np.linspace(1.0,2.0,5)
print("pure python: ")
%timeit -o sum_1(a,b)
print("\n\n\n\npython + numba: ")
%timeit -o sum_2(a,b)
UPDADE: what I am looking for is a similar guideline like here:
"A general guideline is to choose different targets for different data sizes and algorithms. The “cpu” target works well for small data sizes (approx. less than 1KB) and low compute intensity algorithms. It has the least amount of overhead. The “parallel” target works well for medium data sizes (approx. less than 1MB). Threading adds a small delay. The “cuda” target works well for big data sizes (approx. greater than 1MB) and high compute intensity algorithms. Transfering memory to and from the GPU adds significant overhead."
It's hard to draw the line when numba becomes effective. However there are a few indicators when it might not be effective:
If you cannot use jit with nopython=True - whenever you cannot compile it in nopython mode you either try to compile too much or it won't be significantly faster.
If you don't use arrays - When you deal with lists or other types that you pass to the numba function (except from other numba functions), numba needs to copy these which incurs a significant overhead.
If there is already a NumPy or SciPy function that does it - even if numba can be significantly faster for short arrays it will almost always be as fast for longer arrays (also you might easily neglect some common edge cases that these would handle).
There's also another reason why you might not want to use numba in cases where it's just "a bit" faster than other solutions: Numba functions have to be compiled, either ahead-of-time or when first called, in some situations the compilation will be much slower than your gain, even if you call it hundreds of times. Also the compilation times add up: numba is slow to import and compiling the numba functions also adds some overhead. It doesn't make sense to shave off a few milliseconds if the import overhead increased by 1-10 seconds.
Also numba is complicated to install (without conda at least) so if you want to share your code then you have a really "heavy dependency".
Your example is lacking a comparison with NumPy methods and a highly optimized version of pure Python. I added some more comparison functions and did a benchmark (using my library simple_benchmark):
import numpy as np
import numba as nb
from itertools import chain
def python_loop(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
#nb.njit
def numba_loop(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
def numpy_methods(a, b):
return a.sum() + b.sum()
def python_sum(a, b):
return sum(chain(a.tolist(), b.tolist()))
from simple_benchmark import benchmark, MultiArgument
arguments = {
2**i: MultiArgument([np.zeros(2**i), np.zeros(2**i)])
for i in range(2, 17)
}
b = benchmark([python_loop, numba_loop, numpy_methods, python_sum], arguments, warmups=[numba_loop])
%matplotlib notebook
b.plot()
Yes, the numba function is fastest for small arrays, however the NumPy solution will be slightly faster for longer arrays. The Python solutions are slower but the "faster" alternative is already significantly faster than your original proposed solution.
In this case I would simply use the NumPy solution because it's short, readable and fast, except when you're dealing with lots of short arrays and call the function a lot of times - then the numba solution would be significantly better.
If you do not exactly know what is the consequence of explicit input and output declarations let numba decide it. With your input you may want to use 'float64(float64[::1],float64[::1])'. (scalar output, contiguous input arrays). If you call the explicitly declared function with strided inputs it will fail, if you would Numba do the job it would simply recompile.
Without using fastmath=True it is also not possible to use SIMD, because it changes the precision of the result.
Calculating at least 4 partial sums (256 Bit vector) and than calculating the sum of these partial sums is preferable here (Numpy also don't calculate a naive sum).
Example using MSeiferts excellent benchmark utility
import numpy as np
import numba as nb
from itertools import chain
def python_loop(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
#nb.njit
def numba_loop_zip(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
#Your version with suboptimal input and output (prevent njit compilation) declaration
#nb.jit('float64[:](float64[:],float64[:])')
def numba_your_func(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
#nb.njit(fastmath=True)
def numba_loop_zip_fastmath(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
#nb.njit(fastmath=True)
def numba_loop_fastmath_single(a,b):
result = 0.0
size=min(a.shape[0],b.shape[0])
for i in range(size):
result += a[i]+b[i]
return result
#nb.njit(fastmath=True,parallel=True)
def numba_loop_fastmath_multi(a,b):
result = 0.0
size=min(a.shape[0],b.shape[0])
for i in nb.prange(size):
result += a[i]+b[i]
return result
#just for fun... single-threaded for small arrays,
#multithreaded for larger arrays
#nb.njit(fastmath=True,parallel=True)
def numba_loop_fastmath_combined(a,b):
result = 0.0
size=min(a.shape[0],b.shape[0])
if size>2*10**4:
result=numba_loop_fastmath_multi(a,b)
else:
result=numba_loop_fastmath_single(a,b)
return result
def numpy_methods(a, b):
return a.sum() + b.sum()
def python_sum(a, b):
return sum(chain(a.tolist(), b.tolist()))
from simple_benchmark import benchmark, MultiArgument
arguments = {
2**i: MultiArgument([np.zeros(2**i), np.zeros(2**i)])
for i in range(2, 19)
}
b = benchmark([python_loop, numba_loop_zip, numpy_methods,numba_your_func, python_sum,numba_loop_zip_fastmath,numba_loop_fastmath_single,numba_loop_fastmath_multi,numba_loop_fastmath_combined], arguments, warmups=[numba_loop_zip,numba_loop_zip_fastmath,numba_your_func,numba_loop_fastmath_single,numba_loop_fastmath_multi,numba_loop_fastmath_combined])
%matplotlib notebook
b.plot()
Please note that using the numba_loop_fastmath_multi or numba_loop_fastmath_combined(a,b) is only in some special cases recommended. More often such a simple function is one part of another problem which can be more efficiently parallelized (starting threads has some overhead)
Running this code lead to a ~6 times speedup on my machine:
#numba.autojit
def sum_2(a,b):
result = 0.0
for i,j in zip(a,b):
result += (i+j)
return result
Python: 3.31 µs, numba: 589 ns.
As for you question I really think this is not really related to the complexity and it will probably depend mostly on the kind of operations you are doing. On the other hand you can still plot a python/numba comparison to see where the shift happens for a given function.

Faster alternative to numpy for manual element-wise operations on large arrays?

I have some code that was originally written in C (by someone else) using C-style malloc arrays. I later converted a lot of it to C++ style, using vector<vector<vector<complex>>> arrays for consistency with the rest of my project. I never timed it, but both methods seemed to be of similar speed.
I recently started a new project in python, and I wanted to use some of this old code. Not wanting to move data back and for between projects, I decided to port this old code into python so that it's all in one project. I naively typed up all of the code in python syntax, replacing any arrays in the old code with numpy arrays (initialising them like this array = np.zeros(list((1024, 1024)), dtype=complex)). The code works fine, but it is excruciatingly slow. If I had to guess, I would say it's on the order of 1000 times slower.
Now having looked into it, I see that a lot of people say numpy is very slow for element-wise operations. While I have used some of the numpy functions for common mathematical operations, such as FFTs and matrix multiplication, most of my code involves nested for loops. A lot of it is pretty complicated and doesn't seem to me to be amenable to reducing to simple array operations that are faster in numpy.
So, I'm wondering if there is an alternative to numpy that is faster for these kind of calculations. The ideal scenario would be that there is a module that I can import that has a lot of the same functionality, so I don't have to rewrite much of my code (i.e., something that can do FFTs and initialises arrays in the same way, etc.), but failing that, I would be happy with something that I could at least use for the more computationally demanding parts of the code and cast back and forth between the numpy arrays as needed.
cpython arrays sounded promising, but a lot of benchmarks I've seen don't show enough of a difference in speed for my purposes. To give an idea of the kind of thing I'm talking about, this is one of the methods that is slowing down my code. This is called millions of times, and the vz_at() method contains a lookup table and does some interpolation to give the final return value:
def tra(self, tr, x, y, z_number, i, scalex, idx, rmax2, rminsq):
M = 1024
ixo = int(x[i] / scalex)
iyo = int(y[i] / scalex)
nx1 = ixo - idx
nx2 = ixo + idx
ny1 = iyo - idx
ny2 = iyo + idx
for ix in range(nx1, nx2 + 1):
rx2 = x[i] - float(ix) * scalex
rx2 = rx2 * rx2
ixw = ix
while ixw < 0:
ixw = ixw + M
ixw = ixw % M
for iy in range(ny1, ny2 + 1):
rsq = y[i] - float(iy) * scalex
rsq = rx2 + rsq * rsq
if rsq <= rmax2:
iyw = iy
while iyw < 0:
iyw = iyw + M
iyw = iyw % M
if rsq < rminsq:
rsq = rminsq
vz = P.vz_at(z_number[i], rsq)
tr[ixw, iyw] += vz
All up, there are a couple of thousands of lines of code; this is just a small snippet to give an example. To be clear, a lot of my arrays are 1024x1024x1024 or 1024x1024 and are complex-valued. Others are one-dimensional arrays on the order of a million elements. What's the best way I can speed these element-wise operations up?
For information, some of your code can be made more concise and thus a bit more readable. For instance:
array = np.zeros(list((1024, 1024)), dtype=complex)).
can be written
array = np.zeros((1024, 1024), dtype=complex)
As you are trying out Python, this is at least a nice benefit :-)
Now, for your problem there are several solutions in the current Python scientific landscape:
Numba is a just-in-time compiler for Python that is dedicated to array processing, achieving good performance when NumPy hits its limits.
Pros: Little to no modification of your code as you just write plain Python, shows good performance in many situations. Numba should recognize some NumPy operations to avoid a Numba->Python->NumPy slowdown.
Cons: Can be tedious to install and hence to distribute Numba-based code.
Cython is a mix of Python and C to generate compiled functions. You can start from a pure Python file and accelerate the code via type annotations and the use of some "C"-isms.
Pros: stable, widely used, relatively easy to distribute Cython-based code.
Cons: need to rewrite the performance critical code, even if only in part.
As an additional hint, Nicolas Rougier (a French scientist) wrote an online book on many situations where you can make use of NumPy to speed up Python code: http://www.labri.fr/perso/nrougier/from-python-to-numpy/

Speeding up dynamic programming in python/numpy

I have a 2D cost matrix M, perhaps 400x400, and I'm trying to calculate the optimal path through it. As such, I have a function like:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
which is obviously recursive. P1 is some additive constant. My code, which works more or less, is:
def optimalcost(cost, P1=10):
width1,width2 = cost.shape
M = array(cost)
for i in range(0,width1):
for j in range(0,width2):
try:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
except:
M[i,j] = inf
return M
Now I know looping in Numpy is a terrible idea, and for things like the calculation of the initial cost matrix I've been able to find shortcuts to cutting the time down. However, as I need to evaluate potentially the entire matrix I'm not sure how else to do it. This takes around 3 seconds per call on my machine and must be applied to around 300 of these cost matrices. I'm not sure where this time comes from, as profiling says the 200,000 calls to min only take 0.1s - maybe memory access?
Is there a way to do this in parallel somehow? I assume there may be, but to me it seems each iteration is dependent unless there's a smarter way to memoize things.
There are parallels to this question: Can I avoid Python loop overhead on dynamic programming with numpy?
I'm happy to switch to C if necessary, but I like the flexibility of Python for rapid testing and the lack of faff with file IO. Off the top of my head, is something like the following code likely to be significantly faster?
#define P1 10
void optimalcost(double** costin, double** costout){
/*
We assume that costout is initially
filled with costin's values.
*/
float a,b,c,prevcost;
for(i=0;i<400;i++){
for(j=0;j<400;j++){
a = prevcost+P1;
b = costout[i][j-1]+P1;
c = costout[i-1][j-1];
costout[i][j] += min(prevcost,min(b,c));
prevcost = costout[i][j];
}
}
}
return;
Update:
I'm on Mac, and I don't want to install a whole new Python toolchain so I used Homebrew.
> brew install llvm --rtti
> LLVM_CONFIG_PATH=/usr/local/opt/llvm/bin/llvm-config pip install llvmpy
> pip install numba
New "numba'd" code:
from numba import autojit, jit
import time
import numpy as np
#autojit
def cost(left, right):
height,width = left.shape
cost = np.zeros((height,width,width))
for row in range(height):
for x in range(width):
for y in range(width):
cost[row,x,y] = abs(left[row,x]-right[row,y])
return cost
#autojit
def optimalcosts(initcost):
costs = zeros_like(initcost)
for row in range(height):
costs[row,:,:] = optimalcost(initcost[row])
return costs
#autojit
def optimalcost(cost):
width1,width2 = cost.shape
P1=10
prevcost = 0.0
M = np.array(cost)
for i in range(1,width1):
for j in range(1,width2):
M[i,j] += min(M[i-1,j-1],prevcost+P1,M[i,j-1]+P1)
prevcost = M[i,j]
return M
prob_size = 400
left = np.random.rand(prob_size,prob_size)
right = np.random.rand(prob_size,prob_size)
print '---------- Numba Time ----------'
t = time.time()
c = cost(left,right)
optimalcost(c[100])
print time.time()-t
print '---------- Native python Time --'
t = time.time()
c = cost.py_func(left,right)
optimalcost.py_func(c[100])
print time.time()-t
It's interesting writing code in Python that is so un-Pythonic. Note for anyone interested in writing Numba code, you need to explicitly express loops in your code. Before, I had the neat Numpy one-liner,
abs(left[row,:][:,newaxis] - right[row,:])
to calculate the cost. That took around 7 seconds with Numba. Writing out the loops properly gives 0.5s.
It's an unfair comparison to compare it to native Python code, because Numpy can do that pretty quickly, but:
Numba compiled: 0.509318113327s
Native: 172.70626092s
I'm impressed both by the numbers and how utterly simple the conversion is.
If it's not hard for you to switch to the Anaconda distribution of Python, you can try using Numba, which for this particular simple dynamic algorithm would probably offer a lot of speedup without making you leave Python.
Numpy is usually not very good at iterative jobs (though it do have some commonly used iterative functions such as np.cumsum, np.cumprod, np.linalg.* and etc). But for simple tasks like finding the shortest path (or lowest energy path) above, you can vectorize the problem by thinking about what can be computed at the same time (also try to avoid making copy:
Suppose we are finding a shortest path in the "row" direction (i.e. horizontally), we can first create our algorithm input:
# The problem, 300 400*400 matrices
# Create infinitely high boundary so that we dont need to handle indexing "-1"
a = np.random.rand(300, 400, 402).astype('f')
a[:,:,::a.shape[2]-1] = np.inf
then prepare some utility arrays which we will use later (creation takes constant time):
# Create self-overlapping view for 3-way minimize
# This is the input in each iteration
# The shape is (400, 300, 400, 3), separately standing for row, batch, column, left-middle-right
A = np.lib.stride_tricks.as_strided(a, (a.shape[1],len(a),a.shape[2]-2,3), (a.strides[1],a.strides[0],a.strides[2],a.strides[2]))
# Create view for output, this is basically for convenience
# The shape is (399, 300, 400). 399 comes from the fact that first row is never modified
B = a[:,1:,1:-1].swapaxes(0, 1)
# Create a temporary array in advance (try to avoid cache miss)
T = np.empty((len(a), a.shape[2]-2), 'f')
and finally do the computation and timeit:
%%timeit
for i in np.arange(a.shape[1]-1):
A[i].min(2, T)
B[i] += T
The timing result on my (super old laptop) machine is 1.78s, which is already way faster than 3 minute. I believe you can improve even more (while stick to numpy) by optimize the memory layout and alignment (somehow). Or, you can simply use multiprocessing.Pool. It is easy to use, and this problem is trivial to split to smaller problems (by dividing on the batch axis).

Numpy Slicing slow?

Hi I am running scientific computing using numpy + numba.
I've realized that numpy array addition in-place is very slow... compared to matlab
here is the matlab code:
tic;
% A,B are 2-d matrices, ind may not be distinct
for ii=1:N
A(ind(ii),:) = A(ind(ii),:) + B(ii,:);
end
toc;
and here is the numpy code:
s = time.time()
# A,B are numpy.ndarray, ind may not be distinct
for k in xrange(N):
A[ind[k],:] += B[k,:];
print time.time() - s
The result shows that numpy code is 10x slower than matlab... which confuses me a lot.
Moreover, when I pull the addition out of for loop, and just compare a single matrix addition with numpy.add, numpy and matlab seem to be comparable at speed.
One factor I know is that matlab uses JIT for version>=2012a to speed up for loop, but I tried numba on python code, it still does not speed up even a bit. I think this has to do with that numba has not touched numpy.add function at all, hence the performance does not change at all.
I am guessing that matlab does some sick caching for this case, hence it beats numpy dramatically.
Any suggestion on how to speed up numpy ?
Try
A[ind] += B[:N]
i.e. without any loop.
If ind could have duplicate elements, you can use np.add.at:
np.add.at(A, ind, B[:N])
Here'a version that uses dot matrix multiplication. It constructs a matrix of 1s and 0s from ind.
def bar(A,B,ind):
K,M =B.shape
N,M =A.shape
I = np.zeros((N,K))
I[ind,np.arange(K)] = 1
return A+np.dot(I,B)
For a problem with sizes like K,M,N = 30,14,15 this is about 3x faster. But for larger ones like K,M,N = 300,100,150 it's a bit slower.

Categories