NumPy vs Cython - nested loop so slow? - python

I am confused how NumPy nested loop for 3D array is so slow in comparison with Cython.
I wrote trivial example.
Python/NumPy version:
import numpy as np
def my_func(a,b,c):
s=0
for z in xrange(401):
for y in xrange(401):
for x in xrange(401):
if a[z,y,x] == 0 and b[x,y,z] >= 0:
c[z,y,x] = 1
b[z,y,x] = z*y*x
s+=1
return s
a = np.zeros((401,401,401), dtype=np.float32)
b = np.zeros((401,401,401), dtype=np.uint32)
c = np.zeros((401,401,401), dtype=np.uint8)
s = my_func(a,b,c)
Cythonized version:
cimport numpy as np
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def my_func(np.float32_t[:,:,::1] a, np.uint32_t[:,:,::1] b, np.uint8_t[:,:,::1] c):
cdef np.uint16_t z,y,x
cdef np.uint32_t s = 0
for z in range(401):
for y in range(401):
for x in range(401):
if a[z,y,x] == 0 and b[x,y,z] >= 0:
c[z,y,x] = 1
b[z,y,x] = z*y*x
s = s+1
return s
Cythonized version of my_func() runs approx. 6500x faster. Simpler function only with if-statement and array access can be even 10000x faster. Python version of my_func() takes 500.651 sec. to finish. Is iterating over relatively small 3D array so slow or I made some mistake in code?
Cython version 0.21.1, Python 2.7.5, GCC 4.8.1, Xubuntu 13.10.

Python is an interpreted language. One of the benefits of compiling to machine code is the huge speedup you get, especially with things like nested loops.
I don't know what your expectations are, but all interpreted languages will be terribly slow at the things you are trying to do (JIT compiling may help to some extent though).
The trick of getting good performance out of Numpy (or MATLAB or anything similar) is to avoid looping altogether and instead try to refactor your code into a few operations on large matrices. This way, the looping will take place in the (heavily optimized) machine code libraries instead of your Python code.

As mentioned by Krumelur, python loops are definitely slow. You can, however, use numpy to your advantage. Operations on entire arrays are quite fast, although you need a little ingenuity sometimes.
For instance, in your code, since your loop never reads the value in b after you modify it (I think? My head is a little fuzzy at the moment, so you'll definitely want to go through this), the following should be equivalent:
# Precalculate a matrix of x*y*z
tmp = np.indices(a.shape)
prod = (tmp[:,:,:,0] * tmp[:,:,:,1] * tmp[:,:,:,2]).T
# Use array-wide logical operations to compute c using a and the transpose of b
condition = np.logical_and(a == 0, b.T >= 0)
# Use condition to alter b and c only where condition is true
b[condition] = prod[condition]
c[condition] = 1
s = condition.sum()
So this does calculate x*y*z even in cases where the condition is false. You could probably avoid that if it turns out that is using lots of time, but it's likely not to be a significant factor.

For loop with numpy array in python is slow, you should use vector calculation as possible. If the algorithm need for loop for every elements in the array, here is some speedup hint.
a[z,y,x] is a numpy scalar value, calculation with numpy scalar values is very slow:
x = 3.0
%timeit x > 0
x = np.float64(3.0)
%timeit x > 0
the output on my pc with numpy 1.8.2, windows 7:
10000000 loops, best of 3: 64.3 ns per loop
1000000 loops, best of 3: 657 ns per loop
you can use item() method to get the python value directly:
if a.item(z, y, x) == 0 and b.item(x, y, z) >= 0:
...
it can speedup the for loop about 8x.

Related

more efficient array calculation on this python code case

I have written a function which takes an N by N array and compute an output array based on it.
heres how my code looks like this:
def calculate_output(input,N):
output = np.zeros((N, N))
for y in range(N):
for x in range(N):
val1 = 0 if y-1<0 else output[y-1][x]+input[y][x]
val2 = 0 if x-1<0 else output[y][x-1]+input[y][x]
output[y][x] = max(val1,val2)
return output
N = 10000
input = np.reshape(np.random.binomial(1, [0.25] * N * N), (N, N))
output =calculate_output(input,N)
however this compution is not fast enough and takes about 300 seconds on my machine.(compared to 3 seconds when implemented on C++)
is there any way to improve this without writing a C extension?
I have tries using pypy but in this case the code is even slower using pypy
CPython is very slow because it is an interpreter and it clearly cannot compete with C and C++ in such a case. The usual approach to reduce the cost of the interpreter is to avoid loops as much as possible and use few Numpy vectorized calls instead. However in this case, it is barely possible to write an efficient implementation using Numpy vectorized calls.
On the other hand PyPy is often much better for numerical codes because of the JIT compilation. But its implementation of Numpy is not great at all mainly because they used an implementation of Numpy rewritten in Python which is not as good as the native Numpy implementation and the native implementation would not be efficient because of the way Python modules are currently implemented. To put it shortly, AFAIK, the PyPy JIT cannot optimize Numpy access with the native implementation. As the result, the JIT can be slower than the CPython interpreter in your case.
However, you can speed up the code a lot using the Numba JIT compiler which has been written for this exact use-case. Moreover, few optimizations can be implemented to speed up the code even more (whatever the programming language used):
conditionals are generally slow, you can move them in loops performing only the borders
writing zeros initially in the output matrix is not required and is actually slower
Using 2D direct indexing is cleaner and likely a bit faster
integers can be used instead of floating-point numbers since the output contains only integers and computing integers is faster than computing the same operation with floating-point numbers.
import numba as nb
#nb.njit(['int32[:,::1](int32[:,::1],int32)', 'int64[:,::1](int64[:,::1],int64)'])
def calculate_output(input,N):
output = np.empty((N, N), input.dtype)
for x in range(0,N):
val2 = 0 if x-1<0 else output[0,x-1]+input[0,x]
output[0,x] = max(0,val2)
for y in range(1,N):
val1 = 0 if y-1<0 else output[y-1,0]+input[y,0]
output[y,0] = max(val1,0)
for y in range(1,N):
for x in range(1,N):
val1 = output[y-1,x]+input[y,x]
val2 = output[y,x-1]+input[y,x]
output[y,x] = max(val1,val2)
return output
The resulting calculate_output call is 730 times faster on my machine.

Possible to use numba.guvectorize to emulate parallel forall / prange?

As a user of Python for data analysis and numerical calculations, rather than a real "coder", I had been missing a really low-overhead way of distributing embarrassingly parallel loop calculations on several cores.
As I learned, there used to be the prange construct in Numba, but it was abandoned because of "instability and performance issues".
Playing with the newly open-sourced #guvectorize decorator I found a way to use it for virtually no-overhead emulation of the functionality of late prange.
I am very happy to have this tool at hand now, thanks to the guys at Continuum Analytics, and did not find anything on the web explicitly mentioning this use of #guvectorize. Although it may be trivial to people who have been using NumbaPro earlier, I'm posting this for all those fellow non-coders out there (see my answer to this "question").
Consider the example below, where a two-level nested for loop with a core doing some numerical calculation involving two input arrays and a function of the loop indices is executed in four different ways. Each variant is timed with Ipython's %timeit magic:
naive for loop, compiled using numba.jit
forall-like construct using numba.guvectorize, executed in a single thread (target = "cpu")
forall-like construct using numba.guvectorize, executed in as many threads as there are cpu "cores" (in my case hyperthreads) (target = "parallel")
same as 3., however calling the "guvectorized" forall with the sequence of "parallel" loop indices randomly permuted
The last one is done because (in this particular example) the inner loop's range depends on the value of the outer loop's index. I don't know how exactly the dispatchment of gufunc calls is organized inside numpy, but it appears as if the randomization of "parallel" loop indices achieves slightly better load balancing.
On my (slow) machine (1st gen core i5, 2 cores, 4 hyperthreads) I get the timings:
1 loop, best of 3: 8.19 s per loop
1 loop, best of 3: 8.27 s per loop
1 loop, best of 3: 4.6 s per loop
1 loop, best of 3: 3.46 s per loop
Note: I'd be interested if this recipe readily applies to target="gpu" (it should do, but I don't have access to a suitable graphics card right now), and what's the speedup. Please post!
And here's the example:
import numpy as np
from numba import jit, guvectorize, float64, int64
#jit
def naive_for_loop(some_input_array, another_input_array, result):
for i in range(result.shape[0]):
for k in range(some_input_array.shape[0] - i):
result[i] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
#guvectorize([(float64[:],float64[:],int64[:],float64[:])],'(n),(n),()->()', nopython=True, target='parallel')
def forall_loop_body_parallel(some_input_array, another_input_array, loop_index, result):
i = loop_index[0] # just a shorthand
# do some nontrivial calculation involving elements from the input arrays and the loop index
for k in range(some_input_array.shape[0] - i):
result[0] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
#guvectorize([(float64[:],float64[:],int64[:],float64[:])],'(n),(n),()->()', nopython=True, target='cpu')
def forall_loop_body_cpu(some_input_array, another_input_array, loop_index, result):
i = loop_index[0] # just a shorthand
# do some nontrivial calculation involving elements from the input arrays and the loop index
for k in range(some_input_array.shape[0] - i):
result[0] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i))
arg_size = 20000
input_array_1 = np.random.rand(arg_size)
input_array_2 = np.random.rand(arg_size)
result_array = np.zeros_like(input_array_1)
# do single-threaded naive nested for loop
# reset result_array inside %timeit call
%timeit -r 3 result_array[:] = 0.0; naive_for_loop(input_array_1, input_array_2, result_array)
result_1 = result_array.copy()
# do single-threaded forall loop (loop indices in-order)
# reset result_array inside %timeit call
loop_indices = range(arg_size)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_cpu(input_array_1, input_array_2, loop_indices, result_array)
result_2 = result_array.copy()
# do multi-threaded forall loop (loop indices in-order)
# reset result_array inside %timeit call
loop_indices = range(arg_size)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_parallel(input_array_1, input_array_2, loop_indices, result_array)
result_3 = result_array.copy()
# do forall loop (loop indices scrambled for better load balancing)
# reset result_array inside %timeit call
loop_indices_scrambled = np.random.permutation(range(arg_size))
loop_indices_unscrambled = np.argsort(loop_indices_scrambled)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_parallel(input_array_1, input_array_2, loop_indices_scrambled, result_array)
result_4 = result_array[loop_indices_unscrambled].copy()
# check validity
print(np.all(result_1 == result_2))
print(np.all(result_1 == result_3))
print(np.all(result_1 == result_4))

How do I optimize this python code using cython?

I have some python code, and I'm wondering what I can do to optimize the speed for creating the array using Cython. Note that I have tried other methods: Counting Algorithm Performance Optimization in Pypy vs Python (Numpy vs List)
It seems like Cython is significantly faster than anything I've tried before right out of the box. I am wondering if I can get even more performance.
#!/usr/bin/env python
def create_array(size=4):
"""
Creates a multi-dimensional array from size
"""
array = [(x, y, z)
for x in xrange(size)
for y in xrange(size)
for z in xrange(size)]
return array
Thanks in advance!
I won't help with the cython-code, but I believe this operation can still be done efficiently in numpy, you just haven't looked deep enough yet.
def variations_with_repetition(alphabetlen):
"""Return a list of all possible sets of len=3 with elements
chosen from range(alphabetlen)."""
a = np.arange(alphabetlen)
z = np.vstack((
np.repeat(a,alphabetlen**2),
np.tile(np.repeat(a,alphabetlen),alphabetlen),
np.tile(a,alphabetlen**2))).T
return z
Now, execution speed here is meaningless in this case because you just mention you want it below 2ms for alphabetlen=32. That depends on your CPU. But I can compare your own proposed method to this one:
In [4]: %timeit array = [(x, y, z) for x in xrange(size) for y in xrange(size) for z in xrange(size)]
100 loops, best of 3: 3.3 ms per loop
In [5]: %timeit variations_with_repetition(32)
1000 loops, best of 3: 348 µs per loop
That's well below your desires 2ms speed. But once again, your mileage may vary depending on the CPU.

Using pointers to numpy array data attribute

I'm trying to solve the bottleneck in my application, which is an elementwise sum of two matrices.
I'm using NumPy and Cython. I have a cdef class with a matrix attribute. Since Cython still doesn't support buffer arrays in class attributes, I followed this and tried to use a pointer to the data attribute of the matrix. The thing is, I'm sure I'm doing something wrong, as the results indicate.
What I tried to do is more or less the following:
cdef class the_class:
cdef np.ndarray the_matrix
cdef float_t* the_matrix_p
def __init__(self):
the_matrix_p = <float_t*> self.the_matrix.data
cpdef the_function(self):
other_matrix = self.get_other_matrix()
the_matrix_p += other_matrix.data
I have serious doubt that adding two numpy arrays is a bottleneck that you can solve rewriting things in C. See the follwing code, that uses scipy.weave:
import numpy as np
from scipy.weave import inline
a = np.random.rand(10000000)
b = np.random.rand(10000000)
c = np.empty((10000000,))
def c_sum(a, b, c) :
length = a.shape[0]
code = '''
for(int j = 0; j < length; j++)
{
c[j] = a[j] + b[j];
}
'''
inline(code, ['a', 'b', 'c', 'length'])
Once you run c_sum(a, b, c) once to get the C code compiled, these are the timings I get:
In [12]: %timeit c_sum(a, b, c)
10 loops, best of 3: 33.5 ms per loop
In [16]: %timeit np.add(a, b, out=c)
10 loops, best of 3: 33.6 ms per loop
So it seems you are looking at something of a .3% performance improvement, if the timing differences are not simply random noise, on an operation that takes a handful of ms when working on arrays of ten million elements. If it really is a bottleneck, this is hardly going to solve it.
Try compiling ATLAS and recompiling numpy after that. This won't probably help with addition, but you can have really nice performance boost with more complicated matrix operations (if you use such, of course).
Check out this simple benchmark. If your results fall too far from those given in the post, maybe your numpy is not linked against some optimized BLAS implementation.

How do I maximize efficiency with numpy arrays?

I am just getting to know numpy, and I am impressed by its claims of C-like efficiency with memory access in its ndarrays. I wanted to see the differences between these and pythonic lists for myself, so I ran a quick timing test, performing a few of the same simple tasks with numpy without it. Numpy outclassed regular lists by an order of magnitude in the allocation of and arithmetic operations on arrays, as expected. But this segment of code, identical in both tests, took about 1/8 of a second with a regular list, and slightly over 2.5 seconds with numpy:
file = open('timing.log','w')
for num in a2:
if num % 1000 == 0:
file.write("Multiple of 1000!\r\n")
file.close()
Does anyone know why this might be, and if there is some other syntax i should be using for operations like this to take better advantage of what the ndarray can do?
Thanks...
EDIT: To answer Wayne's comment... I timed them both repeatedly and in different orders and got pretty much identical results each time, so I doubt it's another process. I put start = time() at the top of the file after the numpy import and then I have statements like print 'Time after traversal:\t',(time() - start) throughout.
a2 is a NumPy array, right? One possible reason it might be taking so long in NumPy (if other processes' activity don't account for it as Wayne Werner suggested) is that you're iterating over the array using a Python loop. At every step of the iteration, Python has to fetch a single value out of the NumPy array and convert it to a Python integer, which is not a particularly fast operation.
NumPy works much better when you are able to perform operations on the whole array as a unit. In your case, one option (maybe not even the fastest) would be
file.write("Multiple of 1000!\r\n" * (a2 % 1000 == 0).sum())
Try comparing that to the pure-Python equivalent,
file.write("Multiple of 1000!\r\n" * sum(filter(lambda i: i % 1000 == 0, a2)))
or
file.write("Multiple of 1000!\r\n" * sum(1 for i in a2 if i % 1000 == 0))
I'm not surprised that NumPy does poorly w/r/t Python built-ins when using your snippet. A large fraction of the performance benefit in NumPy arises from avoiding the loops and instead access the array by indexing:
In NumPy, it's more common to do something like this:
A = NP.random.randint(10, 100, 100).reshape(10, 10)
w = A[A % 2 == 0]
NP.save("test_file.npy", w)
Per-element access is very slow for numpy arrays. Use vector operations:
$ python -mtimeit -s 'import numpy as np; a2=np.arange(10**6)' '
> sum(1 for i in a2 if i % 1000 == 0)'
10 loops, best of 3: 1.53 sec per loop
$ python -mtimeit -s 'import numpy as np; a2=np.arange(10**6)' '
> (a2 % 1000 == 0).sum()'
10 loops, best of 3: 22.6 msec per loop
$ python -mtimeit -s 'import numpy as np; a2= range(10**6)' '
> sum(1 for i in a2 if i % 1000 == 0)'
10 loops, best of 3: 90.9 msec per loop

Categories