Cython: understanding what the html annotation file has to say?

Cython: understanding what the html annotation file has to say? - python

After compiling the following Cython code, I get the html file that looks like this:
import numpy as np
cimport numpy as np
cpdef my_function(np.ndarray[np.double_t, ndim = 1] array_a,
np.ndarray[np.double_t, ndim = 1] array_b,
int n_rows,
int n_columns):
array_a[0:-1:n_columns] = 0
array_a[n_columns - 1:n_rows * n_columns:n_columns] = 0
array_a[0:n_columns] = 0
array_a[n_columns* (n_rows - 1):n_rows * n_columns] = 0
array_b[array_a == 3] = 0
return array_a, array_b
My question is that why those operations of my function are still yellow? Does this mean that the code is still not as fast as it could be using Cython?

As you already know, yellow lines mean that some interactions with python happen, i.e. python functionality and not raw c-functionality is used, and you can look into the produced code to see, what happens and if it can/should be fixed/avoided.
Not every interaction with python means a (measurable) slowdown.
Let's take a look at this simplified function:
%%cython
cimport numpy as np
def use_slices(np.ndarray[np.double_t] a):
a[0:len(a)]=0.0
When we look into the produced code we see (I kept only the important parts):
__pyx_t_1 = PyObject_Length(((PyObject *)__pyx_v_a));
__pyx_t_2 = PyInt_FromSsize_t(__pyx_t_1);
__pyx_t_3 = PySlice_New(__pyx_int_0, __pyx_t_2, Py_None);
PyObject_SetItem(((PyObject *)__pyx_v_a)
So basically we get a new slice (which is a numpy-array) and then use numpy's functionality (PyObject_SetItem) to set all elements to 0.0, which is C-code under the hood.
Let's take a look at version with hand-written for loop:
cimport numpy as np
def use_for(np.ndarray[np.double_t] a):
cdef int i
for i in range(len(a)):
a[i]=0.0
It still uses PyObject_Length (because of length) and bound-checking, but otherwise it is C-code. When we compare times:
>>> import numpy as np
>>> a=np.ones((500,))
>>> %timeit use_slices(a)
100000 loops, best of 3: 1.85 µs per loop
>>> %timeit use_for(a)
1000000 loops, best of 3: 1.42 µs per loop
>>> b=np.ones((250000,))
>>> %timeit use_slices(b)
10000 loops, best of 3: 104 µs per loop
>>> %timeit use_for(b)
1000 loops, best of 3: 364 µs per loop
You can see the additional overhead of creating a slice for small sizes, but the additional checks in the for-version means it has more overhead in the long run.
Let's disable these checks:
%%cython
cimport cython
cimport numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
def use_for_no_checks(np.ndarray[np.double_t] a):
cdef int i
for i in range(len(a)):
a[i]=0.0
In the produced html we can see, that a[i] gets as simple as it gets:
__pyx_t_3 = __pyx_v_i;
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_double_t *, __pyx_pybuffernd_a.rcbuffer->pybuffer.buf, __pyx_t_3, __pyx_pybuffernd_a.diminfo[0].strides) = 0.0;
}
__Pyx_BufPtrStrided1d(type, buf, i0, s0) is define for (type)((char*)buf + i0 * s0).
And now:
>>> %timeit use_for_no_checks(a)
1000000 loops, best of 3: 1.17 µs per loop
>>> %timeit use_for_no_checks(b)
1000 loops, best of 3: 246 µs per loop
We can improve it further by releasing gil in the for-loop:
%%cython
cimport cython
cimport numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
def use_for_no_checks_no_gil(np.ndarray[np.double_t] a):
cdef int i
cdef int n=len(a)
with nogil:
for i in range(n):
a[i]=0.0
and now:
>>> %timeit use_for_no_checks_no_gil(a)
1000000 loops, best of 3: 1.07 µs per loop
>>> %timeit use_for_no_checks_no_gil(b)
10000 loops, best of 3: 166 µs per loop
So it is somewhat faster, but still you cannot beat numpy for larger arrays.
In my opinion, there are two things to take from it:
Cython will not transform slices to an access through a for-loop and thus Python-functionality must be used.
There is small overhead, but it is only calling the numpy-functionality the most of work is done in numpy-code, and this cannot be speed up through Cython.
One last try using memset function:
%%cython
from libc.string cimport memset
cimport numpy as np
def use_memset(np.ndarray[np.double_t] a):
memset(&a[0], 0, len(a)*sizeof(np.double_t))
We get:
>>> %timeit use_memset(a)
1000000 loops, best of 3: 821 ns per loop
>>> %timeit use_memset(b)
10000 loops, best of 3: 102 µs per loop
It is also as fast as the numpy-code for large arrays.
As DavidW suggested, one could try to use memory-views:
%%cython
cimport numpy as np
def use_slices_memview(double[::1] a):
a[0:len(a)]=0.0
leads to a slightly faster code for small arrays but similar fast code fr large arrays (compared to numpy-slices):
>>> %timeit use_slices_memview(a)
1000000 loops, best of 3: 1.52 µs per loop
>>> %timeit use_slices_memview(b)
10000 loops, best of 3: 105 µs per loop
That means, that the memory-view slices have less overhead than the numpy-slices. Here is the produced code:
__pyx_t_1 = __Pyx_MemoryView_Len(__pyx_v_a);
__pyx_t_2.data = __pyx_v_a.data;
__pyx_t_2.memview = __pyx_v_a.memview;
__PYX_INC_MEMVIEW(&__pyx_t_2, 0);
__pyx_t_3 = -1;
if (unlikely(__pyx_memoryview_slice_memviewslice(
&__pyx_t_2,
__pyx_v_a.shape[0], __pyx_v_a.strides[0], __pyx_v_a.suboffsets[0],
0,
0,
&__pyx_t_3,
0,
__pyx_t_1,
0,
1,
1,
0,
1) < 0))
{
__PYX_ERR(0, 27, __pyx_L1_error)
}
{
double __pyx_temp_scalar = 0.0;
{
Py_ssize_t __pyx_temp_extent = __pyx_t_2.shape[0];
Py_ssize_t __pyx_temp_idx;
double *__pyx_temp_pointer = (double *) __pyx_t_2.data;
for (__pyx_temp_idx = 0; __pyx_temp_idx < __pyx_temp_extent; __pyx_temp_idx++) {
*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
__pyx_temp_pointer += 1;
}
}
}
__PYX_XDEC_MEMVIEW(&__pyx_t_2, 1);
__pyx_t_2.memview = NULL;
__pyx_t_2.data = NULL;
I think the most important part: this code doesn't create an additional temporary object - it reuses the existing memory view for the slice.
My compiler produces (at least for my machine) a slightly faster code if memory views are used. Not sure whether it is worth an investigation. At the first sight the difference in every iteration step is:
# created code for memview-slices:
*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
__pyx_temp_pointer += 1;
#created code for memview-for-loop:
__pyx_v_i = __pyx_t_3;
__pyx_t_4 = __pyx_v_i;
*((double *) ( /* dim=0 */ ((char *) (((double *) data) + __pyx_t_4)) )) = 0.0;
I would expect different compilers handle this code differently well. But clearly, the first version is easier to get optimized.
As Behzad Jamali pointed out, there is difference between double[:] a and double[::1] a. The second version using slices is about 20% faster on my machine. The difference is, that during the compile time it is know for the double[::1] version, that the memory-accesses will be consecutive and this can be used for optimization. In the version with double[:] we don't know anything about stride until the runtime.

Related

Numba: Manual looping faster than a += c * b with numpy arrays?

I would like to do a 'daxpy' (add to a vector the scalar multiple of a second vector and assign the result to the first) with numpy using numba. Doing the following test, I noticed that writing the loop myself was much faster than doing a += c * b.
I was not expecting this. What is the reason for this behavior?
import numpy as np
from numba import jit
x = np.random.random(int(1e6))
o = np.random.random(int(1e6))
c = 3.4
#jit(nopython=True)
def test1(a, b, c):
a += c * b
return a
#jit(nopython=True)
def test2(a, b, c):
for i in range(len(a)):
a[i] += c * b[i]
return a
%timeit -n100 -r10 test1(x, o, c)
>>> 100 loops, best of 10: 2.48 ms per loop
%timeit -n100 -r10 test2(x, o, c)
>>> 100 loops, best of 10: 1.2 ms per loop

One thing to keep in mind is 'manual looping' in numba is very fast, essentially the same as the c-loop used by numpy operations.
In the first example there are two operations, a temporary array (c * b) is allocated / calculated, then that temporary array is added to a. In the second example, both calculations are happening in the same loop with no intermediate result.
In theory, numba could fuse loops and optimize #1 to do the same as #2, but it doesn't seem to be doing it. If you just want to optimize numpy ops, numexpr may also be worth a look as it was designed for exactly that - though probably won't do any better than the explicit fused loop.
In [17]: import numexpr as ne
In [18]: %timeit -r10 test2(x, o, c)
1000 loops, best of 10: 1.36 ms per loop
In [19]: %timeit ne.evaluate('x + o * c', out=x)
1000 loops, best of 3: 1.43 ms per loop

Cython: why is size_t faster than int?

Changing certain Cython variables from type int to type size_t can significantly reduce some functions times (~30%), but I do not understand why.
For example:
cimport numpy as cnp
import numpy as np
def sum_int(cnp.int64_t[::1] A):
cdef unsigned long s = 0
cdef int k
for k in xrange(A.shape[0]):
s += A[k]
return s
def sum_size_t(cnp.int64_t[::1] A):
cdef unsigned long s = 0
cdef size_t k
for k in xrange(A.shape[0]):
s += A[k]
return s
a = np.array(range(1000000))
And the timing results:
In [17]: %timeit sum_int(a)
1000 loops, best of 3: 652 µs per loop
In [18]: %timeit sum_size_t(a)
1000 loops, best of 3: 427 µs per loop
I am new to Cython, and know Fortran better than C. Help me out. What is the important difference between these two variable types that causes such a performance difference? What is it that I don't grok about Cython?

You'd likely have to do a line by line profiling to find out exactly, but one thing stands out to me from the produced C file: int version is checked for wraparound to negative numbers, size_t is assumed ok.
In the int loop: (t_3 is assigned from k, they're the same type)
if (__pyx_t_3 < 0) {
__pyx_t_3 += __pyx_v_A.shape[0];
if (unlikely(__pyx_t_3 < 0)) __pyx_t_4 = 0;
} else if (unlikely(__pyx_t_3 >= __pyx_v_A.shape[0])) __pyx_t_4 = 0;
In the size_t loop:
if (unlikely(__pyx_t_3 >= (size_t)__pyx_v_A.shape[0])) __pyx_t_4 = 0;
So no wraparound test is needed because size_t is unsigned and guaranteed not to wrap around when indexing items in memory. The rest is virtually the same.
Update: regarding your unsigned int results - what's your size of int and size_t? Any chance they're different size, causing the change? In my case the C code for uint and size_t is identical. (since size_t is unsigned and specifically unsigned int on this system)

On a 64 bit system there seem to be two reasons:
Use an unsigned integer for the loop:
%%cython
cimport numpy as cnp
import numpy as np
def sum_int_unsigned(cnp.int64_t[::1] A):
cdef unsigned long s = 0
cdef unsigned k
for k in xrange(A.shape[0]):
s += A[k]
return s
Use a long instead of an int:
%%cython
cimport numpy as cnp
import numpy as np
def sum_int_unsigned_long(cnp.int64_t[::1] A):
cdef unsigned long s = 0
cdef unsigned long k
for k in xrange(A.shape[0]):
s += A[k]
return s
Timings:
%timeit sum_int(a)
1000 loops, best of 3: 1.52 ms per loop
%timeit sum_size_t(a)
1000 loops, best of 3: 671 µs per loop
Using unsigned brings us half way:
%timeit sum_int_unsigned(a)
1000 loops, best of 3: 1.09 ms per loop
Using long accounts for the rest:
%timeit sum_int_unsigned_long(a)
1000 loops, best of 3: 648 µs per loop

Fastest way to populate a matrix with a function on pairs of elements in two numpy vectors?

I have two 1 dimensional numpy vectors va and vb which are being used to populate a matrix by passing all pair combinations to a function.
na = len(va)
nb = len(vb)
D = np.zeros((na, nb))
for i in range(na):
for j in range(nb):
D[i, j] = foo(va[i], vb[j])
As it stands, this piece of code takes a very long time to run due to the fact that va and vb are relatively large (4626 and 737). However I am hoping this can be improved due to the fact that a similiar procedure is performed using the cdist method from scipy with very good performance.
D = cdist(va, vb, metric)
I am obviously aware that scipy has the benefit of running this piece of code in C rather than in python - but I'm hoping there is some numpy function im unaware of that can execute this quickly.

One of the least known numpy functions for what the docs call functional programming routines is np.frompyfunc. This creates a numpy ufunc from a Python function. Not some other object that closely simulates a numpy ufunc, but a proper ufunc with all its bells and whistles. While the behavior is in many aspects very similar to np.vectorize, it has some distinct advantages, that hopefully the following code should highlight:
In [2]: def f(a, b):
...: return a + b
...:
In [3]: f_vec = np.vectorize(f)
In [4]: f_ufunc = np.frompyfunc(f, 2, 1) # 2 inputs, 1 output
In [5]: a = np.random.rand(1000)
In [6]: b = np.random.rand(2000)
In [7]: %timeit np.add.outer(a, b) # a baseline for comparison
100 loops, best of 3: 9.89 ms per loop
In [8]: %timeit f_vec(a[:, None], b) # 50x slower than np.add
1 loops, best of 3: 488 ms per loop
In [9]: %timeit f_ufunc(a[:, None], b) # ~20% faster than np.vectorize...
1 loops, best of 3: 425 ms per loop
In [10]: %timeit f_ufunc.outer(a, b) # ...and you get to use ufunc methods
1 loops, best of 3: 427 ms per loop
So while it is still clearly inferior to a properly vectorized implementation, it is a little faster (the looping is in C, but you still have the Python function call overhead).

cdist is fast because it is written in highly-optimized C code (as you already pointed out), and it only supports a small predefined set of metrics.
Since you want to apply the operation generically, to any given foo function, you have no choice but to call that function na-times-nb times. That part is not likely to be further optimizable.
What's left to optimize are the loops and the indexing. Some suggestions to try out:
Use xrange instead of range (if in python2.x. in python3, range is already a generator-like)
Use enumerate, instead of range + explicitly indexing
Use a python speed "magic", such as cython or numba, to speed up the looping process.
If you can make further assumptions about foo, it might be possible to speed it up further.

Like #shx2 said, it all depends on what is foo. If you can express it in terms of numpy ufuncs, then use outer method:
In [11]: N = 400
In [12]: B = np.empty((N, N))
In [13]: x = np.random.random(N)
In [14]: y = np.random.random(N)
In [15]: %%timeit
for i in range(N):
for j in range(N):
B[i, j] = x[i] - y[j]
....:
10 loops, best of 3: 87.2 ms per loop
In [16]: %timeit A = np.subtract.outer(x, y) # <--- np.subtract is a ufunc
1000 loops, best of 3: 294 µs per loop
Otherwise you can push the looping down to cython level. Continuing a trivial example above:
In [45]: %%cython
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def foo(double[::1] x, double[::1] y, double[:, ::1] out):
cdef int i, j
for i in xrange(x.shape[0]):
for j in xrange(y.shape[0]):
out[i, j] = x[i] - y[j]
....:
In [46]: foo(x, y, B)
In [47]: np.allclose(B, np.subtract.outer(x, y))
Out[47]: True
In [48]: %timeit foo(x, y, B)
10000 loops, best of 3: 149 µs per loop
The cython example is deliberately made overly simplistic: in reality you might want to add some shape/stride checks, allocate the memory within your function etc.

Why is Cython slower than vectorized NumPy?

Consider the following Cython code :
cimport cython
cimport numpy as np
import numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
cdef int i
for i in range(a.shape[0]):
a[i] += b[i]
#cython.boundscheck(False)
#cython.wraparound(False)
def test_numpy(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
cdef int i
for i in range(a.shape[0]):
a[i] += b[i]
def test_numpyvec(a, b):
a += b
def gendata(nb=40000000):
a = np.random.random(nb)
b = np.random.random(nb)
return a, b
Running it in the interpreter yields (after a few runs to warm up the cache) :
In [14]: %timeit -n 100 test_memoryview(a, b)
100 loops, best of 3: 148 ms per loop
In [15]: %timeit -n 100 test_numpy(a, b)
100 loops, best of 3: 159 ms per loop
In [16]: %timeit -n 100 test_numpyvec(a, b)
100 loops, best of 3: 124 ms per loop
# See answer below :
In [17]: %timeit -n 100 test_raw_pointers(a, b)
100 loops, best of 3: 129 ms per loop
I tried it with different dataset sizes, and consistently had the vectorized NumPy function run faster than the compiled Cython code, while I was expecting Cython to be on par with vectorized NumPy in terms of performance.
Did I forget an optimization in my Cython code? Does NumPy use something (BLAS?) in order to make such simple operations run faster? Can I improve the performance of this code?
Update: The raw pointer version seems to be on par with NumPy. So apparently there's some overhead in using memory view or NumPy indexing.

Another option is to use raw pointers (and the global directives to avoid repeating #cython...):
#cython: wraparound=False
#cython: boundscheck=False
#cython: nonecheck=False
#...
cdef ctest_raw_pointers(int n, double *a, double *b):
cdef int i
for i in range(n):
a[i] += b[i]
def test_raw_pointers(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
ctest_raw_pointers(a.shape[0], &a[0], &b[0])

On my machine the difference isn't as large, but I can nearly eliminate it by changing the numpy and memory view functions like this
#cython.boundscheck(False)
#cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
cdef int i, n=a.shape[0]
for i in range(n):
a[i] += b[i]
#cython.boundscheck(False)
#cython.wraparound(False)
def test_numpy(np.ndarray[double] a, np.ndarray[double] b):
cdef int i, n=a.shape[0]
for i in range(n):
a[i] += b[i]
and then, when I compile the C output from Cython, I use the flags -O3 and -march=native.
This seems to indicate that the difference in timings comes from the use of different compiler optimizations.
I use the 64 bit version of MinGW and NumPy 1.8.1.
Your results will probably vary depending on your package versions, hardware, platform, and compiler.
If you are using the IPython notebook's Cython magic, you can force an update with the additional compiler flags by replacing %%cython with %%cython -f -c=-O3 -c=-march=native
If you are using a standard setup.py for your cython module you can specify the extra_compile_args argument when creating the Extension object that you pass to distutils.setup.
Note: I removed the ndim=1 flag when specifying the types for the NumPy arrays because it isn't necessary.
That value defaults to 1 anyway.

A change that slightly increases the speed is to specify the stride:
def test_memoryview_inorder(double[::1] a, double[::1] b):
cdef int i
for i in range(a.shape[0]):
a[i] += b[i]

C array vs NumPy array

In terms of performance (algebraic operations, lookup, caching, etc.), is there a difference between C arrays (which can be exposed as a C array, or a cython.view.array [Cython array], or a memoryview of the aforementioned two) and a NumPy arrays (which in Cython should have no Python overhead)
Edit:
I should mention that in the NumPy array is statically typed using Cython, and the dtypes are NumPy compile-time datypes (e.g. cdef np.int_t or cdef np.float32_t), and the types in the C case are the C equivalents (cdef int_t and cdef float)
Edit2:
Here is the example from the Cython Memoryview documentation to further illustrate my question:
from cython.view cimport array as cvarray
import numpy as np
# Memoryview on a NumPy array
narr = np.arange(27, dtype=np.dtype("i")).reshape((3, 3, 3))
cdef int [:, :, :] narr_view = narr
# Memoryview on a C array
cdef int carr[3][3][3]
cdef int [:, :, :] carr_view = carr
# Memoryview on a Cython array
cyarr = cvarray(shape=(3, 3, 3), itemsize=sizeof(int), format="i")
cdef int [:, :, :] cyarr_view = cyarr
Is there any difference between sticking with a C array vs a Cython array vs a NumPy array?

My knowledge on this is still imperfect, but this may be helpful.
I ran some informal benchmarks to show what each array type is good for and was intrigued by what I found.
Though these array types are different in many ways, if you are doing heavy computation with large arrays, you should be able to get similar performance out of any of them since item-by-item access should be roughly the same across the board.
A NumPy array is a Python object implemented using Python's C API.
NumPy arrays do provide an API at the C level, but they cannot be created independent from the Python interpreter.
They are especially useful because of all the different array manipulation routines available in NumPy and SciPy.
A Cython memory view is also a Python object, but it is made as a Cython extension type.
It does not appear to be designed for use in pure Python since it isn't a part of Cython that can be imported directly from Python, but you can return a view to Python from a Cython function.
You can look at the implementation at https://github.com/cython/cython/blob/master/Cython/Utility/MemoryView.pyx
A C array is a native type in the C language.
It is indexed like a pointer, but arrays and pointers are different.
There is some good discussion on this at http://c-faq.com/aryptr/index.html
They can be allocated on the stack and are easier for the C compiler to optimize, but they will be more difficult to access outside of Cython.
I know you can make a NumPy array from memory that has been dynamically allocated by other programs, but it seems a lot more difficult that way.
Travis Oliphant posted an example of this at http://blog.enthought.com/python/numpy-arrays-with-pre-allocated-memory/
If you are using C arrays or pointers for temporary storage within your program they should work very well for you.
They will not be as convenient for slicing or for any other sort of vectorized computation since you will have to do everything yourself with explicit looping, but they should allocate and deallocate faster and ought to provide a good baseline for speed.
Cython also provides an array class.
It looks like it is designed for internal use.
Instances are created when a memoryview is copied.
See http://docs.cython.org/src/userguide/memoryviews.html#view-cython-arrays
In Cython, you can also allocate memory and index a pointer to treat the allocated memory somewhat like an array.
See http://docs.cython.org/src/tutorial/memory_allocation.html
Here are some benchmarks that show somewhat similar performance for indexing large arrays.
This is the Cython file.
from numpy cimport ndarray as ar, uint64_t
cimport cython
import numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
def ndarr_time(uint64_t n=1000000, uint64_t size=10000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t i, j
for i in range(n):
for j in range(size):
A[j] = n
def carr_time(uint64_t n=1000000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t AC[10000]
uint64_t a
int i, j
for i in range(n):
for j in range(10000):
AC[j] = n
#cython.boundscheck(False)
#cython.wraparound(False)
def ptr_time(uint64_t n=1000000, uint64_t size=10000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t* AP = &A[0]
uint64_t a
int i, j
for i in range(n):
for j in range(size):
AP[j] = n
#cython.boundscheck(False)
#cython.wraparound(False)
def view_time(uint64_t n=1000000, uint64_t size=10000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t[:] AV = A
uint64_t i, j
for i in range(n):
for j in range(size):
AV[j] = n
Timing these using IPython we obtain
%timeit -n 10 ndarr_time()
%timeit -n 10 carr_time()
%timeit -n 10 ptr_time()
%timeit -n 10 view_time()
10 loops, best of 3: 6.33 s per loop
10 loops, best of 3: 3.12 s per loop
10 loops, best of 3: 6.26 s per loop
10 loops, best of 3: 3.74 s per loop
These results struck me as a little odd, considering that, as per Efficiency: arrays vs pointers , arrays are unlikely to be significantly faster than pointers.
It appears that some sort of compiler optimization is making the pure C arrays and the typed memory views faster.
I tried turning off all the optimization flags on my C compiler and got the timings
1 loops, best of 3: 25.1 s per loop
1 loops, best of 3: 25.5 s per loop
1 loops, best of 3: 32 s per loop
1 loops, best of 3: 28.4 s per loop
It looks to me like the item-by item access is pretty much the same across the board, except that C arrays and Cython memory views seem to be easier for the compiler to optimize.
More commentary on this can be seen at a these two blog posts I found some time ago:
http://jakevdp.github.io/blog/2012/08/08/memoryview-benchmarks/
http://jakevdp.github.io/blog/2012/08/16/memoryview-benchmarks-2/
In the second blog post he comments on how, if memory view slices are inlined, they can provide speeds similar to that of pointer arithmetic.
I have noticed in some of my own tests that explicitly inlining functions that use Memory View slices isn't always necessary.
As an example of this, I'll compute the inner product of every combination of two rows of an array.
from numpy cimport ndarray as ar
cimport cython
from numpy import empty
# An inlined dot product
#cython.boundscheck(False)
#cython.wraparound(False)
cdef inline double dot_product(double[:] a, double[:] b, int size):
cdef int i
cdef double tot = 0.
for i in range(size):
tot += a[i] * b[i]
return tot
# non-inlined dot-product
#cython.boundscheck(False)
#cython.wraparound(False)
cdef double dot_product_no_inline(double[:] a, double[:] b, int size):
cdef int i
cdef double tot = 0.
for i in range(size):
tot += a[i] * b[i]
return tot
# function calling inlined dot product
#cython.boundscheck(False)
#cython.wraparound(False)
def dot_rows_slicing(ar[double,ndim=2] A):
cdef:
double[:,:] Aview = A
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = dot_product(Aview[i], Aview[j], A.shape[1])
return res
# function calling non-inlined version
#cython.boundscheck(False)
#cython.wraparound(False)
def dot_rows_slicing_no_inline(ar[double,ndim=2] A):
cdef:
double[:,:] Aview = A
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = dot_product_no_inline(Aview[i], Aview[j], A.shape[1])
return res
# inlined dot product using numpy arrays
#cython.boundscheck(False)
#cython.boundscheck(False)
cdef inline double ndarr_dot_product(ar[double] a, ar[double] b):
cdef int i
cdef double tot = 0.
for i in range(a.size):
tot += a[i] * b[i]
return tot
# non-inlined dot product using numpy arrays
#cython.boundscheck(False)
#cython.boundscheck(False)
cdef double ndarr_dot_product_no_inline(ar[double] a, ar[double] b):
cdef int i
cdef double tot = 0.
for i in range(a.size):
tot += a[i] * b[i]
return tot
# function calling inlined numpy array dot product
#cython.boundscheck(False)
#cython.wraparound(False)
def ndarr_dot_rows_slicing(ar[double,ndim=2] A):
cdef:
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = ndarr_dot_product(A[i], A[j])
return res
# function calling nun-inlined version for numpy arrays
#cython.boundscheck(False)
#cython.wraparound(False)
def ndarr_dot_rows_slicing_no_inline(ar[double,ndim=2] A):
cdef:
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = ndarr_dot_product(A[i], A[j])
return res
# Version with explicit looping and item-by-item access.
#cython.boundscheck(False)
#cython.wraparound(False)
def dot_rows_loops(ar[double,ndim=2] A):
cdef:
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j, k
double tot
for i in range(A.shape[0]):
for j in range(A.shape[0]):
tot = 0.
for k in range(A.shape[1]):
tot += A[i,k] * A[j,k]
res[i,j] = tot
return res
Timing these we see
A = rand(1000, 1000)
%timeit dot_rows_slicing(A)
%timeit dot_rows_slicing_no_inline(A)
%timeit ndarr_dot_rows_slicing(A)
%timeit ndarr_dot_rows_slicing_no_inline(A)
%timeit dot_rows_loops(A)
1 loops, best of 3: 1.02 s per loop
1 loops, best of 3: 1.02 s per loop
1 loops, best of 3: 3.65 s per loop
1 loops, best of 3: 3.66 s per loop
1 loops, best of 3: 1.04 s per loop
The results were as fast with explicit inlining as they were without it.
In both cases, the typed memory views were comparable to a version of the function that was written without slicing.
In the blog post, he had to write a specific example to force the compiler to not inline a function.
It appears that a decent C compiler (I'm using MinGW) is able to take care of these optimizations without being told to inline certain functions.
Memoryviews can be faster for passing array slices between functions within a Cython module, even without explicit inlining.
In this particular case, however, even pushing the loops to C doesn't really reach a speed anywhere near what can be achieved through proper use of matrix multiplication.
The BLAS is still the best way to do things like this.
%timeit A.dot(A.T)
10 loops, best of 3: 25.7 ms per loop
There is also automatic conversion from NumPy arrays to memoryviews as in
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cysum(double[:] A):
cdef tot = 0.
cdef int i
for i in range(A.size):
tot += A[i]
return tot
The one catch is that, if you want a function to return a NumPy array, you will have to use np.asarray to convert the memory view object to a NumPy array again.
This is a relatively inexpensive operation since memory views comply with http://www.python.org/dev/peps/pep-3118/
Conclusion
Typed memory views seem to be a viable alternative to NumPy arrays for internal use in a Cython module.
Array slicing will be faster with memory views, but there are not as many functions and methods written for memory views as there are for NumPy arrays.
If you don't need to call a bunch of the NumPy array methods and want easy array slicing, you can use memory views in place of NumPy arrays.
If you need both the array slicing and the NumPy functionality for a given array, you can make a memory view that points to the same memory as the NumPy array.
You can then use the view for passing slices between functions and the array for calling NumPy functions.
That approach is still somewhat limited, but it will work well if you are doing most of your processing with a single array.
C arrays and/or dynamically allocated blocks of memory could be useful for intermediate calculations, but they are not as easy to pass back to Python for use there.
In my opinion, it is also more cumbersome to dynamically allocate multidimensional C arrays.
The best approach I am aware of is to allocate a large block of memory and then use integer arithmetic to index it as if it were a multidimensional array.
This could be an issue if you want easy allocation of arrays on the fly.
On the other hand, allocation times are probably a good bit faster for C arrays.
The other array types are designed to be nearly as fast and much more convenient, so I would recommend using them unless there is a compelling reason to do otherwise.
Update: As mentioned in the answer by #Veedrac you can still pass Cython memory views to most NumPy functions.
When you do this, NumPy will usually have to create a new NumPy array object to work with the memory view anyway, so this will be somewhat slower.
For large arrays the effect will be negligible.
A call to np.asarray for a memory view will be relatively fast regardless of array size.
However, to demonstrate this effect, here is another benchmark:
Cython file:
def npy_call_on_view(npy_func, double[:] A, int n):
cdef int i
for i in range(n):
npy_func(A)
def npy_call_on_arr(npy_func, ar[double] A, int n):
cdef int i
for i in range(n):
npy_func(A)
in IPython:
from numpy.random import rand
A = rand(1)
%timeit npy_call_on_view(np.amin, A, 10000)
%timeit npy_call_on_arr(np.amin, A, 10000)
output:
10 loops, best of 3: 282 ms per loop
10 loops, best of 3: 35.9 ms per loop
I tried to choose an example that would show this effect well.
Unless many NumPy function calls on relatively small arrays are involved, this shouldn't change the time a whole lot.
Keep in mind that, regardless of which way we are calling NumPy, a Python function call still occurs.
This applies only to the functions in NumPy.
Most of the array methods are not available for memoryviews (some of the attributes still are, like size and shape and T).
For example A.dot(A.T) with NumPy arrays would become np.dot(A, A.T).

Don't use cython.view.array, use cpython.array.array.
See this answer of mine for details, although that only deals with speed. The recommendation is to treat cython.view.array as "demo" material, and cpython.array.array as an actual solid implementation. These arrays are very lightweight and better when just using them as scratch space.
Further, if you're ever tempted by malloc, raw access on these is no slower and instantiation takes only twice as long.
With regards to IanH's
If you need both the array slicing and the NumPy functionality for a given array, you can make a memory view that points to the same memory as the NumPy array.
It's worth noting that memoryviews have a "base" property and many Numpy functions can also take memoryviews, so these do not have to be separated variables.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cython: understanding what the html annotation file has to say? - python

Related

Numba: Manual looping faster than a += c * b with numpy arrays?

Cython: why is size_t faster than int?

Fastest way to populate a matrix with a function on pairs of elements in two numpy vectors?

Why is Cython slower than vectorized NumPy?

C array vs NumPy array

Categories

Resources