C array vs NumPy array - python

In terms of performance (algebraic operations, lookup, caching, etc.), is there a difference between C arrays (which can be exposed as a C array, or a cython.view.array [Cython array], or a memoryview of the aforementioned two) and a NumPy arrays (which in Cython should have no Python overhead)
Edit:
I should mention that in the NumPy array is statically typed using Cython, and the dtypes are NumPy compile-time datypes (e.g. cdef np.int_t or cdef np.float32_t), and the types in the C case are the C equivalents (cdef int_t and cdef float)
Edit2:
Here is the example from the Cython Memoryview documentation to further illustrate my question:
from cython.view cimport array as cvarray
import numpy as np
# Memoryview on a NumPy array
narr = np.arange(27, dtype=np.dtype("i")).reshape((3, 3, 3))
cdef int [:, :, :] narr_view = narr
# Memoryview on a C array
cdef int carr[3][3][3]
cdef int [:, :, :] carr_view = carr
# Memoryview on a Cython array
cyarr = cvarray(shape=(3, 3, 3), itemsize=sizeof(int), format="i")
cdef int [:, :, :] cyarr_view = cyarr
Is there any difference between sticking with a C array vs a Cython array vs a NumPy array?

My knowledge on this is still imperfect, but this may be helpful.
I ran some informal benchmarks to show what each array type is good for and was intrigued by what I found.
Though these array types are different in many ways, if you are doing heavy computation with large arrays, you should be able to get similar performance out of any of them since item-by-item access should be roughly the same across the board.
A NumPy array is a Python object implemented using Python's C API.
NumPy arrays do provide an API at the C level, but they cannot be created independent from the Python interpreter.
They are especially useful because of all the different array manipulation routines available in NumPy and SciPy.
A Cython memory view is also a Python object, but it is made as a Cython extension type.
It does not appear to be designed for use in pure Python since it isn't a part of Cython that can be imported directly from Python, but you can return a view to Python from a Cython function.
You can look at the implementation at https://github.com/cython/cython/blob/master/Cython/Utility/MemoryView.pyx
A C array is a native type in the C language.
It is indexed like a pointer, but arrays and pointers are different.
There is some good discussion on this at http://c-faq.com/aryptr/index.html
They can be allocated on the stack and are easier for the C compiler to optimize, but they will be more difficult to access outside of Cython.
I know you can make a NumPy array from memory that has been dynamically allocated by other programs, but it seems a lot more difficult that way.
Travis Oliphant posted an example of this at http://blog.enthought.com/python/numpy-arrays-with-pre-allocated-memory/
If you are using C arrays or pointers for temporary storage within your program they should work very well for you.
They will not be as convenient for slicing or for any other sort of vectorized computation since you will have to do everything yourself with explicit looping, but they should allocate and deallocate faster and ought to provide a good baseline for speed.
Cython also provides an array class.
It looks like it is designed for internal use.
Instances are created when a memoryview is copied.
See http://docs.cython.org/src/userguide/memoryviews.html#view-cython-arrays
In Cython, you can also allocate memory and index a pointer to treat the allocated memory somewhat like an array.
See http://docs.cython.org/src/tutorial/memory_allocation.html
Here are some benchmarks that show somewhat similar performance for indexing large arrays.
This is the Cython file.
from numpy cimport ndarray as ar, uint64_t
cimport cython
import numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
def ndarr_time(uint64_t n=1000000, uint64_t size=10000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t i, j
for i in range(n):
for j in range(size):
A[j] = n
def carr_time(uint64_t n=1000000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t AC[10000]
uint64_t a
int i, j
for i in range(n):
for j in range(10000):
AC[j] = n
#cython.boundscheck(False)
#cython.wraparound(False)
def ptr_time(uint64_t n=1000000, uint64_t size=10000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t* AP = &A[0]
uint64_t a
int i, j
for i in range(n):
for j in range(size):
AP[j] = n
#cython.boundscheck(False)
#cython.wraparound(False)
def view_time(uint64_t n=1000000, uint64_t size=10000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t[:] AV = A
uint64_t i, j
for i in range(n):
for j in range(size):
AV[j] = n
Timing these using IPython we obtain
%timeit -n 10 ndarr_time()
%timeit -n 10 carr_time()
%timeit -n 10 ptr_time()
%timeit -n 10 view_time()
10 loops, best of 3: 6.33 s per loop
10 loops, best of 3: 3.12 s per loop
10 loops, best of 3: 6.26 s per loop
10 loops, best of 3: 3.74 s per loop
These results struck me as a little odd, considering that, as per Efficiency: arrays vs pointers , arrays are unlikely to be significantly faster than pointers.
It appears that some sort of compiler optimization is making the pure C arrays and the typed memory views faster.
I tried turning off all the optimization flags on my C compiler and got the timings
1 loops, best of 3: 25.1 s per loop
1 loops, best of 3: 25.5 s per loop
1 loops, best of 3: 32 s per loop
1 loops, best of 3: 28.4 s per loop
It looks to me like the item-by item access is pretty much the same across the board, except that C arrays and Cython memory views seem to be easier for the compiler to optimize.
More commentary on this can be seen at a these two blog posts I found some time ago:
http://jakevdp.github.io/blog/2012/08/08/memoryview-benchmarks/
http://jakevdp.github.io/blog/2012/08/16/memoryview-benchmarks-2/
In the second blog post he comments on how, if memory view slices are inlined, they can provide speeds similar to that of pointer arithmetic.
I have noticed in some of my own tests that explicitly inlining functions that use Memory View slices isn't always necessary.
As an example of this, I'll compute the inner product of every combination of two rows of an array.
from numpy cimport ndarray as ar
cimport cython
from numpy import empty
# An inlined dot product
#cython.boundscheck(False)
#cython.wraparound(False)
cdef inline double dot_product(double[:] a, double[:] b, int size):
cdef int i
cdef double tot = 0.
for i in range(size):
tot += a[i] * b[i]
return tot
# non-inlined dot-product
#cython.boundscheck(False)
#cython.wraparound(False)
cdef double dot_product_no_inline(double[:] a, double[:] b, int size):
cdef int i
cdef double tot = 0.
for i in range(size):
tot += a[i] * b[i]
return tot
# function calling inlined dot product
#cython.boundscheck(False)
#cython.wraparound(False)
def dot_rows_slicing(ar[double,ndim=2] A):
cdef:
double[:,:] Aview = A
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = dot_product(Aview[i], Aview[j], A.shape[1])
return res
# function calling non-inlined version
#cython.boundscheck(False)
#cython.wraparound(False)
def dot_rows_slicing_no_inline(ar[double,ndim=2] A):
cdef:
double[:,:] Aview = A
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = dot_product_no_inline(Aview[i], Aview[j], A.shape[1])
return res
# inlined dot product using numpy arrays
#cython.boundscheck(False)
#cython.boundscheck(False)
cdef inline double ndarr_dot_product(ar[double] a, ar[double] b):
cdef int i
cdef double tot = 0.
for i in range(a.size):
tot += a[i] * b[i]
return tot
# non-inlined dot product using numpy arrays
#cython.boundscheck(False)
#cython.boundscheck(False)
cdef double ndarr_dot_product_no_inline(ar[double] a, ar[double] b):
cdef int i
cdef double tot = 0.
for i in range(a.size):
tot += a[i] * b[i]
return tot
# function calling inlined numpy array dot product
#cython.boundscheck(False)
#cython.wraparound(False)
def ndarr_dot_rows_slicing(ar[double,ndim=2] A):
cdef:
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = ndarr_dot_product(A[i], A[j])
return res
# function calling nun-inlined version for numpy arrays
#cython.boundscheck(False)
#cython.wraparound(False)
def ndarr_dot_rows_slicing_no_inline(ar[double,ndim=2] A):
cdef:
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = ndarr_dot_product(A[i], A[j])
return res
# Version with explicit looping and item-by-item access.
#cython.boundscheck(False)
#cython.wraparound(False)
def dot_rows_loops(ar[double,ndim=2] A):
cdef:
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j, k
double tot
for i in range(A.shape[0]):
for j in range(A.shape[0]):
tot = 0.
for k in range(A.shape[1]):
tot += A[i,k] * A[j,k]
res[i,j] = tot
return res
Timing these we see
A = rand(1000, 1000)
%timeit dot_rows_slicing(A)
%timeit dot_rows_slicing_no_inline(A)
%timeit ndarr_dot_rows_slicing(A)
%timeit ndarr_dot_rows_slicing_no_inline(A)
%timeit dot_rows_loops(A)
1 loops, best of 3: 1.02 s per loop
1 loops, best of 3: 1.02 s per loop
1 loops, best of 3: 3.65 s per loop
1 loops, best of 3: 3.66 s per loop
1 loops, best of 3: 1.04 s per loop
The results were as fast with explicit inlining as they were without it.
In both cases, the typed memory views were comparable to a version of the function that was written without slicing.
In the blog post, he had to write a specific example to force the compiler to not inline a function.
It appears that a decent C compiler (I'm using MinGW) is able to take care of these optimizations without being told to inline certain functions.
Memoryviews can be faster for passing array slices between functions within a Cython module, even without explicit inlining.
In this particular case, however, even pushing the loops to C doesn't really reach a speed anywhere near what can be achieved through proper use of matrix multiplication.
The BLAS is still the best way to do things like this.
%timeit A.dot(A.T)
10 loops, best of 3: 25.7 ms per loop
There is also automatic conversion from NumPy arrays to memoryviews as in
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cysum(double[:] A):
cdef tot = 0.
cdef int i
for i in range(A.size):
tot += A[i]
return tot
The one catch is that, if you want a function to return a NumPy array, you will have to use np.asarray to convert the memory view object to a NumPy array again.
This is a relatively inexpensive operation since memory views comply with http://www.python.org/dev/peps/pep-3118/
Conclusion
Typed memory views seem to be a viable alternative to NumPy arrays for internal use in a Cython module.
Array slicing will be faster with memory views, but there are not as many functions and methods written for memory views as there are for NumPy arrays.
If you don't need to call a bunch of the NumPy array methods and want easy array slicing, you can use memory views in place of NumPy arrays.
If you need both the array slicing and the NumPy functionality for a given array, you can make a memory view that points to the same memory as the NumPy array.
You can then use the view for passing slices between functions and the array for calling NumPy functions.
That approach is still somewhat limited, but it will work well if you are doing most of your processing with a single array.
C arrays and/or dynamically allocated blocks of memory could be useful for intermediate calculations, but they are not as easy to pass back to Python for use there.
In my opinion, it is also more cumbersome to dynamically allocate multidimensional C arrays.
The best approach I am aware of is to allocate a large block of memory and then use integer arithmetic to index it as if it were a multidimensional array.
This could be an issue if you want easy allocation of arrays on the fly.
On the other hand, allocation times are probably a good bit faster for C arrays.
The other array types are designed to be nearly as fast and much more convenient, so I would recommend using them unless there is a compelling reason to do otherwise.
Update: As mentioned in the answer by #Veedrac you can still pass Cython memory views to most NumPy functions.
When you do this, NumPy will usually have to create a new NumPy array object to work with the memory view anyway, so this will be somewhat slower.
For large arrays the effect will be negligible.
A call to np.asarray for a memory view will be relatively fast regardless of array size.
However, to demonstrate this effect, here is another benchmark:
Cython file:
def npy_call_on_view(npy_func, double[:] A, int n):
cdef int i
for i in range(n):
npy_func(A)
def npy_call_on_arr(npy_func, ar[double] A, int n):
cdef int i
for i in range(n):
npy_func(A)
in IPython:
from numpy.random import rand
A = rand(1)
%timeit npy_call_on_view(np.amin, A, 10000)
%timeit npy_call_on_arr(np.amin, A, 10000)
output:
10 loops, best of 3: 282 ms per loop
10 loops, best of 3: 35.9 ms per loop
I tried to choose an example that would show this effect well.
Unless many NumPy function calls on relatively small arrays are involved, this shouldn't change the time a whole lot.
Keep in mind that, regardless of which way we are calling NumPy, a Python function call still occurs.
This applies only to the functions in NumPy.
Most of the array methods are not available for memoryviews (some of the attributes still are, like size and shape and T).
For example A.dot(A.T) with NumPy arrays would become np.dot(A, A.T).

Don't use cython.view.array, use cpython.array.array.
See this answer of mine for details, although that only deals with speed. The recommendation is to treat cython.view.array as "demo" material, and cpython.array.array as an actual solid implementation. These arrays are very lightweight and better when just using them as scratch space.
Further, if you're ever tempted by malloc, raw access on these is no slower and instantiation takes only twice as long.
With regards to IanH's
If you need both the array slicing and the NumPy functionality for a given array, you can make a memory view that points to the same memory as the NumPy array.
It's worth noting that memoryviews have a "base" property and many Numpy functions can also take memoryviews, so these do not have to be separated variables.

Related

Nonzero for integers

My problem is as follows. I am generating a random bitstring of size n, and need to iterate over the indices for which the random bit is 1. For example, if my random bitstring ends up being 00101, I want to retrieve [2, 4] (on which I will iterate over). The goal is to do so in the fastest way possible with Python/NumPy.
One of the fast methods is to use NumPy and do
bitstring = np.random.randint(2, size=(n,))
l = np.nonzero(bitstring)[0]
The advantage with np.non_zero is that it finds indices of bits set to 1 much faster than if one iterates (with a for loop) over each bit and checks if it is set to 1.
Now, NumPy can generate a random bitstring faster via np.random.bit_generator.randbits(n). The problem is that it returns it as an integer, on which I cannot use np.nonzero anymore. I saw that for integers one can get the count of bits set to 1 in an integer x by using x.bit_count(), however there is no function to get the indices where bits are set to 1. So currently, I have to resort to a slow for loop, hence losing the initial speedup given by np.random.bit_generator.randbits(n).
How would you do something similar to (and as fast as) np.non_zero, but on integers instead?
Thank you in advance for your suggestions!
A minor optimisation to your code would be to use the new style random interface and generate bools rather than 64bit integers
rng = np.random.default_rng()
def original(n):
bitstring = rng.integers(2, size=n, dtype=bool)
return np.nonzero(bitstring)[0]
this causes it to take ~24 µs on my laptop, tested n upto 128.
I've previously noticed that getting a Numpy to generate a permutation is particularly fast, hence my comment above. Leading to:
def perm(n):
a = rng.permutation(n)
return a[:rng.binomial(n, 0.5)]
which takes between ~7 µs and ~10 µs depending on n. It also returns the indicies out of order, not sure if that's an issue for you. If your n isn't changing much, you could also swap to using rng.shuffle on an pre-allocated array, something like:
n = 32
a = np.arange(n)
def shuffle():
rng.shuffle(a)
return a[:rng.binomial(n, 0.5)]
which saves a couple of microseconds.
After some interesting proposals, I decided to do some benchmarking to understand how the running times grow as a function of n. The functions tested are the following:
def func1(n):
bit_array = np.random.randint(2, size=n)
return np.nonzero(bit_array)[0]
def func2(n):
bit_int = np.random.bit_generator.randbits(n)
a = np.zeros(bit_int.bit_count())
i = 0
for j in range(n):
if 1 & (bit_int >> j):
a[i] = j
i += 1
return a
def func3(n):
bit_string = format(np.random.bit_generator.randbits(n), f'0{n}b')
bit_array = np.array(list(bit_string), dtype=int)
return np.nonzero(bit_array)[0]
def func4(n):
rng = np.random.default_rng()
a = rng.permutation(n)
return a[:rng.binomial(n, 0.5)]
def func5(n):
a = np.arange(n)
rng.shuffle(a)
return a[:rng.binomial(n, 0.5)]
I used timeit to do the benchmark, looping 1000 over a statement each time and averaging over 10 runs. The value of n ranges from 2 to 65536, growing as powers of 2. The average running time is plotted and error bars correspondond to the standard deviation.
For solutions generating a bitstring, the simple func1 actually performs best among them whenever n is large enough (n>32). We can see that for low values of n (n< 16), using the randbits solution with the for loop (func2) is fastest, because the loop is not costly yet. However as n becomes larger, this becomes the worst solution, because all the time is spent in the for loop. This is why having a nonzero for integers would bring the best of both world and hopefully give a faster solution. We can observe that func3, which does a conversion in order to use nonzero after using randbits spends too long doing the conversion.
For implementations which exploit the binomial distribution (see Sam Mason's answer), we see that the use of shuffle (func5) instead of permutation (func4) can reduce the time by a bit, but overall they have similar performance.
Considering all values of n (that were tested), the solution given by Sam Mason which employs a binomial distribution together with shuffling (func5) is so far the most performant in terms of running time. Let's see if this can be improved!
I had a play with Cython to see how much difference it would make. I ended up with quite a lot of code and only ~5x better runtime performance:
from cpython.pycapsule cimport PyCapsule_IsValid, PyCapsule_GetPointer
import numpy as np
cimport numpy as np
cimport cython
from numpy.random cimport bitgen_t
np.import_array()
DTYPE = np.uint32
ctypedef np.uint32_t DTYPE_t
cdef extern int __builtin_popcountl(unsigned long) nogil
cdef extern int __builtin_ffsl(unsigned long) nogil
cdef const char *bgen_capsule_name = "BitGenerator"
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
cdef size_t generate_bits(object bitgen, np.uint64_t *state, Py_ssize_t state_len, np.uint64_t last_mask):
cdef Py_ssize_t i
cdef size_t nset
cdef bitgen_t *rng
capsule = bitgen.capsule
if not PyCapsule_IsValid(capsule, bgen_capsule_name):
raise ValueError("Expecting Numpy BitGenerator Capsule")
rng = <bitgen_t *> PyCapsule_GetPointer(capsule, bgen_capsule_name)
with bitgen.lock:
nset = 0
for i in range(state_len-1):
state[i] = rng.next_uint64(rng.state)
nset += __builtin_popcountl(state[i])
i = state_len-1
state[i] = rng.next_uint64(rng.state) & last_mask
nset += __builtin_popcountl(state[i])
return nset
cdef size_t write_setbits(DTYPE_t *result, DTYPE_t off, np.uint64_t state) nogil:
cdef size_t j
cdef int k
j = 0
while state:
# find first set bit returns zero when nothing is set
k = __builtin_ffsl(state) - 1
# clear out bit k
state &= ~(1ul<<k)
# record in output
result[j] = off + k
j += 1
return j
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
def rint(bitgen, unsigned int n):
cdef Py_ssize_t i, j, nset
cdef np.uint64_t[::1] state
cdef DTYPE_t[::1] result
state = np.empty((n + 63) // 64, dtype=np.uint64)
nset = generate_bits(bitgen, &state[0], len(state), (1ul << (n & 63)) - 1)
pyresult = np.empty(nset, dtype=DTYPE)
result = pyresult
j = 0
for i in range(len(state)):
j += write_setbits(&result[j], i * 64, state[i])
return pyresult
The above code is easy to use via the Cython Jupyter extension.
Comparing this to slightly tidied up versions of the OP's code can be done via:
import random
import timeit
import numpy as np
import matplotlib.pyplot as plt
bitgen = np.random.PCG64()
def func1(n):
# bool type is a bit faster
bit_array = np.random.randint(2, size=n, dtype=bool)
return np.nonzero(bit_array)[0]
def func2(n):
# OPs variant ends up using a CSPRNG which is slower
bit_int = random.getrandbits(n)
# this is much easier than using numpy arrays
return [i for i in range(n) if 1 & (bit_int >> i)]
def func3(n):
bit_string = format(random.getrandbits(n), f'0{n}b')
bit_array = np.array(list(bit_string), dtype='int8')
return np.nonzero(bit_array)[0]
def func4(n):
# shuffle variant is mostly the same
# plot already busy enough
a = np.random.permutation(n)
return a[:np.random.binomial(n, 0.5)]
def func_cython(n):
return rint(bitgen, n)
result = {}
niter = [2**i for i in range(1, 17)]
for name in 'func1 func2 func3 func4 func_cython'.split():
result[name] = res = []
for n in niter:
t = timeit.Timer(f"fn({n})", f"fn = {name}", globals=globals())
nit, dt = t.autorange()
res.append(dt / nit)
plt.loglog()
for name, times in result.items():
plt.plot(niter, np.array(times) * 1000, '.-', label=name)
plt.legend()
Which might produce output like:
Note that in order to reduce variance it's helpful to turn off CPU frequency scaling and turn off turbo modes. The Arch wiki has useful info on how to do this under Linux.
you could convert the number you get with randbits(n) to a numpy.ndarray.
depending on the size of n the compute time of the conversion should be faster than the loop.
n = 10
l = np.random.bit_generator.randbits(n) # gives you the int 616
l_string = f'{l:0{n}b}' # gives you a string representation of the int in length n 1001101000
l_nparray = np.array(list(l_string), dtype=int) # gives you the numpy.ndarray like np.random.randint [1 0 0 1 1 0 1 0 0 0]

fast access of sparse matrix in cython: memoryview vs vector of dictionaries

I used cython to speed my bottleneck in python. The task is to compute the selective inverse (below S) of a sparse matrix given by its cholesky factorization provided in csc-format (data, indptr, indices). But the the task is not really important, in the end it is a 3 times nested for-loop where I have to access elements of S fast.
When I use a memoryview of a full/huge matrix
double[:,:] Sfull
and access the entries then the algorithm is quite fast and meets my expectations. But it is clear, that this is only possible when the matrix Sfull fits into the memory.
My approach was to use a list/vector of dictionaries/maps such that I can access the elements also relatively fast.
cdef vector[map[int, double]] S
It turned out, that accessing the elements inside the loop with this data structure is around 20 times slower. Is this expected or is there another issue? Do you see any other data structure?
Thank you very much for any comments or help!
Best,
Manuel
Bellow, the cython code, where the version with the full memoryview is commented out.
cdef int invTakC12( double[:] id_diag, double[:] data, int len_i, int[:] indptr, int[:] indices, double[:, :] Sfull):
cdef vector[map[int, double]] S = testDictC(len_i-1) #list of empty dicts
cdef int i, j, j_i, lc
cdef double q
for i in range(len_i-2, -1, -1):
for j_i in range(indptr[i+1]-1, indptr[i]-1, -1):
j = indices[j_i]
q = 0
for lc in range(indptr[i+1] -1, indptr[i], -1):
q += data[lc] * S[j][ indices[lc] ]
#q += data[lc] * Sfull[ indices[lc], j ]
S[j][i] = -q
#Sfull[i,j] = -q
if i==j:
S[j][i] += id_diag[i]
#Sfull[i,j] += id_diag[i]
else:
S[i][j] -= q
#Sfull[j,i] -= id_diag[i]
return 0
You can access the arrays independently - e.g.:
cdef double[:] S_data = S.data
cdef np.int32_t[:] S_ind = S.indices
cdef np.int32_t[:] S_indptr = S.indptr
If that's too inconvenient, you can put them in a C struct as pointers:
cdef struct mapped_csc:
double *data
np.int32_t *indices
np.int32_t *indptr

Loop over a Numpy array with Cython

Let a and b be two numpy.float arrays of length 1024, defined with
cdef numpy.ndarray a
cdef numpy.ndarray b
I notice that:
cdef int i
for i in range(1024):
b[i] += a[i]
is considerably slower than:
b += a
Why?
I really need to be able to loop manually over arrays.
The difference will be smaller if you tell Cython the data type and the number of dimensions for a and b:
cdef numpy.ndarray[np.float64_t, ndim=1] a, b
Although the difference will be smaller, you won't beat b += a because this is using NumPy's SIMD-boosted functions (which will perform depending if your CPU supports SIMD).

Why is Cython slower than vectorized NumPy?

Consider the following Cython code :
cimport cython
cimport numpy as np
import numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
cdef int i
for i in range(a.shape[0]):
a[i] += b[i]
#cython.boundscheck(False)
#cython.wraparound(False)
def test_numpy(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
cdef int i
for i in range(a.shape[0]):
a[i] += b[i]
def test_numpyvec(a, b):
a += b
def gendata(nb=40000000):
a = np.random.random(nb)
b = np.random.random(nb)
return a, b
Running it in the interpreter yields (after a few runs to warm up the cache) :
In [14]: %timeit -n 100 test_memoryview(a, b)
100 loops, best of 3: 148 ms per loop
In [15]: %timeit -n 100 test_numpy(a, b)
100 loops, best of 3: 159 ms per loop
In [16]: %timeit -n 100 test_numpyvec(a, b)
100 loops, best of 3: 124 ms per loop
# See answer below :
In [17]: %timeit -n 100 test_raw_pointers(a, b)
100 loops, best of 3: 129 ms per loop
I tried it with different dataset sizes, and consistently had the vectorized NumPy function run faster than the compiled Cython code, while I was expecting Cython to be on par with vectorized NumPy in terms of performance.
Did I forget an optimization in my Cython code? Does NumPy use something (BLAS?) in order to make such simple operations run faster? Can I improve the performance of this code?
Update: The raw pointer version seems to be on par with NumPy. So apparently there's some overhead in using memory view or NumPy indexing.
Another option is to use raw pointers (and the global directives to avoid repeating #cython...):
#cython: wraparound=False
#cython: boundscheck=False
#cython: nonecheck=False
#...
cdef ctest_raw_pointers(int n, double *a, double *b):
cdef int i
for i in range(n):
a[i] += b[i]
def test_raw_pointers(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
ctest_raw_pointers(a.shape[0], &a[0], &b[0])
On my machine the difference isn't as large, but I can nearly eliminate it by changing the numpy and memory view functions like this
#cython.boundscheck(False)
#cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
cdef int i, n=a.shape[0]
for i in range(n):
a[i] += b[i]
#cython.boundscheck(False)
#cython.wraparound(False)
def test_numpy(np.ndarray[double] a, np.ndarray[double] b):
cdef int i, n=a.shape[0]
for i in range(n):
a[i] += b[i]
and then, when I compile the C output from Cython, I use the flags -O3 and -march=native.
This seems to indicate that the difference in timings comes from the use of different compiler optimizations.
I use the 64 bit version of MinGW and NumPy 1.8.1.
Your results will probably vary depending on your package versions, hardware, platform, and compiler.
If you are using the IPython notebook's Cython magic, you can force an update with the additional compiler flags by replacing %%cython with %%cython -f -c=-O3 -c=-march=native
If you are using a standard setup.py for your cython module you can specify the extra_compile_args argument when creating the Extension object that you pass to distutils.setup.
Note: I removed the ndim=1 flag when specifying the types for the NumPy arrays because it isn't necessary.
That value defaults to 1 anyway.
A change that slightly increases the speed is to specify the stride:
def test_memoryview_inorder(double[::1] a, double[::1] b):
cdef int i
for i in range(a.shape[0]):
a[i] += b[i]

No speed gains from Cython again?

The following is my cython code, the purpose is to do a bootstrap.
def boots(int trial, np.ndarray[double, ndim=2] empirical, np.ndarray[double, ndim=2] expected):
cdef int length = len(empirical)
cdef np.ndarray[double, ndim=2] ret = np.empty((trial, 100))
cdef np.ndarray[long] choices
cdef np.ndarray[double] m
cdef np.ndarray[double] n
cdef long o
cdef int i
cdef int j
for i in range(trial):
choices = np.random.randint(0, length, length)
m = np.zeros(100)
n = np.zeros(100)
for j in range(length):
o = choices[j]
m.__iadd__(empirical[o])
n.__iadd__(expected[o])
empirical_boot = m / length
expected_boot = n / length
ret[i] = empirical_boot / expected_boot - 1
ret.sort(axis=0)
return ret[int(trial * 0.025)].reshape((10,10)), ret[int(trial * 0.975)].reshape((10,10))
# test code
empirical = np.ones((40000, 100))
expected = np.ones((40000, 100))
%prun -l 10 boots(100, empirical,expected)
It takes 11 seconds in pure python with fancy indexing, and no matter how hard I tuned in cython it stays the same.
np.random.randint(0, 40000, 40000) takes 1 ms, so 100x takes 0.1s.
np.sort(np.ones((40000, 100)) takes 0.2s.
Thus I feel there must be ways to improve boots.
The primary issue you are seeing is that Cython only optimizes single-item access for typed arrays. This means that each of the lines in your code where you are using vectorization from NumPy still involve creating and interacting with Python objects.
The code you have there wasn't faster than the pure Python version because it wasn't really doing any of the computation differently.
You will have to avoid this by writing out the looping operations explicitly.
Here is a modified version of your code that runs significantly faster.
from numpy cimport ndarray as ar
from numpy cimport int32_t as int32
from numpy import empty
from numpy.random import randint
cimport cython
ctypedef int
# Notice the use of these decorators to tell Cython to turn off
# some of the checking it does when accessing arrays.
#cython.boundscheck(False)
#cython.wraparound(False)
def boots(int32 trial, ar[double, ndim=2] empirical, ar[double, ndim=2] expected):
cdef:
int32 length = empirical.shape[0], i, j, k
int32 o
ar[double, ndim=2] ret = empty((trial, 100))
ar[int32] choices
ar[double] m = empty(100), n = empty(100)
for i in range(trial):
# Still calling Python on this line
choices = randint(0, length, length)
# It was faster to compute m and n separately.
# I suspect that has to do with cache management.
# Instead of allocating new arrays, I just filled the old ones with the new values.
o = choices[0]
for k in range(100):
m[k] = empirical[o,k]
for j in range(1, length):
o = choices[j]
for k in range(100):
m[k] += empirical[o,k]
o = choices[0]
for k in range(100):
n[k] = expected[o,k]
for j in range(1, length):
o = choices[j]
for k in range(100):
n[k] += expected[o,k]
# Here I simplified some of the math and got rid of temporary arrays
for k in range(100):
ret[i,k] = m[k] / n[k] - 1.
ret.sort(axis=0)
return ret[int(trial * 0.025)].reshape((10,10)), ret[int(trial * 0.975)].reshape((10,10))
If you want to have a look at which lines of your code involve Python calls, the Cython compiler can generate an html file showing which lines call Python.
This option is called annotation.
The way you use it depends on how you are compiling your cython code.
If you are using the IPython notebook, just add the --annotate flag to the Cython cell magic.
You may also be able to benefit from turning on the C compiler optimization flags.

Categories