Numpy vs Cython speed

Numpy vs Cython speed - python

I have an analysis code that does some heavy numerical operations using numpy. Just for curiosity, tried to compile it with cython with little changes and then I rewrote it using loops for the numpy part.
To my surprise, the code based on loops was much faster (8x). I cannot post the complete code, but I put together a very simple unrelated computation that shows similar behavior (albeit the timing difference is not so big):
Version 1 (without cython)
import numpy as np
def _process(array):
rows = array.shape[0]
cols = array.shape[1]
out = np.zeros((rows, cols))
for row in range(0, rows):
out[row, :] = np.sum(array - array[row, :], axis=0)
return out
def main():
data = np.load('data.npy')
out = _process(data)
np.save('vianumpy.npy', out)
Version 2 (building a module with cython)
import cython
cimport cython
import numpy as np
cimport numpy as np
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
cdef _process(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row
cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))
for row in range(0, rows):
out[row, :] = np.sum(array - array[row, :], axis=0)
return out
def main():
cdef np.ndarray[DTYPE_t, ndim=2] data
cdef np.ndarray[DTYPE_t, ndim=2] out
data = np.load('data.npy')
out = _process(data)
np.save('viacynpy.npy', out)
Version 3 (building a module with cython)
import cython
cimport cython
import numpy as np
cimport numpy as np
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
cdef _process(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row
cdef np.ndarray[DTYPE_t, ndim=2] out = np.zeros((rows, cols))
for row in range(0, rows):
for col in range(0, cols):
for row2 in range(0, rows):
out[row, col] += array[row2, col] - array[row, col]
return out
def main():
cdef np.ndarray[DTYPE_t, ndim=2] data
cdef np.ndarray[DTYPE_t, ndim=2] out
data = np.load('data.npy')
out = _process(data)
np.save('vialoop.npy', out)
With a 10000x10 matrix saved in data.npy, the times are:
$ python -m timeit -c "from version1 import main;main()"
10 loops, best of 3: 4.56 sec per loop
$ python -m timeit -c "from version2 import main;main()"
10 loops, best of 3: 4.57 sec per loop
$ python -m timeit -c "from version3 import main;main()"
10 loops, best of 3: 2.96 sec per loop
Is this expected or is there an optimization that I am missing? The fact that version 1 and 2 gives the same result is somehow expected, but why version 3 is faster?
Ps.- This is NOT the calculation that I need to make, just a simple example that shows the same thing.

With slight modification, version 3 becomes twice as fast:
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def process2(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row, col, row2
cdef np.ndarray[DTYPE_t, ndim=2] out = np.empty((rows, cols))
for row in range(rows):
for row2 in range(rows):
for col in range(cols):
out[row, col] += array[row2, col] - array[row, col]
return out
The bottleneck in your calculation is memory access. Your input array is C ordered, which means that moving along the last axis makes the smallest jump in memory. Therefore your inner loop should be along axis 1, not axis 0. Making this change cuts the run time in half.
If you need to use this function on small input arrays then you can reduce the overhead by using np.empty instead of np.ones. To reduce the overhead further use PyArray_EMPTY from the numpy C API.
If you use this function on very large input arrays (2**31) then the integers used for indexing (and in the range function) will overflow. To be safe use:
cdef Py_ssize_t rows = array.shape[0]
cdef Py_ssize_t cols = array.shape[1]
cdef Py_ssize_t row, col, row2
instead of
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row, col, row2
Timing:
In [2]: a = np.random.rand(10000, 10)
In [3]: timeit process(a)
1 loops, best of 3: 3.53 s per loop
In [4]: timeit process2(a)
1 loops, best of 3: 1.84 s per loop
where process is your version 3.

As mentioned in the other answers, version 2 is essentially the same as version 1 since cython is unable to dig into the array access operator in order to optimise it. There are 2 reasons for this
First, there is a certain amount of overhead in each call to a numpy function, as compared to optimised C code. However this overhead will become less significant if each operation deals with large arrays
Second, there is the creation of intermediate arrays. This is clearer if you consider a more complex operation such as out[row, :] = A[row, :] + B[row, :]*C[row, :]. In this case a whole array B*C must be created in memory, then added to A. This means that the CPU cache is being thrashed, as data is being read from and written to memory rather than being kept in the CPU and used straight away. Importantly, this problem becomes worse if you are dealing with large arrays.
Particularly since you state that your real code is more complex than your example, and it shows a much greater speedup, I suspect that the second reason is likely to be the main factor in your case.
As an aside, if your calculations are sufficiently simple, you can overcome this effect by using numexpr, although of course cython is useful in many more situations so it may be the better approach for you.

I would recommend using the -a flag to have cython generate the html file that shows what is being translated into pure c vs calling the python API:
http://docs.cython.org/src/quickstart/cythonize.html
Version 2 gives nearly the same result as Version 1, because all of the heavy lifting is being done by the Python API (via numpy) and cython isn't doing anything for you. In fact on my machine, numpy is built against MKL, so when I compile the cython generated c code using gcc, Version 3 is actually a little slower than the other two.
Cython shines when you are doing an array manipulation that numpy can't do in a 'vectorized' way, or when you are doing something memory intensive that it allows you to avoid creating a large temporary array. I've gotten 115x speed-ups using cython vs numpy for some of my own code:
https://github.com/synapticarbors/pylangevin-integrator
Part of that was calling randomkit directory at the level of the c code instead of calling it through numpy.random, but most of that was cython translating the computationally intensive for loops into pure c without calls to python.

The difference may be due to version 1 and 2 doing a Python-level call to np.sum() for each row, while version 3 likely compiles to a tight, pure C loop.
Studying the difference between version 2 and 3's Cython-generated C source should be enlightening.

I'd guess the main overhead you are saving is the temporary arrays created. You create a great big array array - array[row, :], then reduce it into a smaller array using sum. But building that big temporary array won't be free, especially if you need to allocate memory.

Related

Cython loop over array of indexes

I would like to do a series of operations on particular elements of matrices. I need to define the indices of these elements in an external object (self.indices in the example below).
Here is a stupid example of implementation in cython :
%%cython -f -c=-O2 -I./
import numpy as np
cimport numpy as np
cimport cython
cdef class Test:
cdef double[:, ::1] a, b
cdef Py_ssize_t[:, ::1] indices
def __cinit__(self, a, b, indices):
self.a = a
self.b = b
self.indices = indices
#cython.boundscheck(False)
#cython.nonecheck(False)
#cython.wraparound(False)
#cython.initializedcheck(False)
cpdef void run1(self):
""" Use of external structure of indices. """
cdef Py_ssize_t idx, ix, iy
cdef int n = self.indices.shape[0]
for idx in range(n):
ix = self.indices[idx, 0]
iy = self.indices[idx, 1]
self.b[ix, iy] = ix * iy * self.a[ix, iy]
#cython.boundscheck(False)
#cython.nonecheck(False)
#cython.wraparound(False)
#cython.initializedcheck(False)
cpdef void run2(self):
""" Direct formulation """
cdef Py_ssize_t idx, ix, iy
cdef int nx = self.a.shape[0]
cdef int ny = self.a.shape[1]
for ix in range(nx):
for iy in range(ny):
self.b[ix, iy] = ix * iy * self.a[ix, iy]
with this on the python side :
import itertools
import numpy as np
N = 256
a = np.random.rand(N, N)
b = np.zeros_like(a)
indices = np.array([[i, j] for i, j in itertools.product(range(N), range(N))], dtype=int)
test = Test(a, b, indices)
and the results :
%timeit test.run1()
75.6 µs ± 1.51 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit test.run2()
41.4 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Why does the Test.run1() method run much slower than the Test.run2() method?
What are the possibilities to keep a similar level of performance as with Test.run2() by using an external list, array, or any other kind of structure of indices?

Because run1 is significantly more complicated...
run1 is having to read from two separate bits of memory which almost certainly makes the CPU cache less efficient.
It's fairly trivial for the compiler to work out exactly what order it's accessing the array elements in run2. In contrast in run1 it could be accessing them in any order. That likely allows for significant optimizations.
Your current performance is probably as good as it gets.

In addition of the good #DavidW answer, note that run2 is SIMD-friendly as opposed to run1. This means a compiler can easily generate SIMD instruction in run2 so to read multiple packed items from memory, multiply multiple items in a row thanks to packed SIMD instructions and write the packed items into memory. If the array is small enough to fit in CPU caches, which is the case here, the SIMD computation can be very fast. Indeed, nearly all modern x86-64 processors support the 256-bit wide AVX/AVX-2 instruction set that can operate on 8 32-bit integers in a row and 4 double-precision floating-point numbers. Additionally, such a code can easily be unrolled and well pipelined by modern processors. Hardware prefetchers are also optimized for this kind of use-case.
Meanwhile, run1 do indirect memory accesses. Compilers can hardly assume they are actually contiguous and generate packed loads/stores (this is very unlikely in most codes and this is up to developers to write this kind optimization). The indirection require multiple load instructions that saturate the load ports and make the overall computation at least twice slower. AVX-2 have a gather instruction that can theoretically help for such a case. That being said, the instruction is currently not well efficiently implemented on current Intel/AMD processors (it basically does scalar loads internally, saturating the load ports). Still, it should certainly make run1 runs as fast as run2 if the later is not vectorized (otherwise run2 should sharply outperform run1 even with gather instructions). Compilers unfortunately have a hard time using such instructions yet.
In fact, regarding the code and the timing, run2 should be even faster if SIMD instruction would be used. I think this is not the case and this is certainly because the -O2 optimization level is currently set in your code and compilers like GCC does not yet automatically vectorize the code (unless with the very last version AFAIK). Please consider using -O3. Please also consider enabling -mavx and -mavx2 if possible (this assume the target processor are not too old) as it should make the code faster.

Cython initialize matrix with zeros

Description
I simply want to create a matrix of rows x cols that is filled with 0s. Always working with numpy I thought using np.zeros as described in the docs is the easiest:
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def f1():
cdef:
int dim = 40000
int i, j
np.ndarray[DTYPE_t, ndim=2] mat = np.zeros([40000, 40000], dtype=DTYPE)
for i in range(dim):
for j in range(dim):
mat[i, j] = 1
Then I compared this using the arrays in c:
def f2():
cdef:
int dim = 40000
int[40000][40000] mat
int i, j
for i in range(dim):
for j in range(dim):
mat[i][j] = 1
The numpy version took 3 secs on my pc whereas the c version only took2.4e-5 secs. However when I return the array from f2() I noticed it is not zero filled (of course here it can't be, i==j however when not filling it it won't return a 0 array either). How can this be done in cython. I know in regular C it would be like: int arr[n][m] = {};.
Question
How can the c array be filled with 0s? (I would go for numpy instead if there is something obvious wrong in my code)

You do not want to be writing code like this:
int[40000][40000] mat generates a 6 gigabyte array on the stack (assuming 4 byte ints). Typically maximum stack sizes are of the order of a few Mb. I have no idea how this isn't crashing your PC.
However when I return the array from f2() [...]
The array you have allocated is completely local to the function. From a C point of view you cannot return it since it ceases to exist after the function has finished. I think Cython may convert it to a (nested) Python list for you. This requires a slow copy element-by-element and is not what you want.
For what you're doing here you're much better just using Numpy.
Cython doesn't support a good equivalent of the C arr = {} so if you do want initialize sensible, small C arrays you need to use of one:
loops,
memset (which you can cimport from libc.string),
Create a typed memoryview of it and do memview[:,:] = 0
The numpy version took 3 secs on my pc whereas the c version only took2.4e-5 secs.
This kind of difference usually suggests that the C compiler has optimized some code out (by detecting that the result is unused). It is unlikely to be a genuine speed-up.

Should a Cython memory view of indexes be of type Py_ssize_t or int?

I have cython code that takes a 2d numpy.ndarray of data (M) and a numpy.ndarray of indexes (Ixs). It loops through the entries of Ixs and uses the values ix of Ixs to index columns of M. See the code below:
def foo(double[:, ::1] M, int[:, ::1] Ixs):
cdef int rows = M.shape[0]
cdef int cols = M.shape[1]
cdef Py_ssize_t c, r
for c in range(rows):
for r in range(cols):
ix = Ixs[c, r]
dosomething(M[c, ix])
I know that I am supposed to use Py_ssize_t as a type for indexes (I have read it is to accommodate for 64 bit architectures) but right now I am using a memory view of type int... In this case I don't see a way to create a numpy.ndarray of Py_ssize_t so that ix is Py_ssize_t.
What is the correct way to write this cython code? Is there any problem in using int?

One thing to note, you will want to type ix
Your code as written will work OK, M[c, ix] will cast ix from an int to Py_ssize_t, which should always be a safe conversion.
That said, you can and probably should have your indexer array be of Py_ssize_t. The corresponding numpy type is np.intp
https://docs.scipy.org/doc/numpy-1.13.0/user/basics.types.html

Cython with numpy how to get rid of fancy indexing (no call to Python)

I want to release the GIL inside a for loop on a 3-dimensional numpy array
cdef np.ndarray[DTYPE_t,ndim=3] array=np.ones((10000000,4,2))
cdef np.ndarray[DTYPE_t,ndim=2] sliced_array
cdef int i
cdef int N=array.shape[0]
for i in range(N):
sliced_array=array[i]
#perform computations on slice
When I look at the html produced by Cython it looks like it is calling Python when it is doing sliced_array=array[i] I guess it is because it infers the size of the two other dimensions but even when using typed ranges for the second and third axis this line is still yellow !
sliced_array=array[i,typed_slice_x,typed_slice_y]

One of the advantages of the newer memoryview syntax over declaring things as numpy arrays is that you can do indexing operations without the GIL:
cdef double[:,:,:] array=np.ones((10000000,4,2))
cdef double[:,:] sliced_array
cdef int i
cdef int N=array.shape[0]
for i in range(N):
with nogil: # just to prove the point
sliced_array=array[i,:,:]
If you declare them as cdef np.ndarray then you can't easily avoid needing the GIL for indexing.

Improve cython array indexing speed

I'm have a pretty simple function which I need to speed up. Essentially I have a big array of 16 bit numbers with some holes in it. (About 10%) I need to traverse the array, find areas where there are 2 0's in a row, then fill them in with the average of the previous and next elements. This takes only a few milliseconds in C, but Python is doing way worse.
I've switched from regular python arrays to numpy arrays, and then compiled my code using cython, but I'm still really far from my target. I was hoping someone with more experience might look at what I'm doing and give me some feedback.
My regular python code looks like this:
self.rawData = numpy.fromfile(ql, numpy.uint16, 50000)
[snip]
def fixZeroes(self):
for x in range(2,len(self.rawData)):
if self.rawData[x] == 0 and self.rawData[x-1] == 0:
self.rawData[x] = (self.rawData[x-2] + self.rawData[x+2]) / 2
self.rawData[x-1] = (self.rawData[x-3] + self.rawData[x+1]) /2
My Cython code looks very similar:
import numpy as np
cimport numpy as np
DTYPE = np.uint16
ctypedef np.uint16_t DTYPE_t
#cython.boundscheck(False)
def fix_zeroes(np.ndarray[DTYPE_t, ndim=1] raw):
assert raw.dtype == DTYPE
cdef int len = 50000
for x in range(2,len):
if raw[x] == 0 and raw[x-1] == 0:
raw[x] = (raw[x-2] + raw[x+2]) / 2
raw[x-1] = (raw[x-3] + raw[x+1]) /2
return raw
When I run this code, the performance is still way slower than I'd like:
Starting cython zero fix
Finished: 0:00:36.983681
starting python zero fix
Finished: 0:00:41.434476
I really think I must be doing something wrong. Most every article I've seen talks about the huge performance gains numpy and cython add, but I'm barely breaking 10%.

You should declare the x variable that you are using to index the raw array:
cdef int x
you can also use other directives that usually provide a performance boost:
#cython.wraparound(False)
#cython.cdivision(True)
#cython.nonecheck(False)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.