Cython: Effectively using Numpy in Pure Python Mode

Cython: Effectively using Numpy in Pure Python Mode - python

I am fairly new to using Cython and I am interested in using the "Pure Python" mode.
The work that I am doing right now uses numpy extensively and knowing that there is a C api for numpy, I was excited to see what it could do.
As a small test, I put together two small test files, test.py and test.pxd. Their content is as follows:
test.py:
import cython
import numpy as np
#cython.locals(array=np.ndarray)
#cython.returns(np.ndarray)
def test(array):
return np.cumsum(array)
test_array = np.array([1,2,3,4,5])
test(test_array)
test.pxd:
# cython: language_level=3
cimport numpy as np
cdef np.ndarray test(np.ndarray array)
I then compiled these files with cython -a test.py with the hopes that I would see little to no python interaction when calling np.cumsum(). However when I inspected the generated HTML file, I found the following:
From this, it appears that my call to np.cumsum heavily interacts with python, which is something that feels counter-intuitive. My expectation, since I (should) be using the cimported numpy, is that there should be very little python interaction.
My question is "is my intuition correct?". Have I set something up incorrectly with my files that is not allowing the cimported numpy to actually be used for the function call, and that is why I am still seeing so much yellow? Or am I fundamentally misunderstanding something.
Thanks for reading!

Defining the types as np.ndarray mainly improves one thing: it makes indexing them to get single values significantly faster. Almost everything else remains the same speed.
np.cumsum (and any other Numpy function) is called through the standard Python mechanism and runs at exactly the same speed (internally of course it's implemented in C and should be quite quick). Mathematical operator (such add +, -, *, etc.) are also called through Python and remain the same speed.
In reality your wrapping probably makes it slower - it adds an unnecessary type-check (to make sure that the array is an np.ndarray) and an extra layer of indirection.
There is nothing to be gained through typing here.

Related

Why is numba not speeding up the following piece of code?

Why is numba not speeding up the following piece of code?
#jit(nopython=True)
def sort(x):
for i in range(1000):
np.sort(x)
I thought numba was made for these sorts of tasks, where you have for loops combined with numpy operations. Yet this jitted function is 2-3x slower than the pure Python variant (i.e. the same function but without the jit), and yes I have run it after it was compiled.
Am I doing something wrong?
EDIT:
Size of x and data-type is dtype = int32 AND float64 (I tried both), len = 5000.

The performance of the Numba implementation is not mean to be faster with relatively big array (eg. > 1024). Indeed, both Numba and Numpy use a compiled sorting algorithm as Numba does (except Numba use a JIT). Numba an only be better here for small arrays because it can mostly remove the overhead of calling a Numpy function from the CPython interpreter (and performing many input checks). The running time is dominated by the time of the sorting calls and not the overhead of the loop for an array of size=5000 (see below).
Besides this, both implementation appear to use slightly different algorithm implementations (at least not the same thresholds). As a result, the two implementations results in different performance. This is dependent of the input array. Some sorting algorithm are fast on some specific kind of distribution where some other sorting algorithm are slow and vice versa for other kind of distribution.
Here is the runtime execution of the two implementation plotted against the array size tested on random arrays on my machine (with 32-bit integers from 0 to 1,000,000,000):
One can see that Numba is faster for small arrays and faster for big ones. When len=5000, the Numba implementation is 50% slower.
Note that you can tune the algorithm used using the parameter kind. Note also that some Numpy optimized implementations use parallelism so that primitives can run faster. In that case, the comparison with the Numba implementation is not fair as Numba should use a sequential implementation (especially if parallel=True is not set). Besides this, this problem appear to be a well known issue and developers are working on it.

I wouldn't expect any performance benefit either. Numba isn't a magic wand that if you just add it you magically get better performance. It does have an overhead that can easily sneak up on you. It helps to understand what exactly numba does. It parses the ast of a python function and compiles it to native code using llvm and for a lot of non-trivial cases, this makes a huge difference because honestly, python sucks at complex math and branching. That is a reasonable drawback for its design choices. Take a look at your code though. It is a numpy sort function inside a for loop. Think logically what optimisation could numba possibly make that could speed this up. Remember that numpy is already damn fast and numba cant really affect that performance. So you have essentially added overhead to the most critical part of your code and hence the loss in performance.

two dimensional array slicing in cython

I have a simple question, why this is not efficient:
import numpy as np
cimport numpy as c_np
import cython
def function():
cdef c_np.ndarray[double, ndim=2] A = np.random.random((10,10))
cdef c_np.ndarray[double, ndim=1] slice
slice = A[1,:] #this line is marked as slow by the profiler cython -a
return
How should I slice a numpy matrix in python without overhead.
In my code, A is an adjacency matrix, so the slices are the neighbours in my routing algorithm.

The lines marked by the annotator are only suggestions, not based on actual profiling. I think it uses a relatively simple heuristic, something like number of python api calls. It also does not take into account the number of times something is called - yellow lines inside tight loops are much more important than something called once.
In this case, what you are doing is fairly efficient - one call to numpy to get the sliced array, and the assignment of that array to a buffer.
The generated C code looks like it may be better using the memoryview syntax which is functionally equivalent, but you would have to profile to know for sure if this is actually faster.
%%cython -a
import numpy as np
cimport numpy as c_np
import cython
def function():
cdef double[:, :] A = np.random.random((10,10))
cdef double[:] slice
slice = A[1,:]
return

At the risk of repeating an already good answer (and not answering the question!), I'm going to point out a common misunderstanding with the annotator... Yellow indicates a lot of interacting with the python interpreter. That (and the code you see when expanded) is a really useful hint when optimizating.
However! To quote from the first paragraph in every document on code optimization ever:
Profile first, then optimize. Seriously: don't guess, profile first.
And the annotations are definitely not a profile. Check out the cython docs on profiling and maybe this answer for line profiling How to profile cython functions line-by-line
As an example my code has some bright yellow where it calls some numpy functions on big arrays. There's actually very, very little room there for improvement* as that python interaction overhead is amortized over a lot of computation on those big arrays.
*I might be able to eek out a little by using the numpy C interface directly, but again, amortized over huge computation.

Alternatives of fused type in cython

I am working on rewriting a python module originally written in C using python-C api to Cython.The module also uses NumPy. A major challenge of the project is to maintain the current speed of module and also it should work for all Numpy data types. I am thinking to use fused data type to make it generic but I am worried because of its bottleneck effect on performance. Are there any other technique that can be used instead of fused type which I can use to achieve both speed and generic code.

Ignoring ali_m's perfectly valid comment about whether you've actually measured your performance issues...
http://docs.cython.org/src/userguide/fusedtypes.html#selecting-specializations
"For a cdef or cpdef function called from Cython this means that the specialization is figured out at compile time. For def functions the arguments are typechecked at runtime, and a best-effort approach is performed to figure out which specialization is needed."
Essentially, if you're calling from Cython there should be no issue - separate functions are generated and used without overhead. If you're calling from Python it obviously has to stop and think about which one to call.
But measure your performance before worrying about it! (And read the manual, which answers your question quite clearly.)

How to improve Cython performance?

I am doing my first steps with Cython, and I am wondering how to improve performance even more.
Until now I got to half the usual (python only) execution time, but I think there must be more!
I know cython -a and I already typed my variables. But there is still a lot in yellow in my function. Is this because cython does not recognise numpy or is there something else I am missing?

I believe you can benefit by using math functions from libc as you are calling np.sqrt and np.floor on scalars. This has not only the Python call overhead but there are different code paths in the numpy ufuncs for scalars and arrays. So that involves at least a type switch.

I think it's not a problem, as I've tested with the official tutorial, it's also reported as yellow on every np.* lines, and involves python just the same as your code.
Point 3 at the end of that page should have explained this:
Calling NumPy/SciPy functions currently has a Python call overhead; it would be possible to take a short-cut from Cython directly to C. (This does however require some isolated and incremental changes to those libraries; mail the Cython mailing list for details).

Array order in `numpy.dot`

In Python's numerical library NumPy, how does the numpy.dot function deal with arrays of different memory-order? numpy.dot(c-order, f-order) vs. dot(f-order, c-order) etc.
The reason I ask is that long time ago (numpy 1.0.4?), I made some tests and noticed numpy.dot performed worse than calling dgemm from scipy.linalg directly, with the correct transposition flags, though both call the same BLAS library internally. (I suspected the reason was copying of the input matrices inside numpy.dot, which is tragic if the input is large.)
Now I tried again and actually numpy.dot performs the same as dgemm, so there is no reason to keep the arrays in specific order and set transposition flags manually. Much cleaner code.
So my question is, how does a recent (let's say 1.6.0) numpy.dot work, guarantees on when things are copied and when not? I'm concerned about 1) memory 2) performance here. Cheers.

Possibly what you were seeing may have been related to a blas-optimized dot import error being caught and handled silently (this code snippet is from numeric.py)
# try to import blas optimized dot if available
try:
# importing this changes the dot function for basic 4 types
# to blas-optimized versions.
from _dotblas import dot, vdot, inner, alterdot, restoredot
except ImportError:
# docstrings are in add_newdocs.py
inner = multiarray.inner
dot = multiarray.dot

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.