I have a simple question, why this is not efficient:
import numpy as np
cimport numpy as c_np
import cython
def function():
cdef c_np.ndarray[double, ndim=2] A = np.random.random((10,10))
cdef c_np.ndarray[double, ndim=1] slice
slice = A[1,:] #this line is marked as slow by the profiler cython -a
return
How should I slice a numpy matrix in python without overhead.
In my code, A is an adjacency matrix, so the slices are the neighbours in my routing algorithm.
The lines marked by the annotator are only suggestions, not based on actual profiling. I think it uses a relatively simple heuristic, something like number of python api calls. It also does not take into account the number of times something is called - yellow lines inside tight loops are much more important than something called once.
In this case, what you are doing is fairly efficient - one call to numpy to get the sliced array, and the assignment of that array to a buffer.
The generated C code looks like it may be better using the memoryview syntax which is functionally equivalent, but you would have to profile to know for sure if this is actually faster.
%%cython -a
import numpy as np
cimport numpy as c_np
import cython
def function():
cdef double[:, :] A = np.random.random((10,10))
cdef double[:] slice
slice = A[1,:]
return
At the risk of repeating an already good answer (and not answering the question!), I'm going to point out a common misunderstanding with the annotator... Yellow indicates a lot of interacting with the python interpreter. That (and the code you see when expanded) is a really useful hint when optimizating.
However! To quote from the first paragraph in every document on code optimization ever:
Profile first, then optimize. Seriously: don't guess, profile first.
And the annotations are definitely not a profile. Check out the cython docs on profiling and maybe this answer for line profiling How to profile cython functions line-by-line
As an example my code has some bright yellow where it calls some numpy functions on big arrays. There's actually very, very little room there for improvement* as that python interaction overhead is amortized over a lot of computation on those big arrays.
*I might be able to eek out a little by using the numpy C interface directly, but again, amortized over huge computation.
Related
I am fairly new to using Cython and I am interested in using the "Pure Python" mode.
The work that I am doing right now uses numpy extensively and knowing that there is a C api for numpy, I was excited to see what it could do.
As a small test, I put together two small test files, test.py and test.pxd. Their content is as follows:
test.py:
import cython
import numpy as np
#cython.locals(array=np.ndarray)
#cython.returns(np.ndarray)
def test(array):
return np.cumsum(array)
test_array = np.array([1,2,3,4,5])
test(test_array)
test.pxd:
# cython: language_level=3
cimport numpy as np
cdef np.ndarray test(np.ndarray array)
I then compiled these files with cython -a test.py with the hopes that I would see little to no python interaction when calling np.cumsum(). However when I inspected the generated HTML file, I found the following:
From this, it appears that my call to np.cumsum heavily interacts with python, which is something that feels counter-intuitive. My expectation, since I (should) be using the cimported numpy, is that there should be very little python interaction.
My question is "is my intuition correct?". Have I set something up incorrectly with my files that is not allowing the cimported numpy to actually be used for the function call, and that is why I am still seeing so much yellow? Or am I fundamentally misunderstanding something.
Thanks for reading!
Defining the types as np.ndarray mainly improves one thing: it makes indexing them to get single values significantly faster. Almost everything else remains the same speed.
np.cumsum (and any other Numpy function) is called through the standard Python mechanism and runs at exactly the same speed (internally of course it's implemented in C and should be quite quick). Mathematical operator (such add +, -, *, etc.) are also called through Python and remain the same speed.
In reality your wrapping probably makes it slower - it adds an unnecessary type-check (to make sure that the array is an np.ndarray) and an extra layer of indirection.
There is nothing to be gained through typing here.
I am working on an algorithm that leverages Cython and C++ code to speed up a computation. Part of the computation involves keeping track of a 2D matrix, vecs that is D x D' dimensions (e.g. 1000 x 100). The algorithm parallelizes within Cython to set values per column. I am then interested in then obtaining the vecs values as a NumPy array in Python.
Modifying vecs in Cython
The pseudocode for setting each column of vecs is something like:
# this occurs in a Cython/C++ function
for icolumn in range(D'):
for irow in range(D):
vecs[irow, icolumn] = val
Data structure for vecs
To represent such a matrix, I am using a pointer of pointers of type npy_float32 (which I think is just numpy's float 32 type). I have a pointer to pointer array now that looks like this:
ctypedef np.npy_float32 DTYPE_t
cdef DTYPE_t** vecs # (D, D') array of vectors
Goal to obtain the vecs in NumPy Array at the Python Level
I am interested in converting this vecs variable into a numpy array. This is my attempt, but it doesn't work. I'm pretty novice at C++ and Cython.
numpy_vec_arr = np.asarray(<DTYPE_t[:,:]> proj_vecs_arr)
I'm not an expert in NumPy nor Cython, but I do know some about C++ and about interoperating C++ with other languages.
What problem are you trying to solve?
Answering that will prevent the XY problem, and might allow folks here to better help you.
Now, answering your original question. I see two ways to do this.
Use an available constructor of the NumPy array to construct a NumPy array. This will clearly give you a NumPy array, problem solved.
Create an object with the same in-memory structure as the NumPy array that Python expects, then convert that object to a NumPy array. This is tricky to do, bug prone, and depends on many implementation details. For those reasons, I would not suggest taking this approach.
I am working through some code and have come across this:
cdef:
float [::1] embed, feats, doc_embed, mention_embed, best_score
float [:, ::1] s_inp, p_inp
Could someone kindly explain what is being declared here? I am not quite sure if this is a python Slice or a C language specific thing. Please let me know if I can provide any other information.
These are definitions for 1D and 2D typed memoryviews. You can think of them as being numpy arrays. It's generally preferred to use memoryviews these days instead of numpy arrays directly because using memoryviews lets cython generate more efficient code.
I wrote a program using normal Python, and I now think it would be a lot better to use numpy instead of standard lists. The problem is there are a number of things where I'm confused how to use numpy, or whether I can use it at all.
In general how do np.arrays work? Are they dynamic in size like a C++ vector or do I have declare their length and type beforehand like a standard C++ array? In my program I've got a lot of cases where I create a list
ex_list = [] and then cycle through something and append to it ex_list.append(some_lst). Can I do something like with a numpy array? What if I knew the size of ex_list, could I declare and empty one and then add to it?
If I can't, let's say I only call this list, would it be worth it to convert it to numpy afterwards, i.e. is calling a numpy list faster?
Can I do more complicated operations for each element using a numpy array (not just adding 5 to each etc), example below.
full_pallete = [(int(1+i*(255/127.5)),0,0) for i in range(0,128)]
full_pallete += [col for col in right_palette if col[1]!=0 or col[2]!=0 or col==(0,0,0)]
In other words, does it make sense to convert to a numpy array and then cycle through it using something other than for loop?
Numpy arrays can be appended to (see http://docs.scipy.org/doc/numpy/reference/generated/numpy.append.html), although in general calling the append function many times in a loop has a heavy performance cost - it is generally better to pre-allocate a large array and then fill it as necessary. This is because the arrays themselves do have fixed size under the hood, but this is hidden from you in python.
Yes, Numpy is well designed for many operations similar to these. In general, however, you don't want to be looping through numpy arrays (or arrays in general in python) if they are very large. By using inbuilt numpy functions, you basically make use of all sorts of compiled speed up benefits. As an example, rather than looping through and checking each element for a condition, you would use numpy.where().
The real reason to use numpy is to benefit from pre-compiled mathematical functions and data processing utilities on large arrays - both those in the core numpy library as well as many other packages that use them.
Given the thread here
It seems that numpy is not the most ideal for ultra fast calculation. Does anyone know what overhead we must be aware of when using numpy for numerical calculation?
Well, depends on what you want to do. XOR is, for instance, hardly relevant for someone interested in doing numerical linear algebra (for which numpy is pretty fast, by virtue of using optimized BLAS/LAPACK libraries underneath).
Generally, the big idea behind getting good performance from numpy is to amortize the cost of the interpreter over many elements at a time. In other words, move the loops from python code (slow) into C/Fortran loops somewhere in the numpy/BLAS/LAPACK/etc. internals (fast). If you succeed in that operation (called vectorization) performance will usually be quite good.
Of course, you can obviously get even better performance by dumping the python interpreter and using, say, C++ instead. Whether this approach actually succeeds or not depends on how good you are at high performance programming with C++ vs. numpy, and what operation exactly you're trying to do.
Any time you have an expression like x = a * b + c / d + e, you end up with one temporary array for a * b, one temporary array for c / d, one for one of the sums and finally one allocation for the result. This is a limitation of Python types and operator overloading. You can however do things in-place explicitly using the augmented assignment (*=, +=, etc.) operators and be assured that copies aren't made.
As for the specific reason NumPy performs more slowly in that benchmark, it's hard to tell but it probably has to do with the constant overhead of checking sizes, type-marshaling, etc. that Cython/etc. don't have to worry about. On larger problems you'd probably see it get closer.
I can't really tell, but I'd guess there are two factors:
Perhaps numpy is copying more stuff? weave is often faster when you avoid allocating big temporary arrays, but this shouldn't matter here.
numpy has a bit of overhead used in iterating over (possibly) multidimensional arrays. This overhead would normally be dwarfed by number crunching, but an xor is really really fast, so all that really matters is the overhead.
Your sub-question: a = sin(x), how many roundtrips are there.
The trick is to pass a numpy array to sin(x), then there is only one 'roundtrip' for the whole array, since numpy will return an array of sin-values. There is no python for loop involved in this operation.