Cython: declare list-like function parameter - python

I'm trying to create a simple cython module and have the following problem. I would like to create a function like:
cdef float calc(float[:] a1, float[:] a2):
cdef float res = 0
cdef int l = len(a2)
cdef float item_a2
cdef float item_a1
for idx in range(l):
if a2[idx] > 0:
item_a2 = a2[idx]
item_a1 = a1[idx]
res += item_a2 * item_a1
return res
When the function is being executed, a1 and a2 params are python lists. Therefore I get the error:
TypeError: a bytes-like object is required, not 'list'
I just need to make such calculations and nothing more. But how shall I define input params float[:] a1 and float[:] a2 if I need to maximize speed up using C?
Probably it's necessary to convert lists to arrays manually?
P.S. would appreciate also if you can also explain to me whether it's necessary to declare cdef float item_a2 explicitly to perform multiplication (in terms of performance) or it is equally to result += a2[idx] * a1[idx]

cdef float calc(float[:] a1, float[:] a2):
a1 and a2 can be any object that supports the buffer protocol and has a float type. The most common examples would be either a numpy array or the standard library array module. They will not accept Python lists because a Python list is not a single homogeneous C type packed efficiently into memory, but instead a collection of Python objects.
To create a suitable object from a Python list you can do either:
numpy.array([1.0,2.0],dtype=numpy.float32)
array.array('f',[1.0,2.0])
(You may want to consider using double/float64 instead of float for extra precision, but that's your choice)
If you don't want to create array objects like this then Cython will not help you much since there is not much speed up possible with plain lists.
The np.ndarray[FLOAT, ndim=1] a1 syntax suggested in the other answer an outdated version of the memoryview syntax you're already using. There are no advantages (and a few small disadvantages) to using it.
result += a2[idx] * a1[idx]
is fine - Cython knows the types of a1 and a2 so there is no need to create temporary intermediate variables. You can get a html highlighted file with cython -a filename.pyx to inspect that will help indicate where the non-accelerated parts are.

Cython answer
One way you can do this (if you're open to using numpy):
import numpy as np
cimport numpy as np
ctypedef np.npy_float FLOAT
ctypedef np.npy_intp INTP
cdef FLOAT calc(np.ndarray[FLOAT, ndim=1, mode='c'] a1,
np.ndarray[FLOAT, ndim=1, mode='c'] a2):
cdef FLOAT res = 0
cdef INTP l = a2.shape[0]
cdef FLOAT item_a2
cdef FLOAT item_a1
for idx in range(l):
if a2[idx] > 0:
item_a2 = a2[idx]
item_a1 = a1[idx]
res += item_a2 * item_a1
return res
This will require a np.float32 dtype for your array. If you wanted a np.float64, you can redefine FLOAT as np.float64_t.
One unsolicited piece of advice... l is a bad name for a variable, since it looks like a digit. Consider renaming it length, or something of the like.
Pure python with Numpy
Finally, it looks like you're trying to compute the dot product between two vectors where elements in one array are positive. You could use Numpy here pretty efficiently to get the same result.
>>> import numpy as np
>>> a1 = np.array([0, 1, 2, 3, 4, 5, 6])
>>> a2 = np.array([1, 2, 0, 3, -1])
>>> a1[:a2.shape[0]].dot(np.maximum(a2, 0))
11
Note, I added the a1 slice since you didn't check for length equality in your Cython function, but used a2's length. So I assumed the lengths may differ.

Related

fast access of sparse matrix in cython: memoryview vs vector of dictionaries

I used cython to speed my bottleneck in python. The task is to compute the selective inverse (below S) of a sparse matrix given by its cholesky factorization provided in csc-format (data, indptr, indices). But the the task is not really important, in the end it is a 3 times nested for-loop where I have to access elements of S fast.
When I use a memoryview of a full/huge matrix
double[:,:] Sfull
and access the entries then the algorithm is quite fast and meets my expectations. But it is clear, that this is only possible when the matrix Sfull fits into the memory.
My approach was to use a list/vector of dictionaries/maps such that I can access the elements also relatively fast.
cdef vector[map[int, double]] S
It turned out, that accessing the elements inside the loop with this data structure is around 20 times slower. Is this expected or is there another issue? Do you see any other data structure?
Thank you very much for any comments or help!
Best,
Manuel
Bellow, the cython code, where the version with the full memoryview is commented out.
cdef int invTakC12( double[:] id_diag, double[:] data, int len_i, int[:] indptr, int[:] indices, double[:, :] Sfull):
cdef vector[map[int, double]] S = testDictC(len_i-1) #list of empty dicts
cdef int i, j, j_i, lc
cdef double q
for i in range(len_i-2, -1, -1):
for j_i in range(indptr[i+1]-1, indptr[i]-1, -1):
j = indices[j_i]
q = 0
for lc in range(indptr[i+1] -1, indptr[i], -1):
q += data[lc] * S[j][ indices[lc] ]
#q += data[lc] * Sfull[ indices[lc], j ]
S[j][i] = -q
#Sfull[i,j] = -q
if i==j:
S[j][i] += id_diag[i]
#Sfull[i,j] += id_diag[i]
else:
S[i][j] -= q
#Sfull[j,i] -= id_diag[i]
return 0
You can access the arrays independently - e.g.:
cdef double[:] S_data = S.data
cdef np.int32_t[:] S_ind = S.indices
cdef np.int32_t[:] S_indptr = S.indptr
If that's too inconvenient, you can put them in a C struct as pointers:
cdef struct mapped_csc:
double *data
np.int32_t *indices
np.int32_t *indptr

Loop over a Numpy array with Cython

Let a and b be two numpy.float arrays of length 1024, defined with
cdef numpy.ndarray a
cdef numpy.ndarray b
I notice that:
cdef int i
for i in range(1024):
b[i] += a[i]
is considerably slower than:
b += a
Why?
I really need to be able to loop manually over arrays.
The difference will be smaller if you tell Cython the data type and the number of dimensions for a and b:
cdef numpy.ndarray[np.float64_t, ndim=1] a, b
Although the difference will be smaller, you won't beat b += a because this is using NumPy's SIMD-boosted functions (which will perform depending if your CPU supports SIMD).

No speed gains from Cython again?

The following is my cython code, the purpose is to do a bootstrap.
def boots(int trial, np.ndarray[double, ndim=2] empirical, np.ndarray[double, ndim=2] expected):
cdef int length = len(empirical)
cdef np.ndarray[double, ndim=2] ret = np.empty((trial, 100))
cdef np.ndarray[long] choices
cdef np.ndarray[double] m
cdef np.ndarray[double] n
cdef long o
cdef int i
cdef int j
for i in range(trial):
choices = np.random.randint(0, length, length)
m = np.zeros(100)
n = np.zeros(100)
for j in range(length):
o = choices[j]
m.__iadd__(empirical[o])
n.__iadd__(expected[o])
empirical_boot = m / length
expected_boot = n / length
ret[i] = empirical_boot / expected_boot - 1
ret.sort(axis=0)
return ret[int(trial * 0.025)].reshape((10,10)), ret[int(trial * 0.975)].reshape((10,10))
# test code
empirical = np.ones((40000, 100))
expected = np.ones((40000, 100))
%prun -l 10 boots(100, empirical,expected)
It takes 11 seconds in pure python with fancy indexing, and no matter how hard I tuned in cython it stays the same.
np.random.randint(0, 40000, 40000) takes 1 ms, so 100x takes 0.1s.
np.sort(np.ones((40000, 100)) takes 0.2s.
Thus I feel there must be ways to improve boots.
The primary issue you are seeing is that Cython only optimizes single-item access for typed arrays. This means that each of the lines in your code where you are using vectorization from NumPy still involve creating and interacting with Python objects.
The code you have there wasn't faster than the pure Python version because it wasn't really doing any of the computation differently.
You will have to avoid this by writing out the looping operations explicitly.
Here is a modified version of your code that runs significantly faster.
from numpy cimport ndarray as ar
from numpy cimport int32_t as int32
from numpy import empty
from numpy.random import randint
cimport cython
ctypedef int
# Notice the use of these decorators to tell Cython to turn off
# some of the checking it does when accessing arrays.
#cython.boundscheck(False)
#cython.wraparound(False)
def boots(int32 trial, ar[double, ndim=2] empirical, ar[double, ndim=2] expected):
cdef:
int32 length = empirical.shape[0], i, j, k
int32 o
ar[double, ndim=2] ret = empty((trial, 100))
ar[int32] choices
ar[double] m = empty(100), n = empty(100)
for i in range(trial):
# Still calling Python on this line
choices = randint(0, length, length)
# It was faster to compute m and n separately.
# I suspect that has to do with cache management.
# Instead of allocating new arrays, I just filled the old ones with the new values.
o = choices[0]
for k in range(100):
m[k] = empirical[o,k]
for j in range(1, length):
o = choices[j]
for k in range(100):
m[k] += empirical[o,k]
o = choices[0]
for k in range(100):
n[k] = expected[o,k]
for j in range(1, length):
o = choices[j]
for k in range(100):
n[k] += expected[o,k]
# Here I simplified some of the math and got rid of temporary arrays
for k in range(100):
ret[i,k] = m[k] / n[k] - 1.
ret.sort(axis=0)
return ret[int(trial * 0.025)].reshape((10,10)), ret[int(trial * 0.975)].reshape((10,10))
If you want to have a look at which lines of your code involve Python calls, the Cython compiler can generate an html file showing which lines call Python.
This option is called annotation.
The way you use it depends on how you are compiling your cython code.
If you are using the IPython notebook, just add the --annotate flag to the Cython cell magic.
You may also be able to benefit from turning on the C compiler optimization flags.

Optimizing numpy.dot with Cython

I have the following piece of code which I'd like to optimize using Cython:
sim = numpy.dot(v1, v2) / (sqrt(numpy.dot(v1, v1)) * sqrt(numpy.dot(v2, v2)))
dist = 1-sim
return dist
I have written and compiled the .pyx file and when I ran the code I do not see any significant improvement in performance. According to the Cython documentation I have to add c_types. The HTML file generated by Cython indicates that the bottleneck is the dot products (which is expected of course). Does this mean that I have to define a C function for the dot products? If yes how do I do that?
EDIT:
After some research I have come up with the following code. The improvement is only marginal. I am not sure if there is something I can do to improve it :
from __future__ import division
import numpy as np
import math as m
cimport numpy as np
cimport cython
cdef extern from "math.h":
double c_sqrt "sqrt"(double)
ctypedef np.float reals #typedef_for easier readding
cdef inline double dot(np.ndarray[reals,ndim = 1] v1, np.ndarray[reals,ndim = 1] v2):
cdef double result = 0
cdef int i = 0
cdef int length = v1.size
cdef double el1 = 0
cdef double el2 = 0
for i in range(length):
el1 = v1[i]
el2 = v2[i]
result += el1*el2
return result
#cython.cdivision(True)
def distance(np.ndarray[reals,ndim = 1] ex1, np.ndarray[reals,ndim = 1] ex2):
cdef double dot12 = dot(ex1, ex2)
cdef double dot11 = dot(ex1, ex1)
cdef double dot22 = dot(ex2, ex2)
cdef double sim = dot12 / (c_sqrt(dot11 * dot22))
cdef double dist = 1-sim
return dist
As a general note, if you are calling numpy functions from within cython and doing little else, you generally will see only marginal gains if any at all. You generally only get massive speed-ups if you are statically typing code that makes use of an explicit for loop at the python level (not in something that is calling the Numpy C-API already).
You could try writing out the code for a dot product with all of the static typing of the counter, input numpy arrays, etc, with wraparound and boundscheck set to False, import the clib version of the sqrt function and then try to leverage the parallel for loop (prange) to make use of openmp.
You can change the expression
sim = numpy.dot(v1, v2) / (sqrt(numpy.dot(v1, v1)) * sqrt(numpy.dot(v2, v2)))
to
sim = numpy.dot(v1, v2) / sqrt(numpy.dot(v1, v1) * numpy.dot(v2, v2))

Optimizing my Cython/Numpy code? Only a 30% performance gain so far

Is there anything I've forgotten to do here in order to speed things up a bit? I'm trying to implement an algorithm described in a book called Tuning Timbre Spectrum Scale. Also---if all else fails, is there a way for me to just write this part of the code in C, then be able to call it from python?
import numpy as np
cimport numpy as np
# DTYPE = np.float
ctypedef np.float_t DTYPE_t
np.seterr(divide='raise', over='raise', under='ignore', invalid='raise')
"""
I define a timbre as the following 2d numpy array:
[[f0, a0], [f1, a1], [f2, a2]...] where f describes the frequency
of the given partial and a is its amplitude from 0 to 1. Phase is ignored.
"""
#Test Timbre
# cdef np.ndarray[DTYPE_t,ndim=2] t1 = np.array( [[440,1],[880,.5],[(440*3),.333]])
# Calculates the inherent dissonance of one timbres of the above form
# using the diss2Partials function
cdef DTYPE_t diss1Timbre(np.ndarray[DTYPE_t,ndim=2] t):
cdef DTYPE_t runningDiss1
runningDiss1 = 0.0
cdef unsigned int len = np.shape(t)[0]
cdef unsigned int i
cdef unsigned int j
for i from 0 <= i < len:
for j from i+1 <= j < len:
runningDiss1 += diss2Partials(t[i], t[j])
return runningDiss1
# Calculates the dissonance between two timbres of the above form
cdef DTYPE_t diss2Timbres(np.ndarray[DTYPE_t,ndim=2] t1, np.ndarray[DTYPE_t,ndim=2] t2):
cdef DTYPE_t runningDiss2
runningDiss2 = 0.0
cdef unsigned int len1 = np.shape(t1)[0]
cdef unsigned int len2 = np.shape(t2)[0]
runningDiss2 += diss1Timbre(t1)
runningDiss2 += diss1Timbre(t2)
cdef unsigned int i1
cdef unsigned int i2
for i1 from 0 <= i1 < len1:
for i2 from 0 <= i2 < len2:
runningDiss2 += diss2Partials(t1[i1], t2[i2])
return runningDiss2
cdef inline DTYPE_t float_min(DTYPE_t a, DTYPE_t b): return a if a <= b else b
# Calculates the dissonance of two partials of the form [f,a]
cdef DTYPE_t diss2Partials(np.ndarray[DTYPE_t,ndim=1] p1, np.ndarray[DTYPE_t,ndim=1] p2):
cdef DTYPE_t f1 = p1[0]
cdef DTYPE_t f2 = p2[0]
cdef DTYPE_t a1 = abs(p1[1])
cdef DTYPE_t a2 = abs(p2[1])
# In order to insure that f2 > f1:
if (f2 < f1):
(f1,f2,a1,a2) = (f2,f1,a2,a1)
# Constants of the dissonance curves
cdef DTYPE_t _xStar
_xStar = 0.24
cdef DTYPE_t _s1
_s1 = 0.021
cdef DTYPE_t _s2
_s2 = 19
cdef DTYPE_t _b1
_b1 = 3.5
cdef DTYPE_t _b2
_b2 = 5.75
cdef DTYPE_t a = float_min(a1,a2)
cdef DTYPE_t s = _xStar/(_s1*f1 + _s2)
return (a * (np.exp(-_b1*s*(f2-f1)) - np.exp(-_b2*s*(f2-f1)) ) )
cpdef dissTimbreScale(np.ndarray[DTYPE_t,ndim=2] t,np.ndarray[DTYPE_t,ndim=1] s):
cdef DTYPE_t currDiss
currDiss = 0.0;
cdef unsigned int i
for i from 0 <= i < s.size:
currDiss += diss2Timbres(t, transpose(t,s[i]))
return currDiss
cdef np.ndarray[DTYPE_t,ndim=2] transpose(np.ndarray[DTYPE_t,ndim=2] t, DTYPE_t ratio):
return np.dot(t, np.array([[ratio,0],[0,1]]))
Link to code: Cython Code
Here are some things that I noticed:
Use t1.shape[0] instead of np.shape(t1)[0] and in so on in other places.
Don't use len as a variable because it is a built-in function in Python (not for speed, but for good practice). Use L or something like that.
Don't pass two-element arrays to functions unless you really need to. Cython checks the buffer every time you do pass an array. So, when using diss2Partials(t[i], t[j]) do diss2Partials(t[i,0], t[i,1], t[j,0], t[j,1]) instead and redefine diss2Partials appropriately.
Don't use abs, or at least not the Python one. It is having to convert your C double to a Python float, call the abs function, then convert back to a C double. It would probably be better to make an inlined function like you did with float_min.
Calling np.exp is doing a similar thing to using abs. Change np.exp to exp and add from libc.math cimport exp to your imports at the top.
Get rid of the transpose function completely. The np.dot is really slowing things down, but there really is no need for matrix multiplication here anyway. Rewrite your dissTimbreScale function to create an empty matrix, say t2. Before the current loop, set the second column of t2 to be equal to the second column of t (using a loop preferably, but you could probably get away with a Numpy operation here). Then, inside of the current loop, put in a loop that sets the first column of t2 equal to the first column of t times s[i]. That's what your matrix multiplication was really doing. Then just pass t2 as the second parameter to diss2Timbres instead of the one returned by the transpose function.
Do 1-5 first because they are rather easy. Number 6 may take a little more time, effort and maybe experimentation, but I suspect that it may also give you a significant boost in speed.
In your code:
for i from 0 <= i < len:
for j from i+1 <= j < len:
runningDiss1 += diss2Partials(t[i], t[j])
return runningDiss1
bounds checking is performed for each array lookup, use the decorator #cython.boundscheck(False) before the function, and then cast to an unsigned int type before using i and j as the indices. Look up the cython for Numpy tutorial for more info.
I would profile your code in order to see which function takes the most time. If it is diss2Timbres you may benefit from the package "numexpr".
I compared Python/Cython and Numexpr for one of my functions (link to SO). Depending on the size of the array, numexpr outperformed both, Cython and Fortran.
NOTE: Just figured out this post is really old...

Categories