Cython beginner - speed up numpy broadcasting

Cython beginner - speed up numpy broadcasting - python

Trying Cython for the first time, trying to get a speedup on a function that does subtraction and addition on 2 numpy arrays and a float32. I'm trying to get this function to be as quick as possible it's called a lot of times and if i can speed this up then it's a big win.
def broadcast(w, m, spl):
"""
w and m are float32 ndarrays e.g shape (43,)
spl is an nd.float32 value e.g 9722.0
"""
return w + (m - spl)
My cythonising so far is
import numpy as np
cimport numpy as np
DTYPE = np.float32
ctypedef np.float32_t DTYPE_t
def broadcast(np.ndarray w, np.ndarray m, np.float32 spl):
return w + (m - spl)
but it returns the error:
'float32' is not a type identifier
I'm not sure how why I can't declare the type? Do I need to declare a C type? What is an np.float32 in C?

As commented by #YXD, you won't get a speed improvement using Cython to loop over the arrays to perform these operations. Specially for simple operations, Numpy uses SIMD programming which is very efficient.
Despite of that, you could improve the memory usage and the performance by modifying array w in place, if the original array is no longer needed:
def func(w, m, spl)
w += m
w -= spl
then, instead of calling:
out = func(w, m, spl)
the new call:
func(w, m, spl)
will store the output inside w.

Related

NumPy arrays from Numba-accelerated QR decomposition are not contiguous

I encounter a strange warning when performing matrix multiplication after QR decomposition in a Numba-accelerated function. For example:
# Python 3.10
import numpy as np
from numba import jit
#jit
def qr_check(x):
q,r = np.linalg.qr(x)
return q # r
x = np.random.rand(3,3)
qr_check(x)
Running the above code, I get the following NumbaPerformanceWarning:
'#' is faster on contiguous arrays, called on (array(float64, 2d, A), array(float64, 2d, F))
I'm not sure what's going wrong here. I know F is for Fortran, so array r is Fortran-contiguous, but why isn't array q as well?

It is about the details of how QR decomposition is implemented in numba.
As you noted F - stands for Fortran-contiguous (column-major).
A stands for strided memory layout.
Numba does not call numpy.linalg.qr directly. Let's take a look into source code of numba:
https://github.com/numba/numba/blob/251061051ea13c8618c5fbd5e6b3f90a3315fec9/numba/np/linalg.py#L1418
#overload(np.linalg.qr)
def qr_impl(a):
...
As you can see numba overloads the function qr.
Inside this function numba calls lapack function for QR decomposition which is implemented in FORTRAN so the result is Fortran-contiguous. But additionally q is sliced:
q[:, :minmn]
https://github.com/numba/numba/blob/251061051ea13c8618c5fbd5e6b3f90a3315fec9/numba/np/linalg.py#L1490
So the final layouts are:
A (strided) for Q
F (fortran) for R
You will get the same warning in a similar case with a matrix product:
#jit
def qr_check(x):
q = np.zeros((100, 64))
r = np.zeros((64, 200))
return q # r[:1000, :1000]

mix data type inputs for numba njit

I have a large array for operation, for example, matrix transpose. numba is much faster:
#test_transpose.py
import numpy as np
import numba as nb
import time
#nb.njit('float64[:,:](float64[:,:])', parallel=True)
def transpose(x):
r, c = x.shape
x2 = np.zeros((c, r))
for i in nb.prange(c):
for j in range(r):
x2[i, j] = x[j][i]
return x2
if __name__ == "__main__":
x = np.random.randn(int(3e6), 50)
t = time.time()
x = x.transpose().copy()
print(f"numpy transpose: {round(time.time() - t, 4)} secs")
x = np.random.randn(int(3e6), 50)
t = time.time()
x = transpose(x)
print(f"numba paralleled transpose: {round(time.time() - t, 4)} secs")
Run in Windows command prompt
D:\data\test>python test_transpose.py
numpy transpose: 2.0961 secs
numba paralleled transpose: 0.8584 secs
However, I want to input another large matrix, which are integers, using x as
x = np.random.randint(int(3e6), size=(int(3e6), 50), dtype=np.int64)
Exception is raised as
Traceback (most recent call last):
File "test_transpose.py", line 39, in <module>
x = transpose(x)
File "C:\Program Files\Python38\lib\site-packages\numba\core\dispatcher.py", line 703, in _explain_matching_error
raise TypeError(msg)
TypeError: No matching definition for argument type(s) array(int64, 2d, C)
It does not recognize the input data matrix as integer. If I release the data type check for the integer matrix as
#nb.njit(parallel=True) # 'float64[:,:](float64[:,:])'
def transpose(x):
r, c = x.shape
x2 = np.zeros((c, r))
for i in nb.prange(c):
for j in range(r):
x2[i, j] = x[j][i]
return x2
It is slower:
D:\Data\test>python test_transpose.py
numba paralleled transpose: 1.6653 secs
Using #nb.njit('int64[:,:](int64[:,:])', parallel=True) for the integer data matrix is faster, as expected.
So, how can I still allow mixed data type intputs but keep the speed, instead of creating functions each for different types?

So, how can I still allow mixed data type intputs but keep the speed, instead of creating functions each for different types?
The problem is that the Numba function is defined only for float64 types and not int64. The specification of the types is required because Numba compile the Python code to a native code with well-defined types. You can add multiple signatures to a Numba function:
#nb.njit(['float64[:,:](float64[:,:])', 'int64[:,:](int64[:,:])'], parallel=True)
def transpose(x):
r, c = x.shape
# Specifying the dtype is very important here.
# This is a good habit to take to avoid numerical issues and
# slower performance in Numpy too.
x2 = np.zeros((c, r), dtype=x.dtype)
for i in nb.prange(c):
for j in range(r):
x2[i, j] = x[j][i]
return x2
It is slower
This is because of lazy compilation. The first execution include the compilation time. THis is not the case when the signature is specified because of eager compilation is used instead.
numba is much faster
Well, not to much here considering many cores are used. In fact, the naive transposition is very inefficient on big matrices (is wast about 90% of the memory throughput in this case on large arrays). There are faster algorithms. For more information, please read this post (it only consider in-place 2D square transposition which is much simpler but the idea is the same). Also note that the wider the type, the bigger the array. The bigger the array the slower the transposition.

How to return array from element-wise custom calculation on other numpy arrays?

I would like to obtain a numpy array from element-wise calculation on different numpy arrays. As of now, I am using a lambda function to return a value, repeat that for all values, create a list therefrom, and convert to numpy array:
import math
import numpy as np
def weightAdjLoads(loadsX, loadsY, angles, g):
adjust = lambda x, y, a: math.sqrt((abs(x) - math.sin(a)*g)**2 + (abs(y) - math.cos(a)*g)**2)
return np.array([adjust(x, y, a) for x, y, a in zip (loadsX, loadsY, angles)])
This seems to me like too much overhead. Are there any numpy routines which could do just that?
I am aware of methods such as numpy.sqrt(A**2 + B**2), where A and B are numpy arrays. However, those only allow to apply predefined formulas. How can I apply custom formulas on numpy arrays?

numpy.sqrt(A**2 + B**2) is parsed by the Python interpreter into calls roughly as follows:
tmp1 = A**2 # A.__pow__(2)
tmp2 = B**2 #
tmp3 = tmp1 + tmp2 # tmp1.__add__(tmp2)
tmp4 = np.sqrt(tmp3)
That is, there are defined numpy functions and methods for power, addition, sqrt etc.
Your lambda works with scalars, not numpy arrays:
math.sqrt((abs(x) - math.sin(a)*g)**2 + (abs(y) - math.cos(a)*g)**2)
Specifically it's the math trig functions that require scalars. abs works with arrays:
abs(A) => A.__abs__()
numpy provides a full set of trig functions, so this function should work with array, or scalar, arguments:
def foo(x, y, a):
return np.sqrt((abs(x) - np.sin(a)*g)**2 + (abs(y) - np.cos(a)*g)**2)
There are ways of wrapping your scalar adjust into a numpy function, but the speed savings relative to your list comprehension are minor.
f = np.vectorize(adjust)
f = np.frompyfunc(adjust, 3, 1)
Mainly they make it easier to broadcast arrays to a scalar functions. But to gain compiled speed you have to make a conversion such as in my foo, or use a third party package like cython, numba, or numexpr.

calling dot products and linear algebra operations in Cython?

I'm trying to use dot products, matrix inversion and other basic linear algebra operations that are available in numpy from Cython. Functions like numpy.linalg.inv (inversion), numpy.dot (dot product), X.t (transpose of matrix/array). There's a large overhead to calling numpy.* from Cython functions and the rest of the function is written in Cython, so I'd like to avoid this.
If I assume users have numpy installed, is there a way to do something like:
#include "numpy/npy_math.h"
as an extern, and call these functions? Or alternatively call BLAS directly (or whatever it is that numpy calls for these core operations)?
To give an example, imagine you have a function in Cython that does many things and in the end needs to make a computation involving dot products and matrix inverses:
cdef myfunc(...):
# ... do many things faster than Python could
# ...
# compute one value using dot products and inv
# without using
# import numpy as np
# np.*
val = gammaln(sum(v)) - sum(gammaln(v)) + dot((v - 1).T, log(x).T)
how can this be done? If there's a library that implements these in Cython already, I can also use that, but have not found anything. Even if those procedures are less optimized than BLAS directly, not having the overhead of calling numpy Python module from Cython will still make things overall faster.
Example functions I'd like to call:
dot product (np.dot)
matrix inversion (np.linalg.inv)
matrix multiplication
taking transpose (equivalent of x.T in numpy)
gammaln function (like scipy.gammaln equivalent, which should be available in C)
I realize as it says on numpy mailing list (https://groups.google.com/forum/?fromgroups=#!topic/cython-users/XZjMVSIQnTE) that if you call these functions on large matrices, there is no point in doing it from Cython, since calling it from numpy will just result in the majority of the time spent in the optimized C code that numpy calls. However, in my case, I have many calls to these linear algebra operations on small matrices -- in that case, the overhead introduced by repeatedly going from Cython back to numpy and back to Cython will far outweigh the time spent actually computing the operation from BLAS. Therefore, I'd like to keep everything at the C/Cython level for these simple operations and not go through python.
I'd prefer not to go through GSL, since that adds another dependency and since it's unclear if GSL is actively maintained. Since I'm assuming users of the code already have scipy/numpy installed, I can safely assume that they have all the associated C code that goes along with these libraries, so I just want to be able to tap into that code and call it from Cython.
edit: I found a library that wraps BLAS in Cython (https://github.com/tokyo/tokyo) which is close but not what I'm looking for. I'd like to call the numpy/scipy C functions directly (I'm assuming the user has these installed.)

Calling BLAS bundled with Scipy is "fairly" straightforward, here's one example for calling DGEMM to compute matrix multiplication: https://gist.github.com/pv/5437087 Note that BLAS and LAPACK expect all arrays to be Fortran-contiguous (modulo the lda/b/c parameters), hence order="F" and double[::1,:] which are required for correct functioning.
Computing inverses can be similarly done by applying the LAPACK function dgesv on the identity matrix. For the signature, see here. All this requires dropping down to rather low-level coding, you need to allocate temporary work arrays yourself etc etc. --- however these can be encapsulated into your own convenience functions, or just reuse the code from tokyo by replacing the lib_* functions with function pointers obtained from Scipy in the above way.
If you use Cython's memoryview syntax (double[::1,:]) you transpose is the same x.T as usual. Alternatively, you can compute the transpose by writing a function of your own that swaps elements of the array across the diagonal. Numpy doesn't actually contain this operation, x.T only changes the strides of the array and doesn't move the data around.
It would probably be possible to rewrite the tokyo module to use the BLAS/LAPACK exported by Scipy and bundle it in scipy.linalg, so that you could just do from scipy.linalg.blas cimport dgemm. Pull requests are accepted if someone wants to get down to it.
As you can see, it all boils down to passing function pointers around. As alluded to above, Cython does in fact provide its own protocol for exchanging function pointers. For an example, consider from scipy.spatial import qhull; print(qhull.__pyx_capi__) --- those functions could be accessed via from scipy.spatial.qhull cimport XXXX in Cython (they're private though, so don't do that).
However, at the present, scipy.special does not offer this C-API. It would however in fact be quite simple to provide it, given that the interface module in scipy.special is written in Cython.
I don't think there is at the moment any sane and portable way to access the function doing the heavy lifting for gamln, (although you could snoop around the UFunc object, but that's not a sane solution :), so at the moment it's probably best to just grab the relevant part of source code from scipy.special and bundle it with your project, or use e.g. GSL.

Perhaps the easiest way if you do accept using the GSL would be to use this GSL->cython interface https://github.com/twiecki/CythonGSL and call BLAS from there (see the example https://github.com/twiecki/CythonGSL/blob/master/examples/blas2.pyx). It should also take care of the Fortran vs C ordering.
There aren't many new GSL features, but you can safely assume it is actively maintained. The CythonGSL is more complete compared to tokyo; e.g., it features symmetric-matrix products that are absent in numpy.

As I've just encountered the same problem, and wrote some additional functions, I'll include them here in case someone else finds them useful. I code up some matrix multiplication, and also call LAPACK functions for matrix inversion, determinant and cholesky decomposition. But you should consider trying to do linear algebra stuff outside any loops, if you have any, like I do here. And by the way, the determinant function here isn't quite working if you have suggestions. Also, please note that I don't do any checking to see if inputs are conformable.
from scipy.linalg.cython_lapack cimport dgetri, dgetrf, dpotrf
cpdef void double[:, ::1] inv_c(double[:, ::1] A, double[:, ::1] B,
double[:, ::1] work, double[::1] ipiv):
'''invert float type square matrix A
Parameters
----------
A : memoryview (numpy array)
n x n array to invert
B : memoryview (numpy array)
n x n array to use within the function, function
will modify this matrix in place to become the inverse of A
work : memoryview (numpy array)
n x n array to use within the function
ipiv : memoryview (numpy array)
length n array to use within function
'''
cdef int n = A.shape[0], info, lwork
B[...] = A
dgetrf(&n, &n, &B[0, 0], &n, &ipiv[0], &info)
dgetri(&n, &B[0,0], &n, &ipiv[0], &work[0,0], &lwork, &info)
cpdef double det_c(double[:, ::1] A, double[:, ::1] work, double[::1] ipiv):
'''obtain determinant of float type square matrix A
Notes
-----
As is, this function is not yet computing the sign of the determinant
correctly, help!
Parameters
----------
A : memoryview (numpy array)
n x n array to compute determinant of
work : memoryview (numpy array)
n x n array to use within function
ipiv : memoryview (numpy array)
length n vector use within function
Returns
-------
detval : float
determinant of matrix A
'''
cdef int n = A.shape[0], info
work[...] = A
dgetrf(&n, &n, &work[0,0], &n, &ipiv[0], &info)
cdef double detval = 1.
cdef int j
for j in range(n):
if j != ipiv[j]:
detval = -detval*work[j, j]
else:
detval = detval*work[j, j]
return detval
cdef void chol_c(double[:, ::1] A, double[:, ::1] B):
'''cholesky factorization of real symmetric positive definite float matrix A
Parameters
----------
A : memoryview (numpy array)
n x n matrix to compute cholesky decomposition
B : memoryview (numpy array)
n x n matrix to use within function, will be modified
in place to become cholesky decomposition of A. works
similar to np.linalg.cholesky
'''
cdef int n = A.shape[0], info
cdef char uplo = 'U'
B[...] = A
dpotrf(&uplo, &n, &B[0,0], &n, &info)
cdef int i, j
for i in range(n):
for j in range(n):
if j > i:
B[i, j] = 0
cpdef void dotmm_c(double[:, :] A, double[:, :] B, double[:, :] out):
'''matrix multiply matrices A (n x m) and B (m x l)
Parameters
----------
A : memoryview (numpy array)
n x m left matrix
B : memoryview (numpy array)
m x r right matrix
out : memoryview (numpy array)
n x r output matrix
'''
cdef Py_ssize_t i, j, k
cdef double s
cdef Py_ssize_t n = A.shape[0], m = A.shape[1]
cdef Py_ssize_t l = B.shape[0], r = B.shape[1]
for i in range(n):
for j in range(r):
s = 0
for k in range(m):
s += A[i, k]*B[k, j]
out[i, j] = s

Vectorize this convolution type loop more efficiently in numpy

I have to do many loops of the following type
for i in range(len(a)):
for j in range(i+1):
c[i] += a[j]*b[i-j]
where a and b are short arrays (of the same size, which is between about 10 and 50). This can be done efficiently using a convolution:
import numpy as np
np.convolve(a, b)
However, this gives me the full convolution (i.e. the vector is too long, compared to the for loop above). If I use the 'same' option in convolve, I get the central part, but what I want is the first part. Of course, I can chop off what I don't need from the full vector, but I would like to get rid of the unnecessary computation time if possible.
Can someone suggest a better vectorization of the loops?

You could write a small C extension in Cython:
# cython: boundscheck=False
cimport numpy as np
import numpy as np # zeros_like
ctypedef np.float64_t np_t
def convolve_cy_np(np.ndarray[np_t] a not None,
np.ndarray[np_t] b not None,
np.ndarray[np_t] c=None):
if c is None:
c = np.zeros_like(a)
cdef Py_ssize_t i, j, n = c.shape[0]
with nogil:
for i in range(n):
for j in range(i + 1):
c[i] += a[j] * b[i - j]
return c
It performs well for n=10..50 compared to np.convolve(a,b)[:len(a)] on my machine.
Also it seems like a job for numba.

There is no way to do a convolution with vectorized array manipulations in numpy. Your best bet is to use np.convolve(a, b, mode='same') and trim off what you don't need. Thats going to probably be 10x faster than the double loop in pure python that you have above. You could also roll your own with Cython if you are really concerned about the speed, but it likely won't be much if any faster than np.convolve().

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cython beginner - speed up numpy broadcasting - python

Related

NumPy arrays from Numba-accelerated QR decomposition are not contiguous

mix data type inputs for numba njit

How to return array from element-wise custom calculation on other numpy arrays?

calling dot products and linear algebra operations in Cython?

Vectorize this convolution type loop more efficiently in numpy

Categories

Resources