I encounter a strange warning when performing matrix multiplication after QR decomposition in a Numba-accelerated function. For example:
# Python 3.10
import numpy as np
from numba import jit
#jit
def qr_check(x):
q,r = np.linalg.qr(x)
return q # r
x = np.random.rand(3,3)
qr_check(x)
Running the above code, I get the following NumbaPerformanceWarning:
'#' is faster on contiguous arrays, called on (array(float64, 2d, A), array(float64, 2d, F))
I'm not sure what's going wrong here. I know F is for Fortran, so array r is Fortran-contiguous, but why isn't array q as well?
It is about the details of how QR decomposition is implemented in numba.
As you noted F - stands for Fortran-contiguous (column-major).
A stands for strided memory layout.
Numba does not call numpy.linalg.qr directly. Let's take a look into source code of numba:
https://github.com/numba/numba/blob/251061051ea13c8618c5fbd5e6b3f90a3315fec9/numba/np/linalg.py#L1418
#overload(np.linalg.qr)
def qr_impl(a):
...
As you can see numba overloads the function qr.
Inside this function numba calls lapack function for QR decomposition which is implemented in FORTRAN so the result is Fortran-contiguous. But additionally q is sliced:
q[:, :minmn]
https://github.com/numba/numba/blob/251061051ea13c8618c5fbd5e6b3f90a3315fec9/numba/np/linalg.py#L1490
So the final layouts are:
A (strided) for Q
F (fortran) for R
You will get the same warning in a similar case with a matrix product:
#jit
def qr_check(x):
q = np.zeros((100, 64))
r = np.zeros((64, 200))
return q # r[:1000, :1000]
Related
I have a large array for operation, for example, matrix transpose. numba is much faster:
#test_transpose.py
import numpy as np
import numba as nb
import time
#nb.njit('float64[:,:](float64[:,:])', parallel=True)
def transpose(x):
r, c = x.shape
x2 = np.zeros((c, r))
for i in nb.prange(c):
for j in range(r):
x2[i, j] = x[j][i]
return x2
if __name__ == "__main__":
x = np.random.randn(int(3e6), 50)
t = time.time()
x = x.transpose().copy()
print(f"numpy transpose: {round(time.time() - t, 4)} secs")
x = np.random.randn(int(3e6), 50)
t = time.time()
x = transpose(x)
print(f"numba paralleled transpose: {round(time.time() - t, 4)} secs")
Run in Windows command prompt
D:\data\test>python test_transpose.py
numpy transpose: 2.0961 secs
numba paralleled transpose: 0.8584 secs
However, I want to input another large matrix, which are integers, using x as
x = np.random.randint(int(3e6), size=(int(3e6), 50), dtype=np.int64)
Exception is raised as
Traceback (most recent call last):
File "test_transpose.py", line 39, in <module>
x = transpose(x)
File "C:\Program Files\Python38\lib\site-packages\numba\core\dispatcher.py", line 703, in _explain_matching_error
raise TypeError(msg)
TypeError: No matching definition for argument type(s) array(int64, 2d, C)
It does not recognize the input data matrix as integer. If I release the data type check for the integer matrix as
#nb.njit(parallel=True) # 'float64[:,:](float64[:,:])'
def transpose(x):
r, c = x.shape
x2 = np.zeros((c, r))
for i in nb.prange(c):
for j in range(r):
x2[i, j] = x[j][i]
return x2
It is slower:
D:\Data\test>python test_transpose.py
numba paralleled transpose: 1.6653 secs
Using #nb.njit('int64[:,:](int64[:,:])', parallel=True) for the integer data matrix is faster, as expected.
So, how can I still allow mixed data type intputs but keep the speed, instead of creating functions each for different types?
So, how can I still allow mixed data type intputs but keep the speed, instead of creating functions each for different types?
The problem is that the Numba function is defined only for float64 types and not int64. The specification of the types is required because Numba compile the Python code to a native code with well-defined types. You can add multiple signatures to a Numba function:
#nb.njit(['float64[:,:](float64[:,:])', 'int64[:,:](int64[:,:])'], parallel=True)
def transpose(x):
r, c = x.shape
# Specifying the dtype is very important here.
# This is a good habit to take to avoid numerical issues and
# slower performance in Numpy too.
x2 = np.zeros((c, r), dtype=x.dtype)
for i in nb.prange(c):
for j in range(r):
x2[i, j] = x[j][i]
return x2
It is slower
This is because of lazy compilation. The first execution include the compilation time. THis is not the case when the signature is specified because of eager compilation is used instead.
numba is much faster
Well, not to much here considering many cores are used. In fact, the naive transposition is very inefficient on big matrices (is wast about 90% of the memory throughput in this case on large arrays). There are faster algorithms. For more information, please read this post (it only consider in-place 2D square transposition which is much simpler but the idea is the same). Also note that the wider the type, the bigger the array. The bigger the array the slower the transposition.
Suppose I have two arrays. A has size n by d, and B has size t by d. Suppose I want to output an array C, where C[i, j] gives the cubed L3 norm between A[i] and B[j] (both of these have size d). i.e.
C[i, j] = |A[i, 0]-B[j, 0]|**3 + |A[i, 1]-B[j, 1]|**3 + ... + |A[i, d-1]-B[j, d-1]|**3
Can anyone redirect me to a really efficient way to do this? I have tried using a double for loop and something with array operations but the runtime is incredibly slow.
Edit: I know TensorFlow.norm works, but how could I implement this myself?
In accordance with #gph answer, an explicit example of the application of Numpy's np.linalg.norm, using broadcasting to avoid any loops:
import numpy as np
n, m, d = 200, 300, 3
A = np.random.random((n,d))
B = np.random.random((m,d))
C = np.linalg.norm(A[:,None,:] - B[None,...], ord=3, axis = -1)
# The 'raw' option would be:
C2 = (np.sum(np.abs(A[:,None,:] - B[None,...])**3, axis = -1))**(1/3)
Both ways seems to take around the same time to execute. I've tried to use np.einsum but let's just say my algebra skills have seen better days. Perhaps you have better luck (a nice resource is this wiki by Divakar).
There may be a few optimizations to speed this up, but the performance isn't going to be anywhere near specialized math packages. Those packages are using blas and lapack to vectorize the operations. They also avoid a lot of type checking overhead by enforcing types at the time you set a value. If performance is important there really is no way you are going to do better than a numerical computing package like numpy.
Use a 3rd-party library written in C or create your own
numpy has a linalg library which should be able to compute your L3 norm for each A[i]-B[j]
If numpy works for you, take a look at numba's JIT, which can compile and cache some (numpy) code to be orders of magnitude faster (successive runs will take advantage of it). I believe you will need to describe the parts to accelerate as functions which will be called many times to take advantage of it.
roughly
import numpy as np
from numpy import linalg as LA
from numba import jit as numbajit
#numbajit
def L3norm(A, B, i, j):
return LA.norm([A[i] - B[j]], ord=3)
# may be able to apply JIT here too
def operate(A, B, n, t):
# generate array
C = np.zeros(shape=(n, t))
for i in range(n):
for j in range(t):
C[i, j] = L3norm(A, B, i, j)
return C
In my code (written in Python 2.7), I create two numpy arrays, A and B. I then use them to assemble a larger matrix, H, with the following code
H = np.block([A, B], [-B, -A])
Various computations follow, involving substantial amounts of numpy manipulations and for loops. As a result, I would like to use Numba to optimize the code. However, it appears that the numpy block function is unsupported in Numba. The matrices A and B are not terribly large, so I'm fine using a function that may not be as optimized as np.block, but I would still like to assemble H in a block matrix fashion for the purpose of readability. Are there any functions that would accomplish this?
Just to make #hpaulj's comment concrete, making some basic assumptions about your input data and not doing any sort of error checking:
#nb.njit
def nb_block(X):
xtmp1 = np.concatenate(X[0], axis=1)
xtmp2 = np.concatenate(X[1], axis=1)
return np.concatenate((xtmp1, xtmp2), axis=0)
The following also works:
#nb.njit
def nb_block2(X):
xtmp1 = np.hstack(X[0])
xtmp2 = np.hstack(X[1])
return np.vstack((xtmp1, xtmp2))
and the performance of the two differs with different sized arrays. You should benchmark your own application.
Then calling:
A = np.zeros((50,30))
B = np.ones((50,30))
X = np.block([[A, B], [-B, -A]])
Y = nb_block(((A, B), (-B, -A))) # Note the tuple-of-tuples vs list-of-lists
np.all_close(X, Y) # True
Trying Cython for the first time, trying to get a speedup on a function that does subtraction and addition on 2 numpy arrays and a float32. I'm trying to get this function to be as quick as possible it's called a lot of times and if i can speed this up then it's a big win.
def broadcast(w, m, spl):
"""
w and m are float32 ndarrays e.g shape (43,)
spl is an nd.float32 value e.g 9722.0
"""
return w + (m - spl)
My cythonising so far is
import numpy as np
cimport numpy as np
DTYPE = np.float32
ctypedef np.float32_t DTYPE_t
def broadcast(np.ndarray w, np.ndarray m, np.float32 spl):
return w + (m - spl)
but it returns the error:
'float32' is not a type identifier
I'm not sure how why I can't declare the type? Do I need to declare a C type? What is an np.float32 in C?
As commented by #YXD, you won't get a speed improvement using Cython to loop over the arrays to perform these operations. Specially for simple operations, Numpy uses SIMD programming which is very efficient.
Despite of that, you could improve the memory usage and the performance by modifying array w in place, if the original array is no longer needed:
def func(w, m, spl)
w += m
w -= spl
then, instead of calling:
out = func(w, m, spl)
the new call:
func(w, m, spl)
will store the output inside w.
I'm trying to use dot products, matrix inversion and other basic linear algebra operations that are available in numpy from Cython. Functions like numpy.linalg.inv (inversion), numpy.dot (dot product), X.t (transpose of matrix/array). There's a large overhead to calling numpy.* from Cython functions and the rest of the function is written in Cython, so I'd like to avoid this.
If I assume users have numpy installed, is there a way to do something like:
#include "numpy/npy_math.h"
as an extern, and call these functions? Or alternatively call BLAS directly (or whatever it is that numpy calls for these core operations)?
To give an example, imagine you have a function in Cython that does many things and in the end needs to make a computation involving dot products and matrix inverses:
cdef myfunc(...):
# ... do many things faster than Python could
# ...
# compute one value using dot products and inv
# without using
# import numpy as np
# np.*
val = gammaln(sum(v)) - sum(gammaln(v)) + dot((v - 1).T, log(x).T)
how can this be done? If there's a library that implements these in Cython already, I can also use that, but have not found anything. Even if those procedures are less optimized than BLAS directly, not having the overhead of calling numpy Python module from Cython will still make things overall faster.
Example functions I'd like to call:
dot product (np.dot)
matrix inversion (np.linalg.inv)
matrix multiplication
taking transpose (equivalent of x.T in numpy)
gammaln function (like scipy.gammaln equivalent, which should be available in C)
I realize as it says on numpy mailing list (https://groups.google.com/forum/?fromgroups=#!topic/cython-users/XZjMVSIQnTE) that if you call these functions on large matrices, there is no point in doing it from Cython, since calling it from numpy will just result in the majority of the time spent in the optimized C code that numpy calls. However, in my case, I have many calls to these linear algebra operations on small matrices -- in that case, the overhead introduced by repeatedly going from Cython back to numpy and back to Cython will far outweigh the time spent actually computing the operation from BLAS. Therefore, I'd like to keep everything at the C/Cython level for these simple operations and not go through python.
I'd prefer not to go through GSL, since that adds another dependency and since it's unclear if GSL is actively maintained. Since I'm assuming users of the code already have scipy/numpy installed, I can safely assume that they have all the associated C code that goes along with these libraries, so I just want to be able to tap into that code and call it from Cython.
edit: I found a library that wraps BLAS in Cython (https://github.com/tokyo/tokyo) which is close but not what I'm looking for. I'd like to call the numpy/scipy C functions directly (I'm assuming the user has these installed.)
Calling BLAS bundled with Scipy is "fairly" straightforward, here's one example for calling DGEMM to compute matrix multiplication: https://gist.github.com/pv/5437087 Note that BLAS and LAPACK expect all arrays to be Fortran-contiguous (modulo the lda/b/c parameters), hence order="F" and double[::1,:] which are required for correct functioning.
Computing inverses can be similarly done by applying the LAPACK function dgesv on the identity matrix. For the signature, see here. All this requires dropping down to rather low-level coding, you need to allocate temporary work arrays yourself etc etc. --- however these can be encapsulated into your own convenience functions, or just reuse the code from tokyo by replacing the lib_* functions with function pointers obtained from Scipy in the above way.
If you use Cython's memoryview syntax (double[::1,:]) you transpose is the same x.T as usual. Alternatively, you can compute the transpose by writing a function of your own that swaps elements of the array across the diagonal. Numpy doesn't actually contain this operation, x.T only changes the strides of the array and doesn't move the data around.
It would probably be possible to rewrite the tokyo module to use the BLAS/LAPACK exported by Scipy and bundle it in scipy.linalg, so that you could just do from scipy.linalg.blas cimport dgemm. Pull requests are accepted if someone wants to get down to it.
As you can see, it all boils down to passing function pointers around. As alluded to above, Cython does in fact provide its own protocol for exchanging function pointers. For an example, consider from scipy.spatial import qhull; print(qhull.__pyx_capi__) --- those functions could be accessed via from scipy.spatial.qhull cimport XXXX in Cython (they're private though, so don't do that).
However, at the present, scipy.special does not offer this C-API. It would however in fact be quite simple to provide it, given that the interface module in scipy.special is written in Cython.
I don't think there is at the moment any sane and portable way to access the function doing the heavy lifting for gamln, (although you could snoop around the UFunc object, but that's not a sane solution :), so at the moment it's probably best to just grab the relevant part of source code from scipy.special and bundle it with your project, or use e.g. GSL.
Perhaps the easiest way if you do accept using the GSL would be to use this GSL->cython interface https://github.com/twiecki/CythonGSL and call BLAS from there (see the example https://github.com/twiecki/CythonGSL/blob/master/examples/blas2.pyx). It should also take care of the Fortran vs C ordering.
There aren't many new GSL features, but you can safely assume it is actively maintained. The CythonGSL is more complete compared to tokyo; e.g., it features symmetric-matrix products that are absent in numpy.
As I've just encountered the same problem, and wrote some additional functions, I'll include them here in case someone else finds them useful. I code up some matrix multiplication, and also call LAPACK functions for matrix inversion, determinant and cholesky decomposition. But you should consider trying to do linear algebra stuff outside any loops, if you have any, like I do here. And by the way, the determinant function here isn't quite working if you have suggestions. Also, please note that I don't do any checking to see if inputs are conformable.
from scipy.linalg.cython_lapack cimport dgetri, dgetrf, dpotrf
cpdef void double[:, ::1] inv_c(double[:, ::1] A, double[:, ::1] B,
double[:, ::1] work, double[::1] ipiv):
'''invert float type square matrix A
Parameters
----------
A : memoryview (numpy array)
n x n array to invert
B : memoryview (numpy array)
n x n array to use within the function, function
will modify this matrix in place to become the inverse of A
work : memoryview (numpy array)
n x n array to use within the function
ipiv : memoryview (numpy array)
length n array to use within function
'''
cdef int n = A.shape[0], info, lwork
B[...] = A
dgetrf(&n, &n, &B[0, 0], &n, &ipiv[0], &info)
dgetri(&n, &B[0,0], &n, &ipiv[0], &work[0,0], &lwork, &info)
cpdef double det_c(double[:, ::1] A, double[:, ::1] work, double[::1] ipiv):
'''obtain determinant of float type square matrix A
Notes
-----
As is, this function is not yet computing the sign of the determinant
correctly, help!
Parameters
----------
A : memoryview (numpy array)
n x n array to compute determinant of
work : memoryview (numpy array)
n x n array to use within function
ipiv : memoryview (numpy array)
length n vector use within function
Returns
-------
detval : float
determinant of matrix A
'''
cdef int n = A.shape[0], info
work[...] = A
dgetrf(&n, &n, &work[0,0], &n, &ipiv[0], &info)
cdef double detval = 1.
cdef int j
for j in range(n):
if j != ipiv[j]:
detval = -detval*work[j, j]
else:
detval = detval*work[j, j]
return detval
cdef void chol_c(double[:, ::1] A, double[:, ::1] B):
'''cholesky factorization of real symmetric positive definite float matrix A
Parameters
----------
A : memoryview (numpy array)
n x n matrix to compute cholesky decomposition
B : memoryview (numpy array)
n x n matrix to use within function, will be modified
in place to become cholesky decomposition of A. works
similar to np.linalg.cholesky
'''
cdef int n = A.shape[0], info
cdef char uplo = 'U'
B[...] = A
dpotrf(&uplo, &n, &B[0,0], &n, &info)
cdef int i, j
for i in range(n):
for j in range(n):
if j > i:
B[i, j] = 0
cpdef void dotmm_c(double[:, :] A, double[:, :] B, double[:, :] out):
'''matrix multiply matrices A (n x m) and B (m x l)
Parameters
----------
A : memoryview (numpy array)
n x m left matrix
B : memoryview (numpy array)
m x r right matrix
out : memoryview (numpy array)
n x r output matrix
'''
cdef Py_ssize_t i, j, k
cdef double s
cdef Py_ssize_t n = A.shape[0], m = A.shape[1]
cdef Py_ssize_t l = B.shape[0], r = B.shape[1]
for i in range(n):
for j in range(r):
s = 0
for k in range(m):
s += A[i, k]*B[k, j]
out[i, j] = s