In my code (written in Python 2.7), I create two numpy arrays, A and B. I then use them to assemble a larger matrix, H, with the following code
H = np.block([A, B], [-B, -A])
Various computations follow, involving substantial amounts of numpy manipulations and for loops. As a result, I would like to use Numba to optimize the code. However, it appears that the numpy block function is unsupported in Numba. The matrices A and B are not terribly large, so I'm fine using a function that may not be as optimized as np.block, but I would still like to assemble H in a block matrix fashion for the purpose of readability. Are there any functions that would accomplish this?
Just to make #hpaulj's comment concrete, making some basic assumptions about your input data and not doing any sort of error checking:
#nb.njit
def nb_block(X):
xtmp1 = np.concatenate(X[0], axis=1)
xtmp2 = np.concatenate(X[1], axis=1)
return np.concatenate((xtmp1, xtmp2), axis=0)
The following also works:
#nb.njit
def nb_block2(X):
xtmp1 = np.hstack(X[0])
xtmp2 = np.hstack(X[1])
return np.vstack((xtmp1, xtmp2))
and the performance of the two differs with different sized arrays. You should benchmark your own application.
Then calling:
A = np.zeros((50,30))
B = np.ones((50,30))
X = np.block([[A, B], [-B, -A]])
Y = nb_block(((A, B), (-B, -A))) # Note the tuple-of-tuples vs list-of-lists
np.all_close(X, Y) # True
Related
I encounter a strange warning when performing matrix multiplication after QR decomposition in a Numba-accelerated function. For example:
# Python 3.10
import numpy as np
from numba import jit
#jit
def qr_check(x):
q,r = np.linalg.qr(x)
return q # r
x = np.random.rand(3,3)
qr_check(x)
Running the above code, I get the following NumbaPerformanceWarning:
'#' is faster on contiguous arrays, called on (array(float64, 2d, A), array(float64, 2d, F))
I'm not sure what's going wrong here. I know F is for Fortran, so array r is Fortran-contiguous, but why isn't array q as well?
It is about the details of how QR decomposition is implemented in numba.
As you noted F - stands for Fortran-contiguous (column-major).
A stands for strided memory layout.
Numba does not call numpy.linalg.qr directly. Let's take a look into source code of numba:
https://github.com/numba/numba/blob/251061051ea13c8618c5fbd5e6b3f90a3315fec9/numba/np/linalg.py#L1418
#overload(np.linalg.qr)
def qr_impl(a):
...
As you can see numba overloads the function qr.
Inside this function numba calls lapack function for QR decomposition which is implemented in FORTRAN so the result is Fortran-contiguous. But additionally q is sliced:
q[:, :minmn]
https://github.com/numba/numba/blob/251061051ea13c8618c5fbd5e6b3f90a3315fec9/numba/np/linalg.py#L1490
So the final layouts are:
A (strided) for Q
F (fortran) for R
You will get the same warning in a similar case with a matrix product:
#jit
def qr_check(x):
q = np.zeros((100, 64))
r = np.zeros((64, 200))
return q # r[:1000, :1000]
I have a large array for operation, for example, matrix transpose. numba is much faster:
#test_transpose.py
import numpy as np
import numba as nb
import time
#nb.njit('float64[:,:](float64[:,:])', parallel=True)
def transpose(x):
r, c = x.shape
x2 = np.zeros((c, r))
for i in nb.prange(c):
for j in range(r):
x2[i, j] = x[j][i]
return x2
if __name__ == "__main__":
x = np.random.randn(int(3e6), 50)
t = time.time()
x = x.transpose().copy()
print(f"numpy transpose: {round(time.time() - t, 4)} secs")
x = np.random.randn(int(3e6), 50)
t = time.time()
x = transpose(x)
print(f"numba paralleled transpose: {round(time.time() - t, 4)} secs")
Run in Windows command prompt
D:\data\test>python test_transpose.py
numpy transpose: 2.0961 secs
numba paralleled transpose: 0.8584 secs
However, I want to input another large matrix, which are integers, using x as
x = np.random.randint(int(3e6), size=(int(3e6), 50), dtype=np.int64)
Exception is raised as
Traceback (most recent call last):
File "test_transpose.py", line 39, in <module>
x = transpose(x)
File "C:\Program Files\Python38\lib\site-packages\numba\core\dispatcher.py", line 703, in _explain_matching_error
raise TypeError(msg)
TypeError: No matching definition for argument type(s) array(int64, 2d, C)
It does not recognize the input data matrix as integer. If I release the data type check for the integer matrix as
#nb.njit(parallel=True) # 'float64[:,:](float64[:,:])'
def transpose(x):
r, c = x.shape
x2 = np.zeros((c, r))
for i in nb.prange(c):
for j in range(r):
x2[i, j] = x[j][i]
return x2
It is slower:
D:\Data\test>python test_transpose.py
numba paralleled transpose: 1.6653 secs
Using #nb.njit('int64[:,:](int64[:,:])', parallel=True) for the integer data matrix is faster, as expected.
So, how can I still allow mixed data type intputs but keep the speed, instead of creating functions each for different types?
So, how can I still allow mixed data type intputs but keep the speed, instead of creating functions each for different types?
The problem is that the Numba function is defined only for float64 types and not int64. The specification of the types is required because Numba compile the Python code to a native code with well-defined types. You can add multiple signatures to a Numba function:
#nb.njit(['float64[:,:](float64[:,:])', 'int64[:,:](int64[:,:])'], parallel=True)
def transpose(x):
r, c = x.shape
# Specifying the dtype is very important here.
# This is a good habit to take to avoid numerical issues and
# slower performance in Numpy too.
x2 = np.zeros((c, r), dtype=x.dtype)
for i in nb.prange(c):
for j in range(r):
x2[i, j] = x[j][i]
return x2
It is slower
This is because of lazy compilation. The first execution include the compilation time. THis is not the case when the signature is specified because of eager compilation is used instead.
numba is much faster
Well, not to much here considering many cores are used. In fact, the naive transposition is very inefficient on big matrices (is wast about 90% of the memory throughput in this case on large arrays). There are faster algorithms. For more information, please read this post (it only consider in-place 2D square transposition which is much simpler but the idea is the same). Also note that the wider the type, the bigger the array. The bigger the array the slower the transposition.
I am writing an application in Python having speed as the main driver. While optimizing my code, I found out that the main bottleneck is given by the code used to compute
In my code, this matrix multiplication is computed as
POW = np.arange(4)
y = C # (x ** POW)
I tried to use different methods (e.g., for cycle and others), but as now this is the fastest way I found. Do you have any suggestion to improve the computational time?
It's absolutely to use Numpy. Numpy does the actual mathematical operations in highly optimized C code. Using Numpy is faster than writing your own non-optimized C code.
Firstly, use float instead int. Secondly, if you don't need double precision then use np.float32.
POW = np.arange(4, dtype='f')
# C = C.astype('f', copy=False) # ensure that C.dtype == np.float32
y = C # (x ** POW)
Suppose I have two arrays. A has size n by d, and B has size t by d. Suppose I want to output an array C, where C[i, j] gives the cubed L3 norm between A[i] and B[j] (both of these have size d). i.e.
C[i, j] = |A[i, 0]-B[j, 0]|**3 + |A[i, 1]-B[j, 1]|**3 + ... + |A[i, d-1]-B[j, d-1]|**3
Can anyone redirect me to a really efficient way to do this? I have tried using a double for loop and something with array operations but the runtime is incredibly slow.
Edit: I know TensorFlow.norm works, but how could I implement this myself?
In accordance with #gph answer, an explicit example of the application of Numpy's np.linalg.norm, using broadcasting to avoid any loops:
import numpy as np
n, m, d = 200, 300, 3
A = np.random.random((n,d))
B = np.random.random((m,d))
C = np.linalg.norm(A[:,None,:] - B[None,...], ord=3, axis = -1)
# The 'raw' option would be:
C2 = (np.sum(np.abs(A[:,None,:] - B[None,...])**3, axis = -1))**(1/3)
Both ways seems to take around the same time to execute. I've tried to use np.einsum but let's just say my algebra skills have seen better days. Perhaps you have better luck (a nice resource is this wiki by Divakar).
There may be a few optimizations to speed this up, but the performance isn't going to be anywhere near specialized math packages. Those packages are using blas and lapack to vectorize the operations. They also avoid a lot of type checking overhead by enforcing types at the time you set a value. If performance is important there really is no way you are going to do better than a numerical computing package like numpy.
Use a 3rd-party library written in C or create your own
numpy has a linalg library which should be able to compute your L3 norm for each A[i]-B[j]
If numpy works for you, take a look at numba's JIT, which can compile and cache some (numpy) code to be orders of magnitude faster (successive runs will take advantage of it). I believe you will need to describe the parts to accelerate as functions which will be called many times to take advantage of it.
roughly
import numpy as np
from numpy import linalg as LA
from numba import jit as numbajit
#numbajit
def L3norm(A, B, i, j):
return LA.norm([A[i] - B[j]], ord=3)
# may be able to apply JIT here too
def operate(A, B, n, t):
# generate array
C = np.zeros(shape=(n, t))
for i in range(n):
for j in range(t):
C[i, j] = L3norm(A, B, i, j)
return C
Can anyone direct me to the section of numpy manual where i can get functions to accomplish root mean square calculations ...
(i know this can be accomplished using np.mean and np.abs .. isn't there a built in ..if no why?? .. just curious ..no offense)
can anyone explain the complications of matrix and arrays (just in the following case):
U is a matrix(T-by-N,or u say T cross N) , Ue is another matrix(T-by-N)
I define k as a numpy array
U[ind,:] is still matrix
in the following fashion
k = np.array(U[ind,:])
when I print k or type k in ipython
it displays following
K = array ([[2,.3 .....
......
9]])
You see the double square brackets (which makes it multi-dim i guess)
which gives it the shape = (1,N)
but I can't assign it to array defined in this way
l = np.zeros(N)
shape = (,N) or perhaps (N,) something like that
l[:] = k[:]
error:
matrix dimensions incompatible
Is there a way to accomplish the vector assignment which I intend to do ... Please don't tell me do this l = k (that defeats the purpose ... I get different errors in program .. I know the reasons ..If you need I may attach the piece of code)
writing a loop is the dumb way .. which I'm using for the time being ...
I hope I was able to explain .. the problems I'm facing ..
regards ...
For the RMS, I think this is the clearest:
from numpy import mean, sqrt, square, arange
a = arange(10) # For example
rms = sqrt(mean(square(a)))
The code reads like you say it: "root-mean-square".
For rms, the fastest expression I have found for small x.size (~ 1024) and real x is:
def rms(x):
return np.sqrt(x.dot(x)/x.size)
This seems to be around twice as fast as the linalg.norm version (ipython %timeit on a really old laptop).
If you want complex arrays handled more appropriately then this also would work:
def rms(x):
return np.sqrt(np.vdot(x, x)/x.size)
However, this version is nearly as slow as the norm version and only works for flat arrays.
For the RMS, how about
norm(V)/sqrt(V.size)
I don't know why it's not built in. I like
def rms(x, axis=None):
return sqrt(mean(x**2, axis=axis))
If you have nans in your data, you can do
def nanrms(x, axis=None):
return sqrt(nanmean(x**2, axis=axis))
Try this:
U = np.zeros((N,N))
ind = 1
k = np.zeros(N)
k[:] = U[ind,:]
I use this for RMS, all using NumPy, and let it also have an optional axis similar to other NumPy functions:
import numpy as np
rms = lambda V, axis=None: np.sqrt(np.mean(np.square(V), axis))
If you have complex vectors and are using pytorch, the vector norm is the fastest approach on CPU & GPU:
import torch
batch_size, length = 512, 4096
batch = torch.randn(batch_size, length, dtype=torch.complex64)
scale = 1 / torch.sqrt(torch.tensor(length))
rms_power = batch.norm(p=2, dim=-1, keepdim=True)
batch_rms = batch / (rms_power * scale)
Using batch vdot like goodboy's approach is 60% slower than above. Using naïve method similar to deprecated's approach is 85% slower than above.