TypingError when using numpy.stack() with numba njit

TypingError when using numpy.stack() with numba njit - python

The original issue is connected to using np.linspace with arrays as start and stop parameters, though right now I'm having issues with the workaround I came up with.
Take the following:
from numba import njit
import numpy as np
#njit
def f1():
start = np.array([0.1, 1.0], np.float32)
stop = np.array([1.0, 10.0], np.float32)
return np.linspace(start, stop, 10)
f1()
This will raise an error, because though documented as supporting "only the 3-argument form" of linspace, what they actually mean is "the 3-argument form with scalar values for start and stop".
So I came up with the folloing workaround:
import numpy as np
from numba import njit
#njit
def f2():
start = np.array([0.1, 1.0], np.float32)
stop = np.array([1.0, 10.0], np.float32)
pts_0 = np.linspace(start[0], stop[0], 10).astype(np.float32) # works
pts_1 = np.linspace(start[1], stop[1], 10).astype(np.float32) # works
return np.stack([pts_0, pts_1]).T # error
which raises this error:
---------------------------------------------------------------------------
TypingError Traceback (most recent call last)
c:\Users\X\Desktop\X\data_analysis.ipynb Cell 46' in <cell line: 18>()
15 pts_1 = np.linspace(start[1], stop[1], 10).astype(np.float32)
16 return np.stack([pts_0, pts_1]).T
---> 18 r = f2()
File c:\Users\X\miniconda3\envs\X\lib\site-packages\numba\core\dispatcher.py:468, in _DispatcherBase._compile_for_args(self, *args, **kws)
464 msg = (f"{str(e).rstrip()} \n\nThis error may have been caused "
465 f"by the following argument(s):\n{args_str}\n")
466 e.patch_message(msg)
--> 468 error_rewrite(e, 'typing')
469 except errors.UnsupportedError as e:
470 # Something unsupported is present in the user code, add help info
471 error_rewrite(e, 'unsupported_error')
File c:\Users\X\miniconda3\envs\X\lib\site-packages\numba\core\dispatcher.py:409, in _DispatcherBase._compile_for_args.<locals>.error_rewrite(e, issue_type)
407 raise e
408 else:
--> 409 raise e.with_traceback(None)
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function stack at 0x00000186F280CAF0>) found for signature:
>>> stack(list(array(float32, 1d, C))<iv=None>)
Again, according to the documentation, np.stack is supported (no side-commens on this one either).
What am I missing?

np.stack is supported but it expect a tuple instead of a list so far. Here is a fixed code:
#njit
def f2():
start = np.array([0.1, 1.0], np.float32)
stop = np.array([1.0, 10.0], np.float32)
pts_0 = np.linspace(start[0], stop[0], 10).astype(np.float32) # works
pts_1 = np.linspace(start[1], stop[1], 10).astype(np.float32) # works
return np.stack((pts_0, pts_1)).T # works
By the way, note that np.stack((pts_0, pts_1)).T is not very efficient since it creates temporary arrays and a non-contiguous view. Since the purpose of using Numba is to speed up codes, consider using basic loops that should be faster here. The same thing applies for astype(np.float32): a loop can cast the values in-place. Memory and allocations are expensive and this is often what make Numpy slower (also the lack of specific-purpose functions). Such things will be slower in the future (for more information, consider reading more about the "memory wall") so one need to avoid them.
Here is a significantly faster version with basic loops:
#njit
def f2():
start1, start2 = np.float32(0.1), np.float32(1.0)
stop1, stop2 = np.float32(1.0), np.float32(10.0)
steps = 10
delta = np.float32(1 / (steps - 1))
res = np.empty((steps, 2), dtype=np.float32)
for i in range(steps):
res[i, 0] = start1 + (stop1 - start1) * (delta * i)
res[i, 1] = start2 + (stop2 - start2) * (delta * i)
return res
Note that results can be slightly different due to 32-bit FP rounding.

Related

Dealing with memory issue (SIGKILL) when manipulating large arrays

I wrote this function to perform a rolling sum on numpy arrays, inspired by this post
def np_rolling_sum(arr, n, axis=0):
out = np.cumsum(arr, axis=axis)
slc1 = [slice(None)] * len(arr.shape)
slc2 = [slice(None)] * len(arr.shape)
slc1[axis] = slice(n, None)
slc2[axis] = slice(None, -n)
out = out[tuple(slc1)] - out[tuple(slc2)]
shape = list(out.shape)
shape[axis] = arr.shape[axis] - out.shape[axis]
out = np.concatenate((np.full(shape, 0), out), axis=axis)
return out
It works fine, except when I need to use it on large arrays (size is around 1bn). In that case, I get a SIGKILL on this line:
out = out[tuple(slc1)] - out[tuple(slc2)]
I already tried to delete arr after the cumsum since I no more need it (except from its shape that I can store before the deletion), but it didn't help.
My next guess would be to implement a batch management for the operation causing the memory issue. Is there another way for me to write this function better so it will be able to deal with larger arrays ?
Thanks for your help !

For people who might be interested, I finally added a decorator that checks if numpy arguments are greater than a given size. If so, it turns them into dask arrays.
In order to keep the main function closest to the original, I also added an argument that indicates which library should be used: numpy or dask.array
Here is the final result:
import numpy as np
import dask.array as da
threshold = 50_000_000
def large_file_handler(func):
def wrapper(*args, **kwargs):
pos = list(args)
for i in range(len(pos)):
if type(pos[i]) == np.ndarray and pos[i].size > threshold:
pos[i] = da.from_array(pos[i])
kwargs['func_lib'] = da
for k in kwargs:
if type(kwargs[k]) == np.ndarray and pos[kwargs[k]].size > threshold:
kwargs[k] = da.from_array(kwargs[k])
kwargs['func_lib'] = da
return func(*pos, **kwargs)
return wrapper
#large_file_handler
def np_rolling_sum(arr, n, axis=0, func_lib=np):
out = func_lib.cumsum(arr, axis=axis)
slc1 = [slice(None)] * len(arr.shape)
slc2 = [slice(None)] * len(arr.shape)
slc1[axis] = slice(n, None)
slc2[axis] = slice(None, -n)
out = out[tuple(slc1)] - out[tuple(slc2)]
shape = list(out.shape)
shape[axis] = arr.shape[axis] - out.shape[axis]
out = func_lib.concatenate((np.full(shape, 0), out), axis=axis)
return np.array(out)
Please feel free to tell me if this could be improved.

Deploying Numba gives a StopIteration error which gives no clear hint on how to fix the code?

I have a python code that scales really badly with n so much so that for n=50 the time is in seconds but for n=1000 is many hours.
I therefore tried to use Numba to speed it up but have only been dealing with a lot of errors. I have managed to rectify the errors so far but suddenly now I have the error of "StopIteration" and that gives no hint on what bug to look for in the code.
Here is the non-Numba version (with the output which is what I am trying to get with Numba):
import numpy as np
import matplotlib.pyplot as plt
from math import *
counter=0
n=100
iter=0
np.random.seed(0)
Zinitial=np.random.normal(0,1,size=(2,n))[0,:]
np.random.seed(1)
Pinitial=np.random.normal(0,1,size=(2,n))[1,:]
SPIN=np.zeros(n,dtype=int);
SPIN[int(n/2):]=0;
SPIN[:int(n/2)]=1;
SP = np.array(sorted(np.array([np.array([i,j,k]) for i, j,k in zip(Zinitial, Pinitial,SPIN)]),
key=lambda x: x[0]))
print("Initial energy : ",(np.sum(SP[:,0]**2)+np.sum(SP[:,1]**2))/2 )
total_time=0
Tmax=10
alf=sqrt(10)
while total_time<Tmax:
T=[]
for j in range(n-1):
b=(SP[j+1,1]-SP[j,1])/(SP[j+1,0]-SP[j,0])
val1=b+sqrt(b**2+2)
val2=b-sqrt(b**2+2)
if val1>0:
T.append(val1)
else:
T.append(val2)
T=np.array(T)
dt=min(T[T>0])
total_time=dt+total_time
indix=list(T).index(dt)
SP0=SP[:,0].copy()
SP[:,0]=SP0*cos(dt)+SP[:,1]*sin(dt)
SP[:,1]=SP[:,1]*cos(dt)-SP0*sin(dt)
prel=(SP[indix,1]-SP[indix+1,1])/2
rcoeff=1/(1+(prel*alf)**2)
SP[[indix,indix+1]]=SP[[indix+1,indix]]
SP=np.array(sorted(SP,key=lambda x:x[0]))
rand_value=np.random.random()
rcoeff=1/(1+(prel*alf)**2)
if rcoeff>rand_value and SP[indix,2]!=SP[indix+1,2]:
counter=counter+1
SP[indix,2],SP[indix+1,2]=SP[indix+1,2],SP[indix,2]
print("total_time = ",total_time)
print("n= ", n)
print("rate = ", 2*counter/(n*(total_time)))
OUTPUT:
Initial energy : 95.56821008840404
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:29: RuntimeWarning: divide by zero encountered in double_scalars
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:31: RuntimeWarning: invalid value encountered in double_scalars
total_time = 10.000079819065235
n= 100
rate = 3.2079743942482555
final energy : 95.5682100884033
Here is the devil Numba version with the error output:
import numpy as np
import matplotlib.pyplot as plt
from math import *
from numba import jit
from numba import types,typed
#jit(nopython=True)
def f(SP, alf,Tmax, n):
counter=0
total_time=0
T=np.empty(0,dtype=np.float64)
while total_time<Tmax:
for j in range(n-1):
b=(SP[j+1,1]-SP[j,1])/(SP[j+1,0]-SP[j,0])
val1=b+sqrt(b**2+2)
val2=b-sqrt(b**2+2)
if val1>0:
np.append(T,val1)
else:
np.append(T,val2)
dt=min(T[T>0])
total_time=dt+total_time
indix,=np.where(T==dt)[0]
SP0=SP[:,0].copy()
SP[:,0]=SP0*cos(dt)+SP[:,1]*sin(dt)
SP[:,1]=SP[:,1]*cos(dt)-SP0*sin(dt)
prel=(SP[indix,1]-SP[indix+1,1])/2;
rcoeff=1/(1+(prel*alf)**2);
for h in range(n-1):
for z in range(SP.shape[1]):
SP[h, z], SP[h + 1, z] = SP[h + 1, z], SP[h, z]
SP=SP[SP[:, 0].argsort()]
rand_value=np.random.random()
rcoeff=1/(1+(prel*alf)**2)
if rcoeff>rand_value and SP[indix,2]!=SP[indix+1,2]:
counter=counter+1
SP[indix,2],SP[indix+1,2]=SP[indix+1,2],SP[indix,2]
rate=2*counter/(n*total_time)
energy=np.sum((SP[:,0]**2+SP[:,1]**2)/2)
return rate,energy
if __name__ == '__main__':
n=100
Tmax=10
alf=sqrt(10)
np.random.seed(0)
Zinitial=np.random.normal(0,1,size=(2,n))[0,:]
np.random.seed(1)
Pinitial=np.random.normal(0,1,size=(2,n))[1,:]
SPIN=np.zeros(n,dtype=int);
SPIN[int(n/2):]=0;
SPIN[:int(n/2)]=1;
SP = np.array(sorted(np.array([np.array([i,j,k]) for i, j,k in zip(Zinitial, Pinitial,SPIN)]),
key=lambda x: x[0]))
print("Initial energy : ",(np.sum(SP[:,0]**2)+np.sum(SP[:,1]**2))/2 )
rate,energy = f(SP, alf, Tmax, n)
print("Rate of collision per particle = ",rate)
print("Final energy : ",energy)
OUTPUT:
Initial energy : 95.56821008840404
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
<ipython-input-49-9d8cc152e0a8> in <module>()
64
65 print("Initial energy : ",(np.sum(SP[:,0]**2)+np.sum(SP[:,1]**2))/2 )
---> 66 rate,energy = f(SP, alf, Tmax, n)
67 print("Rate of collision per particle = ",rate)
68 print("Final energy : ",energy)
StopIteration:
Pardon me for dumping huge chunks but I don't know which part has the bug.

There are many problems in the Numba code:
The result of np.append is not assigned so the instruction does nothing: np.append is not an in-place operation. This is the reason why you get a StopIteration: the expression dt=min(T[T>0]) fails because T is empty.
np.append should never be use in a loop since it create a new array for every iteration resulting in a slow quadratic execution. Use a preallocated array (best solution) or a list (less efficient).
There are out-of-bound access causing crashes and undefined behaviours. Please use the flags debug=True and boundscheck=True to track them. For example indix is set to 228 (which seems legit since T.shape is 297) while SP.shape is (100,3) so SP[indix,1] fails. Numba is design to be fast by default so it does not track out-of-bound access by default which can cause surprising behaviour since an out-of-bound is an undefined behaviour and the JIT can make crazy assumption in this case. I also strongly advise you to print the values of the variable so to check deterministic computations (ie. not random-based ones) are the same in Numba.
Note that np.random.seed does not affect the Numba seed (Numba use a separate seed that is not synchronised with Numpy). Thus, random numbers will likely differ between Numba and a pure-Numpy code.

Device function throws nopython exception when its returning a list instead of an integer

a device function I have written always throws a no python exception and I do not understand why or where my error is.
Here a small example that represents my problem.
I have the following device function that I call from a kernel:
#cuda.jit (device=True)
def sub_stuff(vec_a, vec_b):
x0 = vec_a[0] - vec_b[0]
x1 = vec_a[1] - vec_b[1]
x2 = vec_a[2] - vec_b[2]
return [x0, x1, x2]
The kernel that calls this function looks like this:
#cuda.jit
def kernel_via_polygon(vectors_a, vectors_b, result_array):
pos = cuda.grid(1)
if pos < vectors_a.size and pos < result_array.size:
result_array[pos] = sub_stuff(vectors_a[pos], vectors_b[pos])
The three input arrays are the following:
vectors_a = np.arange(1, 10).reshape((3, 3))
vectors_b = np.arange(1, 10).reshape((3, 3))
result = np.zeros_like(vectors_a)
When I now call the function via trace_via_polygon(vectors_a, vectors_b, result) a no python error is thrown. When the device funtion would return only an integer value, this error is prevented.
Can someone explain to me where my mistake is?
Edit: FYI as answered by
talonmies list construction isn't supported in device code. An alternative that helped me is using tuples, which are supported.

The source of your error is that the device function sub_stuff is attempting to create a list in GPU code, and that isn't supported.
About the best you can do would be something like this:
from numba import jit, guvectorize, int32, int64, float64
from numba import cuda
import numpy as np
import math
#cuda.jit (device=True)
def sub_stuff(vec_a, vec_b, result):
for i in range(vec_a.shape[0]):
result[i] = vec_a[i] - vec_b[i]
#cuda.jit
def kernel_via_polygon(vectors_a, vectors_b, result_array):
pos = cuda.grid(1)
if pos < vectors_a.size and pos < result_array.size:
sub_stuff(vectors_a[pos], vectors_b[pos], result_array[pos])
vectors_a = 100 + np.arange(1, 10).reshape((3, 3))
vectors_b = np.arange(1, 10).reshape((3, 3))
result = np.zeros_like(vectors_a)
kernel_via_polygon[1,10](vectors_a, vectors_b, result)
print(result)
which uses a loop to iterate over the individual array slices and perform the subtraction between each element.

Matrix exponentiation in Python

I'm trying to exponentiate a complex matrix in Python and am running into some trouble. I'm using the scipy.linalg.expm function, and am having a rather strange error message when I try the following code:
import numpy as np
from scipy import linalg
hamiltonian = np.mat('[1,0,0,0;0,-1,0,0;0,0,-1,0;0,0,0,1]')
# This works
t_list = np.linspace(0,1,10)
unitary = [linalg.expm(-(1j)*t*hamiltonian) for t in t_list]
# This doesn't
t_list = np.linspace(0,10,100)
unitary = [linalg.expm(-(1j)*t*hamiltonian) for t in t_list]
The error when the second experiment is run is:
This works!
Traceback (most recent call last):
File "matrix_exp.py", line 11, in <module>
unitary_t = [linalg.expm(-1*t*(1j)*hamiltonian) for t in t_list]
File "/usr/lib/python2.7/dist-packages/scipy/linalg/matfuncs.py", line 105, in expm
return scipy.sparse.linalg.expm(A)
File "/usr/lib/python2.7/dist- packages/scipy/sparse/linalg/matfuncs.py", line 344, in expm
X = _fragment_2_1(X, A, s)
File "/usr/lib/python2.7/dist- packages/scipy/sparse/linalg/matfuncs.py", line 462, in _fragment_2_1
X[k, k] = exp_diag[k]
TypeError: only length-1 arrays can be converted to Python scalars
This seems really strange since all I changed was the range of t I was using. Is it because the Hamiltonian is diagonal? In general, the Hamiltonians won't be, but I also want it to work for diagonal ones. I don't really know the mechanics of expm, so any help would be greatly appreciated.

That is interesting. One thing I can say is that the problem is specific to the np.matrix subclass. For example, the following works fine:
h = np.array(hamiltonian)
unitary = [linalg.expm(-(1j)*t*h) for t in t_list]
Digging a little deeper into the traceback, the exception is being raised in _fragment_2_1 in scipy.sparse.linalg.matfuncs.py, specifically these lines:
n = X.shape[0]
diag_T = T.diagonal().copy()
# Replace diag(X) by exp(2^-s diag(T)).
scale = 2 ** -s
exp_diag = np.exp(scale * diag_T)
for k in range(n):
X[k, k] = exp_diag[k]
The error message
X[k, k] = exp_diag[k]
TypeError: only length-1 arrays can be converted to Python scalars
suggests to me that exp_diag[k] ought to be a scalar, but is instead returning a vector (and you can't assign a vector to X[k, k], which is a scalar).
Setting a breakpoint and examining the shapes of these variables confirms this:
ipdb> l
751 # Replace diag(X) by exp(2^-s diag(T)).
752 scale = 2 ** -s
753 exp_diag = np.exp(scale * diag_T)
754 for k in range(n):
755 import ipdb; ipdb.set_trace() # breakpoint e86ebbd4 //
--> 756 X[k, k] = exp_diag[k]
757
758 for i in range(s-1, -1, -1):
759 X = X.dot(X)
760
761 # Replace diag(X) by exp(2^-i diag(T)).
ipdb> exp_diag.shape
(1, 4)
ipdb> exp_diag[k].shape
(1, 4)
ipdb> X[k, k].shape
()
The underlying problem is that exp_diag is assumed to be either 1D or a column vector, but the diagonal of an np.matrix object is a row vector. This highlights a more general point that np.matrix is generally less well-supported than np.ndarray, so in most cases it's better to use the latter.
One possible solution would be to use np.ravel() to flatten diag_T into a 1D np.ndarray:
diag_T = np.ravel(T.diagonal().copy())
This seems to fix the problem you're encountering, although there may be other issues relating to np.matrix that I haven't spotted yet.
I've opened a pull request here.

Anaconda's NumbaPro CUDA Assertion Error

I am trying to use NumbaPro's cuda extension to multiply large array matrixes. What I want in the end is to multiply a matrix of size NxN by a diagonal matrix that would be fed in as a 1D matrix (thus, a.dot(numpy.diagflat(b)) which I have found to be synonymous to a * b). However, I am getting an assertion error that provides no information.
I can only avoid this assertion error if I multiply two 1D array matrixes but that is not what I want to do.
from numbapro import vectorize, cuda
from numba import f4,f8
import numpy as np
def generate_input(n):
import numpy as np
A = np.array(np.random.sample((n,n)))
B = np.array(np.random.sample(n) + 10)
return A, B
def product(a, b):
return a * b
def main():
cu_product = vectorize([f4(f4, f4), f8(f8, f8)], target='gpu')(product)
N = 1000
A, B = generate_input(N)
D = np.empty(A.shape)
stream = cuda.stream()
with stream.auto_synchronize():
dA = cuda.to_device(A, stream)
dB = cuda.to_device(B, stream)
dD = cuda.to_device(D, stream, copy=False)
cu_product(dA, dB, out=dD, stream=stream)
dD.to_host(stream)
if __name__ == '__main__':
main()
This is what my terminal spits out:
Traceback (most recent call last):
File "cuda_vectorize.py", line 32, in <module>
main()
File "cuda_vectorize.py", line 28, in main
cu_product(dA, dB, out=dD, stream=stream)
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 109, in __call__
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 191, in _arguments_requirement
AssertionError

The problem is you are using vectorize on a function that takes non-scalar arguments. The idea with NumbaPro's vectorize is that it takes a scalar function as input, and generates a function that applies the scalar operation in parallel to all the elements of a vector. See the NumbaPro documentation.
Your function takes a matrix and a vector, which are definitely not scalar. [Edit] You can do what you want on the GPU using either NumbaPro's wrapper for cuBLAS, or by writing your own simple kernel function. Here's an example that demonstrates both. Note will need NumbaPro 0.12.2 or later (just released as of this edit).
from numbapro import jit, cuda
from numba import float32
import numbapro.cudalib.cublas as cublas
import numpy as np
from timeit import default_timer as timer
def generate_input(n):
A = np.array(np.random.sample((n,n)), dtype=np.float32)
B = np.array(np.random.sample(n), dtype=A.dtype)
return A, B
#cuda.jit(argtypes=[float32[:,:], float32[:,:], float32[:]])
def diagproduct(c, a, b):
startX, startY = cuda.grid(2)
gridX = cuda.gridDim.x * cuda.blockDim.x;
gridY = cuda.gridDim.y * cuda.blockDim.y;
height, width = c.shape
for y in range(startY, height, gridY):
for x in range(startX, width, gridX):
c[y, x] = a[y, x] * b[x]
def main():
N = 1000
A, B = generate_input(N)
D = np.empty(A.shape, dtype=A.dtype)
E = np.zeros(A.shape, dtype=A.dtype)
F = np.empty(A.shape, dtype=A.dtype)
start = timer()
E = np.dot(A, np.diag(B))
numpy_time = timer() - start
blas = cublas.api.Blas()
start = timer()
blas.gemm('N', 'N', N, N, N, 1.0, np.diag(B), A, 0.0, D)
cublas_time = timer() - start
diff = np.abs(D-E)
print("Maximum CUBLAS error %f" % np.max(diff))
blockdim = (32, 8)
griddim = (16, 16)
start = timer()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dF = cuda.to_device(F, copy=False)
diagproduct[griddim, blockdim](dF, dA, dB)
dF.to_host()
cuda_time = timer() - start
diff = np.abs(F-E)
print("Maximum CUDA error %f" % np.max(diff))
print("Numpy took %f seconds" % numpy_time)
print("CUBLAS took %f seconds, %0.2fx speedup" % (cublas_time, numpy_time / cublas_time))
print("CUDA JIT took %f seconds, %0.2fx speedup" % (cuda_time, numpy_time / cuda_time))
if __name__ == '__main__':
main()
The kernel is significantly faster because SGEMM does a full matrix-matrix multiply (O(n^3)), and expands the diagonal into a full matrix. The diagproduct function is smarter. It simply does a single multiply for each matrix element, and never expands the diagonal to a full matrix. Here are the results on my NVIDIA Tesla K20c GPU for N=1000:
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 0.024535 seconds
CUBLAS took 0.010345 seconds, 2.37x speedup
CUDA JIT took 0.004857 seconds, 5.05x speedup
The timing includes all of the copies to and from the GPU, which is a significant bottleneck for small matrices. If we set N to 10,000 and run again, we get a much bigger speedup:
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 7.245677 seconds
CUBLAS took 1.371524 seconds, 5.28x speedup
CUDA JIT took 0.264598 seconds, 27.38x speedup
For very small matrices, however, CUBLAS SGEMM has an optimized path so it is closer to the CUDA performance. Here, N=100
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 0.006876 seconds
CUBLAS took 0.001425 seconds, 4.83x speedup
CUDA JIT took 0.001313 seconds, 5.24x speedup

Just to bounce back on all those considerations. I also wanted to implement some matrix computations on CUDA, but then heard about the numpy.einsum function.
It turns out that einsum is incredibly fast.
In a case like this, here is the code for it. But it can be applied to many types of computations.
G = np.einsum('ij,j -> ij',A, B)
In terms of speed, here are the results for N = 10000
Numpy took 8.387756 seconds
CUDA JIT took 0.218394 seconds, 38.41x speedup
EINSUM took 0.131751 seconds, 63.66x speedup

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

TypingError when using numpy.stack() with numba njit - python

Related

Dealing with memory issue (SIGKILL) when manipulating large arrays

Deploying Numba gives a StopIteration error which gives no clear hint on how to fix the code?

Device function throws nopython exception when its returning a list instead of an integer

Matrix exponentiation in Python

Anaconda's NumbaPro CUDA Assertion Error

Categories

Resources