Working and manipulating numpy arrays with numba - python

Why can't Numba's jit compile a simple Numpy array operation?
Here is a minimal non-working example that reproduces Numba's failure to compile
import numpy as np
from numba import jit
rows = 10
columns = 999999
A = np.empty((rows, columns))
b = np.linspace(0, 1, num=rows)
#jit(nopython=True)
def replicate(A, b):
for i in range(A.shape[1]):
A[:, i] = b
return A #optional
replicate(a, b)
With the following error:
TypingError: Failed at nopython (nopython frontend)
Cannot resolve setitem: array(float64, 1d, C, nonconst)[(slice3_type, int64)] = array(float64, 1d, C, nonconst)
File "<ipython-input-32-db24fbe2922f>", line 12
Am I doing something wrong?
As an aside, I need nopython mode because in my real-life situation I need to perform array addition, multiplication with scalars and array populating with other arrays frequently. and my understanding is that in object mode I won't be able to do loop jitting and thus I won't see any real performance boost on the execution.

Numba doesn't support numpy slicing in nopython mode. Try unrolling the loops explicitly:
rows = 10
columns = 999999
a = np.empty((rows, columns))
b = np.linspace(0, 1, num=rows)
#jit(nopython=True)
def replicate(A, b):
for i in xrange(A.shape[0]):
for j in xrange(A.shape[1]):
A[i, j] = b[i]
return A #optional
replicate(a, b)

Related

weird behavior of numba guvectorize

I write a function to test numba.guvectorize. This function takes product of two numpy arrays and compute the sum after first axis, as following:
from numba import guvectorize, float64
import numpy as np
#guvectorize([(float64[:], float64[:], float64)], '(n),(n)->()')
def g(x, y, res):
res = np.sum(x * y)
However, the above guvectorize function returns wrong results as shown below:
>>> a = np.random.randn(3,4)
>>> b = np.random.randn(3,4)
>>> np.sum(a * b, axis=1)
array([-0.83053829, -0.15221319, -2.27825015])
>>> g(a, b)
array([4.67406747e-310, 0.00000000e+000, 1.58101007e-322])
What might be causing this problem?
Function g() receives an uninitialized array through the res parameter. Assigning a new value to it doesn't modify the original array passed to the function.
You need to replace the contents of res (and declare it as an array):
#guvectorize([(float64[:], float64[:], float64[:])], '(n),(n)->()')
def g(x, y, res):
res[:] = np.sum(x * y)
The function operates on 1D vectors and returns a scalar (thus the signature (n),(n)->()) and guvectorize does the job of dealing with 2D inputs and returning a 1D output.
>>> a = np.random.randn(3,4)
>>> b = np.random.randn(3,4)
>>> np.sum(a * b, axis=1)
array([-3.1756397 , 5.72632531, 0.45359806])
>>> g(a, b)
array([-3.1756397 , 5.72632531, 0.45359806])
But the original Numpy function np.sum is already vectorized and compiled, so there is little speed gain in using guvectorize in this specific case.
Your a and b arrays are 2-dimensional, while your guvectorized function has signature of accepting 1D arrays and returning 0D scalar. You have to modify it to accept 2D and return 1D.
In one case you do np.sum with axis = 1 in another case without it, you have to do same thing in both cases.
Also instead of res = ... use res[...] = .... Maybe it is not the problem in case of guvectorize but it can be a general problem in Numpy code because you have to assign values instead of variable reference.
In my case I added cache = True param to guvectorize decorator, it only speeds up running through caching/re-using Numba compiled code and not re-compiling it on every run. It just speeds up things.
Full modified corrected code see below:
Try it online!
from numba import guvectorize, float64
import numpy as np
#guvectorize([(float64[:, :], float64[:, :], float64[:])], '(n, m),(n, m)->(n)', cache = True)
def g(x, y, res):
res[...] = np.sum(x * y, axis = 1)
# Test
np.random.seed(0)
a = np.random.randn(3, 4)
b = np.random.randn(3, 4)
print(np.sum(a * b, axis = 1))
print(g(a, b))
Output:
[ 2.57335386 3.41749149 -0.42290296]
[ 2.57335386 3.41749149 -0.42290296]

How to use supported numpy and math functions with CUDA in Python?

According to numba 0.51.2 documentation, CUDA Python supports several math functions. However, it doesn't work in the following kernel function:
#cuda.jit
def find_angle(angles):
i, j = cuda.grid(2)
if i < angles.shape[0] and j < angles.shape[1]:
angles[i][j] = math.atan2(j, i)
The output:
numba.core.errors.LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
No definition for lowering <built-in function atan2>(int64, int64) -> float64
Am I using the function incorrectly?
The hint to the source of the problem is here:
No definition for lowering <built-in function atan2>(int64, int64) -> float64
The arguments returned by cuda.grid() (i.e. i, j which you are passing to atan2) are integer values because they are related to indexing.
numba can't find a version of atan2 that it can use that takes two integer arguments and returns a floating-point value:
float64 = atan2(int64, int64)
One possible solution is to convert your atan2 input arguments to match the type that numba seems to want to return from that function, which is evidently float64:
from numba import cuda, float64
import numpy
import math
#cuda.jit
def find_angle(angles):
i, j = cuda.grid(2)
if i < angles.shape[0] and j < angles.shape[1]:
angles[i][j] = math.atan2(float64(j), float64(i))
block_x = 32
block_y = 32
block = (block_x, block_y)
x = 256
y = 256
grid = (x//block_x, y//block_y) # not for arbitrary x and y
angles = numpy.ones((x, y), numpy.float64)
find_angle[grid, block](angles)

Argsort doesn't work in combination with numba

I have some problems with the application of np.argsort in combination with numba.
The input coord is a two-dimensional array consisting of coordinate which need to be sorted in counter-clockwise manner. All the variables are float64 and numy array's.
#jit
def sortVertices(coord):
yr = coord[:, 1]
xc = coord[:, 0]
xl = np.array([xc.size])
yl = np.array([yr.size])
center_xc = np.sum(xc)/xl
center_yr = np.sum(yr)/yl
theta = np.arctan2(yr-center_yr, xc-center_xc) * 180.0 / np.pi
indices = np.argsort(theta)
x = xc[indices]
y = yr[indices]
coord_new = np.vstack((x, y)).T
return coord_new
I update numba. The error:
NumbaWarning:
Compilation is falling back to object mode WITH looplifting enabled because Function sortVertices failed at nopython mode lowering due to: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function make_quicksort_impl.<locals>.run_quicksort at 0x1298bc9d8>) found for signature:
run_quicksort(array(float64, 1d, C))
There are 2 candidate implementations:
- Of which 2 did not match due to:
Overload in function 'register_jitable.<locals>.wrap.<locals>.ov_wrap': File: numba/core/extending.py: Line 150.
With argument(s): '(array(float64, 1d, C))':
Rejected as the implementation raised a specific error:
AttributeError: 'function' object has no attribute 'get_call_template'
raised from /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/numba/core/types/functions.py:511\
Thank you in advance.

Device function throws nopython exception when its returning a list instead of an integer

a device function I have written always throws a no python exception and I do not understand why or where my error is.
Here a small example that represents my problem.
I have the following device function that I call from a kernel:
#cuda.jit (device=True)
def sub_stuff(vec_a, vec_b):
x0 = vec_a[0] - vec_b[0]
x1 = vec_a[1] - vec_b[1]
x2 = vec_a[2] - vec_b[2]
return [x0, x1, x2]
The kernel that calls this function looks like this:
#cuda.jit
def kernel_via_polygon(vectors_a, vectors_b, result_array):
pos = cuda.grid(1)
if pos < vectors_a.size and pos < result_array.size:
result_array[pos] = sub_stuff(vectors_a[pos], vectors_b[pos])
The three input arrays are the following:
vectors_a = np.arange(1, 10).reshape((3, 3))
vectors_b = np.arange(1, 10).reshape((3, 3))
result = np.zeros_like(vectors_a)
When I now call the function via trace_via_polygon(vectors_a, vectors_b, result) a no python error is thrown. When the device funtion would return only an integer value, this error is prevented.
Can someone explain to me where my mistake is?
Edit: FYI as answered by
talonmies list construction isn't supported in device code. An alternative that helped me is using tuples, which are supported.
The source of your error is that the device function sub_stuff is attempting to create a list in GPU code, and that isn't supported.
About the best you can do would be something like this:
from numba import jit, guvectorize, int32, int64, float64
from numba import cuda
import numpy as np
import math
#cuda.jit (device=True)
def sub_stuff(vec_a, vec_b, result):
for i in range(vec_a.shape[0]):
result[i] = vec_a[i] - vec_b[i]
#cuda.jit
def kernel_via_polygon(vectors_a, vectors_b, result_array):
pos = cuda.grid(1)
if pos < vectors_a.size and pos < result_array.size:
sub_stuff(vectors_a[pos], vectors_b[pos], result_array[pos])
vectors_a = 100 + np.arange(1, 10).reshape((3, 3))
vectors_b = np.arange(1, 10).reshape((3, 3))
result = np.zeros_like(vectors_a)
kernel_via_polygon[1,10](vectors_a, vectors_b, result)
print(result)
which uses a loop to iterate over the individual array slices and perform the subtraction between each element.

Anaconda's NumbaPro CUDA Assertion Error

I am trying to use NumbaPro's cuda extension to multiply large array matrixes. What I want in the end is to multiply a matrix of size NxN by a diagonal matrix that would be fed in as a 1D matrix (thus, a.dot(numpy.diagflat(b)) which I have found to be synonymous to a * b). However, I am getting an assertion error that provides no information.
I can only avoid this assertion error if I multiply two 1D array matrixes but that is not what I want to do.
from numbapro import vectorize, cuda
from numba import f4,f8
import numpy as np
def generate_input(n):
import numpy as np
A = np.array(np.random.sample((n,n)))
B = np.array(np.random.sample(n) + 10)
return A, B
def product(a, b):
return a * b
def main():
cu_product = vectorize([f4(f4, f4), f8(f8, f8)], target='gpu')(product)
N = 1000
A, B = generate_input(N)
D = np.empty(A.shape)
stream = cuda.stream()
with stream.auto_synchronize():
dA = cuda.to_device(A, stream)
dB = cuda.to_device(B, stream)
dD = cuda.to_device(D, stream, copy=False)
cu_product(dA, dB, out=dD, stream=stream)
dD.to_host(stream)
if __name__ == '__main__':
main()
This is what my terminal spits out:
Traceback (most recent call last):
File "cuda_vectorize.py", line 32, in <module>
main()
File "cuda_vectorize.py", line 28, in main
cu_product(dA, dB, out=dD, stream=stream)
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 109, in __call__
File "/opt/anaconda1anaconda2anaconda3/lib/python2.7/site-packages/numbapro/_cudadispatch.py", line 191, in _arguments_requirement
AssertionError
The problem is you are using vectorize on a function that takes non-scalar arguments. The idea with NumbaPro's vectorize is that it takes a scalar function as input, and generates a function that applies the scalar operation in parallel to all the elements of a vector. See the NumbaPro documentation.
Your function takes a matrix and a vector, which are definitely not scalar. [Edit] You can do what you want on the GPU using either NumbaPro's wrapper for cuBLAS, or by writing your own simple kernel function. Here's an example that demonstrates both. Note will need NumbaPro 0.12.2 or later (just released as of this edit).
from numbapro import jit, cuda
from numba import float32
import numbapro.cudalib.cublas as cublas
import numpy as np
from timeit import default_timer as timer
def generate_input(n):
A = np.array(np.random.sample((n,n)), dtype=np.float32)
B = np.array(np.random.sample(n), dtype=A.dtype)
return A, B
#cuda.jit(argtypes=[float32[:,:], float32[:,:], float32[:]])
def diagproduct(c, a, b):
startX, startY = cuda.grid(2)
gridX = cuda.gridDim.x * cuda.blockDim.x;
gridY = cuda.gridDim.y * cuda.blockDim.y;
height, width = c.shape
for y in range(startY, height, gridY):
for x in range(startX, width, gridX):
c[y, x] = a[y, x] * b[x]
def main():
N = 1000
A, B = generate_input(N)
D = np.empty(A.shape, dtype=A.dtype)
E = np.zeros(A.shape, dtype=A.dtype)
F = np.empty(A.shape, dtype=A.dtype)
start = timer()
E = np.dot(A, np.diag(B))
numpy_time = timer() - start
blas = cublas.api.Blas()
start = timer()
blas.gemm('N', 'N', N, N, N, 1.0, np.diag(B), A, 0.0, D)
cublas_time = timer() - start
diff = np.abs(D-E)
print("Maximum CUBLAS error %f" % np.max(diff))
blockdim = (32, 8)
griddim = (16, 16)
start = timer()
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dF = cuda.to_device(F, copy=False)
diagproduct[griddim, blockdim](dF, dA, dB)
dF.to_host()
cuda_time = timer() - start
diff = np.abs(F-E)
print("Maximum CUDA error %f" % np.max(diff))
print("Numpy took %f seconds" % numpy_time)
print("CUBLAS took %f seconds, %0.2fx speedup" % (cublas_time, numpy_time / cublas_time))
print("CUDA JIT took %f seconds, %0.2fx speedup" % (cuda_time, numpy_time / cuda_time))
if __name__ == '__main__':
main()
The kernel is significantly faster because SGEMM does a full matrix-matrix multiply (O(n^3)), and expands the diagonal into a full matrix. The diagproduct function is smarter. It simply does a single multiply for each matrix element, and never expands the diagonal to a full matrix. Here are the results on my NVIDIA Tesla K20c GPU for N=1000:
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 0.024535 seconds
CUBLAS took 0.010345 seconds, 2.37x speedup
CUDA JIT took 0.004857 seconds, 5.05x speedup
The timing includes all of the copies to and from the GPU, which is a significant bottleneck for small matrices. If we set N to 10,000 and run again, we get a much bigger speedup:
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 7.245677 seconds
CUBLAS took 1.371524 seconds, 5.28x speedup
CUDA JIT took 0.264598 seconds, 27.38x speedup
For very small matrices, however, CUBLAS SGEMM has an optimized path so it is closer to the CUDA performance. Here, N=100
Maximum CUBLAS error 0.000000
Maximum CUDA error 0.000000
Numpy took 0.006876 seconds
CUBLAS took 0.001425 seconds, 4.83x speedup
CUDA JIT took 0.001313 seconds, 5.24x speedup
Just to bounce back on all those considerations. I also wanted to implement some matrix computations on CUDA, but then heard about the numpy.einsum function.
It turns out that einsum is incredibly fast.
In a case like this, here is the code for it. But it can be applied to many types of computations.
G = np.einsum('ij,j -> ij',A, B)
In terms of speed, here are the results for N = 10000
Numpy took 8.387756 seconds
CUDA JIT took 0.218394 seconds, 38.41x speedup
EINSUM took 0.131751 seconds, 63.66x speedup

Categories