How to do indexing properly in numba.cuda?

How to do indexing properly in numba.cuda? - python

I am writing a code which needs to do some indexing in python using numba.
However, I cannot do it correctly.
It seems something is prohibited.
The code is as follows:
from numba import cuda
import numpy as np
#cuda.jit
def function(output, size, random_array):
i_p, i_k1, i_k2 = cuda.grid(3)
if i_p<size and i_k1<size and i_k2<size:
a1=i_p**2+i_k1
a2=i_p**2+i_k2
a3=i_k1**2+i_k2**2
a=[a1,a2,a3]
for i in range(len(random_array)):
output[i_p,i_k1,i_k2,i] = a[int(random_array[i])]
output=cuda.device_array((10,10,10,5))
random_array=cuda.to_device(np.array([np.random.random()*3 for i in range(5)]))
size=10
threadsperblock = (8, 8, 8)
blockspergridx=(size + (threadsperblock[0] - 1)) // threadsperblock[0]
blockspergrid = ((blockspergridx, blockspergridx, blockspergridx))
# Start the kernel
function[blockspergrid, threadsperblock](output, size, random_array)
print(output.copy_to_host())
It yields an error:
LoweringError: Failed at nopython (nopython mode backend)
'CUDATargetContext' object has no attribute 'build_list'
File "<ipython-input-57-6058e2bfe8b9>", line 10
[1] During: lowering "$40.21 = build_list(items=[Var(a1, <ipython-input-57-6058e2bfe8b9> (7)), Var(a2, <ipython-input-57-6058e2bfe8b9> (8)), Var(a3, <ipython-input-57-6058e2bfe8b9> (9))])" at <ipython-input-57-6058e2bfe8b9> (10
Can anyone help me with this?
One choice is to feed a also as an input of the function, but when a is really large like some 1000*1000*1000*7 array, it always gives me out off memory.

The problem has nothing to do with array indexing. Within the kernel, this line:
a=[a1,a2,a3]
is not supported. You cannot create a list within a #cuda.jit function. The exact list of supported Python types within kernels is fully documented here.

Related

How to use numba?

I meet a problem about using numba jit decorator (#nb.jit)! Here is the warning from jupyter notebook,
NumbaWarning:
Compilation is falling back to object mode WITH looplifting enabled because Function "get_nb_freq" failed type inference due to: No implementation of function Function(<function dotenter image description here at 0x00000190AC399B80>) found for signature:
This is complete information:
Here is my code:
#numba.jit
def get_nb_freq( nb_count = None, onehot_ct = None):
# nb_freq = onehot_ct.T # nb_count
nb_freq = np.dot(onehot_ct.T, nb_count)
res = nb_freq/nb_freq.sum(axis = 1).reshape(Num_celltype,-1)
return res
## onehot_ct is array, and its shape is (921600,4)
## nb_count is array, and its shape is the same as onehot_ct
## Num_celltype equals 4

Based on your mentioned shapes we can create the arrays as:
onehot_ct = np.random.rand(921600, 4)
nb_count = np.random.rand(921600, 4)
Your prepared code will be work correctly and get an answer like:
[[0.25013102754197963 0.25021461207825463 0.2496806287276126 0.24997373165215303]
[0.2501574139037384 0.25018726649940737 0.24975108864220968 0.24990423095464467]
[0.25020550587624757 0.2501303498983212 0.24978335463279314 0.24988078959263807]
[0.2501855533482036 0.2500913419625523 0.24979681404573967 0.24992629064350436]]
So, it shows the code is working and the problem seems to be related to type of the arrays, that numba can not recognize them. So, signature may be helpful here, which by we can recognize the types manually for the function. So, based on the error I think the following signature will pass your issue:
#nb.jit("float64[:, ::1](float64[:, ::1], float32[:, ::1])")
def get_nb_freq( nb_count = None, onehot_ct = None):
nb_freq = np.dot(onehot_ct.T, nb_count)
res = nb_freq/nb_freq.sum(axis=1).reshape(4, -1)
return res
But it will stuck again if you test by get_nb_freq(nb_count.astype(np.float64), onehot_ct.astype(np.float32)), So another cause could be related to unequal types in np.dot. So, use the onehot_ct array as array type np.float64, could pass the issue:
#nb.jit("float64[:, ::1](float64[:, ::1], float32[:, ::1])")
def get_nb_freq( nb_count, onehot_ct):
nb_freq = np.dot(onehot_ct.astype(np.float64).T, nb_count)
res = nb_freq/nb_freq.sum(axis=1).reshape(4, -1)
return res
It ran on my machine with this correction. I recommend write numba equivalent codes (like this for np.dot) instead using np.dot or …, which can be much faster.

numba: overload of function

I'm trying out numba, the python package that is said to make my nparray super fast. I want to run my function in nonpython mode. What it essentially does is that it takes in an 20x20 array, assigns random numbers to each of its elements, calculate its inverse matrix, then return it.
But here's the problem, when I initialize the array result with np.zeros(), my script crashes and gives me an error message 'overload of function zeros'.
Could someone kindly tell me what is going on? Much appreciated.
from numba import njit
import time
import numpy as np
import random
arr = np.zeros((20,20),dtype = float)
#njit
def aFunctionWithNumba (incomingArray):
result = np.zeros(np.shape(incomingArray), dtype = float)
for i in range(len(incomingArray[0])):
for j in range(len(incomingArray[1])):
incomingArray[i,j] = random.randrange(105150,1541586)
result = np.linalg.inv(incomingArray)
return result
t0 = time.time()
fastArray = aFunctionWithNumba(arr)
t1 = time.time()
s1 = t1 - t0
Here's the full error message:
Exception has occurred: TypingError Failed in nopython mode pipeline (step: nopython frontend) No implementation of function Function(<built-in function zeros>) found for signature:
>>> zeros(UniTuple(int64 x 2), dtype=Function(<class 'float'>)) There are 2 candidate implementations:
- Of which 2 did not match due to: Overload of function 'zeros': File: numba\core\typing\npydecl.py: Line 511.
With argument(s): '(UniTuple(int64 x 2), dtype=Function(<class 'float'>))': No match.
During: resolving callee type: Function(<built-in function zeros>) During: typing of call at c:\Users\Eric\Desktop\testNumba.py (9)
File "testNumba.py", line 9: def aFunctionWithNumba (incomingArray):
result = np.zeros(np.shape(incomingArray), dtype = float)
^ File "C:\Users\Eric\Desktop\testNumba.py", line 25, in <module>
fastArray = aFunctionWithNumba(arr)

The error
You should use Numpy or Numba types inside JITted functions.
Changing the following line your code works:
result = np.zeros(np.shape(incomingArray), dtype=np.float64)
But your code will be more generic using:
result = np.zeros(incomingArray.shape, dtype=incomingArray.dtype)
Or, even better:
result = np.zeros_like(incomingArray)
The timing
The first time you call a JITted function it will take some time to compile it, much longer than the time it will take to execute it. So you should call the function with the same parameter types once before you make any timings.
Additional optimization
If you are interested in comparing the execution time of nested loops with or without Numba, your code is fine. Otherwise you can replace the loops with something like:
incomingArray[:] = np.random.random(incomingArray.shape) * (1541586 - 105150) + 105150

Why is Numba failing to compile this function?

I am trying to compile the following function with Numba:
#njit(fastmath=True, nogil=True)
def generate_items(array, start):
array_positions = np.empty(SIZE, dtype=np.int64)
count = 0
while count < SIZE - start:
new_array = mutate(np.empty(0, dtype=np.uint8))
if len(new_array) > 0:
array_positions[count] = len(array) # <<=== FAILS HERE
array = np.append(array, np.append(new_array, 255))
count += 1
return array, array_positions
But it fails on the indicated line above with this error message:
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Cannot unify array(uint8, 1d, C) and array(int64, 1d, C) for 'array.3', defined at ...
Which doesn't seem to make sense since I'm just assigning an int (the result on len) to an array that has a dtype of np.int64?
Note that array is of type np.uint8 - but I'm not assigning the array itself so this message makes no sense to me.
I attempted to refactor to this:
tmp = len(array) # <<=== FAILS HERE
array_positions[count] = tmp
But then it fails there... same message.
I also tried replacing len(array) by array.size since this is a 1d array, but same error.
Can anyone see why this is failing?
I'm on Python 3.7 and Numba 0.50
Thanks!

First issue is that you use an unsupported function mutate (as stated here).
Next, as the error says, you try to add arrays with different data types in
np.append(array, np.append(new_array, 255))
where the array is of type int64 and new_array holds values of type uint8. If you've used #jit it would have showed you a warning of falling to object mode, but because you've enforced the nonpython mode by using the #njit decorator, it throws an error.
Cheers.

Cython Optimization of Numpy for Loop

I am new to cython and have the following code for a numpy for loop that I am trying to optimize. So far, this Cython code isn't much faster than the numpy for loop.
# cython: infer_types = True
import numpy as np
cimport numpy
DTYPE = np.double
def hdcfTransfomation(scanData):
cdef Py_ssize_t Position
scanLength = scanData.shape[0]
hdcfFunction_np = np.zeros(scanLength, dtype = DTYPE)
cdef double [::1] hdcfFunction = hdcfFunction_np
for position in range(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:-(position + 1)]
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
hdcfFunction[position] = arrayMean
return hdcfFunction
I know that using C math library functions would be more ideal than calling back into the numpy language (subtract, square, mean), but I am not sure where I can find a list of functions that can be called in this manner.
I have been trying to figure out ways to optimize this code by using different types, ect. but nothing is providing the performance that I think is possible with a fully optimized implementation of Cython.
Here is a working example of the numpy for-loop:
def hdcfTransfomation(scanData):
scanLength = scanData.shape[0]
hdcfFunction = np.zeros(scanLength)
for position in range(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:-(position + 1)]
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
hdcfFunction[position] = arrayMean
return hdcfFunction
scanDataArray = np.random.rand(80000, 1)
transformedScan = hdcfTransformed(scanDataArray)

Always provide as much informations as possible (some example data, Python/Cython Version, Compiler Version/Settings and CPU Model.
Without that it is quite hard to compare any timings. For example this problem benefits quite well from SIMD-vectorization. It will make quite a difference which compiler you use or if you want to redistribute a compiled version which should also run on low-end or quite old CPUS (eg. no AVX).
I am not very familiar with Cython, but I think your main problem is the missing declaration for scanData. Maybe the C-Compiler needs additional flags like march=native, but the real syntax is compiler dependend. I am am also not sure how Cython or the C-compiler optimizes this part:
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
If that loops (all vectorized commands are actually loops) are not joined, but intead there are temporary arryas like in pure Python created, this will slow down the code. It will be a good idea to create a 1D array first. (eg. scanData=scanData[::1]
As said I am not that familliar with Cython, I tried what is possible with Numba. At least it shows what should also be possible with a resonable good Cython implementation.
Maybe easier to otimize for the compiler
import numba as nb
import numpy as np
#nb.njit(fastmath=True,error_model='numpy',parallel=True)
#scanData is a 1D-array here
def hdcfTransfomation(scanData):
scanLength = scanData.shape[0]
hdcfFunction = np.zeros(scanLength, dtype = scanData.dtype)
for position in nb.prange(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:scanData.shape[0]-(position + 1)]
sum=0.
jj=0
for i in range(scanLength-(position + 1)):
jj+=1
sum+=(topShift[i]-bottomShift[i])**2
hdcfFunction[position] = sum/jj
return hdcfFunction
I also used parallelization here, because the problem is embarrassingly parallel. At least with a size of 80_000 and Numba it doesn't matter if you use a slightly modified version of your code (1D-array), or the code above.
Timings
#Quadcore Core i7-4th gen,Numba 0.4dev,Python 3.6
scanData=np.random.rand(80_000)
#The first call to the function isn't measured (compilation overhead),but the following calls.
Pure Python: 5900ms
Numba single-threaded: 947ms
Numba parallel: 260ms
Especially for larger arrays than np.random.rand(80_000) there may be better aproaches (loop tilling for better cache usage), but for this size that should be more or less OK (At least it fits in the L3-cache)
Naive GPU Implementation
from numba import cuda, float32
#cuda.jit('void(float32[:], float32[:])')
def hdcfTransfomation_gpu(scanData,out_data):
scanLength = scanData.shape[0]
position = cuda.grid(1)
if position < scanLength - 1:
sum= float32(0.)
offset=1 + position
for i in range(scanLength-offset):
sum+=(scanData[i+offset]-scanData[i])**2
out_data[position] = sum/(scanLength-offset)
hdcfTransfomation_gpu[scanData.shape[0]//64,64](scanData,res_3)
This gives about 400ms on a GT640 (float32) and 970ms (float64). For a good implemenation shared arrays should be considered.

Putting cython aside, does this do the same thing as your current code but without a for loop? We can tighten it up and correct for inaccuracies, but the first port of call is to try apply operations in numpy to 2D arrays before turning to cython for for loops. It's too long to put in a comment.
import numpy as np
# Setup
arr = np.random.choice(np.arange(10), 100).reshape(10, 10)
top_shift = arr[:, :-1]
bottom_shift = arr[:, 1:]
arr_diff = top_shift - bottom_shift
arr_squared = np.square(arr_diff)
arr_mean = arr_squared.mean(axis=1)

Using Theano.scan with shared variables

I want to calculate the sumproduct of two arrays in Theano. Both arrays are declared as shared variables and are the result of prior computations. Reading the tutorial, I found out how to use scan to compute what I want using 'normal' tensor arrays, but when I tried to adapt the code to shared arrays I got the error message TypeError: function() takes at least 1 argument (1 given). (See minimal running code example below)
Where is the mistake in my code? Where is my misconception? I am also open to a different approach for solving my problem.
Generally I would prefer a version which takes the shared variables directly, because in my understanding, converting the arrays first back to Numpy arrays and than again passing them to Theano, would be wasteful.
Error message producing sumproduct code using shared variables:
import theano
import theano.tensor as T
import numpy
a1 = [1,2,4]
a2 = [3,4,5]
Ta1_shared = theano.shared(numpy.array(a1))
Ta2_shared = theano.shared(numpy.array(a2))
outputs_info = T.as_tensor_variable(numpy.asarray(0, 'float64'))
Tsumprod_result, updates = theano.scan(fn=lambda Ta1_shared, Ta2_shared, prior_value:
prior_value + Ta1_shared * Ta2_shared,
outputs_info=outputs_info,
sequences=[Ta1_shared, Ta2_shared])
Tsumprod_result = Tsumprod_result[-1]
Tsumprod = theano.function(outputs=Tsumprod_result)
print Tsumprod()
Error message:
TypeError: function() takes at least 1 argument (1 given)
Working sumproduct code using non-shared variables:
import theano
import theano.tensor as T
import numpy
a1 = [1, 2, 4]
a2 = [3, 4, 5]
Ta1 = theano.tensor.vector("a1")
Ta2 = theano.tensor.vector("coefficients")
outputs_info = T.as_tensor_variable(numpy.asarray(0, 'float64'))
Tsumprod_result, updates = theano.scan(fn=lambda Ta1, Ta2, prior_value:
prior_value + Ta1 * Ta2,
outputs_info=outputs_info,
sequences=[Ta1, Ta2])
Tsumprod_result = Tsumprod_result[-1]
Tsumprod = theano.function(inputs=[Ta1, Ta2], outputs=Tsumprod_result)
print Tsumprod(a1, a2)

You need to change the compilation line to this one:
Tsumprod = theano.function([], outputs=Tsumprod_result)
theano.function() always need a list of inputs. If the function take 0 input, like in this case, you need to give an empty list for the inputs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to do indexing properly in numba.cuda? - python

The problem has nothing to do with array indexing. Within the kernel, this line: a=[a1,a2,a3] is not supported. You cannot create a list within a #cuda.jit function. The exact list of supported Python types within kernels is fully documented here.

Related

How to use numba?

numba: overload of function

Why is Numba failing to compile this function?

Cython Optimization of Numpy for Loop

Using Theano.scan with shared variables

Categories

Resources