I found that when the index of a numpy array will go out of bound inside a while-loop in a njit decorated function, the way the function handles the while loop can quite weird, and I am not sure why it happens.
from numba import njit
import numpy as np
def func1(v):
i= 0
K= v[-1]+1
while v[i] < K:
i+=1
return i
#njit
def func2(v):
i= 0
K= v[-1]+1
while v[i] < K:
i+=1
return i
x= np.arange(2)
result2 = func2(x)
result1 = func1(x)
Here is a short summary of the results:
1) func2 won't raise IndexError
2) func2 returns different results(like sometimes it is 4; sometimes 5,9,12, etc, basically unstable output) every time we run the file in the console (I am using ipython version 7.8.0)
I am not sure why and how this happens(could be due to numba or spyder or ipython issues or it could be that my cpu is broken beyond repair) which is why I am asking for help here.
Note: I am using:
Anaconda's distribution of python, python version 3.7.4,
spyder version 3.3.6,
ipython version 7.8.0,
numba version 0.45.1
OS windows 10 64-bit
Numba does not do bounds checking on Numpy arrays for performance reasons. There is currently work to turn it on optionally (https://github.com/numba/numba/pull/4432). When you go outside of the bounds of the array you will get whatever is in memory at the location or possibly seg fault.
I've heard of Numba before, but never used it myself.
Here are the results of some messing around with it (version 0.45.1) just now.
from numba import njit
import numpy as np
x = np.arange(2)
#njit
def func2(v):
i = 0
k = v[-1]+1
while v[i] < k:
i += 1
return i
#njit
def func3(x):
return x[2]
func2(x) # returns 2, but no error raised
func3(x) # returns 32, no error raised
func3(np.array([0])) # returns 32, no error raised
func2([0, 1]) # IndexError raised
func3([0, 1]) # IndexError raised
So to me, the bug looks to be the result of some sort of interaction between Numba's jit and Numpy arrays, since normal Python lists behave as expected.
Related
Below is my task code. In this case e0=15, but I would like to solve this problem for a set of e0 values (e0 - parameter (e0 = 7, 10, 15, 20, 28)). I have a multi-core processor and I would like to distribute the calculations of this task for each parameter e0 to a separate core.
How to do parallel calculations for this task in Python?
import sympy as sp
import scipy as sc
import numpy as np
e0=15
einf=15
def Psi(r,n):
return 2*np.exp(-r/n)*np.sqrt(sc.special.factorial(n)/sc.special.factorial(-1+n))*sc.special.hyp1f1(1-n, 2, 2*r/n)/n**2
def PsiSymb(n):
r=sp.symbols('r')
y1=2*sp.exp(-r/n)*np.sqrt(sc.special.factorial(n)/sc.special.factorial(-1+n))/n**2
y2 = sp.simplify(sp.functions.special.hyper.hyper([1-n], [2], 2*r/n))
y=y1*y2
return y
def LaplacianPsi(n):
r = sp.symbols('r')
ydiff = 2/r*PsiSymb(n).diff(r)+PsiSymb(n).diff(r,2)
ydiffnum = sp.lambdify(r, ydiff, "numpy")
return ydiffnum
def k(n1,n2):
yint=sc.integrate.quad(lambda r: -0.5*Psi(r,n2)*LaplacianPsi(n1)(r)*r**2,0,np.inf)
return yint[0]
def p(n1,n2):
potC=sc.integrate.quad(lambda r: Psi(r,n2)*(-1/r)*Psi(r,n1)*(r**2),0,np.inf)
potB1=sc.integrate.quad(lambda r: Psi(r,n2)*(1/einf-1/e0)*((einf/e0)**(3/5))*(-e0/(2*r))*(np.exp(-r*2.23))*Psi(r,n1)*(r**2),0,np.inf)
potB2=sc.integrate.quad(lambda r: Psi(r,n2)*(1/einf-1/e0)*((einf/e0)**(3/5))*(-e0/(2*r))*(np.exp(-r*2.4))*Psi(r,n1)*(r**2),0,np.inf)
pot=potC[0]+potB1[0]+potB2[0]
return pot
def en(n1,n2):
return k(n1,n2)+p(n1,n2)
nmax=3
EnM = [[0]*nmax for i in range(nmax)]
for n1 in range(nmax):
for n2 in range(nmax):
EnM[n2][n1]=en(n1+1,n2+1)
EnEig=sc.linalg.eigvalsh(EnM)
EnB=min(EnEig)
print(EnB)
This is not needed to use multiple cores for this computation. Indeed, the bottleneck is the LaplacianPsi function which recompute the same thing over and over. You can use memoization to fix this. Here is an example:
import functools
#functools.cache
def LaplacianPsi(n):
r = sp.symbols('r')
ydiff = 2/r*PsiSymb(n).diff(r)+PsiSymb(n).diff(r,2)
ydiffnum = sp.lambdify(r, ydiff, "numpy")
return ydiffnum
# The rest is the same
The code can be further optimized since sc.special.factorial(n) / sc.special.factorial(-1+n) is actually just n and np.sqrt is inefficient on scalar so it should be replaced with math.sqrt(n). This results in a code taking only 0.057 seconds as opposed to 16.5 seconds for the initial implementation on my machine. This means the new implementation is 290 times faster while it produces the same result!
Directly using many cores would just have wasted more resources for a slower result. You can still try to use more cores to compute this with the faster provided implementation though it might not be significantly faster.
I'm trying out numba, the python package that is said to make my nparray super fast. I want to run my function in nonpython mode. What it essentially does is that it takes in an 20x20 array, assigns random numbers to each of its elements, calculate its inverse matrix, then return it.
But here's the problem, when I initialize the array result with np.zeros(), my script crashes and gives me an error message 'overload of function zeros'.
Could someone kindly tell me what is going on? Much appreciated.
from numba import njit
import time
import numpy as np
import random
arr = np.zeros((20,20),dtype = float)
#njit
def aFunctionWithNumba (incomingArray):
result = np.zeros(np.shape(incomingArray), dtype = float)
for i in range(len(incomingArray[0])):
for j in range(len(incomingArray[1])):
incomingArray[i,j] = random.randrange(105150,1541586)
result = np.linalg.inv(incomingArray)
return result
t0 = time.time()
fastArray = aFunctionWithNumba(arr)
t1 = time.time()
s1 = t1 - t0
Here's the full error message:
Exception has occurred: TypingError Failed in nopython mode pipeline (step: nopython frontend) No implementation of function Function(<built-in function zeros>) found for signature:
>>> zeros(UniTuple(int64 x 2), dtype=Function(<class 'float'>)) There are 2 candidate implementations:
- Of which 2 did not match due to: Overload of function 'zeros': File: numba\core\typing\npydecl.py: Line 511.
With argument(s): '(UniTuple(int64 x 2), dtype=Function(<class 'float'>))': No match.
During: resolving callee type: Function(<built-in function zeros>) During: typing of call at c:\Users\Eric\Desktop\testNumba.py (9)
File "testNumba.py", line 9: def aFunctionWithNumba (incomingArray):
result = np.zeros(np.shape(incomingArray), dtype = float)
^ File "C:\Users\Eric\Desktop\testNumba.py", line 25, in <module>
fastArray = aFunctionWithNumba(arr)
The error
You should use Numpy or Numba types inside JITted functions.
Changing the following line your code works:
result = np.zeros(np.shape(incomingArray), dtype=np.float64)
But your code will be more generic using:
result = np.zeros(incomingArray.shape, dtype=incomingArray.dtype)
Or, even better:
result = np.zeros_like(incomingArray)
The timing
The first time you call a JITted function it will take some time to compile it, much longer than the time it will take to execute it. So you should call the function with the same parameter types once before you make any timings.
Additional optimization
If you are interested in comparing the execution time of nested loops with or without Numba, your code is fine. Otherwise you can replace the loops with something like:
incomingArray[:] = np.random.random(incomingArray.shape) * (1541586 - 105150) + 105150
I have a MATLAB function :
Bits=30
NBits= ceil(fzero(#(x)2^(x) - x -1 - Bits, max(log2(Bits),1)))
I want to convert it to python, I wrote something like this so far:
from numpy import log, log2
from scipy.optimize import root_scalar
def func(x,Bits):
return ((x)2^(x)-x-1-Bits, max(log2(Bits)))
However it says that it need to be (x)*2^
Does anybody know first, if the conversion from Matlab to python is correct? and second if * has to be added?
Upon suggestion I wrote this lambda function:
lambda x: (2^(x) -x -1 -Bits) , max(log2(Bits))
but I get this error:
TypeError: 'numpy.float64' object is not iterable
I don't have numpy or scipy on this computer so here is my best attempt at an answer.
def YourFunc(Bits):
return math.ceil(root_scalar(lambda x: (2**x)-x-1-Bits, x0 = max(log2(Bits),1)))
Bits = 30
NBits = YourFunc(30)
print(NBits)
I used this function for log2 rather than the one from numpy. Try it
def log2(x):
return math.log(x,2)
I am new to cython and have the following code for a numpy for loop that I am trying to optimize. So far, this Cython code isn't much faster than the numpy for loop.
# cython: infer_types = True
import numpy as np
cimport numpy
DTYPE = np.double
def hdcfTransfomation(scanData):
cdef Py_ssize_t Position
scanLength = scanData.shape[0]
hdcfFunction_np = np.zeros(scanLength, dtype = DTYPE)
cdef double [::1] hdcfFunction = hdcfFunction_np
for position in range(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:-(position + 1)]
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
hdcfFunction[position] = arrayMean
return hdcfFunction
I know that using C math library functions would be more ideal than calling back into the numpy language (subtract, square, mean), but I am not sure where I can find a list of functions that can be called in this manner.
I have been trying to figure out ways to optimize this code by using different types, ect. but nothing is providing the performance that I think is possible with a fully optimized implementation of Cython.
Here is a working example of the numpy for-loop:
def hdcfTransfomation(scanData):
scanLength = scanData.shape[0]
hdcfFunction = np.zeros(scanLength)
for position in range(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:-(position + 1)]
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
hdcfFunction[position] = arrayMean
return hdcfFunction
scanDataArray = np.random.rand(80000, 1)
transformedScan = hdcfTransformed(scanDataArray)
Always provide as much informations as possible (some example data, Python/Cython Version, Compiler Version/Settings and CPU Model.
Without that it is quite hard to compare any timings. For example this problem benefits quite well from SIMD-vectorization. It will make quite a difference which compiler you use or if you want to redistribute a compiled version which should also run on low-end or quite old CPUS (eg. no AVX).
I am not very familiar with Cython, but I think your main problem is the missing declaration for scanData. Maybe the C-Compiler needs additional flags like march=native, but the real syntax is compiler dependend. I am am also not sure how Cython or the C-compiler optimizes this part:
arrayDiff = np.subtract(topShift, bottomShift)
arraySquared = np.square(arrayDiff)
arrayMean = np.mean(arraySquared, axis = 0)
If that loops (all vectorized commands are actually loops) are not joined, but intead there are temporary arryas like in pure Python created, this will slow down the code. It will be a good idea to create a 1D array first. (eg. scanData=scanData[::1]
As said I am not that familliar with Cython, I tried what is possible with Numba. At least it shows what should also be possible with a resonable good Cython implementation.
Maybe easier to otimize for the compiler
import numba as nb
import numpy as np
#nb.njit(fastmath=True,error_model='numpy',parallel=True)
#scanData is a 1D-array here
def hdcfTransfomation(scanData):
scanLength = scanData.shape[0]
hdcfFunction = np.zeros(scanLength, dtype = scanData.dtype)
for position in nb.prange(scanLength - 1):
topShift = scanData[1 + position:]
bottomShift = scanData[:scanData.shape[0]-(position + 1)]
sum=0.
jj=0
for i in range(scanLength-(position + 1)):
jj+=1
sum+=(topShift[i]-bottomShift[i])**2
hdcfFunction[position] = sum/jj
return hdcfFunction
I also used parallelization here, because the problem is embarrassingly parallel. At least with a size of 80_000 and Numba it doesn't matter if you use a slightly modified version of your code (1D-array), or the code above.
Timings
#Quadcore Core i7-4th gen,Numba 0.4dev,Python 3.6
scanData=np.random.rand(80_000)
#The first call to the function isn't measured (compilation overhead),but the following calls.
Pure Python: 5900ms
Numba single-threaded: 947ms
Numba parallel: 260ms
Especially for larger arrays than np.random.rand(80_000) there may be better aproaches (loop tilling for better cache usage), but for this size that should be more or less OK (At least it fits in the L3-cache)
Naive GPU Implementation
from numba import cuda, float32
#cuda.jit('void(float32[:], float32[:])')
def hdcfTransfomation_gpu(scanData,out_data):
scanLength = scanData.shape[0]
position = cuda.grid(1)
if position < scanLength - 1:
sum= float32(0.)
offset=1 + position
for i in range(scanLength-offset):
sum+=(scanData[i+offset]-scanData[i])**2
out_data[position] = sum/(scanLength-offset)
hdcfTransfomation_gpu[scanData.shape[0]//64,64](scanData,res_3)
This gives about 400ms on a GT640 (float32) and 970ms (float64). For a good implemenation shared arrays should be considered.
Putting cython aside, does this do the same thing as your current code but without a for loop? We can tighten it up and correct for inaccuracies, but the first port of call is to try apply operations in numpy to 2D arrays before turning to cython for for loops. It's too long to put in a comment.
import numpy as np
# Setup
arr = np.random.choice(np.arange(10), 100).reshape(10, 10)
top_shift = arr[:, :-1]
bottom_shift = arr[:, 1:]
arr_diff = top_shift - bottom_shift
arr_squared = np.square(arr_diff)
arr_mean = arr_squared.mean(axis=1)
I am writing a code which needs to do some indexing in python using numba.
However, I cannot do it correctly.
It seems something is prohibited.
The code is as follows:
from numba import cuda
import numpy as np
#cuda.jit
def function(output, size, random_array):
i_p, i_k1, i_k2 = cuda.grid(3)
if i_p<size and i_k1<size and i_k2<size:
a1=i_p**2+i_k1
a2=i_p**2+i_k2
a3=i_k1**2+i_k2**2
a=[a1,a2,a3]
for i in range(len(random_array)):
output[i_p,i_k1,i_k2,i] = a[int(random_array[i])]
output=cuda.device_array((10,10,10,5))
random_array=cuda.to_device(np.array([np.random.random()*3 for i in range(5)]))
size=10
threadsperblock = (8, 8, 8)
blockspergridx=(size + (threadsperblock[0] - 1)) // threadsperblock[0]
blockspergrid = ((blockspergridx, blockspergridx, blockspergridx))
# Start the kernel
function[blockspergrid, threadsperblock](output, size, random_array)
print(output.copy_to_host())
It yields an error:
LoweringError: Failed at nopython (nopython mode backend)
'CUDATargetContext' object has no attribute 'build_list'
File "<ipython-input-57-6058e2bfe8b9>", line 10
[1] During: lowering "$40.21 = build_list(items=[Var(a1, <ipython-input-57-6058e2bfe8b9> (7)), Var(a2, <ipython-input-57-6058e2bfe8b9> (8)), Var(a3, <ipython-input-57-6058e2bfe8b9> (9))])" at <ipython-input-57-6058e2bfe8b9> (10
Can anyone help me with this?
One choice is to feed a also as an input of the function, but when a is really large like some 1000*1000*1000*7 array, it always gives me out off memory.
The problem has nothing to do with array indexing. Within the kernel, this line:
a=[a1,a2,a3]
is not supported. You cannot create a list within a #cuda.jit function. The exact list of supported Python types within kernels is fully documented here.