How to run NumPy code on the GPU with Numba - python

I've developed several codes with high matrix implication; those run quite well but since I spent some extra money in GPU, I'd like take advantage from it... ;-) I've tryed in several configurations from numba manual, but clearly something is missing .. Why I receive an error when try to execute some functions in Cuda (nvidia gtx 1050)?
Would be numba or eighter other module, how can I let execute a portion of code (which require strong parallelization) into GPU ?
from numba import types
from numba.extending import intrinsic
from numba import jit, cuda
from numba import vectorize
#jit
def calculate_portfolio_return(returns, weights):
portfolio_return = np.sum(returns.mean()*weights)*252
print("Expected Portfolio Return:", portfolio_return)
#jit
def calculate_portfolio_risk(returns, weights):
portfolio_variance = np.sqrt(np.dot(weights.T, np.dot(returns.cov()*252,weights)))
print("Expected Risk:", portfolio_variance)
#jit
def generate_portfolios(weights, returns):
preturns = []
pvariances = []
for i in range(10000):
# weights = np.random.random(len(stocks_portfolio))
weights = np.random.random(data.shape[1])
weights/=np.sum(weights)
preturns.append(np.sum(returns.mean()*weights)*252)
pvariances.append(np.sqrt(np.dot(weights.T,np.dot(returns.cov()*252,weights))))
preturns = np.array(preturns)
pvariances = np.array(pvariances)
return preturns,pvariances
......

Related

How to accelerate numpy array masking?

I am profiling performance of a piece of Python code, using a line profiler.
In the code, I have a numpy array tt of shape (106906,) and dtype=int64. With the help of the profiler, I find that the the second line below mask[tt]=True is quite slow. Is there anyway to accelerate it? I am on Python 3 if that matters.
mask = np.zeros(100000, dtype='bool')
mask[tt] = True
You can use Numba as #orlevii has suggested:
from numba import njit
#njit
def f(mask,tt):
mask[tt] = True
#Test:
mask = np.zeros(1000000, dtype='bool')
tt = np.random.randint(0,1000000,106906)
f(mask,tt)
A simple %%timeit check suggests that you should expect roughly 3 times faster execution.
Further speed-up can be achieved by utilizing the GPU. An example of how to do it with PyTorch:
import torch
mask = torch.zeros(1000000).type(torch.cuda.FloatTensor)
tt = torch.randint(0,1000000,torch.Size([106906])).type(torch.cuda.LongTensor)
mask[tt] = True
Note that here we use a torch.Tensor object which is the equivalent of numpy.ndarray in PyTorch. Code will run only if you have a GPU (of NVIDIA) with CUDA. Expect x30 speed-up w.r.t your original code on Tesla V100-SXM2.

How to reduce integration time for integration over 2D connected domains

I need to compute many 2D integrations over domains that are simply connected (and convex most of the time). I'm using python function scipy.integrate.nquad to do this integration. However, the time required by this operation is significantly large compared to integration over a rectangular domain. Is there any faster implementation possible?
Here is an example; I integrate a constant function first over a circular domain (using a constraint inside the function) and then on a rectangular domain (default domain of nquad function).
from scipy import integrate
import time
def circular(x,y,a):
if x**2 + y**2 < a**2/4:
return 1
else:
return 0
def rectangular(x,y,a):
return 1
a = 4
start = time.time()
result = integrate.nquad(circular, [[-a/2, a/2],[-a/2, a/2]], args=(a,))
now = time.time()
print(now-start)
start = time.time()
result = integrate.nquad(rectangular, [[-a/2, a/2],[-a/2, a/2]], args=(a,))
now = time.time()
print(now-start)
The rectangular domain takes only 0.00029 seconds, while the circular domain requires 2.07061 seconds to complete.
Also the circular integration gives the following warning:
IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
If increasing the limit yields no improvement it is advised to analyze
the integrand in order to determine the difficulties. If the position of a
local difficulty can be determined (singularity, discontinuity) one will
probably gain from splitting up the interval and calling the integrator
on the subranges. Perhaps a special-purpose integrator should be used.
**opt)
One way to make the calculation faster is to use numba, a just-in-time compiler for Python.
The #jit decorator
Numba provides a #jit decorator to compile some Python code and output optimized machine code that can be run in parallel on several CPU. Jitting the integrand function only takes little effort and will achieve some time saving as the code is optimized to run faster. One doesn't even have to worry with types, Numba does all this under the hood.
from scipy import integrate
from numba import jit
#jit
def circular_jit(x, y, a):
if x**2 + y**2 < a**2 / 4:
return 1
else:
return 0
a = 4
result = integrate.nquad(circular_jit, [[-a/2, a/2],[-a/2, a/2]], args=(a,))
This runs indeed faster and when timing it on my machine, I get:
Original circular function: 1.599048376083374
Jitted circular function: 0.8280022144317627
That is a ~50% reduction of computation time.
Scipy's LowLevelCallable
Function calls in Python are quite time consuming due to the nature of the language. The overhead can sometimes make Python code slow in comparison to compiled languages like C.
In order to mitigate this, Scipy provides a LowLevelCallable class which can be used to provide access to a low-level compiled callback function. Through this mechanism, Python's function call overhead is bypassed and further time saving can be made.
Note that in the case of nquad, the signature of the cfunc passed to LowerLevelCallable must be one of:
double func(int n, double *xx)
double func(int n, double *xx, void *user_data)
where the int is the number of arguments and the values for the arguments are in the second argument. user_data is used for callbacks that need context to operate.
We can therefore slightly change the circular function signature in Python to make it compatible.
from scipy import integrate, LowLevelCallable
from numba import cfunc
from numba.types import intc, CPointer, float64
#cfunc(float64(intc, CPointer(float64)))
def circular_cfunc(n, args):
x, y, a = (args[0], args[1], args[2]) # Cannot do `(args[i] for i in range(n))` as `yield` is not supported
if x**2 + y**2 < a**2/4:
return 1
else:
return 0
circular_LLC = LowLevelCallable(circular_cfunc.ctypes)
a = 4
result = integrate.nquad(circular_LLC, [[-a/2, a/2],[-a/2, a/2]], args=(a,))
With this method I get
LowLevelCallable circular function: 0.07962369918823242
This is a 95% reduction compared to the original and 90% when compared to the jitted version of the function.
A bespoke decorator
In order to make the code more tidy and to keep the integrand function's signature flexible, a bespoke decorator function can be created. It will jit the integrand function and wrap it into a LowLevelCallable object that can then be used with nquad.
from scipy import integrate, LowLevelCallable
from numba import cfunc, jit
from numba.types import intc, CPointer, float64
def jit_integrand_function(integrand_function):
jitted_function = jit(integrand_function, nopython=True)
#cfunc(float64(intc, CPointer(float64)))
def wrapped(n, xx):
return jitted_function(xx[0], xx[1], xx[2])
return LowLevelCallable(wrapped.ctypes)
#jit_integrand_function
def circular(x, y, a):
if x**2 + y**2 < a**2 / 4:
return 1
else:
return 0
a = 4
result = integrate.nquad(circular, [[-a/2, a/2],[-a/2, a/2]], args=(a,))
Arbitrary number of arguments
If the number of arguments is unknown, then we can use the convenient carray function provided by Numba to convert the CPointer(float64) to a Numpy array.
import numpy as np
from scipy import integrate, LowLevelCallable
from numba import cfunc, carray, jit
from numba.types import intc, CPointer, float64
def jit_integrand_function(integrand_function):
jitted_function = jit(integrand_function, nopython=True)
#cfunc(float64(intc, CPointer(float64)))
def wrapped(n, xx):
ar = carray(xx, n)
return jitted_function(ar[0], ar[1], ar[2:])
return LowLevelCallable(wrapped.ctypes)
#jit_integrand_function
def circular(x, y, a):
if x**2 + y**2 < a[-1]**2 / 4:
return 1
else:
return 0
ar = np.array([1, 2, 3, 4])
a = ar[-1]
result = integrate.nquad(circular, [[-a/2, a/2],[-a/2, a/2]], args=ar)

Numba jit with scipy

So I wanted to speed up a program I wrote with the help of numba jit. However jit seems to be not compatible with many scipy functions because they use try ... except ... structures that jit cannot handle (Am I right with this point?)
A relatively simple solution I came up with is to copy the scipy source code I need and delete the try except parts (I already know that it will not run into errors so the try part will always work anyways)
However I do not like this solution and I am not sure if it will work.
My code structure looks like the following
import scipy.integrate as integrate
from scipy optimize import curve_fit
from numba import jit
def fitfunction():
...
#jit
def function(x):
# do some stuff
try:
fit_param, fit_cov = curve_fit(fitfunction, x, y, p0=(0,0,0), maxfev=500)
for idx in some_list:
integrated = integrate.quad(lambda x: fitfunction(fit_param), lower, upper)
except:
fit_param=(0,0,0)
...
Now this results in the following error:
LoweringError: Failed at object (object mode backend)
I assume this is due to jit not being able to handle try except (it also does not work if I only put jit on the curve_fit and integrate.quad parts and work around my own try except structure)
import scipy.integrate as integrate
from scipy optimize import curve_fit
from numba import jit
def fitfunction():
...
#jit
def integral(lower, upper):
return integrate.quad(lambda x: fitfunction(fit_param), lower, upper)
#jit
def fitting(x, y, pzero, max_fev)
return curve_fit(fitfunction, x, y, p0=pzero, maxfev=max_fev)
def function(x):
# do some stuff
try:
fit_param, fit_cov = fitting(x, y, (0,0,0), 500)
for idx in some_list:
integrated = integral(lower, upper)
except:
fit_param=(0,0,0)
...
Is there a way to use jit with scipy.integrate.quad and curve_fit without manually deleting all try except structures from the scipy code?
And would it even speed up the code?
Numba simply is not a general-purpose library to speed code up. There is a class of problems that can be solved in a much faster way with numba (especially if you have loops over arrays, number crunching) but everything else is either (1) not supported or (2) only slightly faster or even a lot slower.
[...] would it even speed up the code?
SciPy is already a high-performance library so in most cases I would expect numba to perform worse (or rarely: slightly better). You might do some profiling to find out if the bottleneck is really in the code that you jitted, then you could get some improvements. But I suspect the bottleneck will be in the compiled code of SciPy and that compiled code is probably already heavily optimized (so it's really unlikely that you find an implementation that could "only" compete with that code).
Is there a way to use jit with scipy.integrate.quad and curve_fit without manually deleting all try except structures from the scipy code?
As you correctly assumed try and except is simply not supported by numba at this time.
2.6.1. Language
2.6.1.1. Constructs
Numba strives to support as much of the Python language as possible, but some language features are not available inside Numba-compiled functions. The following Python language features are not currently supported:
[...]
Exception handling (try .. except, try .. finally)
So the answer here is No.
Nowadays try and except work with numba. However numba and scipy are still not compatible. Yes, Scipy calls compiled C and Fortran, but it does so in a way that numba can't deal with.
Fortunately there are alternatives to scipy that work well with numba! Below I use NumbaQuadpack and NumbaMinpack to do some curve fitting and integration similar to your example code. Disclaimer: i put together these packages. Below, I also give an equivalent implementation in scipy.
The Scipy implementation is ~18 times slower than the Scipy alternatives (NumbaQuadpack and NumbaMinpack).
Using Scipy alternatives (0.23 ms)
from NumbaQuadpack import quadpack_sig, dqags
from NumbaMinpack import minpack_sig, lmdif
import numpy as np
import numba as nb
import timeit
np.random.seed(0)
x = np.linspace(0,2*np.pi,100)
y = np.sin(x)+ np.random.rand(100)
#nb.jit
def fitfunction(x, A, B):
return A*np.sin(B*x)
#nb.cfunc(minpack_sig)
def fitfunction_optimize(u_, fvec, args_):
u = nb.carray(u_,(2,))
args = nb.carray(args_,(200,))
A, B = u
x = args[:100]
y = args[100:]
for i in range(100):
fvec[i] = fitfunction(x[i], A, B) - y[i]
optimize_ptr = fitfunction_optimize.address
#nb.cfunc(quadpack_sig)
def fitfunction_integrate(x, data):
A = data[0]
B = data[1]
return fitfunction(x, A, B)
integrate_ptr = fitfunction_integrate.address
#nb.njit
def fast_function():
try:
neqs = 100
u_init = np.array([2.0,.8],np.float64)
args = np.append(x,y)
fitparam, fvec, success, info = lmdif(optimize_ptr , u_init, neqs, args)
if not success:
raise Exception
lower = 0.0
uppers = np.linspace(np.pi,np.pi*2.0,200)
solutions = np.empty(len(uppers))
for i in range(len(uppers)):
solutions[i], abserr, success = dqags(integrate_ptr, lower, uppers[i], data = fitparam)
if not success:
raise Exception
except:
print('doing something else')
fast_function()
iters = 1000
t_nb = timeit.Timer(fast_function).timeit(number=iters)/iters
print(t_nb)
Using Scipy (4.4 ms)
import scipy.integrate as integrate
from scipy.optimize import curve_fit
import numpy as np
import numba as nb
import timeit
np.random.seed(0)
x = np.linspace(0,2*np.pi,100)
y = np.sin(x)+ np.random.rand(100)
#nb.jit
def fitfunction(x, A, B):
return A*np.sin(B*x)
def function():
try:
p0 = (2.0,.8)
fit_param, fit_cov = curve_fit(fitfunction, x, y, p0=p0, maxfev=500)
lower = 0.0
uppers = np.linspace(np.pi,np.pi*2.0,200)
solutions = np.empty(len(uppers))
for i in range(len(uppers)):
solutions[i], abserr = integrate.quad(fitfunction, lower, uppers[i], args = tuple(fit_param))
except:
print('do something else')
function()
iters = 1000
t_sp = timeit.Timer(function).timeit(number=iters)/iters
print(t_sp)

Cuda Parallelized Kernel Shared Counter Variable

Is there a way to have an integer counter variable that can be incremented/decremented across all threads in a parallelized cuda kernel? The below code outputs "[1]" since the modifications to the counter array from one thread is not applied in the others.
import numpy as np
from numba import cuda
#cuda.jit('void(int32[:])')
def func(counter):
counter[0] = counter[0] + 1
counter = cuda.to_device(np.zeros(1, dtype=np.int32))
threadsperblock = 64
blockspergrid = 18
func[blockspergrid, threadsperblock](counter)
print(counter.copy_to_host())
One approach would be to use numba cuda atomics:
$ cat t18.py
import numpy as np
from numba import cuda
#cuda.jit('void(int32[:])')
def func(counter):
cuda.atomic.add(counter, 0, 1)
counter = cuda.to_device(np.zeros(1, dtype=np.int32))
threadsperblock = 64
blockspergrid = 18
print blockspergrid * threadsperblock
func[blockspergrid, threadsperblock](counter)
print(counter.copy_to_host())
$ python t18.py
1152
[1152]
$
An atomic operation performs an indivisible read-modify-write operation on the target, so threads do not interfere with each other when they update the target variable.
Certainly other methods are possible, depending on your actual needs, such as a classical parallel reduction. numba provides some reduction sugar also.

Optimizing root finding algorithm from scipy

I use the root function from scipy.optimize with the method "excitingmixing" in my code because other methods, like standard Newton, don't converge to the roots I am looking for.
However I would like to optimize my code using numba, which doesn't support the scipy package. I tried to look up the "exciting mixing" algorithm in the documentation to program it myself:
https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.root.html
I didn't find anything useful except the not really helpful statement that the method "uses a tuned diagonal Jacobian approximation".
I would be glad if someone could tell me something about the algorithm or has an idea on how to optimize the scipy function in an other way.
As requested here is a minimal code example:
import numpy as np
from scipy import optimize
from numba import jit
#jit(nopython = True)
def func(x):
[a, b, c, d] = x
da = a*(1-b)
db = b*(1-c)
dc = c
dd = 1
return [da, db, dc, dd]
#jit(nopython = True)
def getRoot(x0):
solution = optimize.root(func, x0, method="excitingmixing")
return(solution.x)
root = getRoot([0.1,0.1,0.2,0.4])
print(root)
You can look in the source code of scipy to see the implementation of the excitingmixing option:
https://github.com/scipy/scipy/blob/c948e96ebb3454f6a82e9d14021cc601d7ce7a85/scipy/optimize/nonlin.py#L1272
You're likely not going to want to reimplement the entire root finding algorithm in numba. I better strategy you can test is to use numba to optimize the function that you pass to the scipy method. You're still going to pay some overhead of scipy calling a function, but you might see a performance increase if the bottleneck is evaluating the function and that can be done faster with a numba jitted version. I've found it best to just experiment with numba and test with the timeit method.
I wrote a little wrapper Minpack, called NumbaMinpack, which can be called within numba compiled functions: https://github.com/Nicholaswogan/NumbaMinpack.
You should try the lmdif method, if Newton's method is failing you.
from NumbaMinpack import lmdif, hybrd, minpack_sig
from numba import njit, cfunc
import numpy as np
#cfunc(minpack_sig)
def myfunc(x, fvec, args):
fvec[0] = x[0]**2 - args[0]
fvec[1] = x[1]**2 - args[1]
funcptr = myfunc.address # pointer to myfunc
x_init = np.array([10.0,10.0]) # initial conditions
neqs = 2 # number of equations
args = np.array([30.0,8.0]) # data you want to pass to myfunc
#njit
def test():
# solve with lmdif
sol = lmdif(funcptr, x_init, neqs, args)
# OR solve with hybrd
sol = hybrd(funcptr, x_init, args)
return sol
test() # it works!

Categories