I am trying to further speed up some code written in python, compiled using Numba. When looking at the assembly generated by numba, I noticed double-precision operations being generated, which I felt was odd since the inputs and outputs are all supposed to be float32.
I declare the variable/array types as float32 outside of the jitted loop and pass them into the function. Strangely, I find that after running my tests, the variable "scalarout" is converted to python float, which is actually a 64 bit value.
My code:
from scipy import ndimage, misc
import matplotlib.pyplot as plt
import numpy.fft
from timeit import default_timer as timer
import numba
# numba.config.DUMP_ASSEMBLY = 1
from numba import float32
from numba import jit, njit, prange
from numba import cuda
import numpy as np
import scipy as sp
# import llvmlite.binding as llvm
# llvm.set_option('', '--debug-only=loop-vectorize')
#njit(fastmath=True, parallel=False)
def mydot(a, b, xlen, ylen, scalarout):
scalarout = (np.float32)(0.0)
for y in prange(ylen):
for x in prange(xlen):
scalarout += a[y, x] * b[y, x]
return scalarout
# ======================================== TESTS ========================================
print()
xlen = 100000
ylen = 16
a = np.random.rand(ylen, xlen).astype(np.float32)
b = np.random.rand(ylen, xlen).astype(np.float32)
print("a type = ", type(a[1,1]))
scalarout = (np.float32)(0.0)
print("scalarout type, before execution = ", type(scalarout))
iters=1000
time = 100.0
for n in range(iters):
start = timer()
scalarout = mydot(a, b, xlen, ylen, scalarout)
end = timer()
if(end-start < time):
time = end-start
print("Numba njit function time, in us = %16.10f" % ((end-start)*10**6))
print("function output = %f" % scalarout)
print("scalarout type, after execution = ", type(scalarout))
This is more of an extended comment than an answer. If you change the scalarout to be a float32 array of length 1 and modify that, your output is float32.
#njit(fastmath=True, parallel=False)
def mydot(a, b, xlen, ylen):
scalarout = np.array([0.0], dtype=np.float32)
for y in prange(ylen):
for x in prange(xlen):
scalarout[0] += a[y, x] * b[y, x]
return scalarout
If you change return scalarout to return scalarout[0], then the output is again a python float.
In your original code for mydot, the result is a python float even if you write return np.float32(scalarout).
Related
I have optimized a bit on calculating the Mandelbrot set, & I now wish to be able to specify whether my arrays should be float64 or float32 instead of the easier implementation with type complex128 or complex64. I use the fact that for a complex number (a+jb)^2 = a^2-b^2 + (2ab)j, but this seems to give me a slightly different wrong mandelbrot set. The code is seen below:
from timeit import default_timer as timer
import numpy as np
from numexpr import evaluate
import matplotlib.pyplot as plt
#%% Inputs
N = 5000
I = 20
T = 2 #Thresholdenter code here
#%% Functions
def mandel_brot_vector(I,C,T,datatype):
Cre = np.array(C.real,dtype=datatype)
Cim = np.array(C.imag,dtype=datatype)
M = np.zeros(Cre.shape,dtype=datatype)
zreal=0
zimag=0
for i in range(I):
M[zreal*zreal+zimag*zimag<T**2] = i/I
zreal = evaluate("zreal*zreal-zimag*zimag+Cre") #complex multiplication rule
zimag = evaluate("2*zreal*zimag+Cim") #complex multiplication rule
N = len(M[0])
M = np.reshape(np.array(M),(N//2,N)).astype(datatype)
M = np.concatenate((M,M[::-1]),axis=0)
return M
def create_C(N,split):
C_re = np.linspace(np.full((1,N),-2)[0],np.full((1,N),1)[0],N).T
C_im = np.linspace(np.full((1,N),1.5*1j)[0],np.full((1,N),-1.5*1j)[0],N)
C = C_re+C_im
C = C[:N//2,:]
if split != 0:
C_split = np.array_split(C,split)
else:
C_split = C
return np.array(C_split)
C = create_C(N, 0)
t0_32 = timer()
M32 = mandel_brot_vector(I,C,T,np.float32)
t_32 = timer() - t0_32
t0_64 = timer()
M64 = mandel_brot_vector(I,C,T,np.float64)
t_64 = timer() - t0_64
plt.matshow(M64,cmap="hot")
print(" "*10,f"N={N}")
print(f"{'Float 32':<20}{t_32:<40}",
f"\n{'Float 64':<20}{t_64:<40}"
)
Currently the image I get: wrong mandelbrot. For reference, the following function will produce the correct mandelbrot set but with complex128:
def mandel_brot(I,C,T):
M = np.zeros(C.shape)
z=0
for i in range(I):
M[np.abs(z)<T] = i/I
z = evaluate("z*z+C")
N = len(M[0])
M = np.reshape(np.array(M),(N//2,N)).astype(datatype)
M = np.concatenate((M,M[::-1]),axis=0)
return M
Hope someone can help solve this issue, thanks in advance. Btw do not bother with the split of the C array, it is set up to run with multiprocessing which I am not using in the code attached.
I am trying to solve a lot of linear equations as fast as possible. To find out the fastest way I benchmarked NumPy and PyTorch, each on the CPU and on my GeForce 1080 GPU (using Numba for NumPy). The results really confused me.
This is the code I used with Python 3.8:
import timeit
import torch
import numpy
from numba import njit
def solve_numpy_cpu(dim: int = 5):
a = numpy.random.rand(dim, dim)
b = numpy.random.rand(dim)
for _ in range(1000):
numpy.linalg.solve(a, b)
def solve_numpy_njit_a(dim: int = 5):
njit(solve_numpy_cpu, dim=dim)
#njit
def solve_numpy_njit_b(dim: int = 5):
a = numpy.random.rand(dim, dim)
b = numpy.random.rand(dim)
for _ in range(1000):
numpy.linalg.solve(a, b)
def solve_torch_cpu(dim: int = 5):
a = torch.rand(dim, dim)
b = torch.rand(dim, 1)
for _ in range(1000):
torch.solve(b, a)
def solve_torch_gpu(dim: int = 5):
torch.set_default_tensor_type("torch.cuda.FloatTensor")
solve_torch_cpu(dim=dim)
def main():
for f in (solve_numpy_cpu, solve_torch_cpu, solve_torch_gpu, solve_numpy_njit_a, solve_numpy_njit_b):
time = timeit.timeit(f, number=1)
print(f"{f.__name__:<20s}: {time:f}")
if __name__ == "__main__":
main()
And these are the results:
solve_numpy_cpu : 0.007275
solve_torch_cpu : 0.012244
solve_torch_gpu : 5.239126
solve_numpy_njit_a : 0.000158
solve_numpy_njit_b : 1.273660
The slowest is CUDA accelerated PyTorch. I verified that PyTorch is using my GPU with
import torch
torch.cuda.is_available()
torch.cuda.get_device_name(0)
returning
True
'GeForce GTX 1080'
I can get behind that, on the CPU, PyTorch is slower than NumPy. What I cannot understand is why PyTorch on the GPU is so much slower. Not that important but actually even more confusing is that Numba's njit decorator makes performance orders of magnitude slower, until you don't use the # decorator syntax anymore.
Is it my setup? Occasionally I get a weird message about the windows page / swap file not being big enough. In case I've taken a completely obscure path to solving linear equations on the GPU, I'd be happy to be directed into another direction.
Edit
So, I focussed on Numba and changed my benchmarking a bit. As suggested by #max9111 I rewrote the functions to receive input and produce output because, in the end, that's what anyone would want to use them for. Now, I also perform a first compile run for the Numba accelerated function so the subsequent timing is fairer. Finally, I checked the performance against matrix size and plotted the results.
TL/DR: Up to matrix sizes of 500x500, Numba acceleration doesn't really make a difference for numpy.linalg.solve.
Here is the code:
import time
from typing import Tuple
import numpy
from matplotlib import pyplot
from numba import jit
#jit(nopython=True)
def solve_numpy_njit(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
parameters = numpy.linalg.solve(a, b)
return parameters
def solve_numpy(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
parameters = numpy.linalg.solve(a, b)
return parameters
def get_data(dim: int) -> Tuple[numpy.ndarray, numpy.ndarray]:
a = numpy.random.random((dim, dim))
b = numpy.random.random(dim)
return a, b
def main():
a, b = get_data(10)
# compile numba function
p = solve_numpy_njit(a, b)
matrix_size = [(x + 1) * 10 for x in range(50)]
non_accelerated = []
accelerated = []
results = non_accelerated, accelerated
for j, each_matrix_size in enumerate(matrix_size):
for m, f in enumerate((solve_numpy, solve_numpy_njit)):
average_time = -1.
for k in range(5):
time_start = time.time()
for i in range(100):
a, b = get_data(each_matrix_size)
p = f(a, b)
d_t = time.time() - time_start
print(f"{each_matrix_size:d} {f.__name__:<30s}: {d_t:f}")
average_time = (average_time * k + d_t) / (k + 1)
results[m].append(average_time)
pyplot.plot(matrix_size, non_accelerated, label="not numba")
pyplot.plot(matrix_size, accelerated, label="numba")
pyplot.legend()
pyplot.show()
if __name__ == "__main__":
main()
And these are the results (runtime against matrix edge length):
Edit 2
Seeing that Numba doesn't make much of a difference in my case, I came back to benchmarking PyTorch. And indeed, it appears to be roughly 4x faster than Numpy without even using a CUDA device.
Here is the code I used:
import time
from typing import Tuple
import numpy
import torch
from matplotlib import pyplot
def solve_numpy(a: numpy.ndarray, b: numpy.ndarray) -> numpy.ndarray:
parameters = numpy.linalg.solve(a, b)
return parameters
def get_data(dim: int) -> Tuple[numpy.ndarray, numpy.ndarray]:
a = numpy.random.random((dim, dim))
b = numpy.random.random(dim)
return a, b
def get_data_torch(dim: int) -> Tuple[torch.tensor, torch.tensor]:
a = torch.rand(dim, dim)
b = torch.rand(dim, 1)
return a, b
def solve_torch(a: torch.tensor, b: torch.tensor) -> torch.tensor:
parameters, _ = torch.solve(b, a)
return parameters
def experiment_numpy(matrix_size: int, repetitions: int = 100):
for i in range(repetitions):
a, b = get_data(matrix_size)
p = solve_numpy(a, b)
def experiment_pytorch(matrix_size: int, repetitions: int = 100):
for i in range(repetitions):
a, b = get_data_torch(matrix_size)
p = solve_torch(a, b)
def main():
matrix_size = [x for x in range(5, 505, 5)]
experiments = experiment_numpy, experiment_pytorch
results = tuple([] for _ in experiments)
for i, each_experiment in enumerate(experiments):
for j, each_matrix_size in enumerate(matrix_size):
time_start = time.time()
each_experiment(each_matrix_size, repetitions=100)
d_t = time.time() - time_start
print(f"{each_matrix_size:d} {each_experiment.__name__:<30s}: {d_t:f}")
results[i].append(d_t)
for each_experiment, each_result in zip(experiments, results):
pyplot.plot(matrix_size, each_result, label=each_experiment.__name__)
pyplot.legend()
pyplot.show()
if __name__ == "__main__":
main()
And here's the result (runtime against matrix edge length):
So for now, I'll be sticking with torch.solve. However, the original question remains:
How can I exploit my GPU to solve linear equations even faster?
Just wrote a code for comparison of speed of calculation for a function which is in written as numpy and a function which uses ufuncify from sympy:
import numpy as np
from sympy import symbols, Matrix
from sympy.utilities.autowrap import ufuncify
u,v,e,a1,a0 = symbols('u v e a1 a0')
dudt = u-u**3-v
dvdt = e*(u-a1*v-a0)
p = {'a1':0.5,'a0':1.5,'e':0.1}
eqs = Matrix([dudt,dvdt])
numeqs=eqs.subs([(a1,p['a1']),(a0,p['a0']),(e,p['e'])])
print eqs
print numeqs
dudt = ufuncify([u,v],numeqs[0])
dvdt = ufuncify([u,v],numeqs[1])
def syrhs(u,v):
return dudt(u,v),dvdt(u,v)
def nprhs(u,v,p):
dudt = u-u**3-v
dvdt = p['e']*(u-p['a1']*v-p['a0'])
return dudt,dvdt
def compare(n=10000):
import time
timer_np=0
timer_sy=0
error = np.zeros(n)
for i in range(n):
u=np.random.random((128,128))
v=np.random.random((128,128))
start_time=time.time()
npcalc=np.ravel(nprhs(u,v,p))
mid_time=time.time()
sycalc=np.ravel(syrhs(u,v))
end_time=time.time()
timer_np+=(mid_time-start_time)
timer_sy+=(end_time-mid_time)
error[i]=np.max(np.abs(npcalc-sycalc))
print "Max difference is ",np.max(error), ", and mean difference is ",np.mean(error)
print "Average speed for numpy ", timer_np/float(n)
print "Average speed for sympy ", timer_sy/float(n)
On my machine the result is:
In [21]: compare()
Max difference is 5.55111512313e-17 , and mean difference is 5.55111512313e-17
Average speed for numpy 0.00128133814335
Average speed for sympy 0.00127074036598
Any suggestions on how to make either of the above function faster is welcome!
After further exploration it seems that ufuncify and regular numpy functions will give more or less the same speed of computation. Using numba or printing to theano function did not result in a faster code. So the other option to make things faster is either cython or wrapping a c or FORTRAN code.
So while evaluating possibilities to speed up Python code i came across this Stack Overflow post: Comparing Python, Numpy, Numba and C++ for matrix multiplication
I was quite impressed with numba's performance and implemented some of our function in numba. Unfortunately the speedup was only there for very small matrices and for large matrices the code became very slow compared to the previous scipy sparse implementation. I thought this made sense but nevertheless i repeated the test in the original post (code below).
When using a 1000 x 1000 matrix, according to that post even the python implementation should take roughly 0,01 s. Here's my results though:
python : 769.6387 seconds
numpy : 0.0660 seconds
numba : 3.0779 seconds
scipy : 0.0030 seconds
What am i doing wrong to get such different results than the original post? I copied the functions and did not change anything. I tried both Python 3.5.1 (64 bit) and Python 2.7.10 (32 bit), a colleague tried the same code with the same results. This is the result for a 100x100 matrix:
python : 0.6916 seconds
numpy : 0.0035 seconds
numba : 0.0015 seconds
scipy : 0.0035 seconds
Did i make some obvious mistakes?
import numpy as np
import numba as nb
import scipy.sparse
import time
class benchmark(object):
def __init__(self, name):
self.name = name
def __enter__(self):
self.start = time.time()
def __exit__(self, ty, val, tb):
end = time.time()
print("%s : %0.4f seconds" % (self.name, end-self.start))
return False
def dot_py(A, B):
m, n = A.shape
p = B.shape[1]
C = np.zeros((m, p))
for i in range(0, m):
for j in range(0, p):
for k in range(0, n):
C[i, j] += A[i, k] * B[k, j]
return C
def dot_np(A, B):
C = np.dot(A,B)
return C
def dot_scipy(A, B):
C = A * B
return C
dot_nb = nb.jit(nb.float64[:,:](nb.float64[:,:], nb.float64[:,:]), nopython=True)(dot_py)
dim_x = 1000
dim_y = 1000
a = scipy.sparse.rand(dim_x, dim_y, density=0.01)
b = scipy.sparse.rand(dim_x, dim_y, density=0.01)
a_full = a.toarray()
b_full = b.toarray()
print("starting test")
with benchmark("python"):
dot_py(a_full, b_full)
with benchmark("numpy"):
dot_np(a_full, b_full)
with benchmark("numba"):
dot_nb(a_full, b_full)
with benchmark("scipy"):
dot_scipy(a, b)
print("finishing test")
edit:
for anyone seeing this at a later time. this is the results i got when using sparse nxn matrices (1% of elements are nonzero).
In the linked stackoverflow question where you got the code from, m = n = 3 and p is variable, whereas you are using m = n = 1000, which is going to make a huge difference in the timings.
I am trying to set up a 3D loop with the assignment
C(i,j,k) = A(i,j,k) + B(i,j,k)
using Python on my GPU. This is my GPU:
http://www.geforce.com/hardware/desktop-gpus/geforce-gt-520/specifications
The sources I'm looking at / comparing with are:
http://nbviewer.ipython.org/gist/harrism/f5707335f40af9463c43
http://nbviewer.ipython.org/github/ContinuumIO/numbapro-examples/blob/master/webinars/2014_06_17/intro_to_gpu_python.ipynb
It's possible that I've imported more modules than necessary. This is my code:
import numpy as np
import numbapro
import numba
import math
from timeit import default_timer as timer
from numbapro import cuda
from numba import *
#autojit
def myAdd(a, b):
return a+b
myAdd_gpu = cuda.jit(restype=f8, argtypes=[f8, f8], device=True)(myAdd)
#cuda.jit(argtypes=[float32[:,:,:], float32[:,:,:], float32[:,:,:]])
def myAdd_kernel(a, b, c):
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
tz = cuda.threadIdx.z
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
bz = cuda.blockIdx.z
bw = cuda.blockDim.x
bh = cuda.blockDim.y
bd = cuda.blockDim.z
i = tx + bx * bw
j = ty + by * bh
k = tz + bz * bd
if i >= c.shape[0]:
return
if j >= c.shape[1]:
return
if k >= c.shape[2]:
return
for i in xrange(0,c.shape[0]):
for j in xrange(0,c.shape[1]):
for k in xrange(0,c.shape[2]):
# c[i,j,k] = a[i,j,k] + b[i,j,k]
c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])
def main():
my_gpu = numba.cuda.get_current_device()
print "Running on GPU:", my_gpu.name
cores_per_capability = {1: 8,2: 32,3: 192,}
cc = my_gpu.compute_capability
print "Compute capability: ", "%d.%d" % cc, "(Numba requires >= 2.0)"
majorcc = cc[0]
print "Number of streaming multiprocessor:", my_gpu.MULTIPROCESSOR_COUNT
cores_per_multiprocessor = cores_per_capability[majorcc]
print "Number of cores per mutliprocessor:", cores_per_multiprocessor
total_cores = cores_per_multiprocessor * my_gpu.MULTIPROCESSOR_COUNT
print "Number of cores on GPU:", total_cores
N = 100
thread_ct = my_gpu.WARP_SIZE
block_ct = int(math.ceil(float(N) / thread_ct))
print "Threads per block:", thread_ct
print "Block per grid:", block_ct
a = np.ones((N,N,N), dtype = np.float32)
b = np.ones((N,N,N), dtype = np.float32)
c = np.zeros((N,N,N), dtype = np.float32)
start = timer()
cg = cuda.to_device(c)
myAdd_kernel[block_ct, thread_ct](a,b,cg)
cg.to_host()
dt = timer() - start
print "Wall clock time with GPU in %f s" % dt
print 'c[:3,:,:] = ' + str(c[:3,1,1])
print 'c[-3:,:,:] = ' + str(c[-3:,1,1])
if __name__ == '__main__':
main()
My result from running this is the following:
Running on GPU: GeForce GT 520
Compute capability: 2.1 (Numba requires >= 2.0)
Number of streaming multiprocessor: 1
Number of cores per mutliprocessor: 32
Number of cores on GPU: 32
Threads per block: 32
Block per grid: 4
Wall clock time with GPU in 1.104860 s
c[:3,:,:] = [ 2. 2. 2.]
c[-3:,:,:] = [ 2. 2. 2.]
When I run the examples in the sources, I see significant speedup. I don't think my example is running properly since the wall clock time is much longer than I would expect. I've modeled this mostly from the "even bigger speedups with cuda python" section in the first example link.
I believe I've indexed correctly and safely. Maybe the problem is with my blockdim? or griddim? Or maybe I'm using the wrong types for my GPU. I think I read that they must be a certain type. I'm very new to this so the problem very well could be trivial!
Any and all help is greatly appreciated!
You are creating your indexes correctly but then you're ignoring them.
Running the nested loop
for i in xrange(0,c.shape[0]):
for j in xrange(0,c.shape[1]):
for k in xrange(0,c.shape[2]):
is forcing all your threads to loop through all values in all dimensions, which is not what you want. You want each thread to compute one value in a block and then move on.
I think something like this should work better...
i = tx + bx * bw
while i < c.shape[0]:
j = ty+by*bh
while j < c.shape[1]:
k = tz + bz * bd
while k < c.shape[2]:
c[i,j,k] = myAdd_gpu(a[i,j,k],b[i,j,k])
k+=cuda.blockDim.z*cuda.gridDim.z
j+=cuda.blockDim.y*cuda.gridDim.y
i+=cuda.blockDim.x*cuda.gridDim.x
Try to compile and run it. Also make sure to validate it, as I have not.
I don't see you using imshow, or show, so there is no need to import those.
It doesn't appear as though you use your import of math (I didn't see any calls of math.some_function.
Your imports from numba and numbapro seem repetitive. Your "from numba import cuda" overrides your "from numbapro import cuda", since it is subsequent to it. Your calls to cuda use the cuda in numba not numbapro. When you call "from numba import *", you import everything from numba, not just cuda, which seems to be the only thing you use. Also, (I believe) import numba.cuda is equivalent to from numba import cuda. Why not eliminate all your imports from numba and numbapro with a single "from numba import cuda".