Understanding and optimising threads, blocks, and grids in pyCUDA

Understanding and optimising threads, blocks, and grids in pyCUDA - python

I'm very new to GPU programming and pyCUDA and have a pretty fundamental gap in my knowledge. I have spent quite a bit of time searching SO, looking at example code and reading supporting documentation for CUDA/pyCUDA but haven't found much diversity in the explanations and can't get my head around a few things.
I am having trouble correctly defining block and grid dimensions. The code I am currently running is as follows, and aims to do element-wise multiplication of an array a by a float b:
from __future__ import division
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
rows = 256
cols = 10
a = np.ones((rows, cols), dtype=np.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
b = np.float32(2)
mod = SourceModule("""
__global__ void MatMult(float *a, float b)
{
const int i = threadIdx.x + blockDim.x * blockIdx.x;
const int j = threadIdx.y + blockDim.y * blockIdx.y;
int Idx = i + j*gridDim.x;
a[Idx] *= b;
}
""")
func = mod.get_function("MatMult")
xBlock = np.int32(np.floor(1024/rows))
yBlock = np.int32(cols)
bdim = (xBlock, yBlock, 1)
dx, mx = divmod(rows, bdim[0])
dy, my = divmod(cols, bdim[1])
gdim = ( (dx + (mx>0)) * bdim[0], (dy + (my>0)) * bdim[1])
print "bdim=",bdim, ", gdim=", gdim
func(a_gpu, b, block=bdim, grid=gdim)
a_doubled = np.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print a_doubled - 2*a
The code should print the block dimensions bdim and the grid dimensions gdim, as well as an array of zeroes.
This works for small array sizes, for example, if rows=256 and cols=10 (as in the example above) the output is as follows:
bdim= (4, 10, 1) , gdim= (256, 10)
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
However, if I increase rows=512, I get the following output:
bdim= (2, 10, 1) , gdim= (512, 10)
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 2. 2. 2. ..., 2. 2. 2.]
[ 2. 2. 2. ..., 2. 2. 2.]
[ 2. 2. 2. ..., 2. 2. 2.]]
Indicating that the multiplication is happening twice for some elements of the array.
However, if I force the block dimensions to bdim = (1,1,1), the problem no longer occurs and I get the following (correct) output for the larger array size:
bdim= (1, 1, 1) , gdim= (512, 10)
[[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
...,
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
I don't understand this. What is happening here which means that this method of defining the block and grid dimensions is no longer appropriate as the array size is increased? Also, if block has dimensions (1,1,1) does this mean that the calculation is being performed serially?
Thanks in advance for any pointers and help!

You operate on 2D grid of 2D blocks. In your kernel you seem to assume that gridDim.x would return number of threads in x dimension of a grid.
__global__ void MatMult(float *a, float b)
{
const int i = threadIdx.x + blockDim.x * blockIdx.x;
const int j = threadIdx.y + blockDim.y * blockIdx.y;
int Idx = i + j*gridDim.x;
a[Idx] *= b;
}
The gridDim.x returns number of blocks r x direction of grid, not number of threads. In order to obtain number of threads in given direction you should multiply number of threads in a block with number of blocks in a grid in the same direction:
int Idx = i + j * blockDim.x * gridDim.x

Related

Difference between Results in Manual Function and Matrix Multiplication with odeint

I'm currently trying to develop a function that performs matrix multiplication while expanding a differential equation with odeint in Python and am seeing strange results.
I converted the function:
def f(x, t):
return [
-0.1 * x[0] + 2 * x[1],
-2 * x[0] - 0.1 * x[1]
]
to the below so that I can incorporate different matrices.
I have the below matrix of values and function that takes specific values of that matrix:
from scipy.integrate import odeint
x0_train = [2,0]
dt = 0.01
t = np.arange(0, 1000, dt)
matrix_a = np.array([-0.09999975, 1.999999, -1.999999, -0.09999974])
# Function to run odeint with
def f(x, t, a):
return [
a[0] * x[0] + a[1] * x[1],
a[2] * x[0] - a[3] * x[1]
]
odeint(f, x0_train, t, args=(matrix_a,))
>>> array([[ 2. , 0. ],
[ 1.99760115, -0.03999731],
[ 1.99440529, -0.07997867],
...,
[ 1.69090227, 1.15608741],
[ 1.71199436, 1.12319701],
[ 1.73240339, 1.08985846]])
This seems right, but when I create my own function to perform multiplication/regression, I see the results at the bottom of the array are completely different. I have two sparse arrays that provide the same conditions as matrix_a but with zeros around them.
from sklearn.preprocessing import PolynomialFeatures
new_matrix_a = array([[ 0. , -0.09999975, 1.999999 , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. ],
[ 0. , -1.999999 , -0.09999974, 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. ,
0. ]])
# New function
def f_new(x, t, parameters):
polynomials = PolynomialFeatures(degree=5)
x = np.array(x).reshape(-1,2)
#x0_train_array_reshape = x0_train_array.reshape(1,2)
polynomial_transform = polynomials.fit(x)
polynomial_features = polynomial_transform.fit_transform(x).T
x_ode = np.matmul(parameters[0],polynomial_features)
y_ode = np.matmul(parameters[1],polynomial_features)
return np.concatenate((x_ode, y_ode), axis=None).tolist()
odeint(f_new, x0_train, t, args=(new_matrix_a,))
>>> array([[ 2.00000000e+00, 0.00000000e+00],
[ 1.99760142e+00, -3.99573216e-02],
[ 1.99440742e+00, -7.98188169e-02],
...,
[-3.50784051e-21, -9.99729456e-22],
[-3.50782881e-21, -9.99726119e-22],
[-3.50781711e-21, -9.99722781e-22]])
As you can see, I'm getting completely different values at the end of the array. I've been running through my code and can't seem to find a reason why they would be different. Does anybody have a clear reason why or if I'm doing something wrong with my f_new? Ideally, I'd like to develop a function that can take any values in that matrix_a, which is why I'm trying to create this new function.
Thanks in advance.

You should perhaps use numpy even more in the first version, to avoid sign errors in routine algorithms.
def f(x, t, a):
return a.reshape([2,2]) # x # or use matmul, or a.reshape([2,2]).dot(x)
or, for efficiency, pass the already reshaped a.

How to efficiently concatenate Numpy Array based on position conditioning?

The objective is to concatenate a Numpy Array according to a set of position. However, I am curious whether the concatenate and step as shown in the code below can be optimized further without the need of for loop and if-else statement?
tot_length=0.2 implementation
steps=0.1
start_val=0
repeat_perm=3
list_no =np.arange(start_val, tot_length, steps)
x, y, z = np.meshgrid(*[list_no for _ in range(3)], sparse=True)
ix = np.array(((x>=y) & (y>=z)).nonzero()).T
final_opt=list_no[ix]
final_opt[:,[0, 1]] = final_opt[:,[1, 0]]
all_result=itertools.product(range(0,ix.shape[1]), repeat=repeat_perm)
for num, num_pair in enumerate(all_result, start=1):
for num_x, num_pair_x in enumerate ( num_pair, start=0 ):
if (num == 1) &(num_x==0) :
cont_arry = final_opt [num_pair_x, :]
else:
cont_arry= np.concatenate((cont_arry, final_opt [num_pair_x, :]), axis=0)
final_arr =np.reshape(cont_arry, (-1, 9))
print(final_arr)
Output of size (27, 9), but only partial are shown below
[[0. 0. 0. 0. 0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. 0. 0.1 0. 0. ]
[0. 0. 0. 0. 0. 0. 0.1 0.1 0. ]
[0. 0. 0. 0.1 0. 0. 0. 0. 0. ]
[0. 0. 0. 0.1 0. 0. 0.1 0. 0. ]
[0. 0. 0. 0.1 0. 0. 0.1 0.1 0. ]
[0.1 0.1 0. 0.1 0.1 0. 0.1 0.1 0. ]]
Just some heads up,the cont_arry will be vectorised multiply with a 1D array of similar length with the cont_arry. Knowing this, is there a way to avoid from storing the result of concatenation on memory or what not to minimise potential memory issue since in actual application, the worst possible parameter setting is as below
tot_length=200
steps=0.1
start_val=0
repeat_perm=1200

I think your concatenate loop can be replaced with:
alist = []
for num, num_pair in enumerate(all_result, start=1):
for num_x, num_pair_x in enumerate ( num_pair, start=0 ):
alist.append( final_opt [num_pair_x, :]))
arr = np.array(alist)
# arr = np.concatenate(alist, axis=0)
# arr = np.vstack(alist)
There may be some details in this that I didn't catch. I haven't tried to test it. List append is much faster than concatenate, especially when done repeatedly.
concatenate is most efficient when give a whole list of arrays to join.
Better yet, don't iterate at all; instead make use of whole-array math and indexing. But I haven't tried to master your code, so won't suggest how to do that.

Solving ode with python getting wrong solution

i want to solve the following ode
KT + CT' = Q
to given example Data is my code below
import numpy as np
import scipy as sp
# Solve the following ODE
# K*T + C*T' = Q
# T' = C^-1 ( Q - K * T )
T_start=sp.array([ 151.26, 132.18, 131.64, 146.55, 147.87, 137.87])
K = sp.array([[-0.01761969, 0.02704873, 0.00572222, 0. , 0. ,
0. ],
[ 0.02704873, -0.03546941, 0. , 0. , 0.00513177,
0. ],
[ 0.00572222, 0. , 0.03001858, -0.04752982, 0. ,
0.02030505],
[ 0. , 0. , -0.04752982, 0.0444405 , 0.00308932,
0. ],
[ 0. , 0.00513177, 0. , 0.00308932, 0.02629577,
-0.01793915],
[ 0. , 0. , 0.02030505, 0. , -0.01793915,
0.00084506]])
Q = sp.array([ 1.66342077, 0.16187956, 0.65115035, 0.71274755,2.54614269, 0.13680399])
C_invers = sp.array([[ 3.44827586, 0. , 0. , 0. , 0. ,
-0. ],
[ 0. , 1.5625 , 0. , 0. , 0. ,
-0. ],
[ 0. , 0. , 2.63157895, 0. , 0. ,
-0. ],
[ 0. , 0. , 0. , 2.17391304, 0. ,
-0. ],
[ 0. , 0. , 0. , 0. , 1.63934426,
-0. ],
[ 0. , 0. , 0. , 0. , 0. ,
2.38095238]])
time = np.linspace(0, 20, 10000)
#T_real = sp.array([[ 151.26, 132.18, 131.64, 146.55, 147.87, 137.87]])
def deriv(T, t):
return sp.dot( C_invers, Q - np.dot(K, T) )
T_sol = sp.integrate.odeint(deriv, T_start, time)
i know that the result is
sp.array([ 151.26, 132.18, 131.64, 146.55, 147.87, 137.87])
the solution is "stable" if and only if i use this as the T_start condition
but if i change my start condition for example to
T_start=sp.array([ 0, 0, 0, 0, 0, 0])
it won't converge im getting the following result:
where is my fault? Negative values make no sense for my system :/ Can you help me? thanks ;)

The array
array([ 151.26, 132.18, 131.64, 146.55, 147.87, 137.87])
is the equilibrium of your system (approximately). You can find this by setting the right-hand side of your system of equations to 0, which leads to Teq = inv(K)*Q:
In [9]: Teq = np.linalg.solve(K, Q)
In [10]: Teq
Out[10]:
array([ 151.25960795, 132.17972469, 131.6402527 , 146.55025359,
147.87025015, 137.87029892])
That's why your solution appears to be stable when you use these values for the starting point. The solution is very close to the equilibrium, so it doesn't change much.
Long term, however, the solution will eventually diverge away from Teq, because that equilibrium point is unstable. Your system, T' = inv(C)*(Q - K*T), is linear in T, so you can determine the stability by computing the eigenvalues of the coefficient matrix of T. That is, write T = inv(C)*Q - inv(C)*K*T. The coefficient matrix of T is -inv(C)*K. Here's how you can find the eigenvalues of that matrix:
In [11]: A = -C_invers.dot(K)
In [12]: np.linalg.eigvals(A)
Out[12]:
array([-0.2089754 , 0.12257481, -0.06349952, -0.01489581, 0.00146708,
0.05878143])
The coefficent matrix A has three positive eigenvalues. Those correspond to modes that will grow exponentially in time. That is, the equilibrium is unstable, so the growth that you see is to be expected.

Insert Numpy Array into Array with extending of the embedding array

First of all, I work with byte array (>= 400x400x1000) bytes.
I wrote a small function which can insert a multidimensional array (or a fraction of) into another one by indicating an offset. This works if the embedded array is smaller than the embedding array (case A). Otherwise the embedded array is truncated (case B).
case A) Inserting a 3x3 into a 5x5 matrix with offset 1,1 would look like this.
[[ 0. 0. 0. 0. 0.]
[ 0. 1. 1. 1. 0.]
[ 0. 1. 1. 1. 0.]
[ 0. 1. 1. 1. 0.]
[ 0. 0. 0. 0. 0.]]
case B) If the offsets are exceeding the dimensions of the embedding matrix, the smaller array is truncated. E.g. a (-1,-1) offset would result in this.
[[ 1. 1. 0. 0. 0.]
[ 1. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0.]]
case C) Now, instead of truncating the embedded array, I want to extend the embedding array (by zeroes) if the embedded array is either bigger than the embedding array or the offsets enforce it (e.g. case B). Is there a smart way with numpy or scipy to solve this?
[[ 1. 1. 1. 0. 0. 0.]
[ 1. 1. 1. 0. 0. 0.]
[ 1. 1. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0.]]
Actually I work with 3D array, but for simplicity I wrote an example for 2D arrays. Current source:
import numpy as np
import nibabel as nib
def addAtPos(mat_bigger, mat_smaller, xyz_coor):
size_sm_x, size_sm_y = np.shape(mat_smaller)
size_gr_x, size_gr_y = np.shape(mat_bigger)
start_gr_x, start_gr_y = xyz_coor
start_sm_x, start_sm_y = 0,0
end_x, end_y = (start_gr_x + size_sm_x), (start_gr_y + size_sm_y)
print(size_sm_x, size_sm_y)
print(size_gr_x, size_gr_y)
print(end_x, end_y)
if start_gr_x < 0:
start_sm_x = -start_gr_x
start_gr_x = 0
if start_gr_y < 0:
start_sm_y = -start_gr_y
start_gr_y = 0
if end_x > size_gr_x:
size_sm_x = size_sm_x - (end_x - size_gr_x)
end_x = size_gr_x
if end_y > size_gr_y:
size_sm_y = size_sm_y - (end_y - size_gr_y)
end_y = size_gr_y
# copy all or a chunk (if offset is small/big enough) of the smaller matrix into the bigger matrix
mat_bigger[start_gr_x:end_x, start_gr_y:end_y] = mat_smaller[start_sm_x:size_sm_x, start_sm_y:size_sm_y]
return mat_bigger
a_gr = np.zeros([5,5])
a_sm = np.ones([3,3])
a_res = addAtPos(a_gr, a_sm, [-2,1])
#print (a_gr)
print (a_res)

Actually there is an easier way to do it.
For your first example of a 3x3 array embedded to a 5x5 one you can do it with something like:
A = np.array([[1,1,1], [1,1,1], [1,1,1]])
(N, M) = A.shape
B = np.zeros(shape=(N + 2, M + 2))
B[1:-1:, 1:-1] = A
By playing with slicing you can select a subset of A and insert it anywhere within a continuous subset of B.
Hope it helps! ;-)

How do I pass a 2-dimensional array into a kernel in pycuda?

I found an answer here, but it is not clear if I should reshape the array. Do I need to reshape the 2d array into 1d before passing it to pycuda kernel?

There is no need to reshape a 2D gpuarray in order to pass it to a CUDA kernel.
As I said in the answer you linked to, a 2D numpy or PyCUDA array is just an allocation of pitched linear memory, stored in row major order by default. Both have two members which tell you everything that you need to access an array - shape and strides. For example:
In [8]: X=np.arange(0,15).reshape((5,3))
In [9]: print X.shape
(5, 3)
In [10]: print X.strides
(12, 4)
The shape is self explanatory, the stride is the pitch of the storage in bytes. The best practice for kernel code will be to treat the pointer supplied by PyCUDA as if it were allocated using cudaMallocPitch and treat the first element of stride as the byte pitch of the rows in memory. A trivial example might look like this:
import pycuda.driver as drv
from pycuda.compiler import SourceModule
import pycuda.autoinit
import numpy as np
mod = SourceModule("""
__global__ void diag_kernel(float *dest, int stride, int N)
{
const int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid < N) {
float* p = (float*)((char*)dest + tid*stride) + tid;
*p = 1.0f;
}
}
""")
diag_kernel = mod.get_function("diag_kernel")
a = np.zeros((10,10), dtype=np.float32)
a_N = np.int32(a.shape[0])
a_stride = np.int32(a.strides[0])
a_bytes = a.size * a.dtype.itemsize
a_gpu = drv.mem_alloc(a_bytes)
drv.memcpy_htod(a_gpu, a)
diag_kernel(a_gpu, a_stride, a_N, block=(32,1,1))
drv.memcpy_dtoh(a, a_gpu)
print a
Here some memory is allocated on the device, a zeroed 2D array is copied to that allocation directly, and the result of the kernel (filling the diagonals with 1) copied back to the host and printed. It isn't necessary to flatten or otherwise modify the shape or memory layout of the 2D numpy data at any point in the process. The result is:
$ cuda-memcheck python ./gpuarray.py
========= CUDA-MEMCHECK
[[ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]]
========= ERROR SUMMARY: 0 errors

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding and optimising threads, blocks, and grids in pyCUDA - python

Related

Difference between Results in Manual Function and Matrix Multiplication with odeint

How to efficiently concatenate Numpy Array based on position conditioning?

Solving ode with python getting wrong solution

Insert Numpy Array into Array with extending of the embedding array

How do I pass a 2-dimensional array into a kernel in pycuda?

Categories

Resources