PyCUDA LogicError: cuModuleLoadDataEx failed: an illegal memory access was encountered - python

I am trying to parallelize the bitonic sort with pycuda. For this I use SourceModule and the C code of the parallel bitonic sort. For the memory copies management I use InOut of the pycuda.driver that simplify some of the memory transfers
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
from pycuda import gpuarray
import numpy as np
from time import time
ker = SourceModule(
"""
__device__ void swap(int & a, int & b){
int tmp = a;
a = b;
b = tmp;
}
__global__ void bitonicSort(int * values, int N){
extern __shared__ int shared[];
int tid = threadIdx.x + blockDim.x * blockIdx.x;
// Copy input to shared mem.
shared[tid] = values[tid];
__syncthreads();
// Parallel bitonic sort.
for (int k = 2; k <= N; k *= 2){
// Bitonic merge:
for (int j = k / 2; j>0; j /= 2){
int ixj = tid ^ j;
if (ixj > tid){
if ((tid & k) == 0){
//Sort ascending
if (shared[tid] > shared[ixj]){
swap(shared[tid], shared[ixj]);
}
}
else{
//Sort descending
if (shared[tid] < shared[ixj]){
swap(shared[tid], shared[ixj]);
}
}
}
__syncthreads();
}
}
values[tid] = shared[tid];
}
"""
)
N = 8 #lenght of A
A = np.int32(np.random.randint(1, 20, N)) #random numbers in A
BLOCK_SIZE = 256
NUM_BLOCKS = (N + BLOCK_SIZE-1)//BLOCK_SIZE
bitonicSort = ker.get_function("bitonicSort")
t1 = time()
bitonicSort(drv.InOut(A), np.int32(N), block=(BLOCK_SIZE,1,1), grid=(NUM_BLOCKS,1), shared=4*N)
t2 = time()
print("Execution Time {0}".format(t2 - t1))
print(A)
As in the kernel I use extern __shared__, in pycuda I use the shared parameter with the respective 4*N. Also try using __shared__ int shared[N] in the kernel but it doesn't work either (check here: Getting started with shared memory on PyCUDA)
Running in Google Collab I get the following error:
/usr/local/lib/python3.6/dist-packages/pycuda/compiler.py in __init__(self, source, nvcc, options, keep, no_extern_c, arch, code, cache_dir, include_dirs)
292
293 from pycuda.driver import module_from_buffer
--> 294 self.module = module_from_buffer(cubin)
295
296 self._bind_module()
LogicError: cuModuleLoadDataEx failed: an illegal memory access was encountered
Does anyone know what could be generating this error?

Your device code isn't accounting for the sizes of your arrays correctly.
You are launching 256 threads in a single block. That means that you will have 256 threads, with tid numbered 0..255, trying to execute each line of code. For example, in this case:
shared[tid] = values[tid];
You will have, for example, one thread trying to do shared[255] = values[255];
Neither your shared nor values array are that large. That is the reason for the illegal memory access error.
The simplest solution for this kind of trivial problem is to make your array sizes match your block size.
BLOCK_SIZE = N
According to my testing, that change clears up any errors and results in a properly sorted array.
It won't work for N greater than 1024, or multi-block usage, but your code would have to be modified for a multi-block sort, anyway.
If you still have trouble after making that change, I suggest restarting your python session or your colab session.

Related

How to copy a 2D array (matrix) from python with a C function (and do some computer heavy computation) which return a 2D array (matrix) in python?

I want to copy a 2D numpy array (matrix) in a C function a get it back in python (and then do some calculation on it in C taking the speed advantage of C) . Therefore I need the C function matrix_copy to return a 2D array (or, I guess, a pointer to it). I tried with the following code but I get the following output (where one can see the second dimension of the array is lost).
matrix_in.shape:
(300, 200)
matrix_out.shape:
(300,)
How could I change the code (I guess the matrix_copy.c adding some pointer magic) so I could obtain an exact copy of the matrix_in in matrix_out?
Here is the main.py script:
from ctypes import c_void_p, c_double, c_int, cdll
from numpy.ctypeslib import ndpointer
import numpy as np
import pdb
n = 300
m = 200
matrix_in = np.random.randn(n, m)
lib = cdll.LoadLibrary("matrix_copy.so")
matrix_copy = lib.matrix_copy
matrix_copy.restype = ndpointer(dtype=c_double,
shape=(n,))
matrix_out = matrix_copy(c_void_p(matrix_in.ctypes.data),
c_int(n),
c_int(m))
print("matrix_in.shape:")
print(matrix_in.shape)
print("matrix_out.shape:")
print(matrix_out.shape)
Here is the matrix_copy.c script:
#include <stdlib.h>
#include <stdio.h>
double * matrix_copy(const double * matrix_in, int n, int m){
double * matrix_out = (double *)malloc(sizeof(double) * (n*m));
int index = 0;
for(int i=0; i< n; i++){
for(int j=0; j<m; j++){
matrix_out[i+j] = matrix_in[i+j];
//matrix_out[i][j] = matrix_in[i][j];
// some heavy computations not yet implemented
}
}
return matrix_out;
}
which I compile with the command
cc -fPIC -shared -o matrix_copy.so matrix_copy.c
And as a side note, why does the notation matrix_out[i][j] = matrix_in[i][j]; throws me an error on compilation?
matrix_copy.c:10:26: error: subscripted value is not an array, pointer, or vector
matrix_out[i][j] = matrix_in[i][j];
~~~~~~~~~~~~~^~
matrix_copy.c:10:44: error: subscripted value is not an array, pointer, or vector
matrix_out[i][j] = matrix_in[i][j];
The second dimension is 'lost' because you explicitly omit it in the named shape argument of ndpointer. Change:
matrix_copy.restype = ndpointer(dtype=c_double, shape=(n,))
to
matrix_copy.restype = ndpointer(dtype=c_double, shape=(n,m), flags='C')
Where flags='C' additionally notes that the returned data is stored contiguously in row major order.
With regards to matrix_out[i][j] = matrix_in[i][j]; throwing an error, consider that matrix_in is of type const double *. matrix_in[i] would yield a value of type const double - how do you further index this value (i.e., with [j])?
If you want to emulate accessing a 2-dimensional array via a single pointer, you must calculate offsets manually. matrix_out[i+j] is not sufficient, as you must account for the span of each sub array:
matrix_out[i * m + j] = matrix_in[i * m + j];
Note that in C, size_t is the generally preferred type to use when dealing with memory sizes or array lengths.
matrix_copy.c, simplified:
#include <stdlib.h>
double *matrix_copy(const double *matrix_in, size_t n, size_t m)
{
double *matrix_out = malloc(sizeof *matrix_out * n * m);
for (size_t i = 0; i < n; i++)
for (size_t j = 0; j < m; j++)
matrix_out[i * m + j] = matrix_in[i * m + j];
return matrix_out;
}
matrix.py, with more explicit typing:
from ctypes import c_void_p, c_double, c_size_t, cdll, POINTER
from numpy.ctypeslib import ndpointer
import numpy as np
c_double_p = POINTER(c_double)
n = 300
m = 200
matrix_in = np.random.randn(n, m).astype(c_double)
lib = cdll.LoadLibrary("matrix_copy.so")
matrix_copy = lib.matrix_copy
matrix_copy.argtypes = c_double_p, c_size_t, c_size_t
matrix_copy.restype = ndpointer(
dtype=c_double,
shape=(n,m),
flags='C')
matrix_out = matrix_copy(
matrix_in.ctypes.data_as(c_double_p),
c_size_t(n),
c_size_t(m))
print("matrix_in.shape:", matrix_in.shape)
print("matrix_out.shape:", matrix_out.shape)
print("in == out", matrix_in == matrix_out)
The incoming data is a probably single block of memory. You need to create the substructure.
In my C++ code I have to do the following on data (block) coming in via swig:
void divide2DDoubleArray(double * &block, double ** &subblockdividers, int noofsubblocks, int subblocksize){
/* The starting address of a block of doubles is used to generate
* pointers to subblocks.
*
* block: memory containing the original block of data
* subblockdividers: array of subblock addresses
* noofsubblocks: specify the number of subblocks produced
* subblocksize: specify the size of the subblocks produced
*
* Design by contract: application should make sure the memory
* in block is allocated and initialized properly.
*/
// Build 2D matrix for cols
subblockdividers=new double *[noofsubblocks];
subblockdividers[0]= block;
for (int i=1; i<noofsubblocks; ++i) {
subblockdividers[i] = &subblockdividers[i-1][subblocksize];
}
}
Now the pointer returned in subblockdividers can be used the way you would like to.
Don't forget to free subblockdividers when your done. (Note: adjustments might be needed to compile this as C code)

Optimize cython functions operating on python lists

I am currently migrating to Cython a set of functions that are currently implemented in C++ through scipy.weave (now deprecated).
These functions operate on timeseries points that are 2D-lists (eg. [[17100, 19.2], [17101, 20.7], [17102, 20.3], ...]) both in input and in output. A sample function is subtract that accepts two timeseries and calculates a new timeserie as subtraction of the two inputs going date-by-date.
The structure and the interfaces have to be mantained for retrocompatibility, but my profiling trials show that Cython porting is about 30%-40% slower than the original scipy.weave implementation.
I have tried many ways to optimize (inner conversions to Numpy arrays and memoryviews, C pointers, ...), but the conversion time required lenghtens the overall execution time. Even trying to define input and output as C++ vectors, leveraging on Cython implicit conversions doesn't seem to be effective in order to mantain scipy.weave speed. I have also used the various hints on boundscheck, wraparound, division, ...
The highest slow-downs seem to be on functions that uses nested loops and I've seen that a little gain can be predefining the list size (cdef list target = [[-1, float('nan')]]*size).
I am aware that Cython can't be so much performing on Python structures, especially lists, but are there any other tricks or techniques with which a speedup can be obtained?
=== EDIT - ADD CODE EXAMPLE ===
A good example of the typology of functions is the following.
The function takes a 2-D list of dates/prices and a 2-D list of dates/decimal factors and searches matching dates between the two lists, calculating the output on the corresponding price/factor by multiplying or dividing (that is a third input parameter).
My best-performing cython code:
#cython.cdivision(True)
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef apply_conversion(list original_timeserie, list factor_timeserie, int divide_or_multiply=False):
cdef:
Py_ssize_t i, j = 0, size = len(original_timeserie), size2 = len(factor_timeserie)
long original_date, factor_date
double original_price, factor_price, conv_price
list result = []
for i in range(size):
original_date = original_timeserie[i][0]
for j in range(j, size2):
factor_date = factor_timeserie[j][0]
if original_date == factor_date:
original_price = original_timeserie[i][1]
factor_price = factor_timeserie[j][1]
if divide_or_multiply:
if factor_price != 0:
conv_price = original_price / factor_price
else:
conv_price = float('inf')
else:
conv_price = original_price * factor_price
result.append([original_date, conv_price])
break
return result
The original scipy function:
int len = original_timeserie.length();
int len2 = factor_timeserie.length();
PyObject* py_serieconv = PyList_New(len);
PyObject* original_item = NULL;
PyObject* factor_item = NULL;
PyObject* date = NULL;
PyObject* value = NULL;
long original_date = 0;
long factor_date = 0;
double original_price = 0;
double factor_price = 0;
int j = 0;
for(int i=0;i<len;i++) {
original_item = PyList_GetItem(original_timeserie, i);
date = PyList_GetItem(original_item, 0);
original_date = PyInt_AsLong(date);
original_price = PyFloat_AsDouble( PyList_GetItem(original_item, 1) );
factor_item = NULL;
for(;j<len2;) {
factor_item = PyList_GetItem(factor_timeserie, j++);
factor_date = PyInt_AsLong(PyList_GetItem(factor_item, 0));
if (factor_date == original_date) {
factor_price = PyFloat_AsDouble(PyList_GetItem(factor_item, 1));
value = PyFloat_FromDouble(original_price * (divide_or_multiply==0 ? factor_price : 1/factor_price));
PyObject* py_new_item = PyList_New(2);
Py_XINCREF(date);
PyList_SetItem(py_new_item, 0, date);
PyList_SetItem(py_new_item, 1, value);
PyList_SetItem(py_serieconv, i, py_new_item);
break;
}
}
}
return_val = py_serieconv;
Py_XDECREF(py_serieconv);

Threaded code crashes calling FFI process

I've converted a function to use threads (as per this answer). It behaves as expected in tests (that is, it returns identical values to the non-threaded version). However, calling it from Python using ctypes causes the calling process to crash.
First, the working function:
#[no_mangle]
pub extern fn convert_vec(lon: Array, lat: Array) -> Array {
// snip
// orig is a Vec<(f32, f32)>
// convert is a conversion function
let result: Vec<(i32, i32)> = orig.iter()
.map(|elem| convert(elem.0, elem.1))
.collect();
// convert back to vector of unsigned integer Tuples
let nvec = result.iter()
.map(|ints| Tuple { a: ints.0 as u32, b: ints.1 as u32 })
.collect();
Array::from_vec(nvec)
}
And now the threaded version, which passes tests (using cargo test) but crashes when called from Python:
#[no_mangle]
pub extern fn convert_vec_threaded(lon: Array, lat: Array) -> Array {
// snip
// orig is a Vec<(f32, f32)>
// convert is a conversion function
let mut guards: Vec<JoinHandle<Vec<(i32, i32)>>> = vec!();
// split into slices
for chunk in orig.chunks(orig.len() / NUMTHREADS as usize) {
let chunk = chunk.to_owned();
let g = thread::spawn(move || chunk
.into_iter()
.map(|elem| convert(elem.0, elem.1))
.collect());
guards.push(g);
}
let mut result: Vec<(i32, i32)> = Vec::with_capacity(orig.len());
for g in guards {
result.extend(g.join().unwrap().into_iter());
}
// convert back to vector of unsigned integer Tuples
let nvec = result.iter()
.map(|ints| Tuple { a: ints.0 as u32, b: ints.1 as u32 })
.collect();
Array::from_vec(nvec)
}
The complete testable example is available here
From the error message it looks like you used a chunk size of 0 for some inputs. [T]::chunks(size) will assert that size != 0.
If we want NUMTHREADS chunks, we could split it like this:
// Divide into NUMTHREADS chunks
let mut size = orig.len() / NUMTHREADS;
if orig.len() % NUMTHREADS > 0 { size += 1; }
// If we want to avoid the case where orig.len() == 0, we need another adjustment:
size = std::cmp::max(1, size);

Memory Leak in Ctypes method

I have a project mostly written in Python. This project runs on my Raspberry Pi (Model B). With the use of the Pi Camera I record to a stream. Every second I pauze the recording to take the last frame from the stream and compare it with a older frame. The comparing is done in C code (mainly because it is faster than Python).
The C code is called from Python using Ctypes. See the code below.
# Load picturecomparer.so and set argument and return types
cmethod = ctypes.CDLL(Paths.CMODULE_LOCATION)
cmethod.compare_pictures.restype = ctypes.c_double
cmethod.compare_pictures.argtypes = [ctypes.c_char_p, types.c_char_p]
The 2 images that must be compared are stored on the disk. Python gives the paths of both images as arguments to the C code. The C code will return a value (double) which is the difference in percentage of both images.
# Call the C method to compare the images
difflevel = cmethod.compare_pictures(path1, path2)
The C code looks like this:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#ifndef STB_IMAGE_IMPLEMENTATION
#define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h"
#ifndef STBI_ASSERT
#define STBI_ASSERT(x)
#endif
#endif
#define COLOR_R 0
#define COLOR_G 1
#define COLOR_B 2
#define OFFSET 10
double compare_pictures(const char* path1, const char* path2);
double compare_pictures(const char* path1, const char* path2)
{
double totalDiff = 0.0, value;
unsigned int x, y;
int width1, height1, comps1;
unsigned char * image1 = stbi_load(path1, &width1, &height1, &comps1, 0);
int width2, height2, comps2;
unsigned char * image2 = stbi_load(path2, &width2, &height2, &comps2, 0);
// Perform some checks to be sure images are valid
if (image1 == NULL | image2 == NULL) { return 0; }
if (width1 != width2 | height1 != height2) { return 0; }
for (y = 0; y < height1; y++)
{
for (x = 0; x < width1; x++)
{
// Calculate difference in RED
value = (int)image1[(x + y*width1) * comps1 + COLOR_R] - (int)image2[(x + y*width2) * comps2 + COLOR_R];
if (value < OFFSET && value > (OFFSET * -1)) { value = 0; }
totalDiff += fabs(value) / 255.0;
// Calculate difference in GREEN
value = (int)image1[(x + y*width1) * comps1 + COLOR_G] - (int)image2[(x + y*width2) * comps2 + COLOR_G];
if (value < OFFSET && value >(OFFSET * -1)) { value = 0; }
totalDiff += fabs(value) / 255.0;
// Calculate difference in BLUE
value = (int)image1[(x + y*width1) * comps1 + COLOR_B] - (int)image2[(x + y*width2) * comps2 + COLOR_B];
if (value < OFFSET && value >(OFFSET * -1)) { value = 0; }
totalDiff += fabs(value) / 255.0;
}
}
totalDiff = 100.0 * totalDiff / (double)(width1 * height1 * 3);
return totalDiff;
}
The C code will be executed every ~2 seconds. I just noticed that there is a memory leak. After around 10 to 15 minutes my Raspberry Pi haves like 10MB ram left to use. A few minutes later it crashes and doesn't respond anymore.
I have done some checks to find out what causes this in my project. My entire project uses around 30-40MB ram if I disable the C code. This project is all my Raspberry Pi have to execute.
Model B: 512MB ram which shares between CPU and GPU.
GPU: 128MB (/boot/config.txt).
My Linux distro uses: ~60MB.
So I have ~300MB for my project.
Hope someone could point me where it goes wrong, or if I have to call GC myself, etc..
Thanks in advance.
p.s. I know the image comparing is not the best way, but it works for me now.
Since the images are being returned as pointers to buffers stbi_load must be allocating space for them and you are not releasing this space before returning so the memory leak is not surprising.
Check for the documentation to see if there is a specific stpi_free function or try adding free(image1); free(image2); before the final return.
Having checked I can categorically say that you should be calling STBI_FREE(image1); STBI_FREE(image2); before returning.

Disappointing results in pyCUDA benchmark for distance computing between N points

The following script was set-up for benchmark purposes. It computes the distance between N points using an Euclidean L2 norm. Three different routines are implemented:
High-level solution using the scipy.spatial.distance.pdist function.
Fairly low-level OpenMP powered scipy.weave.inline solution.
pyCUDA powered GPGPU solution.
Here are the benchmark results on a i5-3470 (16GB RAM) using a GTX660 (2GB RAM):
------------
Scipy Pdist
Execution time: 3.01975 s
Frist five elements: [ 0.74968684 0.71457213 0.833188 0.48084545 0.86407363]
Last five elements: [ 0.65717077 0.76850474 0.29652017 0.856179 0.56074625]
------------
Weave Inline
Execution time: 2.48705 s
Frist five elements: [ 0.74968684 0.71457213 0.83318806 0.48084542 0.86407363]
Last five elements: [ 0.65717083 0.76850474 0.29652017 0.856179 0.56074625]
------------
pyCUDA
CUDA clock timing: 0.713028930664
Execution time: 2.04364 s
Frist five elements: [ 0.74968684 0.71457213 0.83318806 0.48084542 0.86407363]
Last five elements: [ 0.65717083 0.76850468 0.29652017 0.856179 0.56074625]
------------
I am a bit disappointed on the pyCUDA perfomance. Since I am new to CUDA, there is probably something I am missing here. So where is the crux of the matter ? Am I reaching the limits of global memory bandwidth ? Poor choice of block- and gridsizes ?
import numpy,time,math
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
from scipy.spatial.distance import pdist
from scipy import weave
def weave_solution(x):
"""
OpenMP powered weave inline.
"""
N,DIM = numpy.shape(x)
L = ((N-1)**2+(N-1))/2
solution = numpy.zeros(L).astype(numpy.float32)
ncpu = 4
weave_omp = {'headers' : ['<omp.h>'],
'extra_compile_args': ['-fopenmp'],
'extra_link_args' : ['-lgomp']}
code = \
r'''
omp_set_num_threads(ncpu);
#pragma omp parallel
{
int j,d,pos;
float r=0.0;
#pragma omp for
for (int i=0; i<(N-1); i++){
for (j=(i+1); j<N; j++){
r = 0.0;
for (d=0; d<DIM; d++){
r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
}
pos = (i*N+j)-(i*(i+1)/2)-i-1;
solution[pos] = sqrt(r);
}
}
}
'''
weave.inline(code,['x','N','DIM','solution','ncpu'],**weave_omp)
return numpy.array(solution)
def scipy_solution(x):
"""
SciPy High-level function
"""
return pdist(x).astype(numpy.float32)
def cuda_solution(x):
"""
pyCUDA
"""
N,DIM = numpy.shape(x)
N = numpy.int32(N)
DIM = numpy.int32(DIM)
L = ((N-1)**2+(N-1))/2
solution = numpy.zeros(L).astype(numpy.float32)
start = drv.Event()
end = drv.Event()
mod = SourceModule("""
__global__ void distance(float *x,int N,int DIM,float *solution){
const int i = blockDim.x * blockIdx.x + threadIdx.x;
int j,d,pos;
float r=0.0;
if ( i < (N-1) ){
for (j=(i+1); j<N; j++){
r = 0.0;
for (d=0; d<DIM; d++){
r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
}
pos = (i*N+j)-(i*(i+1)/2)-i-1;
solution[pos] = sqrt(r);
}
}
}
""")
func = mod.get_function("distance")
start.record()
func(drv.In(x),N,DIM,drv.Out(solution),block=(192,1,1),grid=(192,1))
end.record()
end.synchronize()
secs = start.time_till(end)*1e-3
print "CUDA clock timing: ",secs
return solution
if __name__ == '__main__':
# Set up data points
N = 25000
DIM = 3
x = numpy.random.rand(N,DIM).astype(numpy.float32)
print "-"*12
# Scipy solution
print "Scipy Pdist"
stime = time.time()
spsolution = scipy_solution(x)
stime = time.time()-stime
print "Execution time: {0:.5f} s".format(stime)
print "Frist five elements:", spsolution[:5]
print "Last five elements:", spsolution[-5:]
print "-"*12
# Weave solution
print "Weave Inline"
wtime = time.time()
wsolution = weave_solution(x)
wtime = time.time()-wtime
print "Execution time: {0:.5f} s".format(wtime)
print "Frist five elements:", wsolution[:5]
print "Last five elements:", wsolution[-5:]
print "-"*12
# pyCUDA solution
print "pyCUDA"
ctime = time.time()
csolution = cuda_solution(x)
ctime = time.time()-ctime
print "Execution time: {0:.5f} s".format(ctime)
print "Frist five elements:", csolution[:5]
print "Last five elements:", csolution[-5:]
print "-"*12
Edit:
I have added the hash bang line
#!/usr/bin/env python
at the top of the file and made it executable. After commenting out the computation using weave.inline and scipy.spatial.distance.pdist, the NVIDIA Visual Profiler promts the following results:
Right now you have 192 threads each updating N-1 positions, you could easily launch more blocks/threads.
What you want to do is instead of this loop for (j=(i+1); j<N; j++){, replace it with N-1 threads doing just the inner loop.
If you want to take it further you could have N-1 * DIM threads each doing the statement in the inner loop, store the result to shared memory and finally do an reduction on that. See Optimizing Parallel Reduction in CUDA
Looking at this line:
r += (x[i*DIM+d]-x[j*DIM+d])*(x[i*DIM+d]-x[j*DIM+d]);
The memory transactions and pattern is not uniform and coalesced. Also do not know if nvcc will optimizes your expression to only two memory transactions instead of four shown here, as I do not know if pycuda passes -O3 to nvcc. Put (x[i*DIM+d]-x[j*DIM+d]) into a register variable to make sure and just square it yourself.
Else you can also try to put #pragma unroll before each for loop to unroll them if possible.

Categories