Calling __host__ functions in PyCUDA - python

Is it possible to call __host__ functions in pyCUDA like you can __global__ functions? I noticed in the documentation that pycuda.driver.Function creates a handle to a __global__ function. __device__ functions can be called from a __global__ function, but __host__ code cannot. I'm aware that using a __host__ function pretty much defeats the purpose of pyCUDA, but there are some already made functions that I'd like to import and call as a proof of concept.
As a note, whenever I try to import the __host__ function, I get:
pycuda._driver.LogicError: cuModuleGetFunction failed: named symbol not found

No it is not possible.
This isn't a limitation of PyCUDA, per se, but of CUDA itself. The __host__ decorator just decays away to plain host code, and the CUDA APIs don't and cannot handle them in the same way that device code can be handled (note the the APIs also don't handle __device__ either, which is the true equivalent of __host__).
If you want to call/use __host__ functions from Python, you will need to use one of the standard C++/Python interoperability mechanisms, like ctypes or SWIG or boost python, etc.
EDIT:
Since this answer was written five years ago, CUDA has added the ability to run host functions in CUDA streams via cuLaunchHostFunc (driver API) or cudaLaunchHostFunc. Unfortunately, at the time of this edit (June 2022), PyCUDA doesn't expose this functionality, so it still isn't possible in PyCUDA and the core message of the original answer is unchanged.

Below, I'm providing a sample code to call CUDA APIs in pyCUDA. The code generates uniformly distributed random numbers and may serve as a reference to include already made functions (as the poster says and like CUDA APIs) in a pyCUDA code.
import numpy as np
import ctypes
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
curand = CDLL("/usr/local/cuda/lib64/libcurand.so")
# --- Number of elements to generate
N = 10
# --- cuRAND enums
CURAND_RNG_PSEUDO_DEFAULT = 100
# --- Query the cuRAND version
i = c_ulonglong()
curand.curandGetVersion(byref(i))
print("curand version: ", i.value)
# --- Allocate space for generation
d_x = gpuarray.empty(N, dtype = np.float32)
# --- Create random number generator
gen = c_ulonglong()
curand.curandCreateGenerator(byref(gen), CURAND_RNG_PSEUDO_DEFAULT)
# --- Generate random numbers
curand.curandGenerateUniform(gen, ctypes.cast(d_x.ptr, POINTER(c_float)), N)
print(d_x)

Related

Can you call the numba.cuda.random device function in user-created numba CUDA device functions?

I have a cuda kernel and several device functions in numba for a project. When I try to call xoroshiro128p_uniform_float32 from the
numba.cuda.random module. Whenever I try to call this, I get:
import numba
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states
from numba.cuda.random import xoroshiro128p_uniform_float64
#cuda.jit('void(float32[:,:])', device=True)
def device(rng_states):
thread_id = cuda.grid(1)
probability = xoroshiro128p_uniform_float64(rng_states, thread_id)
#cuda.jit()
def kernel(rng_states):
device(rng_states)
BPG = 10
TPB = 10
rng_states = create_xoroshiro128p_states(BPG * TPB, seed=42069)
kernel[TPB, BPG](rng_states)
TypingError: Failed in cuda mode pipeline (step: nopython frontend)
Untyped global name 'xoroshiro128p_uniform_float32': Cannot determine Numba type of <class 'function'>
Has anyone successfully called imported functions in a CUDA device function on numba before?
The error is here:
#cuda.jit('void(float32[:,:])', device=True)
def device(rng_states):
That signature is incorrect and also unnecessary. There is no reason that I know of to think that rng_states is or should be a 2D float32 array.
Do this:
#cuda.jit(device=True)
def device(rng_states):
A few other things I noticed which are not the proximal issue:
A kernel launch specifies blocks per grid first, then threads per block. So this might be incorrect if you modify things in the future:
kernel[TPB, BPG](rng_states)
To my eye it is better written as:
kernel[BPG, TPB](rng_states)
Finally, your question states:
When I try to call xoroshiro128p_uniform_float32
but your code reflects:
probability = xoroshiro128p_uniform_float64(
Regarding this question:
Has anyone successfully called imported functions in a CUDA device function on numba before?
In the general case, functions from a python import may or may not be usable in CUDA device code. (Most are not usable.) This random number generator, provided by numba for this purpose, might be called a "special case".

Python: what is the difference between a package and a compiler?

I was reading the wiki page for Numba, and it says Numba is a "compiler". But then later on, it says that to use Numba, you import it like a package. I later looked up how to use Numba, and indeed, you just pip install it.
So now I am confused. I thought Numba was a compiler? But it seems to be used just like any other package, like numpy or pandas? What's the difference?
A compiler is a program that inputs something in human-readable form (usually a program in a specified language) and outputs a functionally equivalent stream in another, more machine-digestible form. Just as with any other transformation, it's equally viable as a command-line invocation or a function call.
As long as it's wrapped properly in a package for general use, it's perfect reasonable to deliver a compiler as a Python package.
Does that clear up the difficulty?
From what I have read at Numba documentation it's a package that you import to you project and then use the Numba decorator do indicate parts of your code that you would like to have compiled in JIT (Just in Time) in order to optimize them. Like in the following example:
from numba import jit
import random
#jit(nopython=True)
def monte_carlo_pi(nsamples):
acc = 0
for i in range(nsamples):
x = random.random()
y = random.random()
if (x ** 2 + y ** 2) < 1.0:
acc += 1
return 4.0 * acc / nsamples
When the monte_carlo_pi function is called Numba will have it compiled in order to optimize it, so there isn't a compilation step that you can take.

Interacting with AURA_SDK.dll through python using ctypes

I'm trying to control my ASUS ROG Flare keyboard LED colors using python.
I downloaded the Aura Software Developer Kit from the ASUS website.
link here: https://www.asus.com/campaign/aura/us/SDK.php
inside the kit there is a menu guide and a dll file called AURA_SDK.dll. The guide says that with the mentioned dll the keyboard can be controlled.
I'm using the ctypes python package and succeeded in loading the package, but when I'm calling the first function to obtain control on the keyboard the program fails because I don't fully understand the argument the function needs to run.
Documentation from the guide:
Code I am trying:
import ctypes
path_dll = 'AURA_SDK.dll'
dll = ctypes.cdll.LoadLibrary(path_dll)
res = dll.CreateClaymoreKeyboard() # fails here
Any ideas on how to create this argument?
Thanks in advance.
This should do it. A good habit to get into is always define .argtypes and .restype for the functions you call. This will make sure parameters are converted correctly between Python and C types, and provide better error checking to help catch doing something incorrectly.
There are also many pre-defined Windows types in wintypes so you don't have to guess what ctype-type to use for a parameter.
Also note that WINAPI is defined as __stdcall calling convention and should use WinDLL instead of CDLL for loading the DLL. On 64-bit systems there is no difference between standard C calling convention (__cdecl) and __stdcall, but it will matter if you are using 32-bit Python or desire portability to 32-bit Python.
import ctypes as ct
from ctypes import wintypes as w
dll = ct.WinDLL('./AURA_SDK') # Use WinDLL for WINAPI calls.
dll.CreateClaymoreKeyboard.argtypes = ct.POINTER(ct.c_void_p), # tuple of arguments
dll.CreateClaymoreKeyboard.restype = w.DWORD
handle = ct.c_void_p() # Make an instance to pass by reference and receive the handle.
res = dll.CreateClaymoreKeyboard(ct.byref(handle))
# res is non-zero on success

PyCuda - How can I use functions written in Python in the kernel?

I want to parallelize my Python code and I'm trying to use PyCuda.
What I saw so far is that you have to write a "Kernel" in C into your Python code. This Kernel is what is going to be parallelized. Am I right?
Example (doubling an array of random numbers, from https://documen.tician.de/pycuda/tutorial.html):
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(4, 4)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
# Kernel:
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
func = mod.get_function("doublify")
func(a_gpu, block=(4, 4, 1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print(a_doubled)
print(a)
The point is that my Python code has classes and other things all suitable with Python and unsuitable with C (i.e. untranslatable to C).
Let me clarify: my has 256 independent for-loops that I want to parallelize. These loops contain Python code that can’t be translated to C.
How can I parallelize an actual Python code with PyCuda without translating my code to C?
You can't.
PyCUDA doesn't support device side python, all device code must be written in the CUDA C dialect.
Numba includes a direct Python compiler which can allow an extremely limited subset of Python language features to be compiled and run directly on the GPU. This does not include access to any Python libraries such as numpy, scipy, etc.

How do you import a Python library within an R package using rPython?

The basic question is this: Let's say I was writing R functions which called python via rPython, and I want to integrate this into a package. That's simple---it's irrelevant that the R function wraps around Python, and you proceed as usual. e.g.
# trivial example
# library(rPython)
add <- function(x, y) {
python.assign("x", x)
python.assign("y", y)
python.exec("result = x+y")
result <- python.get("result")
return(result)
}
But what if the python code with R functions require users to import Python libraries first? e.g.
# python code, not R
import numpy as np
print(np.sin(np.deg2rad(90)))
# R function that call Python via rPython
# *this function will not run without first executing `import numpy as np`
print_sin <- function(degree){
python.assign("degree", degree)
python.exec('result = np.sin(np.deg2rad(degree))')
result <- python.get('result')
return(result)
}
If you run this without importing the library numpy, you will get an error.
How do you import a Python library in an R package? How do you comment it with roxygen2?
It appears the R standard is this:
# R function that call Python via rPython
# *this function will not run without first executing `import numpy as np`
print_sin <- function(degree){
python.assign("degree", degree)
python.exec('import numpy as np')
python.exec('result = np.sin(np.deg2rad(degree))')
result <- python.get('result')
return(result)
}
Each time you run an R function, you will import an entire Python library.
As #Spacedman and #DirkEddelbuettel suggest you could add a .onLoad/.onAttach function to your package that calls python.exec to import the modules that will typically always be required by users of your package.
You could also test whether the module has already been imported before importing it, but (a) that gets you into a bit of a regression problem because you need to import sys in order to perform the test, (b) the answers to that question suggest that at least in terms of performance, it shouldn't matter, e.g.
If you want to optimize by not importing things twice, save yourself the hassle because Python already takes care of this.
(although admittedly there is some quibblingdiscussion elsewhere on that page about possible scenarios where there could be a performance cost).
But maybe your concern is stylistic rather than performance-oriented ...

Categories