PyCuda - How can I use functions written in Python in the kernel? - python

I want to parallelize my Python code and I'm trying to use PyCuda.
What I saw so far is that you have to write a "Kernel" in C into your Python code. This Kernel is what is going to be parallelized. Am I right?
Example (doubling an array of random numbers, from https://documen.tician.de/pycuda/tutorial.html):
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(4, 4)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
# Kernel:
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
func = mod.get_function("doublify")
func(a_gpu, block=(4, 4, 1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print(a_doubled)
print(a)
The point is that my Python code has classes and other things all suitable with Python and unsuitable with C (i.e. untranslatable to C).
Let me clarify: my has 256 independent for-loops that I want to parallelize. These loops contain Python code that can’t be translated to C.
How can I parallelize an actual Python code with PyCuda without translating my code to C?

You can't.
PyCUDA doesn't support device side python, all device code must be written in the CUDA C dialect.
Numba includes a direct Python compiler which can allow an extremely limited subset of Python language features to be compiled and run directly on the GPU. This does not include access to any Python libraries such as numpy, scipy, etc.

Related

Julia's PyCall package generates Segmentation fault

I'm currently using PyCall to load a Python library for data compression based on LZ-77 into Julia. The python library is sweetsourcod and I have it installed in my home directory. Within that library, I am using the module lempel_ziv for some entropy measurements. I load the python module following PyCall's example. This is how I'm loading it into Julia:
using PyCall
sc = pyimport("sweetsourcod.lempel_ziv")
PyObject <module 'sweetsourcod.lempel_ziv' from '/Users/danielribeiro/sweetsourcod/sweetsourcod/lempel_ziv.cpython-38-darwin.so'>
The use of this python library seems to be causing a segmentation fault within Julia; however, when I write the same code in python, the segmentation fault does not take place. The following Julia example triggers the segmentation fault
using PyCall
L = 1000000
nbins = [2*i for i = 1:2:15]
sc = pyimport("sweetsourcod.lempel_ziv")
# loop through all n
for n in nbins
# loop through all configurations
for i = 1:65
# analogous to reading a configuration from scratch
config = rand(0:255, L)
# calculate entropy
# 1.1300458785794484e6 --> cid of random sequence of same L
entropy = sc.lempel_ziv_complexity(config, "lz77")[2] / 1.1300458785794484e6
end
end
the line entropy = sc.lempel_ziv_complexity(config, "lz77")[2] / 1.1300458785794484e6 is what triggers the segfault. This the minimal working example I was able to write in Julia to generate the segfault. The function lempel_ziv_complexity() compresses the array and returns a tuple with the LZ factors and the approximate size of the compressed file.
When I write identical code in Python, the segfault is not triggered. This is the working example in Python
import numpy as np
from sweetsourcod.lempel_ziv import lempel_ziv_complexity
L = 1000000
nbins = [2*i for i in range(1, 15, 2)]
for n in nbins:
for i in range(1, 65, 1):
config = np.random.randint(0, 256, L)
entropy = lempel_ziv_complexity(config, "lz77")[1] / 1.1300458785794484e6
I suspect the triggering of the segfault has to do with PyCall's internals, with which I am unfamiliar with. I have also tried precompiling sweetsourcod into a module like it is suggested in PyCall's README. Does anyone have any suggestions on how to address this issue?
Thank you in advance!

Python: what is the difference between a package and a compiler?

I was reading the wiki page for Numba, and it says Numba is a "compiler". But then later on, it says that to use Numba, you import it like a package. I later looked up how to use Numba, and indeed, you just pip install it.
So now I am confused. I thought Numba was a compiler? But it seems to be used just like any other package, like numpy or pandas? What's the difference?
A compiler is a program that inputs something in human-readable form (usually a program in a specified language) and outputs a functionally equivalent stream in another, more machine-digestible form. Just as with any other transformation, it's equally viable as a command-line invocation or a function call.
As long as it's wrapped properly in a package for general use, it's perfect reasonable to deliver a compiler as a Python package.
Does that clear up the difficulty?
From what I have read at Numba documentation it's a package that you import to you project and then use the Numba decorator do indicate parts of your code that you would like to have compiled in JIT (Just in Time) in order to optimize them. Like in the following example:
from numba import jit
import random
#jit(nopython=True)
def monte_carlo_pi(nsamples):
acc = 0
for i in range(nsamples):
x = random.random()
y = random.random()
if (x ** 2 + y ** 2) < 1.0:
acc += 1
return 4.0 * acc / nsamples
When the monte_carlo_pi function is called Numba will have it compiled in order to optimize it, so there isn't a compilation step that you can take.

Is there a built-in way to use inline C code in Python?

Even if numba, cython (and especially cython.inline) exist, in some cases, it would be interesting to have inline C code in Python.
Is there a built-in way (in Python standard library) to have inline C code?
PS: scipy.weave used to provide this, but it's Python 2 only.
Directly in the Python standard library, probably not. But it's possible to have something very close to inline C in Python with the cffi module (pip install cffi).
Here is an example, inspired by this article and this question, showing how to implement a factorial function in Python + "inline" C:
from cffi import FFI
ffi = FFI()
ffi.set_source("_test", """
long factorial(int n) {
long r = n;
while(n > 1) {
n -= 1;
r *= n;
}
return r;
}
""")
ffi.cdef("""long factorial(int);""")
ffi.compile()
from _test import lib # import the compiled library
print(lib.factorial(10)) # 3628800
Notes:
ffi.set_source(...) defines the actual C source code
ffi.cdef(...) is the equivalent of the .h header file
you can of course add some cleaning code after, if you don't need the compiled library at the end (however, cython.inline does the same and the compiled .pyd files are not cleaned by default, see here)
this quick inline use is particularly useful during a prototyping / development phase. Once everything is ready, you can separate the build (that you do only once), and the rest of the code which imports the pre-compiled library
It seems too good to be true, but it seems to work!

How to use cython compiler in python

in this link: http://earthpy.org/speed.html I found the following
%%cython
import numpy as np
def useless_cython(year):
# define types of variables
cdef int i, j, n
cdef double a_cum
from netCDF4 import Dataset
f = Dataset('air.sig995.'+year+'.nc')
a = f.variables['air'][:]
a_cum = 0.
for i in range(a.shape[0]):
for j in range(a.shape[1]):
for n in range(a.shape[2]):
#here we have to convert numpy value to simple float
a_cum = a_cum+float(a[i,j,n])
# since a_cum is not numpy variable anymore,
# we introduce new variable d in order to save
# data to the file easily
d = np.array(a_cum)
d.tofile(year+'.bin')
print(year)
return d
It seems to be as easy as to just write %%cython over the function. However this just doesnt work for me -> "Statement seems to have no effect" says my IDE.
After a bit of research I found that the %% syntax comes from iphyton which I did also install (as well as cython). Still doesnt work. Iam using python3.6
Any ideas?
Once you are in the IPython interpreter you have to load the extension prior to using it. It can be done with the statement %load_ext, so in your case :
%load_ext cython
These two tools are pretty well documented, if you have not seen it yet, take a look at the relevant part of the document on cython side and on IPython side.

Calling __host__ functions in PyCUDA

Is it possible to call __host__ functions in pyCUDA like you can __global__ functions? I noticed in the documentation that pycuda.driver.Function creates a handle to a __global__ function. __device__ functions can be called from a __global__ function, but __host__ code cannot. I'm aware that using a __host__ function pretty much defeats the purpose of pyCUDA, but there are some already made functions that I'd like to import and call as a proof of concept.
As a note, whenever I try to import the __host__ function, I get:
pycuda._driver.LogicError: cuModuleGetFunction failed: named symbol not found
No it is not possible.
This isn't a limitation of PyCUDA, per se, but of CUDA itself. The __host__ decorator just decays away to plain host code, and the CUDA APIs don't and cannot handle them in the same way that device code can be handled (note the the APIs also don't handle __device__ either, which is the true equivalent of __host__).
If you want to call/use __host__ functions from Python, you will need to use one of the standard C++/Python interoperability mechanisms, like ctypes or SWIG or boost python, etc.
EDIT:
Since this answer was written five years ago, CUDA has added the ability to run host functions in CUDA streams via cuLaunchHostFunc (driver API) or cudaLaunchHostFunc. Unfortunately, at the time of this edit (June 2022), PyCUDA doesn't expose this functionality, so it still isn't possible in PyCUDA and the core message of the original answer is unchanged.
Below, I'm providing a sample code to call CUDA APIs in pyCUDA. The code generates uniformly distributed random numbers and may serve as a reference to include already made functions (as the poster says and like CUDA APIs) in a pyCUDA code.
import numpy as np
import ctypes
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
curand = CDLL("/usr/local/cuda/lib64/libcurand.so")
# --- Number of elements to generate
N = 10
# --- cuRAND enums
CURAND_RNG_PSEUDO_DEFAULT = 100
# --- Query the cuRAND version
i = c_ulonglong()
curand.curandGetVersion(byref(i))
print("curand version: ", i.value)
# --- Allocate space for generation
d_x = gpuarray.empty(N, dtype = np.float32)
# --- Create random number generator
gen = c_ulonglong()
curand.curandCreateGenerator(byref(gen), CURAND_RNG_PSEUDO_DEFAULT)
# --- Generate random numbers
curand.curandGenerateUniform(gen, ctypes.cast(d_x.ptr, POINTER(c_float)), N)
print(d_x)

Categories