Calling a C++ CUDA device function from a Python kernel

Calling a C++ CUDA device function from a Python kernel - python

I'm working on a project that involves creating CUDA kernels in Python. Numba works quite well (what these guys have accomplished is quite incredible), and so does PyCUDA.
My problem is that I want to call a C device function from my Python generated kernel. I couldn't find a way to accomplish this. Numba can call CFFI modules but only in CPU code. In PyCUDA I can add my C device functions to the SourceModule, but I couldn't figure out how to include functions that already exist in another library.
Is there a way to accomplish this?

As far as I am aware, this isn't possible in either language. Neither exposes the necessary toolchain controls for separate compilation or APIs to do runtime linking of device code.

Related

Using CPU instruction directly from Numba

I would like to use my CPU's builtin instructions from within Numba compiled functions, but am having trouble figuring out how to reference them. For example, the popcnt instruction from the SSE4 instruction set, I can confirm I have it using
llvmlite.binding.get_host_cpu_features(), but have no way of calling the functions itself.
I need to be able to call these functions (instructions) from within other nopython compiled functions.
Ideally this would be done as closely to Python as possible, but in this case speed is more important that readability.

You can use Cython to call SSE intrinsics, but you cannot use Numba to do it. Code doing what you want via Cython is here: https://gist.github.com/aldro61/f604a3fa79b3dec5436a and here: https://gist.github.com/craffel/e470421958cad33df550

You can make a small assembly language DLL and call it through ctypes that in my experience have no overhead whatsoever when used from Numba nopython code. Or alternatively you can use instruction codes directly like in this blog post on jit in Python Piston JavaScript assembler might be used to obtain machine codes for a small asm routine. Numba allows making small functions in LLVM ir as described in this thread Of course llvmlite might be used too.

Are all the algorithms of Tensorflow written in C++ and Python only serve to be easy-to-use APIs?

I know that Tensorflow is written with a C++ engine, but I haven't found any C++ source code in my installation directory (I installed via pip). When I inspect the python codes, I got a sense that the python level is just a wrapper where the essence of the algorithm is not presented. For example, in tensorflow/python/ops/gradients.py, the gradients() function calls python_grad_func() to compute the gradients, which is a class method of DeFun.
My question is that, are all the essential part of Tensorflow written in C++ and the python are only serving as some APIs?

This is mostly correct, though there's a lot of sophisticated stuff implemented in Python. Instead of saying "algorithms" in C++, what I'd say is that the core dataflow execution engine and most of the ops (e.g., matmul, etc.) are in C++. A lot of the plumbing, as well as some functionality like defining gradients of functions, is in Python.
For more information and discussion about why it's this way, see this StackOverflow answer

How to use a dll compiled with VC++ 2013 with cython?

I am developing a library for 3D and environmental audio in C++, which I wish to use from Python and a myriad of other languages. I wish it to work with Kivy on the iPhone down the line, so I want to use Cython instead of ctypes. To that end, I have already constructed and exposed a C-only api that uses only primitive types, i.e. no structs. I would ideally like to have only one implementation of the Python binding, and believe that on Mac/Linux this won't be a problem (I could be wrong, but that is off-topic for this question).
The problem is that it is using C++11 heavily, and consequently needs to compile with VC++ 2013. Python 2 is still typically compiled with VC++ 2008, and I believe Python 3 is only VC++ 2010.
I am familiar with the rules for dynamic runtime mixing: don't pass memory out and free it from the other side, never ever send handles back and forth, etc. I am following them. This means that I'm ctypes and cffi-ready. All functions are appropriately extern "c" and the like.
Can I use this with Cython safely, and if so how? My reading is giving me very mixed answers on whether or not it's safe to go use the export library produced by VC++ 2013 with VC++ 2008, and Cython seems to have no in-built functionality for dynamically linking (and that won't work on iOS anyway). The only thing I can think of that would work besides linking with the import library is to also bind the windows DLL manipulation functions and use those in Cython-but this defeats the one implementation goal pretty quickly.
I've already tried researching this myself, but there seems to be no good info; simply trying it will be inconclusive for weeks-linking is no guarantee of functionality, after all. This is the kind of thing that can very easily seem to work right without actually doing so; to that end, sources and links to further reading are much appreciated.

Multidimensional FFT in python with CUDA or OpenCL

I have been browsing around for simple ways to program FFTs to work on my graphic card (Which is a recent NVIDIA supporting CUDA 3.something).
My current option is either to learn C, then that special C version for CUDA, or use some python CUDA functions. I'd rather not learn C yet, since I only programmed in high-level languages.
I looked at pyCUDA and other ways to use my graphic card in python, but I couldn't find any FFT library which could be use with python code only.
Some libraries/project seem to tackle similar project (CUDAmat, Theano), but sadly I found no FFTs.
Does a function exist which could do the same thing as numpy.fft.fft2(), using my graphic card?
EDIT: Bonus point for an open source solution.

There's PyFFT, which is open-source and based on Apple's (somewhat limited) implementation. Disclaimer: I work on PyFFT :)

Yes, ArrayFire has a 2-D FFT for Python.
Disclaimer: I work on ArrayFire.

Creating a shared library in MATLAB

A researcher has created a small simulation in MATLAB and we want to make it accessible to others. My plan is to take the simulation, clean up a few things and turn it into a set of functions. Then I plan to compile it into a C library and use SWIG to create a Python wrapper. At that point, I should be able to call the simulation from a small Django application. At least I hope so.
Do I have the right plan? Are there are any serious pitfalls that I'm not aware of at the moment?

One thing to remember is that the MATLAB compiler does not actually compile the MATLAB code into native machine instructions. It simply wraps it into a stand-alone executable or a library with its own runtime engine that runs it. You would be able to run your code without MATLAB installed, and you would be able to interface it with other languages, but it will still be interpreted MATLAB code, so there would be no speedup.
Matlab Coder, on the other hand, is the thing that can generate C code from Matlab. There are some limitations, though. Not all Matlab functions are supported for code generation, and there are things you cannot do, like change the type of a variable on the fly.

I remember that I was able to wrap a MATLAB simulation into a DLL file and then call it from a Delphi application. It worked really well.

I'd also try ctypes first.
Use the MATLAB compiler to compile the code into C.
Compile the C code into a DLL.
Use ctypes to load and call code from this DLL
The hardest step is probably 1, but if you already know MATLAB and have used the MATLAB compiler, you should not have serious problems with it.

Perhaps try ctypes instead of SWIG. If it has been included as a part of Python 2.5, then it must be good :-)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Calling a C++ CUDA device function from a Python kernel - python

As far as I am aware, this isn't possible in either language. Neither exposes the necessary toolchain controls for separate compilation or APIs to do runtime linking of device code.

Related

Using CPU instruction directly from Numba

Are all the algorithms of Tensorflow written in C++ and Python only serve to be easy-to-use APIs?

How to use a dll compiled with VC++ 2013 with cython?

Multidimensional FFT in python with CUDA or OpenCL

Creating a shared library in MATLAB

Categories

Resources