Getting started with PyOpenCL - python

I have recently discovered the power of GP-GPU (general purpose graphics processing unit) and want to take advantage of it to perform 'heavy' scientific and math calculations (that otherwise require big CPU clusters) on a single machine.
I know that there are several interfaces to operate on a GPU, the most prominent of those being CUDA and OpenCL. The latter has the advantage against CUDA to run on most graphics cards (NVIDIA, AMD, Intel) rather than NVIDA cards only. In my case, I have an ordinary Intel 4000 GPU that seems to be well cooperating with OpenCL.
Now, I need to learn how to operate with PyOpenCL to get it on further! So here comes the question:
How can I get started with PyOpenCL? What are the prerequisites? Do I really need to be experienced in Python and/or OpenCL?
My background is in fortran and as a matter of fact I need to translate and parallelize a lengthy fortran code to python (or pyopencl) that mainly deals with solving PDEs and diagonalizing matrices.
I have read the two relevant websites http://enja.org/2011/02/22/adventures-in-pyopencl-part-1-getting-started-with-python/ and http://documen.tician.de/pyopencl/ but they are not really helpful for newbies (ie, dummies).
I just don't know what to begin with. I do not aspire on becoming an expert on the field, just to get to know how one can parallelize simple math and linear algebra on pyopencl.
Any advice and help is highly welcome!

It seems you are looking for the fastest and most effective path to learn PyOpenCL. You do not need to know OpenCL (the hard part) at the start, but it will be helpful to know Python when you begin.
For learning Python syntax quickly, I recommend Codecademy's Python track: http://www.codecademy.com/tracks/python
Then, the Udacity parallel programming course is a great place to start with GPGPU (even though the course is taught in CUDA). https://www.udacity.com/course/cs344 This course will teach you fundamental GPGPU concepts very quickly. You will not need a NVIDIA GPU to participate, because all the course assessments are done online.
After (or during) the Udacity course, I recommend you read, run, and customize PyOpenCL code examples: https://github.com/inducer/pyopencl/tree/master/examples

Irrespective of the language of adoption for GPGPU computing such as Java,C/C++, Python, I would recommend you first get started with the basics of GPGPU computing and OpenCL.
You can use the following resources all are C/C++ oriented but this should you give enough knowledge about OpenCL, GPGPU hardware to get you started.
AMD OpenCL University Tool kit
Hetergeneous Computing with OpenCL Book 2nd Edition
NVIDIA OpenCL pages is another Excellent resorce
Streamcomputing.eu has nice openCL starter articles.
Intel OpenCL SDK tutorial
PyOpenCL specific
OpenCL in Action: How to Accelerate Graphics and Computation
has a chapter on PyOpenCL
OpenCL Programming Guide has chapter PyOpenCL
Both the books contain OpenCL 1.1 implementation but it should be good starting point for you.

As someone new to GPU programming I found the relevant articles you mentioned fairly straightforward though I found the sample code ran perfectly from the command line but not in Eclipse with Anaconda. I think this may be because the Eclipse pyopencl from anaconda is different from the command line version, and I have yet to work out how to resolve this.
For learning python there are a large number of resources online including free ebooks.
https://wiki.python.org/moin/BeginnersGuide
http://codecondo.com/10-ways-to-learn-python/
should be good starters. If you use Eclipse you should install pydev. In any case install Anaconda https://docs.continuum.io/anaconda/install as this will save you a lot of hassle.
I estimate a week or so to get to the level of proficiency you need in Python as long as you picj a few simple mini projects. You may also find that with numpy and scipy and possibly ipython notebook you may not need to delve into GPU programming
These links may help you avoid GPU programming or at least delay having to learn it. Be aware that thr cost of switching between cores means you have ot assign a singificant amount of work to each core
http://blog.dominodatalab.com/simple-parallelization/
https://pythonhosted.org/joblib/parallel.html
Generally I find it more efficient, if less fun, to learn only one thing at a time.
I hope this helps.

you can see this :)
Introduction 1
https://github.com/fr33dz/PyOpenCL_Tuto/blob/master/Intro_PyopenCL.ipynb
Introduction 2
https://github.com/fr33dz/PyOpenCL_Tuto/blob/master/2_PyOpenCL.ipynb
Matrix Multiplication
https://github.com/fr33dz/PyOpenCL_Tuto/blob/master/PyOpenCL_produit_2_matrices.ipynb

Related

Difference between pyOpenCL and opencl4py

Today I stumbled over a post in stackoverflow (see also here):
We are developing opencl4py, higher level bindings. This project uses CFFI, so it works on Pypy.
The major issue we encountered with pyopencl is that 'import pyopencl' does OpenCL initialization and takes the whole virtual memory in case of NVIDIA driver, preventing from correct forking and effectively disabling multiprocessing (yes, we claim that using pyopencl disables multiprocessing at least with NVIDIA). opencl4py uses lazy OpenCL initialization, resolving this "import hell".
Later, it gained some nice features like super easy binary program caching, etc. Unfortunately, the documentation is somewhat brief. The best way to learn how it works is go through the tests.
As there is also pyOpenCL, I was woundering what the difference between these two packages is. Does anybody know where I can find an overview on the pro's and con's for these both packages?
Edit: To include benshope's comment as I would also be interested: what does "disable[s] multiprocessing" mean? Like, it can't run kernels on several devices at one time?
As far as I know, there is no such overview. I'll try to list some key points:
pyOpenCL is a mature project with a relatively large user base. There are tutorials, FAQ, etc. opencl4py appeared on 03/2014; no tutorials, FAQ and so on - only unit tests and docstrings.
pyOpenCL is a native cPython extension, whereas opencl4py uses cffi, so that it works on PyPy (pyOpenCL does NOT) and it does not require to be recompiled each time cPython changes version.
PyOpenCL has extras, such as random number generator and OpenGL interoperability.
opencl4py is extensively tested in Samsung production real world scenarios and is being actively developed.
what does "disable[s] multiprocessing" mean? Like, it can't run kernels on several devices at one time?
Of course, it can, I was trying to say that after importing pyopencl, os.fork() or multiprocessing.Process() lead to crashes inside NVIDIA OpenCL userspace library. It is always a bad idea of doing work during import.

Multidimensional FFT in python with CUDA or OpenCL

I have been browsing around for simple ways to program FFTs to work on my graphic card (Which is a recent NVIDIA supporting CUDA 3.something).
My current option is either to learn C, then that special C version for CUDA, or use some python CUDA functions. I'd rather not learn C yet, since I only programmed in high-level languages.
I looked at pyCUDA and other ways to use my graphic card in python, but I couldn't find any FFT library which could be use with python code only.
Some libraries/project seem to tackle similar project (CUDAmat, Theano), but sadly I found no FFTs.
Does a function exist which could do the same thing as numpy.fft.fft2(), using my graphic card?
EDIT: Bonus point for an open source solution.
There's PyFFT, which is open-source and based on Apple's (somewhat limited) implementation. Disclaimer: I work on PyFFT :)
Yes, ArrayFire has a 2-D FFT for Python.
Disclaimer: I work on ArrayFire.

PyGPU Information/Documentation

I'm trying to quickly learn how to write some programs for GPUs using the PyGPU library I found. Initially, I thought this was going to be a very easy task, but I couldn't find any documentation or tutorials for this. I do not have any knowledge of C or any of the current frameworks provided by NVIDIA or ATI, so can anyone suggest a good jumping-off point?
PyGPU doesn't appear to have been updated since 2007. A more mature and well-supported Python interface to the GPU is PyCUDA.
If you are familiar with C++, I would also recommend Thrust as a well-supported, mature high-level interface to GPU programming. The Thrust Quick Start Guide is a great place to start.

Porting Python to an embedded system

I am working with an ARM Cortex M3 on which I need to port Python (without operating system). What would be my best approach? I just need the core Python and basic I/O.
Golly, that's kind of a tall order. There are so many services of a kernel that Python depends upon, and that you'd have to provide yourself. I'd think you'd be far better off looking for a lightweight OS -- maybe Minix 3? -- to put on your embedded processor.
Failing that, I'd be horribly tempted to think about hand-translating to C and building the essentials on that.
You should definitely look at eLua:
http://www.eluaproject.net
"Embedded power, driven by Lua
Quickly prototype and develop embedded software applications with the power of Lua and run them on a wide range of microcontroller architectures"
There are a few projects that have attempted to port Python to the situation you mention, take a look at python-on-a-chip, PyMite or tinypy. These are aimed at lower power microcontrollers without an OS and tend to focus on slightly older versions of the Python language and reduced library support.
One possible approach is to build your own stack machine in software to interpret and execute Python byte code directly. Certainly not a porting job and quite labor-intensive to implement, but a self-contained Python byte code stack processor built for your embedded system gets you around needing an operating system.
Another approach is writing your own low level executive (one step below a general purpose OS) that contains the bare minimum in services that a core Python interpreter port requires. I am not certain if this is more or less labor intensive than building a stack processor.
I am not recommending either of these approaches - personally, I like Charlie Martin's Minix 3 approach best since it is a balanced requirements compromise. On the other hand, what I suggest might be interesting if your project absolutely requires Python without an operating system and if the project has an excellent time and money budget.
Update 5 Mar 2012: Given a strict adherence to your Python/No OS requirements, another possibility of a path to a solution may lie in using an OS-less Java VM (e.g., jnode, currently in beta) and use Jython to create Java byte code from Python. Certainly not an ideal off-the-shelf solution, and it does seem to meet an OS-less Python requirement.
Compile it to c :)
http://shed-skin.blogspot.com/
fyi I just ported CPython 2.7x to non-POSIX OS. That was easy.
You need write pyconfig.h in right way, remove most of unused modules. Disable unused features.
Then fix compile, link errors. Then it just works after fixing some simple problems on run.
If You have no some POSIX header, write one by yourself. Implement all POSIX functions, that needed, such as file i/o.
Took 2-3 weeks in my case. Although I have heavily customized Python core. Unfortunately cannot opensource it :(.
After that I think Python can be ported easily to any platform, that has enough RAM.

Python CPU and OS

Wouldn't it be possible to have an OS entirely in Python if the Python VM itself is build into a hardware? Something like the good old Lisp Machine?
Suppose I have a cpu that is the hardware implementation of the python virtual machine, then all programs written in python would perform with the speed of assembly, won't it (but Python is mostly interpreted but we can compile it)?
If we have such a 'python-microprocessor', what about the memory and other subsystems? Would it be compatible with the current memory.
Is there any information on the registers and the Python VM architecture, something similar to what we have for 8086?
Wouldn't it be possible to have an OS
entirely in Python if the Python VM
itself is build into a hardware?
Something like the good old Lisp
Machine?
Yes, theoretically it would be possible.
Suppose I have a cpu that is the
hardware implementation of the python
virtual machine, then all programs
written in python would perform with
the speed of assembly, won't it (but
Python is mostly interpreted but we
can compile it)?
Python doesn't have a speed, it's a language. The speed of the interpreter (in this case the processor) can be tested. But just as it's difficult to compare the performance of a RISC and a CISC processor, comparing Assembly with Python will be difficult too.
If we have such a
'python-microprocessor', what about
the memory and other subsystems? Would
it be compatible with the current
memory.
The python microprocessor would have to do the memory management (and thus the garbage collection). Since that's normally done by the interpreter, now the microprocessor has to do it.
Is there any information on the
registers and the Python VM
architecture, something similar to
what we have for 8086?
Normally you don't access the memory directly in Python, so the registers shouldn't be relevant here.
Similar things were tried for Java, but none really took the world by storm.
Yeah, it might be possible, but designing new hardware is expensive. Would the return on investment justify building such a toy? I'd guess not, otherwise someone would have tried it by now. :)
Suppose I have a cpu that is the hardware implementation of the python virtual machine, then all programs written in python would perform with the speed of assembly, won't it (but Python is mostly interpreted but we can compile it)?
Yes it would be assembly speed. See this link for a comparison with an avr microcontroller assembly code. http://pycpu.wordpress.com/code-examples/speed-pycpu-vs-8bit-avr/.
It is a hardware implemntation of a cpu that can do very very limited python bytecode. But enought for ifs conditions and while loops with simple integers.
In the 70ties such ideas were quite popular. The idea was to close the semantic gap between compilers/virtual machines and instruction set architectures, and thereby bring programming languages and hardware closer together. However, when Patterson and Ditzel published The Case for the Reduced Instruction Set Computer (PDF, 672KB) and after the success of RISC and the microprocessor, the idea of closing the semantic gap was basically dead.
Now, with ever increasing transistor counts the idea may become interesting again. But, as others already noted, designing chips is costly. You need a very good reason to sink so much money. But it is definitely possible. IBM and Azul have shown this with their massively parallel Java Chips.
I guess you should call Google and convince them that they urgently need a Python processor. ;-)
New operating systems are interesting and cool, and basing one off of python would be cool. Then again, linux is so good and has so much development for it already. It would have to be the "right time".

Categories