PyCUDA/CUDA: Causes of non-deterministic launch failures?

PyCUDA/CUDA: Causes of non-deterministic launch failures? - python

Anyone following CUDA will probably have seen a few of my queries regarding a project I'm involved in, but for those who haven't I'll summarize. (Sorry for the long question in advance)
Three Kernels, One Generates a data set based on some input variables (deals with bit-combinations so can grow exponentially), another solves these generated linear systems, and another reduction kernel to get the final result out. These three kernels are ran over and over again as part of an optimisation algorithm for a particular system.
On my dev machine (Geforce 9800GT, running under CUDA 4.0) this works perfectly, all the time, no matter what I throw at it (up to a computational limit based on the stated exponential nature), but on a test machine (4xTesla S1070's, only one used, under CUDA 3.1) the exact same code (Python base, PyCUDA interface to CUDA kernels), produces the exact results for 'small' cases, but in mid-range cases, the solving stage fails on random iterations.
Previous problems I've had with this code have been to do with the numeric instability of the problem, and have been deterministic in nature (i.e fails at exactly the same stage every time), but this one is frankly pissing me off, as it will fail whenever it wants to.
As such, I don't have a reliable way to breaking the CUDA code out from the Python framework and doing proper debugging, and PyCUDA's debugger support is questionable to say the least.
I've checked the usual things like pre-kernel-invocation checking of free memory on the device, and occupation calculations say that the grid and block allocations are fine. I'm not doing any crazy 4.0 specific stuff, I'm freeing everything I allocate on the device at each iteration and I've fixed all the data types as being floats.
TL;DR, Has anyone come across any gotchas regarding CUDA 3.1 that I haven't seen in the release notes, or any issues with PyCUDA's autoinit memory management environment that would cause intermittent launch failures on repeated invocations?

Have you tried:
cuda-memcheck python yourapp.py
You likely have an out of bounds memory access.

You can use nVidia CUDA Profiler and see what gets executed before the failure.

Related

Numba bytecode generation for generic x64 processors? Rather than 1st run compiling a SLOW #njit(cache=True) argument

I have a pretty large project converted to Numba, and after run #1 with #nb.njit(cache=True, parallel=True, nogil=True), well, it's slow on run #1 (like 15 seconds vs. 0.2-1 seconds after compiling). I realize it's compiling byte code optimized for the specific PC I'm running it on, but since the code is distributed to a large audience, I don't want it to take forever compiling the first run after we deploy our model. What is not covered in the documentation is a "generic x64" cache=True method. I don't care if the code is a little slower on a PC that doesn't have my specific processor, I only care that the initial and subsequent runtimes are quick, and prefer that they don't differ by a huge margin if I distribute a cache file for the #njit functions at deployment.
Does anyone know if such a "generic" x64 implementation is possible using Numba? Or are we stuck with a slow run #1 and fast ones thereafter?
Please comment if you want more details; basically it's around a 50 lines of code function that gets JIT compiled via Numba and afterwards runs quite fast in parallel with no GIL. But I'm willing to give up some extreme optimization if the code can work in a generic form across multiple processors. As where I work, the PCs can vary quite a bit in terms of how advanced they are.
I looked briefly at the AOT (ahead of time) compilation of Numba functions, but these functions, in this case, have so many variables being altered I think it would take me weeks to decorate properly the functions to be compiled without a Numba dependency. I really don't have the time to do AOT, it would make more sense to just rewrite in Cython the whole algorithm, and that's more like C/C++ and more time consuming that I want to devote to this project. Unfortunately there is not (to my knowledge) a Numba -> Cython compiler project out there already. Maybe there will be in the future (which would be outstanding), but I don't know of such a project out there currently.

Unfortunately, you mainly listed all the current available options. Numba functions can be cached and the signature can be specified so to perform an eager compilation (compilation at the time of the function definition) instead of a lazy one (compilation during the first execution). Note that the cache=True flag is only meant to skip the compilation when it as already been done on the same platform before and not to share the code between multiple machine. AFAIK, the internal JIT used by Numba (LLVM-Lite) does not support that. In fact, it is exactly the purpose of the AOT compilation to do that. That being said, the AOT compilation requires the signatures to be provided (this is mandatory whatever the approach/tool used as long as the function is compiled) and it has quite strong limitations (eg. currently there is no support for parallel codes and fastmath). Keep in mind that Numba’s main use case is Just-in-Time compilation and not the ahead-of-time compilation.
Regarding your use-case, using Cython appears to make much more sense: the functions are pre-compiled once for some generic platforms and the compiled binaries can directly be provided to users without the need for recompilation on the target machine.
I don't care if the code is a little slower on a PC that doesn't have my specific processor.
Well, regarding your code, using a "generic" x86-64 code can be much slower. The main reasons lie in the use of SIMD instructions. Indeed, x86-64 processors all supports the SSE2 instruction set which provide basic 128-bit SIMD registers working on integers and floating-point numbers. Since about a decade, x86-processors supports the 256-bit AVX instruction set which significantly speed up floating-point computations. Since at least 7 years, almost all mainstream x86-64 processors supports the AVX-2 instruction set which mainly speed up integer computations (although it also improves floating-point thanks to new features). Since nearly a decade, the FMA instruction set can speed up codes using fuse-multiply adds by a factor of 2. Recent Intel processors support the 512-bit AVX-512 instruction set which not only double the number of items that can be computed per instruction but also adds many useful features. In the end, SIMD-friendly codes can be up to an order of magnitude faster with the new instruction sets compared to the obsolete "generic" SSE2 instruction set. Compilers (eg. GCC, Clang, ICC) are meant to generate a portable code by default and thus they only use SSE2 by default. Note that Numpy already uses such "new" features to speed up a lot many functions (see sorts, argmin/argmax, log/exp, etc.).

Spyder does not release memory when re-running cells

Spyder's memory management is driving me crazy lately. I need to load large datasets (up to 10 GB) and preprocess them. Later on I perform some calculations / modelling (using sklearn) and plots on them. Spyder's cell functionality is perfect for this since it allows me to run the same calculations several times (using different parameters) without repeating the time-consuming preprocessing steps. However, I'm running into memory issues when repeatedly running cells for various reasons:
Re-running the same cell various times increases the memory consumption. I don't understand this since I'm not introducing new variables and previous (global) variables should just be overwritten.
Encapsulating variables in a function helps but not to the degree that one should expect. I have a strong feeling that the memory of local variables is often not released correctly (this also holds when returning any values with .copy() to avoid references to local variables).
Large matplotlib objects do not get recycled properly. Running gc.collect() can help sometimes but does not always clear all the used memory from plots.
When using the 'Clear all variables' button from the IPython console often only parts of the memory get released and several GBs of memory might still remain occupied.
Running %reset from the IPython console works better but does not always clear the memory completely neither (even when running import gc; gc.collect() afterwards).
The only thing that helps for sure is restarting the kernel. However, I don't like doing this since it removes all the output from the console.
Advice on any of the above points is appreciated, as well as some elaboration on the memory management of Spyder. I'm using multi-threading in my code and have a suspicion that some of the issues are related to that, even though I was not able to pinpoint the problem.

Can time.clock() be heavily affected by the state of the system?

This is a rather general question:
I am having issues that the same operation measured by time.clock() takes longer now than it used to.
While I had very some very similar measurements
1954 s
1948 s
1948 s
One somewhat different measurement
1999 s
Another even more different
2207 s
It still seemed more or less ok, but for another one I get
2782 s
And now that I am repeating the measurements, it seems to get slower and slower.
I am not summing over the measurements after rounding or doing other weird manipulations.
Do you have some ideas whether this could be affected by how busy the server is, the clock speed or any other variable parameters? I was hoping that using time.clock() instead of time.time() would mostly sort these out...
The OS is Ubuntu 18.04.1 LTS.
The operations are run in separate screen sessions.
The operations do not involve hard-disk acccess.
The operations are mostly numpy operations that are not distributed. So this is actually mainly C code being executed.
EDIT: This might be relevant: The measurements in time.time() and time.clock() are very similar in any of the cases. That is time.time() measurements are always just slightly longer than time.clock(). So if I haven't missed something, the cause has almost exactly the same effect on time.clock() as on time.time().
EDIT: I do not think that my question has been answered. Another reason I could think of is that garbage collection contributes to CPU usage and is done more frequently when the RAM is full or going to be full.
Mainly, I am looking for an alternative measure that gives the same number for the same operations done. Operations meaning my algorithm executed with the same start state. Is there a simple way to count FLOPS or similar?

The issue seems to be related to Python and Ubuntu.
Try the following:
Check if you have the latest stable build of the python version you're using Link 1
Check the process list, also see which cpu core your python executable is running on.
Check the thread priority state on the cpu Link 2
, Link 3
Note:
Time may vary due to process switching, threading and other OS resource management and application code execution (this cannot be controlled)
Suggestions:
it could be because of your systems build, try running your code on another machine or on a Virtual Machine.
Read Up:
Link 4
Link 5
Good Luck.
~ Dean Van Geunen

As a result of repeatedly running the same algorithm at different 'system states', I would summarize that the answer to the question is:
Yes, time.clock() can be heavily affected by the state of the system.
Of course, this holds all the more for time.time().
The general reasons could be that
The same Python code does not always result in the same commands being sent to the CPU - that is the commands depend not only on the code and the start state, but also on the system state (i.e. garbage collection)
The system might interfere with the commands sent from Python, resulting in additional CPU usage (i.e. by core switching) that is still counted by time.clock()
The divergence can be very large, in my case around 50%.
It is not clear which are the specific reasons, nor how much each of them contributes to the problem.
It is still to be tested whether timeit helps with some or all of the above points. However timeit is meant for benchmarking and might not be recommended to be used during normal processing. It turns off garbage collection and does not allow accessing return values of the timed function.

Memory error when running k_cliques_community

I run k_cliques_community on my network data and received a memory error after a long respond time. The code works perfectly fine for my other data but not this one.
c = list(k_clique_communities(G_fb, 3))
list(c[0])
Here is a snap shot of the trace error

I tried running your code on my system with 16 GB RAM and i7 -7700HQ, the kernel died after returning a memory error. I think it's because the computation of k-cliques of size 3 is taking a lot of computational power/memory since you have quite a large no. of nodes and edges. I think you need to look into other ways to find k-cliques, like GraphX - Apache Spark

You probably have an x32 installation of python. x32 installations are limited to 2 gigabytes of memory regardless of how much system memory you have. This is inconvenient, but it works. Uninstall python and install an x64 installation. Again, it's inconvenient, but it's the only solution that I've ever seen.

As stated in the networkx documentation:
To obtain a list of all maximal cliques, use list(find_cliques(G)). However, be aware that in the worst-case, the length of this list can be exponential in the number of nodes in the graph.
(this is referred to find_cliques, but it is related to your problem also).
That means, if there is a huge number of cliques, the strings you create with list will be so many that you'll probably run out of memory: strings need more space than int values to be saved.
I don't know if there is a work-around tho.

Run out of VRAM using Theano on Amazon cluster

I'm trying to execute the logistic_sgd.py code on an Amazon cluster running the ami-b141a2f5 (Theano - CUDA 7) image.
Instead of the included MNIST database I am using the SD19 database, which requires changing a few dimensional constants, but otherwise no code has been touched. The code runs fine locally, on my CPU, but once I SSH the code and data to the Amazon cluster and run it there, I get this output:
It looks to me like it is running out of VRAM, but it was my understanding that the code should run on a GPU already, without any tinkering on my part necessary. After following the suggestion from the error message, the error persists.

There's nothing especially strange here. The error message is almost certainly accurate: there really isn't enough VRAM. Often, a script will run fine on CPU but then fail like this on GPU simply because there is usually much more system memory available than GPU memory, especially since the system memory is virtualized (and can page out to disk if required) while the GPU memory isn't.
For this script, there needs to be enough memory to store the training, validation, and testing data sets, the model parameters, and enough working space to store intermediate results of the computation. There are two options available:
Reduce the amount of memory needed for one or more of these three components. Reducing the amount of training data is usually easiest; reducing the size of the model next. Unfortunately both of those two options will often impair the quality of the result that is being looked for. Reducing the amount of memory needed for intermediate results is usually beyond the developers control -- it is managed by Theano, but there is sometimes scope for altering the computation to achieve this goal once a good understanding of Theano's internals is achieved.
If the model parameters and working memory can fit in GPU memory then the most common solution is to change the code so that the data is no longer stored in GPU memory (i.e. just store it as numpy arrays, not as Theano shared variables) then pass each batch of data in as inputs instead of givens. The LSTM sample code is an example of this approach.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.