Conda Numba Cuda: libNVVM cannot be found - python

My development environment is: Ubuntu 18.04.5 LTS, Python3.6 and I have installed via conda (numba and cudatoolkit). Nvidia GPU GeForce GTX 1050 Ti, which is supported by cuda.
The installation of conda and numba seem to work as intended as I can import numba within python3.6 scripts.
The problem seems identical situation to the question asked here: Cuda: library nvvm not found
but none of the proposed solutions seem to work in my case, and I'm not sure how to highlight my situation properly (I can't do it through an answer in the other thread...). If raising a duplicate of the question is inappropriate, then guide me to proper conduct.
When I try to run the code below I get the following error: numba.cuda.cudadrv.error.NvvmSupportError: libNVVM cannot be found. Do conda install cudatoolkit: library nvvm not found
from numba import cuda, float32
#Controls threads per block and shared memory usage.
#The computation will be done on blocks of TPBxTPB elements.
TPB = 16
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
if x >= C.shape[0] and y >= C.shape[1]:
# Quit if (x, y) is outside of valid C boundary
return
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
C[x, y] = tmp
import numpy as np
matrix_A = np.array([[0.1,0.2],[0.1,0.2]])
Doing as suggested and running conda install cudatoolkit does not work. I have tried many variations on this install that I've found online to no avail.
In the other post a solution that seems to have worked for many is to add lines about environment variables in the .bashrc file in the home directory. The suggestions however refer to files that exist in the /usr directory, where I have no cuda data since I've installed through conda. I have tried many variations on these exports without success. This is perhaps where the solution lies, but if so then the solution would benefit from being generalized.
Does anyone have any up-to-date or generalized solutions to this problem?
EDIT: adding information from terminal outputs (thanks for the hint of editing the question to do so)
> conda list numba
# packages in environment at /home/tobka/anaconda3:
#
# Name Version Build Channel
numba 0.51.2 py38h0573a6f_1
> conda list cudatoolkit
# packages in environment at /home/tobka/anaconda3:
#
# Name Version Build Channel
cudatoolkit 11.0.221 h6bb024c_0
Also adding output from numba -s: https://pastebin.com/raw/6u1MUkxg
Idea of possible cause (not yet confirmed): I noticed in the numba -s output that it specifies Python Version: 3.8.3, where I've been explicitly using python3.6 in the terminal since simply using python has usually meant using python2.7. I checked however, and my system now uses Python 3.8.3 with the python command, and Python 3.6.9 with the python3.6. And when running the code using python I get a syntax error instead, which is a good sign: raise ValueError(missing_launch_config_msg).
I will try to fix the syntax errors and confirm that the code works, after which I will report here of the situation.

Confirmation of solution: using python instead of python3.6 in the terminal solved the problem. The root cause was the user.

Related

Why does installing sagemath improve the performance of mpmath in python?

I noticed that the performance of mpmath, as oddly as it sounds, depends on whether sagemath is installed or not, regardless of whether the sage module is loaded in the current session. In particular, I experienced this for operations with multiple precision floats.
Example:
from mpmath import mp
import time
mp.prec = 650
t = time.time()
for i in range(1000000):
x_mpmath + y_mpmath
w = time.time()
print('plus:\t', (w-t), 'μs')
t = time.time()
for i in range(1000000):
x_mpmath * y_mpmath
w = time.time()
print('times:\t', (w-t), 'μs')
# If sagemath is installed:
# plus: 0.12919950485229492 μs
# times: 0.17601895332336426 μs
#
# If sagemath is *not* installed:
# plus: 0.6239776611328125 μs
# times: 0.6283771991729736 μs
While in both cases the module mpmath is the exact same
import mpmath
print(mpmath.__file__)
# /usr/lib/python3.9/site-packages/mpmath/__init__.py
I thought that mpmath's backend would depend on some sagemath dependency, and if that is missing it falls back to a less optimized one, but I cannot figure out what it is precisely. My goal is to be able to install only the required packages to speed up mpmath instead of installing all of sagemath.
Since this may very well be dependent on how things are packaged, you might need to have details on my system: I am using Arch Linux and all packages are updated to the most recent versions (sagemath 9.3, mpmath 1.2.1, python 3.9.5).
I found the explanation. In /usr/lib/python3.9/site-packages/mpmath/libmp/backend.py at line 82 there is
if 'MPMATH_NOSAGE' not in os.environ:
try:
import sage.all
import sage.libs.mpmath.utils as _sage_utils
sage = sage.all
sage_utils = _sage_utils
BACKEND = 'sage'
MPZ = sage.Integer
except:
pass
This loads all of sage if sagemath is installed and also sets it as a backend. This means that the following library is loaded next:
import sage.libs.mpmath.ext_libmp as ext_lib
From /usr/lib/python3.9/site-packages/mpmath/libmp/libmpf.py at line 1407. By looking at the __file__ of that module, one sees that it's a .so object, hence compiled, thus faster.
This also means that by exporting MPMATH_NOSAGE to any nonempty value will force the backend to be the default one (python or gmpy) and indeed I can confirm that the code I wrote in the question does run slower in this case, even with sagemath installed.

"MKL ERROR: Parameter 12..." for large matrices with scipy.linalg.eigvalsh in Anaconda 2019.10-py37 (updated)

First of all, note that the problem seems to exist for a long time now (e.g. GitHub Scipy Issue 8205).
It happens that I get an MKL error (see below) when trying to get the eigenvalues of complex matrices with size 2000x2000 or larger using eigvalsh.
Apparently, the problem should have been fixed on the MKL side with release 2019.0 as far as I understand, so I upgraded Anaconda to 2018.12-py37 since Scipy is linked against MKL 2019.1 in this version. Unfortunately, this does not work for me and I still get the error. Did I miss something or is there a fix in sight for this? There seems to be no MKL conda-forge version of Scipy so it might be the most recent version available (in Anaconda with MKL).
Anaconda Release 2018.12-py37
an extract from conda list:
# Name Version Build Channel
blas 1.0 mkl
mkl 2019.1 144
numpy 1.15.4 py37h7e9f1db_0
numpy-base 1.15.4 py37hde5b4d6_0
scipy 1.1.0 py37h7c811a0_2
I have already built a virtual conda environment with only the necessary modules. It raises the same errors.
import numpy as np
from scipy.linalg import eigvalsh
mat = np.random.rand(2000,2000) + 1j * np.random.rand(2000,2000)
mat += mat.conjugate().T
eigvalsh(mat)
I get the following error:
Intel MKL ERROR: Parameter 12 was incorrect on entry to ZHBRDB.
It just raises the error and returns an array with only zeros, except for the last entry.
Edit:
Now with Anaconda Release 2019.10-py37, i invested some more time and was able to trace the problem down to the point where LAPACK functions are wrapped:
eigvalsh calls eigh with eigvals_only=True.
By investigating the eigh routine, I found out that the main difference is in setting the flag jobz for the LAPACK function to 'N' (only eigenvalues) instead of 'V' (eigenvalues and eigenvectors). By reconstructing the code I was able to test the wrapped heevr routine for different cases - application to complex64 and complex 128 hermitian matrices give the error as explained above, application to float32 and float64 symmetric matrices give no error and reasonable results for at least up to 20kx20k randomly generated matrices.
A new minimal example:
from numpy.random import rand
from scipy.linalg.lapack import get_lapack_funcs
mat = rand(2000,2000) + 1j * rand(2000,2000)
mat += mat.conjugate().T
(heevr,) = get_lapack_funcs(('heevr',), (mat,))
(syevr,) = get_lapack_funcs(('syevr',), (mat.real,))
w, v, info = heevr(mat, jobz='N') # produces error as described
#w, v, info = heevr(mat, jobz='V') # no error
#w, v, info = syevr(mat.real, jobz='N') # no error
#w, v, info = syevr(mat.real, jobz='V') # no error
print(w, v, info)
The variable info is 0 in every case.

Anaconda package for cufft keeping arrays in gpu memory between fft / ifft calls

I am using the anaconda suite with ipython 3.6.1 and their accelerate package. There is a cufft sub-package in this two functions fft and ifft. These, as far as I understand, takes in a numpy array and outputs to a numpy array, both in system ram, i.e. all gpu-memory and transfer between system and gpu memory is handled automatically and gpu memory is releaseed as function is ended. This seems all very nice and seems to work for me. However, I would like to run multiple fft/ifft calls on the same array and for each time extract just one number from the array. It would be nice to keep the array in the gpu memory to minimize system <-> gpu transfer. Am I correct that this is not possible using this package? If so, is there another package that would do the same. I have noticed the reikna project but that doesn't seem available in anaconda.
The thing I am doing (and would like to do efficiently on gpu) is in short shown here using numpy.fft
import math as m
import numpy as np
import numpy.fft as dft
nr = 100
nh = 2**16
h = np.random.rand(nh)*1j
H = np.zeros(nh,dtype='complex64')
h[10] = 1
r = np.zeros(nr,dtype='complex64')
fftscale = m.sqrt(nh)
corr = 0.12j
for i in np.arange(nr):
r[i] = h[10]
H = dft.fft(h,nh)/fftscale
h = dft.ifft(h*corr)*fftscale
r[nr-1] = h[10]
print(r)
Thanks in advance!
So I found Arrayfire which seems rather easy to work with.

cvxopt.solvers.qp in python causes the kernel to die

When I try to solve a quadratic programming problem with solvers.qp from the cvxopt package in python, it kills my kernel after a few seconds.
The documentation of the package is found at http://cvxopt.org/userguide/coneprog.html#cvxopt.solvers.qp . If I run the example code from that page:
from math import sqrt
from cvxopt import matrix
from cvxopt.solvers import qp
# Problem data.
n = 4
S = matrix([[ 4e-2, 6e-3, -4e-3, 0.0 ],
[ 6e-3, 1e-2, 0.0, 0.0 ],
[-4e-3, 0.0, 2.5e-3, 0.0 ],
[ 0.0, 0.0, 0.0, 0.0 ]])
pbar = matrix([.12, .10, .07, .03])
G = matrix(0.0, (n,n))
G[::n+1] = -1.0
h = matrix(0.0, (n,1))
A = matrix(1.0, (1,n))
b = matrix(1.0)
# Compute trade-off.
N = 100
mus = [ 10**(5.0*t/N-1.0) for t in range(N) ]
portfolios = [ qp(mu*S, -pbar, G, h, A, b)['x'] for mu in mus ]
After 2 seconds or so I get the following reply from python:
It seems the kernel died unexpectedly. Use 'Restart kernel' to continue using this console.
It seems the kernel died unexpectedly. Use 'Restart kernel' to continue using this console.
It seems the kernel died unexpectedly. Use 'Restart kernel' to continue using this console.
...
I also don't understand what this ['x'] option is all about. But even if I leave that away it gives me that 'unexpected' death of the kernel. I also tried qp problems that definitely have a solution. Like x^2+y^2 under no constraints or with non-negativity constraints... what ever I do, it kills my kernel. What could be the problem?
Maybe it is important to say, that
I use Ubuntu 16
I use Python 3.5
I use cvxopt 1.1.9
The package cvxopt also uses C-files.
I faced the same problem, when I was running cvxopt in Jupyter Lab, so I moved my code to PyCharm and got an error
OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results.
I googled and found a question that solved it via
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
You can run the script from terminal to get cause of this error.
When I got this error, the cause is Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.
I use conda so that's my solution:
conda config --add channels conda-forge
conda install -f cvxopt

How configure theano on Windows?

I have Installed Theano on Windows machine and followed the configuration instructions.
I placed the following .theanorc.txt file in C:\Users\my_username folder:
#!sh
[global]
device = gpu
floatX = float32
[nvcc]
fastmath = True
# flags=-m32 # we have this hard coded for now
[blas]
ldflags =
# ldflags = -lopenblas # placeholder for openblas support
I tried to run the test, but haven't managed to run it on GPU. I guess the values from .theanorc.txt are not read, because I added the line print config.device and it outputs "cpu".
Below is the basic test script and the output:
from theano import function, config, shared, sandbox
import theano.tensor as T
import numpy
import time
print config.device
vlen = 10 * 30 * 768 # 10 x #cores x # threads per core
iters = 1000
rng = numpy.random.RandomState(22)
x = shared(numpy.asarray(rng.rand(vlen), config.floatX))
f = function([], T.exp(x))
print f.maker.fgraph.toposort()
t0 = time.time()
for i in xrange(iters):
r = f()
t1 = time.time()
print 'Looping %d times took' % iters, t1 - t0, 'seconds'
print 'Result is', r
if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]):
print 'Used the cpu'
else:
print 'Used the gpu'
output:
pydev debugger: starting (pid: 9564)
cpu
[Elemwise{exp,no_inplace}(<TensorType(float64, vector)>)]
Looping 1000 times took 10.0310001373 seconds
Result is [ 1.23178032 1.61879341 1.52278065 ..., 2.20771815 2.29967753
1.62323285]
Used the cpu
I have installed CUDA Toolkit successfully but haven't managed to install pyCUDA. I guess Theano should work without pyCUDA installed anyway.
I would be very thankful if anyone could help out solving this problem. I have followed these instructions but don't know why the configuration values in the program don't match the values in .theanorc.txt file.
Contrary to what has been said on a couple of pages, my installation (Windows 10, Python 2.7, Theano 0.10.0.dev1) would not interpret config instructions within a .theanorc.txt file in my user profile folder, but would read a .theanorc file.
If you are having trouble creating a file with that style of name, use the following commands at a terminal:
cd %USERPROFILE%
type NUL > .theanorc
Sauce: http://ankivil.com/making-theano-faster-with-cudnn-and-cnmem-on-windows-10/
You are right that Theano does not need PyCUDA.
It is strange that Theano does not read your configuration file. The exact path that gets read is this. Just run this in Python and you'll see where to put it:
os.path.expanduser('~/.theanorc.txt')
Try to change the content in .theanorc.txt as indicating by Theano website ( http://deeplearning.net/software/theano/install_windows.html). The path needs to be changed accordingly based on your installation.
[global]
floatX = float32
device = gpu
[nvcc]
flags=-LC:\Users\cchan\Anaconda3\libs
compiler_bindir=C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin

Categories