Why does installing sagemath improve the performance of mpmath in python? - python

I noticed that the performance of mpmath, as oddly as it sounds, depends on whether sagemath is installed or not, regardless of whether the sage module is loaded in the current session. In particular, I experienced this for operations with multiple precision floats.
Example:
from mpmath import mp
import time
mp.prec = 650
t = time.time()
for i in range(1000000):
x_mpmath + y_mpmath
w = time.time()
print('plus:\t', (w-t), 'μs')
t = time.time()
for i in range(1000000):
x_mpmath * y_mpmath
w = time.time()
print('times:\t', (w-t), 'μs')
# If sagemath is installed:
# plus: 0.12919950485229492 μs
# times: 0.17601895332336426 μs
#
# If sagemath is *not* installed:
# plus: 0.6239776611328125 μs
# times: 0.6283771991729736 μs
While in both cases the module mpmath is the exact same
import mpmath
print(mpmath.__file__)
# /usr/lib/python3.9/site-packages/mpmath/__init__.py
I thought that mpmath's backend would depend on some sagemath dependency, and if that is missing it falls back to a less optimized one, but I cannot figure out what it is precisely. My goal is to be able to install only the required packages to speed up mpmath instead of installing all of sagemath.
Since this may very well be dependent on how things are packaged, you might need to have details on my system: I am using Arch Linux and all packages are updated to the most recent versions (sagemath 9.3, mpmath 1.2.1, python 3.9.5).

I found the explanation. In /usr/lib/python3.9/site-packages/mpmath/libmp/backend.py at line 82 there is
if 'MPMATH_NOSAGE' not in os.environ:
try:
import sage.all
import sage.libs.mpmath.utils as _sage_utils
sage = sage.all
sage_utils = _sage_utils
BACKEND = 'sage'
MPZ = sage.Integer
except:
pass
This loads all of sage if sagemath is installed and also sets it as a backend. This means that the following library is loaded next:
import sage.libs.mpmath.ext_libmp as ext_lib
From /usr/lib/python3.9/site-packages/mpmath/libmp/libmpf.py at line 1407. By looking at the __file__ of that module, one sees that it's a .so object, hence compiled, thus faster.
This also means that by exporting MPMATH_NOSAGE to any nonempty value will force the backend to be the default one (python or gmpy) and indeed I can confirm that the code I wrote in the question does run slower in this case, even with sagemath installed.

Related

Conda Numba Cuda: libNVVM cannot be found

My development environment is: Ubuntu 18.04.5 LTS, Python3.6 and I have installed via conda (numba and cudatoolkit). Nvidia GPU GeForce GTX 1050 Ti, which is supported by cuda.
The installation of conda and numba seem to work as intended as I can import numba within python3.6 scripts.
The problem seems identical situation to the question asked here: Cuda: library nvvm not found
but none of the proposed solutions seem to work in my case, and I'm not sure how to highlight my situation properly (I can't do it through an answer in the other thread...). If raising a duplicate of the question is inappropriate, then guide me to proper conduct.
When I try to run the code below I get the following error: numba.cuda.cudadrv.error.NvvmSupportError: libNVVM cannot be found. Do conda install cudatoolkit: library nvvm not found
from numba import cuda, float32
#Controls threads per block and shared memory usage.
#The computation will be done on blocks of TPBxTPB elements.
TPB = 16
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
if x >= C.shape[0] and y >= C.shape[1]:
# Quit if (x, y) is outside of valid C boundary
return
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
C[x, y] = tmp
import numpy as np
matrix_A = np.array([[0.1,0.2],[0.1,0.2]])
Doing as suggested and running conda install cudatoolkit does not work. I have tried many variations on this install that I've found online to no avail.
In the other post a solution that seems to have worked for many is to add lines about environment variables in the .bashrc file in the home directory. The suggestions however refer to files that exist in the /usr directory, where I have no cuda data since I've installed through conda. I have tried many variations on these exports without success. This is perhaps where the solution lies, but if so then the solution would benefit from being generalized.
Does anyone have any up-to-date or generalized solutions to this problem?
EDIT: adding information from terminal outputs (thanks for the hint of editing the question to do so)
> conda list numba
# packages in environment at /home/tobka/anaconda3:
#
# Name Version Build Channel
numba 0.51.2 py38h0573a6f_1
> conda list cudatoolkit
# packages in environment at /home/tobka/anaconda3:
#
# Name Version Build Channel
cudatoolkit 11.0.221 h6bb024c_0
Also adding output from numba -s: https://pastebin.com/raw/6u1MUkxg
Idea of possible cause (not yet confirmed): I noticed in the numba -s output that it specifies Python Version: 3.8.3, where I've been explicitly using python3.6 in the terminal since simply using python has usually meant using python2.7. I checked however, and my system now uses Python 3.8.3 with the python command, and Python 3.6.9 with the python3.6. And when running the code using python I get a syntax error instead, which is a good sign: raise ValueError(missing_launch_config_msg).
I will try to fix the syntax errors and confirm that the code works, after which I will report here of the situation.
Confirmation of solution: using python instead of python3.6 in the terminal solved the problem. The root cause was the user.

"MKL ERROR: Parameter 12..." for large matrices with scipy.linalg.eigvalsh in Anaconda 2019.10-py37 (updated)

First of all, note that the problem seems to exist for a long time now (e.g. GitHub Scipy Issue 8205).
It happens that I get an MKL error (see below) when trying to get the eigenvalues of complex matrices with size 2000x2000 or larger using eigvalsh.
Apparently, the problem should have been fixed on the MKL side with release 2019.0 as far as I understand, so I upgraded Anaconda to 2018.12-py37 since Scipy is linked against MKL 2019.1 in this version. Unfortunately, this does not work for me and I still get the error. Did I miss something or is there a fix in sight for this? There seems to be no MKL conda-forge version of Scipy so it might be the most recent version available (in Anaconda with MKL).
Anaconda Release 2018.12-py37
an extract from conda list:
# Name Version Build Channel
blas 1.0 mkl
mkl 2019.1 144
numpy 1.15.4 py37h7e9f1db_0
numpy-base 1.15.4 py37hde5b4d6_0
scipy 1.1.0 py37h7c811a0_2
I have already built a virtual conda environment with only the necessary modules. It raises the same errors.
import numpy as np
from scipy.linalg import eigvalsh
mat = np.random.rand(2000,2000) + 1j * np.random.rand(2000,2000)
mat += mat.conjugate().T
eigvalsh(mat)
I get the following error:
Intel MKL ERROR: Parameter 12 was incorrect on entry to ZHBRDB.
It just raises the error and returns an array with only zeros, except for the last entry.
Edit:
Now with Anaconda Release 2019.10-py37, i invested some more time and was able to trace the problem down to the point where LAPACK functions are wrapped:
eigvalsh calls eigh with eigvals_only=True.
By investigating the eigh routine, I found out that the main difference is in setting the flag jobz for the LAPACK function to 'N' (only eigenvalues) instead of 'V' (eigenvalues and eigenvectors). By reconstructing the code I was able to test the wrapped heevr routine for different cases - application to complex64 and complex 128 hermitian matrices give the error as explained above, application to float32 and float64 symmetric matrices give no error and reasonable results for at least up to 20kx20k randomly generated matrices.
A new minimal example:
from numpy.random import rand
from scipy.linalg.lapack import get_lapack_funcs
mat = rand(2000,2000) + 1j * rand(2000,2000)
mat += mat.conjugate().T
(heevr,) = get_lapack_funcs(('heevr',), (mat,))
(syevr,) = get_lapack_funcs(('syevr',), (mat.real,))
w, v, info = heevr(mat, jobz='N') # produces error as described
#w, v, info = heevr(mat, jobz='V') # no error
#w, v, info = syevr(mat.real, jobz='N') # no error
#w, v, info = syevr(mat.real, jobz='V') # no error
print(w, v, info)
The variable info is 0 in every case.

Set max number of threads at runtime on numpy/openblas

I'd like to know if it's possible to change at (Python) runtime the maximum number of threads used by OpenBLAS behind numpy?
I know it's possible to set it before running the interpreter through the environment variable OMP_NUM_THREADS, but I'd like to change it at runtime.
Typically, when using MKL instead of OpenBLAS, it is possible:
import mkl
mkl.set_num_threads(n)
You can do this by calling the openblas_set_num_threads function using ctypes. I often find myself wanting to do this, so I wrote a little context manager:
import contextlib
import ctypes
from ctypes.util import find_library
# Prioritize hand-compiled OpenBLAS library over version in /usr/lib/
# from Ubuntu repos
try_paths = ['/opt/OpenBLAS/lib/libopenblas.so',
'/lib/libopenblas.so',
'/usr/lib/libopenblas.so.0',
find_library('openblas')]
openblas_lib = None
for libpath in try_paths:
try:
openblas_lib = ctypes.cdll.LoadLibrary(libpath)
break
except OSError:
continue
if openblas_lib is None:
raise EnvironmentError('Could not locate an OpenBLAS shared library', 2)
def set_num_threads(n):
"""Set the current number of threads used by the OpenBLAS server."""
openblas_lib.openblas_set_num_threads(int(n))
# At the time of writing these symbols were very new:
# https://github.com/xianyi/OpenBLAS/commit/65a847c
try:
openblas_lib.openblas_get_num_threads()
def get_num_threads():
"""Get the current number of threads used by the OpenBLAS server."""
return openblas_lib.openblas_get_num_threads()
except AttributeError:
def get_num_threads():
"""Dummy function (symbol not present in %s), returns -1."""
return -1
pass
try:
openblas_lib.openblas_get_num_procs()
def get_num_procs():
"""Get the total number of physical processors"""
return openblas_lib.openblas_get_num_procs()
except AttributeError:
def get_num_procs():
"""Dummy function (symbol not present), returns -1."""
return -1
pass
#contextlib.contextmanager
def num_threads(n):
"""Temporarily changes the number of OpenBLAS threads.
Example usage:
print("Before: {}".format(get_num_threads()))
with num_threads(n):
print("In thread context: {}".format(get_num_threads()))
print("After: {}".format(get_num_threads()))
"""
old_n = get_num_threads()
set_num_threads(n)
try:
yield
finally:
set_num_threads(old_n)
You can use it like this:
with num_threads(8):
np.dot(x, y)
As mentioned in the comments, openblas_get_num_threads and openblas_get_num_procs were very new features at the time of writing, and might therefore not be available unless you compiled OpenBLAS from the latest version of the source code.
We recently developed threadpoolctl, a cross platform package to do control the number of threads used in calls to C-level thread-pools in python. It works similarly to the answer by #ali_m but detects automatically the libraries that needs to be limited by looping through all loaded libraries. It also comes with introspection APIs.
This package can be installed using pip install threadpoolctl and come with a context manager that allow you to control the number of threads used by packages such as numpy:
from threadpoolctl import threadpool_limits
import numpy as np
with threadpool_limits(limits=1, user_api='blas'):
# In this block, calls to blas implementation (like openblas or MKL)
# will be limited to use only one thread. They can thus be used jointly
# with thread-parallelism.
a = np.random.randn(1000, 1000)
a_squared = a # a
you can also have finer control on different threadpools (such as differenciating blas from openmp calls).
Note: this package is still in development and any feedback is welcomed.

Speed difference in Python compiled with MS C vs. MinGW

On my Windows 7 machine, I use two CPython implementations:
1) WinPython distribution, which is compiled with MSC v.1500 64bit
2) MinGW-builds, which is compiled with MinGW/GCC 4.9.1 64bit
I've tried the MinGW-built version to compile some C extensions for Python, which need to be built with the same compiler as Python itself to function properly.
Now consider the following test script, which generates a random dictionary and repeatedly pickles&unpickles it.
import pickle, cPickle, random
from time import clock
def timeit(mdl, d, num=100, bestof=10):
times = []
for _ in range(bestof):
start = clock()
for _ in range(num):
mdl.loads(mdl.dumps(d))
times.append(clock() - start)
return min(times)
def gen_dict(entries=100, keylength=5):
formatstr = "{:0%dx}" % keylength
d = {}
for _ in range(entries):
rn = random.randrange(16**keylength) # 'keylength'-digit hex number
# format into string of length 5 as key, decimal value as value
d[formatstr.format(rn)] = rn
return d
def main(entries=100, keylength=5, num=100, bestof=10):
print "Dict size: %d entries, keylength: %d" % (entries, keylength)
print ("Test is %d times pack/unpack. "
"Take best time out of %d runs\n" % (num, bestof))
d = gen_dict(entries, keylength)
for mdl in [pickle, cPickle]:
print "%s: %f s" % (mdl.__name__, timeit(mdl, d, num, bestof))
if __name__ == "__main__":
main()
MSC CPython gave me
Dict size: 100 entries, keylength: 5
Test is 100 times pack/unpack. Take best time out of 10 runs
pickle: 0.107798 s
cPickle: 0.011802 s
and MinGW/GCC CPython gave me
Dict size: 100 entries, keylength: 5
Test is 100 times pack/unpack. Take best time out of 10 runs
pickle: 0.103065 s
cPickle: 0.075507 s
So the cPickle module (a standard library C extension for Python) is 6.4x slower on MinGW than on MSC.
I haven't investigated further (i.e. tested more C extensions), but I am quite surprised.
Is this to be expected?
Will other C extensions run in general slower on a Python/MinGW toolchain?
I have used MSYS2 and the MinGW-w64 tool chain to compile a large CPU-bound extension. It did not run unusually slow; I actually think it runs faster than MSC. One possible cause for slow extensions: the Mingw32CCompiler class contained in the file cygwincompiler.py specified -O optimization. I changed that to -O2 and the performance improved.
I use the extension with the standard CPython as distributed from python.org.
Update
I tried your sample program on MSYS2. There are two versions of Python 2.7 available: one that is part of the MSYS2 distribution and the other is part of the MinGW-w64 tool-chain. The version that is included with MSYS2 does not exhibit the performance issue while the version included with MinGW-w64 does exhibit the performance issue with cPickle. Since the MSYS2 version is compiled by the GCC included in MinGW-w64, I believe the slowdown is related to the specific options used when compiling the MinGW version. I haven't looked at the source code for both versions to see what causes the difference.
Regarding the requirement to use the same version of the compiler for an extension as the Python interpreter - the answer is "It depends....". The problem occurs because there are some minor differences between the C runtime library that is used by each major version of MSC. IIRC, one of the differences can break the passing of file handles between Python and the extension. If you don't use any calls that rely on the differences, then you can mix compiler versions. Since there isn't a definitive list of differences, nor a way to prevent an extension from making those calls that are different, the only guaranteed answer is not mixing versions. My extension doesn't (I think) use any C runtime calls that are different. It only uses the Python C-API for all IO and memory management. I have successfully mixed compiler versions when testing but I still prefer not to do so.
I'm still experimenting with the MSYS2/MinGW-w64 approach to building my extension and using it with an MSC-compiled version of CPython. It does appear to work and it performs as expected.

numpy OpenBLAS set maximum number of threads

I am using numpy and my model involves intensive matrix-matrix multiplication.
To speed up, I use OpenBLAS multi-threaded library to parallelize the numpy.dot function.
My setting is as follows,
OS : CentOS 6.2 server #CPUs = 12, #MEM = 96GB
python version: Python2.7.6
numpy : numpy 1.8.0
OpenBLAS + IntelMKL
$ OMP_NUM_THREADS=8 python test_mul.py
code, of which I took from https://gist.github.com/osdf/
test_mul.py :
import numpy
import sys
import timeit
try:
import numpy.core._dotblas
print 'FAST BLAS'
except ImportError:
print 'slow blas'
print "version:", numpy.__version__
print "maxint:", sys.maxint
print
x = numpy.random.random((1000,1000))
setup = "import numpy; x = numpy.random.random((1000,1000))"
count = 5
t = timeit.Timer("numpy.dot(x, x.T)", setup=setup)
print "dot:", t.timeit(count)/count, "sec"
when I use OMP_NUM_THREADS=1 python test_mul.py, the result is
dot: 0.200172233582 sec
OMP_NUM_THREADS=2
dot: 0.103047609329 sec
OMP_NUM_THREADS=4
dot: 0.0533880233765 sec
things go well.
However, when I set OMP_NUM_THREADS=8.... the code starts to "occasionally works".
sometimes it works, sometimes it does not even run and and gives me core dumps.
when OMP_NUM_THREADS > 10. the code seems to break all the time..
I am wondering what is happening here ? Is there something like a MAXIMUM number threads that each process can use ? Can I raise that limit, given that I have 12 CPUs in my machine ?
Thanks
Firstly, I don't really understand what you mean by 'OpenBLAS + IntelMKL'. Both of those are BLAS libraries, and numpy should only link to one of them at runtime. You should probably check which of these two numpy is actually using. You can do this by calling:
$ ldd <path-to-site-packages>/numpy/core/_dotblas.so
Update: numpy/core/_dotblas.so was removed in numpy v1.10, but you can check the linkage of numpy/core/multiarray.so instead.
For example, I link against OpenBLAS:
...
libopenblas.so.0 => /opt/OpenBLAS/lib/libopenblas.so.0 (0x00007f788c934000)
...
If you are indeed linking against OpenBLAS, did you build it from source? If you did, you should see that in the Makefile.rule there is a commented option:
...
# You can define maximum number of threads. Basically it should be
# less than actual number of cores. If you don't specify one, it's
# automatically detected by the the script.
# NUM_THREADS = 24
...
By default OpenBLAS will try to set the maximum number of threads to use automatically, but you could try uncommenting and editing this line yourself if it is not detecting this correctly.
Also, bear in mind that you will probably see diminishing returns in terms of performance from using more threads. Unless your arrays are very large it is unlikely that using more than 6 threads will give much of a performance boost because of the increased overhead involved in thread creation and management.

Categories