Set max number of threads at runtime on numpy/openblas

Set max number of threads at runtime on numpy/openblas - python

I'd like to know if it's possible to change at (Python) runtime the maximum number of threads used by OpenBLAS behind numpy?
I know it's possible to set it before running the interpreter through the environment variable OMP_NUM_THREADS, but I'd like to change it at runtime.
Typically, when using MKL instead of OpenBLAS, it is possible:
import mkl
mkl.set_num_threads(n)

You can do this by calling the openblas_set_num_threads function using ctypes. I often find myself wanting to do this, so I wrote a little context manager:
import contextlib
import ctypes
from ctypes.util import find_library
# Prioritize hand-compiled OpenBLAS library over version in /usr/lib/
# from Ubuntu repos
try_paths = ['/opt/OpenBLAS/lib/libopenblas.so',
'/lib/libopenblas.so',
'/usr/lib/libopenblas.so.0',
find_library('openblas')]
openblas_lib = None
for libpath in try_paths:
try:
openblas_lib = ctypes.cdll.LoadLibrary(libpath)
break
except OSError:
continue
if openblas_lib is None:
raise EnvironmentError('Could not locate an OpenBLAS shared library', 2)
def set_num_threads(n):
"""Set the current number of threads used by the OpenBLAS server."""
openblas_lib.openblas_set_num_threads(int(n))
# At the time of writing these symbols were very new:
# https://github.com/xianyi/OpenBLAS/commit/65a847c
try:
openblas_lib.openblas_get_num_threads()
def get_num_threads():
"""Get the current number of threads used by the OpenBLAS server."""
return openblas_lib.openblas_get_num_threads()
except AttributeError:
def get_num_threads():
"""Dummy function (symbol not present in %s), returns -1."""
return -1
pass
try:
openblas_lib.openblas_get_num_procs()
def get_num_procs():
"""Get the total number of physical processors"""
return openblas_lib.openblas_get_num_procs()
except AttributeError:
def get_num_procs():
"""Dummy function (symbol not present), returns -1."""
return -1
pass
#contextlib.contextmanager
def num_threads(n):
"""Temporarily changes the number of OpenBLAS threads.
Example usage:
print("Before: {}".format(get_num_threads()))
with num_threads(n):
print("In thread context: {}".format(get_num_threads()))
print("After: {}".format(get_num_threads()))
"""
old_n = get_num_threads()
set_num_threads(n)
try:
yield
finally:
set_num_threads(old_n)
You can use it like this:
with num_threads(8):
np.dot(x, y)
As mentioned in the comments, openblas_get_num_threads and openblas_get_num_procs were very new features at the time of writing, and might therefore not be available unless you compiled OpenBLAS from the latest version of the source code.

We recently developed threadpoolctl, a cross platform package to do control the number of threads used in calls to C-level thread-pools in python. It works similarly to the answer by #ali_m but detects automatically the libraries that needs to be limited by looping through all loaded libraries. It also comes with introspection APIs.
This package can be installed using pip install threadpoolctl and come with a context manager that allow you to control the number of threads used by packages such as numpy:
from threadpoolctl import threadpool_limits
import numpy as np
with threadpool_limits(limits=1, user_api='blas'):
# In this block, calls to blas implementation (like openblas or MKL)
# will be limited to use only one thread. They can thus be used jointly
# with thread-parallelism.
a = np.random.randn(1000, 1000)
a_squared = a # a
you can also have finer control on different threadpools (such as differenciating blas from openmp calls).
Note: this package is still in development and any feedback is welcomed.

Related

Why does installing sagemath improve the performance of mpmath in python?

I noticed that the performance of mpmath, as oddly as it sounds, depends on whether sagemath is installed or not, regardless of whether the sage module is loaded in the current session. In particular, I experienced this for operations with multiple precision floats.
Example:
from mpmath import mp
import time
mp.prec = 650
t = time.time()
for i in range(1000000):
x_mpmath + y_mpmath
w = time.time()
print('plus:\t', (w-t), 'μs')
t = time.time()
for i in range(1000000):
x_mpmath * y_mpmath
w = time.time()
print('times:\t', (w-t), 'μs')
# If sagemath is installed:
# plus: 0.12919950485229492 μs
# times: 0.17601895332336426 μs
#
# If sagemath is *not* installed:
# plus: 0.6239776611328125 μs
# times: 0.6283771991729736 μs
While in both cases the module mpmath is the exact same
import mpmath
print(mpmath.__file__)
# /usr/lib/python3.9/site-packages/mpmath/__init__.py
I thought that mpmath's backend would depend on some sagemath dependency, and if that is missing it falls back to a less optimized one, but I cannot figure out what it is precisely. My goal is to be able to install only the required packages to speed up mpmath instead of installing all of sagemath.
Since this may very well be dependent on how things are packaged, you might need to have details on my system: I am using Arch Linux and all packages are updated to the most recent versions (sagemath 9.3, mpmath 1.2.1, python 3.9.5).

I found the explanation. In /usr/lib/python3.9/site-packages/mpmath/libmp/backend.py at line 82 there is
if 'MPMATH_NOSAGE' not in os.environ:
try:
import sage.all
import sage.libs.mpmath.utils as _sage_utils
sage = sage.all
sage_utils = _sage_utils
BACKEND = 'sage'
MPZ = sage.Integer
except:
pass
This loads all of sage if sagemath is installed and also sets it as a backend. This means that the following library is loaded next:
import sage.libs.mpmath.ext_libmp as ext_lib
From /usr/lib/python3.9/site-packages/mpmath/libmp/libmpf.py at line 1407. By looking at the __file__ of that module, one sees that it's a .so object, hence compiled, thus faster.
This also means that by exporting MPMATH_NOSAGE to any nonempty value will force the backend to be the default one (python or gmpy) and indeed I can confirm that the code I wrote in the question does run slower in this case, even with sagemath installed.

MATLAB-generated Python packages conflict with PyQt5 on Ubuntu - possible library issue

I am building an application using Ubuntu 18.04 and PyQt 5.12.1, which imports Python packages generated from MATLAB code (these packages depend on the MATLAB Runtime). MATLAB packages in Python require the LD_LIBRARY_PATH environment variable to be set; without this, the program raises an exception when a MATLAB-generated package is imported.
However, I have found that PyQt cannot function when LD_LIBRARY_PATH is set. The program runs fine with the MATLAB Runtime installed, as long as the MATLAB package is not imported and the LD_LIBRARY_PATH is not set.
As prompted by the MATLAB Runtime installer, I added this to the environment variables in my PyCharm run/debug configuration:
LD_LIBRARY_PATH=/usr/local/MATLAB/MATLAB_Runtime/v96/runtime/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v96/bin/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v96/sys/os/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v96/extern/bin/glnxa64.
This causes a crash in the PyQt part of the program. Using the QT_DEBUG_PLUGINS=1 environment variable, the error message is as follows:
Got keys from plugin meta data ("xcb")
QFactoryLoader::QFactoryLoader() checking directory path "<redacted>/PyMODA/venv/bin/platforms" ...
Cannot load library <redacted>/venv/lib/python3.6/site-packages/PyQt5/Qt/plugins/platforms/libqxcb.so: (/usr/local/MATLAB/MATLAB_Runtime/v96/bin/glnxa64/libQt5XcbQpa.so.5: undefined symbol: _ZNK14QPlatformTheme14fileIconPixmapERK9QFileInfoRK6QSizeF6QFlagsINS_10IconOptionEE)
QLibraryPrivate::loadPlugin failed on "<redacted>/venv/lib/python3.6/site-packages/PyQt5/Qt/plugins/platforms/libqxcb.so" : "Cannot load library <redacted>/venv/lib/python3.6/site-packages/PyQt5/Qt/plugins/platforms/libqxcb.so: (/usr/local/MATLAB/MATLAB_Runtime/v96/bin/glnxa64/libQt5XcbQpa.so.5: undefined symbol: _ZNK14QPlatformTheme14fileIconPixmapERK9QFileInfoRK6QSizeF6QFlagsINS_10IconOptionEE)"
qt.qpa.plugin: Could not load the Qt platform plugin "xcb" in "" even though it was found.
This application failed to start because no Qt platform plugin could be initialized. Reinstalling the application may fix this problem.
Available platform plugins are: eglfs, linuxfb, minimal, minimalegl, offscreen, vnc, wayland-egl, wayland, wayland-xcomposite-egl, wayland-xcomposite-glx, webgl, xcb.
The important part:
"Cannot load library <...>/libqxcb.so: (/usr/local/MATLAB/MATLAB_Runtime/v96/bin/glnxa64/libQt5XcbQpa.so.5: undefined symbol: _ZNK14QPlatformTheme14fileIconPixmapERK9QFileInfoRK6QSizeF6QFlagsINS_10IconOptionEE)"
The MATLAB Runtime ships libQt5XcbQpa.so.5 in /usr/local/MATLAB/MATLAB_Runtime/v96/bin/glnxa64/, which must be exported to the LD_LIBRARY_PATH. It seems that this is being used by PyQt when the LD_LIBRARY_PATH is set, and it is an old version which is incompatible with the current version of PyQt.
Another library with the same name is in /usr/lib/x86_64-linux-gnu/, and it has a different MD5 checksum to the MATLAB version. However, adding this directory to the start of the LD_LIBRARY_PATH does not help. Setting the QT_QPA_PLATFORM_PLUGIN_PATH does not help either.
Is there a way to make the version in /usr/lib/x86_64-linux-gnu/ a higher priority than the MATLAB-supplied library? Is there another way to fix this issue?

I have discovered a workaround:
Run all MATLAB-packaged code in a new process; this is barely an inconvenience, since the computations must be run on a separate thread or process to prevent freezing the GUI anyway.
In each process which runs MATLAB-packaged code, set the LD_LIBRARY_PATH environment variable programmatically before importing the MATLAB modules. The import statements will have to be in a function rather than at the top of the file.
Here is a relatively minimal example:
class MyPlot(PlotComponent):
"""
A class which inherits from a base class PlotComponent, which is
a subclass of QWidget. In this simple example, the window
gets the data and calls the function `plot(self, data)` on an
instance of this class.
"""
def __init__(self, parent):
super().__init__(parent)
self.queue = Queue()
def plot(self, data):
"""Calculate the results from the provided data, and plot them."""
fs = data.frequency
self.times = data.times
signal = data.signal.tolist()
# Create the process, supplying all data in non-MATLAB types.
self.proc = Process(target=generate_solutions, args=(self.queue, signal, fs))
self.proc.start()
# Check for a result in 1 second.
QTimer.singleShot(1000, self.check_result)
def check_result(self):
"""Checks for a result from the other process."""
if self.queue.empty(): # No data yet; check again in 1 second.
QTimer.singleShot(1000, self.check_result)
return
w, l = self.queue.get() # Get the data from the process.
a = np.asarray(w)
gh = np.asarray(l)
# Create the plot.
self.axes.pcolormesh(self.times, gh, np.abs(a))
def generate_solutions(queue, signal, freq):
"""
Generates the solutions from the provided data, using the MATLAB-packaged
code. Must be run in a new process.
"""
import os
# Set the LD_LIBRARY_PATH for this process. The particular value may
# differ, depending on your installation.
os.environ["LD_LIBRARY_PATH"] = "/usr/local/MATLAB/MATLAB_Runtime/v96/runtime/glnxa64:" \
"/usr/local/MATLAB/MATLAB_Runtime/v96/bin/glnxa64:/usr/local/MATLAB/MATLAB_Runtime/v96/sys/os/glnxa64:" \
"/usr/local/MATLAB/MATLAB_Runtime/v96/extern/bin/glnxa64"
# Import these modules AFTER setting up the environment variables.
import my_matlab_package
import matlab
package = my_matlab_package.initialize()
# Convert the input into MATLAB data-types, to pass to the MATLAB package.
A = matlab.double([signal])
fs_matlab = matlab.double([freq])
# Calculate the result.
w, l = package.perform_my_calculation(A, fs_matlab, nargout=2)
# Convert the results back to normal Python data-types so that the
# main process can use them without importing matlab, and put them
# in the queue.
queue.put((np.asarray(w), np.asarray(l),))

Calling host functions in PyCUDA

Is it possible to call __host__ functions in pyCUDA like you can __global__ functions? I noticed in the documentation that pycuda.driver.Function creates a handle to a __global__ function. __device__ functions can be called from a __global__ function, but __host__ code cannot. I'm aware that using a __host__ function pretty much defeats the purpose of pyCUDA, but there are some already made functions that I'd like to import and call as a proof of concept.
As a note, whenever I try to import the __host__ function, I get:
pycuda._driver.LogicError: cuModuleGetFunction failed: named symbol not found

No it is not possible.
This isn't a limitation of PyCUDA, per se, but of CUDA itself. The __host__ decorator just decays away to plain host code, and the CUDA APIs don't and cannot handle them in the same way that device code can be handled (note the the APIs also don't handle __device__ either, which is the true equivalent of __host__).
If you want to call/use __host__ functions from Python, you will need to use one of the standard C++/Python interoperability mechanisms, like ctypes or SWIG or boost python, etc.
EDIT:
Since this answer was written five years ago, CUDA has added the ability to run host functions in CUDA streams via cuLaunchHostFunc (driver API) or cudaLaunchHostFunc. Unfortunately, at the time of this edit (June 2022), PyCUDA doesn't expose this functionality, so it still isn't possible in PyCUDA and the core message of the original answer is unchanged.

Below, I'm providing a sample code to call CUDA APIs in pyCUDA. The code generates uniformly distributed random numbers and may serve as a reference to include already made functions (as the poster says and like CUDA APIs) in a pyCUDA code.
import numpy as np
import ctypes
import pycuda.driver as drv
import pycuda.gpuarray as gpuarray
import pycuda.autoinit
curand = CDLL("/usr/local/cuda/lib64/libcurand.so")
# --- Number of elements to generate
N = 10
# --- cuRAND enums
CURAND_RNG_PSEUDO_DEFAULT = 100
# --- Query the cuRAND version
i = c_ulonglong()
curand.curandGetVersion(byref(i))
print("curand version: ", i.value)
# --- Allocate space for generation
d_x = gpuarray.empty(N, dtype = np.float32)
# --- Create random number generator
gen = c_ulonglong()
curand.curandCreateGenerator(byref(gen), CURAND_RNG_PSEUDO_DEFAULT)
# --- Generate random numbers
curand.curandGenerateUniform(gen, ctypes.cast(d_x.ptr, POINTER(c_float)), N)
print(d_x)

numpy OpenBLAS set maximum number of threads

I am using numpy and my model involves intensive matrix-matrix multiplication.
To speed up, I use OpenBLAS multi-threaded library to parallelize the numpy.dot function.
My setting is as follows,
OS : CentOS 6.2 server #CPUs = 12, #MEM = 96GB
python version: Python2.7.6
numpy : numpy 1.8.0
OpenBLAS + IntelMKL
$ OMP_NUM_THREADS=8 python test_mul.py
code, of which I took from https://gist.github.com/osdf/
test_mul.py :
import numpy
import sys
import timeit
try:
import numpy.core._dotblas
print 'FAST BLAS'
except ImportError:
print 'slow blas'
print "version:", numpy.__version__
print "maxint:", sys.maxint
print
x = numpy.random.random((1000,1000))
setup = "import numpy; x = numpy.random.random((1000,1000))"
count = 5
t = timeit.Timer("numpy.dot(x, x.T)", setup=setup)
print "dot:", t.timeit(count)/count, "sec"
when I use OMP_NUM_THREADS=1 python test_mul.py, the result is
dot: 0.200172233582 sec
OMP_NUM_THREADS=2
dot: 0.103047609329 sec
OMP_NUM_THREADS=4
dot: 0.0533880233765 sec
things go well.
However, when I set OMP_NUM_THREADS=8.... the code starts to "occasionally works".
sometimes it works, sometimes it does not even run and and gives me core dumps.
when OMP_NUM_THREADS > 10. the code seems to break all the time..
I am wondering what is happening here ? Is there something like a MAXIMUM number threads that each process can use ? Can I raise that limit, given that I have 12 CPUs in my machine ?
Thanks

Firstly, I don't really understand what you mean by 'OpenBLAS + IntelMKL'. Both of those are BLAS libraries, and numpy should only link to one of them at runtime. You should probably check which of these two numpy is actually using. You can do this by calling:
$ ldd <path-to-site-packages>/numpy/core/_dotblas.so
Update: numpy/core/_dotblas.so was removed in numpy v1.10, but you can check the linkage of numpy/core/multiarray.so instead.
For example, I link against OpenBLAS:
...
libopenblas.so.0 => /opt/OpenBLAS/lib/libopenblas.so.0 (0x00007f788c934000)
...
If you are indeed linking against OpenBLAS, did you build it from source? If you did, you should see that in the Makefile.rule there is a commented option:
...
# You can define maximum number of threads. Basically it should be
# less than actual number of cores. If you don't specify one, it's
# automatically detected by the the script.
# NUM_THREADS = 24
...
By default OpenBLAS will try to set the maximum number of threads to use automatically, but you could try uncommenting and editing this line yourself if it is not detecting this correctly.
Also, bear in mind that you will probably see diminishing returns in terms of performance from using more threads. Unless your arrays are very large it is unlikely that using more than 6 threads will give much of a performance boost because of the increased overhead involved in thread creation and management.

Available disk space on an SMB share, via Python

Does anyone know a way to get the amount of space available on a Windows (Samba) share via Python 2.6 with its standard library? (also running on Windows)
e.g.
>>> os.free_space("\\myshare\folder") # return free disk space, in bytes
1234567890

If PyWin32 is available:
free, total, totalfree = win32file.GetDiskFreeSpaceEx(r'\\server\share')
Where free is a amount of free space available to the current user, and totalfree is amount of free space total. Relevant documentation: PyWin32 docs, MSDN.
If PyWin32 is not guaranteed to be available, then for Python 2.5 and higher there is ctypes module in stdlib. Same function, using ctypes:
import sys
from ctypes import *
c_ulonglong_p = POINTER(c_ulonglong)
_GetDiskFreeSpace = windll.kernel32.GetDiskFreeSpaceExW
_GetDiskFreeSpace.argtypes = [c_wchar_p, c_ulonglong_p, c_ulonglong_p, c_ulonglong_p]
def GetDiskFreeSpace(path):
if not isinstance(path, unicode):
path = path.decode('mbcs') # this is windows only code
free, total, totalfree = c_ulonglong(0), c_ulonglong(0), c_ulonglong(0)
if not _GetDiskFreeSpace(path, pointer(free), pointer(total), pointer(totalfree)):
raise WindowsError
return free.value, total.value, totalfree.value
Could probably be done better but I'm not really familiar with ctypes.

The standard library has the os.statvfs() function, but unfortunately it's only available on Unix-like platforms.
In case there is some cygwin-python maybe it would work there?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.