emphasized texti tried to execute a simple python program on my macbook pro 15 on my two partition: MacOS Mojave and Windows 10.
I use spsolve function for solve a sparse linear system on some matrix and I see that the same code with the same matrix is so much slower on Windows compared Macos.
For example:
matrix 1 -> MacOs: 29 sec / Windows: 377 sec
On MacOS, when I perform these calculations, the processor goes to full speed and I feel the fan turning strong.
On Windows this does not happen, the processor remains at 20%.
I use Python 3 64bit on both systems.
from scipy import array, linalg, dot
import scipy.io as sio
import numpy as np
import time
from scipy.sparse.linalg import spsolve
matrix_names = ['cfd1']
for matrice in matrix_names:
mat = sio.loadmat('/matrix_path/%s' %matrice)
A = mat['Problem']['A']
A=A[0][0]
matrix_size = np.shape(A)[0]
xe = np.ones(matrix_size)
b = A * xe
start = time.time()
X = spsolve(A, b)
end = time.time()
print("Times %.6f sec" %(end-start))
The slow function is
X = spsolve(A, b)
I find the problem.
The MKL library is not implemented by default on windows.
I'm not sure that on MacOS it's integrated, however using Anaconda on Windows (which implements Scipy with MKL library) the Python file is executed as fast as on MacOS.
Related
I noticed that the performance of mpmath, as oddly as it sounds, depends on whether sagemath is installed or not, regardless of whether the sage module is loaded in the current session. In particular, I experienced this for operations with multiple precision floats.
Example:
from mpmath import mp
import time
mp.prec = 650
t = time.time()
for i in range(1000000):
x_mpmath + y_mpmath
w = time.time()
print('plus:\t', (w-t), 'μs')
t = time.time()
for i in range(1000000):
x_mpmath * y_mpmath
w = time.time()
print('times:\t', (w-t), 'μs')
# If sagemath is installed:
# plus: 0.12919950485229492 μs
# times: 0.17601895332336426 μs
#
# If sagemath is *not* installed:
# plus: 0.6239776611328125 μs
# times: 0.6283771991729736 μs
While in both cases the module mpmath is the exact same
import mpmath
print(mpmath.__file__)
# /usr/lib/python3.9/site-packages/mpmath/__init__.py
I thought that mpmath's backend would depend on some sagemath dependency, and if that is missing it falls back to a less optimized one, but I cannot figure out what it is precisely. My goal is to be able to install only the required packages to speed up mpmath instead of installing all of sagemath.
Since this may very well be dependent on how things are packaged, you might need to have details on my system: I am using Arch Linux and all packages are updated to the most recent versions (sagemath 9.3, mpmath 1.2.1, python 3.9.5).
I found the explanation. In /usr/lib/python3.9/site-packages/mpmath/libmp/backend.py at line 82 there is
if 'MPMATH_NOSAGE' not in os.environ:
try:
import sage.all
import sage.libs.mpmath.utils as _sage_utils
sage = sage.all
sage_utils = _sage_utils
BACKEND = 'sage'
MPZ = sage.Integer
except:
pass
This loads all of sage if sagemath is installed and also sets it as a backend. This means that the following library is loaded next:
import sage.libs.mpmath.ext_libmp as ext_lib
From /usr/lib/python3.9/site-packages/mpmath/libmp/libmpf.py at line 1407. By looking at the __file__ of that module, one sees that it's a .so object, hence compiled, thus faster.
This also means that by exporting MPMATH_NOSAGE to any nonempty value will force the backend to be the default one (python or gmpy) and indeed I can confirm that the code I wrote in the question does run slower in this case, even with sagemath installed.
My development environment is: Ubuntu 18.04.5 LTS, Python3.6 and I have installed via conda (numba and cudatoolkit). Nvidia GPU GeForce GTX 1050 Ti, which is supported by cuda.
The installation of conda and numba seem to work as intended as I can import numba within python3.6 scripts.
The problem seems identical situation to the question asked here: Cuda: library nvvm not found
but none of the proposed solutions seem to work in my case, and I'm not sure how to highlight my situation properly (I can't do it through an answer in the other thread...). If raising a duplicate of the question is inappropriate, then guide me to proper conduct.
When I try to run the code below I get the following error: numba.cuda.cudadrv.error.NvvmSupportError: libNVVM cannot be found. Do conda install cudatoolkit: library nvvm not found
from numba import cuda, float32
#Controls threads per block and shared memory usage.
#The computation will be done on blocks of TPBxTPB elements.
TPB = 16
#cuda.jit
def fast_matmul(A, B, C):
# Define an array in the shared memory
# The size and type of the arrays must be known at compile time
sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
x, y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x # blocks per grid
if x >= C.shape[0] and y >= C.shape[1]:
# Quit if (x, y) is outside of valid C boundary
return
# Each thread computes one element in the result matrix.
# The dot product is chunked into dot products of TPB-long vectors.
tmp = 0.
for i in range(bpg):
# Preload data into shared memory
sA[tx, ty] = A[x, ty + i * TPB]
sB[tx, ty] = B[tx + i * TPB, y]
# Wait until all threads finish preloading
cuda.syncthreads()
# Computes partial product on the shared memory
for j in range(TPB):
tmp += sA[tx, j] * sB[j, ty]
# Wait until all threads finish computing
cuda.syncthreads()
C[x, y] = tmp
import numpy as np
matrix_A = np.array([[0.1,0.2],[0.1,0.2]])
Doing as suggested and running conda install cudatoolkit does not work. I have tried many variations on this install that I've found online to no avail.
In the other post a solution that seems to have worked for many is to add lines about environment variables in the .bashrc file in the home directory. The suggestions however refer to files that exist in the /usr directory, where I have no cuda data since I've installed through conda. I have tried many variations on these exports without success. This is perhaps where the solution lies, but if so then the solution would benefit from being generalized.
Does anyone have any up-to-date or generalized solutions to this problem?
EDIT: adding information from terminal outputs (thanks for the hint of editing the question to do so)
> conda list numba
# packages in environment at /home/tobka/anaconda3:
#
# Name Version Build Channel
numba 0.51.2 py38h0573a6f_1
> conda list cudatoolkit
# packages in environment at /home/tobka/anaconda3:
#
# Name Version Build Channel
cudatoolkit 11.0.221 h6bb024c_0
Also adding output from numba -s: https://pastebin.com/raw/6u1MUkxg
Idea of possible cause (not yet confirmed): I noticed in the numba -s output that it specifies Python Version: 3.8.3, where I've been explicitly using python3.6 in the terminal since simply using python has usually meant using python2.7. I checked however, and my system now uses Python 3.8.3 with the python command, and Python 3.6.9 with the python3.6. And when running the code using python I get a syntax error instead, which is a good sign: raise ValueError(missing_launch_config_msg).
I will try to fix the syntax errors and confirm that the code works, after which I will report here of the situation.
Confirmation of solution: using python instead of python3.6 in the terminal solved the problem. The root cause was the user.
I want to parallelize my Python code and I'm trying to use PyCuda.
What I saw so far is that you have to write a "Kernel" in C into your Python code. This Kernel is what is going to be parallelized. Am I right?
Example (doubling an array of random numbers, from https://documen.tician.de/pycuda/tutorial.html):
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(4, 4)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
# Kernel:
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
func = mod.get_function("doublify")
func(a_gpu, block=(4, 4, 1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print(a_doubled)
print(a)
The point is that my Python code has classes and other things all suitable with Python and unsuitable with C (i.e. untranslatable to C).
Let me clarify: my has 256 independent for-loops that I want to parallelize. These loops contain Python code that can’t be translated to C.
How can I parallelize an actual Python code with PyCuda without translating my code to C?
You can't.
PyCUDA doesn't support device side python, all device code must be written in the CUDA C dialect.
Numba includes a direct Python compiler which can allow an extremely limited subset of Python language features to be compiled and run directly on the GPU. This does not include access to any Python libraries such as numpy, scipy, etc.
I am using the anaconda suite with ipython 3.6.1 and their accelerate package. There is a cufft sub-package in this two functions fft and ifft. These, as far as I understand, takes in a numpy array and outputs to a numpy array, both in system ram, i.e. all gpu-memory and transfer between system and gpu memory is handled automatically and gpu memory is releaseed as function is ended. This seems all very nice and seems to work for me. However, I would like to run multiple fft/ifft calls on the same array and for each time extract just one number from the array. It would be nice to keep the array in the gpu memory to minimize system <-> gpu transfer. Am I correct that this is not possible using this package? If so, is there another package that would do the same. I have noticed the reikna project but that doesn't seem available in anaconda.
The thing I am doing (and would like to do efficiently on gpu) is in short shown here using numpy.fft
import math as m
import numpy as np
import numpy.fft as dft
nr = 100
nh = 2**16
h = np.random.rand(nh)*1j
H = np.zeros(nh,dtype='complex64')
h[10] = 1
r = np.zeros(nr,dtype='complex64')
fftscale = m.sqrt(nh)
corr = 0.12j
for i in np.arange(nr):
r[i] = h[10]
H = dft.fft(h,nh)/fftscale
h = dft.ifft(h*corr)*fftscale
r[nr-1] = h[10]
print(r)
Thanks in advance!
So I found Arrayfire which seems rather easy to work with.
I've got this very peculiar hanging happening on my machine when using pytnon multiprocessing Pool with numpy and PySide imported. This is the most entangled bug I have seen in my life so far:) The following code:
import numpy as np
import PySide
def hang():
import multiprocessing
pool = multiprocessing.Pool(processes = 1)
pool.map(f, [None])
def f(ignore):
print('before dot..')
np.dot(np.zeros((128, 1)), np.zeros((1, 32)))
print('after dot.')
if __name__ == "__main__":
hang()
print('success!')
hangs printing only 'before dot..'. But it is supposed to print
before dot..
after dot.
success!
I'm not gdb expert, but looks like gdb shows that processes exits (or crashes) on 'np.dot' line:
[Inferior 1 (process 2884) exited normally]
There are several magical modifications I can do to prevent hanging:
if you decrease shape of arrays going into 'dot' (e.g. from 128 to
127)
(!) if you increase shape of arrays going into 'dot' from 128 to 256
if you do not use multiprocessing and just run function 'f'
(!!!) if you comment out PySide import which is not used anywhere in the code
Any help is appreciated!
Packages version:
numpy=1.8.1 or 1.7.1 PySide=1.2.1 or 1.2.2
Python version:
Python 2.7.5 (default, Sep 12 2013, 21:33:34)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
or
Python 2.7.6 (default, Apr 9 2014, 11:48:52)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.38)] on darwin
Notice: While hunting for a information, I simplified original code and question a bit. But here is a stack of updates to keep history for others who may encounter this bug (e.g. I started with matplotlib, not with pyside)
Update: I narrowed down pylab import to importing matplotlib with pyside backend and updated the code to run.
Update: I'm modifying the post to import only PySide only instead of:
import matplotlib
matplotlib.use('qt4agg')
matplotlib.rcParams['backend.qt4']='PySide'
import matplotlib.pyplot
Update: Initial statistics shows that it is a Mac-only issue. 3 people have it working on Ubuntu, 2 people got it hanging on Mac.
Update: print(os.getpid()) before dot operation gives me pid that I don't see in 'top' that apparently means that it crashes and multiprocessing waits for a dead process. For this reason I can not attach debugger to it. I edited main question accordingly.
this is a general issue with some BLAS libraries used by numpy for dot.
Apple Accelerate and OpenBlas built with GNU Openmp are known to not be safe to use on both sides of a fork (the parent and the child process multiprocessing create). They will deadlock.
This cannot be fixed by numpy but there are three workarounds:
use netlib BLAS, ATLAS or git master OpenBlas based on pthreads (2.8.0 does not work)
use python 3.4 and its new multiprocessing spawn or forkserver start methods
use threading instead of multiprocessing, numpy releases the gil for most expensive operations so you can archive decent threading speedups on typical desktop machines
I believe this to be an issue with the multiprocessing module.
Try using the following instead.
import numpy as np
import PySide
def hang():
import multiprocessing.dummy as multiprocessing
pool = multiprocessing.Pool(processes = 1)
pool.map(f, [None])
def f(ignore):
print('before dot..')
np.dot(np.zeros((128, 1)), np.zeros((1, 32)))
print('after dot.')
if __name__ == "__main__":
hang()
print('success!')
I ran into this exact problem. There was a deadlock when the child process used numpy.dot. But it ran when I reduced the size of the matrix. So instead of a dot product on a matrix with 156000 floats, I performed 3 dot products of 52000 each and concatenated the result. I'm not sure of what the max limit is and whether it depends on the number of child processes, available memory or any other factors. But if the largest matrix that does not deadlock can be identified by trial and error, then the following code should help.
def get_batch(X, update_iter, batchsize):
curr_ptr = update_iter*batchsize
if X.shape[0] - curr_ptr <= batchsize :
X_i = X[curr_ptr:, :]
else:
X_i = X[curr_ptr:curr_ptr+batchsize, :]
return X_i
def batch_dot(X, w, batchsize):
y = np.zeros((1,))
num_batches = X.shape[0]/batchsize
if X.shape[0]%batchsize != 0:
num_batches += 1
for batch_iter in range(0, num_batches):
X_batch = get_batch(X, batch_iter, batchsize)
y_batch = X_batch.dot(w)
y = np.hstack((y, y_batch))
return y[1:]