I've got this very peculiar hanging happening on my machine when using pytnon multiprocessing Pool with numpy and PySide imported. This is the most entangled bug I have seen in my life so far:) The following code:
import numpy as np
import PySide
def hang():
import multiprocessing
pool = multiprocessing.Pool(processes = 1)
pool.map(f, [None])
def f(ignore):
print('before dot..')
np.dot(np.zeros((128, 1)), np.zeros((1, 32)))
print('after dot.')
if __name__ == "__main__":
hang()
print('success!')
hangs printing only 'before dot..'. But it is supposed to print
before dot..
after dot.
success!
I'm not gdb expert, but looks like gdb shows that processes exits (or crashes) on 'np.dot' line:
[Inferior 1 (process 2884) exited normally]
There are several magical modifications I can do to prevent hanging:
if you decrease shape of arrays going into 'dot' (e.g. from 128 to
127)
(!) if you increase shape of arrays going into 'dot' from 128 to 256
if you do not use multiprocessing and just run function 'f'
(!!!) if you comment out PySide import which is not used anywhere in the code
Any help is appreciated!
Packages version:
numpy=1.8.1 or 1.7.1 PySide=1.2.1 or 1.2.2
Python version:
Python 2.7.5 (default, Sep 12 2013, 21:33:34)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
or
Python 2.7.6 (default, Apr 9 2014, 11:48:52)
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.38)] on darwin
Notice: While hunting for a information, I simplified original code and question a bit. But here is a stack of updates to keep history for others who may encounter this bug (e.g. I started with matplotlib, not with pyside)
Update: I narrowed down pylab import to importing matplotlib with pyside backend and updated the code to run.
Update: I'm modifying the post to import only PySide only instead of:
import matplotlib
matplotlib.use('qt4agg')
matplotlib.rcParams['backend.qt4']='PySide'
import matplotlib.pyplot
Update: Initial statistics shows that it is a Mac-only issue. 3 people have it working on Ubuntu, 2 people got it hanging on Mac.
Update: print(os.getpid()) before dot operation gives me pid that I don't see in 'top' that apparently means that it crashes and multiprocessing waits for a dead process. For this reason I can not attach debugger to it. I edited main question accordingly.
this is a general issue with some BLAS libraries used by numpy for dot.
Apple Accelerate and OpenBlas built with GNU Openmp are known to not be safe to use on both sides of a fork (the parent and the child process multiprocessing create). They will deadlock.
This cannot be fixed by numpy but there are three workarounds:
use netlib BLAS, ATLAS or git master OpenBlas based on pthreads (2.8.0 does not work)
use python 3.4 and its new multiprocessing spawn or forkserver start methods
use threading instead of multiprocessing, numpy releases the gil for most expensive operations so you can archive decent threading speedups on typical desktop machines
I believe this to be an issue with the multiprocessing module.
Try using the following instead.
import numpy as np
import PySide
def hang():
import multiprocessing.dummy as multiprocessing
pool = multiprocessing.Pool(processes = 1)
pool.map(f, [None])
def f(ignore):
print('before dot..')
np.dot(np.zeros((128, 1)), np.zeros((1, 32)))
print('after dot.')
if __name__ == "__main__":
hang()
print('success!')
I ran into this exact problem. There was a deadlock when the child process used numpy.dot. But it ran when I reduced the size of the matrix. So instead of a dot product on a matrix with 156000 floats, I performed 3 dot products of 52000 each and concatenated the result. I'm not sure of what the max limit is and whether it depends on the number of child processes, available memory or any other factors. But if the largest matrix that does not deadlock can be identified by trial and error, then the following code should help.
def get_batch(X, update_iter, batchsize):
curr_ptr = update_iter*batchsize
if X.shape[0] - curr_ptr <= batchsize :
X_i = X[curr_ptr:, :]
else:
X_i = X[curr_ptr:curr_ptr+batchsize, :]
return X_i
def batch_dot(X, w, batchsize):
y = np.zeros((1,))
num_batches = X.shape[0]/batchsize
if X.shape[0]%batchsize != 0:
num_batches += 1
for batch_iter in range(0, num_batches):
X_batch = get_batch(X, batch_iter, batchsize)
y_batch = X_batch.dot(w)
y = np.hstack((y, y_batch))
return y[1:]
Related
I'm trying to understand why the following simple example doesn't successfully complete execution and seems to get stuck on the first line of really_simple_func (on Ubuntu machines, but not Windows). The code is:
import torch as t
import numpy as np
import multiprocessing as mp # I've tried both multiprocessing
# import torch.multiprocessing as mp # and torch.multiprocessing
def really_simple_func():
temp_val_2 = t.tensor(np.zeros(425447)[0:400000]) # this is the line that blocks.
return 4.3
if __name__ == "__main__":
print("Run brief starting")
some_zeros = np.zeros(425447)
temp_val = t.tensor(some_zeros[0:400000]) # DELETE THIS LINE TO MAKE IT WORK
pool = mp.Pool(processes=1)
job = pool.apply_async(really_simple_func)
print("just before job.get()")
result = job.get()
print("Run brief completed. Reward = {}".format(result))
I have torch 1.11.0 installed, numpy 1.22.3 and have tried both CPU and GPU versions of Torch. When I run this code on two different Ubuntu machines, I get the following output:
Run brief starting
just before job.get()
However, the code never successfully completes (doesn't print the "Run brief completed" line). (It does complete on a third Windows box).
On the Ubuntu machines, if I delete the line with the comment "#DELETE THIS LINE TO MAKE IT WORK" the execution DOES complete, printing the final line as expected. Similarly, if I leave the line defining temp_val in but delete the line with the comment "This is the line that blocks" it will also complete. Moreover, if I reduce the size of the temp_val tensor (say from 400000 to 4000) it will also complete successfully. Finally, it is worth noting that while I can reproduce this behaviour on two different Ubuntu machines, this code does actually complete on my Windows machine - though, as far as I can tell, the versions of key packages, such as torch, are the same.
I don't understand this behaviour. I suspect it is something to do with the way torch allocates memory or stores information. I've tried calling del temp_val to free up memory, but that doesn't seem to fix things. It seems to me that the async call to t.tensor within really_simple_func is stopped from completing if there has already been a call to t.tensor in the main code block, creating a sufficiently large tensor.
I don't understand why this is happening, or even if that is the correct explanation. In any case, what would be best practice if I do need to do some tensor processing within apply_async as well as in the main thread? More generally, what is Torch waiting on when I make a call to t.tensor?
(Obviously, this is just the simplest version of the real code I'm trying to get to work that reproduced this issue. I realise that calling mp.Pool with only one process doesn't really make sense...nor, indeed, does using apply_async to call a function that returns a constant!)
Unfortunately, I cannot provide any answers to your questions.
I can, however, share experiences with seemingly the same issue. I use a Linux machine with torch 1.8.1 and numpy 1.19.2.
When I run the following code on my machine:
with Pool(max_pool) as p:
pool_outputs = list(
tqdm(
p.imap(lambda f: get_model_results_per_query_file(get_preds, tokenizer, f), query_files),
total=len(query_files)
)
)
For which the function get_model_results_per_query_file contains operations similar to the following:
feats = features.unsqueeze(0).repeat(batch_size, 1, 1).to(device)
(features is a torch tensor)
The first round of jobs automatically fail, and new ones are immediately started (that do not fail for some reason). The whole process never completes though, since the main process still seems to be waiting for the first failed jobs.
If I remove the lines in my code involving the repeat function, no jobs fail.
I managed to solve my issue and preserve the same results by adapting a similar solution to yours:
feats = torch.as_tensor(np.tile(features, (batch_size, 1, 1))).to(device)
I believe as_tensor works in a similar fashion to from_numpy in this case.
I only managed to find this solution thanks to your post and your proposed workaround, so thank you!
After some further exploration, here is a brief answer to my own question.
While I still don't fully understand the blocking behaviour (and would welcome any further explanation), I have just seen that the way I'm generating torch tensors from a numpy array is not correct.
In particular, instead of using torch.tensor(temp_val) where temp_val is a numpy array, I should be using torch.from_numpy(temp_val). Doing this fixes the problem.
Alternatively, I can convert temp_val into a list and then create the tensor via torch.tensor(temp_val_as_list) - which also avoids the issue.
I noticed that the performance of mpmath, as oddly as it sounds, depends on whether sagemath is installed or not, regardless of whether the sage module is loaded in the current session. In particular, I experienced this for operations with multiple precision floats.
Example:
from mpmath import mp
import time
mp.prec = 650
t = time.time()
for i in range(1000000):
x_mpmath + y_mpmath
w = time.time()
print('plus:\t', (w-t), 'μs')
t = time.time()
for i in range(1000000):
x_mpmath * y_mpmath
w = time.time()
print('times:\t', (w-t), 'μs')
# If sagemath is installed:
# plus: 0.12919950485229492 μs
# times: 0.17601895332336426 μs
#
# If sagemath is *not* installed:
# plus: 0.6239776611328125 μs
# times: 0.6283771991729736 μs
While in both cases the module mpmath is the exact same
import mpmath
print(mpmath.__file__)
# /usr/lib/python3.9/site-packages/mpmath/__init__.py
I thought that mpmath's backend would depend on some sagemath dependency, and if that is missing it falls back to a less optimized one, but I cannot figure out what it is precisely. My goal is to be able to install only the required packages to speed up mpmath instead of installing all of sagemath.
Since this may very well be dependent on how things are packaged, you might need to have details on my system: I am using Arch Linux and all packages are updated to the most recent versions (sagemath 9.3, mpmath 1.2.1, python 3.9.5).
I found the explanation. In /usr/lib/python3.9/site-packages/mpmath/libmp/backend.py at line 82 there is
if 'MPMATH_NOSAGE' not in os.environ:
try:
import sage.all
import sage.libs.mpmath.utils as _sage_utils
sage = sage.all
sage_utils = _sage_utils
BACKEND = 'sage'
MPZ = sage.Integer
except:
pass
This loads all of sage if sagemath is installed and also sets it as a backend. This means that the following library is loaded next:
import sage.libs.mpmath.ext_libmp as ext_lib
From /usr/lib/python3.9/site-packages/mpmath/libmp/libmpf.py at line 1407. By looking at the __file__ of that module, one sees that it's a .so object, hence compiled, thus faster.
This also means that by exporting MPMATH_NOSAGE to any nonempty value will force the backend to be the default one (python or gmpy) and indeed I can confirm that the code I wrote in the question does run slower in this case, even with sagemath installed.
emphasized texti tried to execute a simple python program on my macbook pro 15 on my two partition: MacOS Mojave and Windows 10.
I use spsolve function for solve a sparse linear system on some matrix and I see that the same code with the same matrix is so much slower on Windows compared Macos.
For example:
matrix 1 -> MacOs: 29 sec / Windows: 377 sec
On MacOS, when I perform these calculations, the processor goes to full speed and I feel the fan turning strong.
On Windows this does not happen, the processor remains at 20%.
I use Python 3 64bit on both systems.
from scipy import array, linalg, dot
import scipy.io as sio
import numpy as np
import time
from scipy.sparse.linalg import spsolve
matrix_names = ['cfd1']
for matrice in matrix_names:
mat = sio.loadmat('/matrix_path/%s' %matrice)
A = mat['Problem']['A']
A=A[0][0]
matrix_size = np.shape(A)[0]
xe = np.ones(matrix_size)
b = A * xe
start = time.time()
X = spsolve(A, b)
end = time.time()
print("Times %.6f sec" %(end-start))
The slow function is
X = spsolve(A, b)
I find the problem.
The MKL library is not implemented by default on windows.
I'm not sure that on MacOS it's integrated, however using Anaconda on Windows (which implements Scipy with MKL library) the Python file is executed as fast as on MacOS.
I am using scipy.stats.rv_continuous (v0.19.0) in order to create random values from a custom probability distribution. The code I am using looks as follows (using a Gaussian for debugging purposes):
from scipy.stats import rv_continuous
import numpy as np
import resource
import scipy
import sys
print "numpy version: {}".format(np.version.full_version)
print "Scipy version: {}".format(scipy.version.full_version)
print "Python {}".format(sys.version)
class gaussian_gen(rv_continuous):
"Gaussian distribution"
def _pdf(self, x):
return np.exp(-x**2 / 2.) / np.sqrt(2.0 * np.pi)
def print_mem():
mem = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
print 'Memory usage: %s (kb)' % mem
print_mem()
gaussian = gaussian_gen(name='gaussian')
print_mem()
values = gaussian.rvs(size=1000)
print_mem()
values = gaussian.rvs(size=5000)
print_mem()
Which outputs:
numpy version: 1.12.0
Scipy version: 0.19.0
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609]
Memory usage: 69672 (kb)
Memory usage: 69672 (kb)
Memory usage: 426952 (kb)
Memory usage: 2215576 (kb)
As you can see, the memory consumption of this snippet seems really unreasonable. I have found this question, but it is slightly different since I am not creating a new class instance in a loop.
I thought that I am using rv_continuous correctly, but I cannot see why it would consume such enormous amounts of memory. How do I use it correctly? Ideally, I would like to have a solution which I can actually be call in a loop, but one step at a time.
I am using numpy and my model involves intensive matrix-matrix multiplication.
To speed up, I use OpenBLAS multi-threaded library to parallelize the numpy.dot function.
My setting is as follows,
OS : CentOS 6.2 server #CPUs = 12, #MEM = 96GB
python version: Python2.7.6
numpy : numpy 1.8.0
OpenBLAS + IntelMKL
$ OMP_NUM_THREADS=8 python test_mul.py
code, of which I took from https://gist.github.com/osdf/
test_mul.py :
import numpy
import sys
import timeit
try:
import numpy.core._dotblas
print 'FAST BLAS'
except ImportError:
print 'slow blas'
print "version:", numpy.__version__
print "maxint:", sys.maxint
print
x = numpy.random.random((1000,1000))
setup = "import numpy; x = numpy.random.random((1000,1000))"
count = 5
t = timeit.Timer("numpy.dot(x, x.T)", setup=setup)
print "dot:", t.timeit(count)/count, "sec"
when I use OMP_NUM_THREADS=1 python test_mul.py, the result is
dot: 0.200172233582 sec
OMP_NUM_THREADS=2
dot: 0.103047609329 sec
OMP_NUM_THREADS=4
dot: 0.0533880233765 sec
things go well.
However, when I set OMP_NUM_THREADS=8.... the code starts to "occasionally works".
sometimes it works, sometimes it does not even run and and gives me core dumps.
when OMP_NUM_THREADS > 10. the code seems to break all the time..
I am wondering what is happening here ? Is there something like a MAXIMUM number threads that each process can use ? Can I raise that limit, given that I have 12 CPUs in my machine ?
Thanks
Firstly, I don't really understand what you mean by 'OpenBLAS + IntelMKL'. Both of those are BLAS libraries, and numpy should only link to one of them at runtime. You should probably check which of these two numpy is actually using. You can do this by calling:
$ ldd <path-to-site-packages>/numpy/core/_dotblas.so
Update: numpy/core/_dotblas.so was removed in numpy v1.10, but you can check the linkage of numpy/core/multiarray.so instead.
For example, I link against OpenBLAS:
...
libopenblas.so.0 => /opt/OpenBLAS/lib/libopenblas.so.0 (0x00007f788c934000)
...
If you are indeed linking against OpenBLAS, did you build it from source? If you did, you should see that in the Makefile.rule there is a commented option:
...
# You can define maximum number of threads. Basically it should be
# less than actual number of cores. If you don't specify one, it's
# automatically detected by the the script.
# NUM_THREADS = 24
...
By default OpenBLAS will try to set the maximum number of threads to use automatically, but you could try uncommenting and editing this line yourself if it is not detecting this correctly.
Also, bear in mind that you will probably see diminishing returns in terms of performance from using more threads. Unless your arrays are very large it is unlikely that using more than 6 threads will give much of a performance boost because of the increased overhead involved in thread creation and management.