I have the following setup:
import numpy as np
import matplotlib.pyplot as plt
import timeit
import numba
#numba.jit(nopython=True, cache=True)
def f(x):
summ = 0
for i in x:
summ += i
return summ
#numba.jit(nopython=True)
def g21(N, locs):
rvs = np.random.normal(loc=locs, scale=locs, size=N)
res = f(rvs)
return res
#numba.jit(nopython=False)
def g22(N, locs):
rvs = np.random.normal(loc=locs, scale=locs, size=N)
res = f(rvs)
return res
g22 and g21 are the exact same function, just that one of them has nopython=True and the other nopython=False
Now I give them an input. If locs is a scalar, then the numba should be able to compile everything since they support numpy.random.normal() with this signature. However if locs is an array, numba does not support this signature and should go back to the python interpreter.
I run this first just to compile the functions
N = 10_000
g22(N, 3)
g22(N, np.linspace(0,1,N))
g21(N, 3)
# g21(N, np.linspace(0,1,N)) # returns an error
Now I run a speed comparison
%timeit g21(N, 3)
%timeit g22(N, 3)
%timeit g22(N, np.linspace(0,1,N))
which returns
274 µs ± 3.43 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
270 µs ± 5.38 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
421 µs ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
It makes sense that g22(N, np.linspace(0,1,N) is slowest since it goes back to the python interpreter.
However what I dont understand is that g21(N, 3) is roughly the same speed as g22(N, 3), even though one has nopython=True and the other not.
But g22(N,3) has the big advantage that it can take another argument, namely g22(N, np.linspace(0,1,N)), so its more versatile, however at the same time there is no speed penalty to having nopython=False
So my questions are:
in this case, what is the use of using nopython=True, if a function with nopython=False achieves same speed?
in which specific case is nopython=True better than nopython=False?
in this case, what is the use of using nopython=True, if a function with nopython=False achieves same speed?
in which specific case is nopython=True better than nopython=False?
The documentation states:
Numba has two compilation modes: nopython mode and object mode. The former produces much faster code, but has limitations that can force Numba to fall back to the latter. To prevent Numba from falling back, and instead raise an error, pass nopython=True.
Note that in Numba will try to compile the code to a native binary in both modes. However, nopython produces an error when this is not possible while the other produces a warning and cause a fallback code to be used.
For some applications, performance can be critical and so you really do not want the fallback code to be called. This the case for high-performance applications for example. Having an error in this case is better than having a code which runs for days instead of few minutes on an expensive machine (like a supercomputer or a computing server). Using different version of Numba can silently cause a fallback on some machine due to feature not being supported. I personally always use the nopython mode to prevent such case (as the fallback code is generally too slow to be useful) and I consider the object mode a bit useless. Put is shortly, nopython offers stronger guarantees about performance.
Edit: this answer is wrong, see comment below and read accepted answer instead
Well I have continued using nopython=False, and it seems that this causes a bit more compilations. Its not very scientific analysis, but it seems that my function is sometimes recompiled when I change various parameters, whereas with nopython=True it never got recompiled once it was compiled. So that seems to be a potential difference
Related
I am implementing an MCMC procedure, in which the most time-consuming part is calculating the logarithm of the negative binomial probability mass function (with matices as argument). The likelihood is computed in every iteration of the procedure for new values of parameters.
I wrote my own function, which is faster than in-built scipy nbinom.logpmf.
import numpy as np
import scipy.special as sc
from scipy.stats import nbinom
def my_logpmf_nb(x, n, p):
""" logaritm of the negative binomial probability mass function """
coeff = sc.gammaln(n+x) - sc.gammaln(x+1) - sc.gammaln(n)
return coeff + n*np.log(p) + sc.xlog1py(x, -p)
N = 20
M = 8
p = np.random.uniform(0,1,(N,M))
r = np.abs(np.random.normal(10,10, (N,M)))
matrix = np.random.negative_binomial(r,p)
%timeit -n 1000 my_logpmf_nb(matrix, r, p)
16.4 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 nbinom.logpmf(matrix, r, p)
62.7 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I tried to optimize further with the use of Cython, but I failed completely (function implemented in Cython is much slower).
from libc.math cimport log, log1p
import numpy as np
cimport cython
cdef:
float EULER_MAS = 0.577215664901532 # euler mascheroni constant
#cython.cdivision(True)
def gammaln(float z, int n=100000):
"""Compute log of gamma function for some real positive float z"""
"""From: https://stackoverflow.com/questions/54850985/fast-algorithm-for-log-gamma-function"""
cdef:
float out = -EULER_MAS*z - log(z)
int k
float t
for k in range(1, n):
t = z / k
out += t - log1p(t)
return out
#cython.cdivision(True)
#cython.wraparound(False)
#cython.boundscheck(False)
def matrix_nb(double[:,:] x, double[:,:] nn, double[:,:] p):
m = x.shape[0]
n = x.shape[1]
res = np.zeros((m, n))
for i in range(m):
for j in range(n):
res[i,j] = gammaln(nn[i,j]+x[i,j]) - gammaln(x[i,j]+1) - gammaln(nn[i,j]) + nn[i,j]*log(p[i,j]) + x[i,j]*log1p(0-p[i,j])
return res
matrix_bis = matrix.astype("float")
%timeit -n 1000 matrix_nb(matrix_bis, r, p)
49.9 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Is there a way to implement this more efficiently? I would highly appreciate even a hint. Can I use Cython in a different way? Or maybe numba would be useful?
Observe the original implementation:
def _logpmf(self, x, n, p):
k = floor(x)
combiln = (gamln(n+1) - (gamln(k+1) + gamln(n-k+1)))
return combiln + special.xlogy(k, p) + special.xlog1py(n-k, -p)
You claim that the output is the same, but that's likely because you haven't tested certain edge cases. Your performance tests are not apples-to-apples, since the scipy implementation has more safety measures.
To improve performance you would need to drop down to a language that's closer to the metal, and possibly use GPGPU etc.
I tried to optimize further with the use of Cython, but I failed completely (function implemented in Cython is much slower).
The function gammaln of Scipy is mapped on the lgam native function coming from the Cephes math library. lgam calls lgam_sgn which appears to be already highly optimized. My understanding of the code is that it makes uses of a different numerical method converging more efficiently (not all numerical methods are equivalent in speed and accuracy) thanks to a logarithmic version of Stirling's formula using a polynomial approximation of degree 4 (mixed with a rational approximation). The thing with your method is that it appears to require a lot of iterations to provide an accuracy similar to the Scipy function. A numerical method requiring 1_000_000 iterations is generally considered inefficient (and should not be used in a high-performance code unless there is nothing better).
Is there a way to implement this more efficiently? I would highly appreciate even a hint. Can I use Cython in a different way?
A simple way to improve the execution time is simply to use multiple threads to compute the result in parallel. You can easily do that with prange. Note however that the computation should be sufficiently long for threads to be useful (it takes time to create threads, certainly more than computing few items of res). For more information, please read the documentation about that.
There is another way to speed the computation up : using SIMD instructions. That being said, this is far from being easy in this case, especially if you are not a highly-skilled developer. The idea is to compute multiple number at the same time using the same sequence of instruction in a unique function call. This is hard to implement here because of the possible branch divergence between the different SIMD lane. The same problem happens on GPU (ie. warp-level divergence). One solution is to use a numerical method that tends not to use branches or that can be implemented in a branch-less way. Your method has this property but the speed up provided by the SIMD implementation will certainly not be enough to outweigh the slow convergence. Besides, one needs to also implement log1p using SIMD instruction with similar issues. AFAIK, the Intel math library should implement this function using SIMD instructions. Using it from Cython is certainly not trivial though.
Or maybe numba would be useful?
Certainly not here. Indeed, assuming you use the compilation flags -O3 -march=native (and possibly -ffast-math regarding your needs) to compile the C code, the result should be pretty close. The main difference is that Numba uses LLVM compiler toolchain and add some small overhead when accessing Numpy arrays or doing some specific math operations, while the C code produced by Cython is generally compiled with the GCC compiler. The C code can certainly be compiled using Clang (based on LLVM too). I expect GCC and LLVM to produce binaries that are about equally fast in this case.
I have a small ml library where I want to select elements in an array with a mask from a different array. I want to filter my data in groups depending on what prototype they belong to:
for neighbor_id in np.unique(nearest_neighbors):
samples = data[nearest_neighbors == neighbor_id ]
# some function
This line takes 95% of my total runtime.
There are some questions from 8 years ago that had this problem too:
Indexing numpy record arrays is very slow
But I couldn't get this solution with np.take() to work for my use case and maybe there are more recent solutions?
Edit:
Little benchmark
import numpy as np
from timeit import timeit
# create test data
rng = np.random.default_rng(seed=42)
neighbors = rng.integers(0, 200, 100000)
data = rng.random(size=(100000, 800))
def my_solution():
for neighbor_id in np.unique(neighbors):
samples = data[neighbors == neighbor_id]
def jeromes_solution():
index = np.argsort(neighbors)
groups, offsets = np.unique(neighbors[index], return_index=True)
for i in range(groups.size):
neighbor_id = groups[i]
group_start = offsets[i]
group_end = offsets[i+1] if i+1 < groups.size else index.size
group_index = index[group_start:group_end]
samples = data[group_index]
%%timeit
my_solution()
>>> 222 ms ± 8.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
jeromes_solution()
>>> 182 ms ± 8.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For larger dimensions, the difference becomes very small. Most of the time is still spend in samples = data[group_index]. Maybe we should sort the array to match the groups first?
This approach is inefficient because it iterate over nearest_neighbors many times while this is not needed. In the worst case, the runtime execution is quadratic to len(nearest_neighbors) which is really bad for large arrays.
A generic solution to this problem is to use a group-by strategy. Unfortunately, Numpy does not implement such a feature (which is requested by many users since a long time). Pandas does. Using Numpy, you can compute that manually using a sort-based approach. Here is the idea:
index = np.argsort(nearest_neighbors)
groups, offsets = np.unique(nearest_neighbors[index], return_index=True)
for i in range(groups.size):
neighbor_id = groups[i]
group_start = offsets[i]
group_end = offsets[i+1] if i+1<groups.size else index.size
group_index = index[group_start:group_end]
samples = data[group_index]
This solutions runs in quasi-linear time (ie. O(n log n)) as opposed to a quadratic time (ie. O(n²)).
Update: profiling & possible optimizations
First of all, using a recent version of Numpy tends to result in a faster execution (thanks to a better use of SIMD units). That being said, it turns out the implementation of data[group_index] is surprisingly inefficient on Windows compared to Linux. Indeed, here are the results on machine with with a i5-9600KF processor and a 40 GiB/s RAM (Numpy 1.22.4 is used on both Linux and Windows):
Linux:
- Initial code: 77 ms
- Optimized code: 62 ms <---
Windows:
- Initial code: ~210 ms
- Optimized code: 166 ms <---
Optimal time: >= 15 ms
On Windows, a profiling analysis of the optimized implementation shows that ~80% of the time is spent in the memcpy function called by Numpy. Surprisingly, 15-20% of the time is spent in the free_base function certainly caused by many temporary Numpy arrays being freed. Overall the RAM throughput is 12~16 GiB and most of the time is lost writing data to RAM while this is not required here.
On Linux, a profiling analysis of the optimized implementation shows that >80% of the time is spent in the Numpy function __memmove_avx_unaligned_erms executed by data[group_index]. More specifically, the biggest part of the time is spent in the following assembly loop:
%time | instructions
--------------------------------------------------
2,30 │1a0:┌─→vmovdqu (%rsi),%ymm1
23,30 │ │ vmovdqu 0x20(%rsi),%ymm2
3,26 │ │ vmovdqu 0x40(%rsi),%ymm3
31,26 │ │ vmovdqu 0x60(%rsi),%ymm4
0,53 │ │ sub $0xffffffffffffff80,%rsi
1,62 │ │ vmovdqa %ymm1,(%rdi)
4,57 │ │ vmovdqa %ymm2,0x20(%rdi)
2,79 │ │ vmovdqa %ymm3,0x40(%rdi)
2,33 │ │ vmovdqa %ymm4,0x60(%rdi)
0,48 │ │ sub $0xffffffffffffff80,%rdi
0,01 │ │ cmp %rdi,%rdx
0,39 │ └──ja 1a0
This loop is particularly efficient (besides non-temporal instructions are not used): it reads data to 256-bit AVX SIMD registers and store the result in the temporary array. Most of the time is spent reading data from memory. At first, glance, the writes are not so expensive because the resulting array fits in the L3 cache on my machine.
15% of the time spent in this function lies in 2 other instructions done before loading some data from memory too. Such are a bit expensive because of the access to random lines of data.
Overall, the difference between Windows and Linux comes from the way Numpy is compiled. AFAIK, On Windows, the Microsoft compiler (MSVC) is used to build Numpy and it makes use of a slow memcpy function of the Microsoft standard library (CRT); while On Linux, GCC generates a relatively good SIMD-based loop as opposed to a call to memcpy of the glibc. Numpy is not really responsible for the slowdown : the Microsoft MSVC/CRT are the main issue here.
In practice, Numpy cannot saturate the RAM in sequential using only 1 core because of the way my CPU is designed. It can only read data from the RAM at the speed of 28 GiB/s in sequential while this is 38-40 GiB in parallel. This gap is bigger on server-side computing nodes so using multiple cores should helps. This solution may not be possible regarding what you do with samples in the loop.
Moreover, writing data in samples while reading data from the RAM tends to cause cache misses on my machines resulting in samples being written back to RAM (slower than the L3). This problem can be fixed by computing lines of data on the fly (so not to write anything in the L3, but only the L1 cache). Numba and Cython can be used to do that efficiently. However, this may not be possible to do regarding what you do with samples in the loop. This is the best optimization on my machine as this one combined with the previous results in a 18 ms execution time. Using only multiple threads (with Numba) surprisingly do not make the execution faster apparently because it causes more L3 cache misses resulting in more data being written back to RAM (already saturated).
There is not much to do to speed up this code in sequential. Writing data to the L3 cache in sequential takes also 11 ms on my machine (so the best execution time of this method is 26 ms).
Note that the np.unique calls with the above parameter is apparently not supported yet by Numba so it must be done using Numpy in a separate funciton. That being said, one can compute a group-by using bucket strategy (or a hash-map regarding the real-world input numbers). This methods can be combined with on-the-fly computation of samples assuming the computation using samples can be done incrementally.
On Windows, using a quite-naive Numba code can be a good solution to overcome the inefficient Numpy implementation and so to improve the execution time significantly.
Put it shortly, it may or may not be possible to perform this operation faster regarding the exact target platform and the exact computation actually performed in your real-world code.
Since an assignment problem can be posed in the form of a single matrix, I am wondering if NumPy has a function to solve such a matrix. So far I have found none. Maybe one of you guys know if NumPy/SciPy has an assignment-problem-solve function?
Edit: In the meanwhile I have found a Python (not NumPy/SciPy) implementation at http://software.clapper.org/munkres/. Still I suppose a NumPy/SciPy implementation could be much faster, right?
There is now a numpy implementation of the munkres algorithm in scikit-learn under sklearn/utils/linear_assignment_.py its only dependency is numpy. I tried it with some approximately 20x20 matrices, and it seems to be about 4 times as fast as the one linked to in the question. cProfiler shows 2.517 seconds vs 9.821 seconds for 100 iterations.
I was hoping that the newer scipy.optimize.linear_sum_assignment would be fastest, but (perhaps not surprisingly) the Cython library (which does not have pip support) is significantly faster, at least for my use case:
UPDATE: using munkres v1.1.2 and scipy v1.5.0 achieves the following results:
$ python -m timeit -s "from scipy.optimize import linear_sum_assignment; import numpy as np; np.random.seed(0); c = np.random.rand(20,30)" "a,b = linear_sum_assignment(c)"
10000 loops, best of 5: 32.8 usec per loop
$ python -m timeit -s "from munkres import Munkres; import numpy as np; np.random.seed(0); c = np.random.rand(20,30); m = Munkres()" "a = m.compute(c)"
100 loops, best of 5: 2.41 msec per loop
$ python -m timeit -s "from scipy.optimize import linear_sum_assignment; import numpy as np; np.random.seed(0);" "c = np.random.rand(20,30); a,b = linear_sum_assignment(c)"
5000 loops, best of 5: 51.7 usec per loop
$ python -m timeit -s "from munkres import Munkres; import numpy as np; np.random.seed(0)" "c = np.random.rand(20,30); m = Munkres(); a = m.compute(c)"
10 loops, best of : 26 msec per loop
No, NumPy contains no such function. Combinatorial optimization is outside of NumPy's scope. It may be possible to do it with one of the optimizers in scipy.optimize but I have a feeling that the constraints may not be of the right form.
NetworkX probably also includes algorithms for assignment problems.
Yet another fast implementation, as already hinted by #Matthew: scipy.optimize has a function called linear_sum_assignment. From the docs:
The method used is the Hungarian algorithm, also known as the Munkres or Kuhn-Munkres algorithm.
https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.linear_sum_assignment.html
There is an implementation of the Munkres' algorithm as a python extension module which has numpy support. I've used it successfully on my old laptop. However, it does not work on my new machine - I assume there is a problem with "new" numpy versions (or 64bit arch).
As of version 2.4 (released 2019-10-16), NetworkX solves the problem through nx.algorithms.bipartite.minimum_weight_full_matching. At the time of writing, the implementation uses SciPy's scipy.optimize.linear_sum_assignment under the hood, so expect the same performance characteristics.
In addition to the solver in scipy.optimize.linear_sum_assignment already mentioned in some of the other answers, SciPy (as of 1.6.0) also comes with a sparsity-friendly solver in scipy.sparse.csgraph.min_weight_full_bipartite_matching.
In [2]: from scipy.sparse import random
In [3]: from scipy.sparse.csgraph import min_weight_full_bipartite_matching
In [4]: from scipy.optimize import linear_sum_assignment
In [15]: sparse = random(1000, 1000, density=0.01, format='csr')
In [16]: %timeit min_weight_full_bipartite_matching(sparse)
3.84 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [17]: dense = sparse.toarray()
In [18]: %timeit linear_sum_assignment(dense)
18.8 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I ran some simple tests on abs() and fabs() functions and I don't understand what are the advantages of using fabs(), if it is:
1) slower
2) works only on floats
3) will throw an exception if used on a different type
In [1]: %timeit abs(5)
10000000 loops, best of 3: 86.5 ns per loop
In [3]: %timeit fabs(5)
10000000 loops, best of 3: 115 ns per loop
In [4]: %timeit abs(-5)
10000000 loops, best of 3: 88.3 ns per loop
In [5]: %timeit fabs(-5)
10000000 loops, best of 3: 114 ns per loop
In [6]: %timeit abs(5.0)
10000000 loops, best of 3: 92.5 ns per loop
In [7]: %timeit fabs(5.0)
10000000 loops, best of 3: 93.2 ns per loop
it's even slower on floats!
From where I am standing the only advantage of using fabs() is to make your code more readable, because by using it, you are clearly stating your intention of working with float/double point values
Is there any other use of fabs()?
From an email response from Tim Peters:
Why does math have an fabs function? Both it and the abs builtin function
wind up calling fabs() for floats. abs() is faster to boot.
Nothing deep -- the math module supplies everything in C89's standard
libm (+ a few extensions), fabs() is a std C89 libm function.
There isn't a clear (to me) reason why one would be faster than the
other; sounds accidental; math.fabs() could certainly be made faster
(as currently implemented (via math_1), it endures a pile of
general-purpose "try to guess whether libm should have set errno"
boilerplate that's wasted (there are no domain or range errors
possible for fabs())).
It seems there is no advantageous reason to use fabs. Just use abs for virtually all purposes.
i personally had an issue with my gcc compiler in c++ , when using abs it returns always an integer and not a double even if the result is a double , it was really a big issue for me at that time because it did not occur to me that abs could be a problem (i mean it is not obvious and easy to think that way). but i tried accidently using fabs and the issue is solved , now my program runs perfectly .
Since an assignment problem can be posed in the form of a single matrix, I am wondering if NumPy has a function to solve such a matrix. So far I have found none. Maybe one of you guys know if NumPy/SciPy has an assignment-problem-solve function?
Edit: In the meanwhile I have found a Python (not NumPy/SciPy) implementation at http://software.clapper.org/munkres/. Still I suppose a NumPy/SciPy implementation could be much faster, right?
There is now a numpy implementation of the munkres algorithm in scikit-learn under sklearn/utils/linear_assignment_.py its only dependency is numpy. I tried it with some approximately 20x20 matrices, and it seems to be about 4 times as fast as the one linked to in the question. cProfiler shows 2.517 seconds vs 9.821 seconds for 100 iterations.
I was hoping that the newer scipy.optimize.linear_sum_assignment would be fastest, but (perhaps not surprisingly) the Cython library (which does not have pip support) is significantly faster, at least for my use case:
UPDATE: using munkres v1.1.2 and scipy v1.5.0 achieves the following results:
$ python -m timeit -s "from scipy.optimize import linear_sum_assignment; import numpy as np; np.random.seed(0); c = np.random.rand(20,30)" "a,b = linear_sum_assignment(c)"
10000 loops, best of 5: 32.8 usec per loop
$ python -m timeit -s "from munkres import Munkres; import numpy as np; np.random.seed(0); c = np.random.rand(20,30); m = Munkres()" "a = m.compute(c)"
100 loops, best of 5: 2.41 msec per loop
$ python -m timeit -s "from scipy.optimize import linear_sum_assignment; import numpy as np; np.random.seed(0);" "c = np.random.rand(20,30); a,b = linear_sum_assignment(c)"
5000 loops, best of 5: 51.7 usec per loop
$ python -m timeit -s "from munkres import Munkres; import numpy as np; np.random.seed(0)" "c = np.random.rand(20,30); m = Munkres(); a = m.compute(c)"
10 loops, best of : 26 msec per loop
No, NumPy contains no such function. Combinatorial optimization is outside of NumPy's scope. It may be possible to do it with one of the optimizers in scipy.optimize but I have a feeling that the constraints may not be of the right form.
NetworkX probably also includes algorithms for assignment problems.
Yet another fast implementation, as already hinted by #Matthew: scipy.optimize has a function called linear_sum_assignment. From the docs:
The method used is the Hungarian algorithm, also known as the Munkres or Kuhn-Munkres algorithm.
https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.linear_sum_assignment.html
There is an implementation of the Munkres' algorithm as a python extension module which has numpy support. I've used it successfully on my old laptop. However, it does not work on my new machine - I assume there is a problem with "new" numpy versions (or 64bit arch).
As of version 2.4 (released 2019-10-16), NetworkX solves the problem through nx.algorithms.bipartite.minimum_weight_full_matching. At the time of writing, the implementation uses SciPy's scipy.optimize.linear_sum_assignment under the hood, so expect the same performance characteristics.
In addition to the solver in scipy.optimize.linear_sum_assignment already mentioned in some of the other answers, SciPy (as of 1.6.0) also comes with a sparsity-friendly solver in scipy.sparse.csgraph.min_weight_full_bipartite_matching.
In [2]: from scipy.sparse import random
In [3]: from scipy.sparse.csgraph import min_weight_full_bipartite_matching
In [4]: from scipy.optimize import linear_sum_assignment
In [15]: sparse = random(1000, 1000, density=0.01, format='csr')
In [16]: %timeit min_weight_full_bipartite_matching(sparse)
3.84 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [17]: dense = sparse.toarray()
In [18]: %timeit linear_sum_assignment(dense)
18.8 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)