Fast logarithm of the negative binomial probability mass function computation in Python

Fast logarithm of the negative binomial probability mass function computation in Python - python

I am implementing an MCMC procedure, in which the most time-consuming part is calculating the logarithm of the negative binomial probability mass function (with matices as argument). The likelihood is computed in every iteration of the procedure for new values of parameters.
I wrote my own function, which is faster than in-built scipy nbinom.logpmf.
import numpy as np
import scipy.special as sc
from scipy.stats import nbinom
def my_logpmf_nb(x, n, p):
""" logaritm of the negative binomial probability mass function """
coeff = sc.gammaln(n+x) - sc.gammaln(x+1) - sc.gammaln(n)
return coeff + n*np.log(p) + sc.xlog1py(x, -p)
N = 20
M = 8
p = np.random.uniform(0,1,(N,M))
r = np.abs(np.random.normal(10,10, (N,M)))
matrix = np.random.negative_binomial(r,p)
%timeit -n 1000 my_logpmf_nb(matrix, r, p)
16.4 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit -n 1000 nbinom.logpmf(matrix, r, p)
62.7 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I tried to optimize further with the use of Cython, but I failed completely (function implemented in Cython is much slower).
from libc.math cimport log, log1p
import numpy as np
cimport cython
cdef:
float EULER_MAS = 0.577215664901532 # euler mascheroni constant
#cython.cdivision(True)
def gammaln(float z, int n=100000):
"""Compute log of gamma function for some real positive float z"""
"""From: https://stackoverflow.com/questions/54850985/fast-algorithm-for-log-gamma-function"""
cdef:
float out = -EULER_MAS*z - log(z)
int k
float t
for k in range(1, n):
t = z / k
out += t - log1p(t)
return out
#cython.cdivision(True)
#cython.wraparound(False)
#cython.boundscheck(False)
def matrix_nb(double[:,:] x, double[:,:] nn, double[:,:] p):
m = x.shape[0]
n = x.shape[1]
res = np.zeros((m, n))
for i in range(m):
for j in range(n):
res[i,j] = gammaln(nn[i,j]+x[i,j]) - gammaln(x[i,j]+1) - gammaln(nn[i,j]) + nn[i,j]*log(p[i,j]) + x[i,j]*log1p(0-p[i,j])
return res
matrix_bis = matrix.astype("float")
%timeit -n 1000 matrix_nb(matrix_bis, r, p)
49.9 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Is there a way to implement this more efficiently? I would highly appreciate even a hint. Can I use Cython in a different way? Or maybe numba would be useful?

Observe the original implementation:
def _logpmf(self, x, n, p):
k = floor(x)
combiln = (gamln(n+1) - (gamln(k+1) + gamln(n-k+1)))
return combiln + special.xlogy(k, p) + special.xlog1py(n-k, -p)
You claim that the output is the same, but that's likely because you haven't tested certain edge cases. Your performance tests are not apples-to-apples, since the scipy implementation has more safety measures.
To improve performance you would need to drop down to a language that's closer to the metal, and possibly use GPGPU etc.

I tried to optimize further with the use of Cython, but I failed completely (function implemented in Cython is much slower).
The function gammaln of Scipy is mapped on the lgam native function coming from the Cephes math library. lgam calls lgam_sgn which appears to be already highly optimized. My understanding of the code is that it makes uses of a different numerical method converging more efficiently (not all numerical methods are equivalent in speed and accuracy) thanks to a logarithmic version of Stirling's formula using a polynomial approximation of degree 4 (mixed with a rational approximation). The thing with your method is that it appears to require a lot of iterations to provide an accuracy similar to the Scipy function. A numerical method requiring 1_000_000 iterations is generally considered inefficient (and should not be used in a high-performance code unless there is nothing better).
Is there a way to implement this more efficiently? I would highly appreciate even a hint. Can I use Cython in a different way?
A simple way to improve the execution time is simply to use multiple threads to compute the result in parallel. You can easily do that with prange. Note however that the computation should be sufficiently long for threads to be useful (it takes time to create threads, certainly more than computing few items of res). For more information, please read the documentation about that.
There is another way to speed the computation up : using SIMD instructions. That being said, this is far from being easy in this case, especially if you are not a highly-skilled developer. The idea is to compute multiple number at the same time using the same sequence of instruction in a unique function call. This is hard to implement here because of the possible branch divergence between the different SIMD lane. The same problem happens on GPU (ie. warp-level divergence). One solution is to use a numerical method that tends not to use branches or that can be implemented in a branch-less way. Your method has this property but the speed up provided by the SIMD implementation will certainly not be enough to outweigh the slow convergence. Besides, one needs to also implement log1p using SIMD instruction with similar issues. AFAIK, the Intel math library should implement this function using SIMD instructions. Using it from Cython is certainly not trivial though.
Or maybe numba would be useful?
Certainly not here. Indeed, assuming you use the compilation flags -O3 -march=native (and possibly -ffast-math regarding your needs) to compile the C code, the result should be pretty close. The main difference is that Numba uses LLVM compiler toolchain and add some small overhead when accessing Numpy arrays or doing some specific math operations, while the C code produced by Cython is generally compiled with the GCC compiler. The C code can certainly be compiled using Clang (based on LLVM too). I expect GCC and LLVM to produce binaries that are about equally fast in this case.

Related

What is the fastest way to select multiple elements from a numpy array?

I have a small ml library where I want to select elements in an array with a mask from a different array. I want to filter my data in groups depending on what prototype they belong to:
for neighbor_id in np.unique(nearest_neighbors):
samples = data[nearest_neighbors == neighbor_id ]
# some function
This line takes 95% of my total runtime.
There are some questions from 8 years ago that had this problem too:
Indexing numpy record arrays is very slow
But I couldn't get this solution with np.take() to work for my use case and maybe there are more recent solutions?
Edit:
Little benchmark
import numpy as np
from timeit import timeit
# create test data
rng = np.random.default_rng(seed=42)
neighbors = rng.integers(0, 200, 100000)
data = rng.random(size=(100000, 800))
def my_solution():
for neighbor_id in np.unique(neighbors):
samples = data[neighbors == neighbor_id]
def jeromes_solution():
index = np.argsort(neighbors)
groups, offsets = np.unique(neighbors[index], return_index=True)
for i in range(groups.size):
neighbor_id = groups[i]
group_start = offsets[i]
group_end = offsets[i+1] if i+1 < groups.size else index.size
group_index = index[group_start:group_end]
samples = data[group_index]
%%timeit
my_solution()
>>> 222 ms ± 8.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
jeromes_solution()
>>> 182 ms ± 8.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For larger dimensions, the difference becomes very small. Most of the time is still spend in samples = data[group_index]. Maybe we should sort the array to match the groups first?

This approach is inefficient because it iterate over nearest_neighbors many times while this is not needed. In the worst case, the runtime execution is quadratic to len(nearest_neighbors) which is really bad for large arrays.
A generic solution to this problem is to use a group-by strategy. Unfortunately, Numpy does not implement such a feature (which is requested by many users since a long time). Pandas does. Using Numpy, you can compute that manually using a sort-based approach. Here is the idea:
index = np.argsort(nearest_neighbors)
groups, offsets = np.unique(nearest_neighbors[index], return_index=True)
for i in range(groups.size):
neighbor_id = groups[i]
group_start = offsets[i]
group_end = offsets[i+1] if i+1<groups.size else index.size
group_index = index[group_start:group_end]
samples = data[group_index]
This solutions runs in quasi-linear time (ie. O(n log n)) as opposed to a quadratic time (ie. O(n²)).
Update: profiling & possible optimizations
First of all, using a recent version of Numpy tends to result in a faster execution (thanks to a better use of SIMD units). That being said, it turns out the implementation of data[group_index] is surprisingly inefficient on Windows compared to Linux. Indeed, here are the results on machine with with a i5-9600KF processor and a 40 GiB/s RAM (Numpy 1.22.4 is used on both Linux and Windows):
Linux:
- Initial code: 77 ms
- Optimized code: 62 ms <---
Windows:
- Initial code: ~210 ms
- Optimized code: 166 ms <---
Optimal time: >= 15 ms
On Windows, a profiling analysis of the optimized implementation shows that ~80% of the time is spent in the memcpy function called by Numpy. Surprisingly, 15-20% of the time is spent in the free_base function certainly caused by many temporary Numpy arrays being freed. Overall the RAM throughput is 12~16 GiB and most of the time is lost writing data to RAM while this is not required here.
On Linux, a profiling analysis of the optimized implementation shows that >80% of the time is spent in the Numpy function __memmove_avx_unaligned_erms executed by data[group_index]. More specifically, the biggest part of the time is spent in the following assembly loop:
%time | instructions
--------------------------------------------------
2,30 │1a0:┌─→vmovdqu (%rsi),%ymm1
23,30 │ │ vmovdqu 0x20(%rsi),%ymm2
3,26 │ │ vmovdqu 0x40(%rsi),%ymm3
31,26 │ │ vmovdqu 0x60(%rsi),%ymm4
0,53 │ │ sub $0xffffffffffffff80,%rsi
1,62 │ │ vmovdqa %ymm1,(%rdi)
4,57 │ │ vmovdqa %ymm2,0x20(%rdi)
2,79 │ │ vmovdqa %ymm3,0x40(%rdi)
2,33 │ │ vmovdqa %ymm4,0x60(%rdi)
0,48 │ │ sub $0xffffffffffffff80,%rdi
0,01 │ │ cmp %rdi,%rdx
0,39 │ └──ja 1a0
This loop is particularly efficient (besides non-temporal instructions are not used): it reads data to 256-bit AVX SIMD registers and store the result in the temporary array. Most of the time is spent reading data from memory. At first, glance, the writes are not so expensive because the resulting array fits in the L3 cache on my machine.
15% of the time spent in this function lies in 2 other instructions done before loading some data from memory too. Such are a bit expensive because of the access to random lines of data.
Overall, the difference between Windows and Linux comes from the way Numpy is compiled. AFAIK, On Windows, the Microsoft compiler (MSVC) is used to build Numpy and it makes use of a slow memcpy function of the Microsoft standard library (CRT); while On Linux, GCC generates a relatively good SIMD-based loop as opposed to a call to memcpy of the glibc. Numpy is not really responsible for the slowdown : the Microsoft MSVC/CRT are the main issue here.
In practice, Numpy cannot saturate the RAM in sequential using only 1 core because of the way my CPU is designed. It can only read data from the RAM at the speed of 28 GiB/s in sequential while this is 38-40 GiB in parallel. This gap is bigger on server-side computing nodes so using multiple cores should helps. This solution may not be possible regarding what you do with samples in the loop.
Moreover, writing data in samples while reading data from the RAM tends to cause cache misses on my machines resulting in samples being written back to RAM (slower than the L3). This problem can be fixed by computing lines of data on the fly (so not to write anything in the L3, but only the L1 cache). Numba and Cython can be used to do that efficiently. However, this may not be possible to do regarding what you do with samples in the loop. This is the best optimization on my machine as this one combined with the previous results in a 18 ms execution time. Using only multiple threads (with Numba) surprisingly do not make the execution faster apparently because it causes more L3 cache misses resulting in more data being written back to RAM (already saturated).
There is not much to do to speed up this code in sequential. Writing data to the L3 cache in sequential takes also 11 ms on my machine (so the best execution time of this method is 26 ms).
Note that the np.unique calls with the above parameter is apparently not supported yet by Numba so it must be done using Numpy in a separate funciton. That being said, one can compute a group-by using bucket strategy (or a hash-map regarding the real-world input numbers). This methods can be combined with on-the-fly computation of samples assuming the computation using samples can be done incrementally.
On Windows, using a quite-naive Numba code can be a good solution to overcome the inefficient Numpy implementation and so to improve the execution time significantly.
Put it shortly, it may or may not be possible to perform this operation faster regarding the exact target platform and the exact computation actually performed in your real-world code.

Efficient computation of entropy-like formula (sum(xlogx)) in Python

I'm looking for an efficient way to compute the entropy of vectors, without normalizing them and while ignoring any non-positive value.
Since the vectors aren't probability vectors, and shouldn't be normalized, I can't use scipy's entropy function.
So far I couldn't find a single numpy or scipy function to obtain this, and as a result my alternatives involve breaking the computation into 2 steps, which involve intermediate arrays and slow down the run time. If anyone can think of a single function for this computation it will be interseting.
Below is a timeit script for measuring several alternatives at I tried. I'm using a pre-allocated array to avoid repeated allocations and deallocations during run-time. It's possible to select which alternative to run by setting the value of func_code. I included the nansum offered by one of the answers. The measurements on My MacBook Pro 2019 are:
matmul: 16.720187613
xlogy: 17.296380516
nansum: 20.059866123000003
import timeit
import numpy as np
from scipy import special
def matmul(arg):
a, log_a = arg
log_a.fill(0)
np.log2(a, where=a > 0, out=log_a)
return (a[:, None, :] # log_a[..., None]).ravel()
def xlogy(arg):
a, log_a = arg
a[a < 0] = 0
return np.sum(special.xlogy(a, a), axis=1) * (1/np.log(2))
def nansum(arg):
a, log_a = arg
return np.nansum(a * np.log2(a, out=log_a), axis=1)
def setup():
a = np.random.rand(20, 1000) - 0.1
log = np.empty_like(a)
return a, log
setup_code = """
from __main__ import matmul, xlogy, nansum, setup
data = setup()
"""
func_code = "matmul(data)"
print(timeit.timeit(func_code, setup=setup_code, number=100000))

On my machine the computation of the logarithms takes about 80% of the time of matmul so it is definitively the bottleneck an optimizing other functions will result in a negligible speed up.
The bad news is that the default implementation np.log is not yet optimized on most platforms. Indeed, it is not vectorized by default, except on recent x86 Intel processors supporting AVX-512 (ie. basically Skylake processors on servers and IceLake processors on PCs, not recent AlderLake though). This means the computation could be significantly faster once vectorized. AFAIK, the close-source SVML library do support AVX/AVX2 and could speed up it (on x86-64 processors only). SMVL is supported by Numexpr and Numba which can be faster because of that assuming you have access to the non-free SVML which is a part of Intel tools often available on HPC machines (eg. like MKL, OneAPI, etc.).
If you do not have access to the SVML there are two possible remaining options:
Implement your own optimized SIMD log2 function which is possible but hard since it require a good understanding of the hardware SIMD units and certainly require to write a C or Cython code. This solutions consists in computing the log2 function as a n-degree polynomial approximation (it can be exact to 1 ULP with a big n though one generally do not need that). Naive approximations (eg. n=1) are much simple to implement but often too inaccurate for a scientific use).
Implement a multi-threaded log computation typically using Numba/Cython. This is a desperate solution as multithreading can slow things down if the input data is not large enough.
Here is an example of multi-threaded Numba solution:
import numba as nb
#nb.njit('(UniTuple(f8[:,::1],2),)', parallel=True)
def matmul(arg):
a, log_a = arg
result = np.empty(a.shape[0])
for i in nb.prange(a.shape[0]):
s = 0.0
for j in range(a.shape[1]):
if a[i, j] > 0:
s += a[i, j] * np.log2(a[i, j])
result[i] = s
return result
This is about 4.3 times faster on my 6-core PC (200 us VS 46.4 us). However, you should be careful if you run this on a server with many cores on such small dataset as it can actually be slower on some platforms.

Having np.log2 of negative numbers (or zero) just gives a runtime warning and sets those values to np.nan, which is probably the best way to deal with them. If you don't want them to pollute your sum, just use
np.nansum(v_i*np.log2(v_i))

Numba: when to use nopython=True?

I have the following setup:
import numpy as np
import matplotlib.pyplot as plt
import timeit
import numba
#numba.jit(nopython=True, cache=True)
def f(x):
summ = 0
for i in x:
summ += i
return summ
#numba.jit(nopython=True)
def g21(N, locs):
rvs = np.random.normal(loc=locs, scale=locs, size=N)
res = f(rvs)
return res
#numba.jit(nopython=False)
def g22(N, locs):
rvs = np.random.normal(loc=locs, scale=locs, size=N)
res = f(rvs)
return res
g22 and g21 are the exact same function, just that one of them has nopython=True and the other nopython=False
Now I give them an input. If locs is a scalar, then the numba should be able to compile everything since they support numpy.random.normal() with this signature. However if locs is an array, numba does not support this signature and should go back to the python interpreter.
I run this first just to compile the functions
N = 10_000
g22(N, 3)
g22(N, np.linspace(0,1,N))
g21(N, 3)
# g21(N, np.linspace(0,1,N)) # returns an error
Now I run a speed comparison
%timeit g21(N, 3)
%timeit g22(N, 3)
%timeit g22(N, np.linspace(0,1,N))
which returns
274 µs ± 3.43 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
270 µs ± 5.38 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
421 µs ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
It makes sense that g22(N, np.linspace(0,1,N) is slowest since it goes back to the python interpreter.
However what I dont understand is that g21(N, 3) is roughly the same speed as g22(N, 3), even though one has nopython=True and the other not.
But g22(N,3) has the big advantage that it can take another argument, namely g22(N, np.linspace(0,1,N)), so its more versatile, however at the same time there is no speed penalty to having nopython=False
So my questions are:
in this case, what is the use of using nopython=True, if a function with nopython=False achieves same speed?
in which specific case is nopython=True better than nopython=False?

in this case, what is the use of using nopython=True, if a function with nopython=False achieves same speed?
in which specific case is nopython=True better than nopython=False?
The documentation states:
Numba has two compilation modes: nopython mode and object mode. The former produces much faster code, but has limitations that can force Numba to fall back to the latter. To prevent Numba from falling back, and instead raise an error, pass nopython=True.
Note that in Numba will try to compile the code to a native binary in both modes. However, nopython produces an error when this is not possible while the other produces a warning and cause a fallback code to be used.
For some applications, performance can be critical and so you really do not want the fallback code to be called. This the case for high-performance applications for example. Having an error in this case is better than having a code which runs for days instead of few minutes on an expensive machine (like a supercomputer or a computing server). Using different version of Numba can silently cause a fallback on some machine due to feature not being supported. I personally always use the nopython mode to prevent such case (as the fallback code is generally too slow to be useful) and I consider the object mode a bit useless. Put is shortly, nopython offers stronger guarantees about performance.

Edit: this answer is wrong, see comment below and read accepted answer instead
Well I have continued using nopython=False, and it seems that this causes a bit more compilations. Its not very scientific analysis, but it seems that my function is sometimes recompiled when I change various parameters, whereas with nopython=True it never got recompiled once it was compiled. So that seems to be a potential difference

Performance of the linear sum assignment algorithm [duplicate]

Since an assignment problem can be posed in the form of a single matrix, I am wondering if NumPy has a function to solve such a matrix. So far I have found none. Maybe one of you guys know if NumPy/SciPy has an assignment-problem-solve function?
Edit: In the meanwhile I have found a Python (not NumPy/SciPy) implementation at http://software.clapper.org/munkres/. Still I suppose a NumPy/SciPy implementation could be much faster, right?

There is now a numpy implementation of the munkres algorithm in scikit-learn under sklearn/utils/linear_assignment_.py its only dependency is numpy. I tried it with some approximately 20x20 matrices, and it seems to be about 4 times as fast as the one linked to in the question. cProfiler shows 2.517 seconds vs 9.821 seconds for 100 iterations.

I was hoping that the newer scipy.optimize.linear_sum_assignment would be fastest, but (perhaps not surprisingly) the Cython library (which does not have pip support) is significantly faster, at least for my use case:
UPDATE: using munkres v1.1.2 and scipy v1.5.0 achieves the following results:
$ python -m timeit -s "from scipy.optimize import linear_sum_assignment; import numpy as np; np.random.seed(0); c = np.random.rand(20,30)" "a,b = linear_sum_assignment(c)"
10000 loops, best of 5: 32.8 usec per loop
$ python -m timeit -s "from munkres import Munkres; import numpy as np; np.random.seed(0); c = np.random.rand(20,30); m = Munkres()" "a = m.compute(c)"
100 loops, best of 5: 2.41 msec per loop
$ python -m timeit -s "from scipy.optimize import linear_sum_assignment; import numpy as np; np.random.seed(0);" "c = np.random.rand(20,30); a,b = linear_sum_assignment(c)"
5000 loops, best of 5: 51.7 usec per loop
$ python -m timeit -s "from munkres import Munkres; import numpy as np; np.random.seed(0)" "c = np.random.rand(20,30); m = Munkres(); a = m.compute(c)"
10 loops, best of : 26 msec per loop

No, NumPy contains no such function. Combinatorial optimization is outside of NumPy's scope. It may be possible to do it with one of the optimizers in scipy.optimize but I have a feeling that the constraints may not be of the right form.
NetworkX probably also includes algorithms for assignment problems.

Yet another fast implementation, as already hinted by #Matthew: scipy.optimize has a function called linear_sum_assignment. From the docs:
The method used is the Hungarian algorithm, also known as the Munkres or Kuhn-Munkres algorithm.
https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.optimize.linear_sum_assignment.html

There is an implementation of the Munkres' algorithm as a python extension module which has numpy support. I've used it successfully on my old laptop. However, it does not work on my new machine - I assume there is a problem with "new" numpy versions (or 64bit arch).

As of version 2.4 (released 2019-10-16), NetworkX solves the problem through nx.algorithms.bipartite.minimum_weight_full_matching. At the time of writing, the implementation uses SciPy's scipy.optimize.linear_sum_assignment under the hood, so expect the same performance characteristics.

In addition to the solver in scipy.optimize.linear_sum_assignment already mentioned in some of the other answers, SciPy (as of 1.6.0) also comes with a sparsity-friendly solver in scipy.sparse.csgraph.min_weight_full_bipartite_matching.
In [2]: from scipy.sparse import random
In [3]: from scipy.sparse.csgraph import min_weight_full_bipartite_matching
In [4]: from scipy.optimize import linear_sum_assignment
In [15]: sparse = random(1000, 1000, density=0.01, format='csr')
In [16]: %timeit min_weight_full_bipartite_matching(sparse)
3.84 ms ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [17]: dense = sparse.toarray()
In [18]: %timeit linear_sum_assignment(dense)
18.8 ms ± 291 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Are NumPy's math functions faster than Python's?

I have a function defined by a combination of basic math functions (abs, cosh, sinh, exp, ...).
I was wondering if it makes a difference (in speed) to use, for example,
numpy.abs() instead of abs()?

Here are the timing results:
lebigot#weinberg ~ % python -m timeit 'abs(3.15)'
10000000 loops, best of 3: 0.146 usec per loop
lebigot#weinberg ~ % python -m timeit -s 'from numpy import abs as nabs' 'nabs(3.15)'
100000 loops, best of 3: 3.92 usec per loop
numpy.abs() is slower than abs() because it also handles Numpy arrays: it contains additional code that provides this flexibility.
However, Numpy is fast on arrays:
lebigot#weinberg ~ % python -m timeit -s 'a = [3.15]*1000' '[abs(x) for x in a]'
10000 loops, best of 3: 186 usec per loop
lebigot#weinberg ~ % python -m timeit -s 'import numpy; a = numpy.empty(1000); a.fill(3.15)' 'numpy.abs(a)'
100000 loops, best of 3: 6.47 usec per loop
(PS: '[abs(x) for x in a]' is slower in Python 2.7 than the better map(abs, a), which is about 30 % faster—which is still much slower than NumPy.)
Thus, numpy.abs() does not take much more time for 1000 elements than for 1 single float!

You should use numpy function to deal with numpy's types and use regular python function to deal with regular python types.
Worst performance usually occurs when mixing python builtins with numpy, because of types conversion. Those type conversion have been optimized lately, but it's still often better to not use them. Of course, your mileage may vary, so use profiling tools to figure out.
Also consider the use of programs like cython or making a C module if you want to optimize further your program. Or consider not to use python when performances matters.
but, when your data has been put into a numpy array, then numpy can be really fast at computing bunch of data.

In fact, on numpy array
built in abs calls numpy's implementation via __abs__, see Why built-in functions like abs works on numpy array?
So, in theory there shouldn't be much performance difference.
import timeit
x = np.random.standard_normal(10000)
def pure_abs():
return abs(x)
def numpy_abs():
return np.abs(x)
n = 10000
t1 = timeit.timeit(pure_abs, number = n)
print('Pure Python abs:', t1)
t2 = timeit.timeit(numpy_abs, number = n)
print('Numpy abs:', t2)
Pure Python abs: 0.435754060745
Numpy abs: 0.426516056061

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fast logarithm of the negative binomial probability mass function computation in Python - python

Related

What is the fastest way to select multiple elements from a numpy array?

Efficient computation of entropy-like formula (sum(xlogx)) in Python

Numba: when to use nopython=True?

Performance of the linear sum assignment algorithm [duplicate]

Are NumPy's math functions faster than Python's?

Categories

Resources