numpy.amax() will find the max value in an array, and numpy.amin() does the same for the min value. If I want to find both max and min, I have to call both functions, which requires passing over the (very big) array twice, which seems slow.
Is there a function in the numpy API that finds both max and min with only a single pass through the data?
Is there a function in the numpy API that finds both max and min with only a single pass through the data?
No. At the time of this writing, there is no such function. (And yes, if there were such a function, its performance would be significantly better than calling numpy.amin() and numpy.amax() successively on a large array.)
I don't think that passing over the array twice is a problem. Consider the following pseudo-code:
minval = array[0]
maxval = array[0]
for i in array:
if i < minval:
minval = i
if i > maxval:
maxval = i
While there is only 1 loop here, there are still 2 checks. (Instead of having 2 loops with 1 check each). Really the only thing you save is the overhead of 1 loop. If the arrays really are big as you say, that overhead is small compared to the actual loop's work load. (Note that this is all implemented in C, so the loops are more or less free anyway).
EDIT Sorry to the 4 of you who upvoted and had faith in me. You definitely can optimize this.
Here's some fortran code which can be compiled into a python module via f2py (maybe a Cython guru can come along and compare this with an optimized C version ...):
subroutine minmax1(a,n,amin,amax)
implicit none
!f2py intent(hidden) :: n
!f2py intent(out) :: amin,amax
!f2py intent(in) :: a
integer n
real a(n),amin,amax
integer i
amin = a(1)
amax = a(1)
do i=2, n
if(a(i) > amax)then
amax = a(i)
elseif(a(i) < amin) then
amin = a(i)
endif
enddo
end subroutine minmax1
subroutine minmax2(a,n,amin,amax)
implicit none
!f2py intent(hidden) :: n
!f2py intent(out) :: amin,amax
!f2py intent(in) :: a
integer n
real a(n),amin,amax
amin = minval(a)
amax = maxval(a)
end subroutine minmax2
Compile it via:
f2py -m untitled -c fortran_code.f90
And now we're in a place where we can test it:
import timeit
size = 100000
repeat = 10000
print timeit.timeit(
'np.min(a); np.max(a)',
setup='import numpy as np; a = np.arange(%d, dtype=np.float32)' % size,
number=repeat), " # numpy min/max"
print timeit.timeit(
'untitled.minmax1(a)',
setup='import numpy as np; import untitled; a = np.arange(%d, dtype=np.float32)' % size,
number=repeat), '# minmax1'
print timeit.timeit(
'untitled.minmax2(a)',
setup='import numpy as np; import untitled; a = np.arange(%d, dtype=np.float32)' % size,
number=repeat), '# minmax2'
The results are a bit staggering for me:
8.61869883537 # numpy min/max
1.60417699814 # minmax1
2.30169081688 # minmax2
I have to say, I don't completely understand it. Comparing just np.min versus minmax1 and minmax2 is still a losing battle, so it's not just a memory issue ...
notes -- Increasing size by a factor of 10**a and decreasing repeat by a factor of 10**a (keeping the problem size constant) does change the performance, but not in a seemingly consistent way which shows that there is some interplay between memory performance and function call overhead in python. Even comparing a simple min implementation in fortran beats numpy's by a factor of approximately 2 ...
You could use Numba, which is a NumPy-aware dynamic Python compiler using LLVM. The resulting implementation is pretty simple and clear:
import numpy
import numba
#numba.jit
def minmax(x):
maximum = x[0]
minimum = x[0]
for i in x[1:]:
if i > maximum:
maximum = i
elif i < minimum:
minimum = i
return (minimum, maximum)
numpy.random.seed(1)
x = numpy.random.rand(1000000)
print(minmax(x) == (x.min(), x.max()))
It should also be faster than a Numpy's min() & max() implementation. And all without having to write a single C/Fortran line of code.
Do your own performance tests, as it is always dependent on your architecture, your data, your package versions...
There is a function for finding (max-min) called numpy.ptp if that's useful for you:
>>> import numpy
>>> x = numpy.array([1,2,3,4,5,6])
>>> x.ptp()
5
but I don't think there's a way to find both min and max with one traversal.
EDIT: ptp just calls min and max under the hood
Just to get some ideas on the numbers one could expect, given the following approaches:
import numpy as np
def extrema_np(arr):
return np.max(arr), np.min(arr)
import numba as nb
#nb.jit(nopython=True)
def extrema_loop_nb(arr):
n = arr.size
max_val = min_val = arr[0]
for i in range(1, n):
item = arr[i]
if item > max_val:
max_val = item
elif item < min_val:
min_val = item
return max_val, min_val
import numba as nb
#nb.jit(nopython=True)
def extrema_while_nb(arr):
n = arr.size
odd = n % 2
if not odd:
n -= 1
max_val = min_val = arr[0]
i = 1
while i < n:
x = arr[i]
y = arr[i + 1]
if x > y:
x, y = y, x
min_val = min(x, min_val)
max_val = max(y, max_val)
i += 2
if not odd:
x = arr[n]
min_val = min(x, min_val)
max_val = max(x, max_val)
return max_val, min_val
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
cdef void _extrema_loop_cy(
long[:] arr,
size_t n,
long[:] result):
cdef size_t i
cdef long item, max_val, min_val
max_val = arr[0]
min_val = arr[0]
for i in range(1, n):
item = arr[i]
if item > max_val:
max_val = item
elif item < min_val:
min_val = item
result[0] = max_val
result[1] = min_val
def extrema_loop_cy(arr):
result = np.zeros(2, dtype=arr.dtype)
_extrema_loop_cy(arr, arr.size, result)
return result[0], result[1]
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
cdef void _extrema_while_cy(
long[:] arr,
size_t n,
long[:] result):
cdef size_t i, odd
cdef long x, y, max_val, min_val
max_val = arr[0]
min_val = arr[0]
odd = n % 2
if not odd:
n -= 1
max_val = min_val = arr[0]
i = 1
while i < n:
x = arr[i]
y = arr[i + 1]
if x > y:
x, y = y, x
min_val = min(x, min_val)
max_val = max(y, max_val)
i += 2
if not odd:
x = arr[n]
min_val = min(x, min_val)
max_val = max(x, max_val)
result[0] = max_val
result[1] = min_val
def extrema_while_cy(arr):
result = np.zeros(2, dtype=arr.dtype)
_extrema_while_cy(arr, arr.size, result)
return result[0], result[1]
(the extrema_loop_*() approaches are similar to what is proposed here, while extrema_while_*() approaches are based on the code from here)
The following timings:
indicate that the extrema_while_*() are the fastest, with extrema_while_nb() being fastest. In any case, also the extrema_loop_nb() and extrema_loop_cy() solutions do outperform the NumPy-only approach (using np.max() and np.min() separately).
Finally, note that none of these is as flexible as np.min()/np.max() (in terms of n-dim support, axis parameter, etc.).
(full code is available here)
Nobody mentioned numpy.percentile, so I thought I would. If you ask for [0, 100] percentiles, it will give you an array of two elements, the min (0th percentile) and the max (100th percentile).
However, it doesn't satisfy the OP's purpose: it's not faster than min and max separately. That's probably due to some machinery that would allow for non-extreme percentiles (a harder problem, which should take longer).
In [1]: import numpy
In [2]: a = numpy.random.normal(0, 1, 1000000)
In [3]: %%timeit
...: lo, hi = numpy.amin(a), numpy.amax(a)
...:
100 loops, best of 3: 4.08 ms per loop
In [4]: %%timeit
...: lo, hi = numpy.percentile(a, [0, 100])
...:
100 loops, best of 3: 17.2 ms per loop
In [5]: numpy.__version__
Out[5]: '1.14.4'
A future version of Numpy could put in a special case to skip the normal percentile calculation if only [0, 100] are requested. Without adding anything to the interface, there's a way to ask Numpy for min and max in one call (contrary to what was said in the accepted answer), but the standard implementation of the library doesn't take advantage of this case to make it worthwhile.
In general you can reduce the amount of comparisons for a minmax algorithm by processing two elements at a time and only comparing the smaller to the temporary minimum and the bigger one to the temporary maximum. On average one needs only 3/4 of the comparisons than a naive approach.
This could be implemented in c or fortran (or any other low-level language) and should be almost unbeatable in terms of performance. I'm using numba to illustrate the principle and get a very fast, dtype-independant implementation:
import numba as nb
import numpy as np
#nb.njit
def minmax(array):
# Ravel the array and return early if it's empty
array = array.ravel()
length = array.size
if not length:
return
# We want to process two elements at once so we need
# an even sized array, but we preprocess the first and
# start with the second element, so we want it "odd"
odd = length % 2
if not odd:
length -= 1
# Initialize min and max with the first item
minimum = maximum = array[0]
i = 1
while i < length:
# Get the next two items and swap them if necessary
x = array[i]
y = array[i+1]
if x > y:
x, y = y, x
# Compare the min with the smaller one and the max
# with the bigger one
minimum = min(x, minimum)
maximum = max(y, maximum)
i += 2
# If we had an even sized array we need to compare the
# one remaining item too.
if not odd:
x = array[length]
minimum = min(x, minimum)
maximum = max(x, maximum)
return minimum, maximum
It's definetly faster than the naive approach that Peque presented:
arr = np.random.random(3000000)
assert minmax(arr) == minmax_peque(arr) # warmup and making sure they are identical
%timeit minmax(arr) # 100 loops, best of 3: 2.1 ms per loop
%timeit minmax_peque(arr) # 100 loops, best of 3: 2.75 ms per loop
As expected the new minmax implementation only takes roughly 3/4 of the time the naive implementation took (2.1 / 2.75 = 0.7636363636363637)
This is an old thread, but anyway, if anyone ever looks at this again...
When looking for the min and max simultaneously, it is possible to reduce the number of comparisons. If it is floats you are comparing (which I guess it is) this might save you some time, although not computational complexity.
Instead of (Python code):
_max = ar[0]
_min= ar[0]
for ii in xrange(len(ar)):
if _max > ar[ii]: _max = ar[ii]
if _min < ar[ii]: _min = ar[ii]
you can first compare two adjacent values in the array, and then only compare the smaller one against current minimum, and the larger one against current maximum:
## for an even-sized array
_max = ar[0]
_min = ar[0]
for ii in xrange(0, len(ar), 2)): ## iterate over every other value in the array
f1 = ar[ii]
f2 = ar[ii+1]
if (f1 < f2):
if f1 < _min: _min = f1
if f2 > _max: _max = f2
else:
if f2 < _min: _min = f2
if f1 > _max: _max = f1
The code here is written in Python, clearly for speed you would use C or Fortran or Cython, but this way you do 3 comparisons per iteration, with len(ar)/2 iterations, giving 3/2 * len(ar) comparisons. As opposed to that, doing the comparison "the obvious way" you do two comparisons per iteration, leading to 2*len(ar) comparisons. Saves you 25% of comparison time.
Maybe someone one day will find this useful.
At first glance, numpy.histogram appears to do the trick:
count, (amin, amax) = numpy.histogram(a, bins=1)
... but if you look at the source for that function, it simply calls a.min() and a.max() independently, and therefore fails to avoid the performance concerns addressed in this question. :-(
Similarly, scipy.ndimage.measurements.extrema looks like a possibility, but it, too, simply calls a.min() and a.max() independently.
It was worth the effort for me anyways, so I'll propose the most difficult and least elegant solution here for whoever may be interested. My solution is to implement a multi-threaded min-max in one pass algorithm in C++, and use this to create an Python extension module. This effort requires a bit of overhead for learning how to use the Python and NumPy C/C++ APIs, and here I will show the code and give some small explanations and references for whoever wishes to go down this path.
Multi-threaded Min/Max
There is nothing too interesting here. The array is broken into chunks of size length / workers. The min/max is calculated for each chunk in a future, which are then scanned for the global min/max.
// mt_np.cc
//
// multi-threaded min/max algorithm
#include <algorithm>
#include <future>
#include <vector>
namespace mt_np {
/*
* Get {min,max} in interval [begin,end)
*/
template <typename T> std::pair<T, T> min_max(T *begin, T *end) {
T min{*begin};
T max{*begin};
while (++begin < end) {
if (*begin < min) {
min = *begin;
continue;
} else if (*begin > max) {
max = *begin;
}
}
return {min, max};
}
/*
* get {min,max} in interval [begin,end) using #workers for concurrency
*/
template <typename T>
std::pair<T, T> min_max_mt(T *begin, T *end, int workers) {
const long int chunk_size = std::max((end - begin) / workers, 1l);
std::vector<std::future<std::pair<T, T>>> min_maxes;
// fire up the workers
while (begin < end) {
T *next = std::min(end, begin + chunk_size);
min_maxes.push_back(std::async(min_max<T>, begin, next));
begin = next;
}
// retrieve the results
auto min_max_it = min_maxes.begin();
auto v{min_max_it->get()};
T min{v.first};
T max{v.second};
while (++min_max_it != min_maxes.end()) {
v = min_max_it->get();
min = std::min(min, v.first);
max = std::max(max, v.second);
}
return {min, max};
}
}; // namespace mt_np
The Python Extension Module
Here is where things start getting ugly... One way to use C++ code in Python is to implement an extension module. This module can be built and installed using the distutils.core standard module. A complete description of what this entails is covered in the Python documentation: https://docs.python.org/3/extending/extending.html. NOTE: there are certainly other ways to get similar results, to quote https://docs.python.org/3/extending/index.html#extending-index:
This guide only covers the basic tools for creating extensions provided as part of this version of CPython. Third party tools like Cython, cffi, SWIG and Numba offer both simpler and more sophisticated approaches to creating C and C++ extensions for Python.
Essentially, this route is probably more academic than practical. With that being said, what I did next was, sticking pretty close to the tutorial, create a module file. This is essentially boilerplate for distutils to know what to do with your code and create a Python module out of it. Before doing any of this it is probably wise to create a Python virtual environment so you don't pollute your system packages (see https://docs.python.org/3/library/venv.html#module-venv).
Here is the module file:
// mt_np_forpy.cc
//
// C++ module implementation for multi-threaded min/max for np
#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
#include <python3.6/numpy/arrayobject.h>
#include "mt_np.h"
#include <cstdint>
#include <iostream>
using namespace std;
/*
* check:
* shape
* stride
* data_type
* byteorder
* alignment
*/
static bool check_array(PyArrayObject *arr) {
if (PyArray_NDIM(arr) != 1) {
PyErr_SetString(PyExc_RuntimeError, "Wrong shape, require (1,n)");
return false;
}
if (PyArray_STRIDES(arr)[0] != 8) {
PyErr_SetString(PyExc_RuntimeError, "Expected stride of 8");
return false;
}
PyArray_Descr *descr = PyArray_DESCR(arr);
if (descr->type != NPY_LONGLTR && descr->type != NPY_DOUBLELTR) {
PyErr_SetString(PyExc_RuntimeError, "Wrong type, require l or d");
return false;
}
if (descr->byteorder != '=') {
PyErr_SetString(PyExc_RuntimeError, "Expected native byteorder");
return false;
}
if (descr->alignment != 8) {
cerr << "alignment: " << descr->alignment << endl;
PyErr_SetString(PyExc_RuntimeError, "Require proper alignement");
return false;
}
return true;
}
template <typename T>
static PyObject *mt_np_minmax_dispatch(PyArrayObject *arr) {
npy_intp size = PyArray_SHAPE(arr)[0];
T *begin = (T *)PyArray_DATA(arr);
auto minmax =
mt_np::min_max_mt(begin, begin + size, thread::hardware_concurrency());
return Py_BuildValue("(L,L)", minmax.first, minmax.second);
}
static PyObject *mt_np_minmax(PyObject *self, PyObject *args) {
PyArrayObject *arr;
if (!PyArg_ParseTuple(args, "O", &arr))
return NULL;
if (!check_array(arr))
return NULL;
switch (PyArray_DESCR(arr)->type) {
case NPY_LONGLTR: {
return mt_np_minmax_dispatch<int64_t>(arr);
} break;
case NPY_DOUBLELTR: {
return mt_np_minmax_dispatch<double>(arr);
} break;
default: {
PyErr_SetString(PyExc_RuntimeError, "Unknown error");
return NULL;
}
}
}
static PyObject *get_concurrency(PyObject *self, PyObject *args) {
return Py_BuildValue("I", thread::hardware_concurrency());
}
static PyMethodDef mt_np_Methods[] = {
{"mt_np_minmax", mt_np_minmax, METH_VARARGS, "multi-threaded np min/max"},
{"get_concurrency", get_concurrency, METH_VARARGS,
"retrieve thread::hardware_concurrency()"},
{NULL, NULL, 0, NULL} /* sentinel */
};
static struct PyModuleDef mt_np_module = {PyModuleDef_HEAD_INIT, "mt_np", NULL,
-1, mt_np_Methods};
PyMODINIT_FUNC PyInit_mt_np() { return PyModule_Create(&mt_np_module); }
In this file there is a significant use of the Python as well as the NumPy API, for more information consult: https://docs.python.org/3/c-api/arg.html#c.PyArg_ParseTuple, and for NumPy: https://docs.scipy.org/doc/numpy/reference/c-api.array.html.
Installing the Module
The next thing to do is to utilize distutils to install the module. This requires a setup file:
# setup.py
from distutils.core import setup,Extension
module = Extension('mt_np', sources = ['mt_np_module.cc'])
setup (name = 'mt_np',
version = '1.0',
description = 'multi-threaded min/max for np arrays',
ext_modules = [module])
To finally install the module, execute python3 setup.py install from your virtual environment.
Testing the Module
Finally, we can test to see if the C++ implementation actually outperforms naive use of NumPy. To do so, here is a simple test script:
# timing.py
# compare numpy min/max vs multi-threaded min/max
import numpy as np
import mt_np
import timeit
def normal_min_max(X):
return (np.min(X),np.max(X))
print(mt_np.get_concurrency())
for ssize in np.logspace(3,8,6):
size = int(ssize)
print('********************')
print('sample size:', size)
print('********************')
samples = np.random.normal(0,50,(2,size))
for sample in samples:
print('np:', timeit.timeit('normal_min_max(sample)',
globals=globals(),number=10))
print('mt:', timeit.timeit('mt_np.mt_np_minmax(sample)',
globals=globals(),number=10))
Here are the results I got from doing all this:
8
********************
sample size: 1000
********************
np: 0.00012079699808964506
mt: 0.002468645994667895
np: 0.00011947099847020581
mt: 0.0020772050047526136
********************
sample size: 10000
********************
np: 0.00024697799381101504
mt: 0.002037393998762127
np: 0.0002713389985729009
mt: 0.0020942929986631498
********************
sample size: 100000
********************
np: 0.0007130410012905486
mt: 0.0019842900001094677
np: 0.0007540129954577424
mt: 0.0029724110063398257
********************
sample size: 1000000
********************
np: 0.0094779249993735
mt: 0.007134920000680722
np: 0.009129883001151029
mt: 0.012836456997320056
********************
sample size: 10000000
********************
np: 0.09471094200125663
mt: 0.0453535050037317
np: 0.09436299200024223
mt: 0.04188535599678289
********************
sample size: 100000000
********************
np: 0.9537652180006262
mt: 0.3957935369980987
np: 0.9624398809974082
mt: 0.4019058070043684
These are far less encouraging than the results indicate earlier in the thread, which indicated somewhere around 3.5x speedup, and didn't incorporate multi-threading. The results I achieved are somewhat reasonable, I would expect that the overhead of threading and would dominate the time until the arrays got very large, at which point the performance increase would start to approach std::thread::hardware_concurrency x increase.
Conclusion
There is certainly room for application specific optimizations to some NumPy code, it would seem, in particular with regards to multi-threading. Whether or not it is worth the effort is not clear to me, but it certainly seems like a good exercise (or something). I think that perhaps learning some of those "third party tools" like Cython may be a better use of time, but who knows.
Inspired by the previous answer I've written numba implementation returning minmax for axis=0 from 2-D array. It's ~5x faster than calling numpy min/max.
Maybe someone will find it useful.
from numba import jit
#jit
def minmax(x):
"""Return minimum and maximum from 2D array for axis=0."""
m, n = len(x), len(x[0])
mi, ma = np.empty(n), np.empty(n)
mi[:] = ma[:] = x[0]
for i in range(1, m):
for j in range(n):
if x[i, j]>ma[j]: ma[j] = x[i, j]
elif x[i, j]<mi[j]: mi[j] = x[i, j]
return mi, ma
x = np.random.normal(size=(256, 11))
mi, ma = minmax(x)
np.all(mi == x.min(axis=0)), np.all(ma == x.max(axis=0))
# (True, True)
%timeit x.min(axis=0), x.max(axis=0)
# 15.9 µs ± 9.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit minmax(x)
# 2.62 µs ± 31.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The shortest way I've come up with is this:
mn, mx = np.sort(ar)[[0, -1]]
But since it sorts the array, it's not the most efficient.
Another short way would be:
mn, mx = np.percentile(ar, [0, 100])
This should be more efficient, but the result is calculated, and a float is returned.
Maybe use numpy.unique? Like so:
min_, max_ = numpy.unique(arr)[[0, -1]]
Just added it here for variety :) It's just as slow as sort.
Related
My problem is as follows. I am generating a random bitstring of size n, and need to iterate over the indices for which the random bit is 1. For example, if my random bitstring ends up being 00101, I want to retrieve [2, 4] (on which I will iterate over). The goal is to do so in the fastest way possible with Python/NumPy.
One of the fast methods is to use NumPy and do
bitstring = np.random.randint(2, size=(n,))
l = np.nonzero(bitstring)[0]
The advantage with np.non_zero is that it finds indices of bits set to 1 much faster than if one iterates (with a for loop) over each bit and checks if it is set to 1.
Now, NumPy can generate a random bitstring faster via np.random.bit_generator.randbits(n). The problem is that it returns it as an integer, on which I cannot use np.nonzero anymore. I saw that for integers one can get the count of bits set to 1 in an integer x by using x.bit_count(), however there is no function to get the indices where bits are set to 1. So currently, I have to resort to a slow for loop, hence losing the initial speedup given by np.random.bit_generator.randbits(n).
How would you do something similar to (and as fast as) np.non_zero, but on integers instead?
Thank you in advance for your suggestions!
A minor optimisation to your code would be to use the new style random interface and generate bools rather than 64bit integers
rng = np.random.default_rng()
def original(n):
bitstring = rng.integers(2, size=n, dtype=bool)
return np.nonzero(bitstring)[0]
this causes it to take ~24 µs on my laptop, tested n upto 128.
I've previously noticed that getting a Numpy to generate a permutation is particularly fast, hence my comment above. Leading to:
def perm(n):
a = rng.permutation(n)
return a[:rng.binomial(n, 0.5)]
which takes between ~7 µs and ~10 µs depending on n. It also returns the indicies out of order, not sure if that's an issue for you. If your n isn't changing much, you could also swap to using rng.shuffle on an pre-allocated array, something like:
n = 32
a = np.arange(n)
def shuffle():
rng.shuffle(a)
return a[:rng.binomial(n, 0.5)]
which saves a couple of microseconds.
After some interesting proposals, I decided to do some benchmarking to understand how the running times grow as a function of n. The functions tested are the following:
def func1(n):
bit_array = np.random.randint(2, size=n)
return np.nonzero(bit_array)[0]
def func2(n):
bit_int = np.random.bit_generator.randbits(n)
a = np.zeros(bit_int.bit_count())
i = 0
for j in range(n):
if 1 & (bit_int >> j):
a[i] = j
i += 1
return a
def func3(n):
bit_string = format(np.random.bit_generator.randbits(n), f'0{n}b')
bit_array = np.array(list(bit_string), dtype=int)
return np.nonzero(bit_array)[0]
def func4(n):
rng = np.random.default_rng()
a = rng.permutation(n)
return a[:rng.binomial(n, 0.5)]
def func5(n):
a = np.arange(n)
rng.shuffle(a)
return a[:rng.binomial(n, 0.5)]
I used timeit to do the benchmark, looping 1000 over a statement each time and averaging over 10 runs. The value of n ranges from 2 to 65536, growing as powers of 2. The average running time is plotted and error bars correspondond to the standard deviation.
For solutions generating a bitstring, the simple func1 actually performs best among them whenever n is large enough (n>32). We can see that for low values of n (n< 16), using the randbits solution with the for loop (func2) is fastest, because the loop is not costly yet. However as n becomes larger, this becomes the worst solution, because all the time is spent in the for loop. This is why having a nonzero for integers would bring the best of both world and hopefully give a faster solution. We can observe that func3, which does a conversion in order to use nonzero after using randbits spends too long doing the conversion.
For implementations which exploit the binomial distribution (see Sam Mason's answer), we see that the use of shuffle (func5) instead of permutation (func4) can reduce the time by a bit, but overall they have similar performance.
Considering all values of n (that were tested), the solution given by Sam Mason which employs a binomial distribution together with shuffling (func5) is so far the most performant in terms of running time. Let's see if this can be improved!
I had a play with Cython to see how much difference it would make. I ended up with quite a lot of code and only ~5x better runtime performance:
from cpython.pycapsule cimport PyCapsule_IsValid, PyCapsule_GetPointer
import numpy as np
cimport numpy as np
cimport cython
from numpy.random cimport bitgen_t
np.import_array()
DTYPE = np.uint32
ctypedef np.uint32_t DTYPE_t
cdef extern int __builtin_popcountl(unsigned long) nogil
cdef extern int __builtin_ffsl(unsigned long) nogil
cdef const char *bgen_capsule_name = "BitGenerator"
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
cdef size_t generate_bits(object bitgen, np.uint64_t *state, Py_ssize_t state_len, np.uint64_t last_mask):
cdef Py_ssize_t i
cdef size_t nset
cdef bitgen_t *rng
capsule = bitgen.capsule
if not PyCapsule_IsValid(capsule, bgen_capsule_name):
raise ValueError("Expecting Numpy BitGenerator Capsule")
rng = <bitgen_t *> PyCapsule_GetPointer(capsule, bgen_capsule_name)
with bitgen.lock:
nset = 0
for i in range(state_len-1):
state[i] = rng.next_uint64(rng.state)
nset += __builtin_popcountl(state[i])
i = state_len-1
state[i] = rng.next_uint64(rng.state) & last_mask
nset += __builtin_popcountl(state[i])
return nset
cdef size_t write_setbits(DTYPE_t *result, DTYPE_t off, np.uint64_t state) nogil:
cdef size_t j
cdef int k
j = 0
while state:
# find first set bit returns zero when nothing is set
k = __builtin_ffsl(state) - 1
# clear out bit k
state &= ~(1ul<<k)
# record in output
result[j] = off + k
j += 1
return j
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
def rint(bitgen, unsigned int n):
cdef Py_ssize_t i, j, nset
cdef np.uint64_t[::1] state
cdef DTYPE_t[::1] result
state = np.empty((n + 63) // 64, dtype=np.uint64)
nset = generate_bits(bitgen, &state[0], len(state), (1ul << (n & 63)) - 1)
pyresult = np.empty(nset, dtype=DTYPE)
result = pyresult
j = 0
for i in range(len(state)):
j += write_setbits(&result[j], i * 64, state[i])
return pyresult
The above code is easy to use via the Cython Jupyter extension.
Comparing this to slightly tidied up versions of the OP's code can be done via:
import random
import timeit
import numpy as np
import matplotlib.pyplot as plt
bitgen = np.random.PCG64()
def func1(n):
# bool type is a bit faster
bit_array = np.random.randint(2, size=n, dtype=bool)
return np.nonzero(bit_array)[0]
def func2(n):
# OPs variant ends up using a CSPRNG which is slower
bit_int = random.getrandbits(n)
# this is much easier than using numpy arrays
return [i for i in range(n) if 1 & (bit_int >> i)]
def func3(n):
bit_string = format(random.getrandbits(n), f'0{n}b')
bit_array = np.array(list(bit_string), dtype='int8')
return np.nonzero(bit_array)[0]
def func4(n):
# shuffle variant is mostly the same
# plot already busy enough
a = np.random.permutation(n)
return a[:np.random.binomial(n, 0.5)]
def func_cython(n):
return rint(bitgen, n)
result = {}
niter = [2**i for i in range(1, 17)]
for name in 'func1 func2 func3 func4 func_cython'.split():
result[name] = res = []
for n in niter:
t = timeit.Timer(f"fn({n})", f"fn = {name}", globals=globals())
nit, dt = t.autorange()
res.append(dt / nit)
plt.loglog()
for name, times in result.items():
plt.plot(niter, np.array(times) * 1000, '.-', label=name)
plt.legend()
Which might produce output like:
Note that in order to reduce variance it's helpful to turn off CPU frequency scaling and turn off turbo modes. The Arch wiki has useful info on how to do this under Linux.
you could convert the number you get with randbits(n) to a numpy.ndarray.
depending on the size of n the compute time of the conversion should be faster than the loop.
n = 10
l = np.random.bit_generator.randbits(n) # gives you the int 616
l_string = f'{l:0{n}b}' # gives you a string representation of the int in length n 1001101000
l_nparray = np.array(list(l_string), dtype=int) # gives you the numpy.ndarray like np.random.randint [1 0 0 1 1 0 1 0 0 0]
I have a very sparse matrix, say 5000x3000, double precision floats. 80% of this matrix are zeros. I need to compute a sum of each row. All of that in python/cython. I wanted to speed up the process. Because I need to compute this sum a few million times, I thought that if I make indices of non zero elements and sum only them it will be faster. The result turns to be much slower than original "brute-force" summation of all zeros.
Here a minimal example:
#cython: language_level=2
import numpy as np
cimport numpy as np
import time
cdef int Ncells = 5000, KCells = 400, Ne= 350
cdef double x0=0.1, x1=20., x2=1.4, x3=2.8, p=0.2
# Setting up weight
all_weights = np.zeros( (Ncells,KCells) )
all_weights[ :Ne, :Ne ] = x0
all_weights[ :Ne, Ne: ] = x1
all_weights[Ne: , :Ne ] = x2
all_weights[Ne: , Ne: ] = x3
all_weights = all_weights * (np.random.rand(Ncells,KCells) < p)
# Making a memory view
cdef np.float64_t[:,:] my_weights = all_weights
# make an index of non zero weights
x,y = np.where( np.array(my_weights) > 0.)
#np_pawid = np.column_stack( (x ,y ) )
np_pawid = np.column_stack( (x ,y ) ).astype(int)
cdef np.int_t[:,:] pawid = np_pawid
# Making vector for column sum
summEE = np.zeros(KCells)
# Memory view
cdef np.float64_t [:] my_summEE = summEE
cdef int cc,dd,i
# brute-force summing
ntm = time.time()
for cc in range(KCells):
my_summEE[cc] = 0
for dd in range(Ncells):
my_summEE[cc] += my_weights[dd,cc]
stm = time.time()
print "BRUTE-FORCE summation : %f s"%(stm-ntm)
my_summEE[:] = 0
# summing only non zero indices
ntm = time.time()
for dd,cc in pawid:
my_summEE[cc] += my_weights[dd,cc]
stm = time.time()
print "INDEX summation : %f s"%(stm-ntm)
my_summEE[:] = 0
# summing only non zero indices unpacked by zip
ntm = time.time()
for dd,cc in zip(pawid[:,0],pawid[:,1]):
my_summEE[cc] += my_weights[dd,cc]
stm = time.time()
print "ZIPPED INDEX summation : %f s"%(stm-ntm)
my_summEE[:] = 0
# summing only non zero indices unpacked by zip
ntm = time.time()
for i in range(pawid.shape[0]):
dd = pawid[i,0]
cc = pawid[i,1]
my_summEE[cc] += my_weights[dd,cc]
stm = time.time()
print "INDEXING over INDEX summation: %f s"%(stm-ntm)
# Numpy brute-froce summing
ntm = time.time()
sumwee = np.sum(all_weights,axis=0)
stm = time.time()
print "NUMPY BRUTE-FORCE summation : %f s"%(stm-ntm)
#>
print
print "Number of brute-froce summs :",my_weights.shape[0]*my_weights.shape[1]
print "Number of indexing summs :",pawid.shape[0]
#<
I ran it on Raspberry Pi 3, but it seems the same results on PC too.
BRUTE-FORCE summation : 0.381014 s
INDEX summation : 18.479018 s
ZIPPED INDEX summation : 3.615952 s
INDEXING over INDEX summation: 0.450131 s
NUMPY BRUTE-FORCE summation : 0.013017 s
Number of brute-froce summs : 2000000
Number of indexing summs : 400820
NUMPY BRUTE-FORCE in Python : 0.029143 s
Can anyone explain why is cython code 3-4 time slower than numpy? Why is indexing, which reduces the number of summations from 2000000 to 400820, 45 times slower? It doesn't make any sense.
You're outside a function so accessing global variables. This means that Cython has to check they exist each time they're accessed, unlike function locals which it knows can't be accessed from elsewhere.
By default Cython handles negative indices and does bounds checking. You can turn these off in a number of ways. An obvious way is to add #cython.wraparound(False) and #cython.boundscheck(False) as decorators to your function definition. Be aware of what these actually do - the only turn off these features on cdefed numpy arrays or typed memoryviews and don't apply to much else (so don't just apply them everywhere as a cargo-cult type thing).
A good way to see where issues might be is to run cython -a <filename> and look at the annotated html file. Areas with yellow are potentially not optimized, and you can expand the lines to see the underlying C code. Obviously only worry about frequently called functions and loops in this respect - the fact your code to setup the Numpy arrays contains Python calls is expected and not a problem.
A few measurements:
As you wrote it
BRUTE-FORCE summation : 0.008625 s
INDEX summation : 0.713661 s
ZIPPED INDEX summation : 0.127343 s
INDEXING over INDEX summation: 0.002154 s
NUMPY BRUTE-FORCE summation : 0.001461 s
In a function
BRUTE-FORCE summation : 0.007706 s
INDEX summation : 0.681892 s
ZIPPED INDEX summation : 0.123176 s
INDEXING over INDEX summation: 0.002069 s
NUMPY BRUTE-FORCE summation : 0.001429 s
In a function with boundscheck and wraparound off:
BRUTE-FORCE summation : 0.005208 s
INDEX summation : 0.672948 s
ZIPPED INDEX summation : 0.124641 s
INDEXING over INDEX summation: 0.002006 s
NUMPY BRUTE-FORCE summation : 0.001467 s
My suggestions do help, but not too dramatically. My differences aren't as dramatic as you see (even for your unchanged code). Numpy still wins - at a guess:
I suspect it's multithreading.
a direct sum over a whole array will have predictable patterns of memory access, which may make it more efficient than a smaller number of operations with unpredictable memory access
What is the big O of the following if statement?
if "pl" in "apple":
...
What is the overall big O of how python determines if the string "pl" is found in the string "apple"
or any other substring in string search.
Is this the most efficient way to test if a substring is in a string? Does it use the same algorithm as .find()?
The time complexity is O(N) on average, O(NM) worst case (N being the length of the longer string, M, the shorter string you search for). As of Python 3.10, heuristics are used to lower the worst-case scenario to O(N + M) by switching algorithms.
The same algorithm is used for str.index(), str.find(), str.__contains__() (the in operator) and str.replace(); it is a simplification of the Boyer-Moore with ideas taken from the Boyer–Moore–Horspool and Sunday algorithms.
See the original stringlib discussion post, as well as the fastsearch.h source code; until Python 3.10, the base algorithm has not changed since introduction in Python 2.5 (apart from some low-level optimisations and corner-case fixes).
The post includes a Python-code outline of the algorithm:
def find(s, p):
# find first occurrence of p in s
n = len(s)
m = len(p)
skip = delta1(p)[p[m-1]]
i = 0
while i <= n-m:
if s[i+m-1] == p[m-1]: # (boyer-moore)
# potential match
if s[i:i+m-1] == p[:m-1]:
return i
if s[i+m] not in p:
i = i + m + 1 # (sunday)
else:
i = i + skip # (horspool)
else:
# skip
if s[i+m] not in p:
i = i + m + 1 # (sunday)
else:
i = i + 1
return -1 # not found
as well as speed comparisons.
In Python 3.10, the algorithm was updated to use an enhanced version of the Crochemore and Perrin's Two-Way string searching algorithm for larger problems (with p and s longer than 100 and 2100 characters, respectively, with s at least 6 times as long as p), in response to a pathological edgecase someone reported. The commit adding this change included a write-up on how the algorithm works.
The Two-way algorithm has a worst-case time complexity of O(N + M), where O(M) is a cost paid up-front to build a shift table from the s search needle. Once you have that table, this algorithm does have a best-case performance of O(N/M).
In Python 3.4.2, it looks like they are resorting to the same function, but there may be a difference in timing nevertheless. For example, s.find first is required to look up the find method of the string and such.
The algorithm used is a mix between Boyer-More and Horspool.
You can use timeit and test it yourself:
maroun#DQHCPY1:~$ python -m timeit 's = "apple";s.find("pl")'
10000000 loops, best of 3: 0.125 usec per loop
maroun#DQHCPY1:~$ python -m timeit 's = "apple";"pl" in s'
10000000 loops, best of 3: 0.0371 usec per loop
Using in is indeed faster (0.0371 usec compared to 0.125 usec).
For actual implementation, you can look at the code itself.
I think the best way to find out is to look at the source. This looks like it would implement __contains__:
static int
bytes_contains(PyObject *self, PyObject *arg)
{
Py_ssize_t ival = PyNumber_AsSsize_t(arg, PyExc_ValueError);
if (ival == -1 && PyErr_Occurred()) {
Py_buffer varg;
Py_ssize_t pos;
PyErr_Clear();
if (PyObject_GetBuffer(arg, &varg, PyBUF_SIMPLE) != 0)
return -1;
pos = stringlib_find(PyBytes_AS_STRING(self), Py_SIZE(self),
varg.buf, varg.len, 0);
PyBuffer_Release(&varg);
return pos >= 0;
}
if (ival < 0 || ival >= 256) {
PyErr_SetString(PyExc_ValueError, "byte must be in range(0, 256)");
return -1;
}
return memchr(PyBytes_AS_STRING(self), (int) ival, Py_SIZE(self)) != NULL;
}
in terms of stringlib_find(), which uses fastsearch().
My use case is to evaluate Poisson pmf on all points which is less than say, 10, and I would call such function multiple of times with difference lambdas. The lambdas are not known ahead of time so I cannot vectorize lambdas.
I heard from somewhere about a secret trick which is to use _pmf. What is the downside to do so? But still, it is a bit slow, is there any way to improve it without rewriting the pmf in C from scratch?
%timeit scipy.stats.poisson.pmf(np.arange(0,10),3.3)
%timeit scipy.stats.poisson._pmf(np.arange(0,10),3.3)
a = np.arange(0,10)
%timeit scipy.stats.poisson._pmf(a,3.3)
10000 loops, best of 3: 94.5 µs per loop
100000 loops, best of 3: 15.2 µs per loop
100000 loops, best of 3: 13.7 µs per loop
Update
Ok, simply I was just too lazy to write in cython. I had expected there is a faster solution for all discrete distribution that can be evaluated sequentially (iteratively) for consecutive x. E.g. P(X=3) = P(X=2) * lambda / 3 if X ~ Pois(lambda)
Related: Is the build-in probability density functions of `scipy.stat.distributions` slower than a user provided one?
I have less faith in Scipy and Python now. The library function isn't as advanced as what I had expected.
Most of scipy.stats distributions support vectorized evaluation:
>>> poisson.pmf(1, [5, 6, 7, 8])
array([ 0.03368973, 0.01487251, 0.00638317, 0.0026837 ])
This may or may not be fast enough, but you can try taking pmf calls out of the loop.
Re difference between pmf and _pmf: the real work is done in the underscored functions (_pmf, _cdf etc) while the public functions (pmf, cdf) make sure that only valid arguments make it to the _pmf (The output of _pmf is not guaranteed to be meaningful if the arguments are invalid, so use on your own risk).
>>> poisson.pmf(1, -1)
nan
>>> poisson._pmf(1, -1)
/home/br/virtualenvs/scipy-dev/local/lib/python2.7/site-packages/scipy/stats/_discrete_distns.py:432: RuntimeWarning: invalid value encountered in log
Pk = k*log(mu)-gamln(k+1) - mu
nan
Further details: https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py#L2721
Try implementing the pmf in cython. If your scipy is part of a package like Anaconda or Enthought you probably have cython installed. http://cython.org/
Try running it with pypy. http://pypy.org/
Rent time on a big AWS server (or similar).
I found that the scipy.stats.poisson class is tragically slow compared to just a simple python implementation.
No cython or vectors or anything.
import math
def poisson_pmf(x, mu):
return mu**x / math.factorial(x) * math.exp(-mu)
def poisson_cdf(k, mu):
p_total = 0.0
for x in range(k + 1):
p_total += poisson_pmf(x, mu)
return p_total
And if you check the source code of scipy.stats.poisson (even the underscore prefixed versions) then it is clear why!
The above implementation is now ONLY 10x slower than the exact equivalent in C (compiled with gcc -O3 v9.3). The scipy version is at least another 10x slower.
#include <math.h>
unsigned long factorial(unsigned n) {
unsigned long fact = 1;
for (unsigned k = 2; k <= n; ++k)
fact *= k;
return fact;
}
double poisson_pmf(unsigned x, double mu) {
return pow(mu, x) / factorial(x) * exp(-mu);
}
double poisson_cdf(unsigned k, double mu) {
double p_total = 0.0;
for (unsigned x = 0; x <= k; ++x)
p_total += poisson_pmf(x, mu);
return p_total;
}
I am writing some performance intensive code, and was hoping to get some feedback from the cythonistas out there on how to improve it further. The purpose of the functions I've written is a bit tough to explain, but what they do isn't all that intimidating. The first (roughly) takes two dictionaries of lists of numbers and joins them to get one dictionary of lists of numbers. It's only run once so I am less concerned with optimizing it. The second first calls the first, then uses its result to basically cross indices stored in a numpy array with the numbers in the lists of arrays to form queries (new numbers) on a (pybloomfiltermmap) bloom filter.
I've determined the heavy step is due to my nested loops and reduced the number of loops used, moved out of the loops everything that only needs to happen once, and typed everything to the best of my knowledge. Still, each iteration of i in the second function takes about 10 seconds, which is too much. The main things I still see as yellow in the html compilation output are due to indexed accesses in the lists and numpy array, so I tried to replace my lists with all numpy arrays but wasn't able to get any improvement. I would greatly appreciate any feedback you could provide.
#cython: boundscheck=False
#cython: wraparound=False
import numpy as np
cimport numpy as np
def merge_dicts_of_lists(dict c1, dict c2):
cdef dict res
cdef int n, length1, length2, length3
cdef unsigned int i, j, j_line, jj, k, kk, new_line
res = {n: [] for n in range(256)}
length1 = len(c1)
for i in range(length1):
length2 = len(c1[i])
for j in range(length2):
j_line = c1[i][j]
jj = (j_line) % 256
length3 = len(c2[jj])
for k in range(length3):
kk = c2[jj][k]
new_line = (j_line << 10) + kk
res[i].append(new_line)
return res
def get_4kmer_set(np.ndarray c1, dict c2, dict c3, bf):
cdef unsigned int num = 0
cdef unsigned long long query = 0
cdef unsigned int i, j, i_row, i_col, j_line
cdef unsigned int length1, length2
cdef dict merge
cdef list m_i
merge = merge_dicts_of_lists(c2, c3)
length1 = len(c1[:,0])
for i in range(length1):
print "i is %d" % i
i_row = c1[i,0]
i_col = c1[i,1]
m_i = merge[i_col]
length2 = len(m_i)
for j in range(length2):
j_line = m_i[j]
query = (i_row << 24) + (i_col << 20) + j_line
if query in bf:
num += 1
print "%d yes answers from bf" % num
For posterity's sake, I'm adding an off-topic answer, but I hope it will be useful for someone nonetheless. The code I posted above isn't much different from what I've decided to stay with, as it was already compiling to short C lines as seen by the Ctyhon html compilation output.
Since the innermost operation was a Bloom filter query, I found what helped most was speeding up that step in two ways. One was changing the hash function used by pybloomfiltermmap to an available C++ implementation of murmurhash3. I found pybloomfilter was using sha, which was comparatively slowas expected for a cryptographic hash function. The second boost came from applying the trick found in this paper: http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/rsa.pdf. Basically, it says you can save a lot of computation by using a linear combination of two hash values instead of k different hashes for the BF. These two tricks together gave an order of magnitude (~5x) improvement in query time.