Cython: why is size_t faster than int? - python

Changing certain Cython variables from type int to type size_t can significantly reduce some functions times (~30%), but I do not understand why.
For example:
cimport numpy as cnp
import numpy as np
def sum_int(cnp.int64_t[::1] A):
cdef unsigned long s = 0
cdef int k
for k in xrange(A.shape[0]):
s += A[k]
return s
def sum_size_t(cnp.int64_t[::1] A):
cdef unsigned long s = 0
cdef size_t k
for k in xrange(A.shape[0]):
s += A[k]
return s
a = np.array(range(1000000))
And the timing results:
In [17]: %timeit sum_int(a)
1000 loops, best of 3: 652 µs per loop
In [18]: %timeit sum_size_t(a)
1000 loops, best of 3: 427 µs per loop
I am new to Cython, and know Fortran better than C. Help me out. What is the important difference between these two variable types that causes such a performance difference? What is it that I don't grok about Cython?

You'd likely have to do a line by line profiling to find out exactly, but one thing stands out to me from the produced C file: int version is checked for wraparound to negative numbers, size_t is assumed ok.
In the int loop: (t_3 is assigned from k, they're the same type)
if (__pyx_t_3 < 0) {
__pyx_t_3 += __pyx_v_A.shape[0];
if (unlikely(__pyx_t_3 < 0)) __pyx_t_4 = 0;
} else if (unlikely(__pyx_t_3 >= __pyx_v_A.shape[0])) __pyx_t_4 = 0;
In the size_t loop:
if (unlikely(__pyx_t_3 >= (size_t)__pyx_v_A.shape[0])) __pyx_t_4 = 0;
So no wraparound test is needed because size_t is unsigned and guaranteed not to wrap around when indexing items in memory. The rest is virtually the same.
Update: regarding your unsigned int results - what's your size of int and size_t? Any chance they're different size, causing the change? In my case the C code for uint and size_t is identical. (since size_t is unsigned and specifically unsigned int on this system)

On a 64 bit system there seem to be two reasons:
Use an unsigned integer for the loop:
%%cython
cimport numpy as cnp
import numpy as np
def sum_int_unsigned(cnp.int64_t[::1] A):
cdef unsigned long s = 0
cdef unsigned k
for k in xrange(A.shape[0]):
s += A[k]
return s
Use a long instead of an int:
%%cython
cimport numpy as cnp
import numpy as np
def sum_int_unsigned_long(cnp.int64_t[::1] A):
cdef unsigned long s = 0
cdef unsigned long k
for k in xrange(A.shape[0]):
s += A[k]
return s
Timings:
%timeit sum_int(a)
1000 loops, best of 3: 1.52 ms per loop
%timeit sum_size_t(a)
1000 loops, best of 3: 671 µs per loop
Using unsigned brings us half way:
%timeit sum_int_unsigned(a)
1000 loops, best of 3: 1.09 ms per loop
Using long accounts for the rest:
%timeit sum_int_unsigned_long(a)
1000 loops, best of 3: 648 µs per loop

Related

Nonzero for integers

My problem is as follows. I am generating a random bitstring of size n, and need to iterate over the indices for which the random bit is 1. For example, if my random bitstring ends up being 00101, I want to retrieve [2, 4] (on which I will iterate over). The goal is to do so in the fastest way possible with Python/NumPy.
One of the fast methods is to use NumPy and do
bitstring = np.random.randint(2, size=(n,))
l = np.nonzero(bitstring)[0]
The advantage with np.non_zero is that it finds indices of bits set to 1 much faster than if one iterates (with a for loop) over each bit and checks if it is set to 1.
Now, NumPy can generate a random bitstring faster via np.random.bit_generator.randbits(n). The problem is that it returns it as an integer, on which I cannot use np.nonzero anymore. I saw that for integers one can get the count of bits set to 1 in an integer x by using x.bit_count(), however there is no function to get the indices where bits are set to 1. So currently, I have to resort to a slow for loop, hence losing the initial speedup given by np.random.bit_generator.randbits(n).
How would you do something similar to (and as fast as) np.non_zero, but on integers instead?
Thank you in advance for your suggestions!
A minor optimisation to your code would be to use the new style random interface and generate bools rather than 64bit integers
rng = np.random.default_rng()
def original(n):
bitstring = rng.integers(2, size=n, dtype=bool)
return np.nonzero(bitstring)[0]
this causes it to take ~24 µs on my laptop, tested n upto 128.
I've previously noticed that getting a Numpy to generate a permutation is particularly fast, hence my comment above. Leading to:
def perm(n):
a = rng.permutation(n)
return a[:rng.binomial(n, 0.5)]
which takes between ~7 µs and ~10 µs depending on n. It also returns the indicies out of order, not sure if that's an issue for you. If your n isn't changing much, you could also swap to using rng.shuffle on an pre-allocated array, something like:
n = 32
a = np.arange(n)
def shuffle():
rng.shuffle(a)
return a[:rng.binomial(n, 0.5)]
which saves a couple of microseconds.
After some interesting proposals, I decided to do some benchmarking to understand how the running times grow as a function of n. The functions tested are the following:
def func1(n):
bit_array = np.random.randint(2, size=n)
return np.nonzero(bit_array)[0]
def func2(n):
bit_int = np.random.bit_generator.randbits(n)
a = np.zeros(bit_int.bit_count())
i = 0
for j in range(n):
if 1 & (bit_int >> j):
a[i] = j
i += 1
return a
def func3(n):
bit_string = format(np.random.bit_generator.randbits(n), f'0{n}b')
bit_array = np.array(list(bit_string), dtype=int)
return np.nonzero(bit_array)[0]
def func4(n):
rng = np.random.default_rng()
a = rng.permutation(n)
return a[:rng.binomial(n, 0.5)]
def func5(n):
a = np.arange(n)
rng.shuffle(a)
return a[:rng.binomial(n, 0.5)]
I used timeit to do the benchmark, looping 1000 over a statement each time and averaging over 10 runs. The value of n ranges from 2 to 65536, growing as powers of 2. The average running time is plotted and error bars correspondond to the standard deviation.
For solutions generating a bitstring, the simple func1 actually performs best among them whenever n is large enough (n>32). We can see that for low values of n (n< 16), using the randbits solution with the for loop (func2) is fastest, because the loop is not costly yet. However as n becomes larger, this becomes the worst solution, because all the time is spent in the for loop. This is why having a nonzero for integers would bring the best of both world and hopefully give a faster solution. We can observe that func3, which does a conversion in order to use nonzero after using randbits spends too long doing the conversion.
For implementations which exploit the binomial distribution (see Sam Mason's answer), we see that the use of shuffle (func5) instead of permutation (func4) can reduce the time by a bit, but overall they have similar performance.
Considering all values of n (that were tested), the solution given by Sam Mason which employs a binomial distribution together with shuffling (func5) is so far the most performant in terms of running time. Let's see if this can be improved!
I had a play with Cython to see how much difference it would make. I ended up with quite a lot of code and only ~5x better runtime performance:
from cpython.pycapsule cimport PyCapsule_IsValid, PyCapsule_GetPointer
import numpy as np
cimport numpy as np
cimport cython
from numpy.random cimport bitgen_t
np.import_array()
DTYPE = np.uint32
ctypedef np.uint32_t DTYPE_t
cdef extern int __builtin_popcountl(unsigned long) nogil
cdef extern int __builtin_ffsl(unsigned long) nogil
cdef const char *bgen_capsule_name = "BitGenerator"
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
cdef size_t generate_bits(object bitgen, np.uint64_t *state, Py_ssize_t state_len, np.uint64_t last_mask):
cdef Py_ssize_t i
cdef size_t nset
cdef bitgen_t *rng
capsule = bitgen.capsule
if not PyCapsule_IsValid(capsule, bgen_capsule_name):
raise ValueError("Expecting Numpy BitGenerator Capsule")
rng = <bitgen_t *> PyCapsule_GetPointer(capsule, bgen_capsule_name)
with bitgen.lock:
nset = 0
for i in range(state_len-1):
state[i] = rng.next_uint64(rng.state)
nset += __builtin_popcountl(state[i])
i = state_len-1
state[i] = rng.next_uint64(rng.state) & last_mask
nset += __builtin_popcountl(state[i])
return nset
cdef size_t write_setbits(DTYPE_t *result, DTYPE_t off, np.uint64_t state) nogil:
cdef size_t j
cdef int k
j = 0
while state:
# find first set bit returns zero when nothing is set
k = __builtin_ffsl(state) - 1
# clear out bit k
state &= ~(1ul<<k)
# record in output
result[j] = off + k
j += 1
return j
#cython.boundscheck(False) # Deactivate bounds checking
#cython.wraparound(False) # Deactivate negative indexing.
def rint(bitgen, unsigned int n):
cdef Py_ssize_t i, j, nset
cdef np.uint64_t[::1] state
cdef DTYPE_t[::1] result
state = np.empty((n + 63) // 64, dtype=np.uint64)
nset = generate_bits(bitgen, &state[0], len(state), (1ul << (n & 63)) - 1)
pyresult = np.empty(nset, dtype=DTYPE)
result = pyresult
j = 0
for i in range(len(state)):
j += write_setbits(&result[j], i * 64, state[i])
return pyresult
The above code is easy to use via the Cython Jupyter extension.
Comparing this to slightly tidied up versions of the OP's code can be done via:
import random
import timeit
import numpy as np
import matplotlib.pyplot as plt
bitgen = np.random.PCG64()
def func1(n):
# bool type is a bit faster
bit_array = np.random.randint(2, size=n, dtype=bool)
return np.nonzero(bit_array)[0]
def func2(n):
# OPs variant ends up using a CSPRNG which is slower
bit_int = random.getrandbits(n)
# this is much easier than using numpy arrays
return [i for i in range(n) if 1 & (bit_int >> i)]
def func3(n):
bit_string = format(random.getrandbits(n), f'0{n}b')
bit_array = np.array(list(bit_string), dtype='int8')
return np.nonzero(bit_array)[0]
def func4(n):
# shuffle variant is mostly the same
# plot already busy enough
a = np.random.permutation(n)
return a[:np.random.binomial(n, 0.5)]
def func_cython(n):
return rint(bitgen, n)
result = {}
niter = [2**i for i in range(1, 17)]
for name in 'func1 func2 func3 func4 func_cython'.split():
result[name] = res = []
for n in niter:
t = timeit.Timer(f"fn({n})", f"fn = {name}", globals=globals())
nit, dt = t.autorange()
res.append(dt / nit)
plt.loglog()
for name, times in result.items():
plt.plot(niter, np.array(times) * 1000, '.-', label=name)
plt.legend()
Which might produce output like:
Note that in order to reduce variance it's helpful to turn off CPU frequency scaling and turn off turbo modes. The Arch wiki has useful info on how to do this under Linux.
you could convert the number you get with randbits(n) to a numpy.ndarray.
depending on the size of n the compute time of the conversion should be faster than the loop.
n = 10
l = np.random.bit_generator.randbits(n) # gives you the int 616
l_string = f'{l:0{n}b}' # gives you a string representation of the int in length n 1001101000
l_nparray = np.array(list(l_string), dtype=int) # gives you the numpy.ndarray like np.random.randint [1 0 0 1 1 0 1 0 0 0]

Cython: understanding what the html annotation file has to say?

After compiling the following Cython code, I get the html file that looks like this:
import numpy as np
cimport numpy as np
cpdef my_function(np.ndarray[np.double_t, ndim = 1] array_a,
np.ndarray[np.double_t, ndim = 1] array_b,
int n_rows,
int n_columns):
array_a[0:-1:n_columns] = 0
array_a[n_columns - 1:n_rows * n_columns:n_columns] = 0
array_a[0:n_columns] = 0
array_a[n_columns* (n_rows - 1):n_rows * n_columns] = 0
array_b[array_a == 3] = 0
return array_a, array_b
My question is that why those operations of my function are still yellow? Does this mean that the code is still not as fast as it could be using Cython?
As you already know, yellow lines mean that some interactions with python happen, i.e. python functionality and not raw c-functionality is used, and you can look into the produced code to see, what happens and if it can/should be fixed/avoided.
Not every interaction with python means a (measurable) slowdown.
Let's take a look at this simplified function:
%%cython
cimport numpy as np
def use_slices(np.ndarray[np.double_t] a):
a[0:len(a)]=0.0
When we look into the produced code we see (I kept only the important parts):
__pyx_t_1 = PyObject_Length(((PyObject *)__pyx_v_a));
__pyx_t_2 = PyInt_FromSsize_t(__pyx_t_1);
__pyx_t_3 = PySlice_New(__pyx_int_0, __pyx_t_2, Py_None);
PyObject_SetItem(((PyObject *)__pyx_v_a)
So basically we get a new slice (which is a numpy-array) and then use numpy's functionality (PyObject_SetItem) to set all elements to 0.0, which is C-code under the hood.
Let's take a look at version with hand-written for loop:
cimport numpy as np
def use_for(np.ndarray[np.double_t] a):
cdef int i
for i in range(len(a)):
a[i]=0.0
It still uses PyObject_Length (because of length) and bound-checking, but otherwise it is C-code. When we compare times:
>>> import numpy as np
>>> a=np.ones((500,))
>>> %timeit use_slices(a)
100000 loops, best of 3: 1.85 µs per loop
>>> %timeit use_for(a)
1000000 loops, best of 3: 1.42 µs per loop
>>> b=np.ones((250000,))
>>> %timeit use_slices(b)
10000 loops, best of 3: 104 µs per loop
>>> %timeit use_for(b)
1000 loops, best of 3: 364 µs per loop
You can see the additional overhead of creating a slice for small sizes, but the additional checks in the for-version means it has more overhead in the long run.
Let's disable these checks:
%%cython
cimport cython
cimport numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
def use_for_no_checks(np.ndarray[np.double_t] a):
cdef int i
for i in range(len(a)):
a[i]=0.0
In the produced html we can see, that a[i] gets as simple as it gets:
__pyx_t_3 = __pyx_v_i;
*__Pyx_BufPtrStrided1d(__pyx_t_5numpy_double_t *, __pyx_pybuffernd_a.rcbuffer->pybuffer.buf, __pyx_t_3, __pyx_pybuffernd_a.diminfo[0].strides) = 0.0;
}
__Pyx_BufPtrStrided1d(type, buf, i0, s0) is define for (type)((char*)buf + i0 * s0).
And now:
>>> %timeit use_for_no_checks(a)
1000000 loops, best of 3: 1.17 µs per loop
>>> %timeit use_for_no_checks(b)
1000 loops, best of 3: 246 µs per loop
We can improve it further by releasing gil in the for-loop:
%%cython
cimport cython
cimport numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
def use_for_no_checks_no_gil(np.ndarray[np.double_t] a):
cdef int i
cdef int n=len(a)
with nogil:
for i in range(n):
a[i]=0.0
and now:
>>> %timeit use_for_no_checks_no_gil(a)
1000000 loops, best of 3: 1.07 µs per loop
>>> %timeit use_for_no_checks_no_gil(b)
10000 loops, best of 3: 166 µs per loop
So it is somewhat faster, but still you cannot beat numpy for larger arrays.
In my opinion, there are two things to take from it:
Cython will not transform slices to an access through a for-loop and thus Python-functionality must be used.
There is small overhead, but it is only calling the numpy-functionality the most of work is done in numpy-code, and this cannot be speed up through Cython.
One last try using memset function:
%%cython
from libc.string cimport memset
cimport numpy as np
def use_memset(np.ndarray[np.double_t] a):
memset(&a[0], 0, len(a)*sizeof(np.double_t))
We get:
>>> %timeit use_memset(a)
1000000 loops, best of 3: 821 ns per loop
>>> %timeit use_memset(b)
10000 loops, best of 3: 102 µs per loop
It is also as fast as the numpy-code for large arrays.
As DavidW suggested, one could try to use memory-views:
%%cython
cimport numpy as np
def use_slices_memview(double[::1] a):
a[0:len(a)]=0.0
leads to a slightly faster code for small arrays but similar fast code fr large arrays (compared to numpy-slices):
>>> %timeit use_slices_memview(a)
1000000 loops, best of 3: 1.52 µs per loop
>>> %timeit use_slices_memview(b)
10000 loops, best of 3: 105 µs per loop
That means, that the memory-view slices have less overhead than the numpy-slices. Here is the produced code:
__pyx_t_1 = __Pyx_MemoryView_Len(__pyx_v_a);
__pyx_t_2.data = __pyx_v_a.data;
__pyx_t_2.memview = __pyx_v_a.memview;
__PYX_INC_MEMVIEW(&__pyx_t_2, 0);
__pyx_t_3 = -1;
if (unlikely(__pyx_memoryview_slice_memviewslice(
&__pyx_t_2,
__pyx_v_a.shape[0], __pyx_v_a.strides[0], __pyx_v_a.suboffsets[0],
0,
0,
&__pyx_t_3,
0,
__pyx_t_1,
0,
1,
1,
0,
1) < 0))
{
__PYX_ERR(0, 27, __pyx_L1_error)
}
{
double __pyx_temp_scalar = 0.0;
{
Py_ssize_t __pyx_temp_extent = __pyx_t_2.shape[0];
Py_ssize_t __pyx_temp_idx;
double *__pyx_temp_pointer = (double *) __pyx_t_2.data;
for (__pyx_temp_idx = 0; __pyx_temp_idx < __pyx_temp_extent; __pyx_temp_idx++) {
*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
__pyx_temp_pointer += 1;
}
}
}
__PYX_XDEC_MEMVIEW(&__pyx_t_2, 1);
__pyx_t_2.memview = NULL;
__pyx_t_2.data = NULL;
I think the most important part: this code doesn't create an additional temporary object - it reuses the existing memory view for the slice.
My compiler produces (at least for my machine) a slightly faster code if memory views are used. Not sure whether it is worth an investigation. At the first sight the difference in every iteration step is:
# created code for memview-slices:
*((double *) __pyx_temp_pointer) = __pyx_temp_scalar;
__pyx_temp_pointer += 1;
#created code for memview-for-loop:
__pyx_v_i = __pyx_t_3;
__pyx_t_4 = __pyx_v_i;
*((double *) ( /* dim=0 */ ((char *) (((double *) data) + __pyx_t_4)) )) = 0.0;
I would expect different compilers handle this code differently well. But clearly, the first version is easier to get optimized.
As Behzad Jamali pointed out, there is difference between double[:] a and double[::1] a. The second version using slices is about 20% faster on my machine. The difference is, that during the compile time it is know for the double[::1] version, that the memory-accesses will be consecutive and this can be used for optimization. In the version with double[:] we don't know anything about stride until the runtime.

Why is Cython slower than vectorized NumPy?

Consider the following Cython code :
cimport cython
cimport numpy as np
import numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
cdef int i
for i in range(a.shape[0]):
a[i] += b[i]
#cython.boundscheck(False)
#cython.wraparound(False)
def test_numpy(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
cdef int i
for i in range(a.shape[0]):
a[i] += b[i]
def test_numpyvec(a, b):
a += b
def gendata(nb=40000000):
a = np.random.random(nb)
b = np.random.random(nb)
return a, b
Running it in the interpreter yields (after a few runs to warm up the cache) :
In [14]: %timeit -n 100 test_memoryview(a, b)
100 loops, best of 3: 148 ms per loop
In [15]: %timeit -n 100 test_numpy(a, b)
100 loops, best of 3: 159 ms per loop
In [16]: %timeit -n 100 test_numpyvec(a, b)
100 loops, best of 3: 124 ms per loop
# See answer below :
In [17]: %timeit -n 100 test_raw_pointers(a, b)
100 loops, best of 3: 129 ms per loop
I tried it with different dataset sizes, and consistently had the vectorized NumPy function run faster than the compiled Cython code, while I was expecting Cython to be on par with vectorized NumPy in terms of performance.
Did I forget an optimization in my Cython code? Does NumPy use something (BLAS?) in order to make such simple operations run faster? Can I improve the performance of this code?
Update: The raw pointer version seems to be on par with NumPy. So apparently there's some overhead in using memory view or NumPy indexing.
Another option is to use raw pointers (and the global directives to avoid repeating #cython...):
#cython: wraparound=False
#cython: boundscheck=False
#cython: nonecheck=False
#...
cdef ctest_raw_pointers(int n, double *a, double *b):
cdef int i
for i in range(n):
a[i] += b[i]
def test_raw_pointers(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
ctest_raw_pointers(a.shape[0], &a[0], &b[0])
On my machine the difference isn't as large, but I can nearly eliminate it by changing the numpy and memory view functions like this
#cython.boundscheck(False)
#cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
cdef int i, n=a.shape[0]
for i in range(n):
a[i] += b[i]
#cython.boundscheck(False)
#cython.wraparound(False)
def test_numpy(np.ndarray[double] a, np.ndarray[double] b):
cdef int i, n=a.shape[0]
for i in range(n):
a[i] += b[i]
and then, when I compile the C output from Cython, I use the flags -O3 and -march=native.
This seems to indicate that the difference in timings comes from the use of different compiler optimizations.
I use the 64 bit version of MinGW and NumPy 1.8.1.
Your results will probably vary depending on your package versions, hardware, platform, and compiler.
If you are using the IPython notebook's Cython magic, you can force an update with the additional compiler flags by replacing %%cython with %%cython -f -c=-O3 -c=-march=native
If you are using a standard setup.py for your cython module you can specify the extra_compile_args argument when creating the Extension object that you pass to distutils.setup.
Note: I removed the ndim=1 flag when specifying the types for the NumPy arrays because it isn't necessary.
That value defaults to 1 anyway.
A change that slightly increases the speed is to specify the stride:
def test_memoryview_inorder(double[::1] a, double[::1] b):
cdef int i
for i in range(a.shape[0]):
a[i] += b[i]

C array vs NumPy array

In terms of performance (algebraic operations, lookup, caching, etc.), is there a difference between C arrays (which can be exposed as a C array, or a cython.view.array [Cython array], or a memoryview of the aforementioned two) and a NumPy arrays (which in Cython should have no Python overhead)
Edit:
I should mention that in the NumPy array is statically typed using Cython, and the dtypes are NumPy compile-time datypes (e.g. cdef np.int_t or cdef np.float32_t), and the types in the C case are the C equivalents (cdef int_t and cdef float)
Edit2:
Here is the example from the Cython Memoryview documentation to further illustrate my question:
from cython.view cimport array as cvarray
import numpy as np
# Memoryview on a NumPy array
narr = np.arange(27, dtype=np.dtype("i")).reshape((3, 3, 3))
cdef int [:, :, :] narr_view = narr
# Memoryview on a C array
cdef int carr[3][3][3]
cdef int [:, :, :] carr_view = carr
# Memoryview on a Cython array
cyarr = cvarray(shape=(3, 3, 3), itemsize=sizeof(int), format="i")
cdef int [:, :, :] cyarr_view = cyarr
Is there any difference between sticking with a C array vs a Cython array vs a NumPy array?
My knowledge on this is still imperfect, but this may be helpful.
I ran some informal benchmarks to show what each array type is good for and was intrigued by what I found.
Though these array types are different in many ways, if you are doing heavy computation with large arrays, you should be able to get similar performance out of any of them since item-by-item access should be roughly the same across the board.
A NumPy array is a Python object implemented using Python's C API.
NumPy arrays do provide an API at the C level, but they cannot be created independent from the Python interpreter.
They are especially useful because of all the different array manipulation routines available in NumPy and SciPy.
A Cython memory view is also a Python object, but it is made as a Cython extension type.
It does not appear to be designed for use in pure Python since it isn't a part of Cython that can be imported directly from Python, but you can return a view to Python from a Cython function.
You can look at the implementation at https://github.com/cython/cython/blob/master/Cython/Utility/MemoryView.pyx
A C array is a native type in the C language.
It is indexed like a pointer, but arrays and pointers are different.
There is some good discussion on this at http://c-faq.com/aryptr/index.html
They can be allocated on the stack and are easier for the C compiler to optimize, but they will be more difficult to access outside of Cython.
I know you can make a NumPy array from memory that has been dynamically allocated by other programs, but it seems a lot more difficult that way.
Travis Oliphant posted an example of this at http://blog.enthought.com/python/numpy-arrays-with-pre-allocated-memory/
If you are using C arrays or pointers for temporary storage within your program they should work very well for you.
They will not be as convenient for slicing or for any other sort of vectorized computation since you will have to do everything yourself with explicit looping, but they should allocate and deallocate faster and ought to provide a good baseline for speed.
Cython also provides an array class.
It looks like it is designed for internal use.
Instances are created when a memoryview is copied.
See http://docs.cython.org/src/userguide/memoryviews.html#view-cython-arrays
In Cython, you can also allocate memory and index a pointer to treat the allocated memory somewhat like an array.
See http://docs.cython.org/src/tutorial/memory_allocation.html
Here are some benchmarks that show somewhat similar performance for indexing large arrays.
This is the Cython file.
from numpy cimport ndarray as ar, uint64_t
cimport cython
import numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
def ndarr_time(uint64_t n=1000000, uint64_t size=10000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t i, j
for i in range(n):
for j in range(size):
A[j] = n
def carr_time(uint64_t n=1000000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t AC[10000]
uint64_t a
int i, j
for i in range(n):
for j in range(10000):
AC[j] = n
#cython.boundscheck(False)
#cython.wraparound(False)
def ptr_time(uint64_t n=1000000, uint64_t size=10000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t* AP = &A[0]
uint64_t a
int i, j
for i in range(n):
for j in range(size):
AP[j] = n
#cython.boundscheck(False)
#cython.wraparound(False)
def view_time(uint64_t n=1000000, uint64_t size=10000):
cdef:
ar[uint64_t] A = np.empty(n, dtype=np.uint64)
uint64_t[:] AV = A
uint64_t i, j
for i in range(n):
for j in range(size):
AV[j] = n
Timing these using IPython we obtain
%timeit -n 10 ndarr_time()
%timeit -n 10 carr_time()
%timeit -n 10 ptr_time()
%timeit -n 10 view_time()
10 loops, best of 3: 6.33 s per loop
10 loops, best of 3: 3.12 s per loop
10 loops, best of 3: 6.26 s per loop
10 loops, best of 3: 3.74 s per loop
These results struck me as a little odd, considering that, as per Efficiency: arrays vs pointers , arrays are unlikely to be significantly faster than pointers.
It appears that some sort of compiler optimization is making the pure C arrays and the typed memory views faster.
I tried turning off all the optimization flags on my C compiler and got the timings
1 loops, best of 3: 25.1 s per loop
1 loops, best of 3: 25.5 s per loop
1 loops, best of 3: 32 s per loop
1 loops, best of 3: 28.4 s per loop
It looks to me like the item-by item access is pretty much the same across the board, except that C arrays and Cython memory views seem to be easier for the compiler to optimize.
More commentary on this can be seen at a these two blog posts I found some time ago:
http://jakevdp.github.io/blog/2012/08/08/memoryview-benchmarks/
http://jakevdp.github.io/blog/2012/08/16/memoryview-benchmarks-2/
In the second blog post he comments on how, if memory view slices are inlined, they can provide speeds similar to that of pointer arithmetic.
I have noticed in some of my own tests that explicitly inlining functions that use Memory View slices isn't always necessary.
As an example of this, I'll compute the inner product of every combination of two rows of an array.
from numpy cimport ndarray as ar
cimport cython
from numpy import empty
# An inlined dot product
#cython.boundscheck(False)
#cython.wraparound(False)
cdef inline double dot_product(double[:] a, double[:] b, int size):
cdef int i
cdef double tot = 0.
for i in range(size):
tot += a[i] * b[i]
return tot
# non-inlined dot-product
#cython.boundscheck(False)
#cython.wraparound(False)
cdef double dot_product_no_inline(double[:] a, double[:] b, int size):
cdef int i
cdef double tot = 0.
for i in range(size):
tot += a[i] * b[i]
return tot
# function calling inlined dot product
#cython.boundscheck(False)
#cython.wraparound(False)
def dot_rows_slicing(ar[double,ndim=2] A):
cdef:
double[:,:] Aview = A
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = dot_product(Aview[i], Aview[j], A.shape[1])
return res
# function calling non-inlined version
#cython.boundscheck(False)
#cython.wraparound(False)
def dot_rows_slicing_no_inline(ar[double,ndim=2] A):
cdef:
double[:,:] Aview = A
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = dot_product_no_inline(Aview[i], Aview[j], A.shape[1])
return res
# inlined dot product using numpy arrays
#cython.boundscheck(False)
#cython.boundscheck(False)
cdef inline double ndarr_dot_product(ar[double] a, ar[double] b):
cdef int i
cdef double tot = 0.
for i in range(a.size):
tot += a[i] * b[i]
return tot
# non-inlined dot product using numpy arrays
#cython.boundscheck(False)
#cython.boundscheck(False)
cdef double ndarr_dot_product_no_inline(ar[double] a, ar[double] b):
cdef int i
cdef double tot = 0.
for i in range(a.size):
tot += a[i] * b[i]
return tot
# function calling inlined numpy array dot product
#cython.boundscheck(False)
#cython.wraparound(False)
def ndarr_dot_rows_slicing(ar[double,ndim=2] A):
cdef:
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = ndarr_dot_product(A[i], A[j])
return res
# function calling nun-inlined version for numpy arrays
#cython.boundscheck(False)
#cython.wraparound(False)
def ndarr_dot_rows_slicing_no_inline(ar[double,ndim=2] A):
cdef:
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j
for i in range(A.shape[0]):
for j in range(A.shape[0]):
res[i,j] = ndarr_dot_product(A[i], A[j])
return res
# Version with explicit looping and item-by-item access.
#cython.boundscheck(False)
#cython.wraparound(False)
def dot_rows_loops(ar[double,ndim=2] A):
cdef:
ar[double,ndim=2] res = empty((A.shape[0], A.shape[0]))
int i, j, k
double tot
for i in range(A.shape[0]):
for j in range(A.shape[0]):
tot = 0.
for k in range(A.shape[1]):
tot += A[i,k] * A[j,k]
res[i,j] = tot
return res
Timing these we see
A = rand(1000, 1000)
%timeit dot_rows_slicing(A)
%timeit dot_rows_slicing_no_inline(A)
%timeit ndarr_dot_rows_slicing(A)
%timeit ndarr_dot_rows_slicing_no_inline(A)
%timeit dot_rows_loops(A)
1 loops, best of 3: 1.02 s per loop
1 loops, best of 3: 1.02 s per loop
1 loops, best of 3: 3.65 s per loop
1 loops, best of 3: 3.66 s per loop
1 loops, best of 3: 1.04 s per loop
The results were as fast with explicit inlining as they were without it.
In both cases, the typed memory views were comparable to a version of the function that was written without slicing.
In the blog post, he had to write a specific example to force the compiler to not inline a function.
It appears that a decent C compiler (I'm using MinGW) is able to take care of these optimizations without being told to inline certain functions.
Memoryviews can be faster for passing array slices between functions within a Cython module, even without explicit inlining.
In this particular case, however, even pushing the loops to C doesn't really reach a speed anywhere near what can be achieved through proper use of matrix multiplication.
The BLAS is still the best way to do things like this.
%timeit A.dot(A.T)
10 loops, best of 3: 25.7 ms per loop
There is also automatic conversion from NumPy arrays to memoryviews as in
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def cysum(double[:] A):
cdef tot = 0.
cdef int i
for i in range(A.size):
tot += A[i]
return tot
The one catch is that, if you want a function to return a NumPy array, you will have to use np.asarray to convert the memory view object to a NumPy array again.
This is a relatively inexpensive operation since memory views comply with http://www.python.org/dev/peps/pep-3118/
Conclusion
Typed memory views seem to be a viable alternative to NumPy arrays for internal use in a Cython module.
Array slicing will be faster with memory views, but there are not as many functions and methods written for memory views as there are for NumPy arrays.
If you don't need to call a bunch of the NumPy array methods and want easy array slicing, you can use memory views in place of NumPy arrays.
If you need both the array slicing and the NumPy functionality for a given array, you can make a memory view that points to the same memory as the NumPy array.
You can then use the view for passing slices between functions and the array for calling NumPy functions.
That approach is still somewhat limited, but it will work well if you are doing most of your processing with a single array.
C arrays and/or dynamically allocated blocks of memory could be useful for intermediate calculations, but they are not as easy to pass back to Python for use there.
In my opinion, it is also more cumbersome to dynamically allocate multidimensional C arrays.
The best approach I am aware of is to allocate a large block of memory and then use integer arithmetic to index it as if it were a multidimensional array.
This could be an issue if you want easy allocation of arrays on the fly.
On the other hand, allocation times are probably a good bit faster for C arrays.
The other array types are designed to be nearly as fast and much more convenient, so I would recommend using them unless there is a compelling reason to do otherwise.
Update: As mentioned in the answer by #Veedrac you can still pass Cython memory views to most NumPy functions.
When you do this, NumPy will usually have to create a new NumPy array object to work with the memory view anyway, so this will be somewhat slower.
For large arrays the effect will be negligible.
A call to np.asarray for a memory view will be relatively fast regardless of array size.
However, to demonstrate this effect, here is another benchmark:
Cython file:
def npy_call_on_view(npy_func, double[:] A, int n):
cdef int i
for i in range(n):
npy_func(A)
def npy_call_on_arr(npy_func, ar[double] A, int n):
cdef int i
for i in range(n):
npy_func(A)
in IPython:
from numpy.random import rand
A = rand(1)
%timeit npy_call_on_view(np.amin, A, 10000)
%timeit npy_call_on_arr(np.amin, A, 10000)
output:
10 loops, best of 3: 282 ms per loop
10 loops, best of 3: 35.9 ms per loop
I tried to choose an example that would show this effect well.
Unless many NumPy function calls on relatively small arrays are involved, this shouldn't change the time a whole lot.
Keep in mind that, regardless of which way we are calling NumPy, a Python function call still occurs.
This applies only to the functions in NumPy.
Most of the array methods are not available for memoryviews (some of the attributes still are, like size and shape and T).
For example A.dot(A.T) with NumPy arrays would become np.dot(A, A.T).
Don't use cython.view.array, use cpython.array.array.
See this answer of mine for details, although that only deals with speed. The recommendation is to treat cython.view.array as "demo" material, and cpython.array.array as an actual solid implementation. These arrays are very lightweight and better when just using them as scratch space.
Further, if you're ever tempted by malloc, raw access on these is no slower and instantiation takes only twice as long.
With regards to IanH's
If you need both the array slicing and the NumPy functionality for a given array, you can make a memory view that points to the same memory as the NumPy array.
It's worth noting that memoryviews have a "base" property and many Numpy functions can also take memoryviews, so these do not have to be separated variables.

NumPy: function for simultaneous max() and min()

numpy.amax() will find the max value in an array, and numpy.amin() does the same for the min value. If I want to find both max and min, I have to call both functions, which requires passing over the (very big) array twice, which seems slow.
Is there a function in the numpy API that finds both max and min with only a single pass through the data?
Is there a function in the numpy API that finds both max and min with only a single pass through the data?
No. At the time of this writing, there is no such function. (And yes, if there were such a function, its performance would be significantly better than calling numpy.amin() and numpy.amax() successively on a large array.)
I don't think that passing over the array twice is a problem. Consider the following pseudo-code:
minval = array[0]
maxval = array[0]
for i in array:
if i < minval:
minval = i
if i > maxval:
maxval = i
While there is only 1 loop here, there are still 2 checks. (Instead of having 2 loops with 1 check each). Really the only thing you save is the overhead of 1 loop. If the arrays really are big as you say, that overhead is small compared to the actual loop's work load. (Note that this is all implemented in C, so the loops are more or less free anyway).
EDIT Sorry to the 4 of you who upvoted and had faith in me. You definitely can optimize this.
Here's some fortran code which can be compiled into a python module via f2py (maybe a Cython guru can come along and compare this with an optimized C version ...):
subroutine minmax1(a,n,amin,amax)
implicit none
!f2py intent(hidden) :: n
!f2py intent(out) :: amin,amax
!f2py intent(in) :: a
integer n
real a(n),amin,amax
integer i
amin = a(1)
amax = a(1)
do i=2, n
if(a(i) > amax)then
amax = a(i)
elseif(a(i) < amin) then
amin = a(i)
endif
enddo
end subroutine minmax1
subroutine minmax2(a,n,amin,amax)
implicit none
!f2py intent(hidden) :: n
!f2py intent(out) :: amin,amax
!f2py intent(in) :: a
integer n
real a(n),amin,amax
amin = minval(a)
amax = maxval(a)
end subroutine minmax2
Compile it via:
f2py -m untitled -c fortran_code.f90
And now we're in a place where we can test it:
import timeit
size = 100000
repeat = 10000
print timeit.timeit(
'np.min(a); np.max(a)',
setup='import numpy as np; a = np.arange(%d, dtype=np.float32)' % size,
number=repeat), " # numpy min/max"
print timeit.timeit(
'untitled.minmax1(a)',
setup='import numpy as np; import untitled; a = np.arange(%d, dtype=np.float32)' % size,
number=repeat), '# minmax1'
print timeit.timeit(
'untitled.minmax2(a)',
setup='import numpy as np; import untitled; a = np.arange(%d, dtype=np.float32)' % size,
number=repeat), '# minmax2'
The results are a bit staggering for me:
8.61869883537 # numpy min/max
1.60417699814 # minmax1
2.30169081688 # minmax2
I have to say, I don't completely understand it. Comparing just np.min versus minmax1 and minmax2 is still a losing battle, so it's not just a memory issue ...
notes -- Increasing size by a factor of 10**a and decreasing repeat by a factor of 10**a (keeping the problem size constant) does change the performance, but not in a seemingly consistent way which shows that there is some interplay between memory performance and function call overhead in python. Even comparing a simple min implementation in fortran beats numpy's by a factor of approximately 2 ...
You could use Numba, which is a NumPy-aware dynamic Python compiler using LLVM. The resulting implementation is pretty simple and clear:
import numpy
import numba
#numba.jit
def minmax(x):
maximum = x[0]
minimum = x[0]
for i in x[1:]:
if i > maximum:
maximum = i
elif i < minimum:
minimum = i
return (minimum, maximum)
numpy.random.seed(1)
x = numpy.random.rand(1000000)
print(minmax(x) == (x.min(), x.max()))
It should also be faster than a Numpy's min() & max() implementation. And all without having to write a single C/Fortran line of code.
Do your own performance tests, as it is always dependent on your architecture, your data, your package versions...
There is a function for finding (max-min) called numpy.ptp if that's useful for you:
>>> import numpy
>>> x = numpy.array([1,2,3,4,5,6])
>>> x.ptp()
5
but I don't think there's a way to find both min and max with one traversal.
EDIT: ptp just calls min and max under the hood
Just to get some ideas on the numbers one could expect, given the following approaches:
import numpy as np
def extrema_np(arr):
return np.max(arr), np.min(arr)
import numba as nb
#nb.jit(nopython=True)
def extrema_loop_nb(arr):
n = arr.size
max_val = min_val = arr[0]
for i in range(1, n):
item = arr[i]
if item > max_val:
max_val = item
elif item < min_val:
min_val = item
return max_val, min_val
import numba as nb
#nb.jit(nopython=True)
def extrema_while_nb(arr):
n = arr.size
odd = n % 2
if not odd:
n -= 1
max_val = min_val = arr[0]
i = 1
while i < n:
x = arr[i]
y = arr[i + 1]
if x > y:
x, y = y, x
min_val = min(x, min_val)
max_val = max(y, max_val)
i += 2
if not odd:
x = arr[n]
min_val = min(x, min_val)
max_val = max(x, max_val)
return max_val, min_val
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
cdef void _extrema_loop_cy(
long[:] arr,
size_t n,
long[:] result):
cdef size_t i
cdef long item, max_val, min_val
max_val = arr[0]
min_val = arr[0]
for i in range(1, n):
item = arr[i]
if item > max_val:
max_val = item
elif item < min_val:
min_val = item
result[0] = max_val
result[1] = min_val
def extrema_loop_cy(arr):
result = np.zeros(2, dtype=arr.dtype)
_extrema_loop_cy(arr, arr.size, result)
return result[0], result[1]
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
cdef void _extrema_while_cy(
long[:] arr,
size_t n,
long[:] result):
cdef size_t i, odd
cdef long x, y, max_val, min_val
max_val = arr[0]
min_val = arr[0]
odd = n % 2
if not odd:
n -= 1
max_val = min_val = arr[0]
i = 1
while i < n:
x = arr[i]
y = arr[i + 1]
if x > y:
x, y = y, x
min_val = min(x, min_val)
max_val = max(y, max_val)
i += 2
if not odd:
x = arr[n]
min_val = min(x, min_val)
max_val = max(x, max_val)
result[0] = max_val
result[1] = min_val
def extrema_while_cy(arr):
result = np.zeros(2, dtype=arr.dtype)
_extrema_while_cy(arr, arr.size, result)
return result[0], result[1]
(the extrema_loop_*() approaches are similar to what is proposed here, while extrema_while_*() approaches are based on the code from here)
The following timings:
indicate that the extrema_while_*() are the fastest, with extrema_while_nb() being fastest. In any case, also the extrema_loop_nb() and extrema_loop_cy() solutions do outperform the NumPy-only approach (using np.max() and np.min() separately).
Finally, note that none of these is as flexible as np.min()/np.max() (in terms of n-dim support, axis parameter, etc.).
(full code is available here)
Nobody mentioned numpy.percentile, so I thought I would. If you ask for [0, 100] percentiles, it will give you an array of two elements, the min (0th percentile) and the max (100th percentile).
However, it doesn't satisfy the OP's purpose: it's not faster than min and max separately. That's probably due to some machinery that would allow for non-extreme percentiles (a harder problem, which should take longer).
In [1]: import numpy
In [2]: a = numpy.random.normal(0, 1, 1000000)
In [3]: %%timeit
...: lo, hi = numpy.amin(a), numpy.amax(a)
...:
100 loops, best of 3: 4.08 ms per loop
In [4]: %%timeit
...: lo, hi = numpy.percentile(a, [0, 100])
...:
100 loops, best of 3: 17.2 ms per loop
In [5]: numpy.__version__
Out[5]: '1.14.4'
A future version of Numpy could put in a special case to skip the normal percentile calculation if only [0, 100] are requested. Without adding anything to the interface, there's a way to ask Numpy for min and max in one call (contrary to what was said in the accepted answer), but the standard implementation of the library doesn't take advantage of this case to make it worthwhile.
In general you can reduce the amount of comparisons for a minmax algorithm by processing two elements at a time and only comparing the smaller to the temporary minimum and the bigger one to the temporary maximum. On average one needs only 3/4 of the comparisons than a naive approach.
This could be implemented in c or fortran (or any other low-level language) and should be almost unbeatable in terms of performance. I'm using numba to illustrate the principle and get a very fast, dtype-independant implementation:
import numba as nb
import numpy as np
#nb.njit
def minmax(array):
# Ravel the array and return early if it's empty
array = array.ravel()
length = array.size
if not length:
return
# We want to process two elements at once so we need
# an even sized array, but we preprocess the first and
# start with the second element, so we want it "odd"
odd = length % 2
if not odd:
length -= 1
# Initialize min and max with the first item
minimum = maximum = array[0]
i = 1
while i < length:
# Get the next two items and swap them if necessary
x = array[i]
y = array[i+1]
if x > y:
x, y = y, x
# Compare the min with the smaller one and the max
# with the bigger one
minimum = min(x, minimum)
maximum = max(y, maximum)
i += 2
# If we had an even sized array we need to compare the
# one remaining item too.
if not odd:
x = array[length]
minimum = min(x, minimum)
maximum = max(x, maximum)
return minimum, maximum
It's definetly faster than the naive approach that Peque presented:
arr = np.random.random(3000000)
assert minmax(arr) == minmax_peque(arr) # warmup and making sure they are identical
%timeit minmax(arr) # 100 loops, best of 3: 2.1 ms per loop
%timeit minmax_peque(arr) # 100 loops, best of 3: 2.75 ms per loop
As expected the new minmax implementation only takes roughly 3/4 of the time the naive implementation took (2.1 / 2.75 = 0.7636363636363637)
This is an old thread, but anyway, if anyone ever looks at this again...
When looking for the min and max simultaneously, it is possible to reduce the number of comparisons. If it is floats you are comparing (which I guess it is) this might save you some time, although not computational complexity.
Instead of (Python code):
_max = ar[0]
_min= ar[0]
for ii in xrange(len(ar)):
if _max > ar[ii]: _max = ar[ii]
if _min < ar[ii]: _min = ar[ii]
you can first compare two adjacent values in the array, and then only compare the smaller one against current minimum, and the larger one against current maximum:
## for an even-sized array
_max = ar[0]
_min = ar[0]
for ii in xrange(0, len(ar), 2)): ## iterate over every other value in the array
f1 = ar[ii]
f2 = ar[ii+1]
if (f1 < f2):
if f1 < _min: _min = f1
if f2 > _max: _max = f2
else:
if f2 < _min: _min = f2
if f1 > _max: _max = f1
The code here is written in Python, clearly for speed you would use C or Fortran or Cython, but this way you do 3 comparisons per iteration, with len(ar)/2 iterations, giving 3/2 * len(ar) comparisons. As opposed to that, doing the comparison "the obvious way" you do two comparisons per iteration, leading to 2*len(ar) comparisons. Saves you 25% of comparison time.
Maybe someone one day will find this useful.
At first glance, numpy.histogram appears to do the trick:
count, (amin, amax) = numpy.histogram(a, bins=1)
... but if you look at the source for that function, it simply calls a.min() and a.max() independently, and therefore fails to avoid the performance concerns addressed in this question. :-(
Similarly, scipy.ndimage.measurements.extrema looks like a possibility, but it, too, simply calls a.min() and a.max() independently.
It was worth the effort for me anyways, so I'll propose the most difficult and least elegant solution here for whoever may be interested. My solution is to implement a multi-threaded min-max in one pass algorithm in C++, and use this to create an Python extension module. This effort requires a bit of overhead for learning how to use the Python and NumPy C/C++ APIs, and here I will show the code and give some small explanations and references for whoever wishes to go down this path.
Multi-threaded Min/Max
There is nothing too interesting here. The array is broken into chunks of size length / workers. The min/max is calculated for each chunk in a future, which are then scanned for the global min/max.
// mt_np.cc
//
// multi-threaded min/max algorithm
#include <algorithm>
#include <future>
#include <vector>
namespace mt_np {
/*
* Get {min,max} in interval [begin,end)
*/
template <typename T> std::pair<T, T> min_max(T *begin, T *end) {
T min{*begin};
T max{*begin};
while (++begin < end) {
if (*begin < min) {
min = *begin;
continue;
} else if (*begin > max) {
max = *begin;
}
}
return {min, max};
}
/*
* get {min,max} in interval [begin,end) using #workers for concurrency
*/
template <typename T>
std::pair<T, T> min_max_mt(T *begin, T *end, int workers) {
const long int chunk_size = std::max((end - begin) / workers, 1l);
std::vector<std::future<std::pair<T, T>>> min_maxes;
// fire up the workers
while (begin < end) {
T *next = std::min(end, begin + chunk_size);
min_maxes.push_back(std::async(min_max<T>, begin, next));
begin = next;
}
// retrieve the results
auto min_max_it = min_maxes.begin();
auto v{min_max_it->get()};
T min{v.first};
T max{v.second};
while (++min_max_it != min_maxes.end()) {
v = min_max_it->get();
min = std::min(min, v.first);
max = std::max(max, v.second);
}
return {min, max};
}
}; // namespace mt_np
The Python Extension Module
Here is where things start getting ugly... One way to use C++ code in Python is to implement an extension module. This module can be built and installed using the distutils.core standard module. A complete description of what this entails is covered in the Python documentation: https://docs.python.org/3/extending/extending.html. NOTE: there are certainly other ways to get similar results, to quote https://docs.python.org/3/extending/index.html#extending-index:
This guide only covers the basic tools for creating extensions provided as part of this version of CPython. Third party tools like Cython, cffi, SWIG and Numba offer both simpler and more sophisticated approaches to creating C and C++ extensions for Python.
Essentially, this route is probably more academic than practical. With that being said, what I did next was, sticking pretty close to the tutorial, create a module file. This is essentially boilerplate for distutils to know what to do with your code and create a Python module out of it. Before doing any of this it is probably wise to create a Python virtual environment so you don't pollute your system packages (see https://docs.python.org/3/library/venv.html#module-venv).
Here is the module file:
// mt_np_forpy.cc
//
// C++ module implementation for multi-threaded min/max for np
#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
#include <python3.6/numpy/arrayobject.h>
#include "mt_np.h"
#include <cstdint>
#include <iostream>
using namespace std;
/*
* check:
* shape
* stride
* data_type
* byteorder
* alignment
*/
static bool check_array(PyArrayObject *arr) {
if (PyArray_NDIM(arr) != 1) {
PyErr_SetString(PyExc_RuntimeError, "Wrong shape, require (1,n)");
return false;
}
if (PyArray_STRIDES(arr)[0] != 8) {
PyErr_SetString(PyExc_RuntimeError, "Expected stride of 8");
return false;
}
PyArray_Descr *descr = PyArray_DESCR(arr);
if (descr->type != NPY_LONGLTR && descr->type != NPY_DOUBLELTR) {
PyErr_SetString(PyExc_RuntimeError, "Wrong type, require l or d");
return false;
}
if (descr->byteorder != '=') {
PyErr_SetString(PyExc_RuntimeError, "Expected native byteorder");
return false;
}
if (descr->alignment != 8) {
cerr << "alignment: " << descr->alignment << endl;
PyErr_SetString(PyExc_RuntimeError, "Require proper alignement");
return false;
}
return true;
}
template <typename T>
static PyObject *mt_np_minmax_dispatch(PyArrayObject *arr) {
npy_intp size = PyArray_SHAPE(arr)[0];
T *begin = (T *)PyArray_DATA(arr);
auto minmax =
mt_np::min_max_mt(begin, begin + size, thread::hardware_concurrency());
return Py_BuildValue("(L,L)", minmax.first, minmax.second);
}
static PyObject *mt_np_minmax(PyObject *self, PyObject *args) {
PyArrayObject *arr;
if (!PyArg_ParseTuple(args, "O", &arr))
return NULL;
if (!check_array(arr))
return NULL;
switch (PyArray_DESCR(arr)->type) {
case NPY_LONGLTR: {
return mt_np_minmax_dispatch<int64_t>(arr);
} break;
case NPY_DOUBLELTR: {
return mt_np_minmax_dispatch<double>(arr);
} break;
default: {
PyErr_SetString(PyExc_RuntimeError, "Unknown error");
return NULL;
}
}
}
static PyObject *get_concurrency(PyObject *self, PyObject *args) {
return Py_BuildValue("I", thread::hardware_concurrency());
}
static PyMethodDef mt_np_Methods[] = {
{"mt_np_minmax", mt_np_minmax, METH_VARARGS, "multi-threaded np min/max"},
{"get_concurrency", get_concurrency, METH_VARARGS,
"retrieve thread::hardware_concurrency()"},
{NULL, NULL, 0, NULL} /* sentinel */
};
static struct PyModuleDef mt_np_module = {PyModuleDef_HEAD_INIT, "mt_np", NULL,
-1, mt_np_Methods};
PyMODINIT_FUNC PyInit_mt_np() { return PyModule_Create(&mt_np_module); }
In this file there is a significant use of the Python as well as the NumPy API, for more information consult: https://docs.python.org/3/c-api/arg.html#c.PyArg_ParseTuple, and for NumPy: https://docs.scipy.org/doc/numpy/reference/c-api.array.html.
Installing the Module
The next thing to do is to utilize distutils to install the module. This requires a setup file:
# setup.py
from distutils.core import setup,Extension
module = Extension('mt_np', sources = ['mt_np_module.cc'])
setup (name = 'mt_np',
version = '1.0',
description = 'multi-threaded min/max for np arrays',
ext_modules = [module])
To finally install the module, execute python3 setup.py install from your virtual environment.
Testing the Module
Finally, we can test to see if the C++ implementation actually outperforms naive use of NumPy. To do so, here is a simple test script:
# timing.py
# compare numpy min/max vs multi-threaded min/max
import numpy as np
import mt_np
import timeit
def normal_min_max(X):
return (np.min(X),np.max(X))
print(mt_np.get_concurrency())
for ssize in np.logspace(3,8,6):
size = int(ssize)
print('********************')
print('sample size:', size)
print('********************')
samples = np.random.normal(0,50,(2,size))
for sample in samples:
print('np:', timeit.timeit('normal_min_max(sample)',
globals=globals(),number=10))
print('mt:', timeit.timeit('mt_np.mt_np_minmax(sample)',
globals=globals(),number=10))
Here are the results I got from doing all this:
8
********************
sample size: 1000
********************
np: 0.00012079699808964506
mt: 0.002468645994667895
np: 0.00011947099847020581
mt: 0.0020772050047526136
********************
sample size: 10000
********************
np: 0.00024697799381101504
mt: 0.002037393998762127
np: 0.0002713389985729009
mt: 0.0020942929986631498
********************
sample size: 100000
********************
np: 0.0007130410012905486
mt: 0.0019842900001094677
np: 0.0007540129954577424
mt: 0.0029724110063398257
********************
sample size: 1000000
********************
np: 0.0094779249993735
mt: 0.007134920000680722
np: 0.009129883001151029
mt: 0.012836456997320056
********************
sample size: 10000000
********************
np: 0.09471094200125663
mt: 0.0453535050037317
np: 0.09436299200024223
mt: 0.04188535599678289
********************
sample size: 100000000
********************
np: 0.9537652180006262
mt: 0.3957935369980987
np: 0.9624398809974082
mt: 0.4019058070043684
These are far less encouraging than the results indicate earlier in the thread, which indicated somewhere around 3.5x speedup, and didn't incorporate multi-threading. The results I achieved are somewhat reasonable, I would expect that the overhead of threading and would dominate the time until the arrays got very large, at which point the performance increase would start to approach std::thread::hardware_concurrency x increase.
Conclusion
There is certainly room for application specific optimizations to some NumPy code, it would seem, in particular with regards to multi-threading. Whether or not it is worth the effort is not clear to me, but it certainly seems like a good exercise (or something). I think that perhaps learning some of those "third party tools" like Cython may be a better use of time, but who knows.
Inspired by the previous answer I've written numba implementation returning minmax for axis=0 from 2-D array. It's ~5x faster than calling numpy min/max.
Maybe someone will find it useful.
from numba import jit
#jit
def minmax(x):
"""Return minimum and maximum from 2D array for axis=0."""
m, n = len(x), len(x[0])
mi, ma = np.empty(n), np.empty(n)
mi[:] = ma[:] = x[0]
for i in range(1, m):
for j in range(n):
if x[i, j]>ma[j]: ma[j] = x[i, j]
elif x[i, j]<mi[j]: mi[j] = x[i, j]
return mi, ma
x = np.random.normal(size=(256, 11))
mi, ma = minmax(x)
np.all(mi == x.min(axis=0)), np.all(ma == x.max(axis=0))
# (True, True)
%timeit x.min(axis=0), x.max(axis=0)
# 15.9 µs ± 9.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit minmax(x)
# 2.62 µs ± 31.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The shortest way I've come up with is this:
mn, mx = np.sort(ar)[[0, -1]]
But since it sorts the array, it's not the most efficient.
Another short way would be:
mn, mx = np.percentile(ar, [0, 100])
This should be more efficient, but the result is calculated, and a float is returned.
Maybe use numpy.unique? Like so:
min_, max_ = numpy.unique(arr)[[0, -1]]
Just added it here for variety :) It's just as slow as sort.

Categories