How to speed up Poisson pmf function? - python

My use case is to evaluate Poisson pmf on all points which is less than say, 10, and I would call such function multiple of times with difference lambdas. The lambdas are not known ahead of time so I cannot vectorize lambdas.
I heard from somewhere about a secret trick which is to use _pmf. What is the downside to do so? But still, it is a bit slow, is there any way to improve it without rewriting the pmf in C from scratch?
%timeit scipy.stats.poisson.pmf(np.arange(0,10),3.3)
%timeit scipy.stats.poisson._pmf(np.arange(0,10),3.3)
a = np.arange(0,10)
%timeit scipy.stats.poisson._pmf(a,3.3)
10000 loops, best of 3: 94.5 µs per loop
100000 loops, best of 3: 15.2 µs per loop
100000 loops, best of 3: 13.7 µs per loop
Update
Ok, simply I was just too lazy to write in cython. I had expected there is a faster solution for all discrete distribution that can be evaluated sequentially (iteratively) for consecutive x. E.g. P(X=3) = P(X=2) * lambda / 3 if X ~ Pois(lambda)
Related: Is the build-in probability density functions of `scipy.stat.distributions` slower than a user provided one?
I have less faith in Scipy and Python now. The library function isn't as advanced as what I had expected.

Most of scipy.stats distributions support vectorized evaluation:
>>> poisson.pmf(1, [5, 6, 7, 8])
array([ 0.03368973, 0.01487251, 0.00638317, 0.0026837 ])
This may or may not be fast enough, but you can try taking pmf calls out of the loop.
Re difference between pmf and _pmf: the real work is done in the underscored functions (_pmf, _cdf etc) while the public functions (pmf, cdf) make sure that only valid arguments make it to the _pmf (The output of _pmf is not guaranteed to be meaningful if the arguments are invalid, so use on your own risk).
>>> poisson.pmf(1, -1)
nan
>>> poisson._pmf(1, -1)
/home/br/virtualenvs/scipy-dev/local/lib/python2.7/site-packages/scipy/stats/_discrete_distns.py:432: RuntimeWarning: invalid value encountered in log
Pk = k*log(mu)-gamln(k+1) - mu
nan
Further details: https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py#L2721

Try implementing the pmf in cython. If your scipy is part of a package like Anaconda or Enthought you probably have cython installed. http://cython.org/
Try running it with pypy. http://pypy.org/
Rent time on a big AWS server (or similar).

I found that the scipy.stats.poisson class is tragically slow compared to just a simple python implementation.
No cython or vectors or anything.
import math
def poisson_pmf(x, mu):
return mu**x / math.factorial(x) * math.exp(-mu)
def poisson_cdf(k, mu):
p_total = 0.0
for x in range(k + 1):
p_total += poisson_pmf(x, mu)
return p_total
And if you check the source code of scipy.stats.poisson (even the underscore prefixed versions) then it is clear why!
The above implementation is now ONLY 10x slower than the exact equivalent in C (compiled with gcc -O3 v9.3). The scipy version is at least another 10x slower.
#include <math.h>
unsigned long factorial(unsigned n) {
unsigned long fact = 1;
for (unsigned k = 2; k <= n; ++k)
fact *= k;
return fact;
}
double poisson_pmf(unsigned x, double mu) {
return pow(mu, x) / factorial(x) * exp(-mu);
}
double poisson_cdf(unsigned k, double mu) {
double p_total = 0.0;
for (unsigned x = 0; x <= k; ++x)
p_total += poisson_pmf(x, mu);
return p_total;
}

Related

Runtime of python's if substring in string

What is the big O of the following if statement?
if "pl" in "apple":
...
What is the overall big O of how python determines if the string "pl" is found in the string "apple"
or any other substring in string search.
Is this the most efficient way to test if a substring is in a string? Does it use the same algorithm as .find()?
The time complexity is O(N) on average, O(NM) worst case (N being the length of the longer string, M, the shorter string you search for). As of Python 3.10, heuristics are used to lower the worst-case scenario to O(N + M) by switching algorithms.
The same algorithm is used for str.index(), str.find(), str.__contains__() (the in operator) and str.replace(); it is a simplification of the Boyer-Moore with ideas taken from the Boyer–Moore–Horspool and Sunday algorithms.
See the original stringlib discussion post, as well as the fastsearch.h source code; until Python 3.10, the base algorithm has not changed since introduction in Python 2.5 (apart from some low-level optimisations and corner-case fixes).
The post includes a Python-code outline of the algorithm:
def find(s, p):
# find first occurrence of p in s
n = len(s)
m = len(p)
skip = delta1(p)[p[m-1]]
i = 0
while i <= n-m:
if s[i+m-1] == p[m-1]: # (boyer-moore)
# potential match
if s[i:i+m-1] == p[:m-1]:
return i
if s[i+m] not in p:
i = i + m + 1 # (sunday)
else:
i = i + skip # (horspool)
else:
# skip
if s[i+m] not in p:
i = i + m + 1 # (sunday)
else:
i = i + 1
return -1 # not found
as well as speed comparisons.
In Python 3.10, the algorithm was updated to use an enhanced version of the Crochemore and Perrin's Two-Way string searching algorithm for larger problems (with p and s longer than 100 and 2100 characters, respectively, with s at least 6 times as long as p), in response to a pathological edgecase someone reported. The commit adding this change included a write-up on how the algorithm works.
The Two-way algorithm has a worst-case time complexity of O(N + M), where O(M) is a cost paid up-front to build a shift table from the s search needle. Once you have that table, this algorithm does have a best-case performance of O(N/M).
In Python 3.4.2, it looks like they are resorting to the same function, but there may be a difference in timing nevertheless. For example, s.find first is required to look up the find method of the string and such.
The algorithm used is a mix between Boyer-More and Horspool.
You can use timeit and test it yourself:
maroun#DQHCPY1:~$ python -m timeit 's = "apple";s.find("pl")'
10000000 loops, best of 3: 0.125 usec per loop
maroun#DQHCPY1:~$ python -m timeit 's = "apple";"pl" in s'
10000000 loops, best of 3: 0.0371 usec per loop
Using in is indeed faster (0.0371 usec compared to 0.125 usec).
For actual implementation, you can look at the code itself.
I think the best way to find out is to look at the source. This looks like it would implement __contains__:
static int
bytes_contains(PyObject *self, PyObject *arg)
{
Py_ssize_t ival = PyNumber_AsSsize_t(arg, PyExc_ValueError);
if (ival == -1 && PyErr_Occurred()) {
Py_buffer varg;
Py_ssize_t pos;
PyErr_Clear();
if (PyObject_GetBuffer(arg, &varg, PyBUF_SIMPLE) != 0)
return -1;
pos = stringlib_find(PyBytes_AS_STRING(self), Py_SIZE(self),
varg.buf, varg.len, 0);
PyBuffer_Release(&varg);
return pos >= 0;
}
if (ival < 0 || ival >= 256) {
PyErr_SetString(PyExc_ValueError, "byte must be in range(0, 256)");
return -1;
}
return memchr(PyBytes_AS_STRING(self), (int) ival, Py_SIZE(self)) != NULL;
}
in terms of stringlib_find(), which uses fastsearch().

Why are some float < integer comparisons four times slower than others?

When comparing floats to integers, some pairs of values take much longer to be evaluated than other values of a similar magnitude.
For example:
>>> import timeit
>>> timeit.timeit("562949953420000.7 < 562949953421000") # run 1 million times
0.5387085462592742
But if the float or integer is made smaller or larger by a certain amount, the comparison runs much more quickly:
>>> timeit.timeit("562949953420000.7 < 562949953422000") # integer increased by 1000
0.1481498428446173
>>> timeit.timeit("562949953423001.8 < 562949953421000") # float increased by 3001.1
0.1459577925548956
Changing the comparison operator (e.g. using == or > instead) does not affect the times in any noticeable way.
This is not solely related to magnitude because picking larger or smaller values can result in faster comparisons, so I suspect it is down to some unfortunate way the bits line up.
Clearly, comparing these values is more than fast enough for most use cases. I am simply curious as to why Python seems to struggle more with some pairs of values than with others.
A comment in the Python source code for float objects acknowledges that:
Comparison is pretty much a nightmare
This is especially true when comparing a float to an integer, because, unlike floats, integers in Python can be arbitrarily large and are always exact. Trying to cast the integer to a float might lose precision and make the comparison inaccurate. Trying to cast the float to an integer is not going to work either because any fractional part will be lost.
To get around this problem, Python performs a series of checks, returning the result if one of the checks succeeds. It compares the signs of the two values, then whether the integer is "too big" to be a float, then compares the exponent of the float to the length of the integer. If all of these checks fail, it is necessary to construct two new Python objects to compare in order to obtain the result.
When comparing a float v to an integer/long w, the worst case is that:
v and w have the same sign (both positive or both negative),
the integer w has few enough bits that it can be held in the size_t type (typically 32 or 64 bits),
the integer w has at least 49 bits,
the exponent of the float v is the same as the number of bits in w.
And this is exactly what we have for the values in the question:
>>> import math
>>> math.frexp(562949953420000.7) # gives the float's (significand, exponent) pair
(0.9999999999976706, 49)
>>> (562949953421000).bit_length()
49
We see that 49 is both the exponent of the float and the number of bits in the integer. Both numbers are positive and so the four criteria above are met.
Choosing one of the values to be larger (or smaller) can change the number of bits of the integer, or the value of the exponent, and so Python is able to determine the result of the comparison without performing the expensive final check.
This is specific to the CPython implementation of the language.
The comparison in more detail
The float_richcompare function handles the comparison between two values v and w.
Below is a step-by-step description of the checks that the function performs. The comments in the Python source are actually very helpful when trying to understand what the function does, so I've left them in where relevant. I've also summarised these checks in a list at the foot of the answer.
The main idea is to map the Python objects v and w to two appropriate C doubles, i and j, which can then be easily compared to give the correct result. Both Python 2 and Python 3 use the same ideas to do this (the former just handles int and long types separately).
The first thing to do is check that v is definitely a Python float and map it to a C double i. Next the function looks at whether w is also a float and maps it to a C double j. This is the best case scenario for the function as all the other checks can be skipped. The function also checks to see whether v is inf or nan:
static PyObject*
float_richcompare(PyObject *v, PyObject *w, int op)
{
double i, j;
int r = 0;
assert(PyFloat_Check(v));
i = PyFloat_AS_DOUBLE(v);
if (PyFloat_Check(w))
j = PyFloat_AS_DOUBLE(w);
else if (!Py_IS_FINITE(i)) {
if (PyLong_Check(w))
j = 0.0;
else
goto Unimplemented;
}
Now we know that if w failed these checks, it is not a Python float. Now the function checks if it's a Python integer. If this is the case, the easiest test is to extract the sign of v and the sign of w (return 0 if zero, -1 if negative, 1 if positive). If the signs are different, this is all the information needed to return the result of the comparison:
else if (PyLong_Check(w)) {
int vsign = i == 0.0 ? 0 : i < 0.0 ? -1 : 1;
int wsign = _PyLong_Sign(w);
size_t nbits;
int exponent;
if (vsign != wsign) {
/* Magnitudes are irrelevant -- the signs alone
* determine the outcome.
*/
i = (double)vsign;
j = (double)wsign;
goto Compare;
}
}
If this check failed, then v and w have the same sign.
The next check counts the number of bits in the integer w. If it has too many bits then it can't possibly be held as a float and so must be larger in magnitude than the float v:
nbits = _PyLong_NumBits(w);
if (nbits == (size_t)-1 && PyErr_Occurred()) {
/* This long is so large that size_t isn't big enough
* to hold the # of bits. Replace with little doubles
* that give the same outcome -- w is so large that
* its magnitude must exceed the magnitude of any
* finite float.
*/
PyErr_Clear();
i = (double)vsign;
assert(wsign != 0);
j = wsign * 2.0;
goto Compare;
}
On the other hand, if the integer w has 48 or fewer bits, it can safely turned in a C double j and compared:
if (nbits <= 48) {
j = PyLong_AsDouble(w);
/* It's impossible that <= 48 bits overflowed. */
assert(j != -1.0 || ! PyErr_Occurred());
goto Compare;
}
From this point onwards, we know that w has 49 or more bits. It will be convenient to treat w as a positive integer, so change the sign and the comparison operator as necessary:
if (nbits <= 48) {
/* "Multiply both sides" by -1; this also swaps the
* comparator.
*/
i = -i;
op = _Py_SwappedOp[op];
}
Now the function looks at the exponent of the float. Recall that a float can be written (ignoring sign) as significand * 2exponent and that the significand represents a number between 0.5 and 1:
(void) frexp(i, &exponent);
if (exponent < 0 || (size_t)exponent < nbits) {
i = 1.0;
j = 2.0;
goto Compare;
}
This checks two things. If the exponent is less than 0 then the float is smaller than 1 (and so smaller in magnitude than any integer). Or, if the exponent is less than the number of bits in w then we have that v < |w| since significand * 2exponent is less than 2nbits.
Failing these two checks, the function looks to see whether the exponent is greater than the number of bit in w. This shows that significand * 2exponent is greater than 2nbits and so v > |w|:
if ((size_t)exponent > nbits) {
i = 2.0;
j = 1.0;
goto Compare;
}
If this check did not succeed we know that the exponent of the float v is the same as the number of bits in the integer w.
The only way that the two values can be compared now is to construct two new Python integers from v and w. The idea is to discard the fractional part of v, double the integer part, and then add one. w is also doubled and these two new Python objects can be compared to give the correct return value. Using an example with small values, 4.65 < 4 would be determined by the comparison (2*4)+1 == 9 < 8 == (2*4) (returning false).
{
double fracpart;
double intpart;
PyObject *result = NULL;
PyObject *one = NULL;
PyObject *vv = NULL;
PyObject *ww = w;
// snip
fracpart = modf(i, &intpart); // split i (the double that v mapped to)
vv = PyLong_FromDouble(intpart);
// snip
if (fracpart != 0.0) {
/* Shift left, and or a 1 bit into vv
* to represent the lost fraction.
*/
PyObject *temp;
one = PyLong_FromLong(1);
temp = PyNumber_Lshift(ww, one); // left-shift doubles an integer
ww = temp;
temp = PyNumber_Lshift(vv, one);
vv = temp;
temp = PyNumber_Or(vv, one); // a doubled integer is even, so this adds 1
vv = temp;
}
// snip
}
}
For brevity I've left out the additional error-checking and garbage-tracking Python has to do when it creates these new objects. Needless to say, this adds additional overhead and explains why the values highlighted in the question are significantly slower to compare than others.
Here is a summary of the checks that are performed by the comparison function.
Let v be a float and cast it as a C double. Now, if w is also a float:
Check whether w is nan or inf. If so, handle this special case separately depending on the type of w.
If not, compare v and w directly by their representations as C doubles.
If w is an integer:
Extract the signs of v and w. If they are different then we know v and w are different and which is the greater value.
(The signs are the same.) Check whether w has too many bits to be a float (more than size_t). If so, w has greater magnitude than v.
Check if w has 48 or fewer bits. If so, it can be safely cast to a C double without losing its precision and compared with v.
(w has more than 48 bits. We will now treat w as a positive integer having changed the compare op as appropriate.)
Consider the exponent of the float v. If the exponent is negative, then v is less than 1 and therefore less than any positive integer. Else, if the exponent is less than the number of bits in w then it must be less than w.
If the exponent of v is greater than the number of bits in w then v is greater than w.
(The exponent is the same as the number of bits in w.)
The final check. Split v into its integer and fractional parts. Double the integer part and add 1 to compensate for the fractional part. Now double the integer w. Compare these two new integers instead to get the result.
Using gmpy2 with arbitrary precision floats and integers it is possible to get more uniform comparison performance:
~ $ ptipython
Python 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Dec 7 2015, 11:16:01)
Type "copyright", "credits" or "license" for more information.
IPython 4.1.2 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import gmpy2
In [2]: from gmpy2 import mpfr
In [3]: from gmpy2 import mpz
In [4]: gmpy2.get_context().precision=200
In [5]: i1=562949953421000
In [6]: i2=562949953422000
In [7]: f=562949953420000.7
In [8]: i11=mpz('562949953421000')
In [9]: i12=mpz('562949953422000')
In [10]: f1=mpfr('562949953420000.7')
In [11]: f<i1
Out[11]: True
In [12]: f<i2
Out[12]: True
In [13]: f1<i11
Out[13]: True
In [14]: f1<i12
Out[14]: True
In [15]: %timeit f<i1
The slowest run took 10.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 441 ns per loop
In [16]: %timeit f<i2
The slowest run took 12.55 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 152 ns per loop
In [17]: %timeit f1<i11
The slowest run took 32.04 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 269 ns per loop
In [18]: %timeit f1<i12
The slowest run took 36.81 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 231 ns per loop
In [19]: %timeit f<i11
The slowest run took 78.26 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 156 ns per loop
In [20]: %timeit f<i12
The slowest run took 21.24 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 194 ns per loop
In [21]: %timeit f1<i1
The slowest run took 37.61 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 275 ns per loop
In [22]: %timeit f1<i2
The slowest run took 39.03 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 259 ns per loop

Why is this iterative Collatz method 30% slower than its recursive version in Python?

Prelude
I have two implementations for a particular problem, one recursive and one iterative, and I want to know what causes the iterative solution to be ~30% slower than the recursive one.
Given the recursive solution, I write an iterative solution making the stack explicit.
Clearly, I simply mimic what the recursion is doing, so of course the Python engine is better optimized to handle the bookkeeping. But can we write an iterative method with similar performance?
My case study is Problem #14 on Project Euler.
Find the longest Collatz chain with a starting number below one million.
Code
Here is a parsimonious recursive solution (credit due to veritas in the problem thread plus an optimization from jJjjJ):
def solve_PE14_recursive(ub=10**6):
def collatz_r(n):
if not n in table:
if n % 2 == 0:
table[n] = collatz_r(n // 2) + 1
elif n % 4 == 3:
table[n] = collatz_r((3 * n + 1) // 2) + 2
else:
table[n] = collatz_r((3 * n + 1) // 4) + 3
return table[n]
table = {1: 1}
return max(xrange(ub // 2 + 1, ub, 2), key=collatz_r)
Here's my iterative version:
def solve_PE14_iterative(ub=10**6):
def collatz_i(n):
stack = []
while not n in table:
if n % 2 == 0:
x, y = n // 2, 1
elif n % 4 == 3:
x, y = (3 * n + 1) // 2, 2
else:
x, y = (3 * n + 1) // 4, 3
stack.append((n, y))
n = x
ysum = table[n]
for x, y in reversed(stack):
ysum += y
table[x] = ysum
return ysum
table = {1: 1}
return max(xrange(ub // 2 + 1, ub, 2), key=collatz_i)
And the timings on my machine (i7 machine with lots of memory) using IPython:
In [3]: %timeit solve_PE14_recursive()
1 loops, best of 3: 942 ms per loop
In [4]: %timeit solve_PE14_iterative()
1 loops, best of 3: 1.35 s per loop
Comments
The recursive solution is awesome:
Optimized to skip a step or two depending on the two least significant bits.
My original solution didn't skip any Collatz steps and took ~1.86 s
It is difficult to hit Python's default recursion limit of 1000.
collatz_r(9780657630) returns 1133 but requires less than 1000 recursive calls.
Memoization avoids retracing
collatz_r length calculated on-demand for max
Playing around with it, timings seem to be precise to +/- 5 ms.
Languages with static typing like C and Haskell can get timings below 100 ms.
I put the initialization of the memoization table in the method by design for this question, so that timings would reflect the "re-discovery" of the table values on each invocation.
collatz_r(2**1002) raises RuntimeError: maximum recursion depth exceeded.
collatz_i(2**1002) happily returns with 1003.
I am familiar with generators, coroutines, and decorators.
I am using Python 2.7. I am also happy to use Numpy (1.8 on my machine).
What I am looking for
an iterative solution that closes the performance gap
discussion on how Python handles recursion
the finer details of the performance penalties associated with an explicit stack
I'm looking mostly for the first, though the second and third are very important to this problem and would increase my understanding of Python.
Here's my shot at a (partial) explanation after running some benchmarks, which confirm your figures.
While recursive function calls are expensive in CPython, they aren't nearly as expensive as emulating a call stack using lists. The stack for a recursive call is a compact structure implemented in C (see Eli Bendersky's explanation and the file Python/ceval.c in the source code).
By contrast, your emulated stack is a Python list object, i.e. a heap-allocated, dynamically growing array of pointers to tuple objects, which in turn point to the actual values; goodbye, locality of reference, hello cache misses. You then use Python's notoriously slow iteration on these objects. A line-by-line profiling with kernprof confirms that iteration and list handling are taking a lot of time:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
16 #profile
17 def collatz_i(n):
18 750000 339195 0.5 2.4 stack = []
19 3702825 1996913 0.5 14.2 while not n in table:
20 2952825 1329819 0.5 9.5 if n % 2 == 0:
21 864633 416307 0.5 3.0 x, y = n // 2, 1
22 2088192 906202 0.4 6.4 elif n % 4 == 3:
23 1043583 617536 0.6 4.4 x, y = (3 * n + 1) // 2, 2
24 else:
25 1044609 601008 0.6 4.3 x, y = (3 * n + 1) // 4, 3
26 2952825 1543300 0.5 11.0 stack.append((n, y))
27 2952825 1150867 0.4 8.2 n = x
28 750000 352395 0.5 2.5 ysum = table[n]
29 3702825 1693252 0.5 12.0 for x, y in reversed(stack):
30 2952825 1254553 0.4 8.9 ysum += y
31 2952825 1560177 0.5 11.1 table[x] = ysum
32 750000 305911 0.4 2.2 return ysum
Interestingly, even n = x takes around 8% of the total running time.
(Unfortunately, I couldn't get kernprof to produce something similar for the recursive version.)
Iterative code is sometimes faster than recursive because it avoids function call overhead. However, stack.append is also a function call (and an attribute lookup on top) and adds similar overhead. Counting the append calls, the iterative version makes just as many function calls as the recursive version.
Comparing the first two and the last two timings here...
$ python -m timeit pass
10000000 loops, best of 3: 0.0242 usec per loop
$ python -m timeit -s "def f(n): pass" "f(1)"
10000000 loops, best of 3: 0.188 usec per loop
$ python -m timeit -s "def f(n): x=[]" "f(1)"
1000000 loops, best of 3: 0.234 usec per loop
$ python -m timeit -s "def f(n): x=[]; x.append" "f(1)"
1000000 loops, best of 3: 0.336 usec per loop
$ python -m timeit -s "def f(n): x=[]; x.append(1)" "f(1)"
1000000 loops, best of 3: 0.499 usec per loop
...confirms that the append call excluding attribute lookup takes approximately the same time as calling a minimal pure Python function, ~170 ns.
From the above I conclude that the iterative version does not enjoy an inherent advantage. The next question to consider is which one does more work. To get a (very) rough estimate, we can look at the number of lines executed in each version. I did a quick experiment to find out that:
collatz_r is called 1234275 times, and the body of the if block executes 984275 times.
collatz_i is called 250000 times, and the while loop runs 984275 times
Now, let's say collatz_r has 2 lines outside the if and 4 lines inside (that are executed in the worst case, when the else is hit). That adds up to 6.4 million lines to execute. Comparable figures for collatz_i could be 5 and 9, which add up to 10.0 million.
Even if that was just a rough estimate, it is well enough in line with the actual timings.

Calculating Integrals: One-liner solution?

I was working on a program to find the integral of a function, where the user specifies the amount of rectangles, the start, and the stop.
NOTE: I am using left-end points of the rectangles.
I have the function working perfectly (at least, it seems to be perfect). However, I wanted to see if I could write a one-liner for it, but not sure how because I'm using eval(). Here is my original code:
def integral(function, n=1000, start=0, stop=100):
"""Returns integral of function from start to stop with 'n' rectangles"""
increment, rectangles, x = float((stop - start)) / n, [], start
while x <= stop:
num = eval(function)
rectangles.append(num)
if x >= stop: break
x += increment
return increment * sum(rectangles)
This works fine:
>>> integral('x**2')
333833.4999999991
The actual answer is 1000000/3, so my function gives a pretty nice estimate (for only 1000 rectangles).
My attempt at a one-liner:
def integral2(function, n=1000, start=0, stop=100): rectangles = [(float(x) / n) for x in range(start*n, (stop*n)+1)]; return (float((stop-start))/n) * sum([eval(function) for x in rectangles])
However, this isn't a truly a one-liner since I'm using a semi-colon. Also, it's a bit slower (takes a few seconds longer, which is pretty significant) and gives the wrong answer:
>>> integral2('x**2')
33333833.334999967
So, is it possible to use a one-liner solution for this function? I wasn't sure how to implement eval() and float(x)/n in the same list comprehension. float(x)/n achieves a virtual 'step' in the range function.
Thanks!
def integral2(function, n=1000, start=0, stop=100): return (float(1)/n) * sum([eval(function) for x in [(float(x) / n) for x in range(start*n, (stop*n)+1)]])
Note that there is a big difference between integral and integral2: integral2 makes (stop*n)+1-(start*n) rectangles, while integral only makes n rectangles.
In [64]: integral('x**2')
Out[64]: 333833.4999999991
In [68]: integral2('x**2')
Out[68]: 333338.33334999956
In [69]: %timeit integral2('x**2')
1 loops, best of 3: 704 ms per loop
In [70]: %timeit integral('x**2')
100 loops, best of 3: 7.32 ms per loop
Perhaps a more comparable translation of integral would be:
def integral3(function, n=1000, start=0, stop=100): return (float(stop-start)/n) * sum([eval(function) for x in [start+(i*float(stop-start)/n) for i in range(n)]])
In [77]: %timeit integral3('x**2')
100 loops, best of 3: 7.1 ms per loop
Of course, it should go with say that there is no purpose for making this a one-liner other than (perverse?) amusement :)
You don't need to use eval if you receive function as a Python callable itself
You could also make use of numpy.arange function to generate a list of float values
Case 1: function is a Python callable
def integrate(f, n, start, end):
return sum([f(i)*(abs(start-end)/float(n)) for i in np.arange(start, end, abs(start-end)/float(n))])
Case 2: function is not a Python callable
def integrate(f, n, start, end):
return sum([eval(f)*(abs(start-end)/float(n)) for x in np.arange(start, end, abs(start-end)/float(n))])
How about some cheating using numpy's linspace?
integrate = lambda F, n0, nf: sum([F(x)*(abs(nf-n0)/(abs(nf-n0)*300))for x in np.linspace(n0, nf, abs(nf-n0)*300)])
It takes a function F, a lower bound n0 and an upper bound nf. Here's how it works for x**2 between 0 and 5:
In [31]: integrate(lambda x: x**2, 0, 5)
Out[31]: 41.68056482099179
It's pretty close.
EDIT: here's my linspace one liner.
linspace = lambda lo, hi, step:[lo + (hi-lo)/step*i + (hi-lo)/(step*step-1)*i for i in range(step)]
There's a way to tickle that into a full one liner integrator. I leave that to you my friend.

NumPy: function for simultaneous max() and min()

numpy.amax() will find the max value in an array, and numpy.amin() does the same for the min value. If I want to find both max and min, I have to call both functions, which requires passing over the (very big) array twice, which seems slow.
Is there a function in the numpy API that finds both max and min with only a single pass through the data?
Is there a function in the numpy API that finds both max and min with only a single pass through the data?
No. At the time of this writing, there is no such function. (And yes, if there were such a function, its performance would be significantly better than calling numpy.amin() and numpy.amax() successively on a large array.)
I don't think that passing over the array twice is a problem. Consider the following pseudo-code:
minval = array[0]
maxval = array[0]
for i in array:
if i < minval:
minval = i
if i > maxval:
maxval = i
While there is only 1 loop here, there are still 2 checks. (Instead of having 2 loops with 1 check each). Really the only thing you save is the overhead of 1 loop. If the arrays really are big as you say, that overhead is small compared to the actual loop's work load. (Note that this is all implemented in C, so the loops are more or less free anyway).
EDIT Sorry to the 4 of you who upvoted and had faith in me. You definitely can optimize this.
Here's some fortran code which can be compiled into a python module via f2py (maybe a Cython guru can come along and compare this with an optimized C version ...):
subroutine minmax1(a,n,amin,amax)
implicit none
!f2py intent(hidden) :: n
!f2py intent(out) :: amin,amax
!f2py intent(in) :: a
integer n
real a(n),amin,amax
integer i
amin = a(1)
amax = a(1)
do i=2, n
if(a(i) > amax)then
amax = a(i)
elseif(a(i) < amin) then
amin = a(i)
endif
enddo
end subroutine minmax1
subroutine minmax2(a,n,amin,amax)
implicit none
!f2py intent(hidden) :: n
!f2py intent(out) :: amin,amax
!f2py intent(in) :: a
integer n
real a(n),amin,amax
amin = minval(a)
amax = maxval(a)
end subroutine minmax2
Compile it via:
f2py -m untitled -c fortran_code.f90
And now we're in a place where we can test it:
import timeit
size = 100000
repeat = 10000
print timeit.timeit(
'np.min(a); np.max(a)',
setup='import numpy as np; a = np.arange(%d, dtype=np.float32)' % size,
number=repeat), " # numpy min/max"
print timeit.timeit(
'untitled.minmax1(a)',
setup='import numpy as np; import untitled; a = np.arange(%d, dtype=np.float32)' % size,
number=repeat), '# minmax1'
print timeit.timeit(
'untitled.minmax2(a)',
setup='import numpy as np; import untitled; a = np.arange(%d, dtype=np.float32)' % size,
number=repeat), '# minmax2'
The results are a bit staggering for me:
8.61869883537 # numpy min/max
1.60417699814 # minmax1
2.30169081688 # minmax2
I have to say, I don't completely understand it. Comparing just np.min versus minmax1 and minmax2 is still a losing battle, so it's not just a memory issue ...
notes -- Increasing size by a factor of 10**a and decreasing repeat by a factor of 10**a (keeping the problem size constant) does change the performance, but not in a seemingly consistent way which shows that there is some interplay between memory performance and function call overhead in python. Even comparing a simple min implementation in fortran beats numpy's by a factor of approximately 2 ...
You could use Numba, which is a NumPy-aware dynamic Python compiler using LLVM. The resulting implementation is pretty simple and clear:
import numpy
import numba
#numba.jit
def minmax(x):
maximum = x[0]
minimum = x[0]
for i in x[1:]:
if i > maximum:
maximum = i
elif i < minimum:
minimum = i
return (minimum, maximum)
numpy.random.seed(1)
x = numpy.random.rand(1000000)
print(minmax(x) == (x.min(), x.max()))
It should also be faster than a Numpy's min() & max() implementation. And all without having to write a single C/Fortran line of code.
Do your own performance tests, as it is always dependent on your architecture, your data, your package versions...
There is a function for finding (max-min) called numpy.ptp if that's useful for you:
>>> import numpy
>>> x = numpy.array([1,2,3,4,5,6])
>>> x.ptp()
5
but I don't think there's a way to find both min and max with one traversal.
EDIT: ptp just calls min and max under the hood
Just to get some ideas on the numbers one could expect, given the following approaches:
import numpy as np
def extrema_np(arr):
return np.max(arr), np.min(arr)
import numba as nb
#nb.jit(nopython=True)
def extrema_loop_nb(arr):
n = arr.size
max_val = min_val = arr[0]
for i in range(1, n):
item = arr[i]
if item > max_val:
max_val = item
elif item < min_val:
min_val = item
return max_val, min_val
import numba as nb
#nb.jit(nopython=True)
def extrema_while_nb(arr):
n = arr.size
odd = n % 2
if not odd:
n -= 1
max_val = min_val = arr[0]
i = 1
while i < n:
x = arr[i]
y = arr[i + 1]
if x > y:
x, y = y, x
min_val = min(x, min_val)
max_val = max(y, max_val)
i += 2
if not odd:
x = arr[n]
min_val = min(x, min_val)
max_val = max(x, max_val)
return max_val, min_val
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
cdef void _extrema_loop_cy(
long[:] arr,
size_t n,
long[:] result):
cdef size_t i
cdef long item, max_val, min_val
max_val = arr[0]
min_val = arr[0]
for i in range(1, n):
item = arr[i]
if item > max_val:
max_val = item
elif item < min_val:
min_val = item
result[0] = max_val
result[1] = min_val
def extrema_loop_cy(arr):
result = np.zeros(2, dtype=arr.dtype)
_extrema_loop_cy(arr, arr.size, result)
return result[0], result[1]
%%cython -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True, infer_types=True
import numpy as np
cdef void _extrema_while_cy(
long[:] arr,
size_t n,
long[:] result):
cdef size_t i, odd
cdef long x, y, max_val, min_val
max_val = arr[0]
min_val = arr[0]
odd = n % 2
if not odd:
n -= 1
max_val = min_val = arr[0]
i = 1
while i < n:
x = arr[i]
y = arr[i + 1]
if x > y:
x, y = y, x
min_val = min(x, min_val)
max_val = max(y, max_val)
i += 2
if not odd:
x = arr[n]
min_val = min(x, min_val)
max_val = max(x, max_val)
result[0] = max_val
result[1] = min_val
def extrema_while_cy(arr):
result = np.zeros(2, dtype=arr.dtype)
_extrema_while_cy(arr, arr.size, result)
return result[0], result[1]
(the extrema_loop_*() approaches are similar to what is proposed here, while extrema_while_*() approaches are based on the code from here)
The following timings:
indicate that the extrema_while_*() are the fastest, with extrema_while_nb() being fastest. In any case, also the extrema_loop_nb() and extrema_loop_cy() solutions do outperform the NumPy-only approach (using np.max() and np.min() separately).
Finally, note that none of these is as flexible as np.min()/np.max() (in terms of n-dim support, axis parameter, etc.).
(full code is available here)
Nobody mentioned numpy.percentile, so I thought I would. If you ask for [0, 100] percentiles, it will give you an array of two elements, the min (0th percentile) and the max (100th percentile).
However, it doesn't satisfy the OP's purpose: it's not faster than min and max separately. That's probably due to some machinery that would allow for non-extreme percentiles (a harder problem, which should take longer).
In [1]: import numpy
In [2]: a = numpy.random.normal(0, 1, 1000000)
In [3]: %%timeit
...: lo, hi = numpy.amin(a), numpy.amax(a)
...:
100 loops, best of 3: 4.08 ms per loop
In [4]: %%timeit
...: lo, hi = numpy.percentile(a, [0, 100])
...:
100 loops, best of 3: 17.2 ms per loop
In [5]: numpy.__version__
Out[5]: '1.14.4'
A future version of Numpy could put in a special case to skip the normal percentile calculation if only [0, 100] are requested. Without adding anything to the interface, there's a way to ask Numpy for min and max in one call (contrary to what was said in the accepted answer), but the standard implementation of the library doesn't take advantage of this case to make it worthwhile.
In general you can reduce the amount of comparisons for a minmax algorithm by processing two elements at a time and only comparing the smaller to the temporary minimum and the bigger one to the temporary maximum. On average one needs only 3/4 of the comparisons than a naive approach.
This could be implemented in c or fortran (or any other low-level language) and should be almost unbeatable in terms of performance. I'm using numba to illustrate the principle and get a very fast, dtype-independant implementation:
import numba as nb
import numpy as np
#nb.njit
def minmax(array):
# Ravel the array and return early if it's empty
array = array.ravel()
length = array.size
if not length:
return
# We want to process two elements at once so we need
# an even sized array, but we preprocess the first and
# start with the second element, so we want it "odd"
odd = length % 2
if not odd:
length -= 1
# Initialize min and max with the first item
minimum = maximum = array[0]
i = 1
while i < length:
# Get the next two items and swap them if necessary
x = array[i]
y = array[i+1]
if x > y:
x, y = y, x
# Compare the min with the smaller one and the max
# with the bigger one
minimum = min(x, minimum)
maximum = max(y, maximum)
i += 2
# If we had an even sized array we need to compare the
# one remaining item too.
if not odd:
x = array[length]
minimum = min(x, minimum)
maximum = max(x, maximum)
return minimum, maximum
It's definetly faster than the naive approach that Peque presented:
arr = np.random.random(3000000)
assert minmax(arr) == minmax_peque(arr) # warmup and making sure they are identical
%timeit minmax(arr) # 100 loops, best of 3: 2.1 ms per loop
%timeit minmax_peque(arr) # 100 loops, best of 3: 2.75 ms per loop
As expected the new minmax implementation only takes roughly 3/4 of the time the naive implementation took (2.1 / 2.75 = 0.7636363636363637)
This is an old thread, but anyway, if anyone ever looks at this again...
When looking for the min and max simultaneously, it is possible to reduce the number of comparisons. If it is floats you are comparing (which I guess it is) this might save you some time, although not computational complexity.
Instead of (Python code):
_max = ar[0]
_min= ar[0]
for ii in xrange(len(ar)):
if _max > ar[ii]: _max = ar[ii]
if _min < ar[ii]: _min = ar[ii]
you can first compare two adjacent values in the array, and then only compare the smaller one against current minimum, and the larger one against current maximum:
## for an even-sized array
_max = ar[0]
_min = ar[0]
for ii in xrange(0, len(ar), 2)): ## iterate over every other value in the array
f1 = ar[ii]
f2 = ar[ii+1]
if (f1 < f2):
if f1 < _min: _min = f1
if f2 > _max: _max = f2
else:
if f2 < _min: _min = f2
if f1 > _max: _max = f1
The code here is written in Python, clearly for speed you would use C or Fortran or Cython, but this way you do 3 comparisons per iteration, with len(ar)/2 iterations, giving 3/2 * len(ar) comparisons. As opposed to that, doing the comparison "the obvious way" you do two comparisons per iteration, leading to 2*len(ar) comparisons. Saves you 25% of comparison time.
Maybe someone one day will find this useful.
At first glance, numpy.histogram appears to do the trick:
count, (amin, amax) = numpy.histogram(a, bins=1)
... but if you look at the source for that function, it simply calls a.min() and a.max() independently, and therefore fails to avoid the performance concerns addressed in this question. :-(
Similarly, scipy.ndimage.measurements.extrema looks like a possibility, but it, too, simply calls a.min() and a.max() independently.
It was worth the effort for me anyways, so I'll propose the most difficult and least elegant solution here for whoever may be interested. My solution is to implement a multi-threaded min-max in one pass algorithm in C++, and use this to create an Python extension module. This effort requires a bit of overhead for learning how to use the Python and NumPy C/C++ APIs, and here I will show the code and give some small explanations and references for whoever wishes to go down this path.
Multi-threaded Min/Max
There is nothing too interesting here. The array is broken into chunks of size length / workers. The min/max is calculated for each chunk in a future, which are then scanned for the global min/max.
// mt_np.cc
//
// multi-threaded min/max algorithm
#include <algorithm>
#include <future>
#include <vector>
namespace mt_np {
/*
* Get {min,max} in interval [begin,end)
*/
template <typename T> std::pair<T, T> min_max(T *begin, T *end) {
T min{*begin};
T max{*begin};
while (++begin < end) {
if (*begin < min) {
min = *begin;
continue;
} else if (*begin > max) {
max = *begin;
}
}
return {min, max};
}
/*
* get {min,max} in interval [begin,end) using #workers for concurrency
*/
template <typename T>
std::pair<T, T> min_max_mt(T *begin, T *end, int workers) {
const long int chunk_size = std::max((end - begin) / workers, 1l);
std::vector<std::future<std::pair<T, T>>> min_maxes;
// fire up the workers
while (begin < end) {
T *next = std::min(end, begin + chunk_size);
min_maxes.push_back(std::async(min_max<T>, begin, next));
begin = next;
}
// retrieve the results
auto min_max_it = min_maxes.begin();
auto v{min_max_it->get()};
T min{v.first};
T max{v.second};
while (++min_max_it != min_maxes.end()) {
v = min_max_it->get();
min = std::min(min, v.first);
max = std::max(max, v.second);
}
return {min, max};
}
}; // namespace mt_np
The Python Extension Module
Here is where things start getting ugly... One way to use C++ code in Python is to implement an extension module. This module can be built and installed using the distutils.core standard module. A complete description of what this entails is covered in the Python documentation: https://docs.python.org/3/extending/extending.html. NOTE: there are certainly other ways to get similar results, to quote https://docs.python.org/3/extending/index.html#extending-index:
This guide only covers the basic tools for creating extensions provided as part of this version of CPython. Third party tools like Cython, cffi, SWIG and Numba offer both simpler and more sophisticated approaches to creating C and C++ extensions for Python.
Essentially, this route is probably more academic than practical. With that being said, what I did next was, sticking pretty close to the tutorial, create a module file. This is essentially boilerplate for distutils to know what to do with your code and create a Python module out of it. Before doing any of this it is probably wise to create a Python virtual environment so you don't pollute your system packages (see https://docs.python.org/3/library/venv.html#module-venv).
Here is the module file:
// mt_np_forpy.cc
//
// C++ module implementation for multi-threaded min/max for np
#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
#include <python3.6/numpy/arrayobject.h>
#include "mt_np.h"
#include <cstdint>
#include <iostream>
using namespace std;
/*
* check:
* shape
* stride
* data_type
* byteorder
* alignment
*/
static bool check_array(PyArrayObject *arr) {
if (PyArray_NDIM(arr) != 1) {
PyErr_SetString(PyExc_RuntimeError, "Wrong shape, require (1,n)");
return false;
}
if (PyArray_STRIDES(arr)[0] != 8) {
PyErr_SetString(PyExc_RuntimeError, "Expected stride of 8");
return false;
}
PyArray_Descr *descr = PyArray_DESCR(arr);
if (descr->type != NPY_LONGLTR && descr->type != NPY_DOUBLELTR) {
PyErr_SetString(PyExc_RuntimeError, "Wrong type, require l or d");
return false;
}
if (descr->byteorder != '=') {
PyErr_SetString(PyExc_RuntimeError, "Expected native byteorder");
return false;
}
if (descr->alignment != 8) {
cerr << "alignment: " << descr->alignment << endl;
PyErr_SetString(PyExc_RuntimeError, "Require proper alignement");
return false;
}
return true;
}
template <typename T>
static PyObject *mt_np_minmax_dispatch(PyArrayObject *arr) {
npy_intp size = PyArray_SHAPE(arr)[0];
T *begin = (T *)PyArray_DATA(arr);
auto minmax =
mt_np::min_max_mt(begin, begin + size, thread::hardware_concurrency());
return Py_BuildValue("(L,L)", minmax.first, minmax.second);
}
static PyObject *mt_np_minmax(PyObject *self, PyObject *args) {
PyArrayObject *arr;
if (!PyArg_ParseTuple(args, "O", &arr))
return NULL;
if (!check_array(arr))
return NULL;
switch (PyArray_DESCR(arr)->type) {
case NPY_LONGLTR: {
return mt_np_minmax_dispatch<int64_t>(arr);
} break;
case NPY_DOUBLELTR: {
return mt_np_minmax_dispatch<double>(arr);
} break;
default: {
PyErr_SetString(PyExc_RuntimeError, "Unknown error");
return NULL;
}
}
}
static PyObject *get_concurrency(PyObject *self, PyObject *args) {
return Py_BuildValue("I", thread::hardware_concurrency());
}
static PyMethodDef mt_np_Methods[] = {
{"mt_np_minmax", mt_np_minmax, METH_VARARGS, "multi-threaded np min/max"},
{"get_concurrency", get_concurrency, METH_VARARGS,
"retrieve thread::hardware_concurrency()"},
{NULL, NULL, 0, NULL} /* sentinel */
};
static struct PyModuleDef mt_np_module = {PyModuleDef_HEAD_INIT, "mt_np", NULL,
-1, mt_np_Methods};
PyMODINIT_FUNC PyInit_mt_np() { return PyModule_Create(&mt_np_module); }
In this file there is a significant use of the Python as well as the NumPy API, for more information consult: https://docs.python.org/3/c-api/arg.html#c.PyArg_ParseTuple, and for NumPy: https://docs.scipy.org/doc/numpy/reference/c-api.array.html.
Installing the Module
The next thing to do is to utilize distutils to install the module. This requires a setup file:
# setup.py
from distutils.core import setup,Extension
module = Extension('mt_np', sources = ['mt_np_module.cc'])
setup (name = 'mt_np',
version = '1.0',
description = 'multi-threaded min/max for np arrays',
ext_modules = [module])
To finally install the module, execute python3 setup.py install from your virtual environment.
Testing the Module
Finally, we can test to see if the C++ implementation actually outperforms naive use of NumPy. To do so, here is a simple test script:
# timing.py
# compare numpy min/max vs multi-threaded min/max
import numpy as np
import mt_np
import timeit
def normal_min_max(X):
return (np.min(X),np.max(X))
print(mt_np.get_concurrency())
for ssize in np.logspace(3,8,6):
size = int(ssize)
print('********************')
print('sample size:', size)
print('********************')
samples = np.random.normal(0,50,(2,size))
for sample in samples:
print('np:', timeit.timeit('normal_min_max(sample)',
globals=globals(),number=10))
print('mt:', timeit.timeit('mt_np.mt_np_minmax(sample)',
globals=globals(),number=10))
Here are the results I got from doing all this:
8
********************
sample size: 1000
********************
np: 0.00012079699808964506
mt: 0.002468645994667895
np: 0.00011947099847020581
mt: 0.0020772050047526136
********************
sample size: 10000
********************
np: 0.00024697799381101504
mt: 0.002037393998762127
np: 0.0002713389985729009
mt: 0.0020942929986631498
********************
sample size: 100000
********************
np: 0.0007130410012905486
mt: 0.0019842900001094677
np: 0.0007540129954577424
mt: 0.0029724110063398257
********************
sample size: 1000000
********************
np: 0.0094779249993735
mt: 0.007134920000680722
np: 0.009129883001151029
mt: 0.012836456997320056
********************
sample size: 10000000
********************
np: 0.09471094200125663
mt: 0.0453535050037317
np: 0.09436299200024223
mt: 0.04188535599678289
********************
sample size: 100000000
********************
np: 0.9537652180006262
mt: 0.3957935369980987
np: 0.9624398809974082
mt: 0.4019058070043684
These are far less encouraging than the results indicate earlier in the thread, which indicated somewhere around 3.5x speedup, and didn't incorporate multi-threading. The results I achieved are somewhat reasonable, I would expect that the overhead of threading and would dominate the time until the arrays got very large, at which point the performance increase would start to approach std::thread::hardware_concurrency x increase.
Conclusion
There is certainly room for application specific optimizations to some NumPy code, it would seem, in particular with regards to multi-threading. Whether or not it is worth the effort is not clear to me, but it certainly seems like a good exercise (or something). I think that perhaps learning some of those "third party tools" like Cython may be a better use of time, but who knows.
Inspired by the previous answer I've written numba implementation returning minmax for axis=0 from 2-D array. It's ~5x faster than calling numpy min/max.
Maybe someone will find it useful.
from numba import jit
#jit
def minmax(x):
"""Return minimum and maximum from 2D array for axis=0."""
m, n = len(x), len(x[0])
mi, ma = np.empty(n), np.empty(n)
mi[:] = ma[:] = x[0]
for i in range(1, m):
for j in range(n):
if x[i, j]>ma[j]: ma[j] = x[i, j]
elif x[i, j]<mi[j]: mi[j] = x[i, j]
return mi, ma
x = np.random.normal(size=(256, 11))
mi, ma = minmax(x)
np.all(mi == x.min(axis=0)), np.all(ma == x.max(axis=0))
# (True, True)
%timeit x.min(axis=0), x.max(axis=0)
# 15.9 µs ± 9.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit minmax(x)
# 2.62 µs ± 31.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The shortest way I've come up with is this:
mn, mx = np.sort(ar)[[0, -1]]
But since it sorts the array, it's not the most efficient.
Another short way would be:
mn, mx = np.percentile(ar, [0, 100])
This should be more efficient, but the result is calculated, and a float is returned.
Maybe use numpy.unique? Like so:
min_, max_ = numpy.unique(arr)[[0, -1]]
Just added it here for variety :) It's just as slow as sort.

Categories