I have the following function in pure python:
import numpy as np
def subtractPython(a, b):
xAxisCount = a.shape[0]
yAxisCount = a.shape[1]
shape = (xAxisCount, yAxisCount, xAxisCount)
results = np.zeros(shape)
for index in range(len(b)):
subtracted = (a - b[index])
results[:, :, index] = subtracted
return results
I tried to cythonize it this way:
import numpy as np
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def subtractPython(np.ndarray[DTYPE_t, ndim=2] a, np.ndarray[DTYPE_t, ndim=2] b):
cdef int xAxisCount = a.shape[0]
cdef int yAxisCount = a.shape[1]
cdef np.ndarray[DTYPE_t, ndim=3] results = np.zeros([xAxisCount, yAxisCount, xAxisCount], dtype=DTYPE)
cdef int lenB = len(b)
cdef np.ndarray[DTYPE_t, ndim=2] subtracted
for index in range(lenB):
subtracted = (a - b[index])
results[:, :, index] = subtracted
return results
However, Im not seeing any speedup. Is there something I'm missing or this process can't be sped up?
EDIT -> I've realized that I'm not actually cythonizing the subtraction algorithm in the above code. I've managed to cythonize it, but it has the exact same runtime as a - b[:, None], so I guess this is the maximum speed of this operation.
This is basically a - b[:, None] -> has same runtime
%%cython
import numpy as np
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t
cimport cython
#cython.boundscheck(False) # turn off bounds-checking for entire function
#cython.wraparound(False) # turn off negative index wrapping for entire function
def subtract(np.ndarray[DTYPE_t, ndim=2] a, np.ndarray[DTYPE_t, ndim=2] b):
cdef np.ndarray[DTYPE_t, ndim=3] result = np.zeros([b.shape[0], a.shape[0], a.shape[1]], dtype=DTYPE)
cdef int lenB = b.shape[0]
cdef int lenA = a.shape[0]
cdef int lenColB = b.shape[1]
cdef int rowA, rowB, column
for rowB in range(lenB):
for rowA in range(lenA):
for column in range(lenColB):
result[rowB, rowA, column] = a[rowA, column] - b[rowB, column]
return result
When trying to optimize a function, one always should know what is the bottle-neck of this function - without you will spend a lot of time running in the wrong direction.
Let's use your python-function as baseline (actually I use result=np.zeros(shape,dtype=a.dtype) otherwise your method returns floats which is probably a bug):
>>> import numpy as np
>>> a=np.random.randint(1,1000,(300,300), dtype=np.int)
>>> b=np.random.randint(1,1000,(300,300), dtype=np.int)
>>> %timeit subtractPython(a,b)
274 ms ± 3.61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The first question we should ask ourselves is: Is this task memory or CPU bound? Obviously, this is a memory-bound task - a subtraction is nothing compared to needed memory-read- and write-accesses.
This means, all above we have to optimize the memory layout in order to reduce cache-misses. As a rule of thumb, our memory accesses should access one consecutive memory address after another.
Is this the case? No, the array result is in C-order, i.e. row-major-order and thus the access
results[:, :, index] = subtracted
isn't consecutive. On the other hand,
results[index, :, :] = subtracted
would be a consecutive access. Let's change the way information is stored in result:
def subtract1(a, b):
xAxisCount = a.shape[0]
yAxisCount = a.shape[1]
shape = (xAxisCount, xAxisCount, yAxisCount) #<=== Change order
results = np.zeros(shape, dtype=a.dtype)
for index in range(len(b)):
subtracted = (a - b[index])
results[index, :, :] = subtracted #<===== consecutive access
return results
The timings are now:
>>> %timeit subtract1(a,b)
>>> 35.8 ms ± 285 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
There are also 2 more small improvements: we don't have to initialize result with zeros and we can save some python overhead, but this gives us just about 5%:
def subtract2(a, b):
xAxisCount = a.shape[0]
yAxisCount = a.shape[1]
shape = (xAxisCount, xAxisCount, yAxisCount)
results = np.empty(shape, dtype=a.dtype) #<=== no need for zeros
for index in range(len(b)):
results[index, :, :] = (a-b[index]) #<===== less python overhead
return results
>>> %timeit subtract2(a,b)
34.5 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now this is about factor 8 faster than the original version.
You could use Cython to try to speed-up this even further - but the task is probably still memory-bound, so don't expect to get it significantly faster - after all cython cannot make the memory work faster. However, without proper profiling it is hard to tell, how much improvement is possible - would not be surprised, if someone would come up with a faster version.
Related
What I am doing now is:
import numpy as np
eps = np.finfo(float).eps
def sindiv(x):
x = np.abs(x)
return np.maximum(eps, np.sin(x)) / np.maximum(eps, x)
But there is quite a lot of additional array operation. Is there a better way?
You could use numpy.sinc, which computes sin(pi x)/(pi x):
In [20]: x = 2.4
In [21]: np.sin(x)/x
Out[21]: 0.28144299189631289
In [22]: x_over_pi = x / np.pi
In [23]: np.sinc(x_over_pi)
Out[23]: 0.28144299189631289
In [24]: np.sinc(0)
Out[24]: 1.0
In numpy array notation (so you get back a np array):
def sindiv(x):
return np.where(np.abs(x) < 0.01, 1.0 - x*x/6.0, np.sin(x)/x)
Here I've made "epsilon" fairly large for testing and used the first two terms of the taylor series for the approximation. In practice, I'd change 0.01 to some small multiple of your eps (machine epsilon).
xx = np.arange(-0.1, 0.1, 0.001)
yy = sinxdiv(xx)
type(yy)
outputs numpy.ndarray and the values are continuous (and differentiable, if that's important) near the origin.
If you don't want the double evaluation (i.e. both branches are evaluated in the above), then I think you have to go with a loop as I don't believe there is any sort of "lazy where" option.
def sindiv(x):
sox = np.zeros(x.size)
for i in xrange(x.size):
xv = x[i]
if np.abs(xv) < 0.001: # For testing, use a small multiple of machine epsilon
sox[i] = 1.0 - xv * xv / 6.0
else:
sox[i] = np.sin(xv) / xv
return sox
To make this really pythonic though it would be best to check the type of x and just do the non-array version if it is not an array.
As others have said, numpy.sinc() is the easiest.
I want to include a copy of its current implementation in NumPy 1.21.2 (link) to show there's no special tricks:
y = pi * where(x == 0, 1.0e-20, x)
return sin(y)/y
It's basically just sin(x)/x. Note that in creating y: multiplication by pi, where(), and x == 0 will create at least 2 intermediate arrays plus the final array for y. And then sin(y)/y creates two more arrays. In total at least 5 arrays are created by numpy.sinc(); and by my count your sindiv() also creates at least 5 arrays, so it's not actually that wasteful.
Here is another implementation:
TINY = np.finfo(float).tiny # ≈ 2e-308 (smallest 'normal' float)
def mysinc(x):
y = np.abs(np.pi*x) + TINY
return np.sin(y)/y
I'm pretty sure this returns identical values to numpy.sinc(). The reason being sin(x) == x for relatively 'large' values of x:
x = np.ldexp(1, -26, dtype=np.double) # x = 2**-26 ≈ 1.5e-8
print(np.sin(x) == x) # True
x = np.ldexp(1, -32, dtype=np.longdouble) # x = 2**-32 ≈ 2.3e-10
print(np.sin(x) == x) # True
So for small enough x (ignore pi factors), mysinc(x) = (x+TINY)/(x+TINY) = x/x = np.sinc(x). The exact threshold this happens does not matter too much so long as TINY < np.spacing(x) when it occurs so that x + TINY = x in this regime.
(The cutoff is around the square-root of the machine epsilon as can be understood from the Taylor series sin(x) = x - x**3/6 + ... = x(1-x**2/6) + .... So TINY is always small enough to not matter.)
Timings
import numpy as np
eps = np.finfo(float).eps
tiny = np.finfo(float).tiny
def npsinc(x):
y = np.pi * np.where(x == 0, 1.0e-20, x)
return np.sin(y)/y
def sindiv(x):
x = np.pi * np.abs(x)
return np.maximum(eps, np.sin(x)) / np.maximum(eps, x)
def mysinc(x):
y = np.abs(np.pi*x) + tiny
return np.sin(y)/y
def mysinc2(x):
y = np.abs(np.pi*x)
y += tiny # in-place addition
return np.sin(y)/y
# Test data
x = np.random.rand(100)
x[np.random.randint(100, size=10)] = 0
%timeit npsinc(x)
# 10.9 µs ± 18.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit sindiv(x)
# 9.4 µs ± 12.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mysinc(x)
# 7.38 µs ± 15.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit mysinc2(x)
# 8.64 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Curiously using mysinc2() with in-place addition seems to be slower, and using in-place numpy.abs() and in-place numpy.sin() is even slower. Not entirely sure why, but see this related question.
Regardless, if you really need performance, you can try using Cython to generate C code and do things properly instead of playing tricks with NumPy:
%%cython
from libc.math cimport M_PI, sin
cimport cython
cimport numpy as np
import numpy as np
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.cdivision(True)
cdef _cysinc(double[:] x, double[:] out):
cdef size_t i
for i in range(x.shape[0]):
if x[i] == 0:
out[i] = 1
else:
out[i] = sin(M_PI*x[i])/(M_PI*x[i])
def cysinc(np.ndarray x):
out = np.empty_like(x)
_cysinc(x.ravel(), out.ravel())
return out
%timeit cysinc(x)
# 4.38 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
As always, don't prematurely optimize, just use numpy.sinc() to begin with.
Side note
There's a question Is boost::math::sinc_pi unnecessarily complicated? that asks about the benefits of using a Taylor expansion about x=0. In summary, almost none, but maybe they are doing it for other reasons.
To emphasise, there is nothing unstable about floating point division, or dividing a small number by a small number since you're just dividing the significands and subtracting the exponents.
If you calculate sinc(x) as sin(x)/x, instead of a direct Taylor series or other method that sums to convergence beyond the machine epsilon np.spacing(sinc(x)), you will be off by at most np.spacing(sinc(x)) coming from the round-off error in division /, just as you'd get with multiplication *. (Assuming no subnormal business, which even here does not matter in the treatment of sin(x)/x.)
What about allowing div by zero and replace NaNs later?
import numpy as np
def sindiv(x):
a = np.sin(x)/x
a = np.nan_to_num(a)
return a
If you don't want warnings, supress them via seterr
Of course, using a could be eliminated:
def sindiv(x):
return np.nan_to_num(np.sin(x)/x)
I'm trying to find the fastest way to to get the functionality of numpy's 'where' statement on a 2D numpy array; namely, retrieving the indices where a condition is met. It is simply much slower than other languages I have used (e.g., IDL, Matlab).
I have cythonized a function that marches through the array in nested for-loops. There is almost an order of magnitude increase in speed, but I would like to increase performance even more, if possible.
TEST.py:
from cython_where import *
import time
import numpy as np
data = np.zeros((2600,5200))
data[100:200,100:200] = 10
t0 = time.time()
inds,ct = cython_where(data,'EQ',10)
print time.time() - t0
t1 = time.time()
tmp = np.where(data == 10)
print time.time() - t1
My cython_where.pyx program:
from __future__ import division
import numpy as np
cimport numpy as np
cimport cython
DTYPE1 = np.float
ctypedef np.float_t DTYPE1_t
DTYPE2 = np.int
ctypedef np.int_t DTYPE2_t
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.nonecheck(False)
def cython_where(np.ndarray[DTYPE1_t, ndim=2] data, oper, DTYPE1_t val):
assert data.dtype == DTYPE1
cdef int xmax = data.shape[0]
cdef int ymax = data.shape[1]
cdef unsigned int x, y
cdef int count = 0
cdef np.ndarray[DTYPE2_t, ndim=1] xind = np.zeros(100000,dtype=int)
cdef np.ndarray[DTYPE2_t, ndim=1] yind = np.zeros(100000,dtype=int)
if(oper == 'EQ' or oper == 'eq'): #I didn't want to include GT, GE, LT, LE here
for x in xrange(xmax):
for y in xrange(ymax):
if(data[x,y] == val):
xind[count] = x
yind[count] = y
count += 1
return tuple([xind[0:count],yind[0:count]]),count
Output of TEST.py:
cython_test]$ python TEST.py
0.0139019489288
0.0982608795166
I've also tried numpy's argwhere, which is about as fast as where. I'm pretty new to numpy and cython, so if you have any other ideas to really increase performance, I'm all ears!
Contributions:
Numpy can be speed up on flattened array for a 4x gain:
%timeit np.where(data==10)
1 loops, best of 3: 105 ms per loop
%timeit np.unravel_index(np.where(data.ravel()==10),data.shape)
10 loops, best of 3: 26.0 ms per loop
I think you can optimize your cython code with that, avoiding computing k=i*ncol+j for each cell.
Numba give a simple alternative :
from numba import jit
dtype=data.dtype
#jit(nopython=True)
def numbaeq(flatdata,x,nrow,ncol):
size=ncol*nrow
ix=np.empty(size,dtype=dtype)
jx=np.empty(size,dtype=dtype)
count=0
k=0
while k<size:
if flatdata[k]==x :
ix[count]=k//ncol
jx[count]=k%ncol
count+=1
k+=1
return ix[:count],jx[:count]
def whereequal(data,x): return numbaeq(data.ravel(),x,*data.shape)
which gives :
%timeit whereequal(data,10)
10 loops, best of 3: 20.2 ms per loop
Not great optimisation for numba on such problem, under cython performance.
k//ncol and k%ncol can be computed at same time with a optimized divmod operation.
ultimate steps are assembly language and parallélisation , but it's other sports.
I have a function that computes the conditional (on kth alpha) log likelihood of a dirichlet distribution. I have it written in Cython and compiled, but my code calls it about 12M times and it seems to be the bottleneck, so I'm hoping to speed it up.
cimport numpy as np
import numpy as np
import math
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
def logFullConAlphaK(np.ndarray p,np.ndarray alpha, np.int k):
assert p.dtype == np.float64 and alpha.dtype == np.float64
cdef double t1=sum(np.log(p))
cdef DTYPE_t y=((alpha[k-1]-1)*t1)-np.log(alpha[k-1])+(p.shape[0]*
(math.lgamma(sum(alpha))- math.lgamma(alpha[k-1])))
return y
I compile the Cython into a .pyd file that I use in my code. Any thoughts on how I can speed this up?
Thanks
1) By declaring the data types and dimensions of your input arrays and for p.shape[0]:
def logFullConAlphaK(np.ndarray[DTYPE_t, ndim=1] p,
np.ndarray[DTYPE_t, ndim=1] alpha, int k):
...
cdef int tmp
tmp = p.shape[0]
2) By using C functions instead of Python functions from the module math:
cdef extern from "math.h":
double log(double x) nogil
3) Using NumPy's np.ndarray.sum() method
4) Using Cython directives to avoid some overhead
Altogether:
#cython: wraparound=False
#cython: boundscheck=False
#cython: cdivision=True
#cython: nonecheck=False
import math
cimport numpy as np
import numpy as np
cdef extern from "math.h":
double log(double x) nogil
DTYPE = np.float64
ctypedef np.float64_t DTYPE_t
def logFullConAlphaK(np.ndarray[DTYPE_t, ndim=1] p,
np.ndarray[DTYPE_t, ndim=1] alpha, int k):
assert p.dtype == np.float64 and alpha.dtype == np.float64
cdef double t1
cdef int tmp
t1 = np.log(p).sum()
tmp = p.shape[0]
cdef DTYPE_t y=((alpha[k-1]-1)*t1)-log(alpha[k-1])+(tmp*
(math.lgamma(alpha.sum()) - math.lgamma(alpha[k-1])))
return y
Some performance comparisons among the OP's original solution, #cel's solution and mine:
In [2]: timeit solOP(a, b, 10)
1000 loops, best of 3: 273 µs per loop
In [3]: timeit solcel(a, b, 10)
10000 loops, best of 3: 30.5 µs per loop
In [4]: timeit solS(a, b, 10)
100000 loops, best of 3: 15.8 µs per loop
Take this (probably completely unrealistic) sample data:
n = 1000000
p = np.random.rand(n)
alpha = np.random.rand(n)
k = 12
I get following timings:
%timeit logFullConAlphaK(p, alpha, k) -> 1 loops, best of 3: 174 ms per loop
%timeit logFullConAlphaK_opt(p, alpha, k) -> 100 loops, best of 3: 13.3 ms per loop
This version already gives you an order of magnitude in speed. Note that almost all speedup comes from using np.sum over the built-in sum. All other changes are just for cleaner code, they do not have an impact on the speed.
cimport numpy as np
import numpy as np
import math
def logFullConAlphaK_opt(double[:] p, double[:] alpha, int k):
cdef double t1=np.sum(np.log(p))
cdef double y=((alpha[k-1]-1)*t1)-np.log(alpha[k-1])+(p.shape[0]*
(math.lgamma(np.sum(alpha))- math.lgamma(alpha[k-1])))
return y
I was wondering if I'm missing something when using Cython with Numpy because I haven't seen much of an improvement. I wrote this code as an example.
Naive version:
import numpy as np
from skimage.util import view_as_windows
it = 16
arr = np.arange(1000*1000, dtype=np.float64).reshape(1000,1000)
windows = view_as_windows(arr, (it, it), it)
container = np.zeros((windows.shape[0], windows.shape[1]))
def test(windows):
for i in range(windows.shape[0]):
for j in range(windows.shape[1]):
container[i,j] = np.mean(windows[i,j])
return container
%%timeit
test(windows)
1 loops, best of 3: 131 ms per loop
Cythonized version:
%%cython --annotate
import numpy as np
cimport numpy as np
from skimage.util import view_as_windows
import cython
cdef int step = 16
arr = np.arange(1000*1000, dtype=np.float64).reshape(1000,1000)
windows = view_as_windows(arr, (step, step), step)
#cython.boundscheck(False)
def cython_test(np.ndarray[np.float64_t, ndim=4] windows):
cdef np.ndarray[np.float64_t, ndim=2] container = np.zeros((windows.shape[0], windows.shape[1]),dtype=np.float64)
cdef int i, j
I = windows.shape[0]
J = windows.shape[1]
for i in range(I):
for j in range(J):
container[i,j] = np.mean(windows[i,j])
return container
%timeit cython_test(windows)
10 loops, best of 3: 126 ms per loop
As you can see, there is a very modest improvement, so maybe I'm doing something wrong. By the way, the annotation that Cython produces the following:
As you can see, the numpy lines have a yellow background even after including the efficient indexing syntax np.ndarray[DTYPE_t, ndim=2]. Why?
By the way, in my view the ideal outcome is being able to use most numpy functions but still get some reasonable improvement after taking advantage of efficient indexing syntax or maybe memory views as in HYRY's answer.
UPDATE
It seems I'm not doing anything wrong in the code I posted above and that the yellow background in some lines is normal, so I was left wondering the following: In which situations I can get a benefit from typing cdef np.ndarray[np.float64_t, ndim=2] in front of numpy arrays? I suppose there are specific instances where this is helpful, otherwise there wouldn't be much purpose in doing it.
You need to implement the mean() function yourself to speedup the code, this is because the overhead of calling a numpy function is very high.
#cython.boundscheck(False)
#cython.wraparound(False)
def cython_test(double[:, :, :, :] windows):
cdef double[:, ::1] container
cdef int i, j, k, l
cdef int n0, n1, n2, n3
cdef double inv_n
cdef double s
n0, n1, n2, n3 = windows.base.shape
container = np.zeros((n0, n1))
inv_n = 1.0 / (n2 * n3)
for i in range(n0):
for j in range(n1):
s = 0
for k in range(n2):
for l in range(n3):
s += windows[i, j, k, l]
container[i,j] = s * inv_n
return container.base
Here is the %timeit results:
python_test(windows): 63.7 ms
cython_test(windows): 1.24 ms
np.mean(windows, axis=(2, 3)): 2.66 ms
I am trying out Numba in speeding up a function that computes a minimum conditional probability of joint occurrence.
import numpy as np
from numba import double
from numba.decorators import jit, autojit
X = np.random.random((100,2))
def cooccurance_probability(X):
P = X.shape[1]
CS = np.sum(X, axis=0) #Column Sums
D = np.empty((P, P), dtype=np.float) #Return Matrix
for i in range(P):
for j in range(P):
D[i, j] = (X[:,i] * X[:,j]).sum() / max(CS[i], CS[j])
return D
cooccurance_probability_numba = autojit(cooccurance_probability)
However I am finding that the performance of cooccurance_probability and cooccurance_probability_numba to be pretty much the same.
%timeit cooccurance_probability(X)
1 loops, best of 3: 302 ms per loop
%timeit cooccurance_probability_numba(X)
1 loops, best of 3: 307 ms per loop
Why is this? Could it be due to the numpy element by element operation?
I am following as an example:
http://nbviewer.ipython.org/github/ellisonbg/talk-sicm2-2013/blob/master/NumbaCython.ipynb
[Note: I could half the execution time due to the symmetric nature of the problem - but that isn't my main concern]
My guess would be that you're hitting the object layer instead of generating native code due to the calls to sum, which means that Numba isn't going to speed things up significantly. It just doesn't know how to optimize/translate sum (at this point). Additionally it's usually better to unroll vectorized operations into explicit loops with Numba. Notice that the ipynb that you link to only calls out to np.sqrt which I believe does get translated to machine code, and it operates on elements, not slices. I would try to expand out the sum in the inner loop as an explicit additional loop over elements, rather than taking slices and using the sum method.
My experience is that Numba can work wonders sometimes, but it doesn't speed-up arbitrary python code. You need to get a sense of the limitations and what it can optimize effectively. Also note that v0.11 is a bit different in this regard as compared to 0.12 and 0.13 due to the major refactoring that Numba went through between those versions.
Below is a solution using Josh's advice, which is spot on. It appears however the max() works fine in the below implementation. It would be great if there was a list of "safe" python / numpy functions.
Note: I reduced the dimensionality of the original matrix to 100 x 200]
import numpy as np
from numba import double
from numba.decorators import jit, autojit
X = np.random.random((100,200))
def cooccurance_probability_explicit(X):
C = X.shape[0]
P = X.shape[1]
# - Column Sums - #
CS = np.zeros((P,), dtype=np.float)
for p in range(P):
for c in range(C):
CS[p] += X[c,p]
D = np.empty((P, P), dtype=np.float) #Return Matrix
for i in range(P):
for j in range(P):
# - Compute Elemental Pairwise Sums over each Product Vector - #
pws = 0
for c in range(C):
pws += (X[c,i] * X[c,j])
D[i,j] = pws / max(CS[i], CS[j])
return D
cooccurance_probability_explicit_numba = autojit(cooccurance_probability_explicit)
%timeit results:
%timeit cooccurance_probability(X)
10 loops, best of 3: 83 ms per loop
%timeit cooccurance_probability_explicit(X)
1 loops, best of 3: 2.55s per loop
%timeit cooccurance_probability_explicit_numba(X)
100 loops, best of 3: 7.72 ms per loop
The interesting thing about the results is that the explicitly written version executed by python is very slow due to the large type checking overheads. But passing through Numba works it's magic. (Numba is ~11.5 times faster than the python solution using Numpy).
Update: Added a Cython Function for Comparison (thanks to moarningsun: Cython function with variable sized matrix input)
%load_ext cythonmagic
%%cython
import numpy as np
cimport numpy as np
def cooccurance_probability_cy(double[:,:] X):
cdef int C, P, i, j, k
C = X.shape[0]
P = X.shape[1]
cdef double pws
cdef double [:] CS = np.sum(X, axis=0)
cdef double [:,:] D = np.empty((P,P), dtype=np.float)
for i in range(P):
for j in range(P):
pws = 0.0
for c in range(C):
pws += (X[c, i] * X[c, j])
D[i,j] = pws / max(CS[i], CS[j])
return D
%timeit results:
%timeit cooccurance_probability_cy(X)
100 loops, best of 3: 12 ms per loop