I wrote the code below that uses a for loop. I would like to ask if there is a way to vectorize the operation within the second for loop since I intend to work with larger matrices.
import numpy as np
num = 5
A = np.array([[1,2,3,4,5], [4,5,6,4,5], [7,8,9,4,5], [10,11,12,4,5], [13,14,15,4,5]])
sm_factor = np.array([0.1 ,0.1, 0.1, 0.1, 0.1])
d2m = np.zeros((num, num))
d2m[0, 0] = 2
d2m[0, 1] = -5
d2m[0, 2] = 4
d2m[0, 3] = -1
for k in range(1, num-1):
d2m[k, k-1] = 1
d2m[k, k] = -2
d2m[k, k+1] = 1
d2m[num-1, num-4] = -1
d2m[num-1, num-3] = 4
d2m[num-1, num-2] = -5
d2m[num-1, num-1] = 2
x_smf = 0
for i in range(len(sm_factor)):
x_smf = x_smf + sm_factor[i] * (d2m # (A[i, :]).T).T # (d2m # (A[i, :]).T)
x_smf
# 324.0
You can avoid loops for both d2m matrix creation and x_smf vector computation using sps.diags for the creation of a sparse tridiagonal matrix that you can cast to array to be able to edit the first and last lines. Your code will look like this (note that the result of diags in line 10 has been cast to a dense ndarray using scipy.sparse.dia_matrix.toarray method):
import numpy as np
import scipy.sparse as sps
# Dense tridiagonal matrix
d2m = sps.diags([1, -2, 1], [-1, 0, 1], shape=(num, num)).toarray() # cast to array
# First line boundary conditions
d2m[0, 0] = 2
d2m[0, 1] = -5
d2m[0, 2] = 4
d2m[0, 3] = -1
# Last line boundary conditions
d2m[num-1, num-4] = -1
d2m[num-1, num-3] = 4
d2m[num-1, num-2] = -5
d2m[num-1, num-1] = 2
The solution proposed by Valdi_Bo enables you to remove the second FOR loop:
x_smf = np.sum(sm_factor * np.square(d2m # A.T).sum(axis=0))
However, I want to attract your attention on the fact that the x_smf matrix is sparse and storing it as a dense ndarray is bad for both computation time and memory storage. Instead of casting to dense ndarray, I advise you to cast to a sparse matrix format. For example lil_matrix, which is a list of lists sparse matrix format, using tolil() method instead of toarray():
# Sparse tridiagonal matrix
d2m_s = sps.diags([1, -2, 1], [-1, 0, 1], shape=(num, num)).tolil() # cast to lil
Here is a script that compares the three implementations on a bigger case num=4000 (for num=5 all give 324). For this size, I am already seeing benefits of using sparse matrix, here is the whole script (the first lines are a generalisation of the code for num different from 5):
from time import time
import numpy as np
import scipy.sparse as sps
num = 4000
A = np.concatenate([np.arange(1, (num-2)*num+1).reshape(num, num-2), np.repeat([[4, 5]], num, axis=0)], axis=1)
sm_factor = 0.1*np.ones(num)
########## DENSE matrix + FOR loop ##########
d2m = sps.diags([1, -2, 1], [-1, 0, 1], shape=(num, num)).toarray() # cast to array
# First line boundary conditions
d2m[0, 0] = 2
d2m[0, 1] = -5
d2m[0, 2] = 4
d2m[0, 3] = -1
# Last line boundary conditions
d2m[num-1, num-4] = -1
d2m[num-1, num-3] = 4
d2m[num-1, num-2] = -5
d2m[num-1, num-1] = 2
# FOR loop version
t_start = time()
x_smf = 0
for i in range(len(sm_factor)):
x_smf = x_smf + sm_factor[i] * (d2m # (A[i, :]).T).T # (d2m # (A[i, :]).T)
print(f'FOR loop version time: {time()-t_start}s')
print(f'FOR loop version value: {x_smf}\n')
########## DENSE matrix + VECTORIZED ##########
t_start = time()
x_smf_v = np.sum(sm_factor * np.square(d2m # A.T).sum(axis=0))
print(f'VECTORIZED version time: {time()-t_start}s')
print(f'VECTORIZED version value: {x_smf_v}\n')
########## SPARSE matrix + VECTORIZED ##########
d2m_s = sps.diags([1, -2, 1], [-1, 0, 1], shape=(num, num)).tolil() # cast to lil
# First line boundary conditions
d2m_s[0, 0] = 2
d2m_s[0, 1] = -5
d2m_s[0, 2] = 4
d2m_s[0, 3] = -1
# Last line boundary conditions
d2m_s[num-1, num-4] = -1
d2m_s[num-1, num-3] = 4
d2m_s[num-1, num-2] = -5
d2m_s[num-1, num-1] = 2
t_start = time()
x_smf_s = np.sum(sm_factor * np.square(d2m_s # A.T).sum(axis=0))
print(f'SPARSE+VECTORIZED version time: {time()-t_start}s')
print(f'SPARSE+VECTORIZED version value: {x_smf_s}\n')
Here is what I get when running the code:
FOR loop version time: 25.878241777420044s
FOR loop version value: 3.752317536763356e+17
VECTORIZED version time: 1.0873610973358154s
VECTORIZED version value: 3.752317536763356e+17
SPARSE+VECTORIZED version time: 0.37279224395751953s
SPARSE+VECTORIZED version value: 3.752317536763356e+17
As you can see the use of a sparse matrix makes you win another factor 3 on computation time and doesn't require you to adapt the code coming afterwards. It is also a good strategy to test the various scipy implementations of sparse matrices (tocsc(), tocsr(), todok() etc.), some may be more adapted to your case.
After some research and printout of intermediate results of your loop,
I found the solution:
x_smf = np.sum(sm_factor * np.square(d2m # A.T).sum(axis=0))
The result is:
324.0
By the way: Creation of dm2 can be shortened to:
d2m = np.zeros((num, num), dtype='int')
d2m[0, :4] = [ 2, -5, 4, -1]
for k in range(1, num-1):
d2m[k, k-1:k+2] = [ 1, -2, 1]
d2m[-1, -4:] = [-1, 4, -5, 2]
Related
The code calculates the minimum value at each row and picks the next minimum by scanning the nearby element on the same and the next row. Instead, I want the code to start with minimum value of the first row and then progress by scanning the nearby elements. I don't want it to calculate the minimum value for each row. The outputs are attached.
import numpy as np
from scipy.ndimage import minimum_filter as mf
Pe = np.random.rand(5,5)
b = np.zeros((Pe.shape[0], 2))
#the footprint of the window, i.e., we do not consider the value itself or value in the row above
ft = np.asarray([[0, 0, 0],
[1, 0, 1],
[1, 1, 1]])
#applying scipy's minimum filter
#mode defines what should be considered as values at the edges
#setting the edges to INF
Pe_min = mf(Pe, footprint=ft, mode="constant", cval=np.inf)
#finding rowwise index of minimum value
idx = Pe.argmin(axis=1)
#retrieving minimum values and filtered values
b[:, 0] = np.take_along_axis(Pe, idx[None].T, 1).T[0]
b[:, 1] = np.take_along_axis(Pe_min, idx[None].T, 1).T[0]
print(b)
Present Output:
Desired Output:
You can solve this using a simple while loop: for a given current location, each step of the loop iterates over the neighborhood to find the smallest value amongst all the valid next locations and then update/write it.
Since this can be pretty inefficient in pure Numpy, you can use Numba so the code can be executed efficiently. Here is the implementation:
import numpy as np
import numba as nb
Pe = np.random.rand(5,5)
# array([[0.58268917, 0.99645225, 0.06229945, 0.5741654 , 0.41407074],
# [0.4933553 , 0.93253261, 0.1485588 , 0.00133828, 0.09301049],
# [0.49055436, 0.53794993, 0.81358814, 0.25031136, 0.76174586],
# [0.69885908, 0.90878292, 0.25387689, 0.25735301, 0.63913838],
# [0.33781117, 0.99406778, 0.49133067, 0.95026241, 0.14237322]])
#nb.njit('int_[:,:](float64[:,::1])', boundscheck=True)
def minValues(arr):
n, m = arr.shape
assert n >= 1 and m >= 2
res = []
i, j = 0, np.argmin(arr[0,:])
res.append((i, j))
iPrev = jPrev = -1
while iPrev < n-1:
cases = [(i, j-1), (i, j+1), (i+1, j-1), (i+1, j), (i+1, j+1)]
minVal = np.inf
iMin = jMin = -1
# Find the best candidate (smalest value)
for (i2, j2) in cases:
if i2 == iPrev and j2 == jPrev: # No cycles
continue
if i2 < 0 or i2 >= n or j2 < 0 or j2 >= m: # No out-of-bounds
continue
if arr[i2, j2] < minVal:
iMin, jMin = i2, j2
minVal = arr[i2, j2]
assert not np.isinf(minVal)
# Store it and update the values
res.append((iMin, jMin))
iPrev, jPrev = i, j
i, j = iMin, jMin
return np.array(res)
minValues(Pe)
# array([[0, 2],
# [1, 3],
# [1, 4],
# [2, 3],
# [3, 2],
# [3, 3],
# [4, 4],
# [4, 3]], dtype=int32)
The algorithm is relatively fast as it succeeds to find the result of a path of length 141_855 in a Pe array of shape (100_000, 1_000) in only 15 ms on my machine (although it can be optimized). The same code using only CPython (ie. without the Numba JIT) takes 591 ms.
So I want to create the sparse matrix as below from the numpy array matrix as usual:
from scipy import sparse
I = np.array([0,1,2, 0,1,2, 0,1,2])
J = np.array([0,0,0,1,1,1,2,2,2])
DataElement = np.array([2,1,2,1,0,1,2,1,2])
A = sparse.coo_matrix((DataElement,(I,J)),shape=(3,3))
print(A.toarray()) ## This is what I expect to see.
My attempt with numpy is:
import numpy as np
U = np.empty((3,3,), order = "F")
U[:] = np.nan
## Initialize
U[0,0] = 2
U[2,0] = 2
U[0,2] = 2
U[2,2] = 2
for j in range(0,3):
## Slice columns first:
if (j !=0 and j!= 2):
for i in range(0,3):
## slice rows:
if (i != 0 and i != 2):
U[i,j] = 0
else:
U[i,j] = 1
One way using numpy.add.at:
arr = np.zeros((3,3), int)
np.add.at(arr, (I, J), DataElement)
print(arr)
Output:
array([[2, 1, 2],
[1, 0, 1],
[2, 1, 2]])
There are several ways of manual filling of the arrays.
First, you can explicitly define each entry:
U = np.array([[2,1,2],[1,0,1],[2,1,2]],order='F')
Or you can initialize an array with nans and then define each element by subscribing them:
U = np.empty((3,3,), order = "F")
U[:] = np.nan
U[0,0],U[0,1],U[0,2]=2,1,2
U[1,0],U[1,1],U[1,2]=1,0,1
U[2,0],U[2,1],U[2,2]=2,1,2
Finally, if there is a pattern, one can slice and define multiple values at once:
U[:,0]=[2,1,2]
U[:,1]=U[:,0]-1
U[:,2]=U[:,0]
In your attempt, you simply miss some of the entries, and they remain nans.
I have an n-by-3 index array (think of triangles indexing points) and a list of float values associated with the triangles. I now want to get for each index ("point") the minimum value, i.e., check all rows which contain the index, say, 0, and get the minimum value from vals across the respective rows:
import numpy
a = numpy.array([
[0, 1, 2],
[2, 3, 0],
[1, 4, 2],
[2, 5, 3],
])
vals = numpy.array([0.1, 0.5, 0.3, 0.6])
out = [
numpy.min(vals[numpy.any(a == i, axis=1)])
for i in range(6)
]
# out = numpy.array([0.1, 0.1, 0.1, 0.5, 0.3, 0.6])
This solution is inefficient because it does a full array comparison for every i.
This problem is quite similar to numpy's ufuncs, but numpy.min.at doesn't exist.
Any hints?
Approach #1
One approach based on array-assignment to setup a 2D array filled up NaNs, using those a values as column indices (so assumes those to be integers), then mapping vals into it and looking for nan-skipped min values for the final output -
nr,nc = len(a),a.max()+1
m = np.full((nr,nc),np.nan)
m[np.arange(nr)[:,None],a] = vals[:,None]
out = np.nanmin(m,axis=0)
Approach #2
Another one again based on array-assignment, but uses masking and np.minimum.reduceat in favor of dealing with NaNs -
nr,nc = len(a),a.max()+1
m = np.zeros((nc,nr),dtype=bool)
m[a.T,np.arange(nr)] = 1
c = m.sum(1)
shift_idx = np.r_[0,c[:-1].cumsum()]
out = np.minimum.reduceat(np.broadcast_to(vals,m.shape)[m],shift_idx)
Approach #3
Another based on argsort (assuming you have all integers from 0 to a.max() in a) -
sidx = a.ravel().argsort()
c = np.bincount(a.ravel())
out = np.minimum.reduceat(vals[sidx//a.shape[1]],np.r_[0,c[:-1].cumsum()])
Approach #4
For memory efficiency and hence perf. and also to complete the set -
from numba import njit
#njit
def numba1(a, vals, out):
m,n = a.shape
for j in range(m):
for i in range(n):
e = a[j,i]
if vals[j] < out[e]:
out[e] = vals[j]
return out
def func1(a, vals, outlen=None): # feed in output length as outlen if known
if outlen is not None:
N = outlen
else:
N = a.max()+1
out = np.full(N,np.inf)
return numba1(a, vals, out)
You may switch to pd.GroupBy or itertools.groupby if your for loop goes way beyond 6.
For instance,
r = n.ravel()
pd.Series(np.arange(len(r))//3).groupby(r).apply(lambda s: vals[s].min())
This solution would be faster for long loops, and probably slower for small loops (< 50)
Here is one based on this Q&A:
If you have pythran, compile
file <stb_pthr.py>
import numpy as np
#pythran export sort_to_bins(int[:], int)
def sort_to_bins(idx, mx):
if mx==-1:
mx = idx.max() + 1
cnts = np.zeros(mx + 2, int)
for i in range(idx.size):
cnts[idx[i]+2] += 1
for i in range(2, cnts.size):
cnts[i] += cnts[i-1]
res = np.empty_like(idx)
for i in range(idx.size):
res[cnts[idx[i]+1]] = i
cnts[idx[i]+1] += 1
return res, cnts[:-1]
Otherwise the script will fall back to a sparse matrix based approach which is only slightly slower:
import numpy as np
try:
from stb_pthr import sort_to_bins
HAVE_PYTHRAN = True
except:
HAVE_PYTHRAN = False
from scipy.sparse import csr_matrix
def sort_to_bins_sparse(idx, mx):
if mx==-1:
mx = idx.max() + 1
aux = csr_matrix((np.ones_like(idx),idx,np.arange(idx.size+1)),
(idx.size,mx)).tocsc()
return aux.indices, aux.indptr
if not HAVE_PYTHRAN:
sort_to_bins = sort_to_bins_sparse
def f_op():
mx = a.max() + 1
return np.fromiter((np.min(vals[np.any(a == i, axis=1)])
for i in range(mx)),vals.dtype,mx)
def f_pp():
idx, bb = sort_to_bins(a.reshape(-1),-1)
res = np.minimum.reduceat(vals[idx//3], bb[:-1])
res[bb[:-1]==bb[1:]] = np.inf
return res
def f_div_3():
sidx = a.ravel().argsort()
c = np.bincount(a.ravel())
bb = np.r_[0,c.cumsum()]
res = np.minimum.reduceat(vals[sidx//a.shape[1]],bb[:-1])
res[bb[:-1]==bb[1:]] = np.inf
return res
a = np.array([
[0, 1, 2],
[2, 3, 0],
[1, 4, 2],
[2, 5, 3],
])
vals = np.array([0.1, 0.5, 0.3, 0.6])
assert np.all(f_op()==f_pp())
from timeit import timeit
a = np.random.randint(0,1000,(10000,3))
vals = np.random.random(10000)
assert len(np.unique(a))==1000
assert np.all(f_op()==f_pp())
print("1000/1000 labels, 10000 rows")
print("op ", timeit(f_op, number=10)*100, 'ms')
print("pp ", timeit(f_pp, number=100)*10, 'ms')
print("div", timeit(f_div_3, number=100)*10, 'ms')
a = 1 + 2 * np.random.randint(0,5000,(1000000,3))
vals = np.random.random(1000000)
nl = len(np.unique(a))
assert np.all(f_div_3()==f_pp())
print(f"{nl}/{a.max()+1} labels, 1000000 rows")
print("pp ", timeit(f_pp, number=10)*100, 'ms')
print("div", timeit(f_div_3, number=10)*100, 'ms')
a = 1 + 2 * np.random.randint(0,100000,(1000000,3))
vals = np.random.random(1000000)
nl = len(np.unique(a))
assert np.all(f_div_3()==f_pp())
print(f"{nl}/{a.max()+1} labels, 1000000 rows")
print("pp ", timeit(f_pp, number=10)*100, 'ms')
print("div", timeit(f_div_3, number=10)*100, 'ms')
Sample run (timings include #Divakar approach 3 for reference):
1000/1000 labels, 10000 rows
op 145.1122640981339 ms
pp 0.7944229000713676 ms
div 2.2905819199513644 ms
5000/10000 labels, 1000000 rows
pp 113.86540920939296 ms
div 417.2476712032221 ms
100000/200000 labels, 1000000 rows
pp 158.23634970001876 ms
div 486.13436080049723 ms
UPDATE: #Divakar's latest (approach 4) is hard to beat, being essentially a C implementation. Nothing wrong with that except that jitting is not an option but a requirement here (the unjitted code is no fun to run). If one accepts that, the same can, of course, be done with pythran:
pythran -O3 labeled_min.py
file <labeled_min.py>
import numpy as np
#pythran export labeled_min(int[:,:], float[:])
def labeled_min(A, vals):
mn = np.empty(A.max()+1)
mn[:] = np.inf
M,N = A.shape
for i in range(M):
v = vals[i]
for j in range(N):
c = A[i,j]
if v < mn[c]:
mn[c] = v
return mn
Both give another massive speedup:
from labeled_min import labeled_min
func1() # do not measure jitting time
print("nmb ", timeit(func1, number=100)*10, 'ms')
print("pthr", timeit(lambda:labeled_min(a,vals), number=100)*10, 'ms')
Sample run:
nmb 8.41792532010004 ms
pthr 8.104007659712806 ms
pythran comes out a few percent faster but this is only because I moved vals lookup out of the inner loop; without that they are all but equal.
For comparison, the previously best with and without non python helpers on the same problem:
pp 114.04887529788539 ms
pp (py only) 147.0821460010484 ms
Apparently, numpy.minimum.at exists:
import numpy
a = numpy.array([
[0, 1, 2],
[2, 3, 0],
[1, 4, 2],
[2, 5, 3],
])
vals = numpy.array([0.1, 0.5, 0.3, 0.6])
out = numpy.full(6, numpy.inf)
numpy.minimum.at(out, a.reshape(-1), numpy.repeat(vals, 3))
I have an array a of length N and need to implement the following operation:
With p in [0..1]. This equation is a lossy sum, where the first indexes in the sum are weighted by a greater loss (p^{n-i}) than the last ones. The last index (i=n) is always weigthed by 1. if p = 1, then the operation is a simple cumsum.
b = np.cumsum(a)
If if p != 1, I can implement this operation in a cpu-inefficient way:
b = np.empty(np.shape(a))
# I'm using the (-1,-1,-1) idiom for reversed ranges
p_vec = np.power(p, np.arange(N-1, 0-1, -1))
# p_vec[0] = p^{N-1}, p_vec[-1] = 1
for n in range(N):
b[n] = np.sum(a[:n+1]*p_vec[-(n+1):])
Or in a memory-inefficient but vectorized way (IMO is cpu inefficient too, since a lot of work is wasted):
a_idx = np.reshape(np.arange(N+1), (1, N+1)) - np.reshape(np.arange(N-1, 0-1, -1), (N, 1))
a_idx = np.maximum(0, a_idx)
# For N=4, a_idx looks like this:
# [[0, 0, 0, 0, 1],
# [0, 0, 0, 1, 2],
# [0, 0, 1, 2, 3],
# [0, 1, 2, 3, 4]]
a_ext = np.concatenate(([0], a,), axis=0) # len(a_ext) = N + 1
p_vec = np.power(p, np.arange(N, 0-1, -1)) # len(p_vec) = N + 1
b = np.dot(a_ext[a_idx], p_vec)
Is there a better way to achieve this 'lossy' cumsum?
What you want is a IIR filter, you can use scipy.signal.lfilter(), here is the code:
Your code:
import numpy as np
N = 10
p = 0.8
np.random.seed(0)
x = np.random.randn(N)
y = np.empty_like(x)
p_vec = np.power(p, np.arange(N-1, 0-1, -1))
for n in range(N):
y[n] = np.sum(x[:n+1]*p_vec[-(n+1):])
y
the output:
array([1.76405235, 1.81139909, 2.42785725, 4.183179 , 5.21410119,
3.19400307, 3.50529088, 2.65287549, 2.01908154, 2.02586374])
By using lfilter():
from scipy import signal
y = signal.lfilter([1], [1, -p], x)
print(y)
the output:
array([1.76405235, 1.81139909, 2.42785725, 4.183179 , 5.21410119,
3.19400307, 3.50529088, 2.65287549, 2.01908154, 2.02586374])
I run into this problem when implementing the vectorized svm gradient for cs231n assignment1.
here is an example:
ary = np.array([[1,-9,0],
[1,2,3],
[0,0,0]])
ary[[0,1]] += np.ones((2,2),dtype='int')
and it outputs:
array([[ 2, -8, 1],
[ 2, 3, 4],
[ 0, 0, 0]])
everything is fine until rows is not unique:
ary[[0,1,1]] += np.ones((3,3),dtype='int')
although it didn't throw an error,the output was really strange:
array([[ 2, -8, 1],
[ 2, 3, 4],
[ 0, 0, 0]])
and I expect the second row should be [3,4,5] rather than [2,3,4],
the naive way I used to solve this problem is using a for loop like this:
ary = np.array([[ 2, -8, 1],
[ 2, 3, 4],
[ 0, 0, 0]])
# the rows I want to change
rows = [0,1,2,1,0,1]
# the change matrix
change = np.random.randn((6,3))
for i,row in enumerate(rows):
ary[row] += change[i]
so I really don't know how to vectorize this for loop, is there a better way to do this in NumPy?
and why it's wrong to do something like this?:
ary[rows] += change
In case anyone is curious why I want to do so, here is my implementation of svm_loss_vectorized function, I need to compute the gradients of weights based on labels y:
def svm_loss_vectorized(W, X, y, reg):
"""
Structured SVM loss function, vectorized implementation.
Inputs and outputs are the same as svm_loss_naive.
"""
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
# transpose X and W
# D means input dimensions, N means number of train example
# C means number of classes
# X.shape will be (D,N)
# W.shape will be (C,D)
X = X.T
W = W.T
dW = dW.T
num_train = X.shape[1]
# transpose W_y shape to (D,N)
W_y = W[y].T
S_y = np.sum(W_y*X ,axis=0)
margins = np.dot(W,X) + 1 - S_y
mask = np.array(margins>0)
# get the impact of num_train examples made on W's gradient
# that is,only when the mask is positive
# the train example has impact on W's gradient
dW_j = np.dot(mask, X.T)
dW += dW_j
mul_mask = np.sum(mask, axis=0, keepdims=True).T
# dW[y] -= mul_mask * X.T
dW_y = mul_mask * X.T
for i,label in enumerate(y):
dW[label] -= dW_y[i]
loss = np.sum(margins*mask) - num_train
loss /= num_train
dW /= num_train
# add regularization term
loss += reg * np.sum(W*W)
dW += reg * 2 * W
dW = dW.T
return loss, dW
Using built-in np.add.at
The built-in is np.add.at for such tasks, i,e.
np.add.at(ary, rows, change)
But, since we are working with a 2D array, that might not be the most performant one.
Leveraging fast matrix-multiplication
As it turns out, we can leverage the very efficient matrix-multplication for such a case as well and given enough number of repeated rows for summation, could be really good. Here's how we can use it -
mask = rows == np.arange(len(ary))[:,None]
ary += mask.dot(change)
Benchmarking
Let's time np.add.at method against matrix-multiplication based one for bigger arrays -
In [681]: ary = np.random.rand(1000,1000)
In [682]: rows = np.random.randint(0,len(ary),(10000))
In [683]: change = np.random.rand(10000,1000)
In [684]: %timeit np.add.at(ary, rows, change)
1 loop, best of 3: 604 ms per loop
In [687]: def matmul_addat(ary, rows, change):
...: mask = rows == np.arange(len(ary))[:,None]
...: ary += mask.dot(change)
In [688]: %timeit matmul_addat(ary, rows, change)
10 loops, best of 3: 158 ms per loop