I have a numpy operation that looks like the following:
for i in range(i_max):
for j in range(j_max):
r[i, j, x[i, j], y[i, j]] = c[i, j]
where x, y and c have the same shape.
Is it possible to use numpy's advanced indexing to speed this operation up?
I tried using:
i = numpy.arange(i_max)
j = numpy.arange(j_max)
r[i, j, x, y] = c
However, I didn't get the result I expected.
Using linear indexing -
d0,d1,d2,d3 = r.shape
np.put(r,np.arange(i_max)[:,None]*d1*d2*d3 + np.arange(j_max)*d2*d3 + x*d3 +y,c)
Benchmarking and verification
Define functions -
def linear_indx(r,x,y,c,i_max,j_max):
d0,d1,d2,d3 = r.shape
np.put(r,np.arange(i_max)[:,None]*d1*d2*d3 + np.arange(j_max)*d2*d3 + x*d3 +y,c)
return r
def org_app(r,x,y,c,i_max,j_max):
for i in range(i_max):
for j in range(j_max):
r[i, j, x[i,j], y[i,j]] = c[i,j]
return r
Setup input arrays and benchmark -
In [134]: # Setup input arrays
...: i_max = 40
...: j_max = 50
...: D0 = 60
...: D1 = 70
...: N = 80
...:
...: r = np.zeros((D0,D1,N,N))
...: c = np.random.rand(i_max,j_max)
...:
...: x = np.random.randint(0,N,(i_max,j_max))
...: y = np.random.randint(0,N,(i_max,j_max))
...:
In [135]: # Make copies for testing, as both functions make in-situ changes
...: r1 = r.copy()
...: r2 = r.copy()
...:
In [136]: # Verify results by comparing with original loopy approach
...: np.allclose(linear_indx(r1,x,y,c,i_max,j_max),org_app(r2,x,y,c,i_max,j_max))
Out[136]: True
In [137]: # Make copies for testing, as both functions make in-situ changes
...: r1 = r.copy()
...: r2 = r.copy()
...:
In [138]: %timeit linear_indx(r1,x,y,c,i_max,j_max)
10000 loops, best of 3: 115 µs per loop
In [139]: %timeit org_app(r2,x,y,c,i_max,j_max)
100 loops, best of 3: 2.25 ms per loop
The indexing arrays need to be broadcastable for this to work. The only change needed is to add an axis to the first index i to match the shape with the rest. The quick way to accomplish this is by indexing with None (which is equivalent to numpy.newaxis):
i = numpy.arange(i_max)
j = numpy.arange(j_max)
r[i[:,None], j, x, y] = c
Related
I am trying to vectorize the following operations with two matrix in python.
f= matrix([[ 96],
[192],
[288],
[384]], dtype=int32)
g = matrix([[ 0.],
[ 70.],
[ 200.],
[ 60.]])
Need to create z without creating loops such that z is maximum of cumulative sum of first column and sum of last value of z and another matrix g. This loop is called thousands of time, therefore slowing the run time.
for i in range(4):
if i != 0:
z[i] = max(f[i], z[i-1] + g[i])
else:
z[0] = f[i]
Any guidance on how to vectorize this code would be really helpful.
Thanks in advance.
Here is a vectorized version. It uses the cumulative maximum on the difference between f and cumsum(g) to predict the points where f[i] is larger than z[i]:
Timings:
N = 10
loopy 0.00594156 ms
vect 0.03193051 ms
N = 100
loopy 0.05560229 ms
vect 0.03186400 ms
N = 1000
loopy 0.57484017 ms
vect 0.04492043 ms
N = 10000
loopy 5.75115310 ms
vect 0.15519847 ms
N = 100000
loopy 57.30253551 ms
vect 1.69428380 ms
Code:
import numpy as np
import types
from timeit import timeit
def setup_data(N):
g = np.random.random((N,))
f = 2 + np.cumsum(np.random.random(N,))
return f, g
def f_loopy(f, g):
N, = f.shape
z = np.empty_like(f)
for i in range(N):
if i != 0:
z[i] = max(f[i], z[i-1] + g[i])
else:
z[0] = f[i]
return z
def f_vect(f, g):
N, = f.shape
gg = np.cumsum(g)
rmx = np.maximum.accumulate(f - gg)
sw = np.r_[0, 1 + np.flatnonzero(rmx[:-1] != rmx[1:]), N]
return gg + np.repeat(f[sw[:-1]]-gg[sw[:-1]], np.diff(sw))
for N in [10, 100, 1000, 10000, 100000]:
data = setup_data(N)
ref = f_loopy(*data)
print(f'N = {N}')
for name, func in list(globals().items()):
if not name.startswith('f_') or not isinstance(func, types.FunctionType):
continue
try:
assert np.allclose(ref, func(*data))
print("{:16s}{:16.8f} ms".format(name[2:], timeit(
'f(*data)', globals={'f':func, 'data':data}, number=100)*10))
except:
print("{:16s} apparently failed".format(name[2:]))
I'd like to write the vectored version of code that calculates Arnaud Legoux Moving Average using NumPy (or Pandas). Could you help me with this, please? Thanks.
Non-vectored version looks like following (see below).
def NPALMA(pnp_array, **kwargs) :
'''
ALMA - Arnaud Legoux Moving Average,
http://www.financial-hacker.com/trend-delusion-or-reality/
https://github.com/darwinsys/Trading_Strategies/blob/master/ML/Features.py
'''
length = kwargs['length']
# just some number (6.0 is useful)
sigma = kwargs['sigma']
# sensisitivity (close to 1) or smoothness (close to 0)
offset = kwargs['offset']
asize = length - 1
m = offset * asize
s = length / sigma
dss = 2 * s * s
alma = np.zeros(pnp_array.shape)
wtd_sum = np.zeros(pnp_array.shape)
for l in range(len(pnp_array)):
if l >= asize:
for i in range(length):
im = i - m
wtd = np.exp( -(im * im) / dss)
alma[l] += pnp_array[l - length + i] * wtd
wtd_sum[l] += wtd
alma[l] = alma[l] / wtd_sum[l]
return alma
Starting Approach
We can create sliding windows along the first axis and then use tensor multiplication with the range of wtd values for the sum-reductions.
The implementation would look something like this -
# Get all wtd values in an array
wtds = np.exp(-(np.arange(length) - m)**2/dss)
# Get the sliding windows for input array along first axis
pnp_array3D = strided_axis0(pnp_array,len(wtds))
# Initialize o/p array
out = np.zeros(pnp_array.shape)
# Get sum-reductions for the windows which don't need wrapping over
out[length:] = np.tensordot(pnp_array3D,wtds,axes=((1),(0)))[:-1]
# Last element of the output needed wrapping. So, do it separately.
out[length-1] = wtds.dot(pnp_array[np.r_[-1,range(length-1)]])
# Finally perform the divisions
out /= wtds.sum()
Function to get the sliding windows : strided_axis0 is from here.
Boost with 1D convolution
Those multiplications with wtds values and then their sum-reductions are basically convolution along the first axis. As such, we can use scipy.ndimage.convolve1d along axis=0. This would be much faster given the memory efficiency, as we won't be creating huge sliding windows.
The implementation would be -
from scipy.ndimage import convolve1d as conv
avgs = conv(pnp_array, weights=wtds/wtds.sum(),axis=0, mode='wrap')
Thus, out[length-1:], which are the non-zero rows would be same as avgs[:-length+1].
There could be some precision difference if we are working with really small kernel numbers from wtds. So, keep that in mind if using this convolution method.
Runtime test
Approaches -
def original_app(pnp_array, length, m, dss):
alma = np.zeros(pnp_array.shape)
wtd_sum = np.zeros(pnp_array.shape)
for l in range(len(pnp_array)):
if l >= asize:
for i in range(length):
im = i - m
wtd = np.exp( -(im * im) / dss)
alma[l] += pnp_array[l - length + i] * wtd
wtd_sum[l] += wtd
alma[l] = alma[l] / wtd_sum[l]
return alma
def vectorized_app1(pnp_array, length, m, dss):
wtds = np.exp(-(np.arange(length) - m)**2/dss)
pnp_array3D = strided_axis0(pnp_array,len(wtds))
out = np.zeros(pnp_array.shape)
out[length:] = np.tensordot(pnp_array3D,wtds,axes=((1),(0)))[:-1]
out[length-1] = wtds.dot(pnp_array[np.r_[-1,range(length-1)]])
out /= wtds.sum()
return out
def vectorized_app2(pnp_array, length, m, dss):
wtds = np.exp(-(np.arange(length) - m)**2/dss)
return conv(pnp_array, weights=wtds/wtds.sum(),axis=0, mode='wrap')
Timings -
In [470]: np.random.seed(0)
...: m,n = 1000,100
...: pnp_array = np.random.rand(m,n)
...:
...: length = 6
...: sigma = 0.3
...: offset = 0.5
...:
...: asize = length - 1
...: m = np.floor(offset * asize)
...: s = length / sigma
...: dss = 2 * s * s
...:
In [471]: %timeit original_app(pnp_array, length, m, dss)
...: %timeit vectorized_app1(pnp_array, length, m, dss)
...: %timeit vectorized_app2(pnp_array, length, m, dss)
...:
10 loops, best of 3: 36.1 ms per loop
1000 loops, best of 3: 1.84 ms per loop
1000 loops, best of 3: 684 µs per loop
In [472]: np.random.seed(0)
...: m,n = 10000,1000 # rest same as previous one
In [473]: %timeit original_app(pnp_array, length, m, dss)
...: %timeit vectorized_app1(pnp_array, length, m, dss)
...: %timeit vectorized_app2(pnp_array, length, m, dss)
...:
1 loop, best of 3: 503 ms per loop
1 loop, best of 3: 222 ms per loop
10 loops, best of 3: 106 ms per loop
I have the following operation :
import numpy as np
x = np.random.rand(3,5,5)
w = np.random.rand(5,5)
y=np.zeros((3,5,5))
for i in range(3):
y[i] = np.dot(w.T,np.dot(x[i],w))
Which corresponds to the pseudo-expression y[m,i,j] = sum( w[k,i] * x[m,k,l] * w[l,j], axes=[k,l] or equivalently simply the dot product of w.T , x, w broadcaster over the first dimension of x.
How can I implement it with numpy's broadcasting rules ?
Thanks in advance.
Here's one vectorized approach with np.tensordot which should be better than broadcasting + summation anyday -
# Take care of "np.dot(x[i],w)" term
x_w = np.tensordot(x,w,axes=((2),(0)))
# Perform "np.dot(w.T,np.dot(x[i],w))" : "np.dot(w.T,x_w)"
y_out = np.tensordot(x_w,w,axes=((1),(0))).swapaxes(1,2)
Alternatively, all of the mess being taken care of with one np.einsum call, but could be slower -
y_out = np.einsum('ab,cae,eg->cbg',w,x,w)
Runtime test -
In [114]: def tensordot_app(x, w):
...: x_w = np.tensordot(x,w,axes=((2),(0)))
...: return np.tensordot(x_w,w,axes=((1),(0))).swapaxes(1,2)
...:
...: def einsum_app(x, w):
...: return np.einsum('ab,cae,eg->cbg',w,x,w)
...:
In [115]: x = np.random.rand(30,50,50)
...: w = np.random.rand(50,50)
...:
In [116]: %timeit tensordot_app(x, w)
1000 loops, best of 3: 477 µs per loop
In [117]: %timeit einsum_app(x, w)
1 loop, best of 3: 219 ms per loop
Giving the broadcasting a chance
The sum-notation was -
y[m,i,j] = sum( w[k,i] * x[m,k,l] * w[l,j], axes=[k,l] )
Thus, the three terms would be stacked for broadcasting, like so -
w : [ N x k x i x N x N]
x : [ m x k x N x l x N]
w : [ N x N X N x l x j]
, where N represents new-axis being appended to facilitate broadcasting along those dims.
The terms with new axes being added with None/np.newaxis would then look like this -
w : w[None, :, :, None, None]
x : x[:, :, None, :, None]
w : w[None, None, None, :, :]
Thus, the broadcasted product would be -
p = w[None,:,:,None,None]*x[:,:,None,:,None]*w[None,None,None,:,:]
Finally, the output would be sum-reduction to lose (k,l), i.e. axes =(1,3) -
y = p.sum((1,3))
Following up on this answer by jorgeca:
def patchify(img, patch_shape):
img = np.ascontiguousarray(img) # won't make a copy if not needed
X, Y = img.shape
x, y = patch_shape
shape = ((X-x+1), (Y-y+1), x, y) # number of patches, patch_shape
# The right strides can be thought by:
# 1) Thinking of `img` as a chunk of memory in C order
# 2) Asking how many items through that chunk of memory are needed when indices
# i,j,k,l are incremented by one
strides = img.itemsize*np.array([Y, 1, Y, 1])
return np.lib.stride_tricks.as_strided(img, shape=shape, strides=strides)
How can those overlapping arrays be merged back again to the original image?
Approach #1
Here's one approach after converting the 4D array of patches into 2D and then simply slicing and stacking the leftover rows and columns -
def unpatchify(img_patches, block_size):
B0, B1 = block_size
N = np.prod(img_patches.shape[1::2])
patches2D = img_patches.transpose(0,2,1,3).reshape(-1,N)
m,n = patches2D.shape
row_mask = np.zeros(m,dtype=bool)
col_mask = np.zeros(n,dtype=bool)
row_mask[::B0]= 1
col_mask[::B1]= 1
row_mask[-B0:] = 1
col_mask[-B1:] = 1
return patches2D[np.ix_(row_mask, col_mask)]
Sample run -
In [233]: img = np.random.randint(0,255,(16,25))
...: block_size = (4,8)
...:
In [234]: np.allclose(img, unpatchify(patchify(img, block_size), block_size))
Out[234]: True
Approach #2
In the previous approach, use of transpose on the big 4D array would force a copy and as such that transpose operation might prove costly. To avoid that, here's another approach making heavy usage of slicing -
def unpatchify_v2(img_patches, block_size):
B0, B1 = block_size
m,n,r,q = img_patches.shape
shp = m + r - 1, n + q - 1
p1 = img_patches[::B0,::B1].swapaxes(1,2)
p1 = p1.reshape(-1,p1.shape[2]*p1.shape[3])
p2 = img_patches[:,-1,0,:]
p3 = img_patches[-1,:,:,0].T
p4 = img_patches[-1,-1]
out = np.zeros(shp,dtype=img_patches.dtype)
out[:p1.shape[0],:p1.shape[1]] = p1
out[:p2.shape[0],-p2.shape[1]:] = p2
out[-p3.shape[0]:,:p3.shape[1]] = p3
out[-p4.shape[0]:,-p4.shape[1]:] = p4
return out
Runtime test
In [16]: img = np.random.randint(0,255,(1024,1024))
...: block_size = (3,3)
...: img_patches = patchify(img, block_size)
...:
In [17]: %timeit unpatchify(img_patches, block_size)
...: %timeit unpatchify_v2(img_patches, block_size)
10 loops, best of 3: 22.9 ms per loop
100 loops, best of 3: 2.25 ms per loop
In [18]: img = np.random.randint(0,255,(1024,1024))
...: block_size = (8,8)
...: img_patches = patchify(img, block_size)
...:
In [19]: %timeit unpatchify(img_patches, block_size)
...: %timeit unpatchify_v2(img_patches, block_size)
...:
10 loops, best of 3: 114 ms per loop
1000 loops, best of 3: 1.5 ms per loop
I'd like to speed up the following calculations handling r rays and n spheres. Here is what I got so far:
# shape of mu1 and mu2 is (r, n)
# shape of rays is (r, 3)
# note that intersections has 2n columns because for every sphere one can
# get up to two intersections (secant, tangent, no intersection)
intersections = np.empty((r, 2*n, 3))
for col in range(n):
intersections[:, col, :] = rays * mu1[:, col][:, np.newaxis]
intersections[:, col + n, :] = rays * mu2[:, col][:, np.newaxis]
# [...]
# calculate euclidean distance from the center of gravity (0,0,0)
distances = np.empty((r, 2 * n))
for col in range(n):
distances[:, col] = np.linalg.norm(intersections[:, col], axis=1)
distances[:, col + n] = np.linalg.norm(intersections[:, col + n], axis=1)
I tried speeding things up by avoiding the for-Loops, but couldn't figure out how to broadcast the arrays properly so that I only need a single function call. Any help is much appreciated.
Here's a vectorized way using broadcasting -
intersections = np.hstack((mu1,mu2))[...,None]*rays[:,None,:]
distances = np.sqrt((intersections**2).sum(2))
The last step could be replaced with an use of np.einsum like so -
distances = np.sqrt(np.einsum('ijk,ijk->ij',intersections,intersections))
Or replace almost the whole thing with np.einsum for another vectorized way, like so -
mu = np.hstack((mu1,mu2))
distances = np.sqrt(np.einsum('ij,ij,ik,ik->ij',mu,mu,rays,rays))
Runtime tests and verify outputs -
def original_app(mu1,mu2,rays):
intersections = np.empty((r, 2*n, 3))
for col in range(n):
intersections[:, col, :] = rays * mu1[:, col][:, np.newaxis]
intersections[:, col + n, :] = rays * mu2[:, col][:, np.newaxis]
distances = np.empty((r, 2 * n))
for col in range(n):
distances[:, col] = np.linalg.norm(intersections[:, col], axis=1)
distances[:, col + n] = np.linalg.norm(intersections[:, col + n], axis=1)
return distances
def vectorized_app1(mu1,mu2,rays):
intersections = np.hstack((mu1,mu2))[...,None]*rays[:,None,:]
return np.sqrt((intersections**2).sum(2))
def vectorized_app2(mu1,mu2,rays):
intersections = np.hstack((mu1,mu2))[...,None]*rays[:,None,:]
return np.sqrt(np.einsum('ijk,ijk->ij',intersections,intersections))
def vectorized_app3(mu1,mu2,rays):
mu = np.hstack((mu1,mu2))
return np.sqrt(np.einsum('ij,ij,ik,ik->ij',mu,mu,rays,rays))
Timings -
In [101]: # Inputs
...: r = 1000
...: n = 1000
...: mu1 = np.random.rand(r, n)
...: mu2 = np.random.rand(r, n)
...: rays = np.random.rand(r, 3)
In [102]: np.allclose(original_app(mu1,mu2,rays),vectorized_app1(mu1,mu2,rays))
Out[102]: True
In [103]: np.allclose(original_app(mu1,mu2,rays),vectorized_app2(mu1,mu2,rays))
Out[103]: True
In [104]: np.allclose(original_app(mu1,mu2,rays),vectorized_app3(mu1,mu2,rays))
Out[104]: True
In [105]: %timeit original_app(mu1,mu2,rays)
...: %timeit vectorized_app1(mu1,mu2,rays)
...: %timeit vectorized_app2(mu1,mu2,rays)
...: %timeit vectorized_app3(mu1,mu2,rays)
...:
1 loops, best of 3: 306 ms per loop
1 loops, best of 3: 215 ms per loop
10 loops, best of 3: 140 ms per loop
10 loops, best of 3: 136 ms per loop