I have an input matrix A of size I*J
And an output matrix B of size N*M
And some precalculated map of size N*M*2 that dictates for each coordinate in B, which coordinate in A to take. The map has no specific rule or linearity that I can use. Just a map that seems random.
The matrices are pretty big (~5000*~3000) so creating a mapping matrix is out of the question (5000*3000*5000*3000)
I managed to do it using a simple map and loop:
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
And I managed to do it using indexing:
B[coords_y, coords_x] = A[some_mapping[:, 0], some_mapping[:, 1]]
# Where coords_x, coords_y are defined as all of the coordinates:
# [[0,0],[0,1]..[0,M-1],[1,0],[1,1]...[N-1,M-1]]
This works much better, but still kind of slow.
I have infinite time in advance to calculate the mapping or any other utility calculation. But after these precalculations, this mapping should happen as fast as possible.
Currently, the only other option that I see is just to reimplement this in C or something faster...
(Just to make it clear if someone is curious, I'm creating an image out of some other, differently shaped and oriented image with some encoding. But its' mapping is very complicated and not something simple or linear that can be used)
If you have infinity time for precomputing you can get a slight speedup by going to flat indexing:
map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)
Then simply do:
A.ravel()[map_f]
Please note that this speedup is on top of the large speedup we get from fancy indexing. For example:
>>> A = np.random.random((5000, 3000))
>>> mapping = np.random.randint(0, 15000, (5000, 3000, 2)) % [5000, 3000]
>>>
>>> map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)
>>>
>>> np.all(A.ravel()[map_f] == A[mapping[..., 0], mapping[..., 1]])
True
>>>
>>> timeit('A[mapping[:, :, 0], mappping[:, :, 1]]', globals=globals(), number=10)
4.101239089999581
>>> timeit('A.ravel()[map_f]', globals=globals(), number=10)
2.7831342950012186
If we were to compare to the original loopy code, the speedup would be more like ~40x.
Finally, note that this solution does not only avoid the additional dependency and potential installation nightmare that is numba, but is also simpler, shorter and faster:
numba:
precomp: 132.957 ms
main 238.359 ms
flat indexing:
precomp: 76.223 ms
main: 219.910 ms
Code:
import numpy as np
from numba import jit
#jit
def fast(A, B, mapping):
N, M = B.shape
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
from timeit import timeit
A = np.random.random((5000, 3000))
mapping = np.random.randint(0, 15000, (5000, 3000, 2)) % [5000, 3000]
a = np.random.random((5, 3))
m = np.random.randint(0, 15, (5, 3, 2)) % [5, 3]
print('numba:')
print(f"precomp: {timeit('b = fast(a, np.empty_like(a), m)', globals=globals(), number=1)*1000:10.3f} ms")
print(f"main {timeit('B = fast(A, np.empty_like(A), mapping)', globals=globals(), number=10)*100:10.3f} ms")
print('\nflat indexing:')
print(f"precomp: {timeit('map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)', globals=globals(), number=10)*100:10.3f} ms")
map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)
print(f"main: {timeit('B = A.ravel()[map_f]', globals=globals(), number=10)*100:10.3f} ms")
One very nice solution to these types of performance critical problems is to keep it simple and utilize one of the high performance packages. The easiest might be Numba which provides the jit decorator that compiles array and loop heavy code to optimized LLVM. Below is a full example:
from time import time
import numpy as np
from numba import jit
# Function doing the computation
def normal(A, B, mapping):
N, M = B.shape
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
# The same exact function, but with the Numba jit decorator
#jit
def fast(A, B, mapping):
N, M = B.shape
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
# Create sample data
def create_sample_data(I, J, N, M):
A = np.random.random((I, J))
B = np.empty((N, M))
mapping = np.asarray(np.stack((
np.random.random((N, M))*I,
np.random.random((N, M))*J,
), axis=2), dtype=int)
return A, B, mapping
A, B, mapping = create_sample_data(500, 600, 700, 800)
# Run normally
t0 = time()
B = normal(A, B, mapping)
t1 = time()
print('normal took', t1 - t0, 'seconds')
# Run using Numba.
# First we should run the function with smaller arrays,
# just to compile the code.
fast(*create_sample_data(5, 6, 7, 8))
# Now, run with real data
t0 = time()
B = fast(A, B, mapping)
t1 = time()
print('fast took', t1 - t0, 'seconds')
This uses your own looping solution, which is inherently slow using standard Python, but as fast as C when using Numba. On my machine the normal function executes in 0.270 seconds, while the fast function executes in 0.00248 seconds. That is, Numba gives us a 109x speedup (!) pretty much for free.
Note that the fast Numba function is called twice, first with small input arrays and only then with the real data. This is a critical step which is often neglected. Without it, you will find that the performance increase is not nearly as good, as the first call is used to compile the code. The types and dimensions of the input arrays should be the same in this initial call, but the size in each dimension is not important.
I create B outside of the function(s) and passed it as an argument (to be "filled with values"). You might just as well allocate B inside of the function, Numba does not care.
The easiest way to get Numba is properly via the Anaconda distribution.
One option would be to use numba, which can often provide substantial improvements in this kind of simple algorithmic code.
import numpy as np
from numba import njit
I, J = 5000, 5000
N, M = 3000, 3000
A = np.random.randint(0, 10, [I, J])
B = np.random.randint(0, 10, [N, M])
mapping = np.dstack([np.random.randint(0, I - 1, (N, M)),
np.random.randint(0, J - 1, (N, M))])
B0 = B.copy()
def orig(A, B, mapping):
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
new = njit(orig)
which gives us matching results:
In [313]: Bold = B0.copy()
In [314]: orig(A, Bold, mapping)
In [315]: Bnew = B0.copy()
In [316]: new(A, Bnew, mapping)
In [317]: (Bold == Bnew).all()
Out[317]: True
and is much faster:
In [320]: %time orig(A, B0.copy(), mapping)
Wall time: 6.11 s
In [321]: %time new(A, B0.copy(), mapping)
Wall time: 257 ms
and faster still after the first call, when it has to do its jit work:
In [322]: %time new(A, B0.copy(), mapping)
Wall time: 171 ms
In [323]: %time new(A, B0.copy(), mapping)
Wall time: 163 ms
for a 30x improvement for adding two lines of code.
The most straightforward optimization you can do is drop the native python loops and use fancy numpy indexing. You already have the array to do that:
import numpy as np
A = np.random.rand(2000,3000)
B = np.empty((2500,3500)) # just for shape, really
# this is the same as your original, but with random indices
mapping = np.stack([np.random.randint(0, A.shape[0] - 1, B.shape),
np.random.randint(0, A.shape[1] - 1, B.shape)],
axis=-1)
# your loopy original
def loopy(A, B, mapping):
B = B.copy()
for i in range(B.shape[0]):
for j in range(B.shape[1]):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
# vectorization with fancy indexing
def fancy(A, mapping):
return A[mapping[...,0], mapping[...,1]]
Note that the fancy advanced-indexing function doesn't need preallocation of a B array, as a new array is constructed by the indexing operation.
There's a slight variation of the fancy indexing version which could be marginally more efficient: put your last dimension of mapping first, in this way both indexing arrays are contiguous blocks of memory. It turns out from my timing test that this happens to be slower in the above setup. Anyway:
mapping_T = mapping.transpose(2, 0, 1).copy() # but it's actually `mapping` without axis=-1 kwarg
# has shape (2, N, M)
def fancy_T(A, mapping_T):
return A[tuple(mapping_T)]
As Paul Panzer noted in a comment, just calling .transpose on mapping will not create a copy, but rather implement the transpose using stride tricks. In order to end up with a contiguous array (which is the point of the optimization) we need to force the creation of a copy.
I get the following timings in ipython:
# loopy(A, B, mapping)
6.63 s ± 141 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# fancy(A, mapping)
250 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# fancy_T(A, mapping_T)
277 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
To be honest I don't understand why the original array order is faster compared to the transposed, but there's that.
Related
I would like to generate a numpy array by performing a sum of indexed values from another array
For for example, given the following arrays:
row_indices = np.array([[1, 1, 1], [0, 1, 1]])
col_indices = np.array([[0, 0, 1], [1, 1, 1]])
values = np.array([[2, 2, 3], [2, 4, 4]])
I would like so set a new array indexed_sum in the following way:
for i in range(row_indices.size):
indexed_sum[row_indices.flat[i], col_indices.flat[i]] += values.flat[i]
Such that:
indexed_sum = np.array([[0, 2], [4, 11]])
However, since this is a python loop and these arrays can be very large, this takes an unacceptable amount of time. Is there an efficient numpy method that I can use to accomplish this?
You might find success with numba, another Python package. I timed the following two functions in a Jupyter notebook with %timeit. Results are below:
import numba
import numpy as np
# Your loop, but in a function.
def run_sum(row_indicies, col_indicies, values):
indexed_sum = np.zeros((row_indicies.max() + 1, col_indicies.max() + 1))
for i in range(row_indicies.size):
indexed_sum[row_indicies.flat[i], col_indicies.flat[i]] += values.flat[i]
return indexed_sum
# Your loop with a numba decorator.
#numba.jit(nopython=True) # note you may be able to parallelize too
def run_sum_numba(row_indicies, col_indicies, values):
indexed_sum = np.zeros((row_indicies.max() + 1, col_indicies.max() + 1))
for i in range(row_indicies.size):
indexed_sum[row_indicies.flat[i], col_indicies.flat[i]] += values.flat[i]
return indexed_sum
My example data to have something bigger to chew on:
row_id_big = np.random.randint(0, 100, size=(1000,))
col_id_big = np.random.randint(0, 100, size=(1000,))
values_big = np.random.randint(0, 10, size=(1000,))
Results:
%timeit run_sum(row_id_big, col_id_big, values_big)
# 1.04 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit run_sum_numba(row_id_big, col_id_big, values_big)
# 3.85 µs ± 44.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The loop with the numba decorator is a couple hundred times faster in this example. I'm not positive about the memory usage compared to your example. I had to initialize a numpy array to have somewhere to put the data, but if you have a better way of doing that step you might be able to improve performance further.
A note with numba: you need to run your loop once to start seeing the major speed improvements. You might be able to initialize the jit with just a toy example like yours here and see the same speedup.
Since the tradeoff between speed and memmory-ussage I think your method is well situable. But you can still make it faster:
avoid flattening the arrays inside the loop this will save you some time
instead of using .flat[:] or .flatten() use .ravel() (I'm not sure why but it seems to be faster)
also avoid for i in range.. just zip in the values of interest (see method3)
Here a good solution that will speed-up things:
r_i_f = row_indices.ravel()
c_i_f = col_indices.ravel()
v_f = values.ravel()
indexed_sum = np.zeros((row_indices.max()+1,col_indices.max()+1))
for i,j,v in zip(r_i_f,c_i_f,v_f):
indexed_sum[i, j] += v
return indexed_sum
To see a comparision here's some toy code (correct any detail it's not proportioned and let me know if it works well for you)
def method1(values,row_indices,col_indices):
"""OP method"""
indexed_sum = np.zeros((row_indices.max()+1,col_indices.max()+1))
for i in range(row_indices.size):
indexed_sum[row_indices.flat[i], col_indices.flat[i]] += values.flat[i]
return indexed_sum
def method2(values,row_indices,col_indices):
"""just raveling before loop. Time saved here is considerable"""
r_i_f = row_indices.ravel()
c_i_f = col_indices.ravel()
v_f = values.ravel()
indexed_sum = np.zeros((row_indices.max()+1,col_indices.max()+1))
for i in range(row_indices.size):
indexed_sum[r_i_f[i], c_i_f[i]] += v_f[i]
return indexed_sum
def method3(values,row_indices,col_indices):
"""raveling, then avoiding range(...), just zipping
the time saved here is small but by no cost"""
r_i_f = row_indices.ravel()
c_i_f = col_indices.ravel()
v_f = values.ravel()
indexed_sum = np.zeros((row_indices.max()+1,col_indices.max()+1))
for i,j,v in zip(r_i_f,c_i_f,v_f):
indexed_sum[i, j] += v
return indexed_sum
from time import perf_counter
import numpy as np
out_size = 50
in_shape = (5000,5000)
values = np.random.randint(10,size=in_shape)
row_indices = np.random.randint(out_size,size=in_shape)
col_indices = np.random.randint(out_size,size=in_shape)
t1 = perf_counter()
v1 = method1(values,row_indices,col_indices)
t2 = perf_counter()
v2 = method2(values,row_indices,col_indices)
t3 = perf_counter()
v3 = method3(values,row_indices,col_indices)
t4 = perf_counter()
print(f"method1: {t2-t1}")
print(f"method2: {t3-t2}")
print(f"method3: {t4-t3}")
Outputs for values of shape 5000x5000 and output shaped as 50x50:
method1: 23.66934896100429
method2: 14.241692076990148
method3: 11.415708078013267
aditional a comparison between fltten methods (in my computer)
q = np.random.randn(5000,5000)
t1 = perf_counter()
q1 = q.flatten()
t2 = perf_counter()
q2 = q.ravel()
t3 = perf_counter()
q3 = q.reshape(-1)
t4 = perf_counter()
q4 = q.flat[:]
t5 = perf_counter()
#print times:
print(f"q.flatten: {t2-t1}")
print(f"q.ravel: {t3-t2}")
print(f"q.reshape(-1): {t4-t3}")
print(f"q.flat[:]: {t5-t4}")
Outputs:
q.flatten: 0.043878231997950934
q.ravel: 5.550700007006526e-05
q.reshape(-1): 0.0006349250033963472
q.flat[:]: 0.08832104799512308
There's a lot of options for this, but they're all reinventing a wheel that kinda already exists.
import numpy as np
from scipy import sparse
row_indices = np.array([[1, 1, 1], [0, 1, 1]])
col_indices = np.array([[0, 0, 1], [1, 1, 1]])
values = np.array([[2, 2, 3], [2, 4, 4]])
What you want is the built-in behavior for the scipy sparse matrices:
arr = sparse.coo_matrix((values.flat, (row_indices.flat, col_indices.flat)))
Which yields a sparse data structure:
>>> arr
<2x2 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in COOrdinate format>
But you can convert it back to a numpy array easily:
>>> arr.A
array([[ 0, 2],
[ 4, 11]])
There are some good answers here, but in the end I cheated and wrote an extension module method using the numpy C API, which runs in the trivial time that I wanted.
The code is precisely as boring as one would expect, but since an answer would seem incomplete without some, here is the core of it. It does make some unfortunate assumptions about typing that I mean to fill in with time.
int* row_data = PyArray_DATA(row_indices);
int* col_data = PyArray_DATA(col_indices);
double* value_data = PyArray_DATA(values);
double* output_data = PyArray_DATA(sum_obj);
for(int i = 0; i < input_rows; ++i)
{
for(int j = 0; j < input_cols; ++j)
{
long output_row = row_data[i*input_cols+j];
long output_col = col_data[i*input_cols+j];
output_data[output_row*out_col_count+output_col] += value_data[i*input_cols+j];
}
}
A matrix/vector multiplication looks like that and works fine:
A = np.array([[1, 0],
[0,1]]) # matrix
u = np.array([[3],[9]]) # column vector
U = np.dot(A,u) # U = A*u
U
Is there a way to keep the same structure if the elements Xx,Xy,Yx,Yy, u,v are 2- or 3-dimensional arrays each ?
A = np.array([[Xx, Xy],
[Yx,Yy]])
u = np.array([[u],[v]])
U = np.dot(A,u)
The desired result is that and works fine (so the answer to my question is rather 'nice to have'):
U = Xx*u + Xy*v
V = Yx*u + Yy*v
Succes Story
THX to #loopy walt and to #Ivan !
1. it works :
np.einsum('ijmn,jkmn->ikmn', A, w) and (w.T#A.T).T , both reproduce the correct contravariant coordinate transformation.
2. minor issue :
np.einsum('ijmn,jkmn->ikmn', A, w) and (w.T#A.T).T , both return an array with a pair of square brackets too much (it should be 2-dimensional). See #Ivan 's answer why:
array([[[ 0, 50, 200],
[ 450, 800, 1250],
[1800, 2450, 3200]]])
3. to optimize : building the block matrix is time consuming:
A = np.array([[Xx, Xy],
[Yx,Yy]])
w = np.array([[u],[v]])
So the overall return time is for np.einsum('ijmn,jkmn->ikmn', A, w) and (w.T#A.T).T lager. Perhaps can this be optimized.
- p_einstein np.einsum('ijmn,jkmn->ikmn', A, w) 9.07 µs ± 267 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- p_tensor: (w.T#A.T).T 7.89 µs ± 225 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- p_vect: [Xx*u1 + Xy*v1, Yx*u1 + Yy*v1] 2.63 µs ± 173 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
- p_vect: including A = np.array([[Xx, Xy],[Yx,Yy]]) 7.66 µs ± 290 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
This is essentially a matrix block operation which you can generalize from a simple matrix operation. Your example is good because you defined your block matrices as such (I renamed them A_p and u_p to not conflict with matrix u):
A_p = np.array([[Xx, Xy],
[Yx, Yy]])
u_p = np.array([[u], [v]])
If you look at a standard matrix operation, in terms of shapes you have (r, s) meets (s, t) which ultimately results in a matrix of shape (r, t). The 's' dimension gets reduced in the process. If you're not yet familiar with the Einstein Summation, and np.einsum, you should take a quick look. Given matrices M and N, the matrix operation would look like this:
out = np.zeros((r, t))
for i in range(r):
for j in range(s):
for k in range(t):
out[i, k] += M[i, j]*N[j, k]
This becomes very intuitive when you come to realize how the operation takes place: covering the entire multi-dimension space, we accumulate products on the output matrix.
With np.einsum, this would very similar indeed (here M=A, and N=u):
>>> np.einsum('ij,jk->ik', A, u)
array([[3],
[9]])
It is useful to think with coordinates: i, j, and k rather than dimension sizes: r, s, and t respectively, just like when using np.einsum above. So I'll keep the former notation for the rest of the answer.
Now looking at the block matrices, input matrices contain blocks of matrices themselves. Matrix M is shape (i, j, (m, n)) (i.e. (i, j, m, n)) where (k, l) refers to the shape of the blocks. Similarly, for N we have (j, k, m, n). We've basically replaced the scalar shaped (,) with a 2x2 block (m, n), that's all.
Therefore you can directly infer that the A_p*u_p operation as being:
>>> np.einsum('ijmn,jkmn->ikmn', A_p, u_p)
array([[[[ 0, 50],
[200, 450]]],
[[[ 0, 110],
[440, 990]]]])
Which you can compare with the 'hand-made' result:
>>> np.stack((Xx*u + Xy*v, Yx*u + Yy*v))[:, None]
Notice this can be used for any (m, n) block and any matrix size.
You can now try implementing A_p*u_p with a loop just like A*u.
Edit: Expanding on #loopy walt's answer who showed you can achieve the same result with:
>>> (u_p.T # A_p.T).T
Indeed __matmul__ can handle multi-dimensional arrays, and will do so in the following manner (for a number of dimensions higher than 2): both inputs will be treated as a stack of matrices residing in the last two indexes. The inputs shapes are (i, j, m, n), and (j, k, m, n). In order to use __matmul__, we need to pull the block dimensions - namely (m, n) - in first positions, and leave the matrix dimensions - (i, j) and (j, k) respectively - last. The following steps need to be taken
Transposing will have the effect of inverting the order of dimensions: (n, m, j, i) and (n, m, k, j) respectively.
We then swap the two operands, we manage to compute something in the form of (*, k, j) x (*, j, i) = (*, k, i), where * = (n, m).
Transposing the outputs leads to a shape of (i, k, m, n).
Borrowing A_p and u_p from #Ivan I'd like to advertise the following much simpler approach that makes use of __matmul__'s inbuilt broadcasting. All we need to do is reverse the order of dimensions:
(u_p.T#A_p.T).T
This gives the same result as Ivan's einsum but is a bit more flexible giving us full broadcasting without us having to figure out the corresponding format string every time.
np.allclose(np.einsum('ijmn,jkmn->ikmn', A_p, u_p),(u_p.T#A_p.T).T)
# True
I have an 8x8x25000 array W and an 8 x 25000 array r. I want to multiple each 8x8 slice of W by each column (8x1) of r and save the result in Wres, which will end up being an 8x25000 matrix.
I am accomplishing this using a for loop as such:
for i in range(0,25000):
Wres[:,i] = np.matmul(W[:,:,i],res[:,i])
But this is slow and I am hoping there is a quicker way to accomplish this.
Any ideas?
Matmul can propagate as long as the 2 arrays share the same 1 axis length. From the docs:
If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.
Thus, you have to perform 2 operations prior to matmul:
import numpy as np
a = np.random.rand(8,8,100)
b = np.random.rand(8, 100)
transpose a and b so that the first axis are the 100 slices
add an extra dimension to b so that b.shape = (100, 8, 1)
Then:
at = a.transpose(2, 0, 1) # swap to shape 100, 8, 8
bt = b.T[..., None] # swap to shape 100, 8, 1
c = np.matmul(at, bt)
c is now 100, 8, 1, reshape back to 8, 100:
c = np.squeeze(c).swapaxes(0, 1)
or
c = np.squeeze(c).T
And last, a one-liner just for conveniende:
c = np.squeeze(np.matmul(a.transpose(2, 0, 1), b.T[..., None])).T
An alternative to using np.matmul is np.einsum, which can be accomplished in 1 shorter and arguably more palatable line of code with no method chaining.
Example arrays:
np.random.seed(123)
w = np.random.rand(8,8,25000)
r = np.random.rand(8,25000)
wres = np.einsum('ijk,jk->ik',w,r)
# a quick check on result equivalency to your loop
print(np.allclose(np.matmul(w[:, :, 1], r[:, 1]), wres[:, 1]))
True
Timing is equivalent to #Imanol's solution so take your pick of the two. Both are 30x faster than looping. Here, einsum will be competitive because of the size of the arrays. With arrays larger than these, it would likely win out, and lose for smaller arrays. See this discussion for more.
def solution1():
return np.einsum('ijk,jk->ik',w,r)
def solution2():
return np.squeeze(np.matmul(w.transpose(2, 0, 1), r.T[..., None])).T
def solution3():
Wres = np.empty((8, 25000))
for i in range(0,25000):
Wres[:,i] = np.matmul(w[:,:,i],r[:,i])
return Wres
%timeit solution1()
100 loops, best of 3: 2.51 ms per loop
%timeit solution2()
100 loops, best of 3: 2.52 ms per loop
%timeit solution3()
10 loops, best of 3: 64.2 ms per loop
Credit to: #Divakar
I have two large vectors (of equal length) that I'm calculating a sliding window dot product for:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6])
b = np.array([11, 22, 33, 44, 55, 66])
out = np.array(
[[a[0]*b[0]+a[1]*b[1]+a[2]*b[2]],
[a[1]*b[1]+a[2]*b[2]+a[3]*b[3]],
[a[2]*b[2]+a[3]*b[3]+a[4]*b[4]],
[a[3]*b[3]+a[4]*b[4]+a[5]*b[5]],
])
[[154]
[319]
[550]
[847]]
Of course, I can call the dot product function but if the window/vector length is large then it is not as efficient as the following code:
window = 3
result = np.empty([4,1])
result[0] = a[0]*b[0]+a[1]*b[1]+a[2]*b[2]
for i in range(3):
result[i+1] = result[i]-a[i]*b[i]+a[i+window]*b[i+window]
[[154]
[319]
[550]
[847]]
Here, we are leveraging the fact that the i+1th dot product is similar to the ith dot product. That is,
result[i+1] = result[i]-a[i]*b[i]+a[i+window]*b[i+window]
How can I convert my for loop into a vectorized function so that the computation can utilize the information from the ith step so as to reduce the computational redundancy while minimizing the amount of memory needed.
UPDATE
I actually needed:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6])
b = np.array([11, 22, 33, 44, 55, 66, 77, 88])
out = np.array(
[a[0]*b[0]+a[1]*b[1]+a[2]*b[2]+a[3]*b[3]]+a[4]*b[4]]+a[5]*b[5],
a[0]*b[1]+a[1]*b[2]+a[2]*b[3]+a[3]*b[4]]+a[4]*b[5]]+a[5]*b[6],
a[0]*b[2]+a[1]*b[3]+a[2]*b[4]+a[3]*b[5]]+a[4]*b[6]]+a[5]*b[7],
])
[1001
1232
1463]
So a would be slid across b and dot products would be calculated.
You could use partial sums for O(n) complexity:
ps = np.r_[0, np.cumsum(a*b)]
ps[3:]-ps[:-3]
# array([154, 319, 550, 847])
Or a variant that is closer to your original for loop and avoids very large partial sums:
k = 3
d = a*b
d[k:] -= d[:-k].copy()
np.cumsum(d)[k-1:]
# array([154, 319, 550, 847])
Update to match the updated Q.
This is now indeed a convolution, so #Divakar's solution more or less applies. Only, you'd convolve a[::-1] and b directly. If speed is a problem you may try and replace np.convolve with scipy.signal.fftconvolve which depending on the sizes of your operands may be significantly faster. For very small operands or operands of vastly different lengths, though, you may even lose some speed, so be sure to try both methods:
np.convolve(b, a[::-1], 'valid')
scipy.signal.fftconvolve(b, a[::-1], 'valid')
Approach #1
Use np.convolve on element-wise multiplication between the two inputs and with a kernel of all ones and size=3 -
np.convolve(a*b,np.ones(3),'valid')
Approach #2
Since we are simply summing elements in a window, we can also use uniform_filter, like so -
from scipy.ndimage.filters import uniform_filter1d as unif1d
def uniform_filter(a,W):
hW = (W-1)//2
return W*unif1d(a.astype(float),size=W, mode='constant')[hW:-hW]
out = uniform_filter(a*b,W=3)
Benchmarking
Loopy approach -
def loopy_approach(a,b):
window = 3
N = a.size-window+1
result = np.empty([N,1])
result[0] = a[0]*b[0]+a[1]*b[1]+a[2]*b[2]
for i in range(N-1):
result[i+1] = result[i]-a[i]*b[i]+a[i+window]*b[i+window]
return result
Timings and verification -
In [147]: a = np.random.randint(0,100,(1000))
...: b = np.random.randint(0,100,(1000))
...:
In [148]: out0 = loopy_approach(a,b).ravel()
...: out1 = np.convolve(a*b,np.ones(3),'valid')
...: out2 = uniform_filter(a*b,W=3)
...:
In [149]: np.allclose(out0,out1)
Out[149]: True
In [150]: np.allclose(out0,out2)
Out[150]: True
In [151]: %timeit loopy_approach(a,b)
...: %timeit np.convolve(a*b,np.ones(3),'valid')
...: %timeit uniform_filter(a*b,W=3)
...:
100 loops, best of 3: 2.27 ms per loop
100000 loops, best of 3: 7 µs per loop
100000 loops, best of 3: 10.2 µs per loop
Yet another approach using strides is:
In [12]: from numpy.lib.stride_tricks import as_strided
In [13]: def using_strides(a, b, w=3):
shape = a.shape[:-1] + (a.shape[-1] - w + 1, w)
strides = a.strides + (a.strides[-1],)
res = np.sum((as_strided(a, shape=shape, strides=strides) * \
as_strided(b, shape=shape, strides=strides)), axis=1)
return res[:, np.newaxis]
In [14]: using_strides(a, b, 3)
Out[14]:
array([[154],
[319],
[550],
[847]])
I noticed that indexing a multi dimensional array takes more time than indexing a single dimensional array
a1 = np.arange(1000000)
a2 = np.arange(1000000).reshape(1000, 1000)
a3 = np.arange(1000000).reshape(100, 100, 100)
When I index a1
%%timeit
a1[500000]
The slowest run took 39.17 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 84.6 ns per loop
%%timeit
a2[500, 0]
The slowest run took 31.85 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 102 ns per loop
%%timeit
a3[50, 0, 0]
The slowest run took 46.72 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 119 ns per loop
At what point should I consider an alternative way to index or slice a multi-dimensional array? What are the circumstances that make it worth the effort and loss of transparency?
One alternative to slicing an (n, m) array is to flatten the array and derive what it's one dimensional position must be.
consider a = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
we can get the 2nd row, 3rd column with a[1, 2] and get 5
or we can calculate that 1 * a.shape[1] + 2 is the one dimensional position if we flatten a with order='C'
thus we can perform the equivalent slice with a.ravel()[1 * a.shape[1] + 2]
Is this efficient? No, for indexing a single number from an array, it isn't worth the trouble.
What about if we want to slice many numbers from the array? I devised the following test for a 2-D array
2-D test
from timeit import timeit
n, m = 10000, 10000
a = np.random.rand(n, m)
r = pd.DataFrame(index=np.power(10, np.arange(7)), columns=['Multi', 'Flat'])
for k in r.index:
b = np.random.randint(n, size=k)
c = np.random.randint(m, size=k)
kw = dict(setup='from __main__ import a, b, c', number=100)
r.loc[k, 'Multi'] = timeit('a[b, c]', **kw)
r.loc[k, 'Flat'] = timeit('a.ravel()[b * a.shape[1] + c]', **kw)
r.div(r.sum(1), 0).plot.bar()
It appears that when slicing more than 100,000 numbers, it's better to flatten the array.
What about 3-D
3-D test
from timeit import timeit
l, n, m = 1000, 1000, 1000
a = np.random.rand(l, n, m)
r = pd.DataFrame(index=np.power(10, np.arange(7)), columns=['Multi', 'Flat'])
for k in r.index:
b = np.random.randint(l, size=k)
c = np.random.randint(m, size=k)
d = np.random.randint(n, size=k)
kw = dict(setup='from __main__ import a, b, c, d', number=100)
r.loc[k, 'Multi'] = timeit('a[b, c, d]', **kw)
r.loc[k, 'Flat'] = timeit('a.ravel()[b * a.shape[1] * a.shape[2] + c * a.shape[1] + d]', **kw)
r.div(r.sum(1), 0).plot.bar()
Similar results, maybe more dramatic.
Conclusion
For 2 dimensional arrays, consider flattening and deriving flatten positions if you need to pull more than 100,000 elements from the array.
For 3 or more dimensions, it seems clear that flattening the array is almost always better.
Criticism is welcome
Did I do something wrong? Did I not think of something obvious?