I would like to generate a numpy array by performing a sum of indexed values from another array
For for example, given the following arrays:
row_indices = np.array([[1, 1, 1], [0, 1, 1]])
col_indices = np.array([[0, 0, 1], [1, 1, 1]])
values = np.array([[2, 2, 3], [2, 4, 4]])
I would like so set a new array indexed_sum in the following way:
for i in range(row_indices.size):
indexed_sum[row_indices.flat[i], col_indices.flat[i]] += values.flat[i]
Such that:
indexed_sum = np.array([[0, 2], [4, 11]])
However, since this is a python loop and these arrays can be very large, this takes an unacceptable amount of time. Is there an efficient numpy method that I can use to accomplish this?
You might find success with numba, another Python package. I timed the following two functions in a Jupyter notebook with %timeit. Results are below:
import numba
import numpy as np
# Your loop, but in a function.
def run_sum(row_indicies, col_indicies, values):
indexed_sum = np.zeros((row_indicies.max() + 1, col_indicies.max() + 1))
for i in range(row_indicies.size):
indexed_sum[row_indicies.flat[i], col_indicies.flat[i]] += values.flat[i]
return indexed_sum
# Your loop with a numba decorator.
#numba.jit(nopython=True) # note you may be able to parallelize too
def run_sum_numba(row_indicies, col_indicies, values):
indexed_sum = np.zeros((row_indicies.max() + 1, col_indicies.max() + 1))
for i in range(row_indicies.size):
indexed_sum[row_indicies.flat[i], col_indicies.flat[i]] += values.flat[i]
return indexed_sum
My example data to have something bigger to chew on:
row_id_big = np.random.randint(0, 100, size=(1000,))
col_id_big = np.random.randint(0, 100, size=(1000,))
values_big = np.random.randint(0, 10, size=(1000,))
Results:
%timeit run_sum(row_id_big, col_id_big, values_big)
# 1.04 ms ± 12.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit run_sum_numba(row_id_big, col_id_big, values_big)
# 3.85 µs ± 44.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The loop with the numba decorator is a couple hundred times faster in this example. I'm not positive about the memory usage compared to your example. I had to initialize a numpy array to have somewhere to put the data, but if you have a better way of doing that step you might be able to improve performance further.
A note with numba: you need to run your loop once to start seeing the major speed improvements. You might be able to initialize the jit with just a toy example like yours here and see the same speedup.
Since the tradeoff between speed and memmory-ussage I think your method is well situable. But you can still make it faster:
avoid flattening the arrays inside the loop this will save you some time
instead of using .flat[:] or .flatten() use .ravel() (I'm not sure why but it seems to be faster)
also avoid for i in range.. just zip in the values of interest (see method3)
Here a good solution that will speed-up things:
r_i_f = row_indices.ravel()
c_i_f = col_indices.ravel()
v_f = values.ravel()
indexed_sum = np.zeros((row_indices.max()+1,col_indices.max()+1))
for i,j,v in zip(r_i_f,c_i_f,v_f):
indexed_sum[i, j] += v
return indexed_sum
To see a comparision here's some toy code (correct any detail it's not proportioned and let me know if it works well for you)
def method1(values,row_indices,col_indices):
"""OP method"""
indexed_sum = np.zeros((row_indices.max()+1,col_indices.max()+1))
for i in range(row_indices.size):
indexed_sum[row_indices.flat[i], col_indices.flat[i]] += values.flat[i]
return indexed_sum
def method2(values,row_indices,col_indices):
"""just raveling before loop. Time saved here is considerable"""
r_i_f = row_indices.ravel()
c_i_f = col_indices.ravel()
v_f = values.ravel()
indexed_sum = np.zeros((row_indices.max()+1,col_indices.max()+1))
for i in range(row_indices.size):
indexed_sum[r_i_f[i], c_i_f[i]] += v_f[i]
return indexed_sum
def method3(values,row_indices,col_indices):
"""raveling, then avoiding range(...), just zipping
the time saved here is small but by no cost"""
r_i_f = row_indices.ravel()
c_i_f = col_indices.ravel()
v_f = values.ravel()
indexed_sum = np.zeros((row_indices.max()+1,col_indices.max()+1))
for i,j,v in zip(r_i_f,c_i_f,v_f):
indexed_sum[i, j] += v
return indexed_sum
from time import perf_counter
import numpy as np
out_size = 50
in_shape = (5000,5000)
values = np.random.randint(10,size=in_shape)
row_indices = np.random.randint(out_size,size=in_shape)
col_indices = np.random.randint(out_size,size=in_shape)
t1 = perf_counter()
v1 = method1(values,row_indices,col_indices)
t2 = perf_counter()
v2 = method2(values,row_indices,col_indices)
t3 = perf_counter()
v3 = method3(values,row_indices,col_indices)
t4 = perf_counter()
print(f"method1: {t2-t1}")
print(f"method2: {t3-t2}")
print(f"method3: {t4-t3}")
Outputs for values of shape 5000x5000 and output shaped as 50x50:
method1: 23.66934896100429
method2: 14.241692076990148
method3: 11.415708078013267
aditional a comparison between fltten methods (in my computer)
q = np.random.randn(5000,5000)
t1 = perf_counter()
q1 = q.flatten()
t2 = perf_counter()
q2 = q.ravel()
t3 = perf_counter()
q3 = q.reshape(-1)
t4 = perf_counter()
q4 = q.flat[:]
t5 = perf_counter()
#print times:
print(f"q.flatten: {t2-t1}")
print(f"q.ravel: {t3-t2}")
print(f"q.reshape(-1): {t4-t3}")
print(f"q.flat[:]: {t5-t4}")
Outputs:
q.flatten: 0.043878231997950934
q.ravel: 5.550700007006526e-05
q.reshape(-1): 0.0006349250033963472
q.flat[:]: 0.08832104799512308
There's a lot of options for this, but they're all reinventing a wheel that kinda already exists.
import numpy as np
from scipy import sparse
row_indices = np.array([[1, 1, 1], [0, 1, 1]])
col_indices = np.array([[0, 0, 1], [1, 1, 1]])
values = np.array([[2, 2, 3], [2, 4, 4]])
What you want is the built-in behavior for the scipy sparse matrices:
arr = sparse.coo_matrix((values.flat, (row_indices.flat, col_indices.flat)))
Which yields a sparse data structure:
>>> arr
<2x2 sparse matrix of type '<class 'numpy.int64'>'
with 6 stored elements in COOrdinate format>
But you can convert it back to a numpy array easily:
>>> arr.A
array([[ 0, 2],
[ 4, 11]])
There are some good answers here, but in the end I cheated and wrote an extension module method using the numpy C API, which runs in the trivial time that I wanted.
The code is precisely as boring as one would expect, but since an answer would seem incomplete without some, here is the core of it. It does make some unfortunate assumptions about typing that I mean to fill in with time.
int* row_data = PyArray_DATA(row_indices);
int* col_data = PyArray_DATA(col_indices);
double* value_data = PyArray_DATA(values);
double* output_data = PyArray_DATA(sum_obj);
for(int i = 0; i < input_rows; ++i)
{
for(int j = 0; j < input_cols; ++j)
{
long output_row = row_data[i*input_cols+j];
long output_col = col_data[i*input_cols+j];
output_data[output_row*out_col_count+output_col] += value_data[i*input_cols+j];
}
}
Related
I have an n-by-3 index array (think of triangles indexing points) and a list of float values associated with the triangles. I now want to get for each index ("point") the minimum value, i.e., check all rows which contain the index, say, 0, and get the minimum value from vals across the respective rows:
import numpy
a = numpy.array([
[0, 1, 2],
[2, 3, 0],
[1, 4, 2],
[2, 5, 3],
])
vals = numpy.array([0.1, 0.5, 0.3, 0.6])
out = [
numpy.min(vals[numpy.any(a == i, axis=1)])
for i in range(6)
]
# out = numpy.array([0.1, 0.1, 0.1, 0.5, 0.3, 0.6])
This solution is inefficient because it does a full array comparison for every i.
This problem is quite similar to numpy's ufuncs, but numpy.min.at doesn't exist.
Any hints?
Approach #1
One approach based on array-assignment to setup a 2D array filled up NaNs, using those a values as column indices (so assumes those to be integers), then mapping vals into it and looking for nan-skipped min values for the final output -
nr,nc = len(a),a.max()+1
m = np.full((nr,nc),np.nan)
m[np.arange(nr)[:,None],a] = vals[:,None]
out = np.nanmin(m,axis=0)
Approach #2
Another one again based on array-assignment, but uses masking and np.minimum.reduceat in favor of dealing with NaNs -
nr,nc = len(a),a.max()+1
m = np.zeros((nc,nr),dtype=bool)
m[a.T,np.arange(nr)] = 1
c = m.sum(1)
shift_idx = np.r_[0,c[:-1].cumsum()]
out = np.minimum.reduceat(np.broadcast_to(vals,m.shape)[m],shift_idx)
Approach #3
Another based on argsort (assuming you have all integers from 0 to a.max() in a) -
sidx = a.ravel().argsort()
c = np.bincount(a.ravel())
out = np.minimum.reduceat(vals[sidx//a.shape[1]],np.r_[0,c[:-1].cumsum()])
Approach #4
For memory efficiency and hence perf. and also to complete the set -
from numba import njit
#njit
def numba1(a, vals, out):
m,n = a.shape
for j in range(m):
for i in range(n):
e = a[j,i]
if vals[j] < out[e]:
out[e] = vals[j]
return out
def func1(a, vals, outlen=None): # feed in output length as outlen if known
if outlen is not None:
N = outlen
else:
N = a.max()+1
out = np.full(N,np.inf)
return numba1(a, vals, out)
You may switch to pd.GroupBy or itertools.groupby if your for loop goes way beyond 6.
For instance,
r = n.ravel()
pd.Series(np.arange(len(r))//3).groupby(r).apply(lambda s: vals[s].min())
This solution would be faster for long loops, and probably slower for small loops (< 50)
Here is one based on this Q&A:
If you have pythran, compile
file <stb_pthr.py>
import numpy as np
#pythran export sort_to_bins(int[:], int)
def sort_to_bins(idx, mx):
if mx==-1:
mx = idx.max() + 1
cnts = np.zeros(mx + 2, int)
for i in range(idx.size):
cnts[idx[i]+2] += 1
for i in range(2, cnts.size):
cnts[i] += cnts[i-1]
res = np.empty_like(idx)
for i in range(idx.size):
res[cnts[idx[i]+1]] = i
cnts[idx[i]+1] += 1
return res, cnts[:-1]
Otherwise the script will fall back to a sparse matrix based approach which is only slightly slower:
import numpy as np
try:
from stb_pthr import sort_to_bins
HAVE_PYTHRAN = True
except:
HAVE_PYTHRAN = False
from scipy.sparse import csr_matrix
def sort_to_bins_sparse(idx, mx):
if mx==-1:
mx = idx.max() + 1
aux = csr_matrix((np.ones_like(idx),idx,np.arange(idx.size+1)),
(idx.size,mx)).tocsc()
return aux.indices, aux.indptr
if not HAVE_PYTHRAN:
sort_to_bins = sort_to_bins_sparse
def f_op():
mx = a.max() + 1
return np.fromiter((np.min(vals[np.any(a == i, axis=1)])
for i in range(mx)),vals.dtype,mx)
def f_pp():
idx, bb = sort_to_bins(a.reshape(-1),-1)
res = np.minimum.reduceat(vals[idx//3], bb[:-1])
res[bb[:-1]==bb[1:]] = np.inf
return res
def f_div_3():
sidx = a.ravel().argsort()
c = np.bincount(a.ravel())
bb = np.r_[0,c.cumsum()]
res = np.minimum.reduceat(vals[sidx//a.shape[1]],bb[:-1])
res[bb[:-1]==bb[1:]] = np.inf
return res
a = np.array([
[0, 1, 2],
[2, 3, 0],
[1, 4, 2],
[2, 5, 3],
])
vals = np.array([0.1, 0.5, 0.3, 0.6])
assert np.all(f_op()==f_pp())
from timeit import timeit
a = np.random.randint(0,1000,(10000,3))
vals = np.random.random(10000)
assert len(np.unique(a))==1000
assert np.all(f_op()==f_pp())
print("1000/1000 labels, 10000 rows")
print("op ", timeit(f_op, number=10)*100, 'ms')
print("pp ", timeit(f_pp, number=100)*10, 'ms')
print("div", timeit(f_div_3, number=100)*10, 'ms')
a = 1 + 2 * np.random.randint(0,5000,(1000000,3))
vals = np.random.random(1000000)
nl = len(np.unique(a))
assert np.all(f_div_3()==f_pp())
print(f"{nl}/{a.max()+1} labels, 1000000 rows")
print("pp ", timeit(f_pp, number=10)*100, 'ms')
print("div", timeit(f_div_3, number=10)*100, 'ms')
a = 1 + 2 * np.random.randint(0,100000,(1000000,3))
vals = np.random.random(1000000)
nl = len(np.unique(a))
assert np.all(f_div_3()==f_pp())
print(f"{nl}/{a.max()+1} labels, 1000000 rows")
print("pp ", timeit(f_pp, number=10)*100, 'ms')
print("div", timeit(f_div_3, number=10)*100, 'ms')
Sample run (timings include #Divakar approach 3 for reference):
1000/1000 labels, 10000 rows
op 145.1122640981339 ms
pp 0.7944229000713676 ms
div 2.2905819199513644 ms
5000/10000 labels, 1000000 rows
pp 113.86540920939296 ms
div 417.2476712032221 ms
100000/200000 labels, 1000000 rows
pp 158.23634970001876 ms
div 486.13436080049723 ms
UPDATE: #Divakar's latest (approach 4) is hard to beat, being essentially a C implementation. Nothing wrong with that except that jitting is not an option but a requirement here (the unjitted code is no fun to run). If one accepts that, the same can, of course, be done with pythran:
pythran -O3 labeled_min.py
file <labeled_min.py>
import numpy as np
#pythran export labeled_min(int[:,:], float[:])
def labeled_min(A, vals):
mn = np.empty(A.max()+1)
mn[:] = np.inf
M,N = A.shape
for i in range(M):
v = vals[i]
for j in range(N):
c = A[i,j]
if v < mn[c]:
mn[c] = v
return mn
Both give another massive speedup:
from labeled_min import labeled_min
func1() # do not measure jitting time
print("nmb ", timeit(func1, number=100)*10, 'ms')
print("pthr", timeit(lambda:labeled_min(a,vals), number=100)*10, 'ms')
Sample run:
nmb 8.41792532010004 ms
pthr 8.104007659712806 ms
pythran comes out a few percent faster but this is only because I moved vals lookup out of the inner loop; without that they are all but equal.
For comparison, the previously best with and without non python helpers on the same problem:
pp 114.04887529788539 ms
pp (py only) 147.0821460010484 ms
Apparently, numpy.minimum.at exists:
import numpy
a = numpy.array([
[0, 1, 2],
[2, 3, 0],
[1, 4, 2],
[2, 5, 3],
])
vals = numpy.array([0.1, 0.5, 0.3, 0.6])
out = numpy.full(6, numpy.inf)
numpy.minimum.at(out, a.reshape(-1), numpy.repeat(vals, 3))
I have an input matrix A of size I*J
And an output matrix B of size N*M
And some precalculated map of size N*M*2 that dictates for each coordinate in B, which coordinate in A to take. The map has no specific rule or linearity that I can use. Just a map that seems random.
The matrices are pretty big (~5000*~3000) so creating a mapping matrix is out of the question (5000*3000*5000*3000)
I managed to do it using a simple map and loop:
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
And I managed to do it using indexing:
B[coords_y, coords_x] = A[some_mapping[:, 0], some_mapping[:, 1]]
# Where coords_x, coords_y are defined as all of the coordinates:
# [[0,0],[0,1]..[0,M-1],[1,0],[1,1]...[N-1,M-1]]
This works much better, but still kind of slow.
I have infinite time in advance to calculate the mapping or any other utility calculation. But after these precalculations, this mapping should happen as fast as possible.
Currently, the only other option that I see is just to reimplement this in C or something faster...
(Just to make it clear if someone is curious, I'm creating an image out of some other, differently shaped and oriented image with some encoding. But its' mapping is very complicated and not something simple or linear that can be used)
If you have infinity time for precomputing you can get a slight speedup by going to flat indexing:
map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)
Then simply do:
A.ravel()[map_f]
Please note that this speedup is on top of the large speedup we get from fancy indexing. For example:
>>> A = np.random.random((5000, 3000))
>>> mapping = np.random.randint(0, 15000, (5000, 3000, 2)) % [5000, 3000]
>>>
>>> map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)
>>>
>>> np.all(A.ravel()[map_f] == A[mapping[..., 0], mapping[..., 1]])
True
>>>
>>> timeit('A[mapping[:, :, 0], mappping[:, :, 1]]', globals=globals(), number=10)
4.101239089999581
>>> timeit('A.ravel()[map_f]', globals=globals(), number=10)
2.7831342950012186
If we were to compare to the original loopy code, the speedup would be more like ~40x.
Finally, note that this solution does not only avoid the additional dependency and potential installation nightmare that is numba, but is also simpler, shorter and faster:
numba:
precomp: 132.957 ms
main 238.359 ms
flat indexing:
precomp: 76.223 ms
main: 219.910 ms
Code:
import numpy as np
from numba import jit
#jit
def fast(A, B, mapping):
N, M = B.shape
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
from timeit import timeit
A = np.random.random((5000, 3000))
mapping = np.random.randint(0, 15000, (5000, 3000, 2)) % [5000, 3000]
a = np.random.random((5, 3))
m = np.random.randint(0, 15, (5, 3, 2)) % [5, 3]
print('numba:')
print(f"precomp: {timeit('b = fast(a, np.empty_like(a), m)', globals=globals(), number=1)*1000:10.3f} ms")
print(f"main {timeit('B = fast(A, np.empty_like(A), mapping)', globals=globals(), number=10)*100:10.3f} ms")
print('\nflat indexing:')
print(f"precomp: {timeit('map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)', globals=globals(), number=10)*100:10.3f} ms")
map_f = np.ravel_multi_index((*np.moveaxis(mapping, 2, 0),), A.shape)
print(f"main: {timeit('B = A.ravel()[map_f]', globals=globals(), number=10)*100:10.3f} ms")
One very nice solution to these types of performance critical problems is to keep it simple and utilize one of the high performance packages. The easiest might be Numba which provides the jit decorator that compiles array and loop heavy code to optimized LLVM. Below is a full example:
from time import time
import numpy as np
from numba import jit
# Function doing the computation
def normal(A, B, mapping):
N, M = B.shape
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
# The same exact function, but with the Numba jit decorator
#jit
def fast(A, B, mapping):
N, M = B.shape
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
# Create sample data
def create_sample_data(I, J, N, M):
A = np.random.random((I, J))
B = np.empty((N, M))
mapping = np.asarray(np.stack((
np.random.random((N, M))*I,
np.random.random((N, M))*J,
), axis=2), dtype=int)
return A, B, mapping
A, B, mapping = create_sample_data(500, 600, 700, 800)
# Run normally
t0 = time()
B = normal(A, B, mapping)
t1 = time()
print('normal took', t1 - t0, 'seconds')
# Run using Numba.
# First we should run the function with smaller arrays,
# just to compile the code.
fast(*create_sample_data(5, 6, 7, 8))
# Now, run with real data
t0 = time()
B = fast(A, B, mapping)
t1 = time()
print('fast took', t1 - t0, 'seconds')
This uses your own looping solution, which is inherently slow using standard Python, but as fast as C when using Numba. On my machine the normal function executes in 0.270 seconds, while the fast function executes in 0.00248 seconds. That is, Numba gives us a 109x speedup (!) pretty much for free.
Note that the fast Numba function is called twice, first with small input arrays and only then with the real data. This is a critical step which is often neglected. Without it, you will find that the performance increase is not nearly as good, as the first call is used to compile the code. The types and dimensions of the input arrays should be the same in this initial call, but the size in each dimension is not important.
I create B outside of the function(s) and passed it as an argument (to be "filled with values"). You might just as well allocate B inside of the function, Numba does not care.
The easiest way to get Numba is properly via the Anaconda distribution.
One option would be to use numba, which can often provide substantial improvements in this kind of simple algorithmic code.
import numpy as np
from numba import njit
I, J = 5000, 5000
N, M = 3000, 3000
A = np.random.randint(0, 10, [I, J])
B = np.random.randint(0, 10, [N, M])
mapping = np.dstack([np.random.randint(0, I - 1, (N, M)),
np.random.randint(0, J - 1, (N, M))])
B0 = B.copy()
def orig(A, B, mapping):
for i in range(N):
for j in range(M):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
new = njit(orig)
which gives us matching results:
In [313]: Bold = B0.copy()
In [314]: orig(A, Bold, mapping)
In [315]: Bnew = B0.copy()
In [316]: new(A, Bnew, mapping)
In [317]: (Bold == Bnew).all()
Out[317]: True
and is much faster:
In [320]: %time orig(A, B0.copy(), mapping)
Wall time: 6.11 s
In [321]: %time new(A, B0.copy(), mapping)
Wall time: 257 ms
and faster still after the first call, when it has to do its jit work:
In [322]: %time new(A, B0.copy(), mapping)
Wall time: 171 ms
In [323]: %time new(A, B0.copy(), mapping)
Wall time: 163 ms
for a 30x improvement for adding two lines of code.
The most straightforward optimization you can do is drop the native python loops and use fancy numpy indexing. You already have the array to do that:
import numpy as np
A = np.random.rand(2000,3000)
B = np.empty((2500,3500)) # just for shape, really
# this is the same as your original, but with random indices
mapping = np.stack([np.random.randint(0, A.shape[0] - 1, B.shape),
np.random.randint(0, A.shape[1] - 1, B.shape)],
axis=-1)
# your loopy original
def loopy(A, B, mapping):
B = B.copy()
for i in range(B.shape[0]):
for j in range(B.shape[1]):
B[i, j] = A[mapping[i, j, 0], mapping[i, j, 1]]
return B
# vectorization with fancy indexing
def fancy(A, mapping):
return A[mapping[...,0], mapping[...,1]]
Note that the fancy advanced-indexing function doesn't need preallocation of a B array, as a new array is constructed by the indexing operation.
There's a slight variation of the fancy indexing version which could be marginally more efficient: put your last dimension of mapping first, in this way both indexing arrays are contiguous blocks of memory. It turns out from my timing test that this happens to be slower in the above setup. Anyway:
mapping_T = mapping.transpose(2, 0, 1).copy() # but it's actually `mapping` without axis=-1 kwarg
# has shape (2, N, M)
def fancy_T(A, mapping_T):
return A[tuple(mapping_T)]
As Paul Panzer noted in a comment, just calling .transpose on mapping will not create a copy, but rather implement the transpose using stride tricks. In order to end up with a contiguous array (which is the point of the optimization) we need to force the creation of a copy.
I get the following timings in ipython:
# loopy(A, B, mapping)
6.63 s ± 141 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# fancy(A, mapping)
250 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# fancy_T(A, mapping_T)
277 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
To be honest I don't understand why the original array order is faster compared to the transposed, but there's that.
I have two large vectors (of equal length) that I'm calculating a sliding window dot product for:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6])
b = np.array([11, 22, 33, 44, 55, 66])
out = np.array(
[[a[0]*b[0]+a[1]*b[1]+a[2]*b[2]],
[a[1]*b[1]+a[2]*b[2]+a[3]*b[3]],
[a[2]*b[2]+a[3]*b[3]+a[4]*b[4]],
[a[3]*b[3]+a[4]*b[4]+a[5]*b[5]],
])
[[154]
[319]
[550]
[847]]
Of course, I can call the dot product function but if the window/vector length is large then it is not as efficient as the following code:
window = 3
result = np.empty([4,1])
result[0] = a[0]*b[0]+a[1]*b[1]+a[2]*b[2]
for i in range(3):
result[i+1] = result[i]-a[i]*b[i]+a[i+window]*b[i+window]
[[154]
[319]
[550]
[847]]
Here, we are leveraging the fact that the i+1th dot product is similar to the ith dot product. That is,
result[i+1] = result[i]-a[i]*b[i]+a[i+window]*b[i+window]
How can I convert my for loop into a vectorized function so that the computation can utilize the information from the ith step so as to reduce the computational redundancy while minimizing the amount of memory needed.
UPDATE
I actually needed:
import numpy as np
a = np.array([1, 2, 3, 4, 5, 6])
b = np.array([11, 22, 33, 44, 55, 66, 77, 88])
out = np.array(
[a[0]*b[0]+a[1]*b[1]+a[2]*b[2]+a[3]*b[3]]+a[4]*b[4]]+a[5]*b[5],
a[0]*b[1]+a[1]*b[2]+a[2]*b[3]+a[3]*b[4]]+a[4]*b[5]]+a[5]*b[6],
a[0]*b[2]+a[1]*b[3]+a[2]*b[4]+a[3]*b[5]]+a[4]*b[6]]+a[5]*b[7],
])
[1001
1232
1463]
So a would be slid across b and dot products would be calculated.
You could use partial sums for O(n) complexity:
ps = np.r_[0, np.cumsum(a*b)]
ps[3:]-ps[:-3]
# array([154, 319, 550, 847])
Or a variant that is closer to your original for loop and avoids very large partial sums:
k = 3
d = a*b
d[k:] -= d[:-k].copy()
np.cumsum(d)[k-1:]
# array([154, 319, 550, 847])
Update to match the updated Q.
This is now indeed a convolution, so #Divakar's solution more or less applies. Only, you'd convolve a[::-1] and b directly. If speed is a problem you may try and replace np.convolve with scipy.signal.fftconvolve which depending on the sizes of your operands may be significantly faster. For very small operands or operands of vastly different lengths, though, you may even lose some speed, so be sure to try both methods:
np.convolve(b, a[::-1], 'valid')
scipy.signal.fftconvolve(b, a[::-1], 'valid')
Approach #1
Use np.convolve on element-wise multiplication between the two inputs and with a kernel of all ones and size=3 -
np.convolve(a*b,np.ones(3),'valid')
Approach #2
Since we are simply summing elements in a window, we can also use uniform_filter, like so -
from scipy.ndimage.filters import uniform_filter1d as unif1d
def uniform_filter(a,W):
hW = (W-1)//2
return W*unif1d(a.astype(float),size=W, mode='constant')[hW:-hW]
out = uniform_filter(a*b,W=3)
Benchmarking
Loopy approach -
def loopy_approach(a,b):
window = 3
N = a.size-window+1
result = np.empty([N,1])
result[0] = a[0]*b[0]+a[1]*b[1]+a[2]*b[2]
for i in range(N-1):
result[i+1] = result[i]-a[i]*b[i]+a[i+window]*b[i+window]
return result
Timings and verification -
In [147]: a = np.random.randint(0,100,(1000))
...: b = np.random.randint(0,100,(1000))
...:
In [148]: out0 = loopy_approach(a,b).ravel()
...: out1 = np.convolve(a*b,np.ones(3),'valid')
...: out2 = uniform_filter(a*b,W=3)
...:
In [149]: np.allclose(out0,out1)
Out[149]: True
In [150]: np.allclose(out0,out2)
Out[150]: True
In [151]: %timeit loopy_approach(a,b)
...: %timeit np.convolve(a*b,np.ones(3),'valid')
...: %timeit uniform_filter(a*b,W=3)
...:
100 loops, best of 3: 2.27 ms per loop
100000 loops, best of 3: 7 µs per loop
100000 loops, best of 3: 10.2 µs per loop
Yet another approach using strides is:
In [12]: from numpy.lib.stride_tricks import as_strided
In [13]: def using_strides(a, b, w=3):
shape = a.shape[:-1] + (a.shape[-1] - w + 1, w)
strides = a.strides + (a.strides[-1],)
res = np.sum((as_strided(a, shape=shape, strides=strides) * \
as_strided(b, shape=shape, strides=strides)), axis=1)
return res[:, np.newaxis]
In [14]: using_strides(a, b, 3)
Out[14]:
array([[154],
[319],
[550],
[847]])
I noticed that indexing a multi dimensional array takes more time than indexing a single dimensional array
a1 = np.arange(1000000)
a2 = np.arange(1000000).reshape(1000, 1000)
a3 = np.arange(1000000).reshape(100, 100, 100)
When I index a1
%%timeit
a1[500000]
The slowest run took 39.17 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 84.6 ns per loop
%%timeit
a2[500, 0]
The slowest run took 31.85 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 102 ns per loop
%%timeit
a3[50, 0, 0]
The slowest run took 46.72 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 119 ns per loop
At what point should I consider an alternative way to index or slice a multi-dimensional array? What are the circumstances that make it worth the effort and loss of transparency?
One alternative to slicing an (n, m) array is to flatten the array and derive what it's one dimensional position must be.
consider a = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
we can get the 2nd row, 3rd column with a[1, 2] and get 5
or we can calculate that 1 * a.shape[1] + 2 is the one dimensional position if we flatten a with order='C'
thus we can perform the equivalent slice with a.ravel()[1 * a.shape[1] + 2]
Is this efficient? No, for indexing a single number from an array, it isn't worth the trouble.
What about if we want to slice many numbers from the array? I devised the following test for a 2-D array
2-D test
from timeit import timeit
n, m = 10000, 10000
a = np.random.rand(n, m)
r = pd.DataFrame(index=np.power(10, np.arange(7)), columns=['Multi', 'Flat'])
for k in r.index:
b = np.random.randint(n, size=k)
c = np.random.randint(m, size=k)
kw = dict(setup='from __main__ import a, b, c', number=100)
r.loc[k, 'Multi'] = timeit('a[b, c]', **kw)
r.loc[k, 'Flat'] = timeit('a.ravel()[b * a.shape[1] + c]', **kw)
r.div(r.sum(1), 0).plot.bar()
It appears that when slicing more than 100,000 numbers, it's better to flatten the array.
What about 3-D
3-D test
from timeit import timeit
l, n, m = 1000, 1000, 1000
a = np.random.rand(l, n, m)
r = pd.DataFrame(index=np.power(10, np.arange(7)), columns=['Multi', 'Flat'])
for k in r.index:
b = np.random.randint(l, size=k)
c = np.random.randint(m, size=k)
d = np.random.randint(n, size=k)
kw = dict(setup='from __main__ import a, b, c, d', number=100)
r.loc[k, 'Multi'] = timeit('a[b, c, d]', **kw)
r.loc[k, 'Flat'] = timeit('a.ravel()[b * a.shape[1] * a.shape[2] + c * a.shape[1] + d]', **kw)
r.div(r.sum(1), 0).plot.bar()
Similar results, maybe more dramatic.
Conclusion
For 2 dimensional arrays, consider flattening and deriving flatten positions if you need to pull more than 100,000 elements from the array.
For 3 or more dimensions, it seems clear that flattening the array is almost always better.
Criticism is welcome
Did I do something wrong? Did I not think of something obvious?
I perform the cross product of contiguous segments of a trajectory (xy coordinates) using the following script:
In [129]:
def func1(xy, s):
size = xy.shape[0]-2*s
out = np.zeros(size)
for i in range(size):
p1, p2 = xy[i], xy[i+s] #segment 1
p3, p4 = xy[i+s], xy[i+2*s] #segment 2
out[i] = np.cross(p1-p2, p4-p3)
return out
def func2(xy, s):
size = xy.shape[0]-2*s
p1 = xy[0:size]
p2 = xy[s:size+s]
p3 = p2
p4 = xy[2*s:size+2*s]
tmp1 = p1-p2
tmp2 = p4-p3
return tmp1[:, 0] * tmp2[:, 1] - tmp2[:, 0] * tmp1[:, 1]
In [136]:
xy = np.array([[1,2],[2,3],[3,4],[5,6],[7,8],[2,4],[5,2],[9,9],[1,1]])
func2(xy, 2)
Out[136]:
array([ 0, -3, 16, 1, 22])
func1 is particularly slow because of the inner loop so I rewrote the cross-product myself (func2) which is orders of magnitude faster.
Is it possible to use the numpy einsum function to make the same calculation?
einsum computes sums of products only, but you could shoehorn the cross-product into a sum of products by reversing the columns of tmp2 and changing the sign of the first column:
def func3(xy, s):
size = xy.shape[0]-2*s
tmp1 = xy[0:size] - xy[s:size+s]
tmp2 = xy[2*s:size+2*s] - xy[s:size+s]
tmp2 = tmp2[:, ::-1]
tmp2[:, 0] *= -1
return np.einsum('ij,ij->i', tmp1, tmp2)
But func3 is slower than func2.
In [80]: xy = np.tile(xy, (1000, 1))
In [104]: %timeit func1(xy, 2)
10 loops, best of 3: 67.5 ms per loop
In [105]: %timeit func2(xy, 2)
10000 loops, best of 3: 73.2 µs per loop
In [106]: %timeit func3(xy, 2)
10000 loops, best of 3: 108 µs per loop
Sanity check:
In [86]: np.allclose(func1(xy, 2), func3(xy, 2))
Out[86]: True
I think the reason why func2 is beating einsum here is because the cost of setting of the loop in einsum for just 2 iterations is too expensive compared to just manually writing out the sum, and the reversing and multiplying eat up some time as well.
np.cross is a smart little beast, that can handle broadcasting without any issue. So you can rewrite your func2 as:
def func2(xy, s):
size = xy.shape[0]-2*s
p1 = xy[0:size]
p2 = xy[s:size+s]
p3 = p2
p4 = xy[2*s:size+2*s]
return np.cross(p1-p2, p4-p3)
and it will produce the correct result:
>>> func2(xy, 2)
array([ 0, -3, 16, 1, 22])
In the latest numpy it will likely run a tad faster than your code, as it was rewritten to minimize intermediate array creation. You can look at the source code (pure Python) here.