Related
Is there any quicker way to physically transpose a large 2D numpy matrix than array.transpose.copy()? And are there any routines for doing it with efficient memory use?
It may be worth looking at what transpose does, just so we are clear about what you mean by 'physically tranposing'.
Start with a small (4,3) array:
In [51]: arr = np.array([[1,2,3],[10,11,12],[22,23,24],[30,32,34]])
In [52]: arr
Out[52]:
array([[ 1, 2, 3],
[10, 11, 12],
[22, 23, 24],
[30, 32, 34]])
This is stored with a 1d data buffer, which we can display with ravel:
In [53]: arr.ravel()
Out[53]: array([ 1, 2, 3, 10, 11, 12, 22, 23, 24, 30, 32, 34])
and strides which tell it to step columns by 8 bytes, and rows by 24 (3*8):
In [54]: arr.strides
Out[54]: (24, 8)
We can ravel with the "F" order - that's going down the rows:
In [55]: arr.ravel(order='F')
Out[55]: array([ 1, 10, 22, 30, 2, 11, 23, 32, 3, 12, 24, 34])
While [53] is a view, [55] is a copy.
Now the transpose:
In [57]: arrt=arr.T
In [58]: arrt
Out[58]:
array([[ 1, 10, 22, 30],
[ 2, 11, 23, 32],
[ 3, 12, 24, 34]])
This a view; we can tranverse the [53] data buffer, going down rows with 8 byte steps. Doing calculations with the arrt is basically just as fast as with arr. With the strided iteration, order 'F' is just as fast as order 'C'.
In [59]: arrt.strides
Out[59]: (8, 24)
the original order:
In [60]: arrt.ravel(order='F')
Out[60]: array([ 1, 2, 3, 10, 11, 12, 22, 23, 24, 30, 32, 34])
but doing a 'C' ravel creates a copy, same as [55]
In [61]: arrt.ravel(order='C')
Out[61]: array([ 1, 10, 22, 30, 2, 11, 23, 32, 3, 12, 24, 34])
Doing a copy of the transpose makes an array that's transpose with 'C' order. This is your 'physical transpose':
In [62]: arrc = arrt.copy()
In [63]: arrc.strides
Out[63]: (32, 8)
Reshaping a transpose as done with [61] does make a copy, but usually we don't need to explicitly make the copy. I think the only reason to do so is to avoid several redundant copies in later calculations.
I assume that you need to do a row-wise operation that uses the CPU cache more efficiently if rows are contiguous in memory, and you don't have enough memory available to make a copy.
Wikipedia has an article on in-place matrix transposition. It turns out that such a transposition is nontrivial. Here is a follow-the-cycles algorithm as described there:
import numpy as np
from numba import njit
#njit # comment this line for debugging
def transpose_in_place(a):
"""In-place matrix transposition for a rectangular matrix.
https://stackoverflow.com/a/62507342/6228891
Parameter:
- a: 2D array. Unless it's a square matrix, it will be scrambled
in the process.
Return:
- transposed array, using the same in-memory data storage as the
input array.
This algorithm is typically 10x slower than a.T.copy().
Only use it if you are short on memory.
"""
if a.shape == (1, 1):
return a # special case
n, m = a.shape
# find max length L of permutation cycle by starting at a[0,1].
# k is the index in the flat buffer; i, j are the indices in
# a.
L = 0
k = 1
while True:
j = k % m
i = k // m
k = n*j + i
L += 1
if k == 1:
break
permut = np.zeros(L, dtype=np.int32)
# Now do the permutations, one cycle at a time
seen = np.full(n*m, False)
aflat = a.reshape(-1) # flat view
for k0 in range(1, n*m-1):
if seen[k0]:
continue
# construct cycle
k = k0
permut[0] = k0
q = 1 # size of permutation array
while True:
seen[k] = True
# note that this is slightly faster than the formula
# on Wikipedia, k = n*k % (n*m-1)
i = k // m
j = k - i*m
k = n*j + i
if k == k0:
break
permut[q] = k
q += 1
# apply cyclic permutation
tmp = aflat[permut[q-1]]
aflat[permut[1:q]] = aflat[permut[:q-1]]
aflat[permut[0]] = tmp
aT = aflat.reshape(m, n)
return aT
def test_transpose(n, m):
a = np.arange(n*m).reshape(n, m)
aT = a.T.copy()
assert np.all(transpose_in_place(a) == aT)
def roundtrip_inplace(a):
a = transpose_in_place(a)
a = transpose_in_place(a)
def roundtrip_copy(a):
a = a.T.copy()
a = a.T.copy()
if __name__ == '__main__':
test_transpose(1, 1)
test_transpose(3, 4)
test_transpose(5, 5)
test_transpose(1, 5)
test_transpose(5, 1)
test_transpose(19, 29)
Even though I'm using numba.njit here so that the loops in the transpose function are compiled, it's still quite a bit slower than a copy-transpose.
n, m = 1000, 10000
a_big = np.arange(n*m, dtype=np.float64).reshape(n, m)
%timeit -r2 -n10 roundtrip_copy(a_big)
54.5 ms ± 153 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
%timeit -r2 -n1 roundtrip_inplace(a_big)
614 ms ± 141 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
Whatever you do will require O(n^2) time and memory. I would assume that .transpose and .copy (written in C) will be the most efficient choice for your application.
Edit: this assumes you actually need to copy the matrix
Say I have a np.array like this:
a = [1, 3, 4, 5, 60, 43, 53, 4, 46, 54, 56, 78]
Is there a quick method to get the indices of all locations where 3 consecutive numbers are all above some threshold? That is, for some threshold th, get all x where this holds:
a[x]>th and a[x+1]>th and a[x+2]>th
Example: for threshold 40 and the list given above, x should be [4,8,9].
Many thanks.
Approach #1
Use convolution on the mask of boolean array obtained after comparison -
In [40]: a # input array
Out[40]: array([ 1, 3, 4, 5, 60, 43, 53, 4, 46, 54, 56, 78])
In [42]: N = 3 # compare N consecutive numbers
In [44]: T = 40 # threshold for comparison
In [45]: np.flatnonzero(np.convolve(a>T, np.ones(N, dtype=int),'valid')>=N)
Out[45]: array([4, 8, 9])
Approach #2
Use binary_erosion -
In [77]: from scipy.ndimage.morphology import binary_erosion
In [31]: np.flatnonzero(binary_erosion(a>T,np.ones(N, dtype=int), origin=-(N//2)))
Out[31]: array([4, 8, 9])
Approach #3 (Specific case) : Small numbers of consecutive numbers check
For checking such a small number of consecutive numbers (three in this case), we can also slicing on the compared mask for better performance -
m = a>T
out = np.flatnonzero(m[:-2] & m[1:-1] & m[2:])
Benchmarking
Timings on 100000 repeated/tiled array from given sample -
In [78]: a
Out[78]: array([ 1, 3, 4, 5, 60, 43, 53, 4, 46, 54, 56, 78])
In [79]: a = np.tile(a,100000)
In [80]: N = 3
In [81]: T = 40
# Approach #3
In [82]: %%timeit
...: m = a>T
...: out = np.flatnonzero(m[:-2] & m[1:-1] & m[2:])
1000 loops, best of 3: 1.83 ms per loop
# Approach #1
In [83]: %timeit np.flatnonzero(np.convolve(a>T, np.ones(N, dtype=int),'valid')>=N)
100 loops, best of 3: 10.9 ms per loop
# Approach #2
In [84]: %timeit np.flatnonzero(binary_erosion(a>T,np.ones(N, dtype=int), origin=-(N//2)))
100 loops, best of 3: 11.7 ms per loop
try:
th=40
results = [ x for x in range( len( array ) -2 ) if(array[x:x+3].min() > th) ]
which is a list comprehension for
th=40
results = []
for x in range( len( array ) -2 ):
if( array[x:x+3].min() > th ):
results.append( x )
Another approach, using numpy.lib.stride_tricks.as_strided:
in [59]: import numpy as np
In [60]: from numpy.lib.stride_tricks import as_strided
Define the input data:
In [61]: a = np.array([ 1, 3, 4, 5, 60, 43, 53, 4, 46, 54, 56, 78])
In [62]: N = 3
In [63]: threshold = 40
Compute the result; q is the boolean mask for the "big" values.
In [64]: q = a > threshold
In [65]: result = np.all(as_strided(q, shape=(len(q)-N+1, N), strides=(q.strides[0], q.strides[0])), axis=1).nonzero()[0]
In [66]: result
Out[66]: array([4, 8, 9])
Do it again with N = 4:
In [67]: N = 4
In [68]: result = np.all(as_strided(q, shape=(len(q)-N+1, N), strides=(q.strides[0], q.strides[0])), axis=1).nonzero()[0]
In [69]: result
Out[69]: array([8])
I want to get the rank of each element, so I use argsort in numpy:
np.argsort(np.array((1,1,1,2,2,3,3,3,3)))
array([0, 1, 2, 3, 4, 5, 6, 7, 8])
it give the same element the different rank, can I get the same rank like:
array([0, 0, 0, 3, 3, 5, 5, 5, 5])
If you don't mind a dependency on scipy, you can use scipy.stats.rankdata, with method='min':
In [14]: a
Out[14]: array([1, 1, 1, 2, 2, 3, 3, 3, 3])
In [15]: from scipy.stats import rankdata
In [16]: rankdata(a, method='min')
Out[16]: array([1, 1, 1, 4, 4, 6, 6, 6, 6])
Note that rankdata starts the ranks at 1. To start at 0, subtract 1 from the result:
In [17]: rankdata(a, method='min') - 1
Out[17]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
If you don't want the scipy dependency, you can use numpy.unique to compute the ranking. Here's a function that computes the same result as rankdata(x, method='min') - 1:
import numpy as np
def rankmin(x):
u, inv, counts = np.unique(x, return_inverse=True, return_counts=True)
csum = np.zeros_like(counts)
csum[1:] = counts[:-1].cumsum()
return csum[inv]
For example,
In [137]: x = np.array([60, 10, 0, 30, 20, 40, 50])
In [138]: rankdata(x, method='min') - 1
Out[138]: array([6, 1, 0, 3, 2, 4, 5])
In [139]: rankmin(x)
Out[139]: array([6, 1, 0, 3, 2, 4, 5])
In [140]: a = np.array([1,1,1,2,2,3,3,3,3])
In [141]: rankdata(a, method='min') - 1
Out[141]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
In [142]: rankmin(a)
Out[142]: array([0, 0, 0, 3, 3, 5, 5, 5, 5])
By the way, a single call to argsort() does not give ranks. You can find an assortment of approaches to ranking in the question Rank items in an array using Python/NumPy, including how to do it using argsort().
Alternatively, pandas series has a rank method which does what you need with the min method:
import pandas as pd
pd.Series((1,1,1,2,2,3,3,3,3)).rank(method="min")
# 0 1
# 1 1
# 2 1
# 3 4
# 4 4
# 5 6
# 6 6
# 7 6
# 8 6
# dtype: float64
With focus on performance, here's an approach -
def rank_repeat_based(arr):
idx = np.concatenate(([0],np.flatnonzero(np.diff(arr))+1,[arr.size]))
return np.repeat(idx[:-1],np.diff(idx))
For a generic case with the elements in input array not already sorted, we would need to use argsort() to keep track of the positions. So, we would have a modified version, like so -
def rank_repeat_based_generic(arr):
sidx = np.argsort(arr,kind='mergesort')
idx = np.concatenate(([0],np.flatnonzero(np.diff(arr[sidx]))+1,[arr.size]))
return np.repeat(idx[:-1],np.diff(idx))[sidx.argsort()]
Runtime test
Testing out all the approaches listed thus far to solve the problem on a large dataset.
Sorted array case :
In [96]: arr = np.sort(np.random.randint(1,100,(10000)))
In [97]: %timeit rankdata(arr, method='min') - 1
1000 loops, best of 3: 635 µs per loop
In [98]: %timeit rankmin(arr)
1000 loops, best of 3: 495 µs per loop
In [99]: %timeit (pd.Series(arr).rank(method="min")-1).values
1000 loops, best of 3: 826 µs per loop
In [100]: %timeit rank_repeat_based(arr)
10000 loops, best of 3: 200 µs per loop
Unsorted case :
In [106]: arr = np.random.randint(1,100,(10000))
In [107]: %timeit rankdata(arr, method='min') - 1
1000 loops, best of 3: 963 µs per loop
In [108]: %timeit rankmin(arr)
1000 loops, best of 3: 869 µs per loop
In [109]: %timeit (pd.Series(arr).rank(method="min")-1).values
1000 loops, best of 3: 1.17 ms per loop
In [110]: %timeit rank_repeat_based_generic(arr)
1000 loops, best of 3: 1.76 ms per loop
I've written a function for the same purpose. It uses pure python and numpy only. Please have a look. I put comments as well.
def my_argsort(array):
# this type conversion let us work with python lists and pandas series
array = np.array(array)
# create mapping for unique values
# it's a dictionary where keys are values from the array and
# values are desired indices
unique_values = list(set(array))
mapping = dict(zip(unique_values, np.argsort(unique_values)))
# apply mapping to our array
# np.vectorize works similar map(), and can work with dictionaries
array = np.vectorize(mapping.get)(array)
return array
Hope that helps.
Complex solutions are unnecessary for this problem.
> ary = np.sort([1, 1, 1, 2, 2, 3, 3, 3, 3]) # or anything; must be sorted.
> a = np.diff().cumsum(); a
array([0, 0, 1, 1, 2, 2, 2, 2])
> b = np.r_[0, a]; b # ties get first open rank
array([0, 0, 0, 1, 1, 2, 2, 2, 2])
> c = np.flatnonzero(ary[1:] != ary[:-1])
> np.r_[0, 1 + c][b] # ties get last open rank
array([0, 0, 0, 3, 3, 5, 5, 5])
I am surprised this specific question hasn't been asked before, but I really didn't find it on SO nor on the documentation of np.sort.
Say I have a random numpy array holding integers, e.g:
> temp = np.random.randint(1,10, 10)
> temp
array([2, 4, 7, 4, 2, 2, 7, 6, 4, 4])
If I sort it, I get ascending order by default:
> np.sort(temp)
array([2, 2, 2, 4, 4, 4, 4, 6, 7, 7])
but I want the solution to be sorted in descending order.
Now, I know I can always do:
reverse_order = np.sort(temp)[::-1]
but is this last statement efficient? Doesn't it create a copy in ascending order, and then reverses this copy to get the result in reversed order? If this is indeed the case, is there an efficient alternative? It doesn't look like np.sort accepts parameters to change the sign of the comparisons in the sort operation to get things in reverse order.
temp[::-1].sort() sorts the array in place, whereas np.sort(temp)[::-1] creates a new array.
In [25]: temp = np.random.randint(1,10, 10)
In [26]: temp
Out[26]: array([5, 2, 7, 4, 4, 2, 8, 6, 4, 4])
In [27]: id(temp)
Out[27]: 139962713524944
In [28]: temp[::-1].sort()
In [29]: temp
Out[29]: array([8, 7, 6, 5, 4, 4, 4, 4, 2, 2])
In [30]: id(temp)
Out[30]: 139962713524944
>>> a=np.array([5, 2, 7, 4, 4, 2, 8, 6, 4, 4])
>>> np.sort(a)
array([2, 2, 4, 4, 4, 4, 5, 6, 7, 8])
>>> -np.sort(-a)
array([8, 7, 6, 5, 4, 4, 4, 4, 2, 2])
For short arrays I suggest using np.argsort() by finding the indices of the sorted negatived array, which is slightly faster than reversing the sorted array:
In [37]: temp = np.random.randint(1,10, 10)
In [38]: %timeit np.sort(temp)[::-1]
100000 loops, best of 3: 4.65 µs per loop
In [39]: %timeit temp[np.argsort(-temp)]
100000 loops, best of 3: 3.91 µs per loop
Be careful with dimensions.
Let
x # initial numpy array
I = np.argsort(x) or I = x.argsort()
y = np.sort(x) or y = x.sort()
z # reverse sorted array
Full Reverse
z = x[I[::-1]]
z = -np.sort(-x)
z = np.flip(y)
flip changed in 1.15, previous versions 1.14 required axis. Solution: pip install --upgrade numpy.
First Dimension Reversed
z = y[::-1]
z = np.flipud(y)
z = np.flip(y, axis=0)
Second Dimension Reversed
z = y[::-1, :]
z = np.fliplr(y)
z = np.flip(y, axis=1)
Testing
Testing on a 100×10×10 array 1000 times.
Method | Time (ms)
-------------+----------
y[::-1] | 0.126659 # only in first dimension
-np.sort(-x) | 0.133152
np.flip(y) | 0.121711
x[I[::-1]] | 4.611778
x.sort() | 0.024961
x.argsort() | 0.041830
np.flip(x) | 0.002026
This is mainly due to reindexing rather than argsort.
# Timing code
import time
import numpy as np
def timeit(fun, xs):
t = time.time()
for i in range(len(xs)): # inline and map gave much worse results for x[-I], 5*t
fun(xs[i])
t = time.time() - t
print(np.round(t,6))
I, N = 1000, (100, 10, 10)
xs = np.random.rand(I,*N)
timeit(lambda x: np.sort(x)[::-1], xs)
timeit(lambda x: -np.sort(-x), xs)
timeit(lambda x: np.flip(x.sort()), xs)
timeit(lambda x: x[x.argsort()[::-1]], xs)
timeit(lambda x: x.sort(), xs)
timeit(lambda x: x.argsort(), xs)
timeit(lambda x: np.flip(x), xs)
np.flip() and reversed indexed are basically the same. Below is a benchmark using three different methods. It seems np.flip() is slightly faster. Using negation is slower because it is used twice so reversing the array is faster than that.
** Note that np.flip() is faster than np.fliplr() according to my tests.
def sort_reverse(x):
return np.sort(x)[::-1]
def sort_negative(x):
return -np.sort(-x)
def sort_flip(x):
return np.flip(np.sort(x))
arr=np.random.randint(1,10000,size=(1,100000))
%timeit sort_reverse(arr)
%timeit sort_negative(arr)
%timeit sort_flip(arr)
and the results are:
6.61 ms ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.69 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.57 ms ± 58.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Hello I was searching for a solution to reverse sorting a two dimensional numpy array, and I couldn't find anything that worked, but I think I have stumbled on a solution which I am uploading just in case anyone is in the same boat.
x=np.sort(array)
y=np.fliplr(x)
np.sort sorts ascending which is not what you want, but the command fliplr flips the rows left to right! Seems to work!
Hope it helps you out!
I guess it's similar to the suggest about -np.sort(-a) above but I was put off going for that by comment that it doesn't always work. Perhaps my solution won't always work either however I have tested it with a few arrays and seems to be OK.
Unfortunately when you have a complex array, only np.sort(temp)[::-1] works properly. The two other methods mentioned here are not effective.
You could sort the array first (Ascending by default) and then apply np.flip()
(https://docs.scipy.org/doc/numpy/reference/generated/numpy.flip.html)
FYI It works with datetime objects as well.
Example:
x = np.array([2,3,1,0])
x_sort_asc=np.sort(x)
print(x_sort_asc)
>>> array([0, 1, 2, 3])
x_sort_desc=np.flip(x_sort_asc)
print(x_sort_desc)
>>> array([3,2,1,0])
Here is a quick trick
In[3]: import numpy as np
In[4]: temp = np.random.randint(1,10, 10)
In[5]: temp
Out[5]: array([5, 4, 2, 9, 2, 3, 4, 7, 5, 8])
In[6]: sorted = np.sort(temp)
In[7]: rsorted = list(reversed(sorted))
In[8]: sorted
Out[8]: array([2, 2, 3, 4, 4, 5, 5, 7, 8, 9])
In[9]: rsorted
Out[9]: [9, 8, 7, 5, 5, 4, 4, 3, 2, 2]
i suggest using this ...
np.arange(start_index, end_index, intervals)[::-1]
for example:
np.arange(10, 20, 0.5)
np.arange(10, 20, 0.5)[::-1]
Then your resault:
[ 19.5, 19. , 18.5, 18. , 17.5, 17. , 16.5, 16. , 15.5,
15. , 14.5, 14. , 13.5, 13. , 12.5, 12. , 11.5, 11. ,
10.5, 10. ]
I have a 2d array of integers and I want to sum up 2d sub arrays of it. Both arrays can have arbitrary dimensions, although we can assume that the subarray will be orders of magnitudes smaller than the total array.
The reference implementation in python is trivial:
def sub_sums(arr, l, m):
result = np.zeros((len(arr) // l, len(arr[0]) // m))
rows = len(arr) // l * l
cols = len(arr[0]) // m * m
for i in range(rows):
for j in range(cols):
result[i // l, j // m] += arr[i, j]
return result
The question is how I do this best using numpy, hopefully without any looping in python at all. For 1d arrays cumsum and r_ would work and I could use that with a bit of looping to implement a solution for 2d, but I'm still learning numpy and I'm almost certain there's some cleverer way.
Example output:
arr = np.asarray([range(0, 5),
range(4, 9),
range(8, 13),
range(12, 17)])
result = sub_sums(arr, 2, 2)
gives:
[[ 0 1 2 3 4]
[ 4 5 6 7 8]
[ 8 9 10 11 12]
[12 13 14 15 16]]
[[ 10. 18.]
[ 42. 50.]]
There is a blockshaped function which does something rather close to what you want:
In [81]: arr
Out[81]:
array([[ 0, 1, 2, 3, 4],
[ 4, 5, 6, 7, 8],
[ 8, 9, 10, 11, 12],
[12, 13, 14, 15, 16]])
In [82]: blockshaped(arr[:,:4], 2,2)
Out[82]:
array([[[ 0, 1],
[ 4, 5]],
[[ 2, 3],
[ 6, 7]],
[[ 8, 9],
[12, 13]],
[[10, 11],
[14, 15]]])
In [83]: blockshaped(arr[:,:4], 2,2).shape
Out[83]: (4, 2, 2)
Once you have the "blockshaped" array, you can obtain the desired result by reshaping (so the numbers in one block are strung out along a single axis) and then calling the sum method on that axis.
So, with a slight modification of the blockshaped function, you can define sub_sums like this:
import numpy as np
def sub_sums(arr, nrows, ncols):
h, w = arr.shape
h = (h // nrows)*nrows
w = (w // ncols)*ncols
arr = arr[:h,:w]
return (arr.reshape(h // nrows, nrows, -1, ncols)
.swapaxes(1, 2)
.reshape(h // nrows, w // ncols, -1).sum(axis=-1))
arr = np.asarray([range(0, 5),
range(4, 9),
range(8, 13),
range(12, 17)])
print(sub_sums(arr, 2, 2))
yields
[[10 18]
[42 50]]
Edit: Ophion provides a nice improvement -- use np.einsum instead of reshaping before summing:
def sub_sums_ophion(arr, nrows, ncols):
h, w = arr.shape
h = (h // nrows)*nrows
w = (w // ncols)*ncols
arr = arr[:h,:w]
return np.einsum('ijkl->ik', arr.reshape(h // nrows, nrows, -1, ncols))
In [105]: %timeit sub_sums(arr, 2, 2)
10000 loops, best of 3: 112 µs per loop
In [106]: %timeit sub_sums_ophion(arr, 2, 2)
10000 loops, best of 3: 76.2 µs per loop
Here is the simpler way:
In [160]: import numpy as np
In [161]: arr = np.asarray([range(0, 5),
range(4, 9),
range(8, 13),
range(12, 17)])
In [162]: np.add.reduceat(arr, [0], axis=1)
Out[162]:
array([[10],
[30],
[50],
[70]])
In [163]: arr
Out[163]:
array([[ 0, 1, 2, 3, 4],
[ 4, 5, 6, 7, 8],
[ 8, 9, 10, 11, 12],
[12, 13, 14, 15, 16]])
In [164]: import numpy as np
In [165]: arr = np.asarray([range(0, 5),
range(4, 9),
range(8, 13),
range(12, 17)])
In [166]: arr
Out[166]:
array([[ 0, 1, 2, 3, 4],
[ 4, 5, 6, 7, 8],
[ 8, 9, 10, 11, 12],
[12, 13, 14, 15, 16]])
In [167]: np.add.reduceat(arr, [0], axis=1)
Out[167]:
array([[10],
[30],
[50],
[70]])
A very small change in your code is to use slicing and perform the sums of the sub-arrays using the sum() method:
def sub_sums(arr, l, m):
result = np.zeros((len(arr) // l, len(arr[0]) // m))
rows = len(arr) // l * l
cols = len(arr[0]) // m * m
for i in range(len(arr) // l):
for j in range(len(arr[0]) // m):
result[i, j] = arr[i*m:(i+1)*m, j*l:(j+1)*l].sum()
return result
Doing some very simple benchmarks shows that this is slower in the 2x2 case, about equal to your approach in the 3x3 case and faster for bigger sub-arrays (sub_sums2 is your version of the code):
In [19]: arr = np.asarray([range(100)] * 100)
In [20]: %timeit sub_sums(arr, 2, 2)
10 loops, best of 3: 21.8 ms per loop
In [21]: %timeit sub_sums2(arr, 2, 2)
100 loops, best of 3: 9.56 ms per loop
In [22]: %timeit sub_sums(arr, 3, 3)
100 loops, best of 3: 9.58 ms per loop
In [23]: %timeit sub_sums2(arr, 3, 3)
100 loops, best of 3: 9.36 ms per loop
In [24]: %timeit sub_sums(arr, 4, 4)
100 loops, best of 3: 5.58 ms per loop
In [25]: %timeit sub_sums2(arr, 4, 4)
100 loops, best of 3: 9.56 ms per loop
In [26]: %timeit sub_sums(arr, 10, 10)
1000 loops, best of 3: 939 us per loop
In [27]: %timeit sub_sums2(arr, 10, 10)
100 loops, best of 3: 9.48 ms per loop
Notice that with 10x10 sub-arrays it's 1000 times faster. In the 2x2 case it's about twice as slow. Your method basically takes always the same time, while my implementation gets faster with bigger sub-arrays.
I'm pretty sure we can avoid using the for loops explicitly (maybe reshaping the array so that it has the sub-arrays as rows?), but I'm not an expert in numpy and it may take some time before I'll be able to find the final solution. However I believe that 3 orders of magnitude are already a nice improvement.