scipy.sparse.diags allows me to enter multiple diagonal vectors, together with their location, to build a matrix such as
from scipy.sparse import diags
vec = np.ones((5,))
vec2 = vec + 1
diags([vec, vec2], [-2, 2])
I'm looking for an efficient way to do the same but build a dense matrix, instead of DIA. np.diag only supports a single diagonal. What's an efficient way to build a dense matrix from multiple diagonal vectors?
Expected output: the same as np.array(diags([vec, vec2], [-2, 2]).todense())
One way would be to index into the flattened output array using a step of N+1:
import numpy as np
from scipy.sparse import diags
from timeit import timeit
def diags_pp(vecs, offs, dtype=float, N=None):
if N is None:
N = len(vecs[0]) + abs(offs[0])
out = np.zeros((N, N), dtype)
outf = out.reshape(-1)
for vec, off in zip(vecs, offs):
if off<0:
outf[-N*off::N+1] = vec
else:
outf[off:N*(N-off):N+1] = vec
return out
def diags_sp(vecs, offs):
return diags(vecs, offs).A
for N, k in [(10, 2), (100, 20), (1000, 200)]:
print(N)
O = np.arange(-k,k)
D = [np.arange(1, N+1-abs(o)) for o in O]
for n, f in list(globals().items()):
if n.startswith('diags_'):
print(n.replace('diags_', ''), timeit(lambda: f(D, O), number=10000//N)*N)
if n != 'diags_sp':
assert np.all(f(D, O) == diags_sp(D, O))
Sample run:
10
pp 0.06757194991223514
sp 1.9529316504485905
100
pp 0.45834919437766075
sp 4.684177896706387
1000
pp 23.397524026222527
sp 170.66762899048626
With Paul Panzer's (10,2) case
In [107]: O
Out[107]: array([-2, -1, 0, 1])
In [108]: D
Out[108]:
[array([1, 2, 3, 4, 5, 6, 7, 8]),
array([1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
array([1, 2, 3, 4, 5, 6, 7, 8, 9])]
The diagonals have different lengths.
sparse.diags converts this to a sparse.dia_matrix:
In [109]: M = sparse.diags(D,O)
In [110]: M
Out[110]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 36 stored elements (4 diagonals) in DIAgonal format>
In [111]: M.data
Out[111]:
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 0., 0.],
[ 1., 2., 3., 4., 5., 6., 7., 8., 9., 0.],
[ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.],
[ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]])
Here the ragged list of diagonals has been converted to a padded 2d array. This can be a convenient way of specifying the diagonals, but it isn't particularly efficient. It has to be converted to csr format for most calculations:
In [112]: timeit sparse.diags(D,O)
99.8 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [113]: timeit sparse.diags(D,O, format='csr')
371 µs ± 155 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using np.diag I can construct the same array with an iteration
np.add.reduce([np.diag(v,k) for v,k in zip(D,O)])
In [117]: timeit np.add.reduce([np.diag(v,k) for v,k in zip(D,O)])
39.3 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and with Paul's function:
In [120]: timeit diags_pp(D,O)
12.3 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The key step in np.diags is a flat assignment:
res[:n-k].flat[i::n+1] = v
This is essentially the same as Paul's outf assignments. So the functionality is basically the same, assigning each diagonal via a slice. Paul streamlines it.
Creating the M.data array (Out[111]) also requires copying the D arrays to a 2d array - but with different slices.
Related
I have a Numpy array of dimensions (d1,d2,d3,d4), for instance A = np.arange(120).reshape((2,3,4,5)).
I would like to contract it so as to obtain B of dimensions (d1,d2,d4).
The d3-indices of parts to pick are collected in an indexing array Idx of dimensions (d1,d2).
Idx provides, for each couple (x1,x2) of indices along (d1,d2), the index x3 for which B should retain the whole corresponding d4-line in A, for example Idx = rng.integers(4, size=(2,3)).
To sum up, for all (x1,x2), I want B[x1,x2,:] = A[x1,x2,Idx[x1,x2],:].
Is there an efficient, vectorized way to do that, without using a loop? I'm aware that this is similar to Easy way to do nd-array contraction using advanced indexing in Python but I have trouble extending the solution to higher dimensional arrays.
MWE
A = np.arange(120).reshape((2,3,4,5))
Idx = rng.integers(4, size=(2,3))
# correct result:
B = np.zeros((2,3,5))
for i in range(2):
for j in range(3):
B[i,j,:] = A[i,j,Idx[i,j],:]
# what I would like, which doesn't work:
B = A[:,:,Idx[:,:],:]
One way to do that is np.squeeze(np.take_along_axis(A, Idx[:,:,None,None], axis=2), axis=2).
For example,
In [49]: A = np.arange(120).reshape(2, 3, 4, 5)
In [50]: rng = np.random.default_rng(0xeeeeeeeeeee)
In [51]: Idx = rng.integers(4, size=(2,3))
In [52]: Idx
Out[52]:
array([[2, 0, 1],
[0, 2, 1]])
In [53]: C = np.squeeze(np.take_along_axis(A, Idx[:,:,None,None], axis=2), axis=2)
In [54]: C
Out[54]:
array([[[ 10, 11, 12, 13, 14],
[ 20, 21, 22, 23, 24],
[ 45, 46, 47, 48, 49]],
[[ 60, 61, 62, 63, 64],
[ 90, 91, 92, 93, 94],
[105, 106, 107, 108, 109]]])
Check the known correct result:
In [55]: # correct result:
...: B = np.zeros((2,3,5))
...: for i in range(2):
...: for j in range(3):
...: B[i,j,:] = A[i,j,Idx[i,j],:]
...:
In [56]: B
Out[56]:
array([[[ 10., 11., 12., 13., 14.],
[ 20., 21., 22., 23., 24.],
[ 45., 46., 47., 48., 49.]],
[[ 60., 61., 62., 63., 64.],
[ 90., 91., 92., 93., 94.],
[105., 106., 107., 108., 109.]]])
Times for 3 alternatives:
In [91]: %%timeit
...: B = np.zeros((2,3,5),A.dtype)
...: for i in range(2):
...: for j in range(3):
...: B[i,j,:] = A[i,j,Idx[i,j],:]
...:
11 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [92]: timeit A[np.arange(2)[:,None],np.arange(3),Idx]
8.58 µs ± 44 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
In [94]: timeit np.squeeze(np.take_along_axis(A, Idx[:,:,None,None], axis=2), axis=2)
29.4 µs ± 448 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Relative times may differ with larger arrays. But this is a good size for testing the correctness.
I need to alternate Pytorch Tensors (similar to numpy arrays) with rows and columns of zeros. Like this:
Input => [[ 1,2,3],
[ 4,5,6],
[ 7,8,9]]
output => [[ 1,0,2,0,3],
[ 0,0,0,0,0],
[ 4,0,5,0,6],
[ 0,0,0,0,0],
[ 7,0,8,0,9]]
I am using the accepted answer in this question that proposes the following
def insert_zeros(a, N=1):
# a : Input array
# N : number of zeros to be inserted between consecutive rows and cols
out = np.zeros( (N+1)*np.array(a.shape)-N,dtype=a.dtype)
out[::N+1,::N+1] = a
return out
The answers works perfectly, except that I need to perform this many times on many arrays and the time it takes has become the bottleneck. It is the step-sized slicing that takes most of the time.
For what it's worth, the matrices I am using it for are 4D, an example size of a matrix is 32x18x16x16 and I am inserting the alternate rows/cols only in the last two dimensions.
So my question is, is there another implementation with the same functionality but with reduced time?
I am not familiar to Pytorch, but to accelerate the code that you provided, I think JAX library will help a lot. So, if:
import numpy as np
import jax
import jax.numpy as jnp
from functools import partial
a = np.arange(10000).reshape(100, 100)
b = jnp.array(a)
#partial(jax.jit, static_argnums=1)
def new(a, N):
out = jnp.zeros( (N+1)*np.array(a.shape)-N,dtype=a.dtype)
out = out.at[::N+1,::N+1].set(a)
return out
will improve the runtime about 10 times on GPU. It depends to array size and N (The increase in the sizes, the better performances). You can see Benchmarks on my Colab link based on the 4 answer proposed so far (JAX beats the others).
I believe that jax can be one of the best libraries for your case if you could adjust it on your problem (It is possible).
I found a few methods to achieve this result, and the indexing method seems to be consistently the fastest.
There might be some improvement to be made on other methods though, because I tried to generalized them from 1D to 2D and arbitrary number of leading dimensions, and might not have do it in the best way posisble.
Edit: Yet another method using numpy, not faster.
Performance test (CPU):
In [4]: N, C, H, W = 11, 5, 128, 128
...: x = torch.rand(N, C, H, W)
...: k = 3
...:
...: x1 = interleave_index(x, k)
...: x2 = interleave_view(x, k)
...: x3 = interleave_einops(x, k)
...: x4 = interleave_convtranspose(x, k)
...: x4 = interleave_numpy(x, k)
...:
...: assert torch.all(x1 == x2)
...: assert torch.all(x2 == x3)
...: assert torch.all(x3 == x4)
...: assert torch.all(x4 == x5)
...:
...: %timeit interleave_index(x, k)
...: %timeit interleave_view(x, k)
...: %timeit interleave_einops(x, k)
...: %timeit interleave_convtranspose(x, k)
...: %timeit interleave_numpy(x, k)
9.51 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.6 ms ± 4.98 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
23.3 ms ± 4.19 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
62.5 ms ± 19.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
50.6 ms ± 809 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Performance test (GPU):
(numpy metod not tested)
...: ...
...: x = torch.rand(N, C, H, W, device="cuda")
...: ...
260 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
861 µs ± 6.77 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
912 µs ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
429 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Implementations:
import torch
import torch.nn.functional as F
import einops
def interleave_index(x, k):
*cdims, Hin, Win = x.shape
Hout = (k + 1) * (Hin - 1) + 1
Wout = (k + 1) * (Win - 1) + 1
out = x.new_zeros(*cdims, Hout, Wout)
out[..., :: k + 1, :: k + 1] = x
return out
def interleave_view(x, k):
"""
From
https://discuss.pytorch.org/t/how-to-interleave-two-tensors-along-certain-dimension/11332/4
"""
*cdims, Hin, Win = x.shape
Hout = (k + 1) * (Hin - 1) + 1
Wout = (k + 1) * (Win - 1) + 1
zeros = [torch.zeros_like(x)] * k
out = torch.stack([x, *zeros], dim=-1).view(*cdims, Hin, Wout + k)[..., :-k]
zeros = [torch.zeros_like(out)] * k
out = torch.stack([out, *zeros], dim=-2).view(*cdims, Hout + k, Wout)[..., :-k, :]
return out
def interleave_einops(x, k):
"""
From
https://discuss.pytorch.org/t/how-to-interleave-two-tensors-along-certain-dimension/11332/6
"""
zeros = [torch.zeros_like(x)] * k
out = einops.rearrange([x, *zeros], "t ... h w -> ... h (w t)")[..., :-k]
zeros = [torch.zeros_like(out)] * k
out = einops.rearrange([out, *zeros], "t ... h w -> ... (h t) w")[..., :-k, :]
return out
def interleave_convtranspose(x, k):
"""
From
https://github.com/pytorch/pytorch/issues/7911#issuecomment-515493009
"""
C = x.shape[-3]
weight=x.new_ones(C, 1, 1, 1)
return F.conv_transpose2d(x, weight=weight, stride=k+1, groups=C)
def interleave_numpy(x, k):
"""
From https://stackoverflow.com/a/53179919
"""
pos = np.repeat(np.arange(1, x.shape[-1]), k)
out = np.insert(x, pos, 0, axis=-1)
pos = np.repeat(np.arange(1, x.shape[-2]), k)
out = np.insert(out, pos, 0, axis=-2)
return out
Since you know the size of the array in advance, first step to optimize is to create the out array outside the function. Then, try numba to jit-compile the function and work in-place on the out array. This achieves 5X speedup over the numpy version you posted.
import numpy as np
from numba import njit
#njit
def insert_zeros_n(a, out, N=1):
for i in range(a.shape[0]):
for j in range(a.shape[1]):
out[2*i,2*j] = a[i,j]
and call it with the specified N and a:
N = 1
a = np.arange(16*16).reshape(16, 16)
out = np.zeros( (N+1)*np.array(a.shape)-N,dtype=a.dtype)
insert_zeros_n(a,out)
Encapsulated for any N, what about using numpy.kron with 4D inputs,
a = np.arange(1, 19).reshape((1, 2, 3, 3))
print(a)
# array([[[[ 1, 2, 3],
# [ 4, 5, 6],
# [ 7, 8, 9]],
#
# [[10, 11, 12],
# [13, 14, 15],
# [16, 17, 18]]]])
def interleave_kron(a, N=1):
n = N + 1
return np.kron(
a, np.hstack((1, np.zeros(pow(n, 2) - 1))).reshape((1, 1, n, n))
)[..., :-N, :-N]
where np.hstack((1, np.zeros(pow(n, 2) - 1))).reshape((1, 1, n, n)) could be externalized/defaulted once for all for the sake of performance.
and then
>>> interleave_kron(a, N=2)
array([[[[ 1., 0., 0., 2., 0., 0., 3.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 4., 0., 0., 5., 0., 0., 6.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 7., 0., 0., 8., 0., 0., 9.]],
[[10., 0., 0., 11., 0., 0., 12.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[13., 0., 0., 14., 0., 0., 15.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[16., 0., 0., 17., 0., 0., 18.]]]])
?
I have been tasked with applying a "1-2-1" filter to a numpy array and returning an array of the filtered data. Without using for loops while loops or list comprehension.
The "1-2-1" filter maps each point of data to the average of itself twice and its neighbors. For example, if at some point the data contained ...1, 4, 3... then after applying the "1-2-1" filter the 4 would be replaced with (1 + 4 + 4 + 3) / 4 = 12 / 4 = 3.
For example the numpy array [1, 1, 4, 3, 2]
Would after the filter is applied produce a numpy array [1. 1.75 3. 3. 2. ]
Since the end points of the data do not have two neighbors the 1-2-1 filter is only applied to the internal len(data) - 2 points, leaving the end points unchanged.
Essentially I need to access the values before and after a given point during numpy array vectorization. For a array that could be of any length. Which as much as I have googled I cannot work out.
Pandas solution
s = pd.Series(l)
>>> s.rolling(3, center=True).sum().add(s).div(4).fillna(s).values
array([1. , 1.75, 3. , 3. , 2. ])
Step by step:
>>> s.rolling(3, center=True).sum().values
array([nan, 6., 8., 9., nan])
>>> s.rolling(3, center=True).sum().add(s).values
array([nan, 7., 12., 12., nan])
>>> s.rolling(3, center=True).sum().add(s).div(4).values
array([ nan, 1.75, 3. , 3. , nan])
>>> s.rolling(3, center=True).sum().add(s).div(4).fillna(s).values
array([1. , 1.75, 3. , 3. , 2. ])
Numpy solution
a = np.array(l)
>>> np.concatenate([a[:1], np.convolve(a, [1, 2, 1], mode="valid") / 4, a[-1:]])
Step by step:
>>> np.convolve(a, [1, 2, 1], mode="valid")
array([ 7, 12, 12])
>>> np.convolve(a, [1, 2, 1], mode="valid") / 4
array([1.75, 3. , 3. ])
>>> np.concatenate([a[:1], np.convolve(a, [1, 2, 1], mode="valid") / 4, a[-1:]])
array([1. , 1.75, 3. , 3. , 2. ])
Performance
%timeit s.rolling(3, center=True).sum().add(s).div(4).fillna(s).values
706 µs ± 4.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.concatenate([a[:1], np.convolve(a, [1, 2, 1], mode="valid") / 4, a[-1:]])
10.8 µs ± 22.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
You can try something like this:
import numpy as np
from scipy.linalg import circulant
x = np.array([1, 1, 4, 3, 2])
val = np.array([1, 2, 1])
offsets = np.array([0, 1, 2])
col0 = np.zeros(len(x))
col0[offsets] = val
C = circulant(col0).T[:-(len(val) - 1)]
print(C)
C is essentially a circulant matrix which looks like this:
array([[1., 2., 1., 0., 0.],
[0., 1., 2., 1., 0.],
[0., 0., 1., 2., 1.]])
Now you can simply compute the filtered output as follows:
y = (C # x) / 4
print(y)
# array([1.75, 3. , 3. ])
What's the most efficient way to fill a scipy.sparse.dok_matrix, based on an input list ?
Neither the number of columns or rows in the dok_matrix are known in advance.
The number of rows is the length of the input list, the number of columns depends on the values within the input list.
The obvious:
def get_dok_matrix(values: List[Any]) -> scipy.sparse.dok_matrix:
max_cols = 0
datas = []
for value in values:
data = get_data(values)
datas.append(data)
if len(data) > max_cols:
max_cols = len(data)
dok_matrix = scipy.sparse.dok_matrix((len(values), max_cols))
for i, data in enumerate(datas):
for j, datum in enumerate(data):
dok_matrix[i, j] = datum
return dok_matrix
Has two for loops, a nested for loop, and many len() checks also. I can't imagine this being very efficient.
I have also considered:
def get_dok_matrix(values: List[Any]) -> scipy.sparse.dok_matrix:
cols = 0
dok_matrix = scipy.sparse.dok_matrix((0, 0))
for row, value in enumerate(values):
dok_matrix.resize(row + 1, cols)
data = get_data(values)
for col, datum in enumerate(data):
if col + 1 > cols:
cols = col + 1
dok_matrix.resize(row + 1, cols)
dok_matrix[row, col] = datum
return dok_matrix
This hugely depends on how efficient scipy.sparse.dok_matrix.resize is, which I couldn't find in the documentation.
Which of these is most efficient?
Is there a better way that I am missing (maybe I can O(1) set an entire row at once?)?
With:
def get_dok_matrix(values):
max_cols = 0
datas = []
for value in values:
data = value # changed
datas.append(data)
if len(data) > max_cols:
max_cols = len(data)
dok_matrix = sparse.dok_matrix((len(values), max_cols))
for i, data in enumerate(datas):
for j, datum in enumerate(data):
dok_matrix[i, j] = datum
return dok_matrix
And
In [13]: values = [[1],[1,2],[1,2,3],[4,5,6,7],[8,9,10,11,12]]
In [14]: dd = get_dok_matrix(values)
In [15]: dd
Out[15]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 15 stored elements in Dictionary Of Keys format>
In [16]: dd.A
Out[16]:
array([[ 1., 0., 0., 0., 0.],
[ 1., 2., 0., 0., 0.],
[ 1., 2., 3., 0., 0.],
[ 4., 5., 6., 7., 0.],
[ 8., 9., 10., 11., 12.]])
I wish you'd provided a values example, so I wouldn't have to study your code and create one that would work with it.
To make a coo format:
def get_coo_matrix(values):
data, row, col = [],[],[]
for i,value in enumerate(values):
n = len(value)
data.extend(value)
row.extend([i]*n)
col.extend(list(range(n)))
return sparse.coo_matrix((data,(row,col)))
In [18]: M = get_coo_matrix(values)
In [19]: M
Out[19]:
<5x5 sparse matrix of type '<class 'numpy.int64'>'
with 15 stored elements in COOrdinate format>
In [20]: M.A
Out[20]:
array([[ 1, 0, 0, 0, 0],
[ 1, 2, 0, 0, 0],
[ 1, 2, 3, 0, 0],
[ 4, 5, 6, 7, 0],
[ 8, 9, 10, 11, 12]])
times:
In [22]: timeit dd = get_dok_matrix(values)
431 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [23]: timeit M = get_coo_matrix(values)
152 µs ± 524 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I have an array of values arr with shape (N,) and an array of coordinates coords with shape (N,2). I want to represent this in an (M,M) array grid such that grid takes the value 0 at coordinates that are not in coords, and for the coordinates that are included it should store the sum of all values in arr that have that coordinate. So if M=3, arr = np.arange(4)+1, and coords = np.array([[0,0,1,2],[0,0,2,2]]) then grid should be:
array([[3., 0., 0.],
[0., 0., 3.],
[0., 0., 4.]])
The reason this is nontrivial is that I need to be able to repeat this step many times and the values in arr change each time, and so can the coordinates. Ideally I am looking for a vectorized solution. I suspect that I might be able to use np.where somehow but it's not immediately obvious how.
Timing the solutions
I have timed the solutions present at this time and it appear that the accumulator method is slightly faster than the sparse matrix method, with the second accumulation method being the slowest for the reasons explained in the comments:
%timeit for x in range(100): accumulate_arr(np.random.randint(100,size=(2,10000)),np.random.normal(0,1,10000))
%timeit for x in range(100): accumulate_arr_v2(np.random.randint(100,size=(2,10000)),np.random.normal(0,1,10000))
%timeit for x in range(100): sparse.coo_matrix((np.random.normal(0,1,10000),np.random.randint(100,size=(2,10000))),(100,100)).A
47.3 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
103 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
48.2 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
One way would be to create a sparse.coo_matrix and convert that to dense:
from scipy import sparse
sparse.coo_matrix((arr,coords),(M,M)).A
# array([[3, 0, 0],
# [0, 0, 3],
# [0, 0, 4]])
With np.bincount -
def accumulate_arr(coords, arr):
# Get output array shape
m,n = coords.max(1)+1
# Get linear indices to be used as IDs with bincount
lidx = np.ravel_multi_index(coords, (m,n))
# Or lidx = coords[0]*(coords[1].max()+1) + coords[1]
# Accumulate arr with IDs from lidx
return np.bincount(lidx,arr,minlength=m*n).reshape(m,n)
Sample run -
In [58]: arr
Out[58]: array([1, 2, 3, 4])
In [59]: coords
Out[59]:
array([[0, 0, 1, 2],
[0, 0, 2, 2]])
In [60]: accumulate_arr(coords, arr)
Out[60]:
array([[3., 0., 0.],
[0., 0., 3.],
[0., 0., 4.]])
Another with np.add.at on similar lines and might be easier to follow -
def accumulate_arr_v2(coords, arr):
m,n = coords.max(1)+1
out = np.zeros((m,n), dtype=arr.dtype)
np.add.at(out, tuple(coords), arr)
return out