Related
I need to alternate Pytorch Tensors (similar to numpy arrays) with rows and columns of zeros. Like this:
Input => [[ 1,2,3],
[ 4,5,6],
[ 7,8,9]]
output => [[ 1,0,2,0,3],
[ 0,0,0,0,0],
[ 4,0,5,0,6],
[ 0,0,0,0,0],
[ 7,0,8,0,9]]
I am using the accepted answer in this question that proposes the following
def insert_zeros(a, N=1):
# a : Input array
# N : number of zeros to be inserted between consecutive rows and cols
out = np.zeros( (N+1)*np.array(a.shape)-N,dtype=a.dtype)
out[::N+1,::N+1] = a
return out
The answers works perfectly, except that I need to perform this many times on many arrays and the time it takes has become the bottleneck. It is the step-sized slicing that takes most of the time.
For what it's worth, the matrices I am using it for are 4D, an example size of a matrix is 32x18x16x16 and I am inserting the alternate rows/cols only in the last two dimensions.
So my question is, is there another implementation with the same functionality but with reduced time?
I am not familiar to Pytorch, but to accelerate the code that you provided, I think JAX library will help a lot. So, if:
import numpy as np
import jax
import jax.numpy as jnp
from functools import partial
a = np.arange(10000).reshape(100, 100)
b = jnp.array(a)
#partial(jax.jit, static_argnums=1)
def new(a, N):
out = jnp.zeros( (N+1)*np.array(a.shape)-N,dtype=a.dtype)
out = out.at[::N+1,::N+1].set(a)
return out
will improve the runtime about 10 times on GPU. It depends to array size and N (The increase in the sizes, the better performances). You can see Benchmarks on my Colab link based on the 4 answer proposed so far (JAX beats the others).
I believe that jax can be one of the best libraries for your case if you could adjust it on your problem (It is possible).
I found a few methods to achieve this result, and the indexing method seems to be consistently the fastest.
There might be some improvement to be made on other methods though, because I tried to generalized them from 1D to 2D and arbitrary number of leading dimensions, and might not have do it in the best way posisble.
Edit: Yet another method using numpy, not faster.
Performance test (CPU):
In [4]: N, C, H, W = 11, 5, 128, 128
...: x = torch.rand(N, C, H, W)
...: k = 3
...:
...: x1 = interleave_index(x, k)
...: x2 = interleave_view(x, k)
...: x3 = interleave_einops(x, k)
...: x4 = interleave_convtranspose(x, k)
...: x4 = interleave_numpy(x, k)
...:
...: assert torch.all(x1 == x2)
...: assert torch.all(x2 == x3)
...: assert torch.all(x3 == x4)
...: assert torch.all(x4 == x5)
...:
...: %timeit interleave_index(x, k)
...: %timeit interleave_view(x, k)
...: %timeit interleave_einops(x, k)
...: %timeit interleave_convtranspose(x, k)
...: %timeit interleave_numpy(x, k)
9.51 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.6 ms ± 4.98 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
23.3 ms ± 4.19 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
62.5 ms ± 19.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
50.6 ms ± 809 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Performance test (GPU):
(numpy metod not tested)
...: ...
...: x = torch.rand(N, C, H, W, device="cuda")
...: ...
260 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
861 µs ± 6.77 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
912 µs ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
429 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Implementations:
import torch
import torch.nn.functional as F
import einops
def interleave_index(x, k):
*cdims, Hin, Win = x.shape
Hout = (k + 1) * (Hin - 1) + 1
Wout = (k + 1) * (Win - 1) + 1
out = x.new_zeros(*cdims, Hout, Wout)
out[..., :: k + 1, :: k + 1] = x
return out
def interleave_view(x, k):
"""
From
https://discuss.pytorch.org/t/how-to-interleave-two-tensors-along-certain-dimension/11332/4
"""
*cdims, Hin, Win = x.shape
Hout = (k + 1) * (Hin - 1) + 1
Wout = (k + 1) * (Win - 1) + 1
zeros = [torch.zeros_like(x)] * k
out = torch.stack([x, *zeros], dim=-1).view(*cdims, Hin, Wout + k)[..., :-k]
zeros = [torch.zeros_like(out)] * k
out = torch.stack([out, *zeros], dim=-2).view(*cdims, Hout + k, Wout)[..., :-k, :]
return out
def interleave_einops(x, k):
"""
From
https://discuss.pytorch.org/t/how-to-interleave-two-tensors-along-certain-dimension/11332/6
"""
zeros = [torch.zeros_like(x)] * k
out = einops.rearrange([x, *zeros], "t ... h w -> ... h (w t)")[..., :-k]
zeros = [torch.zeros_like(out)] * k
out = einops.rearrange([out, *zeros], "t ... h w -> ... (h t) w")[..., :-k, :]
return out
def interleave_convtranspose(x, k):
"""
From
https://github.com/pytorch/pytorch/issues/7911#issuecomment-515493009
"""
C = x.shape[-3]
weight=x.new_ones(C, 1, 1, 1)
return F.conv_transpose2d(x, weight=weight, stride=k+1, groups=C)
def interleave_numpy(x, k):
"""
From https://stackoverflow.com/a/53179919
"""
pos = np.repeat(np.arange(1, x.shape[-1]), k)
out = np.insert(x, pos, 0, axis=-1)
pos = np.repeat(np.arange(1, x.shape[-2]), k)
out = np.insert(out, pos, 0, axis=-2)
return out
Since you know the size of the array in advance, first step to optimize is to create the out array outside the function. Then, try numba to jit-compile the function and work in-place on the out array. This achieves 5X speedup over the numpy version you posted.
import numpy as np
from numba import njit
#njit
def insert_zeros_n(a, out, N=1):
for i in range(a.shape[0]):
for j in range(a.shape[1]):
out[2*i,2*j] = a[i,j]
and call it with the specified N and a:
N = 1
a = np.arange(16*16).reshape(16, 16)
out = np.zeros( (N+1)*np.array(a.shape)-N,dtype=a.dtype)
insert_zeros_n(a,out)
Encapsulated for any N, what about using numpy.kron with 4D inputs,
a = np.arange(1, 19).reshape((1, 2, 3, 3))
print(a)
# array([[[[ 1, 2, 3],
# [ 4, 5, 6],
# [ 7, 8, 9]],
#
# [[10, 11, 12],
# [13, 14, 15],
# [16, 17, 18]]]])
def interleave_kron(a, N=1):
n = N + 1
return np.kron(
a, np.hstack((1, np.zeros(pow(n, 2) - 1))).reshape((1, 1, n, n))
)[..., :-N, :-N]
where np.hstack((1, np.zeros(pow(n, 2) - 1))).reshape((1, 1, n, n)) could be externalized/defaulted once for all for the sake of performance.
and then
>>> interleave_kron(a, N=2)
array([[[[ 1., 0., 0., 2., 0., 0., 3.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 4., 0., 0., 5., 0., 0., 6.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 7., 0., 0., 8., 0., 0., 9.]],
[[10., 0., 0., 11., 0., 0., 12.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[13., 0., 0., 14., 0., 0., 15.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[16., 0., 0., 17., 0., 0., 18.]]]])
?
scipy.sparse.diags allows me to enter multiple diagonal vectors, together with their location, to build a matrix such as
from scipy.sparse import diags
vec = np.ones((5,))
vec2 = vec + 1
diags([vec, vec2], [-2, 2])
I'm looking for an efficient way to do the same but build a dense matrix, instead of DIA. np.diag only supports a single diagonal. What's an efficient way to build a dense matrix from multiple diagonal vectors?
Expected output: the same as np.array(diags([vec, vec2], [-2, 2]).todense())
One way would be to index into the flattened output array using a step of N+1:
import numpy as np
from scipy.sparse import diags
from timeit import timeit
def diags_pp(vecs, offs, dtype=float, N=None):
if N is None:
N = len(vecs[0]) + abs(offs[0])
out = np.zeros((N, N), dtype)
outf = out.reshape(-1)
for vec, off in zip(vecs, offs):
if off<0:
outf[-N*off::N+1] = vec
else:
outf[off:N*(N-off):N+1] = vec
return out
def diags_sp(vecs, offs):
return diags(vecs, offs).A
for N, k in [(10, 2), (100, 20), (1000, 200)]:
print(N)
O = np.arange(-k,k)
D = [np.arange(1, N+1-abs(o)) for o in O]
for n, f in list(globals().items()):
if n.startswith('diags_'):
print(n.replace('diags_', ''), timeit(lambda: f(D, O), number=10000//N)*N)
if n != 'diags_sp':
assert np.all(f(D, O) == diags_sp(D, O))
Sample run:
10
pp 0.06757194991223514
sp 1.9529316504485905
100
pp 0.45834919437766075
sp 4.684177896706387
1000
pp 23.397524026222527
sp 170.66762899048626
With Paul Panzer's (10,2) case
In [107]: O
Out[107]: array([-2, -1, 0, 1])
In [108]: D
Out[108]:
[array([1, 2, 3, 4, 5, 6, 7, 8]),
array([1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
array([1, 2, 3, 4, 5, 6, 7, 8, 9])]
The diagonals have different lengths.
sparse.diags converts this to a sparse.dia_matrix:
In [109]: M = sparse.diags(D,O)
In [110]: M
Out[110]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 36 stored elements (4 diagonals) in DIAgonal format>
In [111]: M.data
Out[111]:
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 0., 0.],
[ 1., 2., 3., 4., 5., 6., 7., 8., 9., 0.],
[ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.],
[ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]])
Here the ragged list of diagonals has been converted to a padded 2d array. This can be a convenient way of specifying the diagonals, but it isn't particularly efficient. It has to be converted to csr format for most calculations:
In [112]: timeit sparse.diags(D,O)
99.8 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [113]: timeit sparse.diags(D,O, format='csr')
371 µs ± 155 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using np.diag I can construct the same array with an iteration
np.add.reduce([np.diag(v,k) for v,k in zip(D,O)])
In [117]: timeit np.add.reduce([np.diag(v,k) for v,k in zip(D,O)])
39.3 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and with Paul's function:
In [120]: timeit diags_pp(D,O)
12.3 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The key step in np.diags is a flat assignment:
res[:n-k].flat[i::n+1] = v
This is essentially the same as Paul's outf assignments. So the functionality is basically the same, assigning each diagonal via a slice. Paul streamlines it.
Creating the M.data array (Out[111]) also requires copying the D arrays to a 2d array - but with different slices.
I have a numpy array like
m = np.array([[0,0,0,0,0],
[0,0,0,1,1],
[0,1,0,1,0],
[0,1,0,1,0],
[0,0,1,1,1],])
(m == 1).argmax(0) will give array([0, 2, 4, 1, 1]). Is there any similar function to get both 1's minimum index and maximum index across each column. i.e
array([[ nan, 2., 4., 1., 1.], [ nan, 3., 4., 4., 4.]])
One approach would be -
mask = m==1
mask_cumsum = mask.cumsum(0)
valid_mask = mask.any(0)
min_idx = (mask_cumsum==1).argmax(0)
max_idx = mask_cumsum.argmax(0)
min_max_idx = np.vstack((min_idx, max_idx))
out = np.where(valid_mask, min_max_idx, np.nan)
It's not a single function but you can wrap it into one and it should give the expected return:
def argwhere_both_sides_with_nan(arr):
ones = arr == 1
min_r = ones[::-1].argmax(0) # reversed columns
max_n = ones.argmax(0)
res = np.array([max_n, arr.shape[0] - 1 - min_r], dtype=float)
res[:, ~ones.any(0)] = np.nan
return res
>>> argwhere_both_sides_with_nan(m)
array([[ nan, 2., 4., 1., 1.],
[ nan, 3., 4., 4., 4.]])
In case you have numba you could use it here, which could speed this up a bit:
import numba as nb
#nb.njit
def first_and_last_index(arr, val):
rows, cols = arr.shape
ret = np.full((2, cols), np.nan)
for row_idx in range(rows):
for col_idx in range(cols):
if arr[row_idx, col_idx] == val:
if np.isnan(ret[0, col_idx]):
ret[0, col_idx] = row_idx
ret[1, col_idx] = row_idx
return ret
>>> first_and_last_index(m, 1)
array([[ nan, 2., 4., 1., 1.],
[ nan, 3., 4., 4., 4.]])
Timings:
#nb.njit
def first_and_last_index(arr, val):
rows, cols = arr.shape
ret = np.full((2, cols), np.nan)
for row_idx in range(rows):
for col_idx in range(cols):
if arr[row_idx, col_idx] == val:
if np.isnan(ret[0, col_idx]):
ret[0, col_idx] = row_idx
ret[1, col_idx] = row_idx
return ret
def argwhere_both_sides_with_nan(arr):
ones = arr == 1
min_r = ones[::-1].argmax(0) # reversed columns
max_n = ones.argmax(0)
res = np.array([max_n, arr.shape[0] - 1 - min_r], dtype=float)
res[:, ~ones.any(0)] = np.nan
return res
def divakars(m):
mask = m==1
mask_cumsum = mask.cumsum(0)
valid_mask = mask.any(0)
min_idx = (mask_cumsum==1).argmax(0)
max_idx = mask_cumsum.argmax(0)
min_max_idx = np.vstack((min_idx, max_idx))
out = np.where(valid_mask, min_max_idx, np.nan)
return out
m = np.array([[0, 0, 0, 0, 0],
[0, 0, 0, 1, 1],
[0, 1, 0, 1, 0],
[0, 1, 0, 1, 0],
[0, 0, 1, 1, 1],
[0, 0, 1, 1, 1]])
np.testing.assert_array_equal(first_and_last_index(m, 1), argwhere_both_sides_with_nan(m))
np.testing.assert_array_equal(first_and_last_index(m, 1), divakars(m))
%timeit first_and_last_index(m, 1)
# 6.77 µs ± 178 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit argwhere_both_sides_with_nan(m)
# 121 µs ± 3.57 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit divakars(m)
# 138 µs ± 4.24 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
m = (np.random.random((1000, 1000)) > 0.5).astype(np.int64)
np.testing.assert_array_equal(first_and_last_index(m, 1), argwhere_both_sides_with_nan(m))
np.testing.assert_array_equal(first_and_last_index(m, 1), divakars(m))
%timeit first_and_last_index(m, 1)
# 10 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit argwhere_both_sides_with_nan(m)
# 12.8 ms ± 393 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit divakars(m)
# 67.2 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Another solution by flipping it :
import numpy as np
a = list((m == 1).argmax(0))
b = (np.flipud(m) == 1).argmax(0)
# a is your first output
c = [len(m) - elt - 1 for elt in b]
# c is your second output
Current output :
a = [0, 2, 4, 1, 1]
c = [4, 3, 4, 4, 4]
Just a prob with the full 0 colum to fix.
I have a (w,h) np array in 2d. I want to make a 3d dimension that has a value greater than 1 and copy its value over along the 3rd dimensions. I was hoping broadcast would do it but it can't. This is how i'm doing it
arr = np.expand_dims(arr, axis=2)
arr = np.concatenate((arr,arr,arr), axis=2)
is there a a faster way to do so?
You can push all dims forward, introducing a singleton dim/new axis as the last dim to create a 3D array and then repeat three times along that one with np.repeat, like so -
arr3D = np.repeat(arr[...,None],3,axis=2)
Here's another approach using np.tile -
arr3D = np.tile(arr[...,None],3)
Another approach that works:
x_train = np.stack((x_train,) * 3, axis=-1)
Better helpful in converting gray a-channel matrix into 3 channel matrix.
img3 = np.zeros((gray.shape[0],gray.shape[1],3))
img3[:,:,0] = gray
img3[:,:,1] = gray
img3[:,:,2] = gray
fig = plt.figure(figsize = (15,15))
plt.imshow(img3)
Another simple approach is to use matrix multiplication - multiplying by a matrix of ones that will essentially copy the values across the new dimension:
a=np.random.randn(4,4) #a.shape = (4,4)
a = np.expand_dims(a,-1) #a.shape = (4,4,1)
a = a*np.ones((1,1,3))
a.shape #(4, 4, 3)
I'd suggest you to use the barebones numpy.concatenate() simply because the below piece of code shows that it's the fastest among all other suggested answers:
# sample 2D array to work with
In [51]: arr = np.random.random_sample((12, 34))
# promote the array `arr` to 3D and then concatenate along `axis 2`
In [52]: arr3D = np.concatenate([arr[..., np.newaxis]]*3, axis=2)
# verify for desired shape
In [53]: arr3D.shape
Out[53]: (12, 34, 3)
You can see the timings below to convince yourselves. (ordered: best to worst):
In [42]: %timeit -n 100000 np.concatenate([arr[..., np.newaxis]]*3, axis=2)
1.94 µs ± 32.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [43]: %timeit -n 100000 np.repeat(arr[..., np.newaxis], 3, axis=2)
4.38 µs ± 46.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [44]: %timeit -n 100000 np.dstack([arr]*3)
5.1 µs ± 57.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [49]: %timeit -n 100000 np.stack([arr]*3, -1)
5.12 µs ± 125 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [46]: %timeit -n 100000 np.tile(arr[..., np.newaxis], 3)
7.13 µs ± 85.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Having said that, if you're looking for shortest piece of code, then you can use:
# wrap your 2D array in an iterable and then multiply it by the needed depth
arr3D = np.dstack([arr]*3)
# verify shape
print(arr3D.shape)
(12, 34, 3)
This would work. (I think this would not a recommended way :-) But maybe this is the most closest way you thought.)
np.array([img, img, img]).transpose(1,2,0)
just stacking targets(img) any time you want(3), and make the channel(3) go to the last axis.
Not sure if I understood correctly, but broadcasting seems working to me in this case:
>>> a = numpy.array([[1,2], [3,4]])
>>> c = numpy.zeros((4, 2, 2))
>>> c[0] = a
>>> c[1:] = a+1
>>> c
array([[[ 1., 2.],
[ 3., 4.]],
[[ 2., 3.],
[ 4., 5.]],
[[ 2., 3.],
[ 4., 5.]],
[[ 2., 3.],
[ 4., 5.]]])
So I have this array, right?
a=np.zeros(5)
I want to add values to it at the given indices, where indices can be duplicates.
e.g.
a[[1, 2, 2]] += [1, 2, 3]
I want this to produce array([ 0., 1., 5., 0., 0.]), but the answer I get is array([ 0., 1., 3., 0., 0.]).
I'd like this to work with multidimensional arrays and broadcastable indices and all that. Any ideas?
You need to use np.add.at to get around the buffering issue that you encounter with += (values are not accumulated at repeated indices). Specify the array, the indices, and the values to add in place at those indices:
>>> a = np.zeros(5)
>>> np.add.at(a, [1, 2, 2], [1, 2, 3])
>>> a
array([ 0., 1., 5., 0., 0.])
at is part of other ufuncs too (multiply, divide, and so on). This method will also work for multidimensional arrays.
The operation you are performing can be looked at as binning, and to be technically more specific, you are doing weighted bining with those values being the weights and the indices being the bins. For such a binning operation, you can use np.bincount.
Here's the implementation -
import numpy as np
a=np.zeros(5) # initialize output array
idx = [1, 2, 2] # indices
vals = [1, 2, 3] # values
a[:max(idx)+1] = np.bincount(idx,vals) # finally store the bincounts
Runtime tests
Here are some runtime tests for two sets of input datasizes comparing the proposed bincount based approach and the add.at based approach listed in the other answer:
Datasize #1 -
In [251]: a=np.zeros(1000)
...: idx = np.sort(np.random.randint(1,1000,(500))).tolist()
...: vals = np.random.rand(500).tolist()
...:
In [252]: %timeit np.add.at(a, idx, vals)
10000 loops, best of 3: 63.4 µs per loop
In [253]: %timeit a[:max(idx)+1] = np.bincount(idx,vals)
10000 loops, best of 3: 42.4 µs per loop
Datasize #2 -
In [254]: a=np.zeros(10000)
...: idx = np.sort(np.random.randint(1,10000,(5000))).tolist()
...: vals = np.random.rand(5000).tolist()
...:
In [255]: %timeit np.add.at(a, idx, vals)
1000 loops, best of 3: 597 µs per loop
In [256]: %timeit a[:max(idx)+1] = np.bincount(idx,vals)
1000 loops, best of 3: 404 µs per loop