I need to alternate Pytorch Tensors (similar to numpy arrays) with rows and columns of zeros. Like this:
Input => [[ 1,2,3],
[ 4,5,6],
[ 7,8,9]]
output => [[ 1,0,2,0,3],
[ 0,0,0,0,0],
[ 4,0,5,0,6],
[ 0,0,0,0,0],
[ 7,0,8,0,9]]
I am using the accepted answer in this question that proposes the following
def insert_zeros(a, N=1):
# a : Input array
# N : number of zeros to be inserted between consecutive rows and cols
out = np.zeros( (N+1)*np.array(a.shape)-N,dtype=a.dtype)
out[::N+1,::N+1] = a
return out
The answers works perfectly, except that I need to perform this many times on many arrays and the time it takes has become the bottleneck. It is the step-sized slicing that takes most of the time.
For what it's worth, the matrices I am using it for are 4D, an example size of a matrix is 32x18x16x16 and I am inserting the alternate rows/cols only in the last two dimensions.
So my question is, is there another implementation with the same functionality but with reduced time?
I am not familiar to Pytorch, but to accelerate the code that you provided, I think JAX library will help a lot. So, if:
import numpy as np
import jax
import jax.numpy as jnp
from functools import partial
a = np.arange(10000).reshape(100, 100)
b = jnp.array(a)
#partial(jax.jit, static_argnums=1)
def new(a, N):
out = jnp.zeros( (N+1)*np.array(a.shape)-N,dtype=a.dtype)
out = out.at[::N+1,::N+1].set(a)
return out
will improve the runtime about 10 times on GPU. It depends to array size and N (The increase in the sizes, the better performances). You can see Benchmarks on my Colab link based on the 4 answer proposed so far (JAX beats the others).
I believe that jax can be one of the best libraries for your case if you could adjust it on your problem (It is possible).
I found a few methods to achieve this result, and the indexing method seems to be consistently the fastest.
There might be some improvement to be made on other methods though, because I tried to generalized them from 1D to 2D and arbitrary number of leading dimensions, and might not have do it in the best way posisble.
Edit: Yet another method using numpy, not faster.
Performance test (CPU):
In [4]: N, C, H, W = 11, 5, 128, 128
...: x = torch.rand(N, C, H, W)
...: k = 3
...:
...: x1 = interleave_index(x, k)
...: x2 = interleave_view(x, k)
...: x3 = interleave_einops(x, k)
...: x4 = interleave_convtranspose(x, k)
...: x4 = interleave_numpy(x, k)
...:
...: assert torch.all(x1 == x2)
...: assert torch.all(x2 == x3)
...: assert torch.all(x3 == x4)
...: assert torch.all(x4 == x5)
...:
...: %timeit interleave_index(x, k)
...: %timeit interleave_view(x, k)
...: %timeit interleave_einops(x, k)
...: %timeit interleave_convtranspose(x, k)
...: %timeit interleave_numpy(x, k)
9.51 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.6 ms ± 4.98 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
23.3 ms ± 4.19 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
62.5 ms ± 19.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
50.6 ms ± 809 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Performance test (GPU):
(numpy metod not tested)
...: ...
...: x = torch.rand(N, C, H, W, device="cuda")
...: ...
260 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
861 µs ± 6.77 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
912 µs ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
429 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Implementations:
import torch
import torch.nn.functional as F
import einops
def interleave_index(x, k):
*cdims, Hin, Win = x.shape
Hout = (k + 1) * (Hin - 1) + 1
Wout = (k + 1) * (Win - 1) + 1
out = x.new_zeros(*cdims, Hout, Wout)
out[..., :: k + 1, :: k + 1] = x
return out
def interleave_view(x, k):
"""
From
https://discuss.pytorch.org/t/how-to-interleave-two-tensors-along-certain-dimension/11332/4
"""
*cdims, Hin, Win = x.shape
Hout = (k + 1) * (Hin - 1) + 1
Wout = (k + 1) * (Win - 1) + 1
zeros = [torch.zeros_like(x)] * k
out = torch.stack([x, *zeros], dim=-1).view(*cdims, Hin, Wout + k)[..., :-k]
zeros = [torch.zeros_like(out)] * k
out = torch.stack([out, *zeros], dim=-2).view(*cdims, Hout + k, Wout)[..., :-k, :]
return out
def interleave_einops(x, k):
"""
From
https://discuss.pytorch.org/t/how-to-interleave-two-tensors-along-certain-dimension/11332/6
"""
zeros = [torch.zeros_like(x)] * k
out = einops.rearrange([x, *zeros], "t ... h w -> ... h (w t)")[..., :-k]
zeros = [torch.zeros_like(out)] * k
out = einops.rearrange([out, *zeros], "t ... h w -> ... (h t) w")[..., :-k, :]
return out
def interleave_convtranspose(x, k):
"""
From
https://github.com/pytorch/pytorch/issues/7911#issuecomment-515493009
"""
C = x.shape[-3]
weight=x.new_ones(C, 1, 1, 1)
return F.conv_transpose2d(x, weight=weight, stride=k+1, groups=C)
def interleave_numpy(x, k):
"""
From https://stackoverflow.com/a/53179919
"""
pos = np.repeat(np.arange(1, x.shape[-1]), k)
out = np.insert(x, pos, 0, axis=-1)
pos = np.repeat(np.arange(1, x.shape[-2]), k)
out = np.insert(out, pos, 0, axis=-2)
return out
Since you know the size of the array in advance, first step to optimize is to create the out array outside the function. Then, try numba to jit-compile the function and work in-place on the out array. This achieves 5X speedup over the numpy version you posted.
import numpy as np
from numba import njit
#njit
def insert_zeros_n(a, out, N=1):
for i in range(a.shape[0]):
for j in range(a.shape[1]):
out[2*i,2*j] = a[i,j]
and call it with the specified N and a:
N = 1
a = np.arange(16*16).reshape(16, 16)
out = np.zeros( (N+1)*np.array(a.shape)-N,dtype=a.dtype)
insert_zeros_n(a,out)
Encapsulated for any N, what about using numpy.kron with 4D inputs,
a = np.arange(1, 19).reshape((1, 2, 3, 3))
print(a)
# array([[[[ 1, 2, 3],
# [ 4, 5, 6],
# [ 7, 8, 9]],
#
# [[10, 11, 12],
# [13, 14, 15],
# [16, 17, 18]]]])
def interleave_kron(a, N=1):
n = N + 1
return np.kron(
a, np.hstack((1, np.zeros(pow(n, 2) - 1))).reshape((1, 1, n, n))
)[..., :-N, :-N]
where np.hstack((1, np.zeros(pow(n, 2) - 1))).reshape((1, 1, n, n)) could be externalized/defaulted once for all for the sake of performance.
and then
>>> interleave_kron(a, N=2)
array([[[[ 1., 0., 0., 2., 0., 0., 3.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 4., 0., 0., 5., 0., 0., 6.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 7., 0., 0., 8., 0., 0., 9.]],
[[10., 0., 0., 11., 0., 0., 12.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[13., 0., 0., 14., 0., 0., 15.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[16., 0., 0., 17., 0., 0., 18.]]]])
?
Related
What's the most efficient way to fill a scipy.sparse.dok_matrix, based on an input list ?
Neither the number of columns or rows in the dok_matrix are known in advance.
The number of rows is the length of the input list, the number of columns depends on the values within the input list.
The obvious:
def get_dok_matrix(values: List[Any]) -> scipy.sparse.dok_matrix:
max_cols = 0
datas = []
for value in values:
data = get_data(values)
datas.append(data)
if len(data) > max_cols:
max_cols = len(data)
dok_matrix = scipy.sparse.dok_matrix((len(values), max_cols))
for i, data in enumerate(datas):
for j, datum in enumerate(data):
dok_matrix[i, j] = datum
return dok_matrix
Has two for loops, a nested for loop, and many len() checks also. I can't imagine this being very efficient.
I have also considered:
def get_dok_matrix(values: List[Any]) -> scipy.sparse.dok_matrix:
cols = 0
dok_matrix = scipy.sparse.dok_matrix((0, 0))
for row, value in enumerate(values):
dok_matrix.resize(row + 1, cols)
data = get_data(values)
for col, datum in enumerate(data):
if col + 1 > cols:
cols = col + 1
dok_matrix.resize(row + 1, cols)
dok_matrix[row, col] = datum
return dok_matrix
This hugely depends on how efficient scipy.sparse.dok_matrix.resize is, which I couldn't find in the documentation.
Which of these is most efficient?
Is there a better way that I am missing (maybe I can O(1) set an entire row at once?)?
With:
def get_dok_matrix(values):
max_cols = 0
datas = []
for value in values:
data = value # changed
datas.append(data)
if len(data) > max_cols:
max_cols = len(data)
dok_matrix = sparse.dok_matrix((len(values), max_cols))
for i, data in enumerate(datas):
for j, datum in enumerate(data):
dok_matrix[i, j] = datum
return dok_matrix
And
In [13]: values = [[1],[1,2],[1,2,3],[4,5,6,7],[8,9,10,11,12]]
In [14]: dd = get_dok_matrix(values)
In [15]: dd
Out[15]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 15 stored elements in Dictionary Of Keys format>
In [16]: dd.A
Out[16]:
array([[ 1., 0., 0., 0., 0.],
[ 1., 2., 0., 0., 0.],
[ 1., 2., 3., 0., 0.],
[ 4., 5., 6., 7., 0.],
[ 8., 9., 10., 11., 12.]])
I wish you'd provided a values example, so I wouldn't have to study your code and create one that would work with it.
To make a coo format:
def get_coo_matrix(values):
data, row, col = [],[],[]
for i,value in enumerate(values):
n = len(value)
data.extend(value)
row.extend([i]*n)
col.extend(list(range(n)))
return sparse.coo_matrix((data,(row,col)))
In [18]: M = get_coo_matrix(values)
In [19]: M
Out[19]:
<5x5 sparse matrix of type '<class 'numpy.int64'>'
with 15 stored elements in COOrdinate format>
In [20]: M.A
Out[20]:
array([[ 1, 0, 0, 0, 0],
[ 1, 2, 0, 0, 0],
[ 1, 2, 3, 0, 0],
[ 4, 5, 6, 7, 0],
[ 8, 9, 10, 11, 12]])
times:
In [22]: timeit dd = get_dok_matrix(values)
431 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [23]: timeit M = get_coo_matrix(values)
152 µs ± 524 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I would like to add float coordinates to a numpy array by splitting the intensity based on the centre of mass of the coordinate to neighbouring pixels.
As an example with integers:
import numpy as np
arr = np.zeros((5, 5), dtype=float)
coord = [2, 2]
arr[coord[0], coord[1]] = 1
arr
>>> array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])
However I would like to distribute the intensity across neighbouring pixels when coord is float data, eg. coord = [2.2, 1.7].
I have considered using a gaussian, eg:
grid = np.meshgrid(*[np.arange(i) for i in arr.shape], indexing='ij')
out = np.exp(-np.dstack([(grid[i]-c)**2 for i, c in enumerate(coord)]).sum(axis=-1) / 0.5**2)
which gives good results, but becomes slow for 3d data and thousands of points.
Any advice or ideas would be appreciated, thanks.
Based on #rpoleski suggestion, take a local region and apply weighting by distance. This is a good idea although the implementation I have does not maintain the original centre of mass of the coordinates, for example:
from scipy.ndimage import center_of_mass
coord = [2.2, 1.7]
# get region coords
grid = np.meshgrid(*[range(2) for i in coord], indexing='ij')
# difference Euclidean distance between coords and coord
delta = np.linalg.norm(np.dstack([g-(c%1) for g, c, in zip(grid, coord)]), axis=-1)
value = 3 # pixel value of original coord
# create final array by 1/delta, ie. closer is weighted more
# normalise by sum of 1/delta
out = value * (1/delta) / (1/delta).sum()
out.sum()
>>> 3.0 # as expected
# but
center_of_mass(out)
>>> (0.34, 0.63) # should be (0.2, 0.7) in this case, ie. from coord
Any ideas?
Here is a simple (and hence most probably fast enough) solution that keeps the center of mass and has sum = 1:
arr = np.zeros((5, 5), dtype=float)
coord = [2.2, 0.7]
indexes = np.array([[x, y] for x in [int(coord[0]), int(coord[0])+1] for y in [int(coord[1]), int(coord[1])+1]])
values = [1. / (abs(coord[0]-index[0]) * abs(coord[1]-index[1])) for index in indexes]
sum_values = sum(values)
for (value, index) in zip(values, indexes):
arr[index[0], index[1]] = value / sum_values
print(arr)
print(center_of_mass(arr))
which results in:
[[0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. ]
[0. 0.24 0.56 0. 0. ]
[0. 0.06 0.14 0. 0. ]
[0. 0. 0. 0. 0. ]]
(2.2, 1.7)
Note: I'm using taxicab distances - they're good for center of mass calculations.
For anyone needing this functionality, and thanks to #rpoleski, I came up with this which uses Numba to speed up the calculation.
#numba.njit
def _add_floats_to_array_2d(coords, arr, values):
"""
Distribute float values around neighbouring pixels in array whilst maintinating center of mass.
Uses Manhattan (taxicab) distances for center of mass calculation.
This function uses numba to speed up the calculation but is limited to exactly 2D.
Parameters
----------
coords: (N, ndim) ndarray
Floats to distribute into array.
arr: ndim ndarray
Floats will be distributed into this array.
Array is modified in place.
values: (N,) arraylike
The total value of each coord to distribute into arr.
"""
indices_local = np.array([[[0, 0], [1, 0]], [[0, 1], [1, 1]]])
for i, c in enumerate(coords):
temp_abs = np.abs(indices_local - np.remainder(c, 1))
temp = 1.0 / (temp_abs[..., 0] * temp_abs[..., 1])
# handle perfect integers
for j in range(temp.shape[0]):
for k in range(temp.shape[1]):
if np.isinf(temp[j, k]):
temp[j, k] = 0
arr[int(c[0]) : int(c[0]) + 2, int(c[1]) : int(c[1]) + 2] += (
values[i] * temp / temp.sum()
)
Some testing:
arr = np.zeros((256, 256))
coords = np.random.rand(10000, 2) * arr.shape[0]
values = np.ones(len(coords))
%timeit arr[tuple(coords.astype(int).T)] = values
>>> 106 µs ± 4.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit _add_floats_to_array_2d(coords, arr, values)
>>> 13.5 ms ± 546 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In fact it is better to compare to compare to a buffered function as the first test will overwrite any previous values instead of accumulating:
%timeit np.add.at(arr, tuple(coords.astype(int).T), values)
>>> 1.23 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have an array of values arr with shape (N,) and an array of coordinates coords with shape (N,2). I want to represent this in an (M,M) array grid such that grid takes the value 0 at coordinates that are not in coords, and for the coordinates that are included it should store the sum of all values in arr that have that coordinate. So if M=3, arr = np.arange(4)+1, and coords = np.array([[0,0,1,2],[0,0,2,2]]) then grid should be:
array([[3., 0., 0.],
[0., 0., 3.],
[0., 0., 4.]])
The reason this is nontrivial is that I need to be able to repeat this step many times and the values in arr change each time, and so can the coordinates. Ideally I am looking for a vectorized solution. I suspect that I might be able to use np.where somehow but it's not immediately obvious how.
Timing the solutions
I have timed the solutions present at this time and it appear that the accumulator method is slightly faster than the sparse matrix method, with the second accumulation method being the slowest for the reasons explained in the comments:
%timeit for x in range(100): accumulate_arr(np.random.randint(100,size=(2,10000)),np.random.normal(0,1,10000))
%timeit for x in range(100): accumulate_arr_v2(np.random.randint(100,size=(2,10000)),np.random.normal(0,1,10000))
%timeit for x in range(100): sparse.coo_matrix((np.random.normal(0,1,10000),np.random.randint(100,size=(2,10000))),(100,100)).A
47.3 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
103 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
48.2 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
One way would be to create a sparse.coo_matrix and convert that to dense:
from scipy import sparse
sparse.coo_matrix((arr,coords),(M,M)).A
# array([[3, 0, 0],
# [0, 0, 3],
# [0, 0, 4]])
With np.bincount -
def accumulate_arr(coords, arr):
# Get output array shape
m,n = coords.max(1)+1
# Get linear indices to be used as IDs with bincount
lidx = np.ravel_multi_index(coords, (m,n))
# Or lidx = coords[0]*(coords[1].max()+1) + coords[1]
# Accumulate arr with IDs from lidx
return np.bincount(lidx,arr,minlength=m*n).reshape(m,n)
Sample run -
In [58]: arr
Out[58]: array([1, 2, 3, 4])
In [59]: coords
Out[59]:
array([[0, 0, 1, 2],
[0, 0, 2, 2]])
In [60]: accumulate_arr(coords, arr)
Out[60]:
array([[3., 0., 0.],
[0., 0., 3.],
[0., 0., 4.]])
Another with np.add.at on similar lines and might be easier to follow -
def accumulate_arr_v2(coords, arr):
m,n = coords.max(1)+1
out = np.zeros((m,n), dtype=arr.dtype)
np.add.at(out, tuple(coords), arr)
return out
scipy.sparse.diags allows me to enter multiple diagonal vectors, together with their location, to build a matrix such as
from scipy.sparse import diags
vec = np.ones((5,))
vec2 = vec + 1
diags([vec, vec2], [-2, 2])
I'm looking for an efficient way to do the same but build a dense matrix, instead of DIA. np.diag only supports a single diagonal. What's an efficient way to build a dense matrix from multiple diagonal vectors?
Expected output: the same as np.array(diags([vec, vec2], [-2, 2]).todense())
One way would be to index into the flattened output array using a step of N+1:
import numpy as np
from scipy.sparse import diags
from timeit import timeit
def diags_pp(vecs, offs, dtype=float, N=None):
if N is None:
N = len(vecs[0]) + abs(offs[0])
out = np.zeros((N, N), dtype)
outf = out.reshape(-1)
for vec, off in zip(vecs, offs):
if off<0:
outf[-N*off::N+1] = vec
else:
outf[off:N*(N-off):N+1] = vec
return out
def diags_sp(vecs, offs):
return diags(vecs, offs).A
for N, k in [(10, 2), (100, 20), (1000, 200)]:
print(N)
O = np.arange(-k,k)
D = [np.arange(1, N+1-abs(o)) for o in O]
for n, f in list(globals().items()):
if n.startswith('diags_'):
print(n.replace('diags_', ''), timeit(lambda: f(D, O), number=10000//N)*N)
if n != 'diags_sp':
assert np.all(f(D, O) == diags_sp(D, O))
Sample run:
10
pp 0.06757194991223514
sp 1.9529316504485905
100
pp 0.45834919437766075
sp 4.684177896706387
1000
pp 23.397524026222527
sp 170.66762899048626
With Paul Panzer's (10,2) case
In [107]: O
Out[107]: array([-2, -1, 0, 1])
In [108]: D
Out[108]:
[array([1, 2, 3, 4, 5, 6, 7, 8]),
array([1, 2, 3, 4, 5, 6, 7, 8, 9]),
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]),
array([1, 2, 3, 4, 5, 6, 7, 8, 9])]
The diagonals have different lengths.
sparse.diags converts this to a sparse.dia_matrix:
In [109]: M = sparse.diags(D,O)
In [110]: M
Out[110]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 36 stored elements (4 diagonals) in DIAgonal format>
In [111]: M.data
Out[111]:
array([[ 1., 2., 3., 4., 5., 6., 7., 8., 0., 0.],
[ 1., 2., 3., 4., 5., 6., 7., 8., 9., 0.],
[ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.],
[ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.]])
Here the ragged list of diagonals has been converted to a padded 2d array. This can be a convenient way of specifying the diagonals, but it isn't particularly efficient. It has to be converted to csr format for most calculations:
In [112]: timeit sparse.diags(D,O)
99.8 µs ± 3.65 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [113]: timeit sparse.diags(D,O, format='csr')
371 µs ± 155 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Using np.diag I can construct the same array with an iteration
np.add.reduce([np.diag(v,k) for v,k in zip(D,O)])
In [117]: timeit np.add.reduce([np.diag(v,k) for v,k in zip(D,O)])
39.3 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
and with Paul's function:
In [120]: timeit diags_pp(D,O)
12.3 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The key step in np.diags is a flat assignment:
res[:n-k].flat[i::n+1] = v
This is essentially the same as Paul's outf assignments. So the functionality is basically the same, assigning each diagonal via a slice. Paul streamlines it.
Creating the M.data array (Out[111]) also requires copying the D arrays to a 2d array - but with different slices.
I have a 70x70 numpy ndarray, which is mainly diagonal. The only off-diagonal values are the below the diagonal. I would like to make the matrix symmetric.
As a newcomer from Matlab world, I can't get it working without for loops. In MATLAB it was easy:
W = max(A,A')
where A' is matrix transposition and the max() function takes care to make the W matrix which will be symmetric.
Is there an elegant way to do so in Python as well?
EXAMPLE
The sample A matrix is:
1 0 0 0
0 2 0 0
1 0 2 0
0 1 0 3
The desired output matrix W is:
1 0 1 0
0 2 0 1
1 0 2 0
0 1 0 3
Found a following solution which works for me:
import numpy as np
W = np.maximum( A, A.transpose() )
Use the NumPy tril and triu functions as follows. It essentially "mirrors" elements in the lower triangle into the upper triangle.
import numpy as np
A = np.array([[1, 0, 0, 0], [0, 2, 0, 0], [1, 0, 2, 0], [0, 1, 0, 3]])
W = np.tril(A) + np.triu(A.T, 1)
tril(m, k=0) gets the lower triangle of a matrix m (returns a copy of the matrix m with all elements above the kth diagonal zeroed). Similarly, triu(m, k=0) gets the upper triangle of a matrix m (all elements below the kth diagonal zeroed).
To prevent the diagonal being added twice, one must exclude the diagonal from one of the triangles, using either np.tril(A) + np.triu(A.T, 1) or np.tril(A, -1) + np.triu(A.T).
Also note that this behaves slightly differently to using maximum. All elements in the upper triangle are overwritten, regardless of whether they are the maximum or not. This means they can be any value (e.g. nan or inf).
For what it is worth, using the MATLAB's numpy equivalent you mentioned is more efficient than the link #plonser added.
In [1]: import numpy as np
In [2]: A = np.zeros((4, 4))
In [3]: np.fill_diagonal(A, np.arange(4)+1)
In [4]: A[2:,:2] = np.eye(2)
# numpy equivalent to MATLAB:
In [5]: %timeit W = np.maximum( A, A.T)
100000 loops, best of 3: 2.95 µs per loop
# method from link
In [6]: %timeit W = A + A.T - np.diag(A.diagonal())
100000 loops, best of 3: 9.88 µs per loop
Timing for larger matrices can be done similarly:
In [1]: import numpy as np
In [2]: N = 100
In [3]: A = np.zeros((N, N))
In [4]: A[2:,:N-2] = np.eye(N-2)
In [5]: np.fill_diagonal(A, np.arange(N)+1)
In [6]: print A
Out[6]:
array([[ 1., 0., 0., ..., 0., 0., 0.],
[ 0., 2., 0., ..., 0., 0., 0.],
[ 1., 0., 3., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 98., 0., 0.],
[ 0., 0., 0., ..., 0., 99., 0.],
[ 0., 0., 0., ..., 1., 0., 100.]])
# numpy equivalent to MATLAB:
In [6]: %timeit W = np.maximum( A, A.T)
10000 loops, best of 3: 28.6 µs per loop
# method from link
In [7]: %timeit W = A + A.T - np.diag(A.diagonal())
10000 loops, best of 3: 49.8 µs per loop
And with N = 1000
# numpy equivalent to MATLAB:
In [6]: %timeit W = np.maximum( A, A.T)
100 loops, best of 3: 5.65 ms per loop
# method from link
In [7]: %timeit W = A + A.T - np.diag(A.diagonal())
100 loops, best of 3: 11.7 ms per loop