I would like to add float coordinates to a numpy array by splitting the intensity based on the centre of mass of the coordinate to neighbouring pixels.
As an example with integers:
import numpy as np
arr = np.zeros((5, 5), dtype=float)
coord = [2, 2]
arr[coord[0], coord[1]] = 1
arr
>>> array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]])
However I would like to distribute the intensity across neighbouring pixels when coord is float data, eg. coord = [2.2, 1.7].
I have considered using a gaussian, eg:
grid = np.meshgrid(*[np.arange(i) for i in arr.shape], indexing='ij')
out = np.exp(-np.dstack([(grid[i]-c)**2 for i, c in enumerate(coord)]).sum(axis=-1) / 0.5**2)
which gives good results, but becomes slow for 3d data and thousands of points.
Any advice or ideas would be appreciated, thanks.
Based on #rpoleski suggestion, take a local region and apply weighting by distance. This is a good idea although the implementation I have does not maintain the original centre of mass of the coordinates, for example:
from scipy.ndimage import center_of_mass
coord = [2.2, 1.7]
# get region coords
grid = np.meshgrid(*[range(2) for i in coord], indexing='ij')
# difference Euclidean distance between coords and coord
delta = np.linalg.norm(np.dstack([g-(c%1) for g, c, in zip(grid, coord)]), axis=-1)
value = 3 # pixel value of original coord
# create final array by 1/delta, ie. closer is weighted more
# normalise by sum of 1/delta
out = value * (1/delta) / (1/delta).sum()
out.sum()
>>> 3.0 # as expected
# but
center_of_mass(out)
>>> (0.34, 0.63) # should be (0.2, 0.7) in this case, ie. from coord
Any ideas?
Here is a simple (and hence most probably fast enough) solution that keeps the center of mass and has sum = 1:
arr = np.zeros((5, 5), dtype=float)
coord = [2.2, 0.7]
indexes = np.array([[x, y] for x in [int(coord[0]), int(coord[0])+1] for y in [int(coord[1]), int(coord[1])+1]])
values = [1. / (abs(coord[0]-index[0]) * abs(coord[1]-index[1])) for index in indexes]
sum_values = sum(values)
for (value, index) in zip(values, indexes):
arr[index[0], index[1]] = value / sum_values
print(arr)
print(center_of_mass(arr))
which results in:
[[0. 0. 0. 0. 0. ]
[0. 0. 0. 0. 0. ]
[0. 0.24 0.56 0. 0. ]
[0. 0.06 0.14 0. 0. ]
[0. 0. 0. 0. 0. ]]
(2.2, 1.7)
Note: I'm using taxicab distances - they're good for center of mass calculations.
For anyone needing this functionality, and thanks to #rpoleski, I came up with this which uses Numba to speed up the calculation.
#numba.njit
def _add_floats_to_array_2d(coords, arr, values):
"""
Distribute float values around neighbouring pixels in array whilst maintinating center of mass.
Uses Manhattan (taxicab) distances for center of mass calculation.
This function uses numba to speed up the calculation but is limited to exactly 2D.
Parameters
----------
coords: (N, ndim) ndarray
Floats to distribute into array.
arr: ndim ndarray
Floats will be distributed into this array.
Array is modified in place.
values: (N,) arraylike
The total value of each coord to distribute into arr.
"""
indices_local = np.array([[[0, 0], [1, 0]], [[0, 1], [1, 1]]])
for i, c in enumerate(coords):
temp_abs = np.abs(indices_local - np.remainder(c, 1))
temp = 1.0 / (temp_abs[..., 0] * temp_abs[..., 1])
# handle perfect integers
for j in range(temp.shape[0]):
for k in range(temp.shape[1]):
if np.isinf(temp[j, k]):
temp[j, k] = 0
arr[int(c[0]) : int(c[0]) + 2, int(c[1]) : int(c[1]) + 2] += (
values[i] * temp / temp.sum()
)
Some testing:
arr = np.zeros((256, 256))
coords = np.random.rand(10000, 2) * arr.shape[0]
values = np.ones(len(coords))
%timeit arr[tuple(coords.astype(int).T)] = values
>>> 106 µs ± 4.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit _add_floats_to_array_2d(coords, arr, values)
>>> 13.5 ms ± 546 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In fact it is better to compare to compare to a buffered function as the first test will overwrite any previous values instead of accumulating:
%timeit np.add.at(arr, tuple(coords.astype(int).T), values)
>>> 1.23 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Related
I need to alternate Pytorch Tensors (similar to numpy arrays) with rows and columns of zeros. Like this:
Input => [[ 1,2,3],
[ 4,5,6],
[ 7,8,9]]
output => [[ 1,0,2,0,3],
[ 0,0,0,0,0],
[ 4,0,5,0,6],
[ 0,0,0,0,0],
[ 7,0,8,0,9]]
I am using the accepted answer in this question that proposes the following
def insert_zeros(a, N=1):
# a : Input array
# N : number of zeros to be inserted between consecutive rows and cols
out = np.zeros( (N+1)*np.array(a.shape)-N,dtype=a.dtype)
out[::N+1,::N+1] = a
return out
The answers works perfectly, except that I need to perform this many times on many arrays and the time it takes has become the bottleneck. It is the step-sized slicing that takes most of the time.
For what it's worth, the matrices I am using it for are 4D, an example size of a matrix is 32x18x16x16 and I am inserting the alternate rows/cols only in the last two dimensions.
So my question is, is there another implementation with the same functionality but with reduced time?
I am not familiar to Pytorch, but to accelerate the code that you provided, I think JAX library will help a lot. So, if:
import numpy as np
import jax
import jax.numpy as jnp
from functools import partial
a = np.arange(10000).reshape(100, 100)
b = jnp.array(a)
#partial(jax.jit, static_argnums=1)
def new(a, N):
out = jnp.zeros( (N+1)*np.array(a.shape)-N,dtype=a.dtype)
out = out.at[::N+1,::N+1].set(a)
return out
will improve the runtime about 10 times on GPU. It depends to array size and N (The increase in the sizes, the better performances). You can see Benchmarks on my Colab link based on the 4 answer proposed so far (JAX beats the others).
I believe that jax can be one of the best libraries for your case if you could adjust it on your problem (It is possible).
I found a few methods to achieve this result, and the indexing method seems to be consistently the fastest.
There might be some improvement to be made on other methods though, because I tried to generalized them from 1D to 2D and arbitrary number of leading dimensions, and might not have do it in the best way posisble.
Edit: Yet another method using numpy, not faster.
Performance test (CPU):
In [4]: N, C, H, W = 11, 5, 128, 128
...: x = torch.rand(N, C, H, W)
...: k = 3
...:
...: x1 = interleave_index(x, k)
...: x2 = interleave_view(x, k)
...: x3 = interleave_einops(x, k)
...: x4 = interleave_convtranspose(x, k)
...: x4 = interleave_numpy(x, k)
...:
...: assert torch.all(x1 == x2)
...: assert torch.all(x2 == x3)
...: assert torch.all(x3 == x4)
...: assert torch.all(x4 == x5)
...:
...: %timeit interleave_index(x, k)
...: %timeit interleave_view(x, k)
...: %timeit interleave_einops(x, k)
...: %timeit interleave_convtranspose(x, k)
...: %timeit interleave_numpy(x, k)
9.51 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
12.6 ms ± 4.98 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
23.3 ms ± 4.19 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
62.5 ms ± 19.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
50.6 ms ± 809 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Performance test (GPU):
(numpy metod not tested)
...: ...
...: x = torch.rand(N, C, H, W, device="cuda")
...: ...
260 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
861 µs ± 6.77 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
912 µs ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
429 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Implementations:
import torch
import torch.nn.functional as F
import einops
def interleave_index(x, k):
*cdims, Hin, Win = x.shape
Hout = (k + 1) * (Hin - 1) + 1
Wout = (k + 1) * (Win - 1) + 1
out = x.new_zeros(*cdims, Hout, Wout)
out[..., :: k + 1, :: k + 1] = x
return out
def interleave_view(x, k):
"""
From
https://discuss.pytorch.org/t/how-to-interleave-two-tensors-along-certain-dimension/11332/4
"""
*cdims, Hin, Win = x.shape
Hout = (k + 1) * (Hin - 1) + 1
Wout = (k + 1) * (Win - 1) + 1
zeros = [torch.zeros_like(x)] * k
out = torch.stack([x, *zeros], dim=-1).view(*cdims, Hin, Wout + k)[..., :-k]
zeros = [torch.zeros_like(out)] * k
out = torch.stack([out, *zeros], dim=-2).view(*cdims, Hout + k, Wout)[..., :-k, :]
return out
def interleave_einops(x, k):
"""
From
https://discuss.pytorch.org/t/how-to-interleave-two-tensors-along-certain-dimension/11332/6
"""
zeros = [torch.zeros_like(x)] * k
out = einops.rearrange([x, *zeros], "t ... h w -> ... h (w t)")[..., :-k]
zeros = [torch.zeros_like(out)] * k
out = einops.rearrange([out, *zeros], "t ... h w -> ... (h t) w")[..., :-k, :]
return out
def interleave_convtranspose(x, k):
"""
From
https://github.com/pytorch/pytorch/issues/7911#issuecomment-515493009
"""
C = x.shape[-3]
weight=x.new_ones(C, 1, 1, 1)
return F.conv_transpose2d(x, weight=weight, stride=k+1, groups=C)
def interleave_numpy(x, k):
"""
From https://stackoverflow.com/a/53179919
"""
pos = np.repeat(np.arange(1, x.shape[-1]), k)
out = np.insert(x, pos, 0, axis=-1)
pos = np.repeat(np.arange(1, x.shape[-2]), k)
out = np.insert(out, pos, 0, axis=-2)
return out
Since you know the size of the array in advance, first step to optimize is to create the out array outside the function. Then, try numba to jit-compile the function and work in-place on the out array. This achieves 5X speedup over the numpy version you posted.
import numpy as np
from numba import njit
#njit
def insert_zeros_n(a, out, N=1):
for i in range(a.shape[0]):
for j in range(a.shape[1]):
out[2*i,2*j] = a[i,j]
and call it with the specified N and a:
N = 1
a = np.arange(16*16).reshape(16, 16)
out = np.zeros( (N+1)*np.array(a.shape)-N,dtype=a.dtype)
insert_zeros_n(a,out)
Encapsulated for any N, what about using numpy.kron with 4D inputs,
a = np.arange(1, 19).reshape((1, 2, 3, 3))
print(a)
# array([[[[ 1, 2, 3],
# [ 4, 5, 6],
# [ 7, 8, 9]],
#
# [[10, 11, 12],
# [13, 14, 15],
# [16, 17, 18]]]])
def interleave_kron(a, N=1):
n = N + 1
return np.kron(
a, np.hstack((1, np.zeros(pow(n, 2) - 1))).reshape((1, 1, n, n))
)[..., :-N, :-N]
where np.hstack((1, np.zeros(pow(n, 2) - 1))).reshape((1, 1, n, n)) could be externalized/defaulted once for all for the sake of performance.
and then
>>> interleave_kron(a, N=2)
array([[[[ 1., 0., 0., 2., 0., 0., 3.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 4., 0., 0., 5., 0., 0., 6.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 7., 0., 0., 8., 0., 0., 9.]],
[[10., 0., 0., 11., 0., 0., 12.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[13., 0., 0., 14., 0., 0., 15.],
[ 0., 0., 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0., 0., 0.],
[16., 0., 0., 17., 0., 0., 18.]]]])
?
I have been tasked with applying a "1-2-1" filter to a numpy array and returning an array of the filtered data. Without using for loops while loops or list comprehension.
The "1-2-1" filter maps each point of data to the average of itself twice and its neighbors. For example, if at some point the data contained ...1, 4, 3... then after applying the "1-2-1" filter the 4 would be replaced with (1 + 4 + 4 + 3) / 4 = 12 / 4 = 3.
For example the numpy array [1, 1, 4, 3, 2]
Would after the filter is applied produce a numpy array [1. 1.75 3. 3. 2. ]
Since the end points of the data do not have two neighbors the 1-2-1 filter is only applied to the internal len(data) - 2 points, leaving the end points unchanged.
Essentially I need to access the values before and after a given point during numpy array vectorization. For a array that could be of any length. Which as much as I have googled I cannot work out.
Pandas solution
s = pd.Series(l)
>>> s.rolling(3, center=True).sum().add(s).div(4).fillna(s).values
array([1. , 1.75, 3. , 3. , 2. ])
Step by step:
>>> s.rolling(3, center=True).sum().values
array([nan, 6., 8., 9., nan])
>>> s.rolling(3, center=True).sum().add(s).values
array([nan, 7., 12., 12., nan])
>>> s.rolling(3, center=True).sum().add(s).div(4).values
array([ nan, 1.75, 3. , 3. , nan])
>>> s.rolling(3, center=True).sum().add(s).div(4).fillna(s).values
array([1. , 1.75, 3. , 3. , 2. ])
Numpy solution
a = np.array(l)
>>> np.concatenate([a[:1], np.convolve(a, [1, 2, 1], mode="valid") / 4, a[-1:]])
Step by step:
>>> np.convolve(a, [1, 2, 1], mode="valid")
array([ 7, 12, 12])
>>> np.convolve(a, [1, 2, 1], mode="valid") / 4
array([1.75, 3. , 3. ])
>>> np.concatenate([a[:1], np.convolve(a, [1, 2, 1], mode="valid") / 4, a[-1:]])
array([1. , 1.75, 3. , 3. , 2. ])
Performance
%timeit s.rolling(3, center=True).sum().add(s).div(4).fillna(s).values
706 µs ± 4.62 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.concatenate([a[:1], np.convolve(a, [1, 2, 1], mode="valid") / 4, a[-1:]])
10.8 µs ± 22.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
You can try something like this:
import numpy as np
from scipy.linalg import circulant
x = np.array([1, 1, 4, 3, 2])
val = np.array([1, 2, 1])
offsets = np.array([0, 1, 2])
col0 = np.zeros(len(x))
col0[offsets] = val
C = circulant(col0).T[:-(len(val) - 1)]
print(C)
C is essentially a circulant matrix which looks like this:
array([[1., 2., 1., 0., 0.],
[0., 1., 2., 1., 0.],
[0., 0., 1., 2., 1.]])
Now you can simply compute the filtered output as follows:
y = (C # x) / 4
print(y)
# array([1.75, 3. , 3. ])
What's the most efficient way to fill a scipy.sparse.dok_matrix, based on an input list ?
Neither the number of columns or rows in the dok_matrix are known in advance.
The number of rows is the length of the input list, the number of columns depends on the values within the input list.
The obvious:
def get_dok_matrix(values: List[Any]) -> scipy.sparse.dok_matrix:
max_cols = 0
datas = []
for value in values:
data = get_data(values)
datas.append(data)
if len(data) > max_cols:
max_cols = len(data)
dok_matrix = scipy.sparse.dok_matrix((len(values), max_cols))
for i, data in enumerate(datas):
for j, datum in enumerate(data):
dok_matrix[i, j] = datum
return dok_matrix
Has two for loops, a nested for loop, and many len() checks also. I can't imagine this being very efficient.
I have also considered:
def get_dok_matrix(values: List[Any]) -> scipy.sparse.dok_matrix:
cols = 0
dok_matrix = scipy.sparse.dok_matrix((0, 0))
for row, value in enumerate(values):
dok_matrix.resize(row + 1, cols)
data = get_data(values)
for col, datum in enumerate(data):
if col + 1 > cols:
cols = col + 1
dok_matrix.resize(row + 1, cols)
dok_matrix[row, col] = datum
return dok_matrix
This hugely depends on how efficient scipy.sparse.dok_matrix.resize is, which I couldn't find in the documentation.
Which of these is most efficient?
Is there a better way that I am missing (maybe I can O(1) set an entire row at once?)?
With:
def get_dok_matrix(values):
max_cols = 0
datas = []
for value in values:
data = value # changed
datas.append(data)
if len(data) > max_cols:
max_cols = len(data)
dok_matrix = sparse.dok_matrix((len(values), max_cols))
for i, data in enumerate(datas):
for j, datum in enumerate(data):
dok_matrix[i, j] = datum
return dok_matrix
And
In [13]: values = [[1],[1,2],[1,2,3],[4,5,6,7],[8,9,10,11,12]]
In [14]: dd = get_dok_matrix(values)
In [15]: dd
Out[15]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 15 stored elements in Dictionary Of Keys format>
In [16]: dd.A
Out[16]:
array([[ 1., 0., 0., 0., 0.],
[ 1., 2., 0., 0., 0.],
[ 1., 2., 3., 0., 0.],
[ 4., 5., 6., 7., 0.],
[ 8., 9., 10., 11., 12.]])
I wish you'd provided a values example, so I wouldn't have to study your code and create one that would work with it.
To make a coo format:
def get_coo_matrix(values):
data, row, col = [],[],[]
for i,value in enumerate(values):
n = len(value)
data.extend(value)
row.extend([i]*n)
col.extend(list(range(n)))
return sparse.coo_matrix((data,(row,col)))
In [18]: M = get_coo_matrix(values)
In [19]: M
Out[19]:
<5x5 sparse matrix of type '<class 'numpy.int64'>'
with 15 stored elements in COOrdinate format>
In [20]: M.A
Out[20]:
array([[ 1, 0, 0, 0, 0],
[ 1, 2, 0, 0, 0],
[ 1, 2, 3, 0, 0],
[ 4, 5, 6, 7, 0],
[ 8, 9, 10, 11, 12]])
times:
In [22]: timeit dd = get_dok_matrix(values)
431 µs ± 10.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [23]: timeit M = get_coo_matrix(values)
152 µs ± 524 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I have an array of values arr with shape (N,) and an array of coordinates coords with shape (N,2). I want to represent this in an (M,M) array grid such that grid takes the value 0 at coordinates that are not in coords, and for the coordinates that are included it should store the sum of all values in arr that have that coordinate. So if M=3, arr = np.arange(4)+1, and coords = np.array([[0,0,1,2],[0,0,2,2]]) then grid should be:
array([[3., 0., 0.],
[0., 0., 3.],
[0., 0., 4.]])
The reason this is nontrivial is that I need to be able to repeat this step many times and the values in arr change each time, and so can the coordinates. Ideally I am looking for a vectorized solution. I suspect that I might be able to use np.where somehow but it's not immediately obvious how.
Timing the solutions
I have timed the solutions present at this time and it appear that the accumulator method is slightly faster than the sparse matrix method, with the second accumulation method being the slowest for the reasons explained in the comments:
%timeit for x in range(100): accumulate_arr(np.random.randint(100,size=(2,10000)),np.random.normal(0,1,10000))
%timeit for x in range(100): accumulate_arr_v2(np.random.randint(100,size=(2,10000)),np.random.normal(0,1,10000))
%timeit for x in range(100): sparse.coo_matrix((np.random.normal(0,1,10000),np.random.randint(100,size=(2,10000))),(100,100)).A
47.3 ms ± 1.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
103 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
48.2 ms ± 36 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
One way would be to create a sparse.coo_matrix and convert that to dense:
from scipy import sparse
sparse.coo_matrix((arr,coords),(M,M)).A
# array([[3, 0, 0],
# [0, 0, 3],
# [0, 0, 4]])
With np.bincount -
def accumulate_arr(coords, arr):
# Get output array shape
m,n = coords.max(1)+1
# Get linear indices to be used as IDs with bincount
lidx = np.ravel_multi_index(coords, (m,n))
# Or lidx = coords[0]*(coords[1].max()+1) + coords[1]
# Accumulate arr with IDs from lidx
return np.bincount(lidx,arr,minlength=m*n).reshape(m,n)
Sample run -
In [58]: arr
Out[58]: array([1, 2, 3, 4])
In [59]: coords
Out[59]:
array([[0, 0, 1, 2],
[0, 0, 2, 2]])
In [60]: accumulate_arr(coords, arr)
Out[60]:
array([[3., 0., 0.],
[0., 0., 3.],
[0., 0., 4.]])
Another with np.add.at on similar lines and might be easier to follow -
def accumulate_arr_v2(coords, arr):
m,n = coords.max(1)+1
out = np.zeros((m,n), dtype=arr.dtype)
np.add.at(out, tuple(coords), arr)
return out
I have a 70x70 numpy ndarray, which is mainly diagonal. The only off-diagonal values are the below the diagonal. I would like to make the matrix symmetric.
As a newcomer from Matlab world, I can't get it working without for loops. In MATLAB it was easy:
W = max(A,A')
where A' is matrix transposition and the max() function takes care to make the W matrix which will be symmetric.
Is there an elegant way to do so in Python as well?
EXAMPLE
The sample A matrix is:
1 0 0 0
0 2 0 0
1 0 2 0
0 1 0 3
The desired output matrix W is:
1 0 1 0
0 2 0 1
1 0 2 0
0 1 0 3
Found a following solution which works for me:
import numpy as np
W = np.maximum( A, A.transpose() )
Use the NumPy tril and triu functions as follows. It essentially "mirrors" elements in the lower triangle into the upper triangle.
import numpy as np
A = np.array([[1, 0, 0, 0], [0, 2, 0, 0], [1, 0, 2, 0], [0, 1, 0, 3]])
W = np.tril(A) + np.triu(A.T, 1)
tril(m, k=0) gets the lower triangle of a matrix m (returns a copy of the matrix m with all elements above the kth diagonal zeroed). Similarly, triu(m, k=0) gets the upper triangle of a matrix m (all elements below the kth diagonal zeroed).
To prevent the diagonal being added twice, one must exclude the diagonal from one of the triangles, using either np.tril(A) + np.triu(A.T, 1) or np.tril(A, -1) + np.triu(A.T).
Also note that this behaves slightly differently to using maximum. All elements in the upper triangle are overwritten, regardless of whether they are the maximum or not. This means they can be any value (e.g. nan or inf).
For what it is worth, using the MATLAB's numpy equivalent you mentioned is more efficient than the link #plonser added.
In [1]: import numpy as np
In [2]: A = np.zeros((4, 4))
In [3]: np.fill_diagonal(A, np.arange(4)+1)
In [4]: A[2:,:2] = np.eye(2)
# numpy equivalent to MATLAB:
In [5]: %timeit W = np.maximum( A, A.T)
100000 loops, best of 3: 2.95 µs per loop
# method from link
In [6]: %timeit W = A + A.T - np.diag(A.diagonal())
100000 loops, best of 3: 9.88 µs per loop
Timing for larger matrices can be done similarly:
In [1]: import numpy as np
In [2]: N = 100
In [3]: A = np.zeros((N, N))
In [4]: A[2:,:N-2] = np.eye(N-2)
In [5]: np.fill_diagonal(A, np.arange(N)+1)
In [6]: print A
Out[6]:
array([[ 1., 0., 0., ..., 0., 0., 0.],
[ 0., 2., 0., ..., 0., 0., 0.],
[ 1., 0., 3., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 98., 0., 0.],
[ 0., 0., 0., ..., 0., 99., 0.],
[ 0., 0., 0., ..., 1., 0., 100.]])
# numpy equivalent to MATLAB:
In [6]: %timeit W = np.maximum( A, A.T)
10000 loops, best of 3: 28.6 µs per loop
# method from link
In [7]: %timeit W = A + A.T - np.diag(A.diagonal())
10000 loops, best of 3: 49.8 µs per loop
And with N = 1000
# numpy equivalent to MATLAB:
In [6]: %timeit W = np.maximum( A, A.T)
100 loops, best of 3: 5.65 ms per loop
# method from link
In [7]: %timeit W = A + A.T - np.diag(A.diagonal())
100 loops, best of 3: 11.7 ms per loop