Numpy - how find unique values from a symetric similarity Matrix

Numpy - how find unique values from a symetric similarity Matrix - python

I have a squared symetric matrix like this:
[[-1. -0.70710678 -0.70710678 -0.70710678 -0. ]
[-0.70710678 -1. -1. -1. -0. ]
[-0.70710678 -1. -1. -1. -0. ]
[-0.70710678 -1. -1. -1. -0. ]
[-0. -0. -0. -0. -1. ]]
I would like to analyze all numbers below or above the diagonal, but not the diagonal.
What can I do to find unique values from this matrix except the values in the diagonal?
Expected output : [-0., -0.70710678]

You can get the values of the diagonal using arr.diagonal() and np.unique and remove them from the values of the array
unique = np.unique(arr)
index = np.ravel([np.where(unique == i) for i in np.unique(arr.diagonal())])
values = np.delete(unique, index)
print(values) # [-0.70710678 -0. ]

If a is the name of the numpy array with the representation you provided, then
print(np.array(np.setdiff1d(a, a.diagonal())))
does the trick with output
[-0.70710678 0. ]
(Original Answer) Alternatively,
import numpy as np
b = np.unique(a[~np.eye(a.shape[0],dtype=bool)].reshape(a.shape[0],-1))
print(b)
print(np.setdiff1d(b, a.diagonal()))
Printing b will output the unique values in the array a with the main diagonal elements deleted. The next line removes those numbers in the diagonal of a that are in b.
The output is
[-1. -0.70710678 0. ]
[-0.70710678 0. ]

You can use python sets, assuming a the input:
b = np.array(list(set(a.flatten())-set(np.diagonal(a))))
output: array([-0.70710678, -0. ])
NB. this is faster for small arrays (the provided 25 items example) and roughly as fast as numpy operations for larger arrays (tested on 1M (1000x1000) and 100M (10k x 10k) items with 1000 unique possibilities)
timing:
code for the perfplot:
import numpy as np
import perfplot
def guy(a):
unique = np.unique(a)
index = np.ravel([np.where(unique == i) for i in np.unique(a.diagonal())])
values = np.delete(unique, index)
return values
def mozway(a):
return np.array(list(set(a.flatten())-set(np.diagonal(a))))
def oda(a):
b = np.unique(a[~np.eye(a.shape[0],dtype=bool)].reshape(a.shape[0],-1))
return np.setdiff1d(b, a.diagonal())
def oda_setdiff(a):
return np.array(np.setdiff1d(a, a.diagonal()))
perfplot.show(
setup=lambda n: np.random.randint(0,1000, size=(n,n)),
kernels=[guy, oda, oda_setdiff, mozway],
n_range=[2**k for k in range(11)],
xlabel="array shape in each dimension",
equality_check=None,
)

Related

Vectorize this for loop

I have this for loop that I need to vectorize. The code below works, but takes a lot of time (this is a simplified example, the full version will have about 1e6 rows in col_ids). Can someone give me an idea how to vectorize this code to get rid of the loop? If it matters, the col_ids are fixed (will be the same every time the code is run), while the values will change.
values = np.array([1.5, 2, 2.3])
col_ids = np.array([[0,0,0,0], [0,0,0,1], [0,0,1,1]])
result = np.zeros((4,3))
for idx, col_idx in enumerate(col_ids):
result[np.arange(4),col_idx] += values[idx]
Result:
[[5.8 0. 0. ]
[5.8 0. 0. ]
[3.5 2.3 0. ]
[1.5 4.3 0. ]]
Update:
I am adding a second example as there was some ambiguity in the dimensions of my first example. Only values and col_ids are updated, everything else as in first example. (I keep the first one, since this is referred to in the answers)
values = np.array([1.5, 2, 5, 20, 50])
col_ids = np.array([[0,0,0,0], [0,0,0,1], [0,0,1,1], [0,0,1,2], [0,1,2,2]])
Result:
[[78.5 0. 0. ]
[28.5 50. 0. ]
[ 3.5 25. 50. ]
[ 1.5 7. 70. ]]
So result is m x n, col_ids is k x m and values has length k. Both m and n are small (m=4, n=3), k is large (about 1e6 in full example)

You can vectorize the loop, but creating an additional intermediate array is much slower for larger data (starting from result with shape (50,50))
import numpy as np
values = np.array([1.5, 2, 2.3])
col_ids = np.array([[0,0,0,0], [0,0,0,1], [0,0,1,1]])
(np.equal.outer(col_ids, np.arange(len(values))) * values[:,None,None]).sum(0)
# for a fixed result shape (4,3)
# (np.equal.outer(col_ids, np.arange(3)) * values[:,None,None]).sum(0)
Output
array([[5.8, 0. , 0. ],
[5.8, 0. , 0. ],
[3.5, 2.3, 0. ],
[1.5, 4.3, 0. ]])
The only reliably faster solution I could find is numba (using version 0.55.1). I thought this implementation would benefit from parallel execution, but I couldn't get any speed up on a 2-core colab instance.
import numba as nb
#nb.njit(parallel=False) # Try parallel=True for multi-threaded execution, no speed up in my benchmarks
def fill(val, ids):
res = np.zeros(ids.shape[::-1])
for i in nb.prange(len(res)):
for j in range(res.shape[1]):
res[i, ids[j,i]] += val[j]
return res
fill(values, col_ids)
Output
array([[5.8, 0. , 0. ],
[5.8, 0. , 0. ],
[3.5, 2.3, 0. ],
[1.5, 4.3, 0. ]])
For a fixed result shape (4,3) with suitable input.
#nb.njit(boundscheck=True) # ~1.25x slower, but much safer
def fill(val, ids):
res = np.zeros((4,3))
for i in nb.prange(ids.shape[0]):
for j in range(ids.shape[1]):
res[j, ids[i,j]] += val[i]
return res
fill(values, col_ids)
Output for the updated example data
array([[78.5, 0. , 0. ],
[28.5, 50. , 0. ],
[ 3.5, 25. , 50. ],
[ 1.5, 7. , 70. ]])

You can solve this using np.add.at. However, AFAIK, this function does not support 2D array so you need to flatten the arrays, computing the 1D flatten indices, and then call the function:
n, m = result.shape
result = np.zeros((4,3))
indices = np.tile(np.arange(0, n*m, m), col_ids.shape[0]) + col_ids.ravel()
np.add.at(result.ravel(), indices, np.repeat(values, n)) # In-place
print(result)

Randomly assign constant value to Numpy array

The objective is randomly assign a constant value to tril of a numpy array.
I wonder whether there is more efficient and compact than the proposed solution below.
import numpy as np
import random
rand_n2 = np.random.randn(10,10)
arr=np.tril(rand_n2,-1)
n=np.where(arr!=0)
nsize=n[0].shape[0]
rand_idx = random.sample(range(1,nsize), nsize-1)
ndrop=2 # Total location to assign the contant value
for idx in range(ndrop):
arr[n[0][rand_idx[idx]],n[1][rand_idx[idx]]]=10 # Assign constant value to random tril location

You could initialize a matrix with random numbers, and overwrite the upper triangle the you take random indexes from the lower triangle indexes and overwrite them:
import numpy as np
# create the matrix with random values
size = 5
arr = np.random.rand(size, size)
arr[np.triu_indices(size, k=0)] = 0
# set values randomly
val = 10
k_max = 2
ix = np.random.choice(range(int((size*size-size)/2)), k_max)
rnd = np.tril_indices(size, k=-1)
arr[(rnd[0][ix], rnd[1][ix])] = val
array([[ 0. , 0. , 0. , 0. , 0. ],
[ 0.50754565, 0. , 0. , 0. , 0. ],
[ 0.98920062, 0.53945212, 0. , 0. , 0. ],
[ 0.54987252, 10. , 0.22052519, 0. , 0. ],
[10. , 0.82057924, 0.86199411, 0.85397047, 0. ]])

Don't know if this is much more efficient and compact, but I feel it's a bit cleaner and easier to read:
import numpy as np
rand_n2 = np.random.randn(10,10)
arr=np.tril(rand_n2,-1)
# create list of lower trianguler indices
tril_idx = [(i,j) for i in range(1,10) for j in range(i)]
# shuffle indices i.e. draw two at random
np.random.shuffle(tril_idx)
ndrop = 2 # Total location to assign the contant value
for idx in tril_idx[:ndrop]:
arr[idx] = 10 # Assign constant value to random tril location
Instead of using the double list comprehension to create the list of lower triangular indices, you can use np.tril_indices() as well. Just take care since this will return a tuple of arrays of rather than a array of tuples.

Min-max scaling along rows in numpy array

I have a numpy array and I want to rescale values along each row to values between 0 and 1 using the following procedure:
If the maximum value along a given row is X_max and the minimum value along that row is X_min, then the rescaled value (X_rescaled) of a given entry (X) in that row should become:
X_rescaled = (X - X_min)/(X_max - X_min)
As an example, let's consider the following array (arr):
arr = np.array([[1.0,2.0,3.0],[0.1, 5.1, 100.1],[0.01, 20.1, 1000.1]])
print arr
array([[ 1.00000000e+00, 2.00000000e+00, 3.00000000e+00],
[ 1.00000000e-01, 5.10000000e+00, 1.00100000e+02],
[ 1.00000000e-02, 2.01000000e+01, 1.00010000e+03]])
Presently, I am trying to use MinMaxscaler from scikit-learn in the following way:
from sklearn.preprocessing import MinMaxScaler
result = MinMaxScaler(arr)
But, I keep getting my initial array, i.e. result turns out to be the same as arr in the aforementioned method. What am I doing wrong?
How can I scale the array arr in the manner that I require (min-max scaling along each axis?) Thanks in advance.

MinMaxScaler is a bit clunky to use; sklearn.preprocessing.minmax_scale is more convenient. This operates along columns, so use the transpose:
>>> import numpy as np
>>> from sklearn import preprocessing
>>>
>>> a = np.random.random((3,5))
>>> a
array([[0.80161048, 0.99572497, 0.45944366, 0.17338664, 0.07627295],
[0.54467986, 0.8059851 , 0.72999058, 0.08819178, 0.31421126],
[0.51774372, 0.6958269 , 0.62931078, 0.58075685, 0.57161181]])
>>> preprocessing.minmax_scale(a.T).T
array([[0.78888024, 1. , 0.41673812, 0.10562126, 0. ],
[0.63596033, 1. , 0.89412757, 0. , 0.314881 ],
[0. , 1. , 0.62648851, 0.35384099, 0.30248836]])
>>>
>>> b = np.array([(4, 1, 5, 3), (0, 1.5, 1, 3)])
>>> preprocessing.minmax_scale(b.T).T
array([[0.75 , 0. , 1. , 0.5 ],
[0. , 0.5 , 0.33333333, 1. ]])

Does this function compute convolution correctly?

I need to write a basic function that computes a 2D convolution between a matrix and a kernel.
I have recently got into Python, so I'm sorry for my mistakes.
My dissertation teacher said that I should write one by myself so I can handle it better and to be able to modify it for future improvements.
I have found an example of this function on a website, but I don't understand how the returned values are obtained.
This is the code (from http://docs.cython.org/src/tutorial/numpy.html )
from __future__ import division
import numpy as np
def naive_convolve(f, g):
# f is an image and is indexed by (v, w)
# g is a filter kernel and is indexed by (s, t),
# it needs odd dimensions
# h is the output image and is indexed by (x, y),
# it is not cropped
if g.shape[0] % 2 != 1 or g.shape[1] % 2 != 1:
raise ValueError("Only odd dimensions on filter supported")
# smid and tmid are number of pixels between the center pixel
# and the edge, ie for a 5x5 filter they will be 2.
#
# The output size is calculated by adding smid, tmid to each
# side of the dimensions of the input image.
vmax = f.shape[0]
wmax = f.shape[1]
smax = g.shape[0]
tmax = g.shape[1]
smid = smax // 2
tmid = tmax // 2
xmax = vmax + 2*smid
ymax = wmax + 2*tmid
# Allocate result image.
h = np.zeros([xmax, ymax], dtype=f.dtype)
# Do convolution
for x in range(xmax):
for y in range(ymax):
# Calculate pixel value for h at (x,y). Sum one component
# for each pixel (s, t) of the filter g.
s_from = max(smid - x, -smid)
s_to = min((xmax - x) - smid, smid + 1)
t_from = max(tmid - y, -tmid)
t_to = min((ymax - y) - tmid, tmid + 1)
value = 0
for s in range(s_from, s_to):
for t in range(t_from, t_to):
v = x - smid + s
w = y - tmid + t
value += g[smid - s, tmid - t] * f[v, w]
h[x, y] = value
return h
I don't know if this function does the weighted sum from input and filter, because I see no sum here.
I applied this with
kernel = np.array([(1, 1, -1), (1, 0, -1), (1, -1, -1)])
file = np.ones((5,5))
naive_convolve(file, kernel)
I got this matrix:
[[ 1. 2. 1. 1. 1. 0. -1.]
[ 2. 3. 1. 1. 1. -1. -2.]
[ 3. 3. 0. 0. 0. -3. -3.]
[ 3. 3. 0. 0. 0. -3. -3.]
[ 3. 3. 0. 0. 0. -3. -3.]
[ 2. 1. -1. -1. -1. -3. -2.]
[ 1. 0. -1. -1. -1. -2. -1.]]
I tried to do a manual calculation (on paper) for the first full iteration of the function and I got 'h[0,0] = 0', because of the matrix product: 'filter[0, 0] * matrix[0, 0]', but the function returns 1. I am very confused with this.
If anyone can help me understand what is going on here, I would be very grateful. Thanks! :)

Yes, that function computes the convolution correctly. You can check this using scipy.signal.convolve2d
import numpy as np
from scipy.signal import convolve2d
kernel = np.array([(1, 1, -1), (1, 0, -1), (1, -1, -1)])
file = np.ones((5,5))
x = convolve2d(file, kernel)
print x
Which gives:
[[ 1. 2. 1. 1. 1. 0. -1.]
[ 2. 3. 1. 1. 1. -1. -2.]
[ 3. 3. 0. 0. 0. -3. -3.]
[ 3. 3. 0. 0. 0. -3. -3.]
[ 3. 3. 0. 0. 0. -3. -3.]
[ 2. 1. -1. -1. -1. -3. -2.]
[ 1. 0. -1. -1. -1. -2. -1.]]
It's impossible to know how to explain all this to you since I don't know where to start, and I don't know how all the other explanations aren't working for you. I think, though, that you are doing all of this as a learning exercise so you can figure this out for yourself. From what I've seen on SO, asking big questions on SO is not a substitute for working it through yourself.
Your specific question of why does
h[0,0] = 0
in your calculation not match this matrix is a good one. In fact, both are correct. The reason for mismatch is that the output of the convolution doesn't have the mathematical indices specified, but instead they are implied. The center, which is mathematically indicated by the indices [0,0] corresponds to x[3,3] in the matrix above.

How to fill upper triangle of numpy array with zeros in place?

What is the best way to fill in the lower triangle of a numpy array with zeros in place so that I don't have to do the following:
a=np.random.random((5,5))
a = np.triu(a)
since np.triu returns a copy, not a view. Preferable this would require no list indexing as well since I am working with large arrays.

Digging into the internals of triu you'll find that it just multiplies the input by the output of tri.
So you can just multiply the array in-place by the output of tri:
>>> a = np.random.random((5, 5))
>>> a *= np.tri(*a.shape)
>>> a
array([[ 0.46026582, 0. , 0. , 0. , 0. ],
[ 0.76234296, 0.5298908 , 0. , 0. , 0. ],
[ 0.08797149, 0.14881991, 0.9302515 , 0. , 0. ],
[ 0.54794779, 0.36896506, 0.92901552, 0.73747726, 0. ],
[ 0.62917827, 0.61674542, 0.44999905, 0.80970863, 0.41860336]])
Like triu, this still creates a second array (the output of tri), but at least it performs the operation itself in-place. The splat is a bit of a shortcut; consider basing your function on the full version of triu for something robust. But note that you can still specify a diagonal:
>>> a = np.random.random((5, 5))
>>> a *= np.tri(*a.shape, k=2)
>>> a
array([[ 0.25473126, 0.70156073, 0.0973933 , 0. , 0. ],
[ 0.32859487, 0.58188318, 0.95288351, 0.85735005, 0. ],
[ 0.52591784, 0.75030515, 0.82458369, 0.55184033, 0.01341398],
[ 0.90862183, 0.33983192, 0.46321589, 0.21080121, 0.31641934],
[ 0.32322392, 0.25091433, 0.03980317, 0.29448128, 0.92288577]])
I now see that the question title and body describe opposite behaviors. Just in case, here's how you can fill the lower triangle with zeros. This requires you to specify the -1 diagonal:
>>> a = np.random.random((5, 5))
>>> a *= 1 - np.tri(*a.shape, k=-1)
>>> a
array([[0.6357091 , 0.33589809, 0.744803 , 0.55254798, 0.38021111],
[0. , 0.87316263, 0.98047459, 0.00881754, 0.44115527],
[0. , 0. , 0.51317289, 0.16630385, 0.1470729 ],
[0. , 0. , 0. , 0.9239731 , 0.11928557],
[0. , 0. , 0. , 0. , 0.1840326 ]])

If speed and memory use are still a limitation and Cython is available, a short Cython function will do what you want.
Here's a working version designed for a C-contiguous array with double precision values.
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
cpdef make_lower_triangular(double[:,:] A, int k):
""" Set all the entries of array A that lie above
diagonal k to 0. """
cdef int i, j
for i in range(min(A.shape[0], A.shape[0] - k)):
for j in range(max(0, i+k+1), A.shape[1]):
A[i,j] = 0.
This should be significantly faster than any version that involves multiplying by a large temporary array.

import numpy as np
n=3
A=np.zeros((n,n))
for p in range(n):
A[0,p] = p+1
if p >0 :
A[1,p]=p+3
if p >1 :
A[2,p]=p+4
creates a upper triangular matrix starting at 1

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Numpy - how find unique values from a symetric similarity Matrix - python

You can get the values of the diagonal using arr.diagonal() and np.unique and remove them from the values of the array unique = np.unique(arr) index = np.ravel([np.where(unique == i) for i in np.unique(arr.diagonal())]) values = np.delete(unique, index) print(values) # [-0.70710678 -0. ]

Related

Vectorize this for loop

Randomly assign constant value to Numpy array

Min-max scaling along rows in numpy array

Does this function compute convolution correctly?

How to fill upper triangle of numpy array with zeros in place?

Categories

Resources