Adjacency Matrix from Numpy array using Euclidean Distance - python

Can someone help me please on how to generate a weighted adjacency matrix from a numpy array based on euclidean distance between all rows, i.e 0 and 1, 0 and 2,.. 1 and 2,...?
Given the following example with an input matrix(5, 4):
matrix = [[2,10,9,6],
[5,1,4,7],
[3,2,1,0],
[10, 20, 1, 4],
[17, 3, 5, 18]]
I would like to obtain a weighted adjacency matrix (5,5) containing the most minimal distance between nodes, i.e,
if dist(row0, row1)= 10,77 and dist(row0, row2)= 12,84,
--> the output matrix will take the first distance as a column value.
I have already solved the first part for the generation of the adjacency matrix with the following code :
from scipy.spatial.distance import cdist
dist = cdist( matrix, matrix, metric='euclidean')
and I get the following result :
array([[ 0. , 10.77032961, 12.84523258, 15.23154621, 20.83266666],
[10.77032961, 0. , 7.93725393, 20.09975124, 16.43167673],
[12.84523258, 7.93725393, 0. , 19.72308292, 23.17326045],
[15.23154621, 20.09975124, 19.72308292, 0. , 23.4520788 ],
[20.83266666, 16.43167673, 23.17326045, 23.4520788 , 0. ]])
But I don't know yet how to specify the number of neighbors for which we select for example 2 neighbors for each node. For example, we define the number of neighbors N = 2, then for each row, we choose only two neighbors with the two minimum distances and we get as a result :
[[ 0. , 10.77032961, 12.84523258, 0, 0],
[10.77032961, 0. , 7.93725393, 0, 0],
[12.84523258, 7.93725393, 0. , 0, 0],
[15.23154621, 0, 19.72308292, 0. , 0 ],
[20.83266666, 16.43167673, 0, 0 , 0. ]]

You can use this cleaner solution to get the smallest n from a matrix. Try the following -
The dist.argsort(1).argsort(1) creates a rank order (smallest is 0 and largest is 4) over axis=1 and the <= 2 decided the number of nsmallest values you need from the rank order. np.where filters it or replaces it with 0.
np.where(dist.argsort(1).argsort(1) <= 2, dist, 0)
array([[ 0. , 10.77032961, 12.84523258, 0. , 0. ],
[10.77032961, 0. , 7.93725393, 0. , 0. ],
[12.84523258, 7.93725393, 0. , 0. , 0. ],
[15.23154621, 0. , 19.72308292, 0. , 0. ],
[20.83266666, 16.43167673, 0. , 0. , 0. ]])
This works for any axis or if you want nlargest or nsmallest from a matrix as well.

Assuming a is your Euclidean distance matrix, you can use np.argpartition to choose n min/max values per row. Keep in mind the diagonal is always 0 and euclidean distances are non-negative, so to keep two closest point in each row, you need to keep three min per row (including 0s on diagonal). This does not hold if you want to do max however.
a[np.arange(a.shape[0])[:,None],np.argpartition(a, 3, axis=1)[:,3:]] = 0
output:
array([[ 0. , 10.77032961, 12.84523258, 0. , 0. ],
[10.77032961, 0. , 7.93725393, 0. , 0. ],
[12.84523258, 7.93725393, 0. , 0. , 0. ],
[15.23154621, 0. , 19.72308292, 0. , 0. ],
[20.83266666, 16.43167673, 0. , 0. , 0. ]])

Related

Truncating a 2D array for a given tolerance [Python]

An old question on Singular Value Decomposition lead me to ask this question:
How could I truncate a 2-Dimensional array, to a number of columns dictated by a certain tolerance?
Specifically, please consider the following code snippet, which defines an accepted tolerance of 1e-4 and applies Singular Value Decomposition to a matrix 'A'.
#Python
tol=1e-4
U,Sa,V=np.linalg.svd(A)
S=np.diag(Sa)
The resulting singular value diagonal matrix 'S' holds non-negative singular values in decreasing order of magnitude.
What I want to obtain is a truncated 'S' matrix, in a way that the columns of the matrix holding singular values lower than 1e-4 would be removed. Then, apply this truncation to the matrix 'U'.
Is there a simple way of doing this? I have been looking around, and found some solutions to the problem for Matlab, but didn't find anything similar for Python.
For Matlab, the code would look something like:
%Matlab
tol=1e-4
mask=any(Sigma>=tol,2);
sigRB=Sigma(:,mask);
mask2=any(U>=tol,2);
B=U(:,mask);
Thanks in advance. I hope my post was not too messy to understand.
I am not sure if I understand you correctly. If my solution is not what you ask for, please consider adding an example to your question.
The following code drops all columns from array s that consist only of values smaller than tol.
s = np.array([
[1, 0, 0, 0, 0, 0],
[0, .9, 0, 0, 0, 0],
[0, 0, .5, 0, 0, 0],
[0, 0, 0, .4, 0, 0],
[0, 0, 0, 0, .3, 0],
[0, 0, 0, 0, 0, .2]
])
print(s)
tol = .4
ind = np.argwhere(s.max(axis=1) < tol)
s = np.delete(s, ind, 1)
print(s)
Output:
[[1. 0. 0. 0. 0. 0. ]
[0. 0.9 0. 0. 0. 0. ]
[0. 0. 0.5 0. 0. 0. ]
[0. 0. 0. 0.4 0. 0. ]
[0. 0. 0. 0. 0.3 0. ]
[0. 0. 0. 0. 0. 0.2]]
[[1. 0. 0. 0. ]
[0. 0.9 0. 0. ]
[0. 0. 0.5 0. ]
[0. 0. 0. 0.4]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ]]
I am applying max to axis 1 and then using np.argwhere to get the indices of the columns where the max value is smaller than tol.
Edit: In order to truncate the columns of matrix 'U', so it coincides in size with the reduced matrix 'S', the following code works:
k = len(S[0])
Ured = U[:,0:k]
Uredsize = np.shape(Ured) # To check it has worked
print(Uredsize)

1D Numpy array does not get reshaped into a 2D array

columns = np.shape(lines)[0] # Gets x-axis dimension of array lines (to get numbers of columns)
lengths = np.zeros(shape=(2,1)) # Create a 2D array
# lengths = [[ 0.]
# [ 0.]]
lengths = np.arange(columns).reshape((columns)) # Makes array have the same number of columns as columns and fills it with elements going up from zero <--- This line seems to be turning it into a 1D array
Output after printing lengths array:
print(lengths)
[0 1 2]
Expected Output Example:
print(lengths)
[[0 1 2]] # Notice the double square bracket
This results in me not being able to enter data into a 2D parts of an array, because it now no longer exists:
np.append(lengths, 65, axis=1)
AxisError: axis 1 is out of bounds for array of dimension 1
I want the array to be 2D so I can store "IDs" on the first row and values on the second (at a later point in the program). I'm also aware that I could add another row to the array instead of doing it at initialization. But I'd rather not do that since I heard that's inefficient and this program's success is highly dependent on performance.
Thank you.
Since you eventually want a 2d array with ids in one row and values in the second, I'd suggest starting with the right size
In [535]: arr = np.zeros((2,10),int)
In [536]: arr
Out[536]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
In [537]: arr[0,:]=np.arange(10)
In [538]: arr
Out[538]:
array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
Sure you could start with a 1 row array of ids, but adding that 2nd row at a later time requires making a new array anyways. np.append is just a variation on np.concatenate.
But to make a 2d array from arange I like:
In [539]: np.arange(10)[None,:]
Out[539]: array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
reshape also works, but has to be given the correct shape, e.g. (1,10).
In:
lengths = np.zeros(shape=(2,1)) # Create a 2D array
lengths = np.arange(columns).reshape((columns))
the 2nd lengths assignment replaces the first. You have to do an indexed assignment as I did with arr[0,:] to modify an existing array. lengths[0,:] = np.arange(10) wouldn't work because lengths only has 1 column, not 10. Assignments like this require correct pairing of dimensions.
Don't need 2D data to put into a column of a 2D array. You just need 1D data.
You can put the data into the 0th row instead of the 0th column if you change the organization of memory. This is copying data into contiguous memory (memory without gaps) and that is faster.
Program:
import numpy as np
data = np.arange(12)
#method 1
buf = np.zeros((12, 6))
buf[:,0] = data
print(buf)
#method 2
buf = np.zeros((6, 12))
buf[0] = data
print(buf)
Result:
[[ 0. 0. 0. 0. 0. 0.]
[ 1. 0. 0. 0. 0. 0.]
[ 2. 0. 0. 0. 0. 0.]
[ 3. 0. 0. 0. 0. 0.]
[ 4. 0. 0. 0. 0. 0.]
[ 5. 0. 0. 0. 0. 0.]
[ 6. 0. 0. 0. 0. 0.]
[ 7. 0. 0. 0. 0. 0.]
[ 8. 0. 0. 0. 0. 0.]
[ 9. 0. 0. 0. 0. 0.]
[ 10. 0. 0. 0. 0. 0.]
[ 11. 0. 0. 0. 0. 0.]]
[[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

Meshgrid a N-columned matrix in Numpy (or smth else)

Python version: 2.7
I have the following numpy 2d array:
array([[ -5.05000000e+01, -1.05000000e+01],
[ -4.04000000e+01, -8.40000000e+00],
[ -3.03000000e+01, -6.30000000e+00],
[ -2.02000000e+01, -4.20000000e+00],
[ -1.01000000e+01, -2.10000000e+00],
[ 7.10542736e-15, -1.77635684e-15],
[ 1.01000000e+01, 2.10000000e+00],
[ 2.02000000e+01, 4.20000000e+00],
[ 3.03000000e+01, 6.30000000e+00],
[ 4.04000000e+01, 8.40000000e+00]])
If I wanted to find all the combinations of the first and the second columns, I would use np.array(np.meshgrid(first_column, second_column)).T.reshape(-1,2). As a result, I would get a 100*1 matrix with 10*10 = 100 data points. However, my matrix can have 3, 4, or more columns, so I have a problem of using this numpy function.
Question: how can I make an automatically meshgridded matrix with 3+ columns?
UPD: for example, I have the initial array:
[[-50.5 -10.5]
[ 0. 0. ]]
As a result, I want to have the output array like this:
array([[-10.5, -50.5],
[-10.5, 0. ],
[ 0. , -50.5],
[ 0. , 0. ]])
or this:
array([[-50.5, -10.5],
[-50.5, 0. ],
[ 0. , -10.5],
[ 0. , 0. ]])
You could use * operator on the transposed array version that unpacks those columns sequentially. Finally, a swap axes operation is needed to merge the output grid arrays as one array.
Thus, one generic solution would be -
np.swapaxes(np.meshgrid(*arr.T),0,2)
Sample run -
In [44]: arr
Out[44]:
array([[-50.5, -10.5],
[ 0. , 0. ]])
In [45]: np.swapaxes(np.meshgrid(*arr.T),0,2)
Out[45]:
array([[[-50.5, -10.5],
[-50.5, 0. ]],
[[ 0. , -10.5],
[ 0. , 0. ]]])

First n elements of row in numpy array

I'm trying to implement a k-nearest neighbour classifier in Python, and so I want to calculate the Euclidean distance. I have a dataset that I have converted into a big numpy array
[[ 0. 0. 4. ..., 1. 0. 1.]
[ 0. 0. 5. ..., 0. 0. 1.]
[ 0. 0. 14. ..., 16. 9. 1.]
...,
[ 0. 0. 3. ..., 2. 0. 3.]
[ 0. 1. 7. ..., 0. 0. 3.]
[ 0. 2. 10. ..., 0. 0. 3.]]
where the last element of each row indicates the class. So when calculating the Euclidean distance, I obviously don't want to include the last element. I thought I could do the following
for row in dataset:
distance = euclidean_distance(vector, row[:dataset.shape[1] - 1])
but that still includes the last element
print row
>>> [[ 0. 0. 4. ..., 1. 0. 1.]]
print row[:dataset.shape[1] - 1]
>>> [[ 0. 0. 4. ..., 1. 0. 1.]]
as you can see both are the same.
You can subset the data using numpy slicing. If you find yourself iterating over a numpy array, stop and try to find a method that takes advantage of the vectorized nature of numpy operations.
Assuming your array is called arr:
data_points = arr[:,:-1]
classes = arr[:,-1]
For distance to vector calculations:
To find the distance between a 1d array and all of the rows of a 2d array, you can use to following. It assumes the 1d array is v and the 2d array is arr.
dist = np.power(arr - v, 2).sum(axis=1)
dist will be a 1d array of distances.
For pairwise calculations:
The following function takes a 2d array of numbers and returns the upper-diagonal matrix of pair-wise distances using the given L-x distance measurement (the Euclidean distance measure is the L=2 metric).
def pairwise_distance(arr, L=2):
d = arr.shape[0]
out = np.zeros(d)
for f in range(1, d):
out[:-f].ravel()[f::d+1] = np.power(arr[:-f]-arr[f:], L).sum(axis=1)
return np.power(out, 1.0/L)

Error in scipy sparse diags matrix construction

When using scipy.sparse.spdiags or scipy.sparse.diags I have noticed want I consider to be a bug in the routines eg
scipy.sparse.spdiags([1.1,1.2,1.3],1,4,4).toarray()
returns
array([[ 0. , 1.2, 0. , 0. ],
[ 0. , 0. , 1.3, 0. ],
[ 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
That is for positive diagonals it drops the first k data. One might argue that there is some grand programming reason for this and that I just need to pad with zeros. OK annoying as that may be, one can use scipy.sparse.diags which gives the correct result. However this routine has a bug that can't be worked around
scipy.sparse.diags([1.1,1.2],0,(4,2)).toarray()
gives
array([[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ],
[ 0. , 0. ]])
nice, and
scipy.sparse.diags([1.1,1.2],-2,(4,2)).toarray()
gives
array([[ 0. , 0. ],
[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2]])
but
scipy.sparse.diags([1.1,1.2],-1,(4,2)).toarray()
gives an error saying ValueError: Diagonal length (index 0: 2 at offset -1) does not agree with matrix size (4, 2). Obviously the answer is
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ]])
and for extra random behaviour we have
scipy.sparse.diags([1.1],-1,(4,2)).toarray()
giving
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.1],
[ 0. , 0. ]])
Anyone know if there is a function for constructing diagonal sparse matrices that actually works?
Executive summary: spdiags works correctly, even if the matrix input isn't the most intuitive. diags has a bug that affects some offsets in rectangular matrices. There is a bug fix on scipy github.
The example for spdiags is:
>>> data = array([[1,2,3,4],[1,2,3,4],[1,2,3,4]])
>>> diags = array([0,-1,2])
>>> spdiags(data, diags, 4, 4).todense()
matrix([[1, 0, 3, 0],
[1, 2, 0, 4],
[0, 2, 3, 0],
[0, 0, 3, 4]])
Note that the 3rd column of data always appears in the 3rd column of the sparse. The other columns also line up. But they are omitted where they 'fall off the edge'.
The input to this function is a matrix, while the input to diags is a ragged list. The diagonals of the sparse matrix all have different numbers of values. So the specification has to accomodate this in one or other. spdiags does this by ignoring some values, diags by taking a list input.
The sparse.diags([1.1,1.2],-1,(4,2)) error is puzzling.
the spdiags equivalent does work:
In [421]: sparse.spdiags([[1.1,1.2]],-1,4,2).A
Out[421]:
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ]])
The error is raised in this block of code:
for j, diagonal in enumerate(diagonals):
offset = offsets[j]
k = max(0, offset)
length = min(m + offset, n - offset)
if length <= 0:
raise ValueError("Offset %d (index %d) out of bounds" % (offset, j))
try:
data_arr[j, k:k+length] = diagonal
except ValueError:
if len(diagonal) != length and len(diagonal) != 1:
raise ValueError(
"Diagonal length (index %d: %d at offset %d) does not "
"agree with matrix size (%d, %d)." % (
j, len(diagonal), offset, m, n))
raise
The actual matrix constructor in the diags is:
dia_matrix((data_arr, offsets), shape=(m, n))
This is the same constructor that spdiags uses, but without any manipulation.
In [434]: sparse.dia_matrix(([[1.1,1.2]],-1),shape=(4,2)).A
Out[434]:
array([[ 0. , 0. ],
[ 1.1, 0. ],
[ 0. , 1.2],
[ 0. , 0. ]])
In dia format, the inputs are stored exactly as given by spdiags (complete with that matrix with extra values):
In [436]: M.data
Out[436]: array([[ 1.1, 1.2]])
In [437]: M.offsets
Out[437]: array([-1], dtype=int32)
As #user2357112 points out, length = min(m + offset, n - offset is wrong, producing 3 in the test case. Changing it to length = min(m + k, n - k) makes all cases for this (4,2) matrix work. But it fails with the transpose: diags([1.1,1.2], 1, (2, 4))
The correction, as of Oct 5, for this issue is:
https://github.com/pv/scipy-work/commit/529cbde47121c8ed87f74fa6445c05d71353eb6c
length = min(m + offset, n - offset, min(m,n))
With this fix, diags([1.1,1.2], 1, (2, 4)) works.

Categories