confused with the output of sklearn.neighbors.NearestNeighbors - python

Here is the code.
from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
>indices
>array([[0, 1],[1, 0],[2, 1],[3, 4],[4, 3],[5, 4]])
>distances
>array([[0. , 1. ],[0. , 1. ],[0. , 1.41421356], [0. , 1. ],[0. , 1. ],[0. , 1.41421356]])
I don't really understand the shape of 'indices' and 'distances'. How do I understand what these numbers mean?

Its pretty straightforward actually. For each data sample in the input to kneighbors() (X here), it will show 2 neighbors. (Because you have specified n_neighbors=2. The indices will give you the index of training data (again X here) and distances will give you the distance for the corresponding data point in training data (to which the indices are referring).
Take an example of single data point. Assuming X[0] as the first query point, the answer will be indices[0] and distances[0]
So for X[0],
the index of first nearest neighbor in training data is indices[0, 0] = 0 and distance is distances[0, 0] = 0. You can use this index value to get the actual data sample from the training data.
This makes sense, because you used the same data for training and testing, so the first nearest neighbor for each point is itself and the distance is 0.
the index of second nearest neigbor is indices[0, 1] = 1 and distance is distances[0, 1] = 1
Similarly for all other points. The first dimension in indices and distances correspond to the query points and second dimension to the number of neighbors asked.

Maybe a little sketch will help
As an example, the closest point to the training sample with index 0 is 1, and since you are using n_neighbors = 2 (two neighbors) you would expect to see this pair in the results. And indeed you see that the pair [0, 1] appears in the output.

I will comment to the aforementioned, how you can get the "n_neighbors=2" neighbors using the indices array, in a pandas dataframe. So,
import pandas as pd
df = pd.DataFrame([X.iloc[indices[row,col]] for row in range(indices.shape[0]) for col in range(indices.shape[1])])

Related

Define cluster centers manually

Doing Kmeans cluster analysis, how to I manually define a certain cluster-center?
For example I want to say my cluster centers are [1,2,3] and [3,4,5] and now I want to cluster my vectors to the predefined centers.
something like kmeans.cluster_centers_ = [[1,2,3],[3,4,5]] ?
to work around my problem thats what I do atm:
number_of_clusters = len(vec)
kmeans = KMeans(number_of_clusters, init='k-means++', n_init=100)
kmeans.fit(vec)
it basically defines a cluster for each vector. But it takes ages to compute as I have thousands of vectors/sentences. There must be an option to set the vector coordinates directly as cluster coordinates without the need to compute them with the kmeans algorithm. (as the center outputs are basically the vector coordinates after i run the algorithm...)
Edit to be more specific about my task:
So what I do want is I have tonns of vectors ( generated from sentences) and now I want to cluster these. But imagine I have two columns of sentences and always want to sort a B column sentence to an A column sentence. Not A column sentences to each other. Thats why I want to set cluster centers for the A column vectors and afterwards predict the clostest B vectors to these Centers. Hope that makes sense
I am using sklearn kmeans atm
I think I know what you want to do. So you want to manually select the centroids for k-Means with some known examples and then perform the clustering to assign the closests data points to your pre-defined centroids.
The parameter you are looking for is the k-Means initialization named as init see documentation.
I have prepared a small example that would do exactly this.
import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial import distance_matrix
# 5 datapoints with 3 features
data = [[1, 0, 0],
[1, 0.2, 0],
[0, 0, 1],
[0, 0, 0.9],
[1, 0, 0.1]]
X = np.array(data)
distance_matrix(X,X)
The pairwise distance matrix shows which examples are the closests.
> array([[0. , 0.2 , 1.41421356, 1.3453624 , 0.1 ],
> [0.2 , 0. , 1.42828569, 1.36014705, 0.2236068 ],
> [1.41421356, 1.42828569, 0. , 0.1 , 1.3453624 ],
> [1.3453624 , 1.36014705, 0.1 , 0. , 1.28062485],
> [0.1 , 0.2236068 , 1.3453624 , 1.28062485, 0. ]])
you can select certain data points to be used as your initial centroids
centroid_idx = [0,2] # let data point 0 and 2 be our centroids
centroids = X[centroid_idx,:]
print(centroids) # [[1. 0. 0.]
# [0. 0. 1.]]
kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1) # just run one k-Means iteration so that the centroids are not updated
kmeans.fit(X)
kmeans.labels_
>>> array([0, 0, 1, 1, 0], dtype=int32)
As you can see k-Means labels the data points as expected. You might want to omit the max_iter parameter if you want your centroids to be updated.

the variance matrix get from np.cov() is different from calculation by hand

here is the two vector
X = [1,3,4,5]
Y = [2,6,2,2]
np.cov(X,Y)
out put is array([[ 2.91666667, -0.33333333],
[-0.33333333, 4. ]])
here is the calculation
mu(X)=(1+3+4+5)/4=3.25
mu(Y)=(2+6+2+2)/4=3
var[X]=E[(X-mu(X)(X-mu(X)]=8.75
var[Y]=E[(Y-mu(Y)(Y-mu(Y)]=12
cov[X,Y]=E[(X-mu(X)(Y-mu(Y)]=-1
cov[Y,X]=E[(Y-mu(Y)(X-mu(X)]=-1
so the result is array([[8.75, -1],
[-1, 12]])
It is noticed that the result calculated by hand is just 3 times of the array get from np.cov(X,Y).
My question is why the two matrix is different and does 3 mean anything here or just a
coincidence

Coloring specific links in a dendrogram

In a dendrogram from a hierarchical clustering in scipy, I would like to highlight links connecting specific two labels, let's say 0 and 1.
import scipy.cluster.hierarchy as hac
from matplotlib import pyplot as plt
clustering = hac.linkage(points, method='single', metric='cosine')
link_colors = ["black"] * (2 * len(points) - 1)
hac.dendrogram(clustering, link_color_func=lambda k: link_colors[k])
plt.show()
The clustering has the following format:
clustering[i] corresponds to node number len(points) + i and its first two numbers are indices of nodes that are linked. Nodes with indices smaller than len(points) correspond to original points, higher indices to the clusters.
When drawing the dendrogram, different indexing of the links is used and these are the indices that are used for choosing the color. How do the indices of the links (as indexed in link_colors) correspond to indices in clustering?
You have been very close to the solution. The indices in clustering are sorted by size of the 3rd columns of the clustering array. The indices of the color list for link_color_func are indices of clustering + the length of points.
import scipy.cluster.hierarchy as hac
from matplotlib import pyplot as plt
import numpy as np
# Sample data
points = np.array([[8, 7, 7, 1],
[8, 4, 7, 0],
[4, 0, 6, 4],
[2, 4, 6, 3],
[3, 7, 8, 5]])
clustering = hac.linkage(points, method='single', metric='cosine')
clustering does look like this
array([[3. , 4. , 0.00766939, 2. ],
[0. , 1. , 0.02763245, 2. ],
[5. , 6. , 0.13433008, 4. ],
[2. , 7. , 0.15768043, 5. ]])
As you can see the ordering (and thus the row-index) results from clustering being sorted by the third column.
To highlight now a specific link (e.g. [0,1] as you proposed) you have to find the row index of the pair [0,1] within clustering and add len(points). The resulting number is the index of the color list provided for link_color_func.
# Initialize the link_colors list with 'black' (as you did already)
link_colors = ['black'] * (2 * len(points) - 1)
# Specify link you want to have highlighted
link_highlight = (0, 1)
# Find index in clustering where first two columns are equal to link_highlight. This will cause an exception if you look for a link, which is not in clustering (e.g. [0,4])
index_highlight = np.where((clustering[:,0] == link_highlight[0]) *
(clustering[:,1] == link_highlight[1]))[0][0]
# Index in color_list of desired link is index from clustering + length of points
link_colors[index_highlight + len(points)] = 'red'
hac.dendrogram(clustering, link_color_func=lambda k: link_colors[k])
plt.show()
Like this, you can highlight the desired link:
It works also for links between an original element and a cluster or between two clusters (e.g. link_highlight = (5, 6))

Problem with NearestCentroid, python, cluster

I want to find the centroid coordinates of a cluster (list of points [x,y]).
So, I want to use NearestCentroid() from sklearn.
clf = NearestCentroid()
clf.fit(X, y)
X : np.array of my coordinates points.
y : np.array fully filled with 1
I have an error when I launch the fit() function.
ValueError: y has less than 2 classes
Maybe there is a problem with arrays shape.
(X= (7,2) ,y= (7,))
The centroid of points can be calculated by summing up all the values in each dimension and averaging them. You can use numpy.mean() for this. Refer to the documention: numpy.mean
import numpy as np
points = [
[0, 0],
[1, 1],
[0, 1],
[0, 100]
]
a = np.array(points)
centroid = np.mean(a, axis=0)
print(centroid)
Which will give:
[ 0.25 25.5 ]
You can verify this by hand. Sum up the x-axis values: 0+1+0+0 = 1 and average it: 1/4. Same for y-axis: 0+1+1+100 = 102, average it: 102/4 = 25.5.

How to find linearly independent rows from a matrix

How to identify the linearly independent rows from a matrix? For instance,
The 4th rows is independent.
First, your 3rd row is linearly dependent with 1t and 2nd row. However, your 1st and 4th column are linearly dependent.
Two methods you could use:
Eigenvalue
If one eigenvalue of the matrix is zero, its corresponding eigenvector is linearly dependent. The documentation eig states the returned eigenvalues are repeated according to their multiplicity and not necessarily ordered. However, assuming the eigenvalues correspond to your row vectors, one method would be:
import numpy as np
matrix = np.array(
[
[0, 1 ,0 ,0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]
])
lambdas, V = np.linalg.eig(matrix.T)
# The linearly dependent row vectors
print matrix[lambdas == 0,:]
Cauchy-Schwarz inequality
To test linear dependence of vectors and figure out which ones, you could use the Cauchy-Schwarz inequality. Basically, if the inner product of the vectors is equal to the product of the norm of the vectors, the vectors are linearly dependent. Here is an example for the columns:
import numpy as np
matrix = np.array(
[
[0, 1 ,0 ,0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]
])
print np.linalg.det(matrix)
for i in range(matrix.shape[0]):
for j in range(matrix.shape[0]):
if i != j:
inner_product = np.inner(
matrix[:,i],
matrix[:,j]
)
norm_i = np.linalg.norm(matrix[:,i])
norm_j = np.linalg.norm(matrix[:,j])
print 'I: ', matrix[:,i]
print 'J: ', matrix[:,j]
print 'Prod: ', inner_product
print 'Norm i: ', norm_i
print 'Norm j: ', norm_j
if np.abs(inner_product - norm_j * norm_i) < 1E-5:
print 'Dependent'
else:
print 'Independent'
To test the rows is a similar approach.
Then you could extend this to test all combinations of vectors, but I imagine this solution scale badly with size.
With sympy you can find the linear independant rows using: sympy.Matrix.rref:
>>> import sympy
>>> import numpy as np
>>> mat = np.array([[0,1,0,0],[0,0,1,0],[0,1,1,0],[1,0,0,1]]) # your matrix
>>> _, inds = sympy.Matrix(mat).T.rref() # to check the rows you need to transpose!
>>> inds
[0, 1, 3]
Which basically tells you the rows 0, 1 and 3 are linear independant while row 2 isn't (it's a linear combination of row 0 and 1).
Then you could remove these rows with slicing:
>>> mat[inds]
array([[0, 1, 0, 0],
[0, 0, 1, 0],
[1, 0, 0, 1]])
This also works well for rectangular (not only for quadratic) matrices.
I edited the code for Cauchy-Schwartz inequality which scales better with dimension: the inputs are the matrix and its dimension, while the output is a new rectangular matrix which contains along its rows the linearly independent columns of the starting matrix. This works in the assumption that the first column in never null, but can be readily generalized in order to implement this case too. Another thing that I observed is that 1e-5 seems to be a "sloppy" threshold, since some particular pathologic vectors were found to be linearly dependent in that case: 1e-4 doesn't give me the same problems. I hope this could be of some help: it was pretty difficult for me to find a really working routine to extract li vectors, and so I'm willing to share mine. If you find some bug, please report them!!
from numpy import dot, zeros
from numpy.linalg import matrix_rank, norm
def find_li_vectors(dim, R):
r = matrix_rank(R)
index = zeros( r ) #this will save the positions of the li columns in the matrix
counter = 0
index[0] = 0 #without loss of generality we pick the first column as linearly independent
j = 0 #therefore the second index is simply 0
for i in range(R.shape[0]): #loop over the columns
if i != j: #if the two columns are not the same
inner_product = dot( R[:,i], R[:,j] ) #compute the scalar product
norm_i = norm(R[:,i]) #compute norms
norm_j = norm(R[:,j])
#inner product and the product of the norms are equal only if the two vectors are parallel
#therefore we are looking for the ones which exhibit a difference which is bigger than a threshold
if absolute(inner_product - norm_j * norm_i) > 1e-4:
counter += 1 #counter is incremented
index[counter] = i #index is saved
j = i #j is refreshed
#do not forget to refresh j: otherwise you would compute only the vectors li with the first column!!
R_independent = zeros((r, dim))
i = 0
#now save everything in a new matrix
while( i < r ):
R_independent[i,:] = R[index[i],:]
i += 1
return R_independent
I know this was asked a while ago, but here is a very simple (although probably inefficient) solution. Given an array, the following finds a set of linearly independent vectors by progressively adding a vector and testing if the rank has increased:
from numpy.linalg import matrix_rank
def LI_vecs(dim,M):
LI=[M[0]]
for i in range(dim):
tmp=[]
for r in LI:
tmp.append(r)
tmp.append(M[i]) #set tmp=LI+[M[i]]
if matrix_rank(tmp)>len(LI): #test if M[i] is linearly independent from all (row) vectors in LI
LI.append(M[i]) #note that matrix_rank does not need to take in a square matrix
return LI #return set of linearly independent (row) vectors
#Example
mat=[[1,2,3,4],[4,5,6,7],[5,7,9,11],[2,4,6,8]]
LI_vecs(4,mat)
I interpret the problem as finding rows that are linearly independent from other rows.
That is equivalent to finding rows that are linearly dependent on other rows.
Gaussian elimination and treat numbers smaller than a threshold as zeros can do that. It is faster than finding eigenvalues of a matrix, testing all combinations of rows with Cauchy-Schwarz inequality, or singular value decomposition.
See:
https://math.stackexchange.com/questions/1297437/using-gauss-elimination-to-check-for-linear-dependence
Problem with floating point numbers:
http://numpy-discussion.10968.n7.nabble.com/Reduced-row-echelon-form-td16486.html
With regards to the following discussion:
Find dependent rows/columns of a matrix using Matlab?
from sympy import *
A = Matrix([[1,1,1],[2,2,2],[1,7,5]])
print(A.nullspace())
It is obvious that the first and second row are multiplication of each other.
If we execute the above code we get [-1/3, -2/3, 1]. The indices of the zero elements in the null space show independence. But why is the third element here not zero? If we multiply the A matrix with the null space, we get a zero column vector. So what's wrong?
The answer which we are looking for is the null space of the transpose of A.
B = A.T
print(B.nullspace())
Now we get the [-2, 1, 0], which shows that the third row is independent.
Two important notes here:
Consider whether we want to check the row dependencies or the column
dependencies.
Notice that the null space of a matrix is not equal to the null
space of the transpose of that matrix unless it is symmetric.
You can basically find the vectors spanning the columnspace of the matrix by using SymPy library's columnspace() method of Matrix object. Automatically, they are the linearly independent columns of the matrix.
import sympy as sp
import numpy as np
M = sp.Matrix([[0, 1, 0, 0],
[0, 0, 1, 0],
[1, 0, 0, 1]])
for i in M.columnspace():
print(np.array(i))
print()
# The output is following.
# [[0]
# [0]
# [1]]
# [[1]
# [0]
# [0]]
# [[0]
# [1]
# [0]]

Categories