Define cluster centers manually - python

Doing Kmeans cluster analysis, how to I manually define a certain cluster-center?
For example I want to say my cluster centers are [1,2,3] and [3,4,5] and now I want to cluster my vectors to the predefined centers.
something like kmeans.cluster_centers_ = [[1,2,3],[3,4,5]] ?
to work around my problem thats what I do atm:
number_of_clusters = len(vec)
kmeans = KMeans(number_of_clusters, init='k-means++', n_init=100)
kmeans.fit(vec)
it basically defines a cluster for each vector. But it takes ages to compute as I have thousands of vectors/sentences. There must be an option to set the vector coordinates directly as cluster coordinates without the need to compute them with the kmeans algorithm. (as the center outputs are basically the vector coordinates after i run the algorithm...)
Edit to be more specific about my task:
So what I do want is I have tonns of vectors ( generated from sentences) and now I want to cluster these. But imagine I have two columns of sentences and always want to sort a B column sentence to an A column sentence. Not A column sentences to each other. Thats why I want to set cluster centers for the A column vectors and afterwards predict the clostest B vectors to these Centers. Hope that makes sense
I am using sklearn kmeans atm

I think I know what you want to do. So you want to manually select the centroids for k-Means with some known examples and then perform the clustering to assign the closests data points to your pre-defined centroids.
The parameter you are looking for is the k-Means initialization named as init see documentation.
I have prepared a small example that would do exactly this.
import numpy as np
from sklearn.cluster import KMeans
from scipy.spatial import distance_matrix
# 5 datapoints with 3 features
data = [[1, 0, 0],
[1, 0.2, 0],
[0, 0, 1],
[0, 0, 0.9],
[1, 0, 0.1]]
X = np.array(data)
distance_matrix(X,X)
The pairwise distance matrix shows which examples are the closests.
> array([[0. , 0.2 , 1.41421356, 1.3453624 , 0.1 ],
> [0.2 , 0. , 1.42828569, 1.36014705, 0.2236068 ],
> [1.41421356, 1.42828569, 0. , 0.1 , 1.3453624 ],
> [1.3453624 , 1.36014705, 0.1 , 0. , 1.28062485],
> [0.1 , 0.2236068 , 1.3453624 , 1.28062485, 0. ]])
you can select certain data points to be used as your initial centroids
centroid_idx = [0,2] # let data point 0 and 2 be our centroids
centroids = X[centroid_idx,:]
print(centroids) # [[1. 0. 0.]
# [0. 0. 1.]]
kmeans = KMeans(n_clusters=2, init=centroids, max_iter=1) # just run one k-Means iteration so that the centroids are not updated
kmeans.fit(X)
kmeans.labels_
>>> array([0, 0, 1, 1, 0], dtype=int32)
As you can see k-Means labels the data points as expected. You might want to omit the max_iter parameter if you want your centroids to be updated.

Related

Linear Dependence of Set of Vectors in numpy

I want to check whether some vectors are dependent on each other or not by numpy, I found some good suggestions for checking linear dependency of rows of a matrix in the link below:
How to find linearly independent rows from a matrix
I can not understand the 'Cauchy-Schwarz inequality' method which I think is due to lack of my knowledge, however I tried the Eigenvalue method to check linear dependency among columns and here is my code:
A = np.array([
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 1, 1, 0],
[1, 0, 0, 1]
])
lambdas, V = np.linalg.eig(A)
print(lambdas)
print(V)
and I get:
[ 1. 0. 1.61803399 -0.61803399]
[[ 0. 0.70710678 0.2763932 -0.7236068 ]
[ 0. 0. 0.4472136 0.4472136 ]
[ 0. 0. 0.7236068 -0.2763932 ]
[ 1. -0.70710678 0.4472136 0.4472136 ]]
My question is what is the relevance between these eigenvectors or eigenvalues to the dependency of columns of my matrix? How can I understand which columns are dependent to each other and which are independent by these values?
The second column vector corresponds to the eigenvalue of 0.
Just take a look at the API documentation when you get confused.
v : (…, M, M) array
The normalized (unit “length”) eigenvectors, such that the column
v[:,i] is the eigenvector corresponding to the eigenvalue w[i].
You can find the linearly independent columns by QR decomposition as described here.

How to read contents of scipy hierarchy cluster

I have the following code with which I am clustering hierarchically. My data object is an array of similarity distances I calculated earlier. I think I am executing the clustering properly. I thought I could just get the leaves of the Cluster, but when I compare that to the original input I get a mismatch.
I have two questions here:
Why is there a mismatch between the leaves of my cluster and my actual input data?
How can I extract the original data from a cluster by either the linkage matrix or clusternodes?
import numpy as np
import pandas
import scipy.cluster.hierarchy as sch
def list_difference(list1, list2):
return [value for value in list1 if value not in list2]
if __name__ == '__main__':
# example data for this questions purpose.
data = [10, 11, 29, 288, 16]
X = np.array([[i] for i in data])
linkage_matrix = sch.average(X)
rootnode, nodelist = sch.to_tree(linkage_matrix, rd=True)
leaves = sch.leaves_list(linkage_matrix)
print(list_difference(leaves, data))
I want to retrieve the original data points per cluster.
Given your data
data = [10, 11, 29, 288, 16]
the result is compatible with the dendrogram
sch.dendrogram(linkage_matrix);
Analyzing linkage_matrix we can confirm
print(linkage_matrix)
array([[ 0. , 1. , 1. , 2. ],
[ 4. , 5. , 5.5 , 3. ],
[ 2. , 6. , 16.66666667, 4. ],
[ 3. , 7. , 271.5 , 5. ]])
Row by row we have
element 0 and element 1, with distance 1 in a cluster that has got 2 elements (this cluster will be called 5)
element 4 with clustered elements 5 (the previous), with distance 5.5 and 3 elements (this cluster will be called 6)
element 2 with clustered elements 6 (the previous), with distance 16.667 and 4 elements (this cluster will be called 7)
element 3 with clustered elements 7 (the previous), with distance 271.5 and 5 elements

Coloring specific links in a dendrogram

In a dendrogram from a hierarchical clustering in scipy, I would like to highlight links connecting specific two labels, let's say 0 and 1.
import scipy.cluster.hierarchy as hac
from matplotlib import pyplot as plt
clustering = hac.linkage(points, method='single', metric='cosine')
link_colors = ["black"] * (2 * len(points) - 1)
hac.dendrogram(clustering, link_color_func=lambda k: link_colors[k])
plt.show()
The clustering has the following format:
clustering[i] corresponds to node number len(points) + i and its first two numbers are indices of nodes that are linked. Nodes with indices smaller than len(points) correspond to original points, higher indices to the clusters.
When drawing the dendrogram, different indexing of the links is used and these are the indices that are used for choosing the color. How do the indices of the links (as indexed in link_colors) correspond to indices in clustering?
You have been very close to the solution. The indices in clustering are sorted by size of the 3rd columns of the clustering array. The indices of the color list for link_color_func are indices of clustering + the length of points.
import scipy.cluster.hierarchy as hac
from matplotlib import pyplot as plt
import numpy as np
# Sample data
points = np.array([[8, 7, 7, 1],
[8, 4, 7, 0],
[4, 0, 6, 4],
[2, 4, 6, 3],
[3, 7, 8, 5]])
clustering = hac.linkage(points, method='single', metric='cosine')
clustering does look like this
array([[3. , 4. , 0.00766939, 2. ],
[0. , 1. , 0.02763245, 2. ],
[5. , 6. , 0.13433008, 4. ],
[2. , 7. , 0.15768043, 5. ]])
As you can see the ordering (and thus the row-index) results from clustering being sorted by the third column.
To highlight now a specific link (e.g. [0,1] as you proposed) you have to find the row index of the pair [0,1] within clustering and add len(points). The resulting number is the index of the color list provided for link_color_func.
# Initialize the link_colors list with 'black' (as you did already)
link_colors = ['black'] * (2 * len(points) - 1)
# Specify link you want to have highlighted
link_highlight = (0, 1)
# Find index in clustering where first two columns are equal to link_highlight. This will cause an exception if you look for a link, which is not in clustering (e.g. [0,4])
index_highlight = np.where((clustering[:,0] == link_highlight[0]) *
(clustering[:,1] == link_highlight[1]))[0][0]
# Index in color_list of desired link is index from clustering + length of points
link_colors[index_highlight + len(points)] = 'red'
hac.dendrogram(clustering, link_color_func=lambda k: link_colors[k])
plt.show()
Like this, you can highlight the desired link:
It works also for links between an original element and a cluster or between two clusters (e.g. link_highlight = (5, 6))

confused with the output of sklearn.neighbors.NearestNeighbors

Here is the code.
from sklearn.neighbors import NearestNeighbors
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
distances, indices = nbrs.kneighbors(X)
>indices
>array([[0, 1],[1, 0],[2, 1],[3, 4],[4, 3],[5, 4]])
>distances
>array([[0. , 1. ],[0. , 1. ],[0. , 1.41421356], [0. , 1. ],[0. , 1. ],[0. , 1.41421356]])
I don't really understand the shape of 'indices' and 'distances'. How do I understand what these numbers mean?
Its pretty straightforward actually. For each data sample in the input to kneighbors() (X here), it will show 2 neighbors. (Because you have specified n_neighbors=2. The indices will give you the index of training data (again X here) and distances will give you the distance for the corresponding data point in training data (to which the indices are referring).
Take an example of single data point. Assuming X[0] as the first query point, the answer will be indices[0] and distances[0]
So for X[0],
the index of first nearest neighbor in training data is indices[0, 0] = 0 and distance is distances[0, 0] = 0. You can use this index value to get the actual data sample from the training data.
This makes sense, because you used the same data for training and testing, so the first nearest neighbor for each point is itself and the distance is 0.
the index of second nearest neigbor is indices[0, 1] = 1 and distance is distances[0, 1] = 1
Similarly for all other points. The first dimension in indices and distances correspond to the query points and second dimension to the number of neighbors asked.
Maybe a little sketch will help
As an example, the closest point to the training sample with index 0 is 1, and since you are using n_neighbors = 2 (two neighbors) you would expect to see this pair in the results. And indeed you see that the pair [0, 1] appears in the output.
I will comment to the aforementioned, how you can get the "n_neighbors=2" neighbors using the indices array, in a pandas dataframe. So,
import pandas as pd
df = pd.DataFrame([X.iloc[indices[row,col]] for row in range(indices.shape[0]) for col in range(indices.shape[1])])

affinity propagation in python

I am seeing something strange while using AffinityPropagation from sklearn. I have a 4 x 4 numpy ndarray - which is basically the affinity-scores. sim[i, j] has the affinity score of [i, j]. Now, when I feed into the AffinityPropgation function, I get a total of 4 labels.
here is an similar example with a smaller matrix:
In [215]: x = np.array([[1, 0.2, 0.4, 0], [0.2, 1, 0.8, 0.3], [0.4, 0.8, 1, 0.7], [0, 0.3, 0.7, 1]]
.....: )
In [216]: x
Out[216]:
array([[ 1. , 0.2, 0.4, 0. ],
[ 0.2, 1. , 0.8, 0.3],
[ 0.4, 0.8, 1. , 0.7],
[ 0. , 0.3, 0.7, 1. ]])
In [217]: clusterer = cluster.AffinityPropagation(affinity='precomputed')
In [218]: f = clusterer.fit(x)
In [219]: f.labels_
Out[219]: array([0, 1, 1, 1])
This says (according to Kevin), that the first sample (0th-indexed row) is a cluster (Cluster # 0) on its own and the rest of the samples are in another cluster (cluster # 1). But, still, I do not understand this output. What is a sample here? What are the members? I want to have a set of pairs (i, j) assigned to one cluster, another set of pairs assigned to another cluster and so on.
It looks like a 4-sample x 4-feature matrix..which I do not want. Is this the problem? If so, how to convert this to a nice 4-sample x 4-sample affinity-matrix?
The documentation (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) says
fit(X, y=None)
Create affinity matrix from negative euclidean distances, then apply affinity propagation clustering.
Parameters:
X: array-like, shape (n_samples, n_features) or (n_samples, n_samples) :
Data matrix or, if affinity is precomputed, matrix of similarities / affinities.
Thanks!
By your description it sounds like you are working with a "pairwise similarity matrix": x (although your example data does not show that). If this is the case your matrix should be symmertric so that: sim[i,j] == sim[j,i] with your diagonal values equal to 1. Example similarity data S:
S
array([[ 1. , 0.08276253, 0.16227766, 0.47213595, 0.64575131],
[ 0.08276253, 1. , 0.56776436, 0.74456265, 0.09901951],
[ 0.16227766, 0.56776436, 1. , 0.47722558, 0.58257569],
[ 0.47213595, 0.74456265, 0.47722558, 1. , 0.87298335],
[ 0.64575131, 0.09901951, 0.58257569, 0.87298335, 1. ]])
Typically when you already have a distance matrix you should use affinity='precomputed'. But in your case, you are using similarity. In this specific example you can transform to pseudo-distance using 1-D. (The reason to do this would be because I don't know that Affinity Propagation will give you expected results if you give it a similarity matrix as input):
1-D
array([[ 0. , 0.91723747, 0.83772234, 0.52786405, 0.35424869],
[ 0.91723747, 0. , 0.43223564, 0.25543735, 0.90098049],
[ 0.83772234, 0.43223564, 0. , 0.52277442, 0.41742431],
[ 0.52786405, 0.25543735, 0.52277442, 0. , 0.12701665],
[ 0.35424869, 0.90098049, 0.41742431, 0.12701665, 0. ]])
With that being said, I think this is where your interpretation was off:
This says that the first 3-rows are similar, 4th row is a cluster on its own, and the 5th row is also a cluster on its own. Totally of 3 clusters.
The f.labels_ array:
array([0, 1, 1, 1, 0])
is telling you that samples (not rows) 0 and 4 are in cluster 0 AND that samples 2, 3, and 4 are in cluster 1. You don't need 25 different labels for a 5 sample problem, that wouldn't make sense. Hope this helps a little, try the demo (inspect the variables along the way and compare them with your data), which starts with raw data; it should help you decide if Affinity Propagation is the right clustering algorithm for you.
According to this page https://scikit-learn.org/stable/modules/clustering.html
you can use a similarity matrix for AffinityPropagation.

Categories