Spectral Clustering Scikit learn print items in Cluster - python
I know I can get the contents of a particular cluster in K-means clustering with the following code using scikit-learn.
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print "Cluster %d:" % i,
for ind in order_centroids[i, :10]:
print ' %s' % terms[ind],
print
How do I do the same for spectral clustering as there is no attribute 'cluster_centers_'for spectral clustering? I am trying to cluster terms in Text documents.
UPDATED:
Sorry, I've not understood your question correctly at first time.
I think it's impossible to do what you want with Spectral Clustering, because spectral clustering method by itself doesn't compute any centers, it doesn't needs them at all. It even doesn't operates on sample points in raw space, Spectral Clustering transforms your dataset into different subspace and then tries to cluster points at this dataset. And i don't know how to invert this transformation mathematically.
A Tutorial on Spectral Clustering
Maybe you should ask your question as more theoretical on Math-related communities of SO.
spectral = cluster.SpectralClustering(n_clusters=2, eigen_solver='arpack', affinity="nearest_neighbors")
spectral.fit(X)
y_pred = spectral.labels_.astype(np.int)
From here
Spectral clustering does not compute any centroids. In a more practical context, if you really need a kind of 'centroids' derived by the spectral clustering algorithm you can always compute the average (mean) of the points belonging at the same cluster, after the end of the clustering process. These would be an approximation of the centroids defined in the context of the typical k-means algorithm. The same principle applies also in other clustering algorithms that do not produce centroids (e.g. hierarchical).
While it's true that you can't get the cluster centers for spectral clustering, you can do something close that might be useful in some cases. To explain, I'll run through the spectral clustering algorithm quickly and explain the modification.
First, let's call our dataset X = {x_1, ..., x_N}, where each point is d-dimensional (d is the number of features you have in your dataset). We can think of X as an N by d matrix. Let's say we want to put this data into k clusters. Spectral clustering first transforms the data set into another representation and then uses K-means clustering on the new representation of the data to obtain clusters. First, the affinity matrix A is formed by using K-neighbors information. For this, we need to choose a positive integer n to construct A. The element A_{i, j} is equal to 1 if x_i and x_j are both in the list of the top n neighbors of each other, and A_{i, j} is equal to 0 otherwise. A is a symmetric N by N matrix. Next, we construct the normalized Laplacian matrix L of A, which is L = I - D^{-1/2}AD^{-1/2}, where D is the degree matrix. Then the eigenvalue decomposition is performed on L to get L = VEV^{-1}, where V is the matrix of eigenvectors of L, and E is the diagonal matrix with the eigenvalues of L in the diagonal. Since L is positive semi-definite, it's eigenvalues are all non-negative real numbers. For spectral clustering, we use this to order the columns of V so that the first column of V corresponds to the smallest eigenvalue of L, and the last column to the largest eigenvalue of L.
Next, we take the first k columns of V, and view them as N points in k-dimensional space. Let's write this truncated matrix as V', and write it's rows as {v'_1, ..., v'_N}, where each v'_i is k-dimensional. Then we use the K-means algorithm to cluster these points into k clusters; {C'_1,...,C'_k}. Then the clusters are assigned to the points in the dataset X by "pulling back" the clusters from V' to X: the point x_i is in cluster C_j if and only if v'_i is in cluster C'_j.
Now, one of the main points of transforming X into V' and clustering on that representation is that often X is not spherically distributed, and V' at least comes closer to being so. Since V' is closer to being spherically distributed, the centroid will be "inside" the cluster of points it defines. We can take the point in V' that is closest to the cluster centroid for each cluster. Let's call the cluster centroids {c_1,...,c_k}. These are points in the parameter space that V' is represented in. Then for each cluster, choose the point of V' that is closest to the cluster's centroid to get k points of V'. Let's say {v'_i_1,...,v'_i_k} are the representative points closest to the cluster centroids of V'. Then choose {x_i_1,...,x_i_k} as the cluster representatives for the clusters of X.
This method might not always work how you might want, but it's at least a way to get closer to what you're wanting, and maybe you can modify it to get closer to what you need. Here's some example code to show how to do this.
Let's use some fake data provided by scikit-learn.
import numpy as np
import pandas as pd
from sklearn.datasets import make_moons
moons_data = make_moons(n_samples=1000, noise=0.07, random_state=0)
moons = pd.DataFrame(data=moons_data[0], columns=['x', 'y'])
moons['label_truth'] = moons_data[1]
moons.plot(
kind='scatter',
x='x',
y='y',
figsize=(8, 8),
s=10,
alpha=0.7
);
I'm going to kind of cheat and use the spectral clustering method provided by scikit-learn, and then extract the affinity matrix from there.
from sklearn.cluster import SpectralClustering
sclust = SpectralClustering(
n_clusters=2,
random_state=42,
affinity='nearest_neighbors',
n_neighbors=10,
assign_labels='kmeans'
)
sclust.fit(moons[['x', 'y']]);
moons['label_cluster'] = sclust.labels_
moons.plot(
kind='scatter',
x='x',
y='y',
figsize=(16, 14),
s=10,
alpha=0.7,
c='label_cluster',
cmap='Spectral'
);
Next, we'll compute the normalized Laplacian of the affinity matrix, and instead of computing the whole eigenvalue decomposition of the Laplacian, we use the scipy function eigsh to extract the two (since we are wanting two clusters) eigenvectors corresponding to the two smallest eigenvalues.
from scipy.sparse.csgraph import laplacian
from scipy.sparse.linalg import eigsh
affinity_matrix = sclust.affinity_matrix_
lpn = laplacian(affinity_matrix, normed=True)
w, v = eigsh(lpn, k=2, which='SM')
pd.DataFrame(v).plot(
kind='scatter',
x=0,
y=1,
figsize=(16, 16),
s=10,
alpha=0.7
);
Then let's use K-means to cluster on this new representation of the data. Let's also find the two points in this new representation that are closest to the cluster centroids, and highlight them.
from sklearn.cluster import KMeans
from scipy.spatial.distance import euclidean
import matplotlib.pyplot as plt
kmeans = KMeans(
n_clusters=2,
random_state=42
)
kmeans.fit(v)
center_0, center_1 = kmeans.cluster_centers_
representative_index_0 = np.argmin(np.array([euclidean(a, center_0) for a in v]))
representative_index_1 = np.argmin(np.array([euclidean(a, center_1) for a in v]))
fig, ax = plt.subplots(figsize=(16, 16));
pd.DataFrame(v).plot(
kind='scatter',
x=0,
y=1,
ax=ax,
s=10,
alpha=0.7);
pd.DataFrame(v).iloc[[representative_index_0, representative_index_1]].plot(
kind='scatter',
x=0,
y=1,
ax=ax,
s=100,
alpha=0.9,
c='orange',
)
And finally, let's plot the original dataset with the corresponding points highlighted.
moons['labels_lpn_kmeans'] = kmeans.labels_
fig, ax = plt.subplots(figsize=(16, 14));
moons.plot(
kind='scatter',
x='x',
y='y',
ax=ax,
s=10,
alpha=0.7,
c='labels_lpn_kmeans',
cmap='Spectral'
);
moons.iloc[[representative_index_0, representative_index_1]].plot(
kind='scatter',
x='x',
y='y',
ax=ax,
s=100,
alpha=0.9,
c='orange',
);
As we can see, the highlighted points are maybe not where we might expect them to be, but this might be useful to give some way of algorithmically choosing points from each cluster.
Related
Python code for automatic execution of the Elbow curve method in K-modes clustering
having the code for manual and therefore possibly wrong Elbow method selection of optimal number of clusters when K-modes clustering of binary df: cost = [] for num_clusters in list(range(1,10)): kmode = KModes(n_clusters=num_clusters, init = "Huang", n_init = 10) kmode.fit_predict(newdf_matrix) cost.append(kmode.cost_) y = np.array([i for i in range(1,10,1)]) plt.plot(y,cost) An outcome of the for loop is a plot with the so called elbow curve. I know this curve helps me choose a optimal K. I do not want to do that myself tho, I am looking for some computational way. I want a computer to do the job without me determining it "manually". Otherwise it stops executing the whole code at some point. Thank you. What would be the code for selecting the K automatically that would replace my manual selection? Thank you.
Use silhouette coefficient [will not work if the data points are represented as categorical values rather then N-d points] The silhouette coefficient give the measure of how similar a data point is within the cluster compared to other clusters. check Sklearn doc here. The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar. So calculate silhouette_score for different values of k and use the one which has best score (near to 1). Sample using digits dataset. from sklearn.cluster import KMeans import numpy as np from sklearn.datasets import load_digits data, labels = load_digits(return_X_y=True) from sklearn.metrics import silhouette_score silhouette_avg = [] for num_clusters in list(range(2,20)): kmeans = KMeans(n_clusters=num_clusters, init = "k-means++", n_init = 10) kmeans.fit_predict(data) score = silhouette_score(data, kmeans.labels_) silhouette_avg.append(score) import matplotlib.pyplot as plt plt.plot(np.arange(2,20),silhouette_avg,'bx-') plt.xlabel('Values of K') plt.ylabel('Silhouette score') plt.title('Silhouette analysis For Optimal k') _ = plt.xticks(np.arange(2,20)) print (f"Best K: {np.argmax(silhouette_avg)+2}") output: Best K: 9
Python DBSCAN - How to plot clusters based on mean of vectors?
Hi i have gotten the mean of the vectors and used DBSCAN to cluster them. However, i am unsure of how i should plot the results since my data does not have an [x,y,z...] format. sample dataset: mean_vec = [[2.2771908044815063], [3.0691280364990234], [2.7700443267822266], [2.6123080253601074], [2.6043469309806824], [2.6386525630950928], [2.7034034729003906], [2.3540258407592773]] I have used this code below(from scikit-learn) to achieve my clusters: X = StandardScaler().fit_transform(mean_vec) db = DBSCAN(eps = 0.15, min_samples = 5).fit(X) core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_ # Number of clusters in labels, ignoring noise if present. n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0) print('Estimated number of clusters: %d' % n_clusters_) is it possible to plot out my clusters ? the plot from scikit-learn is not working for me. The scikit-learn link can be found here
On one dimensional data. Use kernel density estimation rather than DBSCAN. It is much better supported by theory and much better understood. One can see DBSCAN as a fast approximation to KDE for the multivariate case. Any way, plotting 1 dimensional data is not that hard. For example, you can plot a histogram. Also the clusters will necessarily correspond to intervals, so you can also plot lines for (min,max) of each cluster. You can even abuse 2d scatter plots. Simply use the label as y value.
Elbow method in python
i am trying to implement the elbow method in python on my own to get the optimum number of clusters. Therefore i summed the inertia's of the different k-means runs: sum_squared_dist = [] K = range(1,30) for k in K: km = KMeans(n_clusters=k, random_state=0) km = km.fit(normalized_modeling_data) sum_squared_dist.append(km.inertia_) plt.plot(K, sum_squared_dist, 'bx-') plt.xlabel('number of clusters k') plt.ylabel('Sum of squared distances') plt.show So the next approach would be to find the point, were the curve starts to flatten (which should mean that the first derivation is falling). Is there an built-in method in numpy or scikit-learn to calculate the derivation from an array?
spatial data clustering with sklearn
I have arrays of latitude and longitude data points which I want to do hierachical clustering. Here is my code: position = zip(longitude, latitude) X = np.asarray(position) knn_graph = kneighbors_graph(X, 30, include_self=False, metric= haversine) for connectivity in (None, knn_graph): for n_clusters in(5,8,10,15,20): plt.figure(figsize=(4, 5)) cnt = 0 for index, linkage in enumerate(('average', 'complete', 'ward')): model = AgglomerativeClustering(linkage = linkage, connectivity = connectivity, n_clusters = n_clusters) model.fit(X) plt.scatter(X[:, 0], X[:, 1], c=model.labels_, cmap=plt.cm.spectral) plt.title('linkage=%s (ncluster) %s)' % (linkage, n_clusters), fontdict=dict(verticalalignment='top')) plt.axis([37.1, 37.9, -122.6, -121.6]) plt.show() the problem is for kneighbors_graph there is a parameter called metric which is how we defined the destination,http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.kneighbors_graph.html I want to define my own(real distance regard to the logitude and latitude and earth radius). Let seems I could not plug in my own function, any ideas?
Note that the distance function expects a string usually (e.g. "haversine") you have two locations where you use a distance, then knn graph and as affinity for the clustering. hierarchical clustering has two types of distances, and thus two distance parameters. One is the distance of objects (e.g. haversine), the other is the distance of clusters, which is usually derived from that other disance by aggregation (e.g. maximum, minimum). Both are often called "distance". In sklearn, the first is called affinity.
Build in function for plotting bayes decision boundary given the probability function
Is there a function in python, that plots bayes decision boundary if we input a function to it? I know there is one in matlab, but I'm searching for some function in python. I know that one way to achieve this is to iterate over the points, but I am searching for a built-in function. I have bivariate sample points on the axis, and I want to plot the decision boundary in order to classify them.
Going off the guess of Chris in the comments above, I'm assuming you want to cluster points according to the Gaussian Mixture model - a reasonable method assuming the underlying distribution is a linear combination of Gaussian distributed samples. Below I've shown an example using numpy to create a sample data set, sklearn for it's GM modeling and pylab to show the results. import numpy as np from pylab import * from sklearn import mixture # Create some sample data def G(mu, cov, pts): return np.random.multivariate_normal(mu,cov,500) # Three multivariate Gaussians with means and cov listed below MU = [[5,3], [0,0], [-2,3]] COV = [[[4,2],[0,1]], [[1,0],[0,1]], [[1,2],[2,1]]] A = [G(mu,cov,500) for mu,cov in zip(MU,COV)] PTS = np.concatenate(A) # Join them together # Use a Gaussian Mixture model to fit g = mixture.GMM(n_components=len(A)) g.fit(PTS) # Returns an index list of which cluster they belong to C = g.predict(PTS) # Plot the original points X,Y = map(array, zip(*PTS)) subplot(211) scatter(X,Y) # Plot the points and color according to the cluster subplot(212) color_mask = ['k','b','g'] for n in xrange(len(A)): idx = (C==n) scatter(X[idx],Y[idx],color=color_mask[n]) show() See the sklearn.mixture example page for more detailed information on the classification methods.