I am learning python scikit.
The example given here
displays the top occurring words in each Cluster and not Cluster name.
http://scikit-learn.org/stable/auto_examples/document_clustering.html
I found that the km object has "km.label" which lists the centroid id, which is the number.
I have two question
1. How do I generate the cluster labels?
2. How to identify the members of the clusters for further processing.
I have working knowledge of k-means and aware of tf-ids concepts.
How do I generate the cluster labels?
I'm not sure what you mean by this. You have no cluster labels other than cluster 1, cluster 2, ..., cluster n. That is why it's called unsupervised learning, because there are no labels.
Do you mean you actually have labels and you want to see if the clustering algorithm happened to cluster the data according to your labels?
In that case, the documentation you linked to provides an example:
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
How to identify the members of the clusters for further processing.
See the documentation for KMeans. In particular, the predict method:
predict(X)
Parameters:
X : {array-like, sparse matrix}, shape = [n_samples, n_features] New data to predict.
Returns:
labels : array, shape [n_samples,]
Index of the cluster each sample belongs to.
If you don't want to predict something new, km.labels_ should do that for the training data.
Oh that's easy
My environment:
scikit-learn version '0.20.0'
Just use the attribute .labels_ as in the docs: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
from sklearn.cluster import KMeans
import numpy as np
Working example:
x1 = [[1],[1],[2],[2],[2],[3],[3],[7],[7],[7]]
x2 = [[1],[1],[2],[2],[2],[3],[3],[7],[7],[7]]
X_2D = np.concatenate((x1,x2),axis=1)
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
labels = kmeans.fit(X_2D)
print(labels.labels_)
Output:
[2 2 3 3 3 0 0 1 1 1]
So as you can see, we have 4 clusters, and each data example in the X_2D array is assigned a label accordingly.
Related
I am faced with the following array:
y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
What I would like to do is extract the cluster with the highest scores. That would be
best_cluster = [200,297,275,243]
I have checked quite a few questions on stack on this topic and most of them recommend using kmeans. Although a few others mention that kmeans might be an overkill for 1D arrays clustering.
However kmeans is a supervised learnig algorithm, hence this means that I would have to pass in the number of centroids. As I need to generalize this problem to other arrays, I cannot pass the number of centroids for each one of them. Therefore I am looking at implementing some sort of unsupervised learning algorithm that would be able to figure out the clusters by itself and select the highest one.
In array y I would see 3 clusters as so [1,2,4,7,9,5,4,7,9],[56,57,54,60],[200,297,275,243].
What algorithm would best fit my needs, considering computation cost and accuracy and how could I implement it for my problem?
Try MeanShift. From the sklean user guide of MeanShift:
The algorithm automatically sets the number of clusters, ...
Modified demo code:
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
# #############################################################################
# Generate sample data
X = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
X = np.reshape(X, (-1, 1))
# #############################################################################
# Compute clustering with MeanShift
# The following bandwidth can be automatically detected using
# bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=100)
ms = MeanShift(bandwidth=None, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("number of estimated clusters : %d" % n_clusters_)
print(labels)
Output:
number of estimated clusters : 2
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
Note that MeanShift is not scalable with the number of samples. The recommended upper limit is 10,000.
BTW, as rahlf23 already mentioned, K-mean is an unsupervised learning algorithm. The fact that you have to specify the number of clusters does not mean it is supervised.
See also:
Overview of clustering methods
Choosing the right estimator
Clustering is overkill here
Just compute the differences of subsequent elements. I.e. look at x[i]-x[i-1].
Choose the k largest differences as split points. Or define a threshold on when to split. E.g. 20. Depends on your data knowledge.
This is O(n), much faster than all the others mentioned. Also very understandable and predictable.
On one dimensional ordered data, any method that doesn't use the order will be slower than necessary.
HDBSCAN is the best clustering algorithm and you should always use it.
Basically all you need to do is provide a reasonable min_cluster_size, a valid distance metric and you're good to go.
For min_cluster_size I suggest using 3 since a cluster of 2 is lame and for metric the default euclidean works great so you don't even need to mention it.
Don't forget that distance metrics apply to vectors and here we have scalars so some ugly reshaping is in order.
To put it all together and assuming by "cluster with the highest scores" you mean the cluster that includes the max value we get:
from hdbscan import HDBSCAN
import numpy as np
y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
y = np.reshape(y, (-1, 1))
clusterer = HDBSCAN(min_cluster_size=3)
cluster_labels = clusterer.fit_predict(y)
best_cluster = clusterer.exemplars_[cluster_labels[y.argmax()]].ravel()
print(best_cluster)
The output is [297 200 275 243]. Original order is not preserved. C'est la vie.
I am working on a project where I exploit the cluster structure of an unlabeled dataset to improve the performance of a supervised learning clustering algorithm. After preprocessing the data - stored in a matrix - I use k-means to cluster the data like so:
from sklearn.cluster import KMeans
k = KMeans(n_clusters=40).fit(X)
I have the desired labels stored in y. I am intrested in seeing how the different classes are clustered ie. if the clusters are relatively pure or mixed.
To do this I want to see the proportions of each class in each cluster. This is a binary classification task - positive (represented by a 1 in y) instances and negative instances (represented by a 0 in y ).
(The nth element of the y array is the correct label for the nth row of the X matrix.)
I would use pandas:
import pandas as pd
Combine the true labels and cluster labels into a dataframe:
df = pd.DataFrame({'clusters' : k.labels_, 'labels' : y})
Group by clusters and for each cluster get the fraction of 1's:
df.groupby('clusters').apply(lambda cluster: cluster.sum()/cluster.count())
I am working on a project aiming to exploit the cluster structure of my dataset to improve a supervised active learning classifier for binray classification. I use the following code to cluster my data, X using scikit-leanr's K-Means implementation:
k = KMeans(n_clusters=(i+2), precompute_distances=True, ).fit(X)
df = pd.DataFrame({'cluster' : k.labels_, 'percentage posotive' : y})
a = df.groupby('cluster').apply(lambda cluster:cluster.sum()/cluster.count())
The two classes are positive (represented by a 1) and negative (represented by a 0) and are stored in an array y.
This code first clusters X and then stores in a data frame each clusters number and the number of percentage of positive instances within it.
I would now like to randomly select points from each cluster, until I have sampled 15%. How can I do this?
As requested here is a simplified script including a test dataset:
from sklearn.cluster import KMeans
import pandas as pd
X = [[1,2], [2,5], [1,2], [3,3], [1,2], [7,3], [1,1], [2,19], [1,11], [54,3], [78,2], [74,36]]
y = [0,0,0,0,0,0,0,0,0,1,0,0]
k = KMeans(n_clusters=(4), precompute_distances=True, ).fit(X)
df = pd.DataFrame({'cluster' : k.labels_, 'percentage posotive' : y})
a = df.groupby('cluster').apply(lambda cluster:cluster.sum()/cluster.count())
print(a)
Note: The real datasets are much larger consisting of thousands of features and thousands of data instances.
In response to #SandipanDey:
I can't tell you too much, but basically we are dealing with a highly unbalanced dataset (1:10,000) and we are only interested in identifying the minority class examples with recall > 95% whilst reducing the number of labels requested. (Recall needs to be so high as its related to healthcare.)
The minority examples cluster together, and any cluster containing a positive instances will usually contain at least x%, so by sampling x% we ensure that we identify all clusters with any positive instances. So we are able to quickly reduce the size of the dataset with potential positives. This parital dataset can then be used for active learning. Our approach is loosely inspired by 'Hierarchical Sampling for Active Learning'
If I understood you correctly, the following code should serve the purpose:
import numpy as np
# For each cluster
# (1) Find all the points from X that are assigned to the cluster.
# (2) Choose x% from those points randomly.
n_clusters = 4
x = 0.15 # percentage
for i in range(n_clusters):
# (1) indices of all the points from X that belong to cluster i
C_i = np.where(k.labels_ == i)[0].tolist()
n_i = len(C_i) # number of points in cluster i
# (2) indices of the points from X to be sampled from cluster i
sample_i = np.random.choice(C_i, int(x * n_i))
print i, sample_i
Just for curiosity, how are you going to use these x% points for active learning?
I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set.
I could use the function score() to compute the log probability under the model.
However, I am looking for a metric called 'purity' which is defined in this article.
How can I implement it in Python? My current implementation looks like this:
from sklearn.mixture import GMM
# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)
clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)
# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)
But I can not loop through each cluster in order to compute the confusion matrix (according this question)
David's answer works but here is another way to do it.
import numpy as np
from sklearn import metrics
def purity_score(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# return purity
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
Also if you need to compute Inverse Purity, all you need to do is replace "axis=0" by "axis=1".
sklearn doesn't implement a cluster purity metric. You have 2 options:
Implement the measurement using sklearn data structures yourself. This and this have some python source for measuring purity, but either your data or the function bodies need to be adapted for compatibility with each other.
Use the (much less mature) PML library, which does implement cluster purity.
A very late contribution.
You can try to implement it like this, pretty much like in this gist
def purity_score(y_true, y_pred):
"""Purity score
Args:
y_true(np.ndarray): n*1 matrix Ground truth labels
y_pred(np.ndarray): n*1 matrix Predicted clusters
Returns:
float: Purity score
"""
# matrix which will hold the majority-voted labels
y_voted_labels = np.zeros(y_true.shape)
# Ordering labels
## Labels might be missing e.g with set like 0,2 where 1 is missing
## First find the unique labels, then map the labels to an ordered set
## 0,2 should become 0,1
labels = np.unique(y_true)
ordered_labels = np.arange(labels.shape[0])
for k in range(labels.shape[0]):
y_true[y_true==labels[k]] = ordered_labels[k]
# Update unique labels
labels = np.unique(y_true)
# We set the number of bins to be n_classes+2 so that
# we count the actual occurence of classes between two consecutive bins
# the bigger being excluded [bin_i, bin_i+1[
bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)
for cluster in np.unique(y_pred):
hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
# Find the most present label in the cluster
winner = np.argmax(hist)
y_voted_labels[y_pred==cluster] = winner
return accuracy_score(y_true, y_voted_labels)
The currently top voted answer correctly implements the purity metric, but may not be the most appropriate metric in all cases, because it does not ensure that each predicted cluster label is assigned only once to a true label.
For example, consider a dataset that is very imbalanced, with 99 examples of one label and 1 example of another label. Then any clustering (e.g: having two equal clusters of size 50) will achieve purity of at least 0.99, rendering it a useless metric.
Instead, in cases where the number of clusters is the same as the number of labels, cluster accuracy may be more appropriate. This has the advantage of mirroring classification accuracy in an unsupervised setting. To compute cluster accuracy, we need to use the Hungarian algorithm to find the optimal matching between cluster labels and true labels. The SciPy function linear_sum_assignment does this:
import numpy as np
from sklearn import metrics
from scipy.optimize import linear_sum_assignment
def cluster_accuracy(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# Find optimal one-to-one mapping between cluster labels and true labels
row_ind, col_ind = linear_sum_assignment(-contingency_matrix)
# Return cluster accuracy
return contingency_matrix[row_ind, col_ind].sum() / np.sum(contingency_matrix)
I am clustering textual data using K-Means in Python(scikit-learn).
How do I get the cluster to which the line belongs?
Example :
data=["Red , Yellow and Blue are colours","Icecream is my favourite food","You can now get icecream in strawberry flavour too","Sky is blue"]
After performing K-Means with n_clusters=2, I expect two clusters to be formed s.t.
"Red , Yellow and Blue are colours","Sky is blue" lie in one cluster and "Icecream is my favourite food","You can now get icecream in strawberry flavour too" lie in the other.
How do I get this i.e. which line is in which cluster?
Code for K-means :
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data)
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=500, n_init=20)
model.fit(X)
Try using the predict function.
Example -
model.predict(X)
From documentation -
predict(X)
Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters:
X : {array-like, sparse matrix}, shape = [n_samples, n_features]
New data to predict.
Returns:
labels : array, shape [n_samples,]
Index of the cluster each sample belongs to.
This seems to return the array of indexes of cluster each sample belongs to.
Maybe you can also use - fit_predict() function.
You can get cluster centers using the attribute - cluster_centers_ , in your case - model.cluster_centers_ and the label for each sample - model.labels_ .