K-means cluster - Plot class proportions in each cluster - python

I am working on a project where I exploit the cluster structure of an unlabeled dataset to improve the performance of a supervised learning clustering algorithm. After preprocessing the data - stored in a matrix - I use k-means to cluster the data like so:
from sklearn.cluster import KMeans
k = KMeans(n_clusters=40).fit(X)
I have the desired labels stored in y. I am intrested in seeing how the different classes are clustered ie. if the clusters are relatively pure or mixed.
To do this I want to see the proportions of each class in each cluster. This is a binary classification task - positive (represented by a 1 in y) instances and negative instances (represented by a 0 in y ).
(The nth element of the y array is the correct label for the nth row of the X matrix.)

I would use pandas:
import pandas as pd
Combine the true labels and cluster labels into a dataframe:
df = pd.DataFrame({'clusters' : k.labels_, 'labels' : y})
Group by clusters and for each cluster get the fraction of 1's:
df.groupby('clusters').apply(lambda cluster: cluster.sum()/cluster.count())

Related

How to find clusters of geospatial data based on their timestamp rather than position?

I have a set of geospatial data with me that also has the corresponding timestamps in a separate column.
Something like this:
Timestamp
Latitude
Longitude
1
1.56
104.57
2
1.57
105.42
4
1.65
103.32
12
1.76
101.15
14
1.78
100.45
16
1.80
99.65
I want to be able to cluster the data based on their timestamps rather than their distances.
So for the above example, I should obtain 2 clusters: 1 from the 1st 3 data points, and 1 from the remaining 3. I would also like to obtain the range of timestamps for each cluster is possible.
From what I've researched so far, I've only gotten either geospatial distance clustering, or time-series clustering, both of which do not sound like what I need. Are there any recommended algorithms for what I am trying to do?
Here density-based spatial clustering of applications with noise or shortly DBSCAN algorithm will be helpful in your case. DBSCAN is a density-based clustering algorithm which groups the points based on the closeness between them.
From what I understood in my quick research, DBSCAN draws a circle around its core. The circle's radius is called epsilon. All the points within the single circle will be counted in the same cluster. Larger the epsilon, the more points you will have in your cluster & vice versa.
There is more to this algorithm which you can find on this & this links.
Why DBSCAN is good for Timeseries Clustering:
DBSCAN does not require k (number of clusters) as the input
In your case, there might be many clusters of time periods. Trying to fit an elbow curve to find the best number of clusters will be time-consuming & inefficient.
Code:
The below code snippet will do your task,
import pandas as pd
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt
# Getting Data
df = pd.DataFrame({
'Timestamp' : [1,2,4,12,14,16,25,28,29],
'Latitude' : [1.56,1.57,1.65,1.76,1.78,1.80,1.83,1.845,1.855],
'Longitude' : [104.57,105.42,103.32,101.15,100.45,99.65,100,100.3,101.2]})
# Initializing the object
db = DBSCAN(eps=3.0, min_samples=3)
# eps = Epsilon value. Larger the epsilon, the more distant points you will catch in a single cluster.
# Ex. eps = 1.0 wasn't capturing the '4' value from [1,2,4] cluster. Increasing the epsilon
# helped in detecting that.
# min_samples = Minimum number of samples you want in your single cluster.
# Fitting the algorithm onto Timestamp column
df['Cluster'] = db.fit_predict(np.array(df['Timestamp']).reshape(-1,1))
print(f"Found {df['Cluster'].nunique()} clusters \n")
print(df)
# Plotting the Graph
fig = plt.figure(figsize = (5,5))
plt.xlabel('Latitude')
plt.ylabel('Longitude')
for data in df.groupby(df['Cluster']):
index = data[0]
df = data[1]
plt.scatter(df['Latitude'], df['Longitude'], c=np.random.rand(1,len(df)), label=f"Cluster {index}")
plt.legend()
plt.show()
OUTPUT:

Weighted k-means in python

After reading this post here about duplicate values in k-means clustering, I realized I cannot simply use unique points for clustering.
https://stats.stackexchange.com/questions/152808/do-i-need-to-remove-duplicate-objects-for-cluster-analysis-of-objects
I have over 10000000 points, though only 8000 unique ones. Therefore, I initially thought that for speeding it up, I’d use unique points only. Seems like this is a bad idea.
To keep computational time down, this post suggests to add weights to each point. How can this be implemented in python?
Using K-Means package from Scikit library, clustering is performed for number of clusters as 11 here.
The array Y contains data that has been inserted as weights where as X has actual points that need to be clustered.
from sklearn.cluster import KMeans #For applying KMeans
##--------------------------------------------------------------------------------------------------------##
#Starting k-means clustering
kmeans = KMeans(n_clusters=11, n_init=10, random_state=0, max_iter=1000)
#Running k-means clustering and enter the ‘X’ array as the input coordinates and ‘Y’
array as sample weights
wt_kmeansclus = kmeans.fit(X,sample_weight = Y)
predicted_kmeans = kmeans.predict(X, sample_weight = Y)
#Storing results obtained together with respective city-state labels
kmeans_results =
pd.DataFrame({"label":data_label,"kmeans_cluster":predicted_kmeans+1})
#Printing count of points alloted to each cluster and then the cluster centers
print(kmeans_results.kmeans_cluster.value_counts())
I think the post suggests to work with weighted average.
You can create a new dataset out of the old one, and the new dataset will have an extra attribute for each point, it's frequency (i.e it's weight).
Every time you calculate the new centroid for each cluster, take the weighted average of all points of that cluster (instead of calculating the simple mean of all points).
PS: Manipulating the dataset is dangerous. I'd parallelize the code if computational cost is a major factor.

Randomly sample x% of each cluster

I am working on a project aiming to exploit the cluster structure of my dataset to improve a supervised active learning classifier for binray classification. I use the following code to cluster my data, X using scikit-leanr's K-Means implementation:
k = KMeans(n_clusters=(i+2), precompute_distances=True, ).fit(X)
df = pd.DataFrame({'cluster' : k.labels_, 'percentage posotive' : y})
a = df.groupby('cluster').apply(lambda cluster:cluster.sum()/cluster.count())
The two classes are positive (represented by a 1) and negative (represented by a 0) and are stored in an array y.
This code first clusters X and then stores in a data frame each clusters number and the number of percentage of positive instances within it.
I would now like to randomly select points from each cluster, until I have sampled 15%. How can I do this?
As requested here is a simplified script including a test dataset:
from sklearn.cluster import KMeans
import pandas as pd
X = [[1,2], [2,5], [1,2], [3,3], [1,2], [7,3], [1,1], [2,19], [1,11], [54,3], [78,2], [74,36]]
y = [0,0,0,0,0,0,0,0,0,1,0,0]
k = KMeans(n_clusters=(4), precompute_distances=True, ).fit(X)
df = pd.DataFrame({'cluster' : k.labels_, 'percentage posotive' : y})
a = df.groupby('cluster').apply(lambda cluster:cluster.sum()/cluster.count())
print(a)
Note: The real datasets are much larger consisting of thousands of features and thousands of data instances.
In response to #SandipanDey:
I can't tell you too much, but basically we are dealing with a highly unbalanced dataset (1:10,000) and we are only interested in identifying the minority class examples with recall > 95% whilst reducing the number of labels requested. (Recall needs to be so high as its related to healthcare.)
The minority examples cluster together, and any cluster containing a positive instances will usually contain at least x%, so by sampling x% we ensure that we identify all clusters with any positive instances. So we are able to quickly reduce the size of the dataset with potential positives. This parital dataset can then be used for active learning. Our approach is loosely inspired by 'Hierarchical Sampling for Active Learning'
If I understood you correctly, the following code should serve the purpose:
import numpy as np
# For each cluster
# (1) Find all the points from X that are assigned to the cluster.
# (2) Choose x% from those points randomly.
n_clusters = 4
x = 0.15 # percentage
for i in range(n_clusters):
# (1) indices of all the points from X that belong to cluster i
C_i = np.where(k.labels_ == i)[0].tolist()
n_i = len(C_i) # number of points in cluster i
# (2) indices of the points from X to be sampled from cluster i
sample_i = np.random.choice(C_i, int(x * n_i))
print i, sample_i
Just for curiosity, how are you going to use these x% points for active learning?

Python Clustering 'purity' metric

I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set.
I could use the function score() to compute the log probability under the model.
However, I am looking for a metric called 'purity' which is defined in this article.
How can I implement it in Python? My current implementation looks like this:
from sklearn.mixture import GMM
# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)
clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)
# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)
But I can not loop through each cluster in order to compute the confusion matrix (according this question)
David's answer works but here is another way to do it.
import numpy as np
from sklearn import metrics
def purity_score(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# return purity
return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)
Also if you need to compute Inverse Purity, all you need to do is replace "axis=0" by "axis=1".
sklearn doesn't implement a cluster purity metric. You have 2 options:
Implement the measurement using sklearn data structures yourself. This and this have some python source for measuring purity, but either your data or the function bodies need to be adapted for compatibility with each other.
Use the (much less mature) PML library, which does implement cluster purity.
A very late contribution.
You can try to implement it like this, pretty much like in this gist
def purity_score(y_true, y_pred):
"""Purity score
Args:
y_true(np.ndarray): n*1 matrix Ground truth labels
y_pred(np.ndarray): n*1 matrix Predicted clusters
Returns:
float: Purity score
"""
# matrix which will hold the majority-voted labels
y_voted_labels = np.zeros(y_true.shape)
# Ordering labels
## Labels might be missing e.g with set like 0,2 where 1 is missing
## First find the unique labels, then map the labels to an ordered set
## 0,2 should become 0,1
labels = np.unique(y_true)
ordered_labels = np.arange(labels.shape[0])
for k in range(labels.shape[0]):
y_true[y_true==labels[k]] = ordered_labels[k]
# Update unique labels
labels = np.unique(y_true)
# We set the number of bins to be n_classes+2 so that
# we count the actual occurence of classes between two consecutive bins
# the bigger being excluded [bin_i, bin_i+1[
bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)
for cluster in np.unique(y_pred):
hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
# Find the most present label in the cluster
winner = np.argmax(hist)
y_voted_labels[y_pred==cluster] = winner
return accuracy_score(y_true, y_voted_labels)
The currently top voted answer correctly implements the purity metric, but may not be the most appropriate metric in all cases, because it does not ensure that each predicted cluster label is assigned only once to a true label.
For example, consider a dataset that is very imbalanced, with 99 examples of one label and 1 example of another label. Then any clustering (e.g: having two equal clusters of size 50) will achieve purity of at least 0.99, rendering it a useless metric.
Instead, in cases where the number of clusters is the same as the number of labels, cluster accuracy may be more appropriate. This has the advantage of mirroring classification accuracy in an unsupervised setting. To compute cluster accuracy, we need to use the Hungarian algorithm to find the optimal matching between cluster labels and true labels. The SciPy function linear_sum_assignment does this:
import numpy as np
from sklearn import metrics
from scipy.optimize import linear_sum_assignment
def cluster_accuracy(y_true, y_pred):
# compute contingency matrix (also called confusion matrix)
contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
# Find optimal one-to-one mapping between cluster labels and true labels
row_ind, col_ind = linear_sum_assignment(-contingency_matrix)
# Return cluster accuracy
return contingency_matrix[row_ind, col_ind].sum() / np.sum(contingency_matrix)

How to identify Cluster labels in kmeans scikit learn

I am learning python scikit.
The example given here
displays the top occurring words in each Cluster and not Cluster name.
http://scikit-learn.org/stable/auto_examples/document_clustering.html
I found that the km object has "km.label" which lists the centroid id, which is the number.
I have two question
1. How do I generate the cluster labels?
2. How to identify the members of the clusters for further processing.
I have working knowledge of k-means and aware of tf-ids concepts.
How do I generate the cluster labels?
I'm not sure what you mean by this. You have no cluster labels other than cluster 1, cluster 2, ..., cluster n. That is why it's called unsupervised learning, because there are no labels.
Do you mean you actually have labels and you want to see if the clustering algorithm happened to cluster the data according to your labels?
In that case, the documentation you linked to provides an example:
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels, km.labels_))
print("Completeness: %0.3f" % metrics.completeness_score(labels, km.labels_))
print("V-measure: %0.3f" % metrics.v_measure_score(labels, km.labels_))
How to identify the members of the clusters for further processing.
See the documentation for KMeans. In particular, the predict method:
predict(X)
Parameters:
X : {array-like, sparse matrix}, shape = [n_samples, n_features] New data to predict.
Returns:
labels : array, shape [n_samples,]
Index of the cluster each sample belongs to.
If you don't want to predict something new, km.labels_ should do that for the training data.
Oh that's easy
My environment:
scikit-learn version '0.20.0'
Just use the attribute .labels_ as in the docs: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
from sklearn.cluster import KMeans
import numpy as np
Working example:
x1 = [[1],[1],[2],[2],[2],[3],[3],[7],[7],[7]]
x2 = [[1],[1],[2],[2],[2],[3],[3],[7],[7],[7]]
X_2D = np.concatenate((x1,x2),axis=1)
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
labels = kmeans.fit(X_2D)
print(labels.labels_)
Output:
[2 2 3 3 3 0 0 1 1 1]
So as you can see, we have 4 clusters, and each data example in the X_2D array is assigned a label accordingly.

Categories