Measuring dissimilarity within the cluster - Kmodes - python

I was doing clustering with categorical data. I came across Kmodes algo and found it to be perfect for my requirements. Now, I want to measure dissimilarity within a cluster for all the clusters. I am thinking to measure the dissimilarity with a cluster and reduce it as much as possible. Is there any way to do that?
Alternatively, is there any way to check how efficiently my data has been clustered?
Since my data is categorical, ways which consider distance as a metric might not be helpful.

To measure the dissimilarity within a cluster you need to come up with some kind of a metric. For categorical data, one of the possible ways of calculating dissimilarity could be the following:
d(i, j) = (p - m) / p
where:
p is the number of classes/categories in your data
m is the number of matches you have between samples i and j
For example, if your data has 3 categorical features and the samples, i and j are as follows:
Feature1 Feature2 Feature3
i x y z
j x w z
So here, we have 3 categorical features, so p=3 and out of these three, two features have same values for the samples i and j, so m=2. Therefore
d(i,j) = (3 - 2) / 3
d(i,j) = 0.33
Another alternative is to convert your categorical variables to one-hot-encoded features and then compute jaccard simmilarity.
So, in order to measure the dissimilarity within a cluster you could calculate pairwise dissimilarity between each object in your cluster and then take the average of that.
Based on these measures you may also use the silhoutte score for evaluating the quality of your clustering (but you need to take it with a grain of salt, sometimes the score can be good while the clustering might not be what you expected).

Related

Cosine Similarity normalization

I am trying to compare two 13-D vectors using the cosine similarity but want all of the column entries/features to have equal weighting. Right now, I have 3 features with much larger values that appear to be too heavily-weighted in my comparison results. Is there any easy way to normalize the different features so that they are on a similar scale. I am doing this in python.
The usual approach is for each feature x to recalculate them as x = x - np.mean(x) this will place your frame of reference at the center of the cluster, "look to the points closer".
Then for each cluster x = x / sqrt(mean(x**2)), this will normalize the features, this will make the points more evenly distributed over all possible directions in the feature space.

Unsupervised learning clustering 1D array

I am faced with the following array:
y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
What I would like to do is extract the cluster with the highest scores. That would be
best_cluster = [200,297,275,243]
I have checked quite a few questions on stack on this topic and most of them recommend using kmeans. Although a few others mention that kmeans might be an overkill for 1D arrays clustering.
However kmeans is a supervised learnig algorithm, hence this means that I would have to pass in the number of centroids. As I need to generalize this problem to other arrays, I cannot pass the number of centroids for each one of them. Therefore I am looking at implementing some sort of unsupervised learning algorithm that would be able to figure out the clusters by itself and select the highest one.
In array y I would see 3 clusters as so [1,2,4,7,9,5,4,7,9],[56,57,54,60],[200,297,275,243].
What algorithm would best fit my needs, considering computation cost and accuracy and how could I implement it for my problem?
Try MeanShift. From the sklean user guide of MeanShift:
The algorithm automatically sets the number of clusters, ...
Modified demo code:
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
# #############################################################################
# Generate sample data
X = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
X = np.reshape(X, (-1, 1))
# #############################################################################
# Compute clustering with MeanShift
# The following bandwidth can be automatically detected using
# bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=100)
ms = MeanShift(bandwidth=None, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("number of estimated clusters : %d" % n_clusters_)
print(labels)
Output:
number of estimated clusters : 2
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
Note that MeanShift is not scalable with the number of samples. The recommended upper limit is 10,000.
BTW, as rahlf23 already mentioned, K-mean is an unsupervised learning algorithm. The fact that you have to specify the number of clusters does not mean it is supervised.
See also:
Overview of clustering methods
Choosing the right estimator
Clustering is overkill here
Just compute the differences of subsequent elements. I.e. look at x[i]-x[i-1].
Choose the k largest differences as split points. Or define a threshold on when to split. E.g. 20. Depends on your data knowledge.
This is O(n), much faster than all the others mentioned. Also very understandable and predictable.
On one dimensional ordered data, any method that doesn't use the order will be slower than necessary.
HDBSCAN is the best clustering algorithm and you should always use it.
Basically all you need to do is provide a reasonable min_cluster_size, a valid distance metric and you're good to go.
For min_cluster_size I suggest using 3 since a cluster of 2 is lame and for metric the default euclidean works great so you don't even need to mention it.
Don't forget that distance metrics apply to vectors and here we have scalars so some ugly reshaping is in order.
To put it all together and assuming by "cluster with the highest scores" you mean the cluster that includes the max value we get:
from hdbscan import HDBSCAN
import numpy as np
y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
y = np.reshape(y, (-1, 1))
clusterer = HDBSCAN(min_cluster_size=3)
cluster_labels = clusterer.fit_predict(y)
best_cluster = clusterer.exemplars_[cluster_labels[y.argmax()]].ravel()
print(best_cluster)
The output is [297 200 275 243]. Original order is not preserved. C'est la vie.

Distance metric for n binary vectors

I have n and m binary vectors(of length 1500) from set A and B respectively.
I need a metric that can say how similar (kind of distance metric) all those n vectors and m vectors are.
The output should be total_distance_of_n_vectors and total_distance_of_m_vectors.
And if total_distance_of_n_vectors > total_distance_of_m_vectors, it means Set B have more similar vectors than Set A.
Which metric should I use? I thought of Jaccard similarity. But I am not able to put it in this context. Should I find the distance of each vector with each other to find the total distance or something else ?
There are two concepts relevant to your question, which you should consider separately.
Similarity Measure:
Independent of your scoring mechanism, you should find a similarity measure which suits your data best. It can be an Euclidean distance (not suitable for a 1500 dimensional space), a cosine (dot product based) distance, or a Hamiltonian distance (assuming your input features are completely independent, which rarely is the case).
A lot can go on in your distance function, and you should find one which makes sense for your data.
Scoring Mechanism:
You mention total_distance_of_vectors in your question, which probably is not what you want. If n >> m, almost certainly the total sum of distances for n vectors is more than the total distance for m vectors.
What you're looking for is most probably an average of the distances between the members of your sets. Then, depending on weather you want your average to be sensitive to outliers or not, you can go for average of the distances or average of squared distances.
If you want to dig deeper, you can also get the mean and variance of the distances within the two sets and compare the distributions.

Evaluating vector distance measures

I am working with vectors of word frequencies and trying out some of the different distance measures available in Scikit Learns Pairwise Distances. I would like to use these distances for clustering and classification.
I usually have a feature matrix of ~ 30,000 x 100. My idea was to choose a distance metric that maximizes the pairwise distances by running pairwise differences over the same dataset with the distance metrics available in Scipy (e.g. Euclidean, Cityblock, etc.) and for each metric
convert distances computed for the dataset to zscores to normalize across metrics
get the range of these zscores, i.e. the spread of the distances
use the distance metric that gives me the widest range of distances as it apparently gives me the maximum spread over my dataset and the most variance to work with. (Cf. code below)
My questions:
Does this approach make sense?
Are there other evaluation procedures that one should try? I found these papers (Gavin, Aggarwal, but they don't apply 100 % here...)
Any help is much appreciated!
My code:
matrix=np.random.uniform(0, .1, size=(10,300)) #test data set
scipy_distances=['euclidean', 'minkowski', ...] #these are the distance metrics
for d in scipy_distances: #iterate over distances
distmatrix=sklearn.metrics.pairwise.pairwise_distances(matrix, metric=d)
distzscores = scipy.stats.mstats.zscore(distmatrix, axis=0, ddof=1)
diststats=basicstatsmaker(distzscores)
range=np.ptp(distzscores, axis=0)
print "range of metric", d, np.ptp(range)
In general - this is just a heuristic, which might, or not - work. In particular, it is easy to construct a "dummy metric" which will "win" in your approach even though it is useless. Try out
class Dummy_dist:
def __init__(self):
self.cheat = True
def __call__(self, x, y):
if self.cheat:
self.cheat = False
return 1e60
else:
return 0
dummy_dist = Dummy_dist()
This will give you huuuuge spread (even with z-score normalization). Of course this is a cheating example as this is non determinsitic, but I wanted to show the basic counterexample, and of course given your data one can construct a deterministic analogon.
So what you should do? Your metric should be treated as hyperparameter of your process. You should not divide process of generating your clustering/classification into two separate phases: choosing a distance and then learning something; but you should do this jointly, consider your clustering/classification + distance pairs as a single model, thus instead of working with k-means, you will work with k-means+euclidean, k-means+minkowsky and so on. This is the only statistically supported approach. You cannot construct a method of assessing "general goodness" of the metric, as there is no such object, metric quality can be only assessed in a particular task, which involves fixing every other element (such as a clustering/classification method, particular dataset etc.). Once you perform such wide, exhaustive evaluation, check many such pairs, on many datasets, you might claim that given metric performes best in such range of tasks.

Use K-means to learn features in Python

Question
I implemented a K-Means algorithm in Python. First I apply PCA and whitening to the input data. Then I use k-means to successfully subtract k centroids out of the data.
How can I use those centroids to understand the "features" learnt? Are the centroids already the features (doesn't seem like this to me) or do I need to combine them with the input data again?
Because of some answers: K-means is not "just" a method for clustering, instead it's a vector quantization method. That said the goal of k-means is to describe a dataset with a reduced number of feature vectors. Therefore there are big analogies to methods like Sparse Filtering/ Learning regarding the potential outcome.
Code Example
# Perform K-means, data already pre-processed
centroids = k_means(matrix_pca_whitened,1000)
# Assign data to centroid
idx,_ = vq(song_matrix_pca,centroids)
The clusters produced by the K-mean algorithms separate your input space into K regions. When you have new data, you can tell which region it belongs to, and thus classify it.
The centroids are just a property of these clusters.
You can have a look at the scikit-learn doc if you are unsure, and at the map to make sure you choose the right algorithm.
This is sort of a circular question: "understand" requires knowing something about the features outside of the k-means process. All that k-means does is to identify k groups of physical proximity. It says "there are clumps of stuff in these 'k' places, and here's how the all the points choose the nearest."
What this means in terms of the features is up to the data scientist, rather than any deeper meaning that k-means can ascribe. The variance of each group may tell you a little about how tightly those points are clustered. Do remember that k-means also chooses starting points at random; an unfortunate choice can easily give a sub-optimal description of the space.
A centroid is basically the "mean" of the cluster. If you can ascribe some deeper understanding from the distribution of centroids, great -- but that depends on the data and features, rather than any significant meaning devolving from k-means.
Is that the level of answer you need?
The centroids are in fact the features learnt. Since k-means is a method of vector quantization we look up which observation belongs to which cluster and therefore is best described by the feature vector (centroid).
By having one observation e.g. separated into 10 patches before, the observation might consist of 10 feature vectors max.
Example:
Method: K-means with k=10
Dataset: 20 observations divided into 2 patches each = 40 data vectors
We now perform K-means on this patched dataset and get the nearest centroid per patch. We could then create a vector for each of the 20 observations with the length 10 (=k) and if patch 1 belongs to centroid 5 and patch 2 belongs to centroid 9 the vector could look like: 0 - 0 - 0 - 0 - 1 - 0 - 0 - 0 - 1 - 0.
This means that this observation consists of the centroids/ features 5 and 9. You could also measure use the distance between patch and centroid instead of this hard assignment.

Categories