Evaluating vector distance measures - python

I am working with vectors of word frequencies and trying out some of the different distance measures available in Scikit Learns Pairwise Distances. I would like to use these distances for clustering and classification.
I usually have a feature matrix of ~ 30,000 x 100. My idea was to choose a distance metric that maximizes the pairwise distances by running pairwise differences over the same dataset with the distance metrics available in Scipy (e.g. Euclidean, Cityblock, etc.) and for each metric
convert distances computed for the dataset to zscores to normalize across metrics
get the range of these zscores, i.e. the spread of the distances
use the distance metric that gives me the widest range of distances as it apparently gives me the maximum spread over my dataset and the most variance to work with. (Cf. code below)
My questions:
Does this approach make sense?
Are there other evaluation procedures that one should try? I found these papers (Gavin, Aggarwal, but they don't apply 100 % here...)
Any help is much appreciated!
My code:
matrix=np.random.uniform(0, .1, size=(10,300)) #test data set
scipy_distances=['euclidean', 'minkowski', ...] #these are the distance metrics
for d in scipy_distances: #iterate over distances
distmatrix=sklearn.metrics.pairwise.pairwise_distances(matrix, metric=d)
distzscores = scipy.stats.mstats.zscore(distmatrix, axis=0, ddof=1)
diststats=basicstatsmaker(distzscores)
range=np.ptp(distzscores, axis=0)
print "range of metric", d, np.ptp(range)

In general - this is just a heuristic, which might, or not - work. In particular, it is easy to construct a "dummy metric" which will "win" in your approach even though it is useless. Try out
class Dummy_dist:
def __init__(self):
self.cheat = True
def __call__(self, x, y):
if self.cheat:
self.cheat = False
return 1e60
else:
return 0
dummy_dist = Dummy_dist()
This will give you huuuuge spread (even with z-score normalization). Of course this is a cheating example as this is non determinsitic, but I wanted to show the basic counterexample, and of course given your data one can construct a deterministic analogon.
So what you should do? Your metric should be treated as hyperparameter of your process. You should not divide process of generating your clustering/classification into two separate phases: choosing a distance and then learning something; but you should do this jointly, consider your clustering/classification + distance pairs as a single model, thus instead of working with k-means, you will work with k-means+euclidean, k-means+minkowsky and so on. This is the only statistically supported approach. You cannot construct a method of assessing "general goodness" of the metric, as there is no such object, metric quality can be only assessed in a particular task, which involves fixing every other element (such as a clustering/classification method, particular dataset etc.). Once you perform such wide, exhaustive evaluation, check many such pairs, on many datasets, you might claim that given metric performes best in such range of tasks.

Related

How to find appropriate clustering algorithm to cluster my data? [duplicate]

I am new to clustering algorithms. I have a movie dataset with more than 200 movies and more than 100 users. All the users rated at least one movie. A value of 1 for good, 0 for bad and blank if the annotator has no choice.
I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:
UserID M1 M2 M3 ............... M200
user1 1 0 0
user2 0 1 1
user3 1 1 1
.
.
.
.
user100 1 0 1
The problem i am facing is that i don't know exactly how to find most optimal number of clusters for this dataset and then draw a graph of those clusters. I am clustering them with k-means and there is no issue with that but i want to know the most stable or optimal number of clusters for this dataset.
I will appreciate some help..
Clustering is part of the unsupervised machine learning methods. Contrary to supervised methods, in unsupervised methods there is not a straightforward approach to determine the "best" model among a set of models that were trained on a certain dataset.
Nonetheless, there are some quantitative measures. Most of them are based on the concept of "how much are the points in a certain cluster more similar between themself than with the points in different clusters?" I suggest you take a look at the scikit-learn documentation on clustering evaluation. Take a look at all the techniques that do not require labels_true (i.e. at all the unsupervised techniques).
Once you have a quantitative measure about the "goodness" of a certain clustering, you usually observe how this quantity evolves while changing the number of clusters; this approach is called Elbow Method.
Here is some code that uses K-Means algorithm with all possible K values from 2 to 30, calculates various scores for each K value, and stores all scores in a DataFrame.
seed_random = 1
fitted_kmeans = {}
labels_kmeans = {}
df_scores = []
k_values_to_try = np.arange(2, 31)
for n_clusters in k_values_to_try:
#Perform clustering.
kmeans = KMeans(n_clusters=n_clusters,
random_state=seed_random,
)
labels_clusters = kmeans.fit_predict(X)
#Insert fitted model and calculated cluster labels in dictionaries,
#for further reference.
fitted_kmeans[n_clusters] = kmeans
labels_kmeans[n_clusters] = labels_clusters
#Calculate various scores, and save them for further reference.
silhouette = silhouette_score(X, labels_clusters)
ch = calinski_harabasz_score(X, labels_clusters)
db = davies_bouldin_score(X, labels_clusters)
tmp_scores = {"n_clusters": n_clusters,
"silhouette_score": silhouette,
"calinski_harabasz_score": ch,
"davies_bouldin_score": db,
}
df_scores.append(tmp_scores)
#Create a DataFrame of clustering scores, using `n_clusters` as index, for easier plotting.
df_scores = pd.DataFrame(df_scores)
df_scores.set_index("n_clusters", inplace=True)
This code assumes that all your numerical features are in a DataFrame X.
All clustering performance metrics are stored in df_scores DataFrame.
You can easily use the elbow method by plotting columns from df_scores; for instance, if you want to see the elbow graph of the Silhouette Score, you can use df_scores["silhouette_score"].plot().
It's pretty common to start with visualizing the data. Sometimes it is obvious graphically, that there are N classes/clusters. Other times you may be able to see if it's <5, <10, or <100 classes. It depends on your data really.
Another common approach is to use the Bayesian Information Criterium (BIC) or the Akaike Information Criterium (AIC).
The main takeaway is that a lot of classification-problems can yield optimal results if e.g. you have as many classes as you have inputs: every input fits perfectly in its own cluster.
BIC/AIC penalizes a high-dimensional solution, from the insight that simpler models are often better/more stable. I.e. they generalize better and overfit less.
From wikipedia:
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.
You can use the Gini index as a metric, and then do a Grid Search based on this metric. Tell me if you have any other question.
You could use the elbow method.
The base meaning of K-Means is to cluster the data points such that the total "within-cluster sum of squares (a.k.a WSS)" is minimized. Hence you can vary the k from 2 to n, while also calculating its WSS at each point; plot the graph and the curve. Find the location of the bend and that can be considered as an optimal number of clusters !

Distance metric for n binary vectors

I have n and m binary vectors(of length 1500) from set A and B respectively.
I need a metric that can say how similar (kind of distance metric) all those n vectors and m vectors are.
The output should be total_distance_of_n_vectors and total_distance_of_m_vectors.
And if total_distance_of_n_vectors > total_distance_of_m_vectors, it means Set B have more similar vectors than Set A.
Which metric should I use? I thought of Jaccard similarity. But I am not able to put it in this context. Should I find the distance of each vector with each other to find the total distance or something else ?
There are two concepts relevant to your question, which you should consider separately.
Similarity Measure:
Independent of your scoring mechanism, you should find a similarity measure which suits your data best. It can be an Euclidean distance (not suitable for a 1500 dimensional space), a cosine (dot product based) distance, or a Hamiltonian distance (assuming your input features are completely independent, which rarely is the case).
A lot can go on in your distance function, and you should find one which makes sense for your data.
Scoring Mechanism:
You mention total_distance_of_vectors in your question, which probably is not what you want. If n >> m, almost certainly the total sum of distances for n vectors is more than the total distance for m vectors.
What you're looking for is most probably an average of the distances between the members of your sets. Then, depending on weather you want your average to be sensitive to outliers or not, you can go for average of the distances or average of squared distances.
If you want to dig deeper, you can also get the mean and variance of the distances within the two sets and compare the distributions.

Use K-means to learn features in Python

Question
I implemented a K-Means algorithm in Python. First I apply PCA and whitening to the input data. Then I use k-means to successfully subtract k centroids out of the data.
How can I use those centroids to understand the "features" learnt? Are the centroids already the features (doesn't seem like this to me) or do I need to combine them with the input data again?
Because of some answers: K-means is not "just" a method for clustering, instead it's a vector quantization method. That said the goal of k-means is to describe a dataset with a reduced number of feature vectors. Therefore there are big analogies to methods like Sparse Filtering/ Learning regarding the potential outcome.
Code Example
# Perform K-means, data already pre-processed
centroids = k_means(matrix_pca_whitened,1000)
# Assign data to centroid
idx,_ = vq(song_matrix_pca,centroids)
The clusters produced by the K-mean algorithms separate your input space into K regions. When you have new data, you can tell which region it belongs to, and thus classify it.
The centroids are just a property of these clusters.
You can have a look at the scikit-learn doc if you are unsure, and at the map to make sure you choose the right algorithm.
This is sort of a circular question: "understand" requires knowing something about the features outside of the k-means process. All that k-means does is to identify k groups of physical proximity. It says "there are clumps of stuff in these 'k' places, and here's how the all the points choose the nearest."
What this means in terms of the features is up to the data scientist, rather than any deeper meaning that k-means can ascribe. The variance of each group may tell you a little about how tightly those points are clustered. Do remember that k-means also chooses starting points at random; an unfortunate choice can easily give a sub-optimal description of the space.
A centroid is basically the "mean" of the cluster. If you can ascribe some deeper understanding from the distribution of centroids, great -- but that depends on the data and features, rather than any significant meaning devolving from k-means.
Is that the level of answer you need?
The centroids are in fact the features learnt. Since k-means is a method of vector quantization we look up which observation belongs to which cluster and therefore is best described by the feature vector (centroid).
By having one observation e.g. separated into 10 patches before, the observation might consist of 10 feature vectors max.
Example:
Method: K-means with k=10
Dataset: 20 observations divided into 2 patches each = 40 data vectors
We now perform K-means on this patched dataset and get the nearest centroid per patch. We could then create a vector for each of the 20 observations with the length 10 (=k) and if patch 1 belongs to centroid 5 and patch 2 belongs to centroid 9 the vector could look like: 0 - 0 - 0 - 0 - 1 - 0 - 0 - 0 - 1 - 0.
This means that this observation consists of the centroids/ features 5 and 9. You could also measure use the distance between patch and centroid instead of this hard assignment.

Wrap-around when calculating distance for k-means

I'm trying to do a K-means clustering of some dataset using sklearn. The problem is that one of the dimensions is hour-of-day: a number from 0-23 and so the distance algorithm then thinks that 0 is very far from 23, because in absolute terms it is. In reality and for my purposes, hour 0 is very close to hour 23. Is there a way to make the distance algorithm do some form of wrap-around so it computes the more 'real' time difference.
I'm doing something simple, similar to the following:
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters = 2)
data = vstack(data)
fit = clusters.fit(data)
classes = fit.predict(data)
data elements looks something like [22, 418, 192] where the first element is the hour.
Any ideas?
Even though #elyase answer is accepted, I think it is not the correct approach.
Yes, to use such distance you have to refine your distance measure and so - use different library. But what is more important - concept of mean used in k-means won't suit the cyclic dimension. Lets consider following example:
#current cluster X,, based on centroid position Xc=24
x1=1
x2=24
#current cluster Y, based on centroid position Yc=10
y1=12
y2=13
computing simple arithmetic mean will place the centoids in Xc=12.5,Yc=12.5, which from the point of view of cyclic meausre is incorect, it should be Xc=0.5,Yc=12.5. As you can see, asignment based on the cyclic distance measure is not "compatible" with simple mean operation, and leads to bizzare results.
Simple k-means will result in clusters {x1,y1}, {x2,y2}
Simple k--means + distance measure result in degenerated super cluster {x1,x2,y1,y2}
Correct clustering would be {x1,x2},{y1,y2}
Solving this problem requires checking one if (whether it is better to measure "simple average" or by representing one of the points as x'=x-24). Unfortunately given n points it makes 2^n possibilities.
This seems as a use case of the kernelized k-means, where you are actually clustering in the abstract feature space (in your case - a "tube" rolled around the time dimension) induced by kernel ("similarity measure", being the inner product of some vector space).
Details of the kernel k-means are given here
Why k-means doesn't work with arbitrary distances
K-means is not a distance-based algorithm.
K-means minimizes the Within-Cluster-Sum-of-Squares, which is a kind of variance (it's roughly the weighted average variance of all clusters, where each object and dimension is given the same weight).
In order for Lloyds algorithm to converge you need to have both steps optimize the same function:
the reassignment step
the centroid update step
Now the "mean" function is a least-squares estimator. I.e. choosing the mean in step 2 is optimal for the WCSS objective. Assigning objects by least-squares deviation (= squared Euclidean distance, monotone to Euclidean distance) in step 1 also yields guaranteed convergence. The mean is exactly where your wrap-around idea would fall apart.
If you plug in a random other distance function as suggested by #elyase k-means might no longer converge.
Proper solutions
There are various solutions to this:
Use K-medoids (PAM). By choosing the medoid instead of the mean you do get guaranteed convergence with arbitrary distances. However, computing the medoid is rather expensive.
Transform the data into a kernel space where you are happy with minimizing Sum-of-Squares. For example, you could transform the hour into sin(hour / 12 * pi), cos(hour / 12 * pi) which may be okay for SSQ.
Use other, distance-based clustering algorithms. K-means is old, and there has been a lot of research on clustering since. You may want to start with hierarchical clustering (which actually is just as old as k-means), and then try DBSCAN and the variants of it.
The easiest approach, to me, is to adapt the K-means algorithm wraparound dimension via computing the "circular mean" for the dimension. Of course, you will also need to change the distance-to-centroid calculation accordingly.
#compute the mean of hour 0 and 23
import numpy as np
hours = np.array(range(24))
#hours to angles
angles = hours/24 * (2*np.pi)
sin = np.sin(angles)
cos = np.cos(angles)
a = np.arctan2(sin[23]+sin[0], cos[23]+cos[0])
if a < 0: a += 2*np.pi
#angle back to hour
hour = a * 24 / (2*np.pi)
#23.5

dbscan - setting limit on maximum cluster span

By my understanding of DBSCAN, it's possible for you to specify an epsilon of, say, 100 meters and — because DBSCAN takes into account density-reachability and not direct density-reachability when finding clusters — end up with a cluster in which the maximum distance between any two points is > 100 meters. In a more extreme possibility, it seems possible that you could set epsilon of 100 meters and end up with a cluster of 1 kilometer:
see [2][6] in this array of images from scikit learn for an example of when that might occur. (I'm more than willing to be told I'm a total idiot and am misunderstanding DBSCAN if that's what's happening here.)
Is there an algorithm that is density-based like DBSCAN but takes into account some kind of thresholding for the maximum distance between any two points in a cluster?
DBSCAN indeed does not impose a total size constraint on the cluster.
The epsilon value is best interpreted as the size of the gap separating two clusters (that may at most contain minpts-1 objects).
I believe, you are in fact not even looking for clustering: clustering is the task of discovering structure in data. The structure can be simpler (such as k-means) or complex (such as the arbitrarily shaped clusters discovered by hierarchical clustering and k-means).
You might be looking for vector quantization - reducing a data set to a smaller set of representatives - or set cover - finding the optimal cover for a given set - instead.
However, I also have the impression that you aren't really sure on what you need and why.
A stength of DBSCAN is that it has a mathematical definition of structure in the form of density-connected components. This is a strong and (except for some rare border cases) well-defined mathematical concept, and the DBSCAN algorithm is an optimally efficient algorithm to discover this structure.
Direct density reachability however, doesn't define a useful (partitioning) structure. It just does not partition the data into disjoint partitions.
If you don't need this kind of strong structure (i.e. you don't do clustering as in "structure discovery", but you just want to compress your data as in vector quantization), you could give "canopy preclustering" a try. It can be seen as a preprocessing step designed for clustering. Essentially, it is like DBSCAN, except that it uses two epsilon values, and the structure is not guaranteed to be optimal in any way, but will highly depend on the ordering of your data. If you then preprocess it appropriately, it can still be useful. Unless you are in a distributed setting, canopy preclustering however is at least as expensive than a full DBSCAN run. Due to the loose requirements (in particular, "clusters" may overlap, and objects are expected to belong to multiple "clusters"), it is easier to parallelize.
Oh, and you might also just be looking for complete-linkage hierarchical clustering. If you cut the dendrogram at your desired height, the resulting clusters should all have the desired maximum distance inbetween of any two objects. The only problem is that hierarchical clustering usually is O(n^3), i.e. it doesn't scale to large data sets. DBSCAN runs in O(n log n) in good implementations (with index support).
I had the same problem and ended up solving it by using DBSCAN in combination with KMeans clustering: First I use DBSCAN to identify high density clusters and remove outliers, then I take any cluster larger than 250 Miles (in my case) and break it apart. Here's the code:
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=0.3, min_samples=100).fit(load_geocodes[['lat', 'long']])
load_geocodes.loc[:,'cluster'] = clustering.labels_
import mpu
def calculate_cluster_size(lat, long):
left_top = (max(lat), min(long))
right_bottom = (min(lat), max(long))
distance = mpu.haversine_distance(left_top, right_bottom)*0.621371
return distance
for c, df in load_geocodes.groupby('cluster'):
if c == -1:
continue # don't do this for outliers
distance = calculate_cluster_size(df['lat'], df['long'])
print(distance)
if distance > 250:
# break clusters into more clusters until the maximum size of a cluster is less than 250 Miles
max_distance = distance
i = 2
while max_distance > 250:
kmeans = KMeans(n_clusters=i, random_state=0).fit(df[['lat', 'long']])
df.loc[:, 'cl_temp'] = kmeans.labels_
max_temp_cl_size = 0
for temp_cl, temp_cl_df in df.groupby('cl_temp'):
temp_cl_size = calculate_cluster_size(temp_cl_df['lat'], temp_cl_df['long'])
if temp_cl_size > max_temp_cl_size:
max_temp_cl_size = temp_cl_size
i += 1
max_distance = max_temp_cl_size
load_geocodes.loc[df.index,'subcluster'] = kmeans.labels_

Categories