Using cosine distance with scikit learn KNeighborsClassifier - python

Is it possible to use something like 1 - cosine similarity with scikit learn's KNeighborsClassifier?
This answer says no, but on the documentation for KNeighborsClassifier, it says the metrics mentioned in DistanceMetrics are available. Distance metrics don't include an explicit cosine distance, probably because it's not really a distance, but supposedly it's possible to input a function into the metric. I tried inputting the scikit learn linear kernel into KNeighborsClassifier but it gives me an error that the function needs two arrays as arguments. Anyone else tried this?

The cosine similarity is generally defined as xT y / (||x|| * ||y||), and outputs 1 if they are the same and goes to -1 if they are completely different. This definition is not technically a metric, and so you can't use accelerating structures like ball and kd trees with it. If you force scikit learn to use the brute force approach, you should be able to use it as a distance if you pass it your own custom distance metric object. There are methods of transforming the cosine similarity into a valid distance metric if you would like to use ball trees (you can find one in the JSAT library)
Notice though, that xT y / (||x|| * ||y||) = (x/||x||)T (y/||y||). The euclidean distance can be equivalently written as sqrt(xTx + yTy − 2 xTy). If we normalize every datapoint before giving it to the KNeighborsClassifier, then x^T x = 1 for all x. So the euclidean distance will degrade to sqrt(2 − 2x^T y). For completely the same inputs, we would get sqrt(2-2*1) = 0 and for complete opposites sqrt(2-2*-1)= 2. And it is clearly a simple shape, so you can get the same ordering as the cosine distance by normalizing your data and then using the euclidean distance. So long as you use the uniform weights option, the results will be identical to having used a correct Cosine Distance.

KNN family class constructors have a parameter called metric, you can switch between different distance metrics you want to use in nearest neighbour model.
A list of available distance metrics can be found here
If you want to use cosine metric for ranking and classification problem, you can use norm 2 Euclidean distance on normalized feature vector, that gives you same ranking/classification (predictions that made by argmax or argmin operations) results.

Related

Cosine similarities and totally different results using same source

I am learning word embeddings and cosine similarity. My data is composed of two sets of same words but in 2 different languages.
I did two tests:
I measured the cosine similarity using the average of the word vectors (that I think it should be called soft cosine similarity instead)
I measured the cosine similarity using the word vectors
Should I expect to obtain quite the same results? I noticed that sometimes I have two opposite results. Since I am new on this, I am trying to figure out if I did something wrong or if there is an explanation behind. According to what I have been reading, soft cosine similarity should be more accurate than the usual cosine similarity.
Now, it's time for some data to show you. Unfortunately I can't post a part of my data (the words themselves), but I will try my best to give you the max of information I can give you.
Some other details before:
I am using FastText to create the embeddings, skipgram model with
default parameters.
For the soft cosine similarity, I am using Scipy
spatial distance cosine. Following some people suggestions, to measure cosine similarity it seems that I should subtract 1 from the formula, such as:
(1-distance.cosine(data['LANG1_AVG'].iloc[i],data['LANG2_AVG'].iloc[i]))
For the usual cosine similarity I am using the Fast Vector cosine similarity from FastText Multilingual, defined in this way:
#classmethod
def cosine_similarity(cls, vec_a, vec_b):
"""Compute cosine similarity between vec_a and vec_b"""
return np.dot(vec_a, vec_b) / \
(np.linalg.norm(vec_a) * np.linalg.norm(vec_b))
As you will see from the image here, for some words I obtained the same results or quite similar using the two methods. For others I obtained two totally different results. How can I explain this?
From what I understand, the soft similarity between two vectors x and y is given by (avg(x) * avg(y)) / (abs(avg(x)) * abs(avg(y))) = sign(avg(x) * avg(y)), which is either 1 or -1 depending on whether the averages have the same sign or not. This is probably not very helpful.
The cosine similarity is calculated with (x * y) / (||x|| * ||y||). 2 vectors pointing in the same direction will have a similarity of 1 (x * x = ||x||^2), 2 vectors pointing to the opposite direction, a similarity of -1 (x * -x = -||x||^2) and 2 perpendicular vectors a similarity of 0 ((1,0)*(0,1)=0). If the angle between the vectors is not equal to one of 0, 90, 180 or 270, you will have a similarity score between (but not equal to) -1 and 1.
Bottom line: forget about the averages and only use the cosine similarity. Note that the cosine similarity compares the orientation and not the length of the vectors.
PS: the translation of "able" in french is "capable" and not "able" ;)
After some more additional research, I found a 2014 paper (Soft Similarity and Soft Cosine Measure:
Similarity of Features in Vector Space Model) that explains when and how it could be useful to use averages of the features, and it explains also what is exactly a soft cosine measure:
Our idea is more general: we propose to modify the
manner of calculation of similarity in Vector Space Model
taking into account similarity of features. If we apply
this idea to the cosine measure, then the “soft cosine
measure” is introduced, as opposed to traditional “hard
cosine”, which ignores similarity of features. Note that
when we consider similarity of each pair of features, it
is equivalent to introducing new features in the VSM.
Essentially, we have a matrix of similarity between pairs
of features and all these features represent new dimensions in the VSM.

How to define a range of values for the eps parameter of sklearn.cluster.DBSCAN?

I want to use DBSCAN with the metric sklearn.metrics.pairwise.cosine_similarity to cluster points that have cosine similarity close to 1 (i.e. whose vectors (from "the" origin) are parallel or almost parallel).
The issue:
eps is the maximum distance between two samples for them to be considered as in the same neighbourhood by DBSCAN - meaning that if the distance between two points is lower than or equal to eps, these points are considered neighbours;
but
sklearn.metrics.pairwise.cosine_similarity spits out values between -1 and 1 and I want DBSCAN to consider two points to be neighbours if the distance between them is, say, between 0.75 and 1 - i.e. greater than or equal to 0.75.
I see two possible solutions:
pass a range of values to the eps parameter of DBSCAN e.g. eps=[0.75,1]
Pass the value eps=-0.75 to DBSCAN but (somehow) force it to use the negative of the cosine similarities matrix that is spit out by sklearn.metrics.pairwise.cosine_similarity
I do not know how to implement either of these.
Any guidance would be appreciated!
DBSCAN has a metric keyword argument. Docstring:
metric : string, or callable
The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of
the options allowed by metrics.pairwise.calculate_distance for its
metric parameter.
If metric is "precomputed", X is assumed to be a distance matrix and
must be square. X may be a sparse matrix, in which case only "nonzero"
elements may be considered neighbors for DBSCAN.
So probably the easiest thing to do is to precompute a distance matrix using cosine similarity as your distance metric, preprocess the distance matrix such that it fits your bespoke distance criterion (probably something like D = np.abs(np.abs(CD) -1), where CD is your cosine distance matrix), and then set metric to precomputed, and pass the precomputed distance matrix D in for X, i.e. the data.
For example:
#!/usr/bin/env python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN
total_samples = 1000
dimensionality = 3
points = np.random.rand(total_samples, dimensionality)
cosine_distance = cosine_similarity(points)
# option 1) vectors are close to each other if they are parallel
bespoke_distance = np.abs(np.abs(cosine_distance) -1)
# option 2) vectors are close to each other if they point in the same direction
bespoke_distance = np.abs(cosine_distance - 1)
results = DBSCAN(metric='precomputed', eps=0.25).fit(bespoke_distance)
A) check out Generalized DBSCAN which works fine with similarities too. With cosine, sklearn will supposedly be slow anyway.
B) you can trivially use: cosine distance = 1 - cosine similarity. But that may well cause the sklearn implementation to run in O(n²).
C) you supposedly can even pass -cosinesimilarity as precomputed distance matrix and use -0.75 as eps.
d) just make a binary distance matrix (in O(n²) memory, though, so slow), where distance = 0 of the cosine similarity is larger than your threshold, and 0 otherwise. Then use DBSCAN with eps=0.5. it is trivial to show that distance < eps if and only if similarity > threshold.
A few options:
dist = np.abs(cos_sim - 1) accepted answer here
dist = np.arccos(cos_sim) / np.pi https://math.stackexchange.com/a/3385463/816178
dist = 1 - (sim + 1) / 2 https://math.stackexchange.com/q/3241174/816178
I've found they all work the same in practice for this application (precomputed distances in hierarchical clustering; I've hit the snag too). As I understand #2 is the more mathematically-correct approach; preserving angular distance.

Distance metric for n binary vectors

I have n and m binary vectors(of length 1500) from set A and B respectively.
I need a metric that can say how similar (kind of distance metric) all those n vectors and m vectors are.
The output should be total_distance_of_n_vectors and total_distance_of_m_vectors.
And if total_distance_of_n_vectors > total_distance_of_m_vectors, it means Set B have more similar vectors than Set A.
Which metric should I use? I thought of Jaccard similarity. But I am not able to put it in this context. Should I find the distance of each vector with each other to find the total distance or something else ?
There are two concepts relevant to your question, which you should consider separately.
Similarity Measure:
Independent of your scoring mechanism, you should find a similarity measure which suits your data best. It can be an Euclidean distance (not suitable for a 1500 dimensional space), a cosine (dot product based) distance, or a Hamiltonian distance (assuming your input features are completely independent, which rarely is the case).
A lot can go on in your distance function, and you should find one which makes sense for your data.
Scoring Mechanism:
You mention total_distance_of_vectors in your question, which probably is not what you want. If n >> m, almost certainly the total sum of distances for n vectors is more than the total distance for m vectors.
What you're looking for is most probably an average of the distances between the members of your sets. Then, depending on weather you want your average to be sensitive to outliers or not, you can go for average of the distances or average of squared distances.
If you want to dig deeper, you can also get the mean and variance of the distances within the two sets and compare the distributions.

Evaluating vector distance measures

I am working with vectors of word frequencies and trying out some of the different distance measures available in Scikit Learns Pairwise Distances. I would like to use these distances for clustering and classification.
I usually have a feature matrix of ~ 30,000 x 100. My idea was to choose a distance metric that maximizes the pairwise distances by running pairwise differences over the same dataset with the distance metrics available in Scipy (e.g. Euclidean, Cityblock, etc.) and for each metric
convert distances computed for the dataset to zscores to normalize across metrics
get the range of these zscores, i.e. the spread of the distances
use the distance metric that gives me the widest range of distances as it apparently gives me the maximum spread over my dataset and the most variance to work with. (Cf. code below)
My questions:
Does this approach make sense?
Are there other evaluation procedures that one should try? I found these papers (Gavin, Aggarwal, but they don't apply 100 % here...)
Any help is much appreciated!
My code:
matrix=np.random.uniform(0, .1, size=(10,300)) #test data set
scipy_distances=['euclidean', 'minkowski', ...] #these are the distance metrics
for d in scipy_distances: #iterate over distances
distmatrix=sklearn.metrics.pairwise.pairwise_distances(matrix, metric=d)
distzscores = scipy.stats.mstats.zscore(distmatrix, axis=0, ddof=1)
diststats=basicstatsmaker(distzscores)
range=np.ptp(distzscores, axis=0)
print "range of metric", d, np.ptp(range)
In general - this is just a heuristic, which might, or not - work. In particular, it is easy to construct a "dummy metric" which will "win" in your approach even though it is useless. Try out
class Dummy_dist:
def __init__(self):
self.cheat = True
def __call__(self, x, y):
if self.cheat:
self.cheat = False
return 1e60
else:
return 0
dummy_dist = Dummy_dist()
This will give you huuuuge spread (even with z-score normalization). Of course this is a cheating example as this is non determinsitic, but I wanted to show the basic counterexample, and of course given your data one can construct a deterministic analogon.
So what you should do? Your metric should be treated as hyperparameter of your process. You should not divide process of generating your clustering/classification into two separate phases: choosing a distance and then learning something; but you should do this jointly, consider your clustering/classification + distance pairs as a single model, thus instead of working with k-means, you will work with k-means+euclidean, k-means+minkowsky and so on. This is the only statistically supported approach. You cannot construct a method of assessing "general goodness" of the metric, as there is no such object, metric quality can be only assessed in a particular task, which involves fixing every other element (such as a clustering/classification method, particular dataset etc.). Once you perform such wide, exhaustive evaluation, check many such pairs, on many datasets, you might claim that given metric performes best in such range of tasks.

Wrap-around when calculating distance for k-means

I'm trying to do a K-means clustering of some dataset using sklearn. The problem is that one of the dimensions is hour-of-day: a number from 0-23 and so the distance algorithm then thinks that 0 is very far from 23, because in absolute terms it is. In reality and for my purposes, hour 0 is very close to hour 23. Is there a way to make the distance algorithm do some form of wrap-around so it computes the more 'real' time difference.
I'm doing something simple, similar to the following:
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters = 2)
data = vstack(data)
fit = clusters.fit(data)
classes = fit.predict(data)
data elements looks something like [22, 418, 192] where the first element is the hour.
Any ideas?
Even though #elyase answer is accepted, I think it is not the correct approach.
Yes, to use such distance you have to refine your distance measure and so - use different library. But what is more important - concept of mean used in k-means won't suit the cyclic dimension. Lets consider following example:
#current cluster X,, based on centroid position Xc=24
x1=1
x2=24
#current cluster Y, based on centroid position Yc=10
y1=12
y2=13
computing simple arithmetic mean will place the centoids in Xc=12.5,Yc=12.5, which from the point of view of cyclic meausre is incorect, it should be Xc=0.5,Yc=12.5. As you can see, asignment based on the cyclic distance measure is not "compatible" with simple mean operation, and leads to bizzare results.
Simple k-means will result in clusters {x1,y1}, {x2,y2}
Simple k--means + distance measure result in degenerated super cluster {x1,x2,y1,y2}
Correct clustering would be {x1,x2},{y1,y2}
Solving this problem requires checking one if (whether it is better to measure "simple average" or by representing one of the points as x'=x-24). Unfortunately given n points it makes 2^n possibilities.
This seems as a use case of the kernelized k-means, where you are actually clustering in the abstract feature space (in your case - a "tube" rolled around the time dimension) induced by kernel ("similarity measure", being the inner product of some vector space).
Details of the kernel k-means are given here
Why k-means doesn't work with arbitrary distances
K-means is not a distance-based algorithm.
K-means minimizes the Within-Cluster-Sum-of-Squares, which is a kind of variance (it's roughly the weighted average variance of all clusters, where each object and dimension is given the same weight).
In order for Lloyds algorithm to converge you need to have both steps optimize the same function:
the reassignment step
the centroid update step
Now the "mean" function is a least-squares estimator. I.e. choosing the mean in step 2 is optimal for the WCSS objective. Assigning objects by least-squares deviation (= squared Euclidean distance, monotone to Euclidean distance) in step 1 also yields guaranteed convergence. The mean is exactly where your wrap-around idea would fall apart.
If you plug in a random other distance function as suggested by #elyase k-means might no longer converge.
Proper solutions
There are various solutions to this:
Use K-medoids (PAM). By choosing the medoid instead of the mean you do get guaranteed convergence with arbitrary distances. However, computing the medoid is rather expensive.
Transform the data into a kernel space where you are happy with minimizing Sum-of-Squares. For example, you could transform the hour into sin(hour / 12 * pi), cos(hour / 12 * pi) which may be okay for SSQ.
Use other, distance-based clustering algorithms. K-means is old, and there has been a lot of research on clustering since. You may want to start with hierarchical clustering (which actually is just as old as k-means), and then try DBSCAN and the variants of it.
The easiest approach, to me, is to adapt the K-means algorithm wraparound dimension via computing the "circular mean" for the dimension. Of course, you will also need to change the distance-to-centroid calculation accordingly.
#compute the mean of hour 0 and 23
import numpy as np
hours = np.array(range(24))
#hours to angles
angles = hours/24 * (2*np.pi)
sin = np.sin(angles)
cos = np.cos(angles)
a = np.arctan2(sin[23]+sin[0], cos[23]+cos[0])
if a < 0: a += 2*np.pi
#angle back to hour
hour = a * 24 / (2*np.pi)
#23.5

Categories