Is there a way to calculate a distance metric (euclidean or cosine similarity or manhattan) between two homomorphically encrypted vectors?
Specifically, I'm looking to generate embeddings of documents (using a transformer), homomorphically encrypting those embeddings, and wanting to calculate a distance metric between embeddings to obtain document similarity scores.
I have evaluated libraries like concrete-numpy, TenSEAL, and Pyfhel (HE libraries) and each library appears to lack a specific mathematical operation be it division, cumulative sum, or absolute value that prevents generating any of the listed distance metrics above.
(I did find this: https://github.com/ibarrond/Pyfhel/blob/master/examples/Demo_8_HammingDist.py which calculates hamming distance between encrypted vectors, but this metric doesn't help with document similarity).
[Credit goes to ibarrond - answer found here: https://github.com/ibarrond/Pyfhel/issues/155]
There is indeed! You just need to rely on a few tricks to overcome the limitations of supported operations in CKKS/BFV schemes (mainly additions and multiplications):
Cosine Similarity: Formulated as CS(x, y) = (sum(xᵢ * yᵢ))/(||x|| * ||y||), it would require a division and a norm.
The trick: Normalize the vectors x' = x / ||x|| and y' = y / ||y||, encrypt x' and y' and perform a simple scalar product sum(xᵢ' * yᵢ') to obtain the cosine similarity (Check Demo_7_ScalarProd.py on how to do that).
Euclidean Distance: Formulated as ED(x, y) = sqrt( sum(xᵢ² - yᵢ²)), it would require a square root.
The trick: Settle with the Squared Euclidean Distance instead, where SED = sum(xᵢ² - yᵢ²)). In Pyfhel you can perform element-wise squaring natively (Demo_3), followed by Pyfhel.cumul_add for cumulative sum (some examples of this in Demo_7 and Demo_8).
Manhattan Distance: Formulated as MD(x, y) = sum(|xᵢ - yᵢ|), it would require computing the absolute value.
The trick: If you encrypt binary values only (that is, x̂, ŷ s.t. x̂ᵢ , ŷᵢ) ∈ {0,1} ∀ i), you can reformulate MD(x̂, ŷ) = sum((x̂ᵢ - ŷᵢ)²) = HD(x̂, ŷ), the Hamming Distance, which is implemented in Demo_8. For non-binary vectors you can at least compute the Squared Manhattan Distance SMD(x, y) = sum((xᵢ - yᵢ)²), missing a square root to get the MD.
General Tip: Leverage on the operations you can perform before encrypting (e.g., normalization) and after decrypting (e.g., square root of the result) to avoid computing non-linear functions in FHE!
Demos referenced can be found here: https://github.com/ibarrond/Pyfhel/tree/master/examples
Related
I am learning word embeddings and cosine similarity. My data is composed of two sets of same words but in 2 different languages.
I did two tests:
I measured the cosine similarity using the average of the word vectors (that I think it should be called soft cosine similarity instead)
I measured the cosine similarity using the word vectors
Should I expect to obtain quite the same results? I noticed that sometimes I have two opposite results. Since I am new on this, I am trying to figure out if I did something wrong or if there is an explanation behind. According to what I have been reading, soft cosine similarity should be more accurate than the usual cosine similarity.
Now, it's time for some data to show you. Unfortunately I can't post a part of my data (the words themselves), but I will try my best to give you the max of information I can give you.
Some other details before:
I am using FastText to create the embeddings, skipgram model with
default parameters.
For the soft cosine similarity, I am using Scipy
spatial distance cosine. Following some people suggestions, to measure cosine similarity it seems that I should subtract 1 from the formula, such as:
(1-distance.cosine(data['LANG1_AVG'].iloc[i],data['LANG2_AVG'].iloc[i]))
For the usual cosine similarity I am using the Fast Vector cosine similarity from FastText Multilingual, defined in this way:
#classmethod
def cosine_similarity(cls, vec_a, vec_b):
"""Compute cosine similarity between vec_a and vec_b"""
return np.dot(vec_a, vec_b) / \
(np.linalg.norm(vec_a) * np.linalg.norm(vec_b))
As you will see from the image here, for some words I obtained the same results or quite similar using the two methods. For others I obtained two totally different results. How can I explain this?
From what I understand, the soft similarity between two vectors x and y is given by (avg(x) * avg(y)) / (abs(avg(x)) * abs(avg(y))) = sign(avg(x) * avg(y)), which is either 1 or -1 depending on whether the averages have the same sign or not. This is probably not very helpful.
The cosine similarity is calculated with (x * y) / (||x|| * ||y||). 2 vectors pointing in the same direction will have a similarity of 1 (x * x = ||x||^2), 2 vectors pointing to the opposite direction, a similarity of -1 (x * -x = -||x||^2) and 2 perpendicular vectors a similarity of 0 ((1,0)*(0,1)=0). If the angle between the vectors is not equal to one of 0, 90, 180 or 270, you will have a similarity score between (but not equal to) -1 and 1.
Bottom line: forget about the averages and only use the cosine similarity. Note that the cosine similarity compares the orientation and not the length of the vectors.
PS: the translation of "able" in french is "capable" and not "able" ;)
After some more additional research, I found a 2014 paper (Soft Similarity and Soft Cosine Measure:
Similarity of Features in Vector Space Model) that explains when and how it could be useful to use averages of the features, and it explains also what is exactly a soft cosine measure:
Our idea is more general: we propose to modify the
manner of calculation of similarity in Vector Space Model
taking into account similarity of features. If we apply
this idea to the cosine measure, then the “soft cosine
measure” is introduced, as opposed to traditional “hard
cosine”, which ignores similarity of features. Note that
when we consider similarity of each pair of features, it
is equivalent to introducing new features in the VSM.
Essentially, we have a matrix of similarity between pairs
of features and all these features represent new dimensions in the VSM.
I want to use DBSCAN with the metric sklearn.metrics.pairwise.cosine_similarity to cluster points that have cosine similarity close to 1 (i.e. whose vectors (from "the" origin) are parallel or almost parallel).
The issue:
eps is the maximum distance between two samples for them to be considered as in the same neighbourhood by DBSCAN - meaning that if the distance between two points is lower than or equal to eps, these points are considered neighbours;
but
sklearn.metrics.pairwise.cosine_similarity spits out values between -1 and 1 and I want DBSCAN to consider two points to be neighbours if the distance between them is, say, between 0.75 and 1 - i.e. greater than or equal to 0.75.
I see two possible solutions:
pass a range of values to the eps parameter of DBSCAN e.g. eps=[0.75,1]
Pass the value eps=-0.75 to DBSCAN but (somehow) force it to use the negative of the cosine similarities matrix that is spit out by sklearn.metrics.pairwise.cosine_similarity
I do not know how to implement either of these.
Any guidance would be appreciated!
DBSCAN has a metric keyword argument. Docstring:
metric : string, or callable
The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of
the options allowed by metrics.pairwise.calculate_distance for its
metric parameter.
If metric is "precomputed", X is assumed to be a distance matrix and
must be square. X may be a sparse matrix, in which case only "nonzero"
elements may be considered neighbors for DBSCAN.
So probably the easiest thing to do is to precompute a distance matrix using cosine similarity as your distance metric, preprocess the distance matrix such that it fits your bespoke distance criterion (probably something like D = np.abs(np.abs(CD) -1), where CD is your cosine distance matrix), and then set metric to precomputed, and pass the precomputed distance matrix D in for X, i.e. the data.
For example:
#!/usr/bin/env python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN
total_samples = 1000
dimensionality = 3
points = np.random.rand(total_samples, dimensionality)
cosine_distance = cosine_similarity(points)
# option 1) vectors are close to each other if they are parallel
bespoke_distance = np.abs(np.abs(cosine_distance) -1)
# option 2) vectors are close to each other if they point in the same direction
bespoke_distance = np.abs(cosine_distance - 1)
results = DBSCAN(metric='precomputed', eps=0.25).fit(bespoke_distance)
A) check out Generalized DBSCAN which works fine with similarities too. With cosine, sklearn will supposedly be slow anyway.
B) you can trivially use: cosine distance = 1 - cosine similarity. But that may well cause the sklearn implementation to run in O(n²).
C) you supposedly can even pass -cosinesimilarity as precomputed distance matrix and use -0.75 as eps.
d) just make a binary distance matrix (in O(n²) memory, though, so slow), where distance = 0 of the cosine similarity is larger than your threshold, and 0 otherwise. Then use DBSCAN with eps=0.5. it is trivial to show that distance < eps if and only if similarity > threshold.
A few options:
dist = np.abs(cos_sim - 1) accepted answer here
dist = np.arccos(cos_sim) / np.pi https://math.stackexchange.com/a/3385463/816178
dist = 1 - (sim + 1) / 2 https://math.stackexchange.com/q/3241174/816178
I've found they all work the same in practice for this application (precomputed distances in hierarchical clustering; I've hit the snag too). As I understand #2 is the more mathematically-correct approach; preserving angular distance.
I have n and m binary vectors(of length 1500) from set A and B respectively.
I need a metric that can say how similar (kind of distance metric) all those n vectors and m vectors are.
The output should be total_distance_of_n_vectors and total_distance_of_m_vectors.
And if total_distance_of_n_vectors > total_distance_of_m_vectors, it means Set B have more similar vectors than Set A.
Which metric should I use? I thought of Jaccard similarity. But I am not able to put it in this context. Should I find the distance of each vector with each other to find the total distance or something else ?
There are two concepts relevant to your question, which you should consider separately.
Similarity Measure:
Independent of your scoring mechanism, you should find a similarity measure which suits your data best. It can be an Euclidean distance (not suitable for a 1500 dimensional space), a cosine (dot product based) distance, or a Hamiltonian distance (assuming your input features are completely independent, which rarely is the case).
A lot can go on in your distance function, and you should find one which makes sense for your data.
Scoring Mechanism:
You mention total_distance_of_vectors in your question, which probably is not what you want. If n >> m, almost certainly the total sum of distances for n vectors is more than the total distance for m vectors.
What you're looking for is most probably an average of the distances between the members of your sets. Then, depending on weather you want your average to be sensitive to outliers or not, you can go for average of the distances or average of squared distances.
If you want to dig deeper, you can also get the mean and variance of the distances within the two sets and compare the distributions.
Is it possible to use something like 1 - cosine similarity with scikit learn's KNeighborsClassifier?
This answer says no, but on the documentation for KNeighborsClassifier, it says the metrics mentioned in DistanceMetrics are available. Distance metrics don't include an explicit cosine distance, probably because it's not really a distance, but supposedly it's possible to input a function into the metric. I tried inputting the scikit learn linear kernel into KNeighborsClassifier but it gives me an error that the function needs two arrays as arguments. Anyone else tried this?
The cosine similarity is generally defined as xT y / (||x|| * ||y||), and outputs 1 if they are the same and goes to -1 if they are completely different. This definition is not technically a metric, and so you can't use accelerating structures like ball and kd trees with it. If you force scikit learn to use the brute force approach, you should be able to use it as a distance if you pass it your own custom distance metric object. There are methods of transforming the cosine similarity into a valid distance metric if you would like to use ball trees (you can find one in the JSAT library)
Notice though, that xT y / (||x|| * ||y||) = (x/||x||)T (y/||y||). The euclidean distance can be equivalently written as sqrt(xTx + yTy − 2 xTy). If we normalize every datapoint before giving it to the KNeighborsClassifier, then x^T x = 1 for all x. So the euclidean distance will degrade to sqrt(2 − 2x^T y). For completely the same inputs, we would get sqrt(2-2*1) = 0 and for complete opposites sqrt(2-2*-1)= 2. And it is clearly a simple shape, so you can get the same ordering as the cosine distance by normalizing your data and then using the euclidean distance. So long as you use the uniform weights option, the results will be identical to having used a correct Cosine Distance.
KNN family class constructors have a parameter called metric, you can switch between different distance metrics you want to use in nearest neighbour model.
A list of available distance metrics can be found here
If you want to use cosine metric for ranking and classification problem, you can use norm 2 Euclidean distance on normalized feature vector, that gives you same ranking/classification (predictions that made by argmax or argmin operations) results.
As a newbie in Dynamic Time Warping (DTW), I find its Python implementation mlpy.dtw is not documented in a very detailed extend. I have some problems with its return value.
Regarding the returned value dist? I have two questions:
Any typo here? For standard DTW, the document says
Standard DTW as described in [Muller07], using the Euclidean distance
(absolute value of the difference) or squared Euclidean distance (as
in [Keogh01]) as local cost measure.
and for subsequence DTW, the document says
Subsequence DTW as described in [Muller07], assuming that the length
of y is much larger than the length of x and using the Manhattan
distance (absolute value of the difference) as local cost measure.
The same so-called "absolute value of the difference" corresponds two different distance metrics?
Total distance? After running the snippet
dist, cost, path = mlpy.dtw_std(x, y, dist_only=False)
dist is one value. So is it the lumped sum of all the distances between each matched pair?
Yes, the mlpy.dtw() function is not well documented.
First question: no typo here. As you can see in the documentation, euclidean, squared euclidean and manhattan distances concern the local cost measure. In this case the cost measure is defined as a distance between two real values (one dimension), see cost in the pseudocode in http://en.wikipedia.org/wiki/Dynamic_time_warping. So, in this case, Manhattan distance and Euclidean distance are the same (http://en.wikipedia.org/wiki/Euclidean_distance#One_dimension). Anyway, in the standard dtw, you can choose the euclidean distance (absolute value of the difference) or the squared euclidean distance (squared difference) by the parameter squared:
>>> import mlpy
>>> mlpy.dtw_std([1,2,3], [4,5,6], squared=False) # Euclidean distance
9.0
>>> mlpy.dtw_std([1,2,3], [4,5,6], squared=True) # Squared Euclidean distance
26.0
Second question: dist is the unnormalized minimum-distance warp path between time series x and y. It is the unnormalized DTW distance. You can normalize it dividing by len(X)+len(Y). See http://www.irit.fr/~Julien.Pinquier/Docs/TP_MABS/res/dtw-sakoe-chiba78.pdf
Cheers,
Davide
It seems to be an error in the documentation. Euclidean distance is not the "absolute value of the difference", it is the correct description of the Manhattan metric. Probably author was thinking about one dimension case, as in R both Euclidean and manhattan metrics are the same (and Euclidean metric really expresses the absolute value of the difference then). I am not familiar with the library, if it only operates on 1 dimensional objects, then there is no error and these two distance measures are equivalent
The dist value is the value of best time-warp (measured as the summarized costs of matching, see the algorithm definiton on wikipedia). So it is in fact the minimum edit distance between two sequences, where particular edits' costs are expressed in dissimilarity (distance) between "matched" objects