Reading around, I find it is possible to pass a precomputed distance matrix into SKLearn DBSCAN. Unfortunately, I don't know how to pass it for calculation.
Say I have a 1D array with 100 elements, with just the names of the nodes. Then I have a 2D matrix, 100x100 with the distance between each element (in the same order).
I know I have to call it:
db = DBSCAN(eps=2, min_samples=5, metric="precomputed")
For a distance between nodes of 2 and a minimum of 5 node clusters. Also, use "precomputed" to indicate to use the 2D matrix. But how do I pass the info for the calculation?
The same question could apply if using RAPIDS CUML DBScan function (GPU accelerated).
documentation:
class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean',
metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)
[...]
[...]
metricstring, or callable, default=’euclidean’
The metric to use when calculating distance between instances in a feature array. If
metric is a string or callable, it must be one of the options allowed by
sklearn.metrics.pairwise_distances for its metric parameter. If metric is
“precomputed”, X is assumed to be a distance matrix and must be square. X may be a
Glossary, in which case only “nonzero” elements may be considered neighbors for
DBSCAN.
[...]
So, the way you normally call this is:
from sklearn.cluster import DBSCAN
clustering = DBSCAN()
DBSCAN.fit(X)
if you have a distance matrix, you do:
from sklearn.cluster import DBSCAN
clustering = DBSCAN(metric='precomputed')
clustering.fit(distance_matrix)
Related
I want to pass my own distance matrix (row linkages) to seaborn clustermap.
There are already some posts on this like
Use Distance Matrix in scipy.cluster.hierarchy.linkage()?
But they all point to
scipy hierarchy linkage
Which takes the clustering metric and method as arguments.
scipy.cluster.hierarchy.linkage(y, method='single',
metric='euclidean', optimal_ordering=False)
The input y may be either a 1d condensed distance matrix or a 2d array
of observation vectors
What I dont get is this:
My distance matrix is already based on a certain metric and method,
why would I want to recalculate this in scipy hierarchy linkage ?
Is there an option where it purely uses my distances and creates the linkages?
For posterity, here is a complete method of how to do this, as #WarrenWeckesser in the comments and #SibbsGambling in the linked answer leave out some details.
Suppose distMatrix is your matrix of distances (don't have to be Euclidean), with entry in row i and column j representing the distance between the ith and jth objects. Then:
# import packages
from scipy.cluster import hierarchy
import scipy.spatial.distance as ssd
import seaborn as sns
# define distance array as in linked answer
distArray = ssd.squareform(distMatrix)
# define linkage object
distLinkage = hierarchy.linkage(distArray)
# make clustermap
sns.clustermap(distMatrix, row_linkage=distLinkage, col_linkage=distLinkage)
Note that when creating the clustermap, you still have to reference the original matrix. If you want to use a different clustering method, such as method='ward', include that option when defining distLinkage.
I want to use DBSCAN with the metric sklearn.metrics.pairwise.cosine_similarity to cluster points that have cosine similarity close to 1 (i.e. whose vectors (from "the" origin) are parallel or almost parallel).
The issue:
eps is the maximum distance between two samples for them to be considered as in the same neighbourhood by DBSCAN - meaning that if the distance between two points is lower than or equal to eps, these points are considered neighbours;
but
sklearn.metrics.pairwise.cosine_similarity spits out values between -1 and 1 and I want DBSCAN to consider two points to be neighbours if the distance between them is, say, between 0.75 and 1 - i.e. greater than or equal to 0.75.
I see two possible solutions:
pass a range of values to the eps parameter of DBSCAN e.g. eps=[0.75,1]
Pass the value eps=-0.75 to DBSCAN but (somehow) force it to use the negative of the cosine similarities matrix that is spit out by sklearn.metrics.pairwise.cosine_similarity
I do not know how to implement either of these.
Any guidance would be appreciated!
DBSCAN has a metric keyword argument. Docstring:
metric : string, or callable
The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of
the options allowed by metrics.pairwise.calculate_distance for its
metric parameter.
If metric is "precomputed", X is assumed to be a distance matrix and
must be square. X may be a sparse matrix, in which case only "nonzero"
elements may be considered neighbors for DBSCAN.
So probably the easiest thing to do is to precompute a distance matrix using cosine similarity as your distance metric, preprocess the distance matrix such that it fits your bespoke distance criterion (probably something like D = np.abs(np.abs(CD) -1), where CD is your cosine distance matrix), and then set metric to precomputed, and pass the precomputed distance matrix D in for X, i.e. the data.
For example:
#!/usr/bin/env python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN
total_samples = 1000
dimensionality = 3
points = np.random.rand(total_samples, dimensionality)
cosine_distance = cosine_similarity(points)
# option 1) vectors are close to each other if they are parallel
bespoke_distance = np.abs(np.abs(cosine_distance) -1)
# option 2) vectors are close to each other if they point in the same direction
bespoke_distance = np.abs(cosine_distance - 1)
results = DBSCAN(metric='precomputed', eps=0.25).fit(bespoke_distance)
A) check out Generalized DBSCAN which works fine with similarities too. With cosine, sklearn will supposedly be slow anyway.
B) you can trivially use: cosine distance = 1 - cosine similarity. But that may well cause the sklearn implementation to run in O(n²).
C) you supposedly can even pass -cosinesimilarity as precomputed distance matrix and use -0.75 as eps.
d) just make a binary distance matrix (in O(n²) memory, though, so slow), where distance = 0 of the cosine similarity is larger than your threshold, and 0 otherwise. Then use DBSCAN with eps=0.5. it is trivial to show that distance < eps if and only if similarity > threshold.
A few options:
dist = np.abs(cos_sim - 1) accepted answer here
dist = np.arccos(cos_sim) / np.pi https://math.stackexchange.com/a/3385463/816178
dist = 1 - (sim + 1) / 2 https://math.stackexchange.com/q/3241174/816178
I've found they all work the same in practice for this application (precomputed distances in hierarchical clustering; I've hit the snag too). As I understand #2 is the more mathematically-correct approach; preserving angular distance.
I have around 1M of binary numpy array which I need to get Hamming Distance between them to found de k-nearest-neighbours, the fastest method that I get is using cdist, returning a float matrix with distance.
Since I don't have memory enough to get a 1Mx1M float matrix so I'm doing it one element at the time like this:
from scipy.spatial Import distance
Hamming_Distance = distance.cdist(array1,all_array,'hamming')
The probles is that it's taken like 2-3s for each Hamming_Distance, to 1m document it took an eternity (And I need to use it to different k).
Is there any fastest way to do it?
I'm thinking on multiprocessing or make it on C but I have some troubles understanding how it works multiprocessing on python and I don't know how to mix C code with Python code.
If you want to compute the k-nearest neighbors, it may not be necessary to compute all n^2 pairs of distances. Instead, you can use a Kd tree or a ball tree (both are data structures for efficiently querying relations between a set of points).
Scipy has a package called scipy.spatial.kdtree. It however does not currently support hamming distance as a metric between points. However, the wonderful folks at scikit-learn (aka sklearn) do have an implementation of ball tree with hamming distance supported. Here's a small example using sklearn's ball tree.
from sklearn.neighbors import BallTree
import numpy as np
# Generate random binary data.
data = np.random.random_integers(0, 1, size=(10,10))
# Implement BallTree.
ballt = BallTree(data, leaf_size = 30, metric = 'hamming')
distances, neighbors = ballt.query(data, k=3)
print neighbors # Row n has the nth vector's k closest neighbors.
print distances # Same idea but the hamming distance to neighbors.
Now for the big caveat. For high dimensional vectors, KDTree and BallTree become comparable to the brute force algorithm. I'm a bit unclear on the nature of your vectors, but hopefully the above snippet gives you some ideas/direction.
Is it possible to use something like 1 - cosine similarity with scikit learn's KNeighborsClassifier?
This answer says no, but on the documentation for KNeighborsClassifier, it says the metrics mentioned in DistanceMetrics are available. Distance metrics don't include an explicit cosine distance, probably because it's not really a distance, but supposedly it's possible to input a function into the metric. I tried inputting the scikit learn linear kernel into KNeighborsClassifier but it gives me an error that the function needs two arrays as arguments. Anyone else tried this?
The cosine similarity is generally defined as xT y / (||x|| * ||y||), and outputs 1 if they are the same and goes to -1 if they are completely different. This definition is not technically a metric, and so you can't use accelerating structures like ball and kd trees with it. If you force scikit learn to use the brute force approach, you should be able to use it as a distance if you pass it your own custom distance metric object. There are methods of transforming the cosine similarity into a valid distance metric if you would like to use ball trees (you can find one in the JSAT library)
Notice though, that xT y / (||x|| * ||y||) = (x/||x||)T (y/||y||). The euclidean distance can be equivalently written as sqrt(xTx + yTy − 2 xTy). If we normalize every datapoint before giving it to the KNeighborsClassifier, then x^T x = 1 for all x. So the euclidean distance will degrade to sqrt(2 − 2x^T y). For completely the same inputs, we would get sqrt(2-2*1) = 0 and for complete opposites sqrt(2-2*-1)= 2. And it is clearly a simple shape, so you can get the same ordering as the cosine distance by normalizing your data and then using the euclidean distance. So long as you use the uniform weights option, the results will be identical to having used a correct Cosine Distance.
KNN family class constructors have a parameter called metric, you can switch between different distance metrics you want to use in nearest neighbour model.
A list of available distance metrics can be found here
If you want to use cosine metric for ranking and classification problem, you can use norm 2 Euclidean distance on normalized feature vector, that gives you same ranking/classification (predictions that made by argmax or argmin operations) results.
To get distortion function (sum of distance for each point to its center) when doing K means clustering by Scikit-Learn, one simple way is just to get the centers (k_means.cluster_centers_) and sum up the distance for each point.
Just wondering if there is a faster way? (In terms of programmer time) Something like a direct function call or so.
This is already pre-computed at fit time in the inertia_ attribute for the KMeans class.
>>> from sklearn.datasets import load_iris
>>> from sklearn.cluster import KMeans
>>> iris = load_iris()
>>> km = KMeans(3).fit(iris.data)
>>> km.inertia_
78.940841426146108
Depending on the definition of distortion measure,it can either be
Sum of the square of the distance
of each example to its nearest cluster center.
OR
Average of the euclidean squared distance from the centroid of the respective clusters.
For the latter case,you can visit
Can distortion be derived from inertia rather than recalculating it from scratch in case of kmeans?
The inertia_ attribute in KMeans is defined in official docs as
Sum of squared distances of samples to their closest cluster center,
weighted by the sample weights if provided.