Buggy results using DBSCAN (sklearn) for word clustering

Buggy results using DBSCAN (sklearn) for word clustering - python

I am trying to cluster similar looking bigrams using DBSCAN (sklearn) with Levenshtein distance as the distance metric. I need to cluster together similar looking words (Spelling errors) like the following:
Sundar Residency
Sndar Residency
Sundhar Residency
My code:
# Distance metric
def lev_metric(x, y):
i, j = int(x[0]), int(y[0])
return (editdistance.eval(b1[i], b1[j]))
# Loading data
b1 = debug_data['b1']
b1 = b1.tolist()
X1 = np.arange(len(b1)).reshape(-1, 1)
# Defining DBSCAN Parameters and clustering
db = DBSCAN(eps = 2, min_samples = 2, metric = lev_metric)
predictions = db.fit_predict(X1)
# Printing results
tmp = pd.DataFrame({'b1': b1, 'cluster_id': predictions})
tmp.sort_values(by = ['cluster_id'], ascending = True, inplace = True)
print (tmp)
The results are mixed. When my eps is 2, which I assume is the Levenshtein distance between any two points in my cluster, all points get clustered in one cluster. While, when it is set to 1, the clustering is better. But, still the Levenshtein distance between any two points is not 1.
Results with eps == 2
Results with eps == 1
Can anyone explain what is happening here?
Thank you for your help.

DBSCAN uses the transitive closure.
Therefore, the distance of points in one cluster can be much larger than epsilon. It guarantees there exists a sequence a,b,c,d,e,f...,x such that every step is at most epsilon, but there is no limit on the number of steps.
You could try Leader clustering, but I would rather not use clustering at all, but treat this as a simpler similarity search problem - you want to find similar objects, not complex structured groups.

Related

How to define a range of values for the eps parameter of sklearn.cluster.DBSCAN?

I want to use DBSCAN with the metric sklearn.metrics.pairwise.cosine_similarity to cluster points that have cosine similarity close to 1 (i.e. whose vectors (from "the" origin) are parallel or almost parallel).
The issue:
eps is the maximum distance between two samples for them to be considered as in the same neighbourhood by DBSCAN - meaning that if the distance between two points is lower than or equal to eps, these points are considered neighbours;
but
sklearn.metrics.pairwise.cosine_similarity spits out values between -1 and 1 and I want DBSCAN to consider two points to be neighbours if the distance between them is, say, between 0.75 and 1 - i.e. greater than or equal to 0.75.
I see two possible solutions:
pass a range of values to the eps parameter of DBSCAN e.g. eps=[0.75,1]
Pass the value eps=-0.75 to DBSCAN but (somehow) force it to use the negative of the cosine similarities matrix that is spit out by sklearn.metrics.pairwise.cosine_similarity
I do not know how to implement either of these.
Any guidance would be appreciated!

DBSCAN has a metric keyword argument. Docstring:
metric : string, or callable
The metric to use when calculating distance between instances in a
feature array. If metric is a string or callable, it must be one of
the options allowed by metrics.pairwise.calculate_distance for its
metric parameter.
If metric is "precomputed", X is assumed to be a distance matrix and
must be square. X may be a sparse matrix, in which case only "nonzero"
elements may be considered neighbors for DBSCAN.
So probably the easiest thing to do is to precompute a distance matrix using cosine similarity as your distance metric, preprocess the distance matrix such that it fits your bespoke distance criterion (probably something like D = np.abs(np.abs(CD) -1), where CD is your cosine distance matrix), and then set metric to precomputed, and pass the precomputed distance matrix D in for X, i.e. the data.
For example:
#!/usr/bin/env python
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN
total_samples = 1000
dimensionality = 3
points = np.random.rand(total_samples, dimensionality)
cosine_distance = cosine_similarity(points)
# option 1) vectors are close to each other if they are parallel
bespoke_distance = np.abs(np.abs(cosine_distance) -1)
# option 2) vectors are close to each other if they point in the same direction
bespoke_distance = np.abs(cosine_distance - 1)
results = DBSCAN(metric='precomputed', eps=0.25).fit(bespoke_distance)

A) check out Generalized DBSCAN which works fine with similarities too. With cosine, sklearn will supposedly be slow anyway.
B) you can trivially use: cosine distance = 1 - cosine similarity. But that may well cause the sklearn implementation to run in O(n²).
C) you supposedly can even pass -cosinesimilarity as precomputed distance matrix and use -0.75 as eps.
d) just make a binary distance matrix (in O(n²) memory, though, so slow), where distance = 0 of the cosine similarity is larger than your threshold, and 0 otherwise. Then use DBSCAN with eps=0.5. it is trivial to show that distance < eps if and only if similarity > threshold.

A few options:
dist = np.abs(cos_sim - 1) accepted answer here
dist = np.arccos(cos_sim) / np.pi https://math.stackexchange.com/a/3385463/816178
dist = 1 - (sim + 1) / 2 https://math.stackexchange.com/q/3241174/816178
I've found they all work the same in practice for this application (precomputed distances in hierarchical clustering; I've hit the snag too). As I understand #2 is the more mathematically-correct approach; preserving angular distance.

Unsupervised learning clustering 1D array

I am faced with the following array:
y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
What I would like to do is extract the cluster with the highest scores. That would be
best_cluster = [200,297,275,243]
I have checked quite a few questions on stack on this topic and most of them recommend using kmeans. Although a few others mention that kmeans might be an overkill for 1D arrays clustering.
However kmeans is a supervised learnig algorithm, hence this means that I would have to pass in the number of centroids. As I need to generalize this problem to other arrays, I cannot pass the number of centroids for each one of them. Therefore I am looking at implementing some sort of unsupervised learning algorithm that would be able to figure out the clusters by itself and select the highest one.
In array y I would see 3 clusters as so [1,2,4,7,9,5,4,7,9],[56,57,54,60],[200,297,275,243].
What algorithm would best fit my needs, considering computation cost and accuracy and how could I implement it for my problem?

Try MeanShift. From the sklean user guide of MeanShift:
The algorithm automatically sets the number of clusters, ...
Modified demo code:
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
# #############################################################################
# Generate sample data
X = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
X = np.reshape(X, (-1, 1))
# #############################################################################
# Compute clustering with MeanShift
# The following bandwidth can be automatically detected using
# bandwidth = estimate_bandwidth(X, quantile=0.2, n_samples=100)
ms = MeanShift(bandwidth=None, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("number of estimated clusters : %d" % n_clusters_)
print(labels)
Output:
number of estimated clusters : 2
[0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
Note that MeanShift is not scalable with the number of samples. The recommended upper limit is 10,000.
BTW, as rahlf23 already mentioned, K-mean is an unsupervised learning algorithm. The fact that you have to specify the number of clusters does not mean it is supervised.
See also:
Overview of clustering methods
Choosing the right estimator

Clustering is overkill here
Just compute the differences of subsequent elements. I.e. look at x[i]-x[i-1].
Choose the k largest differences as split points. Or define a threshold on when to split. E.g. 20. Depends on your data knowledge.
This is O(n), much faster than all the others mentioned. Also very understandable and predictable.
On one dimensional ordered data, any method that doesn't use the order will be slower than necessary.

HDBSCAN is the best clustering algorithm and you should always use it.
Basically all you need to do is provide a reasonable min_cluster_size, a valid distance metric and you're good to go.
For min_cluster_size I suggest using 3 since a cluster of 2 is lame and for metric the default euclidean works great so you don't even need to mention it.
Don't forget that distance metrics apply to vectors and here we have scalars so some ugly reshaping is in order.
To put it all together and assuming by "cluster with the highest scores" you mean the cluster that includes the max value we get:
from hdbscan import HDBSCAN
import numpy as np
y = [1,2,4,7,9,5,4,7,9,56,57,54,60,200,297,275,243]
y = np.reshape(y, (-1, 1))
clusterer = HDBSCAN(min_cluster_size=3)
cluster_labels = clusterer.fit_predict(y)
best_cluster = clusterer.exemplars_[cluster_labels[y.argmax()]].ravel()
print(best_cluster)
The output is [297 200 275 243]. Original order is not preserved. C'est la vie.

AgglomerativeClustering on a correlation matrix

I have a correlation matrix of typical structure that is of size 288x288 that is defined by:
from sklearn.cluster import AgglomerativeClustering
df = read_returns()
correl_matrix = df.corr()
where read_returns gives me a dataframe with a date index, and columns of the returns of assets.
Now - I want to cluster these correlations to reduce the population size.
By doing some reading and experimenting I discovered AgglomerativeClustering - and it appears at first pass to be an appropriate solution to my problem.
I define a distance metric as ((.5*(1-correl_matrix))**.5) and have:
cluster = AgglomerativeClustering(n_clusters=40, linkage='average')
cluster.fit(((.5*(1-correl_matrix))**.5).values)
label_groups = cluster.labels_
To observe some of the data and cross check my work I pick out cluster 1 and observe the pairwise correlations and find the min correlation between two items with that group in my dataset to find :
single_cluster = []
for i in range(0,correl_matrix.shape[0]):
if label_groups[i]==1:
single_cluster.append(correl_matrix.index[i])
min_correl = 1.0
for x in single_cluster:
for y in single_cluster:
if x<>y:
if correl_matrix[x][y]<min_correl:
min_correl = correl_matrix[x][y]
print min_correl
and get a min pairwise correlation of .20
To me this seems quite low - but "low based off what?" is a fair question to which I have no answer.
I would like to anticipate/enforce each pairwise correlation of a cluster to be >=.7 or something like this.
Is this possible in AgglomerativeClustering?
Am I accidentally going down the wrong path?

Hierarchical clustering supports different "linkage" strategies.
single-link: this connects points on the minimum distance to the others in the cluster
complete-link: this connects based on the maximum distance to the cluster
...
If you want a high minimum correlation = small maximum distance, this calls for complete linkage.
You may want to treat negative correlations as "good", too.
i.e. use dist = 1 - abs(corr).
Make sure to use ghe dendrogram. If you have outliers in your data, you want to cut into (n_clusters+n_outliers) partitions.

Evaluating vector distance measures

I am working with vectors of word frequencies and trying out some of the different distance measures available in Scikit Learns Pairwise Distances. I would like to use these distances for clustering and classification.
I usually have a feature matrix of ~ 30,000 x 100. My idea was to choose a distance metric that maximizes the pairwise distances by running pairwise differences over the same dataset with the distance metrics available in Scipy (e.g. Euclidean, Cityblock, etc.) and for each metric
convert distances computed for the dataset to zscores to normalize across metrics
get the range of these zscores, i.e. the spread of the distances
use the distance metric that gives me the widest range of distances as it apparently gives me the maximum spread over my dataset and the most variance to work with. (Cf. code below)
My questions:
Does this approach make sense?
Are there other evaluation procedures that one should try? I found these papers (Gavin, Aggarwal, but they don't apply 100 % here...)
Any help is much appreciated!
My code:
matrix=np.random.uniform(0, .1, size=(10,300)) #test data set
scipy_distances=['euclidean', 'minkowski', ...] #these are the distance metrics
for d in scipy_distances: #iterate over distances
distmatrix=sklearn.metrics.pairwise.pairwise_distances(matrix, metric=d)
distzscores = scipy.stats.mstats.zscore(distmatrix, axis=0, ddof=1)
diststats=basicstatsmaker(distzscores)
range=np.ptp(distzscores, axis=0)
print "range of metric", d, np.ptp(range)

In general - this is just a heuristic, which might, or not - work. In particular, it is easy to construct a "dummy metric" which will "win" in your approach even though it is useless. Try out
class Dummy_dist:
def __init__(self):
self.cheat = True
def __call__(self, x, y):
if self.cheat:
self.cheat = False
return 1e60
else:
return 0
dummy_dist = Dummy_dist()
This will give you huuuuge spread (even with z-score normalization). Of course this is a cheating example as this is non determinsitic, but I wanted to show the basic counterexample, and of course given your data one can construct a deterministic analogon.
So what you should do? Your metric should be treated as hyperparameter of your process. You should not divide process of generating your clustering/classification into two separate phases: choosing a distance and then learning something; but you should do this jointly, consider your clustering/classification + distance pairs as a single model, thus instead of working with k-means, you will work with k-means+euclidean, k-means+minkowsky and so on. This is the only statistically supported approach. You cannot construct a method of assessing "general goodness" of the metric, as there is no such object, metric quality can be only assessed in a particular task, which involves fixing every other element (such as a clustering/classification method, particular dataset etc.). Once you perform such wide, exhaustive evaluation, check many such pairs, on many datasets, you might claim that given metric performes best in such range of tasks.

dbscan - setting limit on maximum cluster span

By my understanding of DBSCAN, it's possible for you to specify an epsilon of, say, 100 meters and — because DBSCAN takes into account density-reachability and not direct density-reachability when finding clusters — end up with a cluster in which the maximum distance between any two points is > 100 meters. In a more extreme possibility, it seems possible that you could set epsilon of 100 meters and end up with a cluster of 1 kilometer:
see [2][6] in this array of images from scikit learn for an example of when that might occur. (I'm more than willing to be told I'm a total idiot and am misunderstanding DBSCAN if that's what's happening here.)
Is there an algorithm that is density-based like DBSCAN but takes into account some kind of thresholding for the maximum distance between any two points in a cluster?

DBSCAN indeed does not impose a total size constraint on the cluster.
The epsilon value is best interpreted as the size of the gap separating two clusters (that may at most contain minpts-1 objects).
I believe, you are in fact not even looking for clustering: clustering is the task of discovering structure in data. The structure can be simpler (such as k-means) or complex (such as the arbitrarily shaped clusters discovered by hierarchical clustering and k-means).
You might be looking for vector quantization - reducing a data set to a smaller set of representatives - or set cover - finding the optimal cover for a given set - instead.
However, I also have the impression that you aren't really sure on what you need and why.
A stength of DBSCAN is that it has a mathematical definition of structure in the form of density-connected components. This is a strong and (except for some rare border cases) well-defined mathematical concept, and the DBSCAN algorithm is an optimally efficient algorithm to discover this structure.
Direct density reachability however, doesn't define a useful (partitioning) structure. It just does not partition the data into disjoint partitions.
If you don't need this kind of strong structure (i.e. you don't do clustering as in "structure discovery", but you just want to compress your data as in vector quantization), you could give "canopy preclustering" a try. It can be seen as a preprocessing step designed for clustering. Essentially, it is like DBSCAN, except that it uses two epsilon values, and the structure is not guaranteed to be optimal in any way, but will highly depend on the ordering of your data. If you then preprocess it appropriately, it can still be useful. Unless you are in a distributed setting, canopy preclustering however is at least as expensive than a full DBSCAN run. Due to the loose requirements (in particular, "clusters" may overlap, and objects are expected to belong to multiple "clusters"), it is easier to parallelize.
Oh, and you might also just be looking for complete-linkage hierarchical clustering. If you cut the dendrogram at your desired height, the resulting clusters should all have the desired maximum distance inbetween of any two objects. The only problem is that hierarchical clustering usually is O(n^3), i.e. it doesn't scale to large data sets. DBSCAN runs in O(n log n) in good implementations (with index support).

I had the same problem and ended up solving it by using DBSCAN in combination with KMeans clustering: First I use DBSCAN to identify high density clusters and remove outliers, then I take any cluster larger than 250 Miles (in my case) and break it apart. Here's the code:
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=0.3, min_samples=100).fit(load_geocodes[['lat', 'long']])
load_geocodes.loc[:,'cluster'] = clustering.labels_
import mpu
def calculate_cluster_size(lat, long):
left_top = (max(lat), min(long))
right_bottom = (min(lat), max(long))
distance = mpu.haversine_distance(left_top, right_bottom)*0.621371
return distance
for c, df in load_geocodes.groupby('cluster'):
if c == -1:
continue # don't do this for outliers
distance = calculate_cluster_size(df['lat'], df['long'])
print(distance)
if distance > 250:
# break clusters into more clusters until the maximum size of a cluster is less than 250 Miles
max_distance = distance
i = 2
while max_distance > 250:
kmeans = KMeans(n_clusters=i, random_state=0).fit(df[['lat', 'long']])
df.loc[:, 'cl_temp'] = kmeans.labels_
max_temp_cl_size = 0
for temp_cl, temp_cl_df in df.groupby('cl_temp'):
temp_cl_size = calculate_cluster_size(temp_cl_df['lat'], temp_cl_df['long'])
if temp_cl_size > max_temp_cl_size:
max_temp_cl_size = temp_cl_size
i += 1
max_distance = max_temp_cl_size
load_geocodes.loc[df.index,'subcluster'] = kmeans.labels_

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.