Alternative to scipy.cluster.hierarchy.cut_tree() - python

I was doing an agglomerative hierarchical clustering experiment in Python 3 and I found scipy.cluster.hierarchy.cut_tree() is not returning the requested number of clusters for some input linkage matrices. So, by now I know there is a bug in the cut_tree() function (as described here).
However, I need to be able to get a flat clustering with an assignment of k different labels to my datapoints. Do you know the algorithm to get a flat clustering with k labels from an arbitrary input linkage matrix Z? My question boils down to: how can I compute what cut_tree() is computing from scratch with no bugs?
You can test your code with this dataset.
from scipy.cluster.hierarchy import linkage, is_valid_linkage
from scipy.spatial.distance import pdist
## Load dataset
X = np.load("dataset.npy")
## Hierarchical clustering
dists = pdist(X)
Z = linkage(dists, method='centroid', metric='euclidean')
print(is_valid_linkage(Z))
## Now let's say we want the flat cluster assignement with 10 clusters.
# If cut_tree() was working we would do
from scipy.cluster.hierarchy import cut_tree
cut = cut_tree(Z, 10)
Sidenote: An alternative approach could maybe be using rpy2's cutree() as a substitute for scipy's cut_tree(), but I never used it. What do you think?

One way to obtain k flat clusters is to use scipy.cluster.hierarchy.fcluster with criterion='maxclust':
from scipy.cluster.hierarchy import fcluster
clust = fcluster(Z, k, criterion='maxclust')

Related

Python code for automatic execution of the Elbow curve method in K-modes clustering

having the code for manual and therefore possibly wrong Elbow method selection of optimal number of clusters when K-modes clustering of binary df:
cost = []
for num_clusters in list(range(1,10)):
kmode = KModes(n_clusters=num_clusters, init = "Huang", n_init = 10)
kmode.fit_predict(newdf_matrix)
cost.append(kmode.cost_)
y = np.array([i for i in range(1,10,1)])
plt.plot(y,cost)
An outcome of the for loop is a plot with the so called elbow curve. I know this curve helps me choose a optimal K. I do not want to do that myself tho, I am looking for some computational way. I want a computer to do the job without me determining it "manually". Otherwise it stops executing the whole code at some point.
Thank you.
What would be the code for selecting the K automatically that would replace my manual selection?
Thank you.
Use silhouette coefficient [will not work if the data points are represented as categorical values rather then N-d points]
The silhouette coefficient give the measure of how similar a data point is within the cluster compared to other clusters. check Sklearn doc here.
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
So calculate silhouette_score for different values of k and use the one which has best score (near to 1).
Sample using digits dataset.
from sklearn.cluster import KMeans
import numpy as np
from sklearn.datasets import load_digits
data, labels = load_digits(return_X_y=True)
from sklearn.metrics import silhouette_score
silhouette_avg = []
for num_clusters in list(range(2,20)):
kmeans = KMeans(n_clusters=num_clusters, init = "k-means++", n_init = 10)
kmeans.fit_predict(data)
score = silhouette_score(data, kmeans.labels_)
silhouette_avg.append(score)
import matplotlib.pyplot as plt
plt.plot(np.arange(2,20),silhouette_avg,'bx-')
plt.xlabel('Values of K')
plt.ylabel('Silhouette score')
plt.title('Silhouette analysis For Optimal k')
_ = plt.xticks(np.arange(2,20))
print (f"Best K: {np.argmax(silhouette_avg)+2}")
output:
Best K: 9

Tag based co-occurrence image clustering

I labeled lots of object images using Google Vision API. Using those labels (list in pickle here), I created a label co-occurrence matrix (download as numpy array here). Size of the matrix is 2195x2195.
Loading the data:
import pickle
import numpy as np
with open('labels.pkl', 'rb') as f:
labels = pickle.load(f)
cooccurrence = np.load('cooccurrence.npy')
I would like to use a clustering analysis to define reasonable amount of clusters (defined as lists of Vision labels) which would represent some objects (e.g. cars, shoes, books, ....). I do not know what is the right number of clusters.
I tried hierarchical clustering algorithm available in scikit-learn:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_colwidth', 1000)
#creating non-symetrical "similarity" matrix:
occurrences = cooccurrence.diagonal().copy()
similarities = cooccurrence / occurrences[:,None]
#clustering:
from sklearn.cluster import AgglomerativeClustering
clusters = AgglomerativeClustering(n_clusters=200, affinity='euclidean', linkage='ward').fit_predict(similarities)
#results in pandas:
df_clusters = pd.DataFrame({'cluster': clusters.tolist(), 'label': labels})
df_clusters_grouped = df_clusters.groupby(['cluster']).agg({'label': [len, list]})
df_clusters_grouped.columns = [' '.join(col).strip() for col in df_clusters_grouped.columns.values]
df_clusters_grouped.rename(columns = {'label len': 'cluster_size', 'label list': 'cluster_labels'}, inplace=True)
df_clusters_grouped.sort_values(by=['cluster_size'], ascending=False)
Like this, I was able to create 200 clusters where one can look like:
["Racket", "Racquet sport", "Tennis racket", "Rackets", "Tennis", "Racketlon", "Tennis racket accessory", "Strings"]
This somehow works, but I would rather use some soft clustering method which would be able to assign one label to multiple clusters (for instance "leather" might make sense for shoes and wallets). Also, I had to define number of clusters (200 in my example code), which is something I would rather get as a result (if possible).
I was also playing with hdbscan, k-clique and Gaussian mixture models but I did not come up with any better output.
Clustering methods such as AgglomerativeClustering of sklearn require a data matrix as input. With metric="precomputed" you can also use a distance matrix (it for k-means and Gaussian mixture modeling, these do need coordinate data).
You, however, have a cooccurrence or simarity matrix. These values have the opposite meaning, so you'll have to identify an appropriate transformation (for example occurrences-cooccurrences). Treating the cooccurrence matrix as data matrix (and then using Euclidean distance - that is what you do) works to some extend but has very weird semantics and is not recommended.

k-means using signature matrix generated from minhash

I have used minhash on documents and their shingles to generate a signature matrix from these documents. I have verified that the signature matrices are good as comparing jaccard distances of known similar documents (say, two articles about the same sports team or two articles about the same world event) give correct readings.
My question is: does it make sense to use this signature matrix to perform k-means clustering?
I've tried using the signature vectors of documents and calculating the euclidean distance of these vectors inside the iterative kmeans algorithm and I always get nonsense for my clusters. I know there should be two clusters (my data set is a few thousands articles about either sports or business) and in the end my two clusters are always just random. I'm convinced that the randomness of hashing words into integers is going to skew the distance function every time and overpower similar hash values in two signature matrices.
[Edited to highlight the question]
TL;DR
Short answer: No, it doesn't make sense to use the signature matrix for K-means clustering. At least, not without significant manipulation.
Some explanations
I'm coming at this after a few days of figuring out how to do the same thing (text clustering) myself. I might be wrong, but my perception is that you're making the same mistake I was: using MinHash to build an [n_samples x n_perms] matrix, then using this as a features matrix X on which you run k-means.
I'm guessing you're doing something like:
# THIS CODE IS AN EXAMPLE OF WRONG! DON'T IMPLEMENT!
import numpy as np
import MinHash
from sklearn.cluster import KMeans
# Get your data.
data = get_your_list_of_strings_to_cluster()
n_samples = len(data)
# Minhash all the strings
n_perms = 128
minhash_values = np.zeros((n_samples, n_perms), dtype='uint64')
minhashes = []
for index, string in enumerate(data):
minhash = MinHash(num_perm=n_perms)
for gram in ngrams(string, 3):
minhash.update("".join(gram).encode('utf-8'))
minhash_values[index, :] = minhash.hashvalues
# Compute clusters
clusterer = KMeans(n_clusters=8)
clusters = clusterer.fit_predict(minhash_values)
This will behave horribly because of the fateful flaw - the minhash_values array is not a feature matrix. Each row is basically a list of features (hashes) which appear in that sample of text... but they're not column-aligned so features are scattered into the wrong dimensions.
To turn that into a feature matrix, you'd have to look at all the unique hashes in minhash_values then create a matrix which is [n_samples x n_unique_hashes], (n_unique_hashes is the number of unique features found) setting it to 1 where the text sample contains that feature, 0 elsewhere. Typically this matrix would be large and sparse. You could then cluster on that.
Alternative way of text clustering
What an unbelievable hassle though! Fortunately, scikit-learn is there to help. It provides some very easy to use and scalable vectorisers:
So your problem becomes easily solved:
# Imports
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.cluster import KMeans
# Get your data
data = get_your_list_of_strings_to_cluster()
# Get your feature matrix
text_features = HashingVectorizer(analyzer="word").fit_transform(data)
# Compute clusters
clusterer = KMeans(n_clusters=2)
clusters = clusterer.fit_predict(text_features)
And there you go. From there:
Fine tune your vectoriser (try TfidfVectorizer too, tweak the input params, etc),
Try other clusterers (f/ex I find
HDBSCAN miles better
than kmeans - quicker, more robust, more accurate, less tuning).
Hope this helps.
Tom

Specify max distance in agglomerative clustering (scikit learn)

When using a clustering algorithm, you always have to specify a shutoff parameter.
I am currently using Agglomerative clustering with scikit learn, and the only shutoff parameter that I can see is the number of clusters.
agg_clust = AgglomerativeClustering(n_clusters=N)
y_pred = agg_clust.fit_predict(matrix)
But I would like to find an algorithm where you would specify the maximum distance within elements of a clusters, and not the number of clusters.
Therefore the algorithm would simply agglomerate clusters until the max distance is reached.
Any suggestion ?
What you are looking for is implemented in scipy.cluster.hierarchy, see here.
So here is how you can do it:
from scipy.cluster.hierarchy import linkage, fcluster
y_pred = fcluster(linkage(matrix), t, criterion='distance')
# or more direct way
from scipy.cluster.hierarchy import fclusterdata
y_pred = fclusterdata(matrix, t, criterion='distance')

Build in function for plotting bayes decision boundary given the probability function

Is there a function in python, that plots bayes decision boundary if we input a function to it? I know there is one in matlab, but I'm searching for some function in python. I know that one way to achieve this is to iterate over the points, but I am searching for a built-in function.
I have bivariate sample points on the axis, and I want to plot the decision boundary in order to classify them.
Going off the guess of Chris in the comments above, I'm assuming you want to cluster points according to the Gaussian Mixture model - a reasonable method assuming the underlying distribution is a linear combination of Gaussian distributed samples. Below I've shown an example using numpy to create a sample data set, sklearn for it's GM modeling and pylab to show the results.
import numpy as np
from pylab import *
from sklearn import mixture
# Create some sample data
def G(mu, cov, pts):
return np.random.multivariate_normal(mu,cov,500)
# Three multivariate Gaussians with means and cov listed below
MU = [[5,3], [0,0], [-2,3]]
COV = [[[4,2],[0,1]], [[1,0],[0,1]], [[1,2],[2,1]]]
A = [G(mu,cov,500) for mu,cov in zip(MU,COV)]
PTS = np.concatenate(A) # Join them together
# Use a Gaussian Mixture model to fit
g = mixture.GMM(n_components=len(A))
g.fit(PTS)
# Returns an index list of which cluster they belong to
C = g.predict(PTS)
# Plot the original points
X,Y = map(array, zip(*PTS))
subplot(211)
scatter(X,Y)
# Plot the points and color according to the cluster
subplot(212)
color_mask = ['k','b','g']
for n in xrange(len(A)):
idx = (C==n)
scatter(X[idx],Y[idx],color=color_mask[n])
show()
See the sklearn.mixture example page for more detailed information on the classification methods.

Categories