I want to pass my own distance matrix (row linkages) to seaborn clustermap.
There are already some posts on this like
Use Distance Matrix in scipy.cluster.hierarchy.linkage()?
But they all point to
scipy hierarchy linkage
Which takes the clustering metric and method as arguments.
scipy.cluster.hierarchy.linkage(y, method='single',
metric='euclidean', optimal_ordering=False)
The input y may be either a 1d condensed distance matrix or a 2d array
of observation vectors
What I dont get is this:
My distance matrix is already based on a certain metric and method,
why would I want to recalculate this in scipy hierarchy linkage ?
Is there an option where it purely uses my distances and creates the linkages?
For posterity, here is a complete method of how to do this, as #WarrenWeckesser in the comments and #SibbsGambling in the linked answer leave out some details.
Suppose distMatrix is your matrix of distances (don't have to be Euclidean), with entry in row i and column j representing the distance between the ith and jth objects. Then:
# import packages
from scipy.cluster import hierarchy
import scipy.spatial.distance as ssd
import seaborn as sns
# define distance array as in linked answer
distArray = ssd.squareform(distMatrix)
# define linkage object
distLinkage = hierarchy.linkage(distArray)
# make clustermap
sns.clustermap(distMatrix, row_linkage=distLinkage, col_linkage=distLinkage)
Note that when creating the clustermap, you still have to reference the original matrix. If you want to use a different clustering method, such as method='ward', include that option when defining distLinkage.
Related
I have carried out some clustering analysis on some data X and have arrived at both the labels y and the centroids c. Now, I'm trying to calculate the distance between X and their assigned cluster's centroid c. This is easy when we have a small number of points:
import numpy as np
# 10 random points in 3D space
X = np.random.rand(10,3)
# define the number of clusters, say 3
clusters = 3
# give each point a random label
# (in the real code this is found using KMeans, for example)
y = np.asarray([np.random.randint(0,clusters) for i in range(10)]).reshape(-1,1)
# randomly assign location of centroids
# (in the real code this is found using KMeans, for example)
c = np.random.rand(clusters,3)
# calculate distances
distances = []
for i in range(len(X)):
distances.append(np.linalg.norm(X[i]-c[y[i][0]]))
Unfortunately, the actual data has many more rows. Is there a way to vectorise this somehow (instead of using a for loop)? I can't seem to get my head around the mapping.
Thanks to numpy's array indexing, you can actually turn your for loop into a one-liner and avoid explicit looping altogether:
distances = np.linalg.norm(X- np.einsum('ijk->ik', c[y]), axis=1)
will do the same thing as your original for loop.
EDIT: Thanks #Kris, I forgot the axis keyword, and since I didn't specify it, numpy automatically computed the norm of the entire flattened matrix, not just along the rows (axis 1). I've updated it now, and it should return an array of distances for each point. Also, einsum was suggested by #Kris for their specific application.
Reading around, I find it is possible to pass a precomputed distance matrix into SKLearn DBSCAN. Unfortunately, I don't know how to pass it for calculation.
Say I have a 1D array with 100 elements, with just the names of the nodes. Then I have a 2D matrix, 100x100 with the distance between each element (in the same order).
I know I have to call it:
db = DBSCAN(eps=2, min_samples=5, metric="precomputed")
For a distance between nodes of 2 and a minimum of 5 node clusters. Also, use "precomputed" to indicate to use the 2D matrix. But how do I pass the info for the calculation?
The same question could apply if using RAPIDS CUML DBScan function (GPU accelerated).
documentation:
class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean',
metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)
[...]
[...]
metricstring, or callable, default=’euclidean’
The metric to use when calculating distance between instances in a feature array. If
metric is a string or callable, it must be one of the options allowed by
sklearn.metrics.pairwise_distances for its metric parameter. If metric is
“precomputed”, X is assumed to be a distance matrix and must be square. X may be a
Glossary, in which case only “nonzero” elements may be considered neighbors for
DBSCAN.
[...]
So, the way you normally call this is:
from sklearn.cluster import DBSCAN
clustering = DBSCAN()
DBSCAN.fit(X)
if you have a distance matrix, you do:
from sklearn.cluster import DBSCAN
clustering = DBSCAN(metric='precomputed')
clustering.fit(distance_matrix)
I was doing an agglomerative hierarchical clustering experiment in Python 3 and I found scipy.cluster.hierarchy.cut_tree() is not returning the requested number of clusters for some input linkage matrices. So, by now I know there is a bug in the cut_tree() function (as described here).
However, I need to be able to get a flat clustering with an assignment of k different labels to my datapoints. Do you know the algorithm to get a flat clustering with k labels from an arbitrary input linkage matrix Z? My question boils down to: how can I compute what cut_tree() is computing from scratch with no bugs?
You can test your code with this dataset.
from scipy.cluster.hierarchy import linkage, is_valid_linkage
from scipy.spatial.distance import pdist
## Load dataset
X = np.load("dataset.npy")
## Hierarchical clustering
dists = pdist(X)
Z = linkage(dists, method='centroid', metric='euclidean')
print(is_valid_linkage(Z))
## Now let's say we want the flat cluster assignement with 10 clusters.
# If cut_tree() was working we would do
from scipy.cluster.hierarchy import cut_tree
cut = cut_tree(Z, 10)
Sidenote: An alternative approach could maybe be using rpy2's cutree() as a substitute for scipy's cut_tree(), but I never used it. What do you think?
One way to obtain k flat clusters is to use scipy.cluster.hierarchy.fcluster with criterion='maxclust':
from scipy.cluster.hierarchy import fcluster
clust = fcluster(Z, k, criterion='maxclust')
I have a Matrix that contains N users and K items. I want to plot that matrix in Python by considering each line as a vector with multiple coordinates. For example a simple point plot require X,Y. My vector hasK coordinates and I want to plot each one of those N vectors as a point to see there similarities. Can anyone help me with that ?
UPDATE:
#Matrix M shape = (944, 1683)
plt.figure()
plt.imshow(M, interpolation='nearest', cmap=plt.cm.ocean)
plt.colorbar()
plt.show()
but this gave me as result :
What I want is something like that:
It is difficult from this question to be sure if my answer is relevant, but here's my best guess. I believe deltascience is asking how multidimensional vectors are generally plotted into two-dimensional space, as would be the case with a scatter plot. I think the best answer is that some kind of dimension reduction algorithm is generally performed. In other words, you don't do this by finding the right matplotlib code; you get your data into the right shape (one list for the X axis, and another list for the Y axis) and you then plot it using a typical matplotlib approach:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
M = np.random.rand(944, 1683)
pca = PCA(n_components=2)
reduced = pca.fit_transform(M)
# We need a 2 x 944 array, not 944 by 2 (all X coordinates in one list)
t = reduced.transpose()
plt.scatter(t[0], t[1])
plt.show()
Here are some relevant links:
https://stats.stackexchange.com/questions/63589/how-to-project-high-dimensional-space-into-a-two-dimensional-plane
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
https://towardsdatascience.com/the-art-of-effective-visualization-of-multi-dimensional-data-6c7202990c57
https://www.evl.uic.edu/documents/etemadpour_choosingvisualization_springer2016.pdf
July 2019 Addendum: It didn't occur to me at the the time, but another way people often visualize multi-dimensional data is with network visualization. Each multi-dimensional array in this context would be a node, and the edge weight would be something like the cosine similarity of two nodes, or the Euclidian distance. Networkx in python has some really nice visualization options.
Is there a function in python, that plots bayes decision boundary if we input a function to it? I know there is one in matlab, but I'm searching for some function in python. I know that one way to achieve this is to iterate over the points, but I am searching for a built-in function.
I have bivariate sample points on the axis, and I want to plot the decision boundary in order to classify them.
Going off the guess of Chris in the comments above, I'm assuming you want to cluster points according to the Gaussian Mixture model - a reasonable method assuming the underlying distribution is a linear combination of Gaussian distributed samples. Below I've shown an example using numpy to create a sample data set, sklearn for it's GM modeling and pylab to show the results.
import numpy as np
from pylab import *
from sklearn import mixture
# Create some sample data
def G(mu, cov, pts):
return np.random.multivariate_normal(mu,cov,500)
# Three multivariate Gaussians with means and cov listed below
MU = [[5,3], [0,0], [-2,3]]
COV = [[[4,2],[0,1]], [[1,0],[0,1]], [[1,2],[2,1]]]
A = [G(mu,cov,500) for mu,cov in zip(MU,COV)]
PTS = np.concatenate(A) # Join them together
# Use a Gaussian Mixture model to fit
g = mixture.GMM(n_components=len(A))
g.fit(PTS)
# Returns an index list of which cluster they belong to
C = g.predict(PTS)
# Plot the original points
X,Y = map(array, zip(*PTS))
subplot(211)
scatter(X,Y)
# Plot the points and color according to the cluster
subplot(212)
color_mask = ['k','b','g']
for n in xrange(len(A)):
idx = (C==n)
scatter(X[idx],Y[idx],color=color_mask[n])
show()
See the sklearn.mixture example page for more detailed information on the classification methods.