Clustering algorithms with distance matrix as input - python

I need to cluster the graphs of countries around the world to find similarity.
The graphs are about covid-19 cases during the pandemic.
I need a clustering method that take distance matrix as input.

You should make your question more precise. If possible, try to include a reproducible example, with a small distance matrix to test.
Anyway, You can use : https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html with affinity="precomputed"
from sklearn.cluster import AgglomerativeClustering
import numpy as np
X = np.array([[0, 2, 3], [2, 0, 3], [3, 3, 0]])
clustering = AgglomerativeClustering(affinity="precomputed").fit(X)
clustering.labels_

Related

Scipy and the hierarchical clustering input

When performing hierarchical clustering with scipy, it is said in the docs here that scipy.cluster.hierarchy.linkage takes 1-D condensed distance matrix or a 2-D array of observation vectors as input. However, I generated a simple (symmetric) similarity matrix with pandas Dataframe and scipy took that as input with no problem at all, and the resulting dendrogram is just fine.
Can someone explain, how is this possible? Do I have outdated docs or...?
The docs are accurate, they just don't tell you what will happen if you actually try to use an uncondensed distance matrix.
The function raises a warning but still runs because it first tries to convert input into a numpy array. This creates a 2-D array from your 2-D DataFrame while at the same time recognizing that this likely isn't the expected input based on the array dimensions and symmetry.
Depending on the complexity (e.g. cluster separation, number of clusters, distribution of data across clusters) of your input data, the clustering may still look like it succeeds in generating a suitable dendrogram, as you noted. This makes sense conceptually because the result is a clustering of n- similarity vectors which may be well-separated in simple cases.
For example, here is some synthetic data with 150 observations and 2 clusters:
import pandas as pd
from scipy.spatial.distance import cosine, pdist, squareform
np.random.seed(42) # for repeatability
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
obs_df = pd.DataFrame(np.concatenate((a, b),), columns=['x', 'y'])
obs_df.plot.scatter(x='x', y='y')
Z = linkage(obs_df, 'ward')
fig = plt.figure(figsize=(8, 4))
dn = dendrogram(Z)
If you generate a similarity matrix, this is an n x n matrix that could still be clustered as if it were n vectors. I can't plot 150-D vectors, but plotting the magnitude of each vector and then the dendrogram seems to confirm a similar clustering.
def similarity_func(u, v):
return 1-cosine(u, v)
dists = pdist(obs_df, similarity_func)
sim_df = pd.DataFrame(squareform(dists), columns=obs_df.index, index=obs_df.index)
sim_array = np.asarray(sim_df)
sim_lst = []
for vec in sim_array:
mag = np.linalg.norm(vec,ord=1)
sim_lst.append(mag)
pd.Series(sim_lst).plot.bar()
Z = linkage(sim_df, 'ward')
fig = plt.figure(figsize=(8, 4))
dn = dendrogram(Z)
What we're really clustering here is a vector whose components are similarity measures of each of the 150 points. We're clustering a collection of each point's intra- and inter-cluster similarity measures. Since the two clusters are different sizes, a point in one cluster will have a rather different collection of intra- and inter-cluster similarities relative to a point in the other cluster. Hence, we get two primary clusters that are proportionate to the number of points in each cluster just as we did in the first step.

Can somebody tell me the name of the algorithm if it exists or tell me how to find it

Here is the idea:
There is a huge 2D dataset (250,000 datapoints).
I need to get rid of 90% of the datapoint without hurting the data structure. Which means (i believe) to get rid of the closest ones. Density must decrease...
Considering we need to keep the structure - we can't just randomly delete 90% as this might cause bias. There may be a little element of random in this but no too much.
I can put the data in 2D matrix and divide into cells. Some cells then will have more datapoints and some will have less and some will have none.
I need the algorithm that will group those datapoints or the cells in my matrix into segments which will all have relatively close nummer of datapoints in it. Those segments or cells in "new" matrix can be different size(which i believe is the point in this algorithm).
I've drawn a picture. It is not accurate but I hope it will make idea a bit clearer.
Also I code in python :^)
Thank you!!
the algorithm you are searching is a unsupervised learning method, the most famous one is kmeans on python.
You can find the documentation here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
Here is a code example for an array:
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
If you have to adjust it for a dataframe (df), it looks like this:
from sklearn.cluster import KMeans
X = df[['column A',..., 'column D']]
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
the output labels are your clusters.

Simple 2-D Clustering Algorithm in Python

Being new to unsupervised methods I'm in need of a push in the right direction with some semi-simple code to run through some data as a case study. The data I'm working on only has 300 or so observations but I'm wanting to learn how I can apply clustering to very large sets as well that behave similarly.
I have a 2 feature set of data and I'd like to run an DBSCAN or something similar using Euclidean distances (if this is the correct clustering approach).
As an example the data looks like this:
I can just tell by eye that clustering this way might not be the best method as the distribution looks irregular.
What method should I use to begin understanding similar distributions like these - especially when the set is very large (100s of thousands of observations).
For most machine learning tasks, scikit-learn is your friend here. For DBSCAN, scikit-learn has sklearn.cluster.DBSCAN. From the scikit-learn docs:
>>> from sklearn.cluster import DBSCAN
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3],
... [8, 7], [8, 8], [25, 80]])
>>> clustering = DBSCAN(eps=3, min_samples=2).fit(X)
>>> clustering.labels_
array([ 0, 0, 0, 1, 1, -1])
>>> clustering
DBSCAN(algorithm='auto', eps=3, leaf_size=30, metric='euclidean',
metric_params=None, min_samples=2, n_jobs=None, p=None)
You also have other clustering algorithms available to you through scikit-learn. You can see all of them here.

PCA: Get Top 20 Most Important Dimensions

I'm doing a bit of machine learning and trying to find important dimensions using PCA. Here's what I've done so far:
from sklearn.decomposition import PCA
pca = PCA(n_components=0.98)
X_reduced = pca.fit_transform(df_normalized)
X_reduced.shape
(2208, 1961)
So I have 2,208 rows consisting of 1,961 columns after running PCA that explains 98% of the variance in my dataset. However, I'm worried that the dimensions with the least explanatory power may actually be hurting my attempt at prediction (my model may just find spurious correlations in the data).
Does SciKit-Learn order the columns by importance? If so, I could just do:
X_final = X_reduced[:, :20], correct?
Thanks for the help!
From the documentation it says the output is sorted by explained variance. So yes, you should be able to do what you suggest and just take the first N dimensions the output. You could also print the output variable explained_variance_ (or even explained_variance_ratio_) along with the components_ output to double check the order.
Example from the documentation shows how to access the explained variance amounts:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
so in your case you could do print(X_reduced.components_) and print(X_reduced.explained_variance_ratio_) to get both. Then simply take the first N that you want from X_reduced.components_ after finding what N explains y% of your variance.
Be aware! In your suggested solution you mix up the dimensions. X_reduced.components_ is of the shape [n_components, n_features] so for instance if you want the first 20 components you should use X_reduced.components[:20, :] I believe.

Python k-means algorithm

I am looking for Python implementation of k-means algorithm with examples to cluster and cache my database of coordinates.
Update: (Eleven years after this original answer, it's probably time for an update.)
First off, are you sure you want k-means? This page gives an excellent graphical summary of some different clustering algorithms. I'd suggest that beyond the graphic, look especially at the parameters that each method requires and decide whether you can provide the required parameter (eg, k-means requires the number of clusters, but maybe you don't know that before you start clustering).
Here are some resources:
sklearn k-means and sklearn other clustering algorithms
scipy k-means and scipy k-means2
Old answer:
Scipy's clustering implementations work well, and they include a k-means implementation.
There's also scipy-cluster, which does agglomerative clustering; ths has the advantage that you don't need to decide on the number of clusters ahead of time.
SciPy's kmeans2() has some numerical problems: others have reported error messages such as "Matrix is not positive definite - Cholesky decomposition cannot be computed" in version 0.6.0, and I just encountered the same in version 0.7.1.
For now, I would recommend using PyCluster instead. Example usage:
>>> import numpy
>>> import Pycluster
>>> points = numpy.vstack([numpy.random.multivariate_normal(mean,
0.03 * numpy.diag([1,1]),
20)
for mean in [(1, 1), (2, 4), (3, 2)]])
>>> labels, error, nfound = Pycluster.kcluster(points, 3)
>>> labels # Cluster number for each point
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
>>> error # The within-cluster sum of distances for the solution
1.7721661785401261
>>> nfound # Number of times this solution was found
1
For continuous data, k-means is very easy.
You need a list of your means, and for each data point, find the mean its closest to and average the new data point to it. your means will represent the recent salient clusters of points in the input data.
I do the averaging continuously, so there is no need to have the old data to obtain the new average. Given the old average k,the next data point x, and a constant n which is the number of past data points to keep the average of, the new average is
k*(1-(1/n)) + n*(1/n)
Here is the full code in Python
from __future__ import division
from random import random
# init means and data to random values
# use real data in your code
means = [random() for i in range(10)]
data = [random() for i in range(1000)]
param = 0.01 # bigger numbers make the means change faster
# must be between 0 and 1
for x in data:
closest_k = 0;
smallest_error = 9999; # this should really be positive infinity
for k in enumerate(means):
error = abs(x-k[1])
if error < smallest_error:
smallest_error = error
closest_k = k[0]
means[closest_k] = means[closest_k]*(1-param) + x*(param)
you could just print the means when all the data has passed through, but its much more fun to watch it change in real time. I used this on frequency envelopes of 20ms bits of sound and after talking to it for a minute or two, it had consistent categories for the short 'a' vowel, the long 'o' vowel, and the 's' consonant. wierd!
(Years later) this kmeans.py under is-it-possible-to-specify-your-own-distance-function-using-scikits-learn-k-means is straightforward and reasonably fast; it uses any of the 20-odd metrics in scipy.spatial.distance.
From wikipedia, you could use scipy, K-means clustering an vector quantization
Or, you could use a Python wrapper for OpenCV, ctypes-opencv.
Or you could OpenCV's new Python interface, and their kmeans implementation.
SciKit Learn's KMeans() is the simplest way to apply k-means clustering in Python. Fitting clusters is simple as:
kmeans = KMeans(n_clusters=2, random_state=0).fit(X).
This code snippet shows how to store centroid coordinates and predict clusters for an array of coordinates.
>>> from sklearn.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
... [4, 2], [4, 4], [4, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> kmeans.predict([[0, 0], [4, 4]])
array([0, 1], dtype=int32)
>>> kmeans.cluster_centers_
array([[ 1., 2.],
[ 4., 2.]])
(courtesy of SciKit Learn's documentation, linked above)
You can also use GDAL, which has many many functions to work with spatial data.
Python's Pycluster and pyplot can be used for k-means clustering and for visualization of 2D data. A recent blog post Stock Price/Volume Analysis Using Python and PyCluster gives an example of clustering using PyCluster on stock data.

Categories