Find distance between centroid and points in a single feature dataframe - KMeans - python

I'm working on an anomaly detection task using KMeans.
Pandas dataframe that i'm using has a single feature and it is like the following one:
df = array([[12534.],
[12014.],
[12158.],
[11935.],
...,
[ 5120.],
[ 4828.],
[ 4443.]])
I'm able to fit and to predict values with the following instructions:
km = KMeans(n_clusters=2)
km.fit(df)
km.predict(df)
In order to identify anomalies, I would like to calculate the distance between centroid and each single point, but with a dataframe with a single feature i'm not sure that it is the correct approach.
I found examples which used euclidean distance to calculate the distance. An example is the following one:
def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
distances = [np.sqrt((x - cx) ** 2 + (y - cy) ** 2) for (x, y) in data[cluster_labels == i_centroid]]
return distances
centroids = self.km.cluster_centers_
distances = []
for i, (cx, cy) in enumerate(centroids):
mean_distance = k_mean_distance(day_df, cx, cy, i, clusters)
distances.append({'x': cx, 'y': cy, 'distance': mean_distance})
This code doesn't work for me because centroids are like the following one in my case, since i have a single feature dataframe:
array([[11899.90692187],
[ 5406.54143126]])
In this case, what is the correct approach to find the distance between centroid and points? Is it possible?
Thank you and sorry for the trivial question, i'm still learning

There's scipy.spatial.distance_matrix you can make use of:
# setup a set of 2d points
np.random.seed(2)
df = np.random.uniform(0,1,(100,2))
# make it a dataframe
df = pd.DataFrame(df)
# clustering with 3 clusters
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(df)
preds = km.predict(df)
# get centroids
centroids = km.cluster_centers_
# visualize
plt.scatter(df[0], df[1], c=preds)
plt.scatter(centroids[:,0], centroids[:,1], c=range(centroids.shape[0]), s=1000)
gives
Now the distance matrix:
from scipy.spatial import distance_matrix
dist_mat = pd.DataFrame(distance_matrix(df.values, centroids))
You can confirm that this is correct by
dist_mat.idxmin(axis=1) == preds
And finally, the mean distance to centroids:
dist_mat.groupby(preds).mean()
gives:
0 1 2
0 0.243367 0.525194 0.571674
1 0.525350 0.228947 0.575169
2 0.560297 0.573860 0.197556
where the columns denote the centroid number and rows denote the mean distance of the points in a cluster.

You can use scipy.spatial.distance.cdist to create a distance matrix:
from scipy.spatial.distance import cdist
dm = cdist(df, centroids)
This should give you a 2-d array, where each row represents an observation in your original dataset, and each column represents a centroid. The x-th row in the y-th column gives the distance between your x-th observation to your y-th cluster centroid. cdist uses Euclidean distance by default, but you can use other metrics (not that it matters much for a dataset with only one feature).

Related

Trying to code the nearest neighbours algorithm - euclidean distance function only calculates the distances for one row of the test set - why?

I am trying to code the Nearest Neighbours Algorithm from scratch and have come across a problem - my algorithm was only giving the index/classification of the nearest neighbour for one row/point of the the training set. I went through every part of my code and realised that the problem is my Euclidean distance function. It only gives the result for one row.
This is the code I have written for Euclidean distance:
def euclidean_dist(r1, r2):
dist = 0
for j in range(0, len(r2)-1):
dist = dist + (r2[j] - r1[j])**2
return dist**0.5
Within my Nearest Neighbours algorithm this is the implementation of the Euclidean distance function:
for i in range(len(x_test)):
dist1 = []
dist2 = []
for j in range(len(x_train)):
distances = euclidean_dist(x_test[i], x_train[j,:])
dist1.append(distances)
dist2.append(distances)
dist1 = np.array(dist1)
sorting(dist1) #separate sorting function to sort the distances from lowest to highest,
#the aim was to get one array, dist1, with the euclidean distances for each row sorted
#and one array with the unsorted euclidean distances, dist2, (to be able to search for index later in the code)
I noticed the problem when using the iris dataset and trying out this part of the function with it. I split the data set into testing and training (X_test, X_train and y_test).
When this was implemented with the data set I got the following array for dist2:
[0.3741657386773946,
1.643167672515499,
3.389690251335658,
2.085665361461421,
1.284523257866513,
3.9572717874818752,
0.9539392014169458,
3.5805027579936315,
0.7211102550927979,
...
0.8062257748298555,
0.4242640687119287,
0.5196152422706631]
Its length is 112 which is the same length as X_train, but these are only the Euclidean distances for the first row or point of the X_test set. The dist1 array is the same except it is sorted.
Why am I not getting the Euclidean distances for every row/point of the test set? I thought I iterated through correctly with the for loops, but clearly something is not quite right. Any advice or help would be appreciated.
Using numpy for speed, built-in distance, and code length:
x_test_array = np.array(x_test)
x_train_array = np.array(x_train)
distance_matrix = np.linalg.norm(x_test[:,np.newaxis,:]-x_train[np.newaxis,:,:], axis=2)
Cell i,j in the matrix corresponds to the distance between x_train[i] and x_test[j].
You can then do sorting.
Edit: How to create the distance matrix without numpy:
matrix = []
for i in range(len(x_test)):
dist1 = []
for j in range(len(x_train)):
distances = euclidean_dist(x_test[i], x_train[j,:])
dist1.append(distances)
matrix.append(dist1)

Value at KMeans.cluster_centers_ in sklearn KMeans

On doing K means fit on some vectors with 3 clusters, I was able to get the labels for the input data.
KMeans.cluster_centers_ returns the coordinates of the centers and so shouldn't there be some vector corresponding to that? How can I find the value at the centroid of these clusters?
closest, _ = pairwise_distances_argmin_min(KMeans.cluster_centers_, X)
The array closest will contain the index of the point in X that is closest to each centroid.
Let's say the closest gave output as array([0,8,5]) for the three clusters. So X[0] is the closest point in X to centroid 0, and X[8] is the closest to centroid 1 and so on.
Source: https://codedump.io/share/XiME3OAGY5Tm/1/get-nearest-point-to-centroid-scikit-learn
The cluster centre value is the value of the centroid. At the end of k-means clustering, you'll have three individual clusters and three centroids, with each centroid being located at the centre of each cluster. The centroid doesn't necessarily have to coincide with an existing data point.
Sharda neglected to import the metrics module from scikit-learn, see below.
from sklearn.metrics import pairwise_distances_argmin_min
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, X)
or
closest, _ = sklearn.metrics.pairwise_distances_argmin_min(kmeans.cluster_centers_, X)
Assuming X is the input data and kmeans has been fit to that data, both options give you an array, closest, for which each element is the index of the closest element in X to that centroid. Thus, closest[0] is the index of the data closest to the first centroid and X[closest[0]] is that data.
To answer your first question, k-means clustering randomly selects a point in the plane for each centroid and then adjusts them all to be the best representatives of the data. The centroids will not necessarily end up coinciding with any of the original data. This contrasts with the Affinity Propagation Clustering algorithm which picks an exemplar data point as the representative for each cluster, not just a point in the same plane.

Efficient distance for clustering computation

I want to compute the distance from a set of N 3D-points to a set of M 3D-centers and store the results in a NxM matrix (where column i is the distance from all points to center i)
Example:
data = np.random.rand(100,3) # 100 toy 3D points
centers = np.random.rand(20,3) # 20 toy 3D points
For computing the distance between all points and a single center we can use "broadcasting" so we avoid looping though all points:
i = 0 # first center
np.sqrt(np.sum(np.power(data - centers[i,:], 2),1)) # Euclidean distance
Now we can put this code in a loop that iterates over all centers:
distances = np.zeros(data.shape[0], centers.shape[0])
for i in range(centers.shape[0]):
distances[:,i] = np.sqrt(np.sum(np.power(data - centers[i,:], 2),1))
However this is clearly an operation that could be parallelized and improved.
I'm wondering if there is a better way of doing this (maybe some multi-dimensional broadcasting or some library).
This is a very common problem for clustering and classification, where you want to get distances from your data to a set of classes, so I think it should be some efficient implementation to to this.
What's the best way of doing this?
Broadcast all the way:
import numpy as np
data = np.random.rand(100,3)
centers = np.random.rand(20,3)
distances = np.sqrt(np.sum(np.power(data[:,None,:] - centers[None,:,:], 2), axis=-1))
print distances.shape
# 100, 20
If you just want the nearest center, and you have a lot of data points (a lot being more than a several 100 000 samples), you probably should store your data in a KD tree and query that with the centers (scipy.spatial.KDTree).

Method for calculating irregularly spaced accumulation points

I am attempting to do the opposite of this: Given a 2D image of (continuous) intensities, generate a set of irregularly spaced accumulation points, i.e, points that irregularly cover the 2D map, being closer to each other at the areas of high intensities (but without overlap!).
My first try was "weighted" k-means. As I didn't find a working implementation of weighted k-means, the way I introduce the weights consists of repeating the points with high intensities. Here is my code:
import numpy as np
from sklearn.cluster import KMeans
def accumulation_points_finder(x, y, data, n_points, method, cut_value):
#computing the rms
rms = estimate_rms(data)
#structuring the data
X,Y = np.meshgrid(x, y, sparse=False)
if cut_value > 0.:
mask = data > cut_value
#applying the mask
X = X[mask]; Y = Y[mask]; data = data[mask]
_data = np.array([X, Y, data])
else:
X = X.ravel(); Y = Y.ravel(); data = data.ravel()
_data = np.array([X, Y, data])
if method=='weighted_kmeans':
res = []
for i in range(len(data)):
w = int(ceil(data[i]/rms))
res.extend([[X[i],Y[i]]]*w)
res = np.asarray(res)
#kmeans object instantiation
kmeans = KMeans(init='k-means++', n_clusters=n_points, n_init=25, n_jobs=2)
#performing kmeans clustering
kmeans.fit(res)
#returning just (x,y) positions
return kmeans.cluster_centers_
Here are two different results: 1) Making use of all the data pixels. 2) Making use of only pixels above some threshold (RMS).
As you can see the points seems to be more regularly spaced than concentrated at areas of high intensities.
So my question is if there exist a (deterministic if possible) better method for computing such accumulation points.
Partition the data using quadtrees (https://en.wikipedia.org/wiki/Quadtree) into units of equal variance (or maybe also possible to make use of the concentration value?), using a defined threhold, then keep one point per unit (the centroid). There will be more subdivisions in areas with rapidly changing values, fewer in the background areas.

How to get centroids from SciPy's hierarchical agglomerative clustering?

I am using SciPy's hierarchical agglomerative clustering methods to cluster a m x n matrix of features, but after the clustering is complete, I can't seem to figure out how to get the centroid from the resulting clusters. Below follows my code:
Y = distance.pdist(features)
Z = hierarchy.linkage(Y, method = "average", metric = "euclidean")
T = hierarchy.fcluster(Z, 100, criterion = "maxclust")
I am taking my matrix of features, computing the euclidean distance between them, and then passing them onto the hierarchical clustering method. From there, I am creating flat clusters, with a maximum of 100 clusters
Now, based on the flat clusters T, how do I get the 1 x n centroid that represents each flat cluster?
A possible solution is a function, which returns a codebook with the centroids like kmeans in scipy.cluster.vq does. Only thing you need is the partition as vector with flat clusters part and the original observations X
def to_codebook(X, part):
"""
Calculates centroids according to flat cluster assignment
Parameters
----------
X : array, (n, d)
The n original observations with d features
part : array, (n)
Partition vector. p[n]=c is the cluster assigned to observation n
Returns
-------
codebook : array, (k, d)
Returns a k x d codebook with k centroids
"""
codebook = []
for i in range(part.min(), part.max()+1):
codebook.append(X[part == i].mean(0))
return np.vstack(codebook)
You can do something like this (D=number of dimensions):
# Sum the vectors in each cluster
lens = {} # will contain the lengths for each cluster
centroids = {} # will contain the centroids of each cluster
for idx,clno in enumerate(T):
centroids.setdefault(clno,np.zeros(D))
centroids[clno] += features[idx,:]
lens.setdefault(clno,0)
lens[clno] += 1
# Divide by number of observations in each cluster to get the centroid
for clno in centroids:
centroids[clno] /= float(lens[clno])
This will give you a dictionary with cluster number as the key and the centroid of the specific cluster as the value.

Categories