Define k-1 cluster centroids -- SKlearn KMeans - python

I am performing a binary classification of a partially labeled dataset. I have a reliable estimate of its 1's, but not of its 0's.
From sklearn KMeans documentation:
init : {‘k-means++’, ‘random’ or an ndarray}
Method for initialization, defaults to ‘k-means++’:
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
I would like to pass an ndarray, but I only have 1 reliable centroid, not 2.
Is there a way to maximize the entropy between the K-1st centroids and the Kth? Alternatively, is there a way to manually initialize K-1 centroids and use K++ for the remaining?
=======================================================
Related questions:
This seeks to define K centroids with n-1 features. (I want to define k-1 centroids with n features).
Here is a description of what I want, but it was interpreted as a bug by one of the developers, and is "easily implement[able]"

I'm reasonably confident this works as intended, but please correct me if you spot an error. (cobbled together from geeks for geeks):
import sys
def distance(p1, p2):
return np.sum((p1 - p2)**2)
def find_remaining_centroid(data, known_centroids, k = 1):
'''
initialized the centroids for K-means++
inputs:
data - Numpy array containing the feature space
known_centroid - Numpy array containing the location of one or multiple known centroids
k - remaining centroids to be found
'''
n_points = data.shape[0]
# Initialize centroids list
if known_centroids.ndim > 1:
centroids = [cent for cent in known_centroids]
else:
centroids = [np.array(known_centroids)]
# Perform casting if necessary
if isinstance(data, pd.DataFrame):
data = np.array(data)
# Add a randomly selected data point to the list
centroids.append(data[np.random.randint(
n_points), :])
# Compute remaining k-1 centroids
for c_id in range(k - 1):
## initialize a list to store distances of data
## points from nearest centroid
dist = np.empty(n_points)
for i in range(n_points):
point = data[i, :]
d = sys.maxsize
## compute distance of 'point' from each of the previously
## selected centroid and store the minimum distance
for j in range(len(centroids)):
temp_dist = distance(point, centroids[j])
d = min(d, temp_dist)
dist[i] = d
## select data point with maximum distance as our next centroid
next_centroid = data[np.argmax(dist), :]
centroids.append(next_centroid)
# Reinitialize distance array for next centroid
dist = np.empty(n_points)
return centroids[-k:]
Its usage:
# For finding a third centroid:
third_centroid = find_remaining_centroid(X_train, np.array([presence_seed, absence_seed]), k = 1)
# For finding the second centroid:
second_centroid = find_remaining_centroid(X_train, presence_seed, k = 1)
Where presence_seed and absence_seed are known centroid locations.

Related

Intra-cluster for custom k-means

I'm stuck trying to implement and plot in python the intra-cluster of each cluster in k-means to get best number of k. Which is represented using this formula
Which is the sum of the square distances of data points which belong to a certain cluster from the centroid and normalized by the size of the cluster Ck.
Then we can compute the intra cluster variance for all clusters by just adding up the individual cluster or specific variances using this formula:
Can I get help implementing Wk and W?
The custom k-mean implementaion:
def kmeans(X, k):
iterations=0
data = pd.DataFrame(X)
cluster = np.zeros(X.shape[0])
#taking random samples from the datapoints as an initialization of centroids
centroids = data.sample(n=k).values
while True:
# for each observation
for i, row in enumerate(X):
mn_dist = float('inf')
# distance of the point from all centroids
for idx, centroid in enumerate(centroids):
# calculating euclidean distance
d = np.sqrt((centroid[0]-row[0])**2 + (centroid[1]-row[1])**2)
# assign closest centroid
if mn_dist > d:
mn_dist = d
cluster[i] = idx
#updating centroids by taking the mean value of all datapoints of each cluster
new_centroids = pd.DataFrame(X).groupby(by=cluster).mean().values
iterations+=1
# if centroids are same then break.
if np.count_nonzero(centroids-new_centroids) == 0:
break
else: #else update old centroids with new ones
centroids = new_centroids
return centroids, cluster, iterations

Trying to code the nearest neighbours algorithm - euclidean distance function only calculates the distances for one row of the test set - why?

I am trying to code the Nearest Neighbours Algorithm from scratch and have come across a problem - my algorithm was only giving the index/classification of the nearest neighbour for one row/point of the the training set. I went through every part of my code and realised that the problem is my Euclidean distance function. It only gives the result for one row.
This is the code I have written for Euclidean distance:
def euclidean_dist(r1, r2):
dist = 0
for j in range(0, len(r2)-1):
dist = dist + (r2[j] - r1[j])**2
return dist**0.5
Within my Nearest Neighbours algorithm this is the implementation of the Euclidean distance function:
for i in range(len(x_test)):
dist1 = []
dist2 = []
for j in range(len(x_train)):
distances = euclidean_dist(x_test[i], x_train[j,:])
dist1.append(distances)
dist2.append(distances)
dist1 = np.array(dist1)
sorting(dist1) #separate sorting function to sort the distances from lowest to highest,
#the aim was to get one array, dist1, with the euclidean distances for each row sorted
#and one array with the unsorted euclidean distances, dist2, (to be able to search for index later in the code)
I noticed the problem when using the iris dataset and trying out this part of the function with it. I split the data set into testing and training (X_test, X_train and y_test).
When this was implemented with the data set I got the following array for dist2:
[0.3741657386773946,
1.643167672515499,
3.389690251335658,
2.085665361461421,
1.284523257866513,
3.9572717874818752,
0.9539392014169458,
3.5805027579936315,
0.7211102550927979,
...
0.8062257748298555,
0.4242640687119287,
0.5196152422706631]
Its length is 112 which is the same length as X_train, but these are only the Euclidean distances for the first row or point of the X_test set. The dist1 array is the same except it is sorted.
Why am I not getting the Euclidean distances for every row/point of the test set? I thought I iterated through correctly with the for loops, but clearly something is not quite right. Any advice or help would be appreciated.
Using numpy for speed, built-in distance, and code length:
x_test_array = np.array(x_test)
x_train_array = np.array(x_train)
distance_matrix = np.linalg.norm(x_test[:,np.newaxis,:]-x_train[np.newaxis,:,:], axis=2)
Cell i,j in the matrix corresponds to the distance between x_train[i] and x_test[j].
You can then do sorting.
Edit: How to create the distance matrix without numpy:
matrix = []
for i in range(len(x_test)):
dist1 = []
for j in range(len(x_train)):
distances = euclidean_dist(x_test[i], x_train[j,:])
dist1.append(distances)
matrix.append(dist1)

unsupervised learning - clustering numpy arrays within numpy arrays

We're working with a dataset of spoken numbers. The wavefiles are converted to MFCC values. Each row (wavfile) consists of around 20 to 40 (depending on the length of the soundfile) arrays, with 13 floatvalues in each array. The goal of the task is to identify 10 spoken numbers. Because we don't have labels we want to cluster them in 10 groups using a learning method.
The code looks like this:
def kmeans(data, k=3, normalize=False, limit= 500):
"""Basic k-means clustering algorithm.
"""
# optionally normalize the data. k-means will perform poorly or strangely if the dimensions
# don't have the same ranges.
if normalize:
stats = (data.mean(axis=0), data.std(axis=0))
data = (data - stats[0]) / stats[1]
# pick the first k points to be the centers. this also ensures that each group has at least
# one point.
centers = data[:k]
for i in range(limit):
# core of clustering algorithm...
# first, use broadcasting to calculate the distance from each point to each center, then
# classify based on the minimum distance.
classifications = np.argmin(((data[:, :, None] - centers.T[None, :, :])**2).sum(axis=1), axis=1)
# next, calculate the new centers for each cluster.
new_centers = np.array([data[classifications == j, :].mean(axis=0) for j in range(k)])
# if the centers aren't moving anymore it is time to stop.
if (new_centers == centers).all():
break
else:
centers = new_centers
else:
# this will not execute if the for loop exits on a break.
raise RuntimeError(f"Clustering algorithm did not complete within {limit} iterations")
# if data was normalized, the cluster group centers are no longer scaled the same way the original
# data is scaled.
if normalize:
centers = centers * stats[1] + stats[0]
print(f"Clustering completed after {i} iterations")
return classifications, centers
classifications, centers = kmeans(speechdata, k=5)
plt.figure(figsize=(12, 8))
plt.scatter(x=speechdata[:, 0], y=speechdata[:, 1], s=100, c=classifications)
plt.scatter(x=centers[:, 0], y=centers[:, 1], s=500, c='k', marker='^')
the line "classifications, centers = kmeans(speechdata, k=5)" gives me an error: IndexError: too many indices for array.
How do I transform my array of array data, with varying length (one row has shape (20,13) and one might have (38,13) so that I can cluster them?

coarse graining a graph (networkx)

I am trying to coarse grain a large network to a smaller network by predefined node labels. say:
large_network = np.random.rand(100,100)
labels = [1,1,1,1,
5,5,5,5,5,5,5,5,
0,0,0,0,0, ...] #[1x100]
for example, we have 10 regions each having a few nodes.
something like membership list (in the network community detection algorithms in networkx), that tells each node belongs to which community, but here I am defining it manually. Then I need to calculate new reduced adjacency matrix say [10x10].
So the average weights of edges between the regions A and B that w_{AB} = mean(edges(A, B)) determine the weight of the edge between these two regions.
One way is to loop over edges of each node and if two endpoints of the edge were in the membership list of two regions, add it to the weighted sum.
Am I doing right?
Is there any better strightforward method?
You could coo_matrix in scipy.sparse to do the job for you. The nice thing is that this approach can readily by extended to sparse network representations.
import numpy as np
from scipy.sparse import coo_matrix
# set parameters
N = 100 # no of nodes
M = 10 # no of types
# initialise random network and random node labels
weights = np.random.rand(N, N) # a.k.a "large_network"
labels = np.random.randint(0, M, size=N)
# get sum of weights by connection type
indices = np.tile(labels, (N,1)) # create N x N matrix of labels
nominator = coo_matrix((weights.ravel(), (indices.ravel(), indices.transpose().ravel())), shape=(M,M)).todense()
# count number of weights by connection type
adjacency = (weights > 0.).astype(np.int)
denominator = coo_matrix((adjacency.ravel(), (indices.ravel(), indices.transpose().ravel())), shape=(M,M)).todense()
# normalise sum of weights by counts
small_network = nominator / denominator

How to get centroids from SciPy's hierarchical agglomerative clustering?

I am using SciPy's hierarchical agglomerative clustering methods to cluster a m x n matrix of features, but after the clustering is complete, I can't seem to figure out how to get the centroid from the resulting clusters. Below follows my code:
Y = distance.pdist(features)
Z = hierarchy.linkage(Y, method = "average", metric = "euclidean")
T = hierarchy.fcluster(Z, 100, criterion = "maxclust")
I am taking my matrix of features, computing the euclidean distance between them, and then passing them onto the hierarchical clustering method. From there, I am creating flat clusters, with a maximum of 100 clusters
Now, based on the flat clusters T, how do I get the 1 x n centroid that represents each flat cluster?
A possible solution is a function, which returns a codebook with the centroids like kmeans in scipy.cluster.vq does. Only thing you need is the partition as vector with flat clusters part and the original observations X
def to_codebook(X, part):
"""
Calculates centroids according to flat cluster assignment
Parameters
----------
X : array, (n, d)
The n original observations with d features
part : array, (n)
Partition vector. p[n]=c is the cluster assigned to observation n
Returns
-------
codebook : array, (k, d)
Returns a k x d codebook with k centroids
"""
codebook = []
for i in range(part.min(), part.max()+1):
codebook.append(X[part == i].mean(0))
return np.vstack(codebook)
You can do something like this (D=number of dimensions):
# Sum the vectors in each cluster
lens = {} # will contain the lengths for each cluster
centroids = {} # will contain the centroids of each cluster
for idx,clno in enumerate(T):
centroids.setdefault(clno,np.zeros(D))
centroids[clno] += features[idx,:]
lens.setdefault(clno,0)
lens[clno] += 1
# Divide by number of observations in each cluster to get the centroid
for clno in centroids:
centroids[clno] /= float(lens[clno])
This will give you a dictionary with cluster number as the key and the centroid of the specific cluster as the value.

Categories