unsupervised learning - clustering numpy arrays within numpy arrays - python

We're working with a dataset of spoken numbers. The wavefiles are converted to MFCC values. Each row (wavfile) consists of around 20 to 40 (depending on the length of the soundfile) arrays, with 13 floatvalues in each array. The goal of the task is to identify 10 spoken numbers. Because we don't have labels we want to cluster them in 10 groups using a learning method.
The code looks like this:
def kmeans(data, k=3, normalize=False, limit= 500):
"""Basic k-means clustering algorithm.
"""
# optionally normalize the data. k-means will perform poorly or strangely if the dimensions
# don't have the same ranges.
if normalize:
stats = (data.mean(axis=0), data.std(axis=0))
data = (data - stats[0]) / stats[1]
# pick the first k points to be the centers. this also ensures that each group has at least
# one point.
centers = data[:k]
for i in range(limit):
# core of clustering algorithm...
# first, use broadcasting to calculate the distance from each point to each center, then
# classify based on the minimum distance.
classifications = np.argmin(((data[:, :, None] - centers.T[None, :, :])**2).sum(axis=1), axis=1)
# next, calculate the new centers for each cluster.
new_centers = np.array([data[classifications == j, :].mean(axis=0) for j in range(k)])
# if the centers aren't moving anymore it is time to stop.
if (new_centers == centers).all():
break
else:
centers = new_centers
else:
# this will not execute if the for loop exits on a break.
raise RuntimeError(f"Clustering algorithm did not complete within {limit} iterations")
# if data was normalized, the cluster group centers are no longer scaled the same way the original
# data is scaled.
if normalize:
centers = centers * stats[1] + stats[0]
print(f"Clustering completed after {i} iterations")
return classifications, centers
classifications, centers = kmeans(speechdata, k=5)
plt.figure(figsize=(12, 8))
plt.scatter(x=speechdata[:, 0], y=speechdata[:, 1], s=100, c=classifications)
plt.scatter(x=centers[:, 0], y=centers[:, 1], s=500, c='k', marker='^')
the line "classifications, centers = kmeans(speechdata, k=5)" gives me an error: IndexError: too many indices for array.
How do I transform my array of array data, with varying length (one row has shape (20,13) and one might have (38,13) so that I can cluster them?

Related

Define k-1 cluster centroids -- SKlearn KMeans

I am performing a binary classification of a partially labeled dataset. I have a reliable estimate of its 1's, but not of its 0's.
From sklearn KMeans documentation:
init : {‘k-means++’, ‘random’ or an ndarray}
Method for initialization, defaults to ‘k-means++’:
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
I would like to pass an ndarray, but I only have 1 reliable centroid, not 2.
Is there a way to maximize the entropy between the K-1st centroids and the Kth? Alternatively, is there a way to manually initialize K-1 centroids and use K++ for the remaining?
=======================================================
Related questions:
This seeks to define K centroids with n-1 features. (I want to define k-1 centroids with n features).
Here is a description of what I want, but it was interpreted as a bug by one of the developers, and is "easily implement[able]"
I'm reasonably confident this works as intended, but please correct me if you spot an error. (cobbled together from geeks for geeks):
import sys
def distance(p1, p2):
return np.sum((p1 - p2)**2)
def find_remaining_centroid(data, known_centroids, k = 1):
'''
initialized the centroids for K-means++
inputs:
data - Numpy array containing the feature space
known_centroid - Numpy array containing the location of one or multiple known centroids
k - remaining centroids to be found
'''
n_points = data.shape[0]
# Initialize centroids list
if known_centroids.ndim > 1:
centroids = [cent for cent in known_centroids]
else:
centroids = [np.array(known_centroids)]
# Perform casting if necessary
if isinstance(data, pd.DataFrame):
data = np.array(data)
# Add a randomly selected data point to the list
centroids.append(data[np.random.randint(
n_points), :])
# Compute remaining k-1 centroids
for c_id in range(k - 1):
## initialize a list to store distances of data
## points from nearest centroid
dist = np.empty(n_points)
for i in range(n_points):
point = data[i, :]
d = sys.maxsize
## compute distance of 'point' from each of the previously
## selected centroid and store the minimum distance
for j in range(len(centroids)):
temp_dist = distance(point, centroids[j])
d = min(d, temp_dist)
dist[i] = d
## select data point with maximum distance as our next centroid
next_centroid = data[np.argmax(dist), :]
centroids.append(next_centroid)
# Reinitialize distance array for next centroid
dist = np.empty(n_points)
return centroids[-k:]
Its usage:
# For finding a third centroid:
third_centroid = find_remaining_centroid(X_train, np.array([presence_seed, absence_seed]), k = 1)
# For finding the second centroid:
second_centroid = find_remaining_centroid(X_train, presence_seed, k = 1)
Where presence_seed and absence_seed are known centroid locations.

Calculate the length of an edge consisting of many pixel data

I have made a workflow code to detect the edges of a flame in an image. I could get the edge line. It consists of many pixel points stored in an array (data in my code). Now based on the data, I would like to calculate the length of the edge. The idea is to calculate the distance between every point in data and sum them all to get the length. I really stuck in making that. Please help me, many thanks.
Here is a processed image:
Here is the original image that converted to the processed image, I put in the code is to compare the result:
import cv2
import matplotlib.pyplot as plt
if __name__ == '__main__':
path = '1897_1.jpg' #processed image
pic = cv2.imread(path)
original = cv2.imread('1897_2.jpg') #original image
img2 = cv2.flip(original, 1)
b,g,r = cv2.split(pic)
img4 = cv2.flip(b, 1)
h,w = img4.shape
data = []
th_val = 20
for i in range(h):
for j in range(w):
val = img4[i, j]
if (val >= th_val):
data.append(j)
break
b1 = range(len(data))
b2 = len(data)
result = [b2]
print (b2)
plt.figure(figsize = (10, 8))
plt.subplot(121)
plt.imshow(img4)
plt.plot(data, b1)
plt.axis('off');
plt.subplot(122)
plt.plot(data, b1)
plt.imshow(img2)
plt.axis('off')
I came up with a very simple solution, it is far from optimal, but it works for this example, and it is a good starting point. Unfortunately, this solution is not optimal for the blue chanell, where the curve is not smooth, but it works for green and red chanells.
data contains width coordinates of the first red pixel overcoming threshold. So, all first pixels are separated by 1 pixel step on vertical axes and data[i+1] - data[i] on horizontal axes. These two values can be considered as two cathetus of the squeare triangle, and the hypothenuse is the distance we want to calculate. So, here is the solution:
length = 0
for i in range(0,len(data)-1):
cathetus = data[i+1]-data[i]
hypothenuse = (cathetus**2 + 1**2)**1/2
length += hypothenuse
print(length)
Update
I have came up with two solutions: a hardcoded one and one released in the form of the function. Let us start with the first one: mean is a rather good approximator for the signal + noise. In the situation, when you do not have very strong noise or missing data, you may use this approach. In the example below we select points with x in [1,2,3] then we calculate mean y for these points and assign mean to coordinate x=2. Next we select points x in [2,3,4] and so on. As a result, we obtain mean_data list with y coordinates and mean_x with x coordinates. We can calculate length with the approach described above. You may also increase the power of smoothing by averaging over 4 and more points from data.
mean_data = []
mean_x = range(1,len(data)-1)
for i in range(0,len(data)-2):
mean_d = (data[i] + data[i+1] + data[i+2])/3
mean_data.append(mean_d)
Another approach is to use smoothing tools from scipy package. One of them is described below. When calculating the length you will have to adjust to new x axes xnew.
from scipy.interpolate import spline
import numpy as np
#transform to np.arrays initial data
b1_ = np.array(b1)
data_ = np.array(data)
# create new x with more data points
xnew = np.linspace(b1_.min(),b1_.max(),50) #50 is a number of points in between
smoothed_data = spline(b1_,data_,xnew)

Find distance between centroid and points in a single feature dataframe - KMeans

I'm working on an anomaly detection task using KMeans.
Pandas dataframe that i'm using has a single feature and it is like the following one:
df = array([[12534.],
[12014.],
[12158.],
[11935.],
...,
[ 5120.],
[ 4828.],
[ 4443.]])
I'm able to fit and to predict values with the following instructions:
km = KMeans(n_clusters=2)
km.fit(df)
km.predict(df)
In order to identify anomalies, I would like to calculate the distance between centroid and each single point, but with a dataframe with a single feature i'm not sure that it is the correct approach.
I found examples which used euclidean distance to calculate the distance. An example is the following one:
def k_mean_distance(data, cx, cy, i_centroid, cluster_labels):
distances = [np.sqrt((x - cx) ** 2 + (y - cy) ** 2) for (x, y) in data[cluster_labels == i_centroid]]
return distances
centroids = self.km.cluster_centers_
distances = []
for i, (cx, cy) in enumerate(centroids):
mean_distance = k_mean_distance(day_df, cx, cy, i, clusters)
distances.append({'x': cx, 'y': cy, 'distance': mean_distance})
This code doesn't work for me because centroids are like the following one in my case, since i have a single feature dataframe:
array([[11899.90692187],
[ 5406.54143126]])
In this case, what is the correct approach to find the distance between centroid and points? Is it possible?
Thank you and sorry for the trivial question, i'm still learning
There's scipy.spatial.distance_matrix you can make use of:
# setup a set of 2d points
np.random.seed(2)
df = np.random.uniform(0,1,(100,2))
# make it a dataframe
df = pd.DataFrame(df)
# clustering with 3 clusters
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3)
km.fit(df)
preds = km.predict(df)
# get centroids
centroids = km.cluster_centers_
# visualize
plt.scatter(df[0], df[1], c=preds)
plt.scatter(centroids[:,0], centroids[:,1], c=range(centroids.shape[0]), s=1000)
gives
Now the distance matrix:
from scipy.spatial import distance_matrix
dist_mat = pd.DataFrame(distance_matrix(df.values, centroids))
You can confirm that this is correct by
dist_mat.idxmin(axis=1) == preds
And finally, the mean distance to centroids:
dist_mat.groupby(preds).mean()
gives:
0 1 2
0 0.243367 0.525194 0.571674
1 0.525350 0.228947 0.575169
2 0.560297 0.573860 0.197556
where the columns denote the centroid number and rows denote the mean distance of the points in a cluster.
You can use scipy.spatial.distance.cdist to create a distance matrix:
from scipy.spatial.distance import cdist
dm = cdist(df, centroids)
This should give you a 2-d array, where each row represents an observation in your original dataset, and each column represents a centroid. The x-th row in the y-th column gives the distance between your x-th observation to your y-th cluster centroid. cdist uses Euclidean distance by default, but you can use other metrics (not that it matters much for a dataset with only one feature).

Method for calculating irregularly spaced accumulation points

I am attempting to do the opposite of this: Given a 2D image of (continuous) intensities, generate a set of irregularly spaced accumulation points, i.e, points that irregularly cover the 2D map, being closer to each other at the areas of high intensities (but without overlap!).
My first try was "weighted" k-means. As I didn't find a working implementation of weighted k-means, the way I introduce the weights consists of repeating the points with high intensities. Here is my code:
import numpy as np
from sklearn.cluster import KMeans
def accumulation_points_finder(x, y, data, n_points, method, cut_value):
#computing the rms
rms = estimate_rms(data)
#structuring the data
X,Y = np.meshgrid(x, y, sparse=False)
if cut_value > 0.:
mask = data > cut_value
#applying the mask
X = X[mask]; Y = Y[mask]; data = data[mask]
_data = np.array([X, Y, data])
else:
X = X.ravel(); Y = Y.ravel(); data = data.ravel()
_data = np.array([X, Y, data])
if method=='weighted_kmeans':
res = []
for i in range(len(data)):
w = int(ceil(data[i]/rms))
res.extend([[X[i],Y[i]]]*w)
res = np.asarray(res)
#kmeans object instantiation
kmeans = KMeans(init='k-means++', n_clusters=n_points, n_init=25, n_jobs=2)
#performing kmeans clustering
kmeans.fit(res)
#returning just (x,y) positions
return kmeans.cluster_centers_
Here are two different results: 1) Making use of all the data pixels. 2) Making use of only pixels above some threshold (RMS).
As you can see the points seems to be more regularly spaced than concentrated at areas of high intensities.
So my question is if there exist a (deterministic if possible) better method for computing such accumulation points.
Partition the data using quadtrees (https://en.wikipedia.org/wiki/Quadtree) into units of equal variance (or maybe also possible to make use of the concentration value?), using a defined threhold, then keep one point per unit (the centroid). There will be more subdivisions in areas with rapidly changing values, fewer in the background areas.

How to get centroids from SciPy's hierarchical agglomerative clustering?

I am using SciPy's hierarchical agglomerative clustering methods to cluster a m x n matrix of features, but after the clustering is complete, I can't seem to figure out how to get the centroid from the resulting clusters. Below follows my code:
Y = distance.pdist(features)
Z = hierarchy.linkage(Y, method = "average", metric = "euclidean")
T = hierarchy.fcluster(Z, 100, criterion = "maxclust")
I am taking my matrix of features, computing the euclidean distance between them, and then passing them onto the hierarchical clustering method. From there, I am creating flat clusters, with a maximum of 100 clusters
Now, based on the flat clusters T, how do I get the 1 x n centroid that represents each flat cluster?
A possible solution is a function, which returns a codebook with the centroids like kmeans in scipy.cluster.vq does. Only thing you need is the partition as vector with flat clusters part and the original observations X
def to_codebook(X, part):
"""
Calculates centroids according to flat cluster assignment
Parameters
----------
X : array, (n, d)
The n original observations with d features
part : array, (n)
Partition vector. p[n]=c is the cluster assigned to observation n
Returns
-------
codebook : array, (k, d)
Returns a k x d codebook with k centroids
"""
codebook = []
for i in range(part.min(), part.max()+1):
codebook.append(X[part == i].mean(0))
return np.vstack(codebook)
You can do something like this (D=number of dimensions):
# Sum the vectors in each cluster
lens = {} # will contain the lengths for each cluster
centroids = {} # will contain the centroids of each cluster
for idx,clno in enumerate(T):
centroids.setdefault(clno,np.zeros(D))
centroids[clno] += features[idx,:]
lens.setdefault(clno,0)
lens[clno] += 1
# Divide by number of observations in each cluster to get the centroid
for clno in centroids:
centroids[clno] /= float(lens[clno])
This will give you a dictionary with cluster number as the key and the centroid of the specific cluster as the value.

Categories