k means using numpy - calculate error of each iteration - python

I am trying to to perform an image quantization (reducing the number of colors of an image) using one of the k-means algorithms of numpy/scipy for a school project. The algirithm works fine, but I also want to calculate the sum of error for each iteration of the algorithm, e.i. the sum of distances of samples to their closest cluster center (this is one of the project tasks).
I could't find any kmeans method of numpy or other fast, elegant way do perform this.
Is there such a way or method, and if not, what is the best way to perform this task? my goal is to minimize any re-implimentation of the existing kmeans algorithm.
Below I added my code so far
import scipy.cluster.vq as vq
def quantize_rgb(im_orig, n_quant, n_iter):
"""
A function that performs optimal quantization of a given RGB image.
:param im_orig: the input RGB image to be quantized (float32 image with values in [0, 1])
:param n_quant: the number of intensities the output image should have
:param n_iter: the maximum number of iterations of the optimization procedure (may converge earlier.)
"""
reshaped_im = im_orig.reshape(im_orig.shape[0] * im_orig.shape[1], 3)
centroids, label = vq.kmeans2(reshaped_im, n_quant, n_iter)
reshaped_im = centroids[label]
im_quant = reshaped_im.reshape(im_orig.shape[0], im_orig.shape[1], 3)
return im_quant

Simply use
vq.kmeans2(k=previous_centers, iter=1, minit="matrix")
to only do one iteration at a time.

Related

Setting precision while scaling vectors using preprocessing in scikit learn

I have to calculate Euclidean distance between two vectors , and have to do scaling before i calculate the distances.
sample_A= np.array([1,1,1,0,0,1,0,0,1,1,0,0,0,0,0,0.008624,-0.002894,0.006471,0.000961,0.007407,-0.004442,-0.00966,-0.003026,0.010202,0.008907,-0.003031,-0.002724,0.002302,0.002171,-0.011219,0.006802,0.004588,0.030068,0.016608,0.021235,0.015706,0.102711,0.053489,0.006902,-0.010042,0.002647,0.036403,-0.010567,0.040207,0.065626,-0.010786,-0.010131,0.080007,-0.046524,-0.08577,0.120587,0.159285,0.058588,0.112184,0.011561])
sample_B = np.array([18,1,1,0,0,1,0,0,1,0,1,0,0,0,0,1.921413,-1.350259,-0.549294,-0.829648,-0.271365,-2.267258,-0.043207,-0.127863,0.46472,0.106202,-0.363018,-0.863932,-1.041068,0.944935,-0.269358,-0.705195,-0.505604,-0.721329,0.603105,-0.619679,-0.461518,0.595048,-0.097054,-1.602379,-0.373747,-0.253988,-0.476779,1.108103,1.428308,1.12896,1.296803,-0.086155,-0.555077,0.347556,0.202161,0.289031,0.676664,-0.318146,0.193779,0.841483])
The expected distance between these two points as per requirement is 7.296226771
from sklearn import preprocessing
A_scaled = preprocessing.scale(sample_A)
B_scaled = preprocessing.scale(sample_B)
distance.euclidean(A_scaled,B_scaled)
The value i got was 7.713635264892224
My understanding is this is because of the higher precision that is present while calculating the standard deviation and mean. Is there any way to provide precision while scaling as input to the function or do i have to write a custom scale function.
If so how can i write a custom scale function that applies to the entire numpy array.

Don't understand the output of Principal Component Analysis (PCA) in Python

I did a PCA in Python on audio spectrograms and face the following problem: I have a matrix, where each row consists of flattened song features. After applying PCA it's clear to me, that the dimensions are reduced. BUT I can't find those dimensional data in the regular dataset.
import sys
import glob
from scipy.io.wavfile import read
from scipy import signal
from scipy.fftpack import fft
import numpy as np
import matplotlib.pyplot as plt
import pylab
# Read file to get samplerate and numpy array containing the signal
files = glob.glob('../some/*.wav')
song_list = []
for wav in files:
(fs, x) = read(wav)
channels = [
np.array(x[:, 0]),
np.array(x[:, 1])
]
# Combine channels to make a mono signal out of stereo
channel = np.mean(channels, axis=0)
channel = channel[0:1024,]
# Generate spectrogram
## Freqs is the same with different songs, t differs slightly
Pxx, freqs, t, plot = pylab.specgram(
channel,
NFFT=128,
Fs=44100,
detrend=pylab.detrend_none,
window=pylab.window_hanning,
noverlap=int(128 * 0.5))
# Magnitude Spectrum to use
Pxx = Pxx[0:2]
X_flat = Pxx.flatten()
song_list.append(X_flat)
song_matrix = np.vstack(song_list)
If I now apply PCA to the song_matrix...
import matplotlib
from matplotlib.mlab import PCA
from sklearn import decomposition
#test = matplotlib.mlab.PCA(song_matrix.T)
pca = decomposition.PCA(n_components=2)
song_matrix_pca = pca.fit_transform(song_matrix.T)
pca.components_ #These components should be most helpful to discriminate between the songs due to their high variance
pca.components_
...the final 2 components are the following:
Final components - two dimensions from 15 wav-files
The problem is, that I can't find those two vectors in the original dataset with all dimensions What am I doing wrong or am I misinterpreting the whole thing?
PCA doesn't give you the vectors in your dataset.
From Wikipedia :
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.
Say you have a column vector V containing ONE flattened spectrogram. PCA will find a matrix M whose columns are orthogonal vectors (think of them as being at right angles to every other column in M).
Multiplying M and T will give you a vector of "scores", which can be used to determine how much variance each column of M captures from the original data and each column of M captures progressively less variance in the data.
Multiplying matrix M' (the first 2 columns of M) by V will produce a 2x1 vector T' representing the "dimension-reduced spectrogram". You could reconstruct an approximation of V by multiplying T' by the inverse of M'. This would work if you had a matrix of spectrograms, too. Keeping only two principal components would produce an extremely lossy compression of your data.
But what if you want to add a new song to your dataset? Unless it is very much like the original song (meaning it introduces little variance to the original data set), there's no reason to think that the vectors of M will describe the new song well. For that matter, even multiplying all the elements of V by a constant would render M useless. PCA is quite data specific. Which is why it's not used in image/audio compression.
The good news? You can use a Discrete Cosine transform to compress your training data. Instead of lines, it finds cosines that form a descriptive basis, and doesn't suffer from the data specific limitation. DCT is used in jpeg, mp3 and other compression schemes.

How can I improve a "dumb" vector quantization algorithm for K-means clustering

I need to convert a codebase relying on the scipy.cluster.vq module to not use scipy so that I can implement it in C++.
First I am trying to replicate the results using only numpy.
Starting with an image of dimensions MxNx3 , I create a "centroids" Kx3 array using kmeans with opencv.
I need to map each pixel of the original image to the pixel value in the centroids array that is closest to the original pixel.
I have it working, but performance is awful. I'm sure there must be more advanced ways to compute this, and I suspect it's related to a nearest neighbour search (maybe?) but don't know for sure.
Here is what I'm currently doing: I think this may be called a "brute force" approach
iterate over every pixel in the image
calculate the euclidean distance between this pixel and each pixel in the centroid list
return the minimum value from the list generated in step 2
assign the original image pixel to the value of the centroids list that returned the minimum distance.
def vq(self,image,centroids):
x,y,z = image.shape
Z=np.reshape(image,(x*y,z))
counts = np.zeros(len(centroids))
clusterMap = np.zeros(Z.shape,np.uint8)
for i in range(Z.shape[0]):
color = Z[i]
closestIndex = self.getClosestCenter(color, centroids)
counts[closestIndex]+=1# tracking how often each color occurs
clusterMap[i] = centroids[closestIndex]
return clusterMap,counts
def getClosestCenter(self,color,centers):
distances = [0 for i in range(len(centers))]
for i,center in enumerate(centers):
distances[i] = self.getDistance(color, center)
return distances.index(min(distances))
def getDistance(self,value1,value2):
if len(value1) !=len(value2): return None #error
sum = 0
for i in range(len(value1)):
sum+=(value1[i]-value2[i])**2
return sum**(0.5)
First of all, profile your code to see where exactly it is slow.
Constructs such as enumerate can be very expensive because they require the creation and garbage collection of many tuple objects. A good rule of thumb is to avoid object allocations in inner loops and functions (this includes hidden objects such as tuples)
Last but not least, kmeans does not use Euclidean distance. It uses sum-of-squares. Get rid of the square root.

Wrap-around when calculating distance for k-means

I'm trying to do a K-means clustering of some dataset using sklearn. The problem is that one of the dimensions is hour-of-day: a number from 0-23 and so the distance algorithm then thinks that 0 is very far from 23, because in absolute terms it is. In reality and for my purposes, hour 0 is very close to hour 23. Is there a way to make the distance algorithm do some form of wrap-around so it computes the more 'real' time difference.
I'm doing something simple, similar to the following:
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters = 2)
data = vstack(data)
fit = clusters.fit(data)
classes = fit.predict(data)
data elements looks something like [22, 418, 192] where the first element is the hour.
Any ideas?
Even though #elyase answer is accepted, I think it is not the correct approach.
Yes, to use such distance you have to refine your distance measure and so - use different library. But what is more important - concept of mean used in k-means won't suit the cyclic dimension. Lets consider following example:
#current cluster X,, based on centroid position Xc=24
x1=1
x2=24
#current cluster Y, based on centroid position Yc=10
y1=12
y2=13
computing simple arithmetic mean will place the centoids in Xc=12.5,Yc=12.5, which from the point of view of cyclic meausre is incorect, it should be Xc=0.5,Yc=12.5. As you can see, asignment based on the cyclic distance measure is not "compatible" with simple mean operation, and leads to bizzare results.
Simple k-means will result in clusters {x1,y1}, {x2,y2}
Simple k--means + distance measure result in degenerated super cluster {x1,x2,y1,y2}
Correct clustering would be {x1,x2},{y1,y2}
Solving this problem requires checking one if (whether it is better to measure "simple average" or by representing one of the points as x'=x-24). Unfortunately given n points it makes 2^n possibilities.
This seems as a use case of the kernelized k-means, where you are actually clustering in the abstract feature space (in your case - a "tube" rolled around the time dimension) induced by kernel ("similarity measure", being the inner product of some vector space).
Details of the kernel k-means are given here
Why k-means doesn't work with arbitrary distances
K-means is not a distance-based algorithm.
K-means minimizes the Within-Cluster-Sum-of-Squares, which is a kind of variance (it's roughly the weighted average variance of all clusters, where each object and dimension is given the same weight).
In order for Lloyds algorithm to converge you need to have both steps optimize the same function:
the reassignment step
the centroid update step
Now the "mean" function is a least-squares estimator. I.e. choosing the mean in step 2 is optimal for the WCSS objective. Assigning objects by least-squares deviation (= squared Euclidean distance, monotone to Euclidean distance) in step 1 also yields guaranteed convergence. The mean is exactly where your wrap-around idea would fall apart.
If you plug in a random other distance function as suggested by #elyase k-means might no longer converge.
Proper solutions
There are various solutions to this:
Use K-medoids (PAM). By choosing the medoid instead of the mean you do get guaranteed convergence with arbitrary distances. However, computing the medoid is rather expensive.
Transform the data into a kernel space where you are happy with minimizing Sum-of-Squares. For example, you could transform the hour into sin(hour / 12 * pi), cos(hour / 12 * pi) which may be okay for SSQ.
Use other, distance-based clustering algorithms. K-means is old, and there has been a lot of research on clustering since. You may want to start with hierarchical clustering (which actually is just as old as k-means), and then try DBSCAN and the variants of it.
The easiest approach, to me, is to adapt the K-means algorithm wraparound dimension via computing the "circular mean" for the dimension. Of course, you will also need to change the distance-to-centroid calculation accordingly.
#compute the mean of hour 0 and 23
import numpy as np
hours = np.array(range(24))
#hours to angles
angles = hours/24 * (2*np.pi)
sin = np.sin(angles)
cos = np.cos(angles)
a = np.arctan2(sin[23]+sin[0], cos[23]+cos[0])
if a < 0: a += 2*np.pi
#angle back to hour
hour = a * 24 / (2*np.pi)
#23.5

How do I perform a convolution in python with a variable-width Gaussian?

I need to perform a convolution using a Gaussian, however the width of the Gaussian needs to change. I'm not doing traditional signal processing but instead I need to take my perfect Probability Density Function (PDF) and ``smear" it, based on the resolution of my equipment.
For instance, suppose my PDF starts out as a spike/delta-function. I'll model this as a very narrow Gaussian. After being run through my equipment, it will be smeared out according to some Gaussian resolution. I can calculate this using the scipy.signal convolution functions.
import numpy as np
import matplotlib.pylab as plt
import scipy.signal as signal
import scipy.stats as stats
# Create the initial function. I model a spike
# as an arbitrarily narrow Gaussian
mu = 1.0 # Centroid
sig=0.001 # Width
original_pdf = stats.norm(mu,sig)
x = np.linspace(0.0,2.0,1000)
y = original_pdf.pdf(x)
plt.plot(x,y,label='original')
# Create the ``smearing" function to convolve with the
# original function.
# I use a Gaussian, centered at 0.0 (no bias) and
# width of 0.5
mu_conv = 0.0 # Centroid
sigma_conv = 0.5 # Width
convolving_term = stats.norm(mu_conv,sigma_conv)
xconv = np.linspace(-5,5,1000)
yconv = convolving_term.pdf(xconv)
convolved_pdf = signal.convolve(y/y.sum(),yconv,mode='same')
plt.plot(x,convolved_pdf,label='convolved')
plt.ylim(0,1.2*max(convolved_pdf))
plt.legend()
plt.show()
This all works no problem. But now suppose my original PDF is not a spike, but some broader function. For example, a Gaussian with sigma=1.0. And now suppose my resolution actually varys over x: at x=0.5, the smearing function is a Gaussian with sigma_conv=0.5, but at x=1.5, the smearing function is a Gaussian with sigma_conv=1.5. And suppose I know the functional form of the x-dependence of my smearing Gaussian. Naively, I thought I would change the line above to
convolving_term = stats.norm(mu_conv,lambda x: 0.2*x + 0.1)
But that doesn't work, because the norm function expects a value for the width, not a function. In some sense, I need my convolving function to be a 2D array, where I have a different smearing Gaussian for each point in my original PDF, which remains a 1D array.
So is there a way to do this with functions already defined in Python? I have some code to do this that I wrote myself....but I want to make sure I've not just re-invented the wheel.
Thanks in advance!
Matt
Question, in brief:
How to convolve with a non-stationary kernel, for example, a Gaussian that changes width for different locations in the data, and does a Python an existing tool for this?
Answer, sort-of:
It's difficult to prove a negative, but I do not think that a function to perform a convolution with a non-stationary kernel exists in scipy or numpy. Anyway, as you describe it, it can't really be vectorized well, so you may as well do a loop or write some custom C code.
One trick that might work for you is, instead of changing the kernel size with position, stretch the data with the inverse scale (ie, at places where you'd want to the Gaussian with to be 0.5 the base width, stretch the data to 2x). This way, you can do a single warping operation on the data, a standard convolution with a fixed width Gaussian, and then unwarp the data to original scale.
The advantages of this approach are that it's very easy to write, and is completely vectorized, and therefore probably fairly fast to run.
Warping the data (using, say, an interpolation method) will cause some loss of accuracy, but if you choose things so that the data is always expanded and not reduced in your initial warping operation, the losses should be minimal.

Categories