Expectation Maximization Algorithm (EM) for Gaussian Mixture Models (GMMs)

Expectation Maximization Algorithm (EM) for Gaussian Mixture Models (GMMs) - python

I'm trying to apply the Expectation Maximization Algorithm (EM) to a Gaussian Mixture Model (GMM) using Python and NumPy. The PDF document I am basing my implementation on can be found here.
Below are the equations:
When applying the algorithm I get the mean of the first and second cluster equal to:
array([[2.50832195],
[2.51546208]])
When the actual vector means for the first and second cluster are, respectively:
array([[0],
[0]])
and:
array([[5],
[5]])
The same thing happens when getting the values of the covariance matrices I get:
array([[7.05168736, 6.17098629],
[6.17098629, 7.23009494]])
When it should be:
array([[1, 0],
[0, 1]])
for both clusters.
Here is the code:
np.random.seed(1)
# first cluster
X_11 = np.random.normal(0, 1, 1000)
X_21 = np.random.normal(0, 1, 1000)
# second cluster
X_12 = np.random.normal(5, 1, 1000)
X_22 = np.random.normal(5, 1, 1000)
X_1 = np.concatenate((X_11,X_12), axis=None)
X_2 = np.concatenate((X_21,X_22), axis=None)
# data matrix of k x n dimensions (2 x 2000 dimensions)
X = np.concatenate((np.array([X_1]),np.array([X_2])), axis=0)
# multivariate normal distribution function gives n x 1 vector (2000 x 1 vector)
def normal_distribution(x, mu, sigma):
mvnd = []
for i in range(np.shape(x)[1]):
gd = (2*np.pi)**(-2/2) * np.linalg.det(sigma)**(-1/2) * np.exp((-1/2) * np.dot(np.dot((x[:,i:i+1]-mu).T, np.linalg.inv(sigma)), (x[:,i:i+1]-mu)))
mvnd.append(gd)
return np.reshape(np.array(mvnd), (np.shape(x)[1], 1))
# Initialized parameters
sigma_1 = np.array([[10, 0],
[0, 10]])
sigma_2 = np.array([[10, 0],
[0, 10]])
mu_1 = np.array([[10],
[10]])
mu_2 = np.array([[10],
[10]])
pi_1 = 0.5
pi_2 = 0.5
Sigma_1 = np.empty([2000, 2, 2])
Sigma_2 = np.empty([2000, 2, 2])
for i in range(10):
# E-step:
w_i1 = (pi_1*normal_distribution(X, mu_1, sigma_1))/(pi_1*normal_distribution(X, mu_1, sigma_1) + pi_2*normal_distribution(X, mu_2, sigma_2))
w_i2 = (pi_2*normal_distribution(X, mu_2, sigma_2))/(pi_1*normal_distribution(X, mu_1, sigma_1) + pi_2*normal_distribution(X, mu_2, sigma_2))
# M-step:
pi_1 = np.sum(w_i1)/2000
pi_2 = np.sum(w_i2)/2000
mu_1 = np.array([(1/(np.sum(w_i1)))*np.sum(w_i1.T*X, axis=1)]).T
mu_2 = np.array([(1/(np.sum(w_i2)))*np.sum(w_i2.T*X, axis=1)]).T
for i in range(2000):
Sigma_1[i:i+1, :, :] = w_i1[i:i+1,:]*np.dot((X[:,i:i+1]-mu_1), (X[:,i:i+1]-mu_1).T)
Sigma_2[i:i+1, :, :] = w_i2[i:i+1,:]*np.dot((X[:,i:i+1]-mu_2), (X[:,i:i+1]-mu_2).T)
sigma_1 = (1/(np.sum(w_i1)))*np.sum(Sigma_1, axis=0)
sigma_2 = (1/(np.sum(w_i2)))*np.sum(Sigma_2, axis=0)
Would really appreciate if someone could point out the mistake in my code or in my misunderstanding of the algorithm..

Related

How to get equally spaced grid points in an irregularly shaped figure?

I have an irregularly shaped image and I want to get equally spaced grid points inside that.
The image that I have for example is Image I have
I am thinking of using OpenCV to get the corner coordinates and that is easy. But I do not know how to pass all the corner coordinates or divide my shape in identifiable geometric shapes and do this.
Right now, I have hard coded the coordinates and created a function to pass the coordinates.
import numpy as np
import matplotlib.pyplot as plt
import functools
def gridFunc(arr):
center = np.mean(arr, axis=0)
x = np.arange(min(arr[:, 0]), max(arr[:, 0]) + 0.04, 0.4)
y = np.arange(min(arr[:, 1]), max(arr[:, 1]) + 0.04, 0.4)
a, b = np.meshgrid(x, y)
points = np.stack([a.reshape(-1), b.reshape(-1)]).T
def normal(a, b):
v = b - a
n = np.array([v[1], -v[0]])
# normal needs to point out
if (center - a) # n > 0:
n *= -1
return n
mask = functools.reduce(np.logical_and, [((points - a) # normal(a, b)) < 0 for a, b in zip(arr[:-1], arr[1:])])
#plt.plot(arr[:, 0], arr[:, 1])
#plt.gca().set_aspect('equal')
#plt.scatter(points[mask][:, 0], points[mask][:, 1])
#plt.show()
return points[mask]
arr1 = np.array([[0, 7],[3, 10],[3, 4],[0, 7]])
arr2 = np.array([[3, 0], [3, 14], [12, 14], [12, 0], [3,0]])
arr3 = np.array([[12, 4], [12, 10], [20, 10], [20, 4], [12, 4]])
arr_1 = gridFunc(arr1)
arr_2 = gridFunc(arr2)
arr_3 = gridFunc(arr3)
res = np.append(arr_1, arr_2)
res = np.reshape(res, (-1, 2))
res = np.append(res, arr_3)
res = np.reshape(res, (-1, 2))
plt.scatter(res[:,0], res[:,1])
plt.show()
The image that I get is this, But I am doing this manually And I want to extend this to other shapes as well.
Image I get

'Lossy' cumsum in numpy

I have an array a of length N and need to implement the following operation:
With p in [0..1]. This equation is a lossy sum, where the first indexes in the sum are weighted by a greater loss (p^{n-i}) than the last ones. The last index (i=n) is always weigthed by 1. if p = 1, then the operation is a simple cumsum.
b = np.cumsum(a)
If if p != 1, I can implement this operation in a cpu-inefficient way:
b = np.empty(np.shape(a))
# I'm using the (-1,-1,-1) idiom for reversed ranges
p_vec = np.power(p, np.arange(N-1, 0-1, -1))
# p_vec[0] = p^{N-1}, p_vec[-1] = 1
for n in range(N):
b[n] = np.sum(a[:n+1]*p_vec[-(n+1):])
Or in a memory-inefficient but vectorized way (IMO is cpu inefficient too, since a lot of work is wasted):
a_idx = np.reshape(np.arange(N+1), (1, N+1)) - np.reshape(np.arange(N-1, 0-1, -1), (N, 1))
a_idx = np.maximum(0, a_idx)
# For N=4, a_idx looks like this:
# [[0, 0, 0, 0, 1],
# [0, 0, 0, 1, 2],
# [0, 0, 1, 2, 3],
# [0, 1, 2, 3, 4]]
a_ext = np.concatenate(([0], a,), axis=0) # len(a_ext) = N + 1
p_vec = np.power(p, np.arange(N, 0-1, -1)) # len(p_vec) = N + 1
b = np.dot(a_ext[a_idx], p_vec)
Is there a better way to achieve this 'lossy' cumsum?

What you want is a IIR filter, you can use scipy.signal.lfilter(), here is the code:
Your code:
import numpy as np
N = 10
p = 0.8
np.random.seed(0)
x = np.random.randn(N)
y = np.empty_like(x)
p_vec = np.power(p, np.arange(N-1, 0-1, -1))
for n in range(N):
y[n] = np.sum(x[:n+1]*p_vec[-(n+1):])
y
the output:
array([1.76405235, 1.81139909, 2.42785725, 4.183179 , 5.21410119,
3.19400307, 3.50529088, 2.65287549, 2.01908154, 2.02586374])
By using lfilter():
from scipy import signal
y = signal.lfilter([1], [1, -p], x)
print(y)
the output:
array([1.76405235, 1.81139909, 2.42785725, 4.183179 , 5.21410119,
3.19400307, 3.50529088, 2.65287549, 2.01908154, 2.02586374])

How to broadcast or vectorize a linear interpolation of a 2D array that uses scipy.ndimage map_coordinates?

I have recently hit a roadblock when it comes to performance. I know how to manually loop and do the interpolation from the origin cell to all the other cells by brute-forcing/looping each row and column in 2d array.
however when I process a 2D array of a shape say (3000, 3000), the linear spacing and the interpolation come to a standstill and severely hurt performance.
I am looking for a way I can optimize this loop, I am aware of vectorization and broadcasting just not sure how I can apply it in this situation.
I will explain it with code and figures
import numpy as np
from scipy.ndimage import map_coordinates
m = np.array([
[10,10,10,10,10,10],
[9,9,9,10,9,9],
[9,8,9,10,8,9],
[9,7,8,0,8,9],
[8,7,7,8,8,9],
[5,6,7,7,6,7]])
origin_row = 3
origin_col = 3
m_max = np.zeros(m.shape)
m_dist = np.zeros(m.shape)
rows, cols = m.shape
for col in range(cols):
for row in range(rows):
# Get spacing linear interpolation
x_plot = np.linspace(col, origin_col, 5)
y_plot = np.linspace(row, origin_row, 5)
# grab the interpolated line
interpolated_line = map_coordinates(m,
np.vstack((y_plot,
x_plot)),
order=1, mode='nearest')
m_max[row][col] = max(interpolated_line)
m_dist[row][col] = np.argmax(interpolated_line)
print(m)
print(m_max)
print(m_dist)
As you can see this is very brute force, and I have managed to broadcast all the code around this part but stuck on this part.
here is an illustration of what I am trying to achieve, I will go through the first iteration
1.) the input array
2.) the first loop from 0,0 to origin (3,3)
3.) this will return [10 9 9 8 0] and the max will be 10 and the index will be 0
5.) here is the output for the sample array I used
Here is an update of the performance based on the accepted answer.

To speed up the code, you could first create the x_plot and y_plot outside of the loops instead of creating them several times each one:
#this would be outside of the loops
num = 5
lin_col = np.array([np.linspace(i, origin_col, num) for i in range(cols)])
lin_row = np.array([np.linspace(i, origin_row, num) for i in range(rows)])
then you could access them in each loop by x_plot = lin_col[col] and y_plot = lin_row[row]
Second, you can avoid both loops by using map_coordinates on more than just one v_stack for each couple (row, col). To do so, you can create all the combinaisons of x_plot and y_plot by using np.tile and np.ravel such as:
arr_vs = np.vstack(( np.tile( lin_row, cols).ravel(),
np.tile( lin_col.ravel(), rows)))
Note that ravel is not used at the same place each time to get all the combinaisons. Now you can use map_coordinates with this arr_vs and reshape the result with the number of rows, cols and num to get each interpolated_line in the last axis of a 3D-array:
arr_map = map_coordinates(m, arr_vs, order=1, mode='nearest').reshape(rows,cols,num)
Finally, you can use np.max and np.argmax on the last axis of arr_map to get the results m_max and m_dist. So all the code would be:
import numpy as np
from scipy.ndimage import map_coordinates
m = np.array([
[10,10,10,10,10,10],
[9,9,9,10,9,9],
[9,8,9,10,8,9],
[9,7,8,0,8,9],
[8,7,7,8,8,9],
[5,6,7,7,6,7]])
origin_row = 3
origin_col = 3
rows, cols = m.shape
num = 5
lin_col = np.array([np.linspace(i, origin_col, num) for i in range(cols)])
lin_row = np.array([np.linspace(i, origin_row, num) for i in range(rows)])
arr_vs = np.vstack(( np.tile( lin_row, cols).ravel(),
np.tile( lin_col.ravel(), rows)))
arr_map = map_coordinates(m, arr_vs, order=1, mode='nearest').reshape(rows,cols,num)
m_max = np.max( arr_map, axis=-1)
m_dist = np.argmax( arr_map, axis=-1)
print (m_max)
print (m_dist)
and you get like expected:
#m_max
array([[10, 10, 10, 10, 10, 10],
[ 9, 9, 10, 10, 9, 9],
[ 9, 9, 9, 10, 8, 9],
[ 9, 8, 8, 0, 8, 9],
[ 8, 8, 7, 8, 8, 9],
[ 7, 7, 8, 8, 8, 8]])
#m_dist
array([[0, 0, 0, 0, 0, 0],
[0, 0, 2, 0, 0, 0],
[0, 2, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0],
[0, 2, 0, 0, 0, 0],
[1, 1, 2, 1, 2, 1]])
EDIT: lin_col and lin_row are related, so you can do faster:
if cols >= rows:
arr = np.arange(cols)[:,None]
lin_col = arr + (origin_col-arr)/(num-1.)*np.arange(num)
lin_row = lin_col[:rows] + np.linspace(0, origin_row - origin_col, num)[None,:]
else:
arr = np.arange(rows)[:,None]
lin_row = arr + (origin_row-arr)/(num-1.)*np.arange(num)
lin_col = lin_row[:cols] + np.linspace(0, origin_col - origin_row, num)[None,:]

Here is a sort-of-vectorized approach. It is not very optimized and there may be one or two index-off-by-one errors, but it may give you ideas.
Two examples a monochrome 384x512 test pattern and a "real" 3-channel 768x1024 image. Both are uint8.
This takes half a minute on my machine.
For larger images one would require more RAM than I have (8GB). Or one would have to break it down into smaller chunks.
And the code
import numpy as np
def rays(img, ctr):
M, N, *d = img.shape
aidx = 2*(slice(None),) + (img.ndim-2)*(None,)
m, n = ctr
out = np.empty_like(img)
offsI = np.empty(img.shape, np.uint16)
offsJ = np.empty(img.shape, np.uint16)
img4, out4, I4, J4 = ((x[m:, n:], x[m:, n::-1], x[m::-1, n:], x[m::-1, n::-1]) for x in (img, out, offsI, offsJ))
for i, o, y, x in zip(img4, out4, I4, J4):
for _ in range(2):
M, N, *d = i.shape
widths = np.arange(1, M+1, dtype=np.uint16).clip(None, N)
I = np.arange(M, dtype=np.uint16).repeat(widths)
J = np.ones_like(I)
J[0] = 0
J[widths[:-1].cumsum()] -= widths[:-1]
J = J.cumsum(dtype=np.uint16)
ii = np.arange(1, 2*M-1, dtype=np.uint16) // 2
II = ii.clip(None, I[:, None])
jj = np.arange(2*M-2, dtype=np.uint32) // 2 * 2 + 1
jj[0] = 0
JJ = ((1 + jj) * J[:, None] // (2*(I+1))[:, None]).astype(np.uint16).clip(None, J[:, None])
idx = i[II, JJ].argmax(axis=1)
II, JJ = (np.take_along_axis(ZZ[aidx] , idx[:, None], 1)[:, 0] for ZZ in (II, JJ))
y[I, J], x[I, J] = II, JJ
SH = II, JJ, *np.ogrid[tuple(map(slice, img.shape))][2:]
o[I, J] = i[SH]
i, o = i.swapaxes(0, 1), o.swapaxes(0, 1)
y, x = x.swapaxes(0, 1), y.swapaxes(0, 1)
return out, offsI, offsJ
from scipy.misc import face
f = face()
fr, *fidx = rays(f, (200, 400))
s = np.uint8((np.arange(384)[:, None] % 41 < 2)&(np.arange(512) % 41 < 2))
s = 255*s + 128*s[::-1, ::-1] + 64*s[::-1] + 32*s[:, ::-1]
sr, *sidx = rays(s, (200, 400))
import Image
Image.fromarray(f).show()
Image.fromarray(fr).show()
Image.fromarray(s).show()
Image.fromarray(sr).show()

Generating random numbers around a set of coordinates without for loop

I have a set of coordinate means (3D) and a set of standard deviations (3D) accompying them like this:
means = [[x1, y1, z1],
[x2, y2, z2],
...
[xn, yn, zn]]
stds = [[sx1, sy1, sz1],
[sx2, sy2, sz2],
...
[sxn, syn, szn]]
so the problem is N x 3
I am looking to generate 1000 coordinate sample sets (N x 3 x 1000) randomly using np.random.normal(). Currently I generate the samples using a for loop:
for i in range(0,1000):
samples = np.random.normal(means, stds)
But I have the feeling I can lose the for loop and let numpy do it faster and in one call, anybody know how I should code that?

or alternatively use the size argument:
import numpy as np
means = [ [0, 0, 0], [1, 1, 1] ]
std = [ [1, 1, 1], [1, 1, 1] ]
#100 samples
print(np.random.normal(means, std, size = (100, len(means), 3)))

You can repeat your means and stds arrays 1000 times, and then call np.random.normal() once.
means = [[0, 0, 0],
[1, 1, 1]]
stds = [[1, 1, 1],
[2, 2, 2]]
means = numpy.array(means) * numpy.ones(1000)[:, None, None]
stds = numpy.array(stds) * numpy.ones(1000)[:, None, None]
samples = numpy.random.normal(means, stds)

Standardizing X different in Python Lasso and R glmnet?

I was trying to get the same result fitting lasso using Python's scikit-learn and R's glmnet. A helpful link
If I specify "normalize =True" in Python and "standardize = T" in R, they gave me the same result.
Python:
from sklearn.linear_model import Lasso
X = np.array([[1, 1, 2], [3, 4, 2], [6, 5, 2], [5, 5, 3]])
y = np.array([1, 0, 0, 1])
reg = Lasso(alpha =0.01, fit_intercept = True, normalize =True)
reg.fit(X, y)
np.hstack((reg.intercept_, reg.coef_))
Out[95]: array([-0.89607695, 0. , -0.24743375, 1.03286824])
R:
reg_glmnet = glmnet(X, y, alpha = 1, lambda = 0.02,standardize = T)
coef(reg_glmnet)
4 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -0.8960770
V1 .
V2 -0.2474338
V3 1.0328682
However, if I don't want to standardize variables and set normalize =False and standardize = F, they gave me quite different results.
Python:
from sklearn.linear_model import Lasso
Z = np.array([[1, 1, 2], [3, 4, 2], [6, 5, 2], [5, 5, 3]])
y = np.array([1, 0, 0, 1])
reg = Lasso(alpha =0.01, fit_intercept = True, normalize =False)
reg.fit(Z, y)
np.hstack((reg.intercept_, reg.coef_))
Out[96]: array([-0.88 , 0.09384212, -0.36159299, 1.05958478])
R:
reg_glmnet = glmnet(X, y, alpha = 1, lambda = 0.02,standardize = F)
coef(reg_glmnet)
4 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -0.76000000
V1 0.04441697
V2 -0.29415542
V3 0.97623074
What's the difference between "normalize" in Python's Lasso and "standardize" in R's glmnet?

Currently, with regard to the normalize parameter the docs state "If you wish to standardize, please use StandardScaler before calling fit on an estimator with normalize=False.''
So evidently normalize and standardize are not the same with sklearn.linear_model.Lasso. Having read the StandardScaler docs I fail to understand the difference, but the fact that there is one is implied by the provided description of the normalize parameter.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Expectation Maximization Algorithm (EM) for Gaussian Mixture Models (GMMs) - python

Related

How to get equally spaced grid points in an irregularly shaped figure?

'Lossy' cumsum in numpy

How to broadcast or vectorize a linear interpolation of a 2D array that uses scipy.ndimage map_coordinates?

Generating random numbers around a set of coordinates without for loop

Standardizing X different in Python Lasso and R glmnet?

Categories

Resources