PyMC3 Dirichlet Process Multivariate Gaussian Mixture Model - python

I'm having trouble getting my shapes to work for a Dirichlet Process Gaussian Mixture Model. My data observations has shape (number of samples, number of dimensions). Each Gaussian's mean should be drawn from an isotropic prior, and each Gaussian's covariance should be the identity matrix. I thought I set this up correctly, but I'm getting the following error:
Input dimension mis-match. (input[0].shape[1] = 13, input[1].shape[1] = 2)
My code is:
import numpy as np
import pymc3 as pm
import theano.tensor as tt
num_obs, obs_dim = observations.shape
max_num_clusters = 13
def stick_breaking(beta):
portion_remaining = tt.concatenate([[1], tt.extra_ops.cumprod(1 - beta)[:-1]])
return beta * portion_remaining
with pm.Model() as model:
w = pm.Deterministic("w", stick_breaking(beta))
cluster_means = pm.MvNormal(f'cluster_means',
mu=pm.floatX(np.zeros(obs_dim)),
cov=pm.floatX(gaussian_mean_prior_cov_scaling * np.eye(obs_dim)),
shape=(max_num_clusters, obs_dim))
comp_dists = pm.MvNormal.dist(mu=cluster_means,
cov=gaussian_cov_scaling * np.eye(obs_dim),
shape=(max_num_clusters, obs_dim))
obs = pm.Mixture(
"obs",
w=w,
comp_dists=comp_dists,
observed=observations,
shape=obs_dim)
Can someone clarify how to get the shapes to work?

Related

Weak PYMC3 Estimates for Large Datasets

I generated a dataset from a known Weibull distribution:
Weibull( alpha= A.SI^-n, beta) where A=1800, n=0.5, Beta=1.5, and SI=1000. Here is the link of the dataset (DF1).
I tried to estimate the parameters of the distribution in a Bayesian analysis using PYMC3 and below is my code. The Bayesian estimates are very good when the size of the dataset is small (100 data points) but they get away from the true values when the dataset is larger (500 data points).
For the larger dataset I tried to get better estimates by increasing number of samples to 10000, tune to 10000, and target_accept to 0.99 but the estimates did not significantly change and still were far from the true values. I was wondering if anyone knows how to define the parameters of the pm.sample() to get better estimates for the larger dataset?
import warnings
import pandas as pd
import arviz as az
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
import theano.tensor as tt
from pymc3 import Model, Normal, Slice, sample
from pymc3.distributions import Interpolated
from scipy import stats
SI=1000
ns1 =round(DF1['biased drops'],2)
SIs=round(DF1['SIs'],2)
def logp(SIs,ns1,SI):
summ1 = 0
for i in range(0,len(DF1)):
print(i)
F=DF1['failure'][i]
nu=(ns1[i])*(SIs[i]/SI)**n
PDF = (B*nu**(B-1))/(A*SI**-n)**B
R = np.exp(-(nu/(A*SI**-n))**B)
logLik = (np.log ((PDF**F)*R))
summ1 += logLik
return(summ1)
with pm.Model() as model_ss1:
MuB = pm.Uniform('MuB', lower=1, upper=3)
SigmaB= pm.HalfNormal("SigmaB", 2/3)
B = pm.Normal('B', mu=MuB, sigma=SigmaB)
MuA = pm.Uniform('MuA', lower=400, upper=2000)
SigmaA= pm.HalfNormal("SigmaA", 400)
A = pm.Normal('A', mu=MuA, sigma=SigmaA)
Mun = pm.Uniform('Mun', lower=0.2, upper=0.8)
Sigman= pm.HalfNormal("Sigman", 0.16)
n = pm.Normal('n', mu=Mun, sigma=Sigman)
y = pm.DensityDist('y', logp, observed={ 'SI': SI,'SIs': SIs.values.astype(int), 'ns1': ns1.values.astype(int)})
trace_ss1 = pm.sample(1000, tune=1000, chains = 2)
Bi = pm.summary(trace_ss1, var_names=['B'])['mean'][0]
Ai = pm.summary(trace_ss1, var_names=['A'])['mean'][0]
ni = pm.summary(trace_ss1, var_names=['n'])['mean'][0]
az.plot_trace(trace_ss1, var_names=['B','A','n'])

Plot unimodal distributions determined from a multimodal distribution

I've used GaussianMixture to analyze a multimodal distribution. From the GaussianMixture class I can access the means and covariances using the attributes means_ and covariances_. How can I use them to now plot the two underlying unimodal distributions?
I thought of using scipy.stats.norm but I don't know what to select as parameters for loc and scale. The desired output would be analogously as shown in the attached figure.
The example code of this question was modified from the answer here.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import mixture
from scipy.stats import norm
ls = np.linspace(0, 60, 1000)
multimodal_norm = norm.pdf(ls, 0, 5) + norm.pdf(ls, 20, 10)
plt.plot(ls, multimodal_norm)
# concatenate ls and multimodal to form an array of samples
# the shape is [n_samples, n_features]
# we reshape them to create an additional axis and concatenate along it
samples = np.concatenate([ls.reshape((-1, 1)), multimodal_norm.reshape((-1,1))], axis=-1)
print(samples.shape)
gmix = mixture.GaussianMixture(n_components = 2, covariance_type = "full")
fitted = gmix.fit(samples)
print(fitted.means_)
print(fitted.covariances_)
# The idea is something like the following (not working):
new_norm1 = norm.pdf(ls, fitted.means_, fitted.covariances_)
new_norm2 = norm.pdf(ls, fitted.means_, fitted.covariances_)
plt.plot(ls, new_norm1, label='Norm 1')
plt.plot(ls, new_norm2, label='Norm 2')
It is not entirely clear what you are trying to accomplish. You are fitting a GaussianMixture model to the concatenation of the sum of the values of pdfs of two gaussians sampled on a uniform grid, and the unifrom grid itself. This is not how a Gaussian Mixture model is meant to be fitted. Typically one fits a model to random observations drawn from some distribution (typically unknown but could be a simulated one).
Let me assume that you want to fit the GaussianMixture model to a sample drawn from a Gaussian Mixture distribution. Presumably to test how well the fit works given you know what the expected outcome is. Here is the code for doing this, both to simulate the right distribution and to fit the model. It prints the parameters that the fit recovered from the sample -- we observe that they are indeed close to the ones we used to simulate the sample. Plot of the density of the GaussianMixture distribution that fits to the data is generated at the end
import numpy as np
import matplotlib.pyplot as plt
from sklearn import mixture
from scipy.stats import norm
# set simulation parameters
mean1, std1, w1 = 0,5,0.5
mean2, std2, w2 = 20,10,1-w1
# simulate constituents
n_samples = 100000
np.random.seed(2021)
gauss_sample_1 = np.random.normal(loc = mean1,scale = std1,size = n_samples)
gauss_sample_2 = np.random.normal(loc = mean2,scale = std2,size = n_samples)
binomial = np.random.binomial(n=1, p=w1, size = n_samples)
# simulate gaussian mixture
mutlimodal_samples = (gauss_sample_1 * binomial + gauss_sample_2 * (1-binomial)).reshape(-1,1)
# define and fit the mixture model
gmix = mixture.GaussianMixture(n_components = 2, covariance_type = "full")
fitted = gmix.fit(mutlimodal_samples)
print('fitted means:',fitted.means_[0][0],fitted.means_[1][0])
print('fitted stdevs:',np.sqrt(fitted.covariances_[0][0][0]),np.sqrt(fitted.covariances_[1][0][0]))
print('fitted weights:',fitted.weights_)
# Plot component pdfs and a joint pdf
ls = np.linspace(-50, 50, 1000)
new_norm1 = norm.pdf(ls, fitted.means_[0][0], np.sqrt(fitted.covariances_[0][0][0]))
new_norm2 = norm.pdf(ls, fitted.means_[1][0], np.sqrt(fitted.covariances_[1][0][0]))
multi_pdf = w1*new_norm1 + (1-w1)*new_norm2
plt.plot(ls, new_norm1, label='Norm pdf 1')
plt.plot(ls, new_norm2, label='Norm pdf 2')
plt.plot(ls, multi_pdf, label='multi-norm pdf')
plt.legend(loc = 'best')
plt.show()
The results are
fitted means: 22.358448018824642 0.8607494960575028
fitted stdevs: 8.770962351118127 5.58538485134623
fitted weights: [0.42517515 0.57482485]
as we see they are close (up to their ordering, which of course the model cannot recover but it is irrelevant anyway) to what went into the simulation:
mean1, std1, w1 = 0,5,0.5
mean2, std2, w2 = 20,10,1-w1
And the plot of the density and its parts. Recall that the pdf of the GaussianMixture is not the sum of the pdfs but a weighted average with weights w1, 1-w1:

Python: Develope Multiple Linear Regression Model From Scrath

I am trying to create a multiple linear regression model from scratch in python. Dataset used: Boston Housing Dataset from Sklearn. Since my focus was on the model building I did not perform any pre-processing steps on the data. However, I used an OLS model to calculate p-values and dropped 3 features from the data. After that, I used a Linear Regression model to find out the weights for each feature.
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
X=load_boston()
data=pd.DataFrame(X.data,columns=X.feature_names)
y=X.target
data.head()
#dropping three features
data=data.drop(['INDUS','NOX','AGE'],axis=1)
#new shape of the data (506,10) not including the target variable
#Passed the whole dataset to Linear Regression Model
model_lr=LinearRegression()
model_lr.fit(data,y)
model_lr.score(data,y)
0.7278959820021539
model_lr.intercept_
22.60536462807957 #----- intercept value
model_lr.coef_
array([-0.09649731, 0.05281081, 2.3802989 , 3.94059598, -1.05476566,
0.28259531, -0.01572265, -0.75651996, 0.01023922, -0.57069861]) #--- coefficients
Now I wanted to calculate the coefficients manually in excel before creating the model in python. To calculate the weights of each feature I used this formula:
Calculating the Weights of the Features
To calculate the intercept I used the formula
b0 = mean(y)-b1*mean(x1)-b2*(mean(x2)....-bn*mean(xn)
The intercept value from my calculations was 22.63551387(almost same to that of the model)
The problem is that the weights of the features from my calculation are far off from that of the sklearn linear model.
-0.002528644 #-- CRIM
-0.001028914 #-- Zn
-0.038663314 #-- CHAS
-0.035026972 #-- RM
-0.014275311 #-- DIS
-0.004058291 #-- RAD
-0.000241103 #-- TAX
-0.015035534 #-- PTRATIO
-0.000318376 #-- B
-0.006411897 #-- LSTAT
Using the first row as a test data to check my calculations, I get 22.73167044199992 while the Linear Regression model predicts 30.42657776. The original value is 24.
But as soon as I check for other rows the sklearn model is having more variation while the predictions made by the weights from my calculations are all showing values close to 22.
I think I am making a mistake in calculating the weights, but I am not sure where the problem is? Is there a mistake in my calculation? Why are all my coefficients from the calculations so close to 0?
Here is my Code for Calculating the coefficients:(beginner here)
x_1=[]
x_2=[]
for i,j in zip(data['CRIM'],y):
mean_x=data['CRIM'].mean()
mean_y=np.mean(y)
c=i-mean_x*(j-mean_y)
d=(i-mean_x)**2
x_1.append(c)
x_2.append(d)
print(sum(x_1)/sum(x_2))
Thank you for reading this long post, I appreciate it.
It seems like the trouble lies in the coefficient calculation. The formula you have given for calculating the coefficients is in scalar form, used for the simplest case of linear regression, namely with only one feature x.
EDIT
Now after seeing your code for the coefficient calculation, the problem is clearer.
You cannot use this equation to calculate the coefficients of each feature independent of each other, as each coefficient will depend on all the features. I suggest you take a look at the derivation of the solution to this least squares optimization problem in the simple case here and in the general case here. And as a general tip stick with matrix implementation whenever you can, as this is radically more efficient.
However, in this case we have a 10-dimensional feature vector, and so in matrix notation it becomes.
See derivation here
I suspect you made some computational error here, as implementing this in python using the scalar formula is more tedious and untidy than the matrix equivalent. But since you haven't shared this peace of your code its hard to know.
Here's an example of how you would implement it:
def calc_coefficients(X,Y):
X=np.mat(X)
Y = np.mat(Y)
return np.dot((np.dot(np.transpose(X),X))**(-1),np.transpose(np.dot(Y,X)))
def score_r2(y_pred,y_true):
ss_tot=np.power(y_true-y_true.mean(),2).sum()
ss_res = np.power(y_true -y_pred,2).sum()
return 1 -ss_res/ss_tot
X = np.ones(shape=(506,11))
X[:,1:] = data.values
B=calc_coefficients(X,y)
##### Coeffcients
B[:]
matrix([[ 2.26053646e+01],
[-9.64973063e-02],
[ 5.28108077e-02],
[ 2.38029890e+00],
[ 3.94059598e+00],
[-1.05476566e+00],
[ 2.82595310e-01],
[-1.57226536e-02],
[-7.56519964e-01],
[ 1.02392192e-02],
[-5.70698610e-01]])
#### Intercept
B[0]
matrix([[22.60536463]])
y_pred = np.dot(np.transpose(B),np.transpose(X))
##### First 5 rows predicted
np.array(y_pred)[0][:5]
array([30.42657776, 24.80818347, 30.69339701, 29.35761397, 28.6004966 ])
##### First 5 rows Ground Truth
y[:5]
array([24. , 21.6, 34.7, 33.4, 36.2])
### R^2 score
score_r2(y_pred,y)
0.7278959820021539
Complete Solution - 2020 - boston dataset
As the other said, to compute the coefficients for the linear regression you have to compute
β = (X^T X)^-1 X^T y
This give you the coefficients ( all B for the feature + the intercept ).
Be sure to add a column with all 1ones to the X for compute the intercept(more in the code)
Main.py
from sklearn.datasets import load_boston
import numpy as np
from CustomLibrary import CustomLinearRegression
from CustomLibrary import CustomMeanSquaredError
boston = load_boston()
X = np.array(boston.data, dtype="f")
Y = np.array(boston.target, dtype="f")
regression = CustomLinearRegression()
regression.fit(X, Y)
print("Projection matrix sk:", regression.coefficients, "\n")
print("bias sk:", regression.intercept, "\n")
Y_pred = regression.predict(X)
loss_sk = CustomMeanSquaredError(Y, Y_pred)
print("Model performance:")
print("--------------------------------------")
print("MSE is {}".format(loss_sk))
print("\n")
CustomLibrary.py
import numpy as np
class CustomLinearRegression():
def __init__(self):
self.coefficients = None
self.intercept = None
def fit(self, x , y):
x = self.add_one_column(x)
x_T = np.transpose(x)
inverse = np.linalg.inv(np.dot(x_T, x))
pseudo_inverse = inverse.dot(x_T)
coef = pseudo_inverse.dot(y)
self.intercept = coef[0]
self.coefficients = coef[1:]
return coef
def add_one_column(self, x):
'''
the fit method with x feature return x coefficients ( include the intercept)
so for have the intercept + x feature coefficients we have to add one column ( in the beginning )
with all 1ones
'''
X = np.ones(shape=(x.shape[0], x.shape[1] +1))
X[:, 1:] = x
return X
def predict(self, x):
predicted = np.array([])
for sample in x:
result = self.intercept
for idx, feature_value_in_sample in enumerate(sample):
result += feature_value_in_sample * self.coefficients[idx]
predicted = np.append(predicted, result)
return predicted
def CustomMeanSquaredError(Y, Y_pred):
mse = 0
for idx,data in enumerate(Y):
mse += (data - Y_pred[idx])**2
return mse * (1 / len(Y))

Generating random matrices for VAR(p) in PyMC3

I am trying to build a simple VAR(p) model using pymc3, but I'm getting some cryptic errors about incompatible dimensions. I suspect the issue is that I'm not properly generating random matrices. Here is an attempt at VAR(1), any help would be welcome:
# generate some data
y_full = numpy.zeros((2,100))
t = numpy.linspace(0,2*numpy.pi,100)
y_full[0,:] = numpy.cos(5*t)+numpy.random.randn(100)*0.02
y_full[1,:] = numpy.sin(6*t)+numpy.random.randn(100)*0.01
y_obs = y_full[:,1:]
y_lag = y_full[:,:-1]
with pymc3.Model() as model:
beta= pymc3.MvNormal('beta',mu=numpy.ones((4)),cov=numpy.ones((4,4)),shape=(4))
mu = pymc3.Deterministic('mu',beta.reshape((2,2)).dot(y_lag))
y = pymc3.MvNormal('y',mu=mu,cov=numpy.eye(2),observed=y_obs)
The last line should be
y = pm.MvNormal('y',mu=mu.T, cov=np.eye(2),observed=y_obs.T)
MvNormal interprets the last dimension as the mvnormal vectors. This is because the behaviour of numpy indexing implies that y_obs is a vector of length 2 containing vectors of length 100 (y_lag[i].shape == (100,))

class PCA matplotlib for face recognition

I am trying to make face recognition by Principal Component Analysis (PCA) using python. I am using class pca found in matplotlib. Here is it's documentation:
class matplotlib.mlab.PCA(a)
compute the SVD of a and store data for PCA. Use project to project the data onto a reduced set of dimensions
Inputs:
a: a numobservations x numdims array
Attrs:
a a centered unit sigma version of input a
numrows, numcols: the dimensions of a
mu : a numdims array of means of a
sigma : a numdims array of atandard deviation of a
fracs : the proportion of variance of each of the principal components
Wt : the weight vector for projecting a numdims point or array into PCA space
Y : a projected into PCA space
The factor loadings are in the Wt factor, ie the factor loadings for the 1st principal component are given by Wt[0]
And here is my code:
import os
from PIL import Image
import numpy as np
import glob
import numpy.linalg as linalg
from matplotlib.mlab import PCA
#Step 1: put database images into a 2D array
filenames = glob.glob('C:\\Users\\Karim\\Downloads\\att_faces\\New folder/*.pgm')
filenames.sort()
img = [Image.open(fn).convert('L').resize((90, 90)) for fn in filenames]
images = np.asarray([np.array(im).flatten() for im in img])
#Step 2: database PCA
results = PCA(images.T)
w = results.Wt
#Step 3: input image
input_image = Image.open('C:\\Users\\Karim\\Downloads\\att_faces\\1.pgm').convert('L')
input_image = np.asarray(input_image)
#Step 4: input image PCA
results_in = PCA(input_image)
w_in = results_in.Wt
#Step 5: Euclidean distance
d = np.sqrt(np.sum(np.asarray(w - w_in)**2, axis=1))
But I am getting an error:
Traceback (most recent call last):
File "C:/Users/Karim/Desktop/Bachelor 2/New folder/matplotlib_pca.py", line 32, in <module>
d = np.sqrt(np.sum(np.asarray(w - w_in)**2, axis=1))
ValueError: operands could not be broadcast together with shapes (30,30) (92,92)
Can anyone help me to correct the error?
Is that the correct way for face recognition?
The error is telling you that the two arrays (w and w_in) are not the same size and numpy can not figure out how to broadcast the values to take the difference.
I am not familiar with this function, but I would guess that the source of the issue is your input images are different sizes.

Categories