Multivariate Normal Distribution fitting dataset

Multivariate Normal Distribution fitting dataset - python

I was reading a few papers about RNN networks. At some point, I came accross the following explanations:
The prediction model trained on sN is used to compute the error vectors for
each point in the validation and test sequences. The error vectors are modelled
to fit a multivariate Gaussian distribution N = N (μ, Σ). The likelihood p(t)
of observing an error vector e(t) is given by the value of N at e(t) (similar to
normalized innovations squared (NIS) used for novelty detection using Kalman
filter based dynamic prediction model [5]). The error vectors for the points from
vN1 are used to estimate the parameters μ and Σ using Maximum Likelihood
Estimation.
And:
A Multivariate Gaussian Distribution is fitted to the error
vectors on the validation set. y
(t)
is the probability of an error
vector e
(t)
after applying Multivariate Gaussian Distribution
N = N (µ, ±). Maximum Likelihood Estimation is used to
select the parameters µ and Σ for the points from vN.
vN or vN1 are validaton datasets. sN is the training dataset.
They are from 2 different articles but describe the same thing. I didn't really understand what they mean by fitting a Multivariate Gaussian Distribution to the data. What does it mean?
Many thanks,
Guillaume

Let's start with one dimensional data first. If you have a data distributed in a 1D line, they have a mean (µ) and variance (sigma). Then modeling them is as simple as having (µ, sigma) to generate a new data point following your main distribution.
# Generating a new_point in a 1D Gaussian distribution
import random
mu, sigma = 1, 1.6
new_point = random.gauss(mu, sigma)
# 2.797757476598497
Now in N dimensional space, multivariate normal distribution is a generalization of the one-dimensional. The objective in general is to find N averages µ and N x N covariances this time noted by Σ to model all data points in the N dimensional space. Having them, you are able to generate as many random data points as you want following the main distributions. In Python/ Numpy, you can do it like:
import numpy as np
new_data_point = np.random.multivariate_normal(mean, covariance, 1)

Related

How to draw multivariate normally distributed data that exhibit given correlation structure

I would like to draw features (covariates) from a multivariate normal distribution such that each feature has a standard deviation of 1 and a paarwise correlation between 0.3 and 0.7.
One can use the following code snippet to draw from a multivariate normal
rng = np.random.default_rng()
features = rng.multivariate_normal(np.zeros(n_vars), covar, size=n_obs)
However, this requires generating a (symmetric and positive semi-definite) covariance matrix with main-diagonal elements equal to 1 and off-diagonal elements between 0.3 and 0.7.

How can I fit a GMM to a 1D Gaussian plot with sklearn?

I realize there are several articles that demonstrate how to fit a GMM to a 1D Gaussian with sklearn ([1] and [2], to name a few). However, in all of those cases, the data is present as single points where the distribution is Gaussian. In my case, I'm essentially have a frequency table (I'm working with spectroscopic data), where the distribution is Gaussian, but the individual points are unknown.
My distribution (i.e., the data I'm trying to fit) looks like this: 1D Gaussian Peak
I'd like to use GMM to deconvolve the 2 initial Gaussian distributions that make up this peak.
So far, I've tried the following (assume my data is a 200x2 array, with position in one column and AFU on the second) :
import numpy as np
from sklearn import mixture
import matplotlib.pyplot as plt
def gengmm(nc=4, n_iter = 2):
g = mixture.GMM(n_components=nc) # number of components
g.init_params = "" # No initialization
g.n_iter = n_iter # iteration of EM method
return g
I tried to see if I could fit this peak to just a single Gaussian:
g = gengmm(1, 100)
g.fit(data)
However, the mean and covariance I get don't define my data particularly well (notably, the mean for that Gaussian distribution is 127.5, which is not what is recovered with a 1 component GMM).
Is there an easier way to do this? (I realize I can just use a least-squares fit to recover the initial Gaussian, but again, I'm trying to ultimately use this to determine the two underlying Gaussians distributions that make up the final one.)
Thanks!

Fit gaussians (or other distributions) on my data using python

I have a database of features, a 2D np.array (2000 samples and each sample contains 100 features, 2000 X 100). I want to fit gaussian distributions to my database using python. My code is the following:
data = load_my_data() # loads a np.array with size 2000x200
clf = mixture.GaussianMixture(n_components= 50, covariance_type='full')
clf.fit(data)
I am not sure about the parameters for example the covariance_type and how can I investigate whether the fit was occured succesfully or not.
EDIT: I debug the code to investigate what is happening with the clf.means_ and appartently it produced a matrix n_components X size_of_features 50 X 20). Is there a way that i can check that the fitting was successful, or to plot data? What are the alternatives to Gaussian mixtures (mixtures of exponential for example, I cannot find any available implementation)?

I think you are using sklearn package.
Once you have fit, then type
print clf.means_
If it has output, then the data is fitted, if it raise errors, not fitted.
Hope this helps you.

You can do dimensionality reduction using PCA to 3D space (let's say) and then plot means and data.

Is is always preferred to choose a reduced set of candidate before trying to identify the distribution (in other words, use Cullen & Frey to reject the unlikely candidates) and then go for goodness of fit a select the best result,
You can just create a list of all available distributions in scipy. An example with two distributions and random data:
import numpy as np
import scipy.stats as st
data = np.random.random(10000)
#Specify all distributions here
distributions = [st.laplace, st.norm]
mles = []
for distribution in distributions:
pars = distribution.fit(data)
mle = distribution.nnlf(pars, data)
mles.append(mle)
results = [(distribution.name, mle) for distribution, mle in
zip(distributions, mles)]
best_fit = sorted(zip(distributions, mles), key=lambda d: d[1])[0]
print 'Best fit reached using {}, MLE value: {}'.format(best_fit[0].name, best_fit[1])

I understand, you may like to do regression of two different distributions, more than fitting them to an arithmetic curve. If this is the case, you may be interested in plotting one against the other one, and make a linear (or polynomial) regression, checking the coefficients
If this is the case, linear regression of two distributions, may tell you if there linear dependent or not.
Linear Regression using Scipy documentation

Generating random value for given cdf

Depending on sample of values of random variable I create cumulative density function using kernel density estimation.
cdf = gaussian_kde(sample)
What I need is to generate sample values of random variable whose density function is equal to constructed cdf. I know about the way of inversing the probability distribution function, but since I can not do it analitically it requires pretty complicated preparations. Is there integrated solution or maybe another way to accomplish the task?

If you're using a kernel density estimator (KDE) with Gaussian kernels, your density estimate is a Gaussian mixture model. This means that the density function is a weighted sum of 'mixture components', where each mixture component is a Gaussian distribution. In a typical KDE, there's a mixture component centered over each data point, and each component is a copy of the kernel. This distribution is easy to sample from without using the inverse CDF method. The procedure looks like this:
Setup
Let mu be a vector where mu[i] is the mean of mixture component i. In a KDE, this will just be the locations of the original data points
Let sigma be a vector where sigma[i] is the standard deviation of mixture component i. In typical KDEs, this will be the kernel bandwidth, which is shared for all points (but variable-bandwidth variants do exist).
Let w be a vector where w[i] contains the weight of mixture component i. The weights must be positive and sum to 1. In a typical, unweighted KDE, all weights will be 1/(number of data points) (but weighted variants do exist).
Choose the number of random points to sample, n_total
Determine how many points will be drawn from each mixture component.
Let n be a vector where n[i] contains the number of points to sample from mixture component i.
Draw n from a multinomial distribution with "number of trials" equal to n_total and "success probabilities" equal to w. This means the number of points to draw from each mixture component will be randomly chosen, proportional to the component weights.
Draw random values
For each mixture component i:
Draw n[i] values from a normal distribution with mean mu[i] and standard deviation sigma[i]
Shuffle the list of random values, so they have random order.
This procedure is relatively straightforward because random number generators (RNGs) for multinomial and normal distributions are widely available. If your kernels aren't Gaussian but some other probability distribution, you can replicate this strategy, replacing the normal RNG in step 4 with a RNG for that distribution (if it's available). You can also use this procedure to sample from mixture models in general, not just KDEs.

How to infer the parameters of a 1D gaussian distribution using PyMC?

I'm pretty new to PyMC and I'm trying desperately to infer the parameters of an underlying gaussian distribution that best fits a distribution of observed data that I have, not with a pre-build normal distrubution, but with a more general method using histograms of the simulated data to build pdfs. But so far I can't get my code to converge, and I don't know why...
So here's a summary of what my code does.
I have a dataset of 5000 points distributed normally (mean=5,sigma=2). I want to retrieve these values (mean, sigma) with a bayesian inference (using MCMC).
I have a data simulator that generates for each iteration of the MCMC process a normal distribution of 5000 points with a random mean and sigma (uniform prior)
From the simulated distribution of points I compute a numpy histogram normed to 1 representing the pdf of the distribution (Nbins=int(sqrt(5000))). I then compute the mean and standard deviation of this distribution.
What I want is the set of parameters that will allow me to build a simulated distribution that best fits the observed data.
I use the most general definition of the log likelihood, that is:
ln L(θ|x)=∑ln(f(xi|θ)) (the likelihood function being defined as the probability distribution of the observed data given the parameters θ)
Then I interpolate linearly the histogram values for every bin center. Therefore I have a continuous pdf for the simulated distribution. So here f is the interpolated function I made from the histogram of the simulation.
I sum the log(f(xi)) contributions for every (real) data point and return the loglikelihood value at the end.
But some (real) data points are so far off the mean of the simulated distribution that f(xi)=0. For these points the code raises a math domain error (Reminder: log(0)=-inf). So I artificially set the pdf to a small epsilon for the points where it's usually set to 0.
But here's the thing. The loglikelihood is not computed for every iteration. And actually it is not computed at all, in the present architecture of my code. So that's why the MCMC process is not converging. But... I don't know why.
Turns out that building custom likelihood functions does not seem to be very casual approach in the PyMC community, where people usually prefer to used pre-built distributions. I'm having troubles to find some help on these matters, so ideas and suggestions will be deeply appreciated :)
import numpy as np
import matplotlib.pyplot as plt
import math
import pymc as pm
from scipy.interpolate import InterpolatedUnivariateSpline
# Generate the data
np.random.seed(0)
N=5000
true_mean=5.
true_sigma = 2.
data = np.random.normal(true_mean,true_sigma,N)
#prior
m=pm.Uniform('m', lower=4, upper=6)
s=pm.Uniform('s', lower=1, upper=3)
#pm.deterministic
def data_simulator(mean_input=m,sig_input=s):
out=np.empty(4,dtype=object)
datasim = np.random.normal(mean_input,sig_input,N)
hist, bin_edges = np.histogram(datasim, bins=int(math.sqrt(len(datasim))), density=True)
bin_centers = (bin_edges[:-1] + bin_edges[1:])/2
m_sim=np.mean(datasim)
s_sim=np.std(datasim)
out[0]=m_sim
out[1]=s_sim
out[2]=bin_centers
out[3]=hist
return out
#pm.stochastic(observed=True)
def logp(value=data,mean_output=data_simulator.value[0],sigma_output=data_simulator.value[1],bin_centers_sim=data_simulator.value[2],hist_sim=data_simulator.value[3]):
interp_sim=InterpolatedUnivariateSpline(bin_centers_sim,hist_sim,k=1,ext=0) #returns the extrapolated values
logp=np.sum(np.log(interp_sim(value)))
print 'logp=',logp
return logp
model = pm.Model({"mean": m,"sigma":s,"data_simulator":data_simulator,"loglikelihood":loglikelihood})
#Run the MCMC sampler
mcmc = pm.MCMC(model)
mcmc.sample(iter=10000, burn=5000)
#Plot the marginals
pm.Matplot.plot(mcmc)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.