How to infer the parameters of a 1D gaussian distribution using PyMC? - python

I'm pretty new to PyMC and I'm trying desperately to infer the parameters of an underlying gaussian distribution that best fits a distribution of observed data that I have, not with a pre-build normal distrubution, but with a more general method using histograms of the simulated data to build pdfs. But so far I can't get my code to converge, and I don't know why...
So here's a summary of what my code does.
I have a dataset of 5000 points distributed normally (mean=5,sigma=2). I want to retrieve these values (mean, sigma) with a bayesian inference (using MCMC).
I have a data simulator that generates for each iteration of the MCMC process a normal distribution of 5000 points with a random mean and sigma (uniform prior)
From the simulated distribution of points I compute a numpy histogram normed to 1 representing the pdf of the distribution (Nbins=int(sqrt(5000))). I then compute the mean and standard deviation of this distribution.
What I want is the set of parameters that will allow me to build a simulated distribution that best fits the observed data.
I use the most general definition of the log likelihood, that is:
ln L(θ|x)=∑ln(f(xi|θ)) (the likelihood function being defined as the probability distribution of the observed data given the parameters θ)
Then I interpolate linearly the histogram values for every bin center. Therefore I have a continuous pdf for the simulated distribution. So here f is the interpolated function I made from the histogram of the simulation.
I sum the log(f(xi)) contributions for every (real) data point and return the loglikelihood value at the end.
But some (real) data points are so far off the mean of the simulated distribution that f(xi)=0. For these points the code raises a math domain error (Reminder: log(0)=-inf). So I artificially set the pdf to a small epsilon for the points where it's usually set to 0.
But here's the thing. The loglikelihood is not computed for every iteration. And actually it is not computed at all, in the present architecture of my code. So that's why the MCMC process is not converging. But... I don't know why.
Turns out that building custom likelihood functions does not seem to be very casual approach in the PyMC community, where people usually prefer to used pre-built distributions. I'm having troubles to find some help on these matters, so ideas and suggestions will be deeply appreciated :)
import numpy as np
import matplotlib.pyplot as plt
import math
import pymc as pm
from scipy.interpolate import InterpolatedUnivariateSpline
# Generate the data
np.random.seed(0)
N=5000
true_mean=5.
true_sigma = 2.
data = np.random.normal(true_mean,true_sigma,N)
#prior
m=pm.Uniform('m', lower=4, upper=6)
s=pm.Uniform('s', lower=1, upper=3)
#pm.deterministic
def data_simulator(mean_input=m,sig_input=s):
out=np.empty(4,dtype=object)
datasim = np.random.normal(mean_input,sig_input,N)
hist, bin_edges = np.histogram(datasim, bins=int(math.sqrt(len(datasim))), density=True)
bin_centers = (bin_edges[:-1] + bin_edges[1:])/2
m_sim=np.mean(datasim)
s_sim=np.std(datasim)
out[0]=m_sim
out[1]=s_sim
out[2]=bin_centers
out[3]=hist
return out
#pm.stochastic(observed=True)
def logp(value=data,mean_output=data_simulator.value[0],sigma_output=data_simulator.value[1],bin_centers_sim=data_simulator.value[2],hist_sim=data_simulator.value[3]):
interp_sim=InterpolatedUnivariateSpline(bin_centers_sim,hist_sim,k=1,ext=0) #returns the extrapolated values
logp=np.sum(np.log(interp_sim(value)))
print 'logp=',logp
return logp
model = pm.Model({"mean": m,"sigma":s,"data_simulator":data_simulator,"loglikelihood":loglikelihood})
#Run the MCMC sampler
mcmc = pm.MCMC(model)
mcmc.sample(iter=10000, burn=5000)
#Plot the marginals
pm.Matplot.plot(mcmc)

Related

How does scipy.stats distribution fitting work exactly?

I'm interested in the tail distribution of some given data, so I tried using scipy.stats to fit my data to a Gaussian, Generalized extreme value distribution, and a Generalized Pareto distribution.
This is how the data looks like:
Data Histogram
This is what I tried
data=df.loc[:,'X']
v=np.ceil(np.log2(len(data))) + 1 #This criterion is Sturge's rule, we use this formula to calculate the "adequate" number of bins to visualize our data's distribution
y,x=np.histogram(data,bins=int(v),density=True) #"clustering" our data for the plot
plt.hist(data, bins=11, density=True)
plt.title("Histogram")
plt.show()
x = (x + np.roll(x, -1))[:-1] / 2.0 #This takes the mid point of every "bin" interval as the reference x-axis point for its corresponding y probability
# =============================================================================
# Fitting our data and plotting the PDFs
# =============================================================================
fit1=stats.genextreme.fit(data,floc=0) #The fit method finds the optimal parameters (using MLE) for your data fitting a chosen probability distribution
fit2=stats.norm.fit(data)
fit3=stats.genpareto.fit(data,floc=0)
fit4=stats.weibull_min.fit(data,floc=0)
fit5=stats.exponweib.fit(data,floc=0)
fit6=stats.gumbel_r.fit(data,floc=0)
fit7=stats.gumbel_l.fit(data,floc=0)
....
First I had some strange results because I didn't set the initial location parameter to 0, I still didn't exactly understand why.
What surprised me the most though, is that genextreme and Weibull_min gave me different results, when I thought Weibull was a special case of the generalized extreme value distribution with positive shape parameter.
Especially since the Weibull fit seems to work better here.
Here is the Weibull Fit:
Weibull Fit
And this is the GEV Fit:
GEV Fit
Actually the GEV Fit was similar to the Gumbel_r one:
Gumbel_r Fit
I read one could deduce whether Weibull_min or max should be used just from the shape of the data's histogram, how can one do that?
Since I am interested in extreme positive values (Tail distribution), shouldn't I be using Weibull_max since that's the limiting distribution of the maximum?

Fitting a theoretical distribution to a sampled empirical CDF with scipy stats

I have a plot for the CDF distribution of packet losses. I thus do not have the original data or the CDF model itself but samples from the CDF curve. (The data is extracted from plots published in literature.)
I want to find which distribution and with what parameters offers the closest fit to the CDF samples.
I've seen that Scipy stats distributions offer fit(data) method but all examples apply to raw data points. PDF/CDF is subsequently drawn from the fitted parameters. Using fit with my CDF samples does not give sensible results.
Am I right in assuming that fit() cannot be directly applied to data samples from an empirical CDF?
What alternatives could I use to find a matching known distribution?
I'm not sure exactly what you're trying to do. When you say you have a CDF, what does that mean? Do you have some data points, or the function itself? It would be helpful if you could post more information or some sample data.
If you have some data points and know the distribution its not hard to do using scipy. If you don't know the distribution, you could just iterate over all distributions until you find one which works reasonably well.
We can define functions of the form required for scipy.optimize.curve_fit. I.e., the first argument should be x, and then the other arguments are parameters.
I use this function to generate some test data based on the CDF of a normal random variable with a bit of added noise.
n = 100
x = np.linspace(-4,4,n)
f = lambda x,mu,sigma: scipy.stats.norm(mu,sigma).cdf(x)
data = f(x,0.2,1) + 0.05*np.random.randn(n)
Now, use curve_fit to find parameters.
mu,sigma = scipy.optimize.curve_fit(f,x,data)[0]
This gives output
>> mu,sigma
0.1828320963531838, 0.9452044983927278
We can plot the original CDF (orange), noisy data, and fit CDF (blue) and observe that it works pretty well.
Note that curve_fit can take some additional parameters, and that the output gives additional information about how good of a fit the function is.
#tch Thank you for the answer. I read on the technique and successfully applied it. I wanted to apply the fit to all continuous distribution supported by scipy.stats so I ended up doing the following:
fitted = []
failed = []
for d in dist_list:
dist_name = d[0] #fetch the distribution name
dist_object = getattr(ss, dist_name) #fetch the distribution object
param_default = d[1] #fetch the default distribution parameters
# For distributions with only location and scale set those to the default loc=0 and scale=1
if not param_default:
param_default = (0,1)
# Computed parameters of fitted distribution
try:
param,cov = curve_fit(dist_object.cdf,data_in,data_out,p0=param_default,method='trf')
# Only take distributions which do not result in zero covariance as those are not a valid fit
if np.any(cov):
fitted.append((dist_name,param),)
# Capture which distributions are not possible to be fitted (variety of reasons)
except (NotImplementedError,RuntimeError) as e:
failed.append((dist_name,e),)
pass
In the above, the empirical cdf distribution is captured in data_out which holds the sampled cdf values for a range of data_in data points. The list dist_list holds for each distribution in scipy.stats.rv_continuous the name of the distribution as first element and a list of the default parameters as second element. Default parameters I extract from scipy.stats._distr_params.
Some distributions cannot be fitted and raise an error. I keep those is failed list.
Finally, I generate a list fitted which holds for each successfully fitted distribution the estimated parameters.

How can I fit a GMM to a 1D Gaussian plot with sklearn?

I realize there are several articles that demonstrate how to fit a GMM to a 1D Gaussian with sklearn ([1] and [2], to name a few). However, in all of those cases, the data is present as single points where the distribution is Gaussian. In my case, I'm essentially have a frequency table (I'm working with spectroscopic data), where the distribution is Gaussian, but the individual points are unknown.
My distribution (i.e., the data I'm trying to fit) looks like this: 1D Gaussian Peak
I'd like to use GMM to deconvolve the 2 initial Gaussian distributions that make up this peak.
So far, I've tried the following (assume my data is a 200x2 array, with position in one column and AFU on the second) :
import numpy as np
from sklearn import mixture
import matplotlib.pyplot as plt
def gengmm(nc=4, n_iter = 2):
g = mixture.GMM(n_components=nc) # number of components
g.init_params = "" # No initialization
g.n_iter = n_iter # iteration of EM method
return g
I tried to see if I could fit this peak to just a single Gaussian:
g = gengmm(1, 100)
g.fit(data)
However, the mean and covariance I get don't define my data particularly well (notably, the mean for that Gaussian distribution is 127.5, which is not what is recovered with a 1 component GMM).
Is there an easier way to do this? (I realize I can just use a least-squares fit to recover the initial Gaussian, but again, I'm trying to ultimately use this to determine the two underlying Gaussians distributions that make up the final one.)
Thanks!

Fit gaussians (or other distributions) on my data using python

I have a database of features, a 2D np.array (2000 samples and each sample contains 100 features, 2000 X 100). I want to fit gaussian distributions to my database using python. My code is the following:
data = load_my_data() # loads a np.array with size 2000x200
clf = mixture.GaussianMixture(n_components= 50, covariance_type='full')
clf.fit(data)
I am not sure about the parameters for example the covariance_type and how can I investigate whether the fit was occured succesfully or not.
EDIT: I debug the code to investigate what is happening with the clf.means_ and appartently it produced a matrix n_components X size_of_features 50 X 20). Is there a way that i can check that the fitting was successful, or to plot data? What are the alternatives to Gaussian mixtures (mixtures of exponential for example, I cannot find any available implementation)?
I think you are using sklearn package.
Once you have fit, then type
print clf.means_
If it has output, then the data is fitted, if it raise errors, not fitted.
Hope this helps you.
You can do dimensionality reduction using PCA to 3D space (let's say) and then plot means and data.
Is is always preferred to choose a reduced set of candidate before trying to identify the distribution (in other words, use Cullen & Frey to reject the unlikely candidates) and then go for goodness of fit a select the best result,
You can just create a list of all available distributions in scipy. An example with two distributions and random data:
import numpy as np
import scipy.stats as st
data = np.random.random(10000)
#Specify all distributions here
distributions = [st.laplace, st.norm]
mles = []
for distribution in distributions:
pars = distribution.fit(data)
mle = distribution.nnlf(pars, data)
mles.append(mle)
results = [(distribution.name, mle) for distribution, mle in
zip(distributions, mles)]
best_fit = sorted(zip(distributions, mles), key=lambda d: d[1])[0]
print 'Best fit reached using {}, MLE value: {}'.format(best_fit[0].name, best_fit[1])
I understand, you may like to do regression of two different distributions, more than fitting them to an arithmetic curve. If this is the case, you may be interested in plotting one against the other one, and make a linear (or polynomial) regression, checking the coefficients
If this is the case, linear regression of two distributions, may tell you if there linear dependent or not.
Linear Regression using Scipy documentation

Get the distribution when a point is sampled from a mixture in PyMC3

I have a model with a pm.NormalMixture(), and when I sample from the normal mixture, I also want to know which of the mixed distributions that point is being sampled from.
import numpy as np
import pymc3 as pm
obs = np.concatenate([np.random.normal(5,1,100),
np.random.normal(10,2,200)])
with pm.Model() as model:
mu = pm.Normal('mu', 10, 10, shape=2)
sd = pm.Normal('sd', 10, 10, shape=2)
x = pm.NormalMixture('x', mu=mu, sd=sd, observed=obs)
I sample from that model, then use that trace to sample from the posterior predictive distribution, and what I want to know is for each x in the posterior predictive trace, which of the two normal distributions being sampled from it belongs to. Is that possible in PyMC3 without doing it manually?
This example demonstrates how posterior predictive checks (PPCs) work. The gist of a PPC is that you first draw random samples from the trace. The trace is essentially always multivariate, and in your model a single sample would be defined by the vector (mu[i,0], mu[i,1], sd[i,0], sd[i,1]). Then, for each trace sample, generate random numbers from the distribution specified for the likelihood with its parameter values equal to those from the trace samples. In your case, this would be NormalMixture(mu[i,:], sd[i,:]). In your model, x is the likelihood function, not an individual point of the trace.
Some practical notes:
You haven't specified a weighting variable, so I'm assuming by default it forces the normal distributions to be weighted equally (I haven't tested this).
The odds of a given point coming from one distribution or the other is just the ratio between the probability densities at that point.
Check out this for recommendations on how to choose priors. For example, your SD prior is placing a lot of weight on very large SDs, which would bias your results, especially for smaller datasets.
Good luck!

Categories