Using scipy optimize for MLE estimate and curve fitting

Using scipy optimize for MLE estimate and curve fitting - python

I randomly generated 1000 data points using the weights I know are true for the normal distribution. Now I am trying to minimize the -log likelihood function to estimate the values of sig^2 and the weights. I sort of get the process conceptually, but when I try to code it I'm just lost.
This is my model:
p(y|x, w, sig^2) = N(y|w0+w1x+...+wnx^n, sig^2)
I've been googling for a while now and I've learned the scipy.stats.optimize.minimize function is good for this, but I can't get it to work right. Every solution I have tried has worked for the example I got the solution from, but I'm unable to extrapolate it to my problem.
x = np.linspace(0, 1000, num=1000)
data = []
for y in x:
data.append(np.polyval([.5, 1, 3], y))
#plot to confirm I do have a normal distribution...
data.sort()
pdf = stats.norm.pdf(data, np.mean(data), np.std(data))
plt.plot(test, pdf)
plt.show()
#This is where I am stuck.
logLik = -np.sum(stats.norm.logpdf(data, loc=??, scale=??))
I have found that the equation error(w) = .5*sum(poly(x_n, w) - y_n)^2 is relevant for minimizing the error of the weights, which therefore maximizes my likelihood for the weights, but I don't understand how to code this... I have found a similar relationship for sig^2, but have the same problem. Can somebody clarify how to do this to help my curve fitting? Maybe go as far to post psuedo code I can use?

Yes, implementing likelihood fitting with minimize is tricky, I spend a lot of time on it. Which is why I wrapped it. If I may shamelessly plug my own package symfit, your problem can be solved by doing something like this:
from symfit import Parameter, Variable, Likelihood, exp
import numpy as np
# Define the model for an exponential distribution
beta = Parameter()
x = Variable()
model = (1 / beta) * exp(-x / beta)
# Draw 100 samples from an exponential distribution with beta=5.5
data = np.random.exponential(5.5, 100)
# Do the fitting!
fit = Likelihood(model, data)
fit_result = fit.execute()
I have to admit I don't exactly understand your distribution, since I don't understand the role of your w, but perhaps with this code as an example, you'll know how to adapt it.
If not, let me know the full mathematical equation of your model so I can help you further.
For more info check the docs. (For a more technical description of what happens under the hood, read here and here.)

I think there's an issue with your setup. With maximum likelihood, you obtain the parameters that maximize the probability of observing your data (given a certain model). Your model seems to be:
where epsilon is N(0, sigma).
So you maximize it:
or equivalently take logs to get:
The f in this case is the log-normal probability density function which you can get with stats.norm.logpdf. You should then use scipy.minimize to maximize an expression that will be the summation of stats.norm.logpdf evaluated at each of the i points, from 1 to your sample size.
If I've understood you correctly, your code is missing having a y vector plus an x vector! Show us a sample of those vectors and I can update my answer to include a sample code for estimating MLE with that date.

Related

trouble getting started with simple pymc3 example

I am new to using the PyMC3 package and am just trying to implement an example from a course on measurement uncertainty that I’m taking. (Note this is an optional employee education course through work, not a graded class where I shouldn’t find answers online). The course uses R but I find python to be preferable.
The (simple) problem is posed as following:
Say you have an end-gauge of actual (unknown) length at room-temperature length, and measured length m. The relationship between the two is:
length = m / (1 + alpha*dT)
where alpha is an expansion coefficient and dT is the deviation from room temperature and m is the measured quantity. The goal is to find the posterior distribution on length in order to determine its expected value and standard deviation (i.e. the measurement uncertainty)
The problem specifies prior distributions on alpha and dT (Gaussians with small standard deviation) and a loose prior on length (Gaussian with large standard deviation). The problem specifies that m was measured 25 times with an average of 50.000215 and standard deviation of 5.8e-6. We assume that the measurements of m are normally distributed with a mean of the true value of m.
One issue I had is that the likelihood doesn’t seem like it can be specified just based on these statistics in PyMC3, so I generated some dummy measurement data (I ended up doing 1000 measurements instead of 25). Again, the question is to get a posterior distribution on length (and in the process, although of less interest, updated posteriors on alpha and dT).
Here’s my code, which is not working and having convergence issues:
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
import scipy.stats as stats
import pymc3 as pm
import theano.tensor as tt
basic_model = pm.Model()
xdata = np.random.normal(50.000215,5.8e-6*np.sqrt(1000),1000)
with basic_model:
#prior distributions
theta = pm.Normal('theta',mu=-.1,sd=.04)
alpha = pm.Normal('alpha',mu=.0000115,sd=.0000012)
length = pm.Normal('length',mu=50,sd=1)
mumeas = length*(1+alpha*theta)
with basic_model:
obs = pm.Normal('obs',mu=mumeas,sd=5.8e-6,observed=xdata)
#yobs = Normal('yobs',)
start = pm.find_MAP()
#trace = pm.sample(2000, step=pm.Metropolis, start=start)
step = pm.Metropolis()
trace = pm.sample(10000, tune=200000,step=step,start=start,njobs=1)
length_samples = trace['length']
fig,ax=plt.subplots()
plt.hist(length_samples, histtype='stepfilled', bins=30, alpha=0.85,
label="posterior of $\lambda_1$", color="#A60628", normed=True)
I would really appreciate any help as to why this isn’t working. I've been trying for a while and it never converges to the expected solution given from the R code. I tried the default sampler (NUTS I think) as well as Metropolis but that completely failed with a zero gradient error. The (relevant) course slides are attached as an image. Finally, here is the comparable R code:
library(rjags)
#Data
jags_data <- list(xbar=50.000215)
jags_code <- jags.model(file = "calibration.txt",
data = jags_data,
n.chains = 1,
n.adapt = 30000)
post_samples <- coda.samples(model = jags_code,
variable.names =
c("l","mu","alpha","theta"),#,"ypred"),
n.iter = 30000)
summary(post_samples)
mean(post_samples[[1]][,"l"])
sd(post_samples[[1]][,"l"])
plot(post_samples)
and the calibration.txt model:
model{
l~dnorm(50,1.0)
alpha~dnorm(0.0000115,694444444444)
theta~dnorm(-0.1,625)
mu<-l*(1+alpha*theta)
xbar~dnorm(mu,29726516052)
}
(note I think the dnorm distribution takes 1/sigma^2, hence the weird-looking variances)
Any help or insight as to why the PyMC3 sampling isn't converging and what I should do differently would be extremely appreciated. Thanks!

I also had trouble getting anything useful from the generated data and model in the code. It seems to me that the level of noise in the fake data could equally be explained by the different sources of variance in the model. That can lead to a situation of highly correlated posterior parameters. Add to that the extreme scale imbalances, then it makes sense this would have sampling issues.
However, looking at the JAGS model, it seems they really are using just that one input observation. I've never seen this technique(?) before, that is, inputting summary statistics of data instead of the raw data itself. I suppose it worked for them in JAGS, so I decided to try running the exact same MCMC, including using the precision (tau) parameterization of the Gaussian.
Original Model with Metropolis
with pm.Model() as m0:
# tau === precision parameterization
dT = pm.Normal('dT', mu=-0.1, tau=625)
alpha = pm.Normal('alpha', mu=0.0000115, tau=694444444444)
length = pm.Normal('length', mu=50.0, tau=1.0)
mu = pm.Deterministic('mu', length*(1+alpha*dT))
# only one input observation; tau indicates the 5.8 nm sd
obs = pm.Normal('obs', mu=mu, tau=29726516052, observed=[50.000215])
trace = pm.sample(30000, tune=30000, chains=4, cores=4, step=pm.Metropolis())
While it's still not that great at sampling length and dT, it at least appears convergent overall:
I think noteworthy here is that despite the relatively weak prior on length (sd=1), the strong priors on all the other parameters appear to propagate a tight uncertainty bound on the length posterior. Ultimately, this is the posterior of interest, so this seems to be consistent with the intent of the exercise. Also, see that mu comes out in the posterior as exactly the distribution described, namely, N(50.000215, 5.8e-6).
Trace Plots
Forest Plot
Pair Plot
Here, however, you can see the core problem is still there. There's both strong correlation between length and dT, plus 4 or 5 orders of magnitude scale difference between the standard errors. I'd definitely do a long run before I really trusted the result.
Alternative Model with NUTS
In order to get this running with NUTS, you'd have to address the scaling issue. That is, somehow we need to reparameterize to get all the tau values closer to 1. Then, you'd run the sampler and transform back into the units you're interested in. Unfortunately, I don't have time to play around with this right now (I'd have to figure it out too), but maybe it's something you can start exploring on your own.

How do I improve a Gaussian/Normal fit in Python 3.X by using a running median?

I have an array of 100x100 data points, where I'm trying to perform a Gaussian fit to each column of 100 values in the array. I then want the parameters of the Gaussian found by using the fit of the first column to be the initial parameters of the starting point for the next column to use. Let's say I start with the initial parameters of 1000, 0, and 1, and the fit finds values of 800, 3, and 1.5. I then want the fitter to use these three parameters as initial values for the next column.
My code is:
x = np.linspace(-50,50,100)
Gauss_Model = models.Gaussian1D(amplitude = 1000., mean = 0, stddev = 1.)
Fitting_Model = fitting.LevMarLSQFitter()
Fit_Data = []
for i in range(0, Data_Array.shape[0]):
Fit_Data.append(Fitting_Model(Gauss_Model, x, Data_Array[:,i]))
Right now it uses the same initial values for every fit. Does anyone know how to perform such a running median/mean for a Gaussian fitting method? Would really appreciate any help or being pointed in the right direction, thanks!

I'm not familiar with the specific library you are using, but if you can get your fitted parameters out with something like fit_data[-1].amplitude or fit_data[-1].mean, then you could modify your loop to use something like:
for i in range(0, data_array.shape[0]):
if fit_data: # true if not an empty list
Gauss_Model = models.Gaussian1D(amplitude=fit_data[-1].amplitude,
mean=fit_data[-1].mean,
stddev=fit_data[-1].stddev)
fit_data.append(Fitting_Model(Gauss_Model, x, Data_Array[:,i]))
basically checking whether you have already fit a model, and if you have, use the most recent fitted amplitude, mean, and standard deviation as the starting point for your next Gauss_Model.
A thought: this might speed up your fitting, but it shouldn't result in a "better" fit to the 100 data points in each fit operation. Your resulting model is probably the best fit model to the data it was presented. If you want to estimate the error in the parameters of your model, you can use the fact that, for two normal distributions A ~ N(m_a, v_a) and B ~ N(m_b, v_b), the distribution A + B will have mean m_a + m_b and variance is v_a + v_b. Thus, the distribution of your means will be N(sum(means)/n, sum(variances)/n). Basically you can say that your true mean is centered at the mean of your means with standard deviation (sum(stddev)/sqrt(n)).

I also cannot tell what library you are using, and the details of how to do this probably depend on the details of how that library stores the fitted values. I can say that for lmfit (https://lmfit.github.io/lmfit-py/) we struggled with this sort of usage and arrived at a design that makes what you are trying to do pretty easy. With lmfit, you might compose this problem as:
import numpy as np
from lmfit import GaussianModel
x = np.linspace(-50,50,100)
# get Data_Array from somewhere....
# create a model for a Gaussian
Gauss_Model = GaussianModel()
# make a set of parameters, setting initial values
params = Gauss_Model.make_params(amplitude=1000, center=0, sigma=1.0)
Fit_Results = []
for i in range(Data_Array.shape[1]):
result = Gauss_Model.fit(Data_Array[:, i], params, x=x)
Fit_Results.append(result)
# update `params` with the current best fit params for the next column
params = result.params
Note that this works because lmfit is careful that Model.fit() will not alter the input parameters, and will put the resulting best-fit parameters for each fit in result.params.
And, if you decide you do want to have all columns use the original initial values, just comment out that last params = result.params.
Lmfit has a lot more bells and whistles, but I hope that helps you do what you need.

Gaussian fit in Python - parameters estimation

I want to fit an array of data (in the program called "data", of size "n") with a Gaussian function and I want to get the estimations for the parameters of the curve, namely the mean and the sigma. Is the following code, which I found on the Web, a fast way to do that? If so, how can I actually get the estimated values of the parameters?
import pylab as plb
from scipy.optimize import curve_fit
from scipy import asarray as ar,exp
x = ar(range(n))
y = data
n = len(x) #the number of data
mean = sum(x*y)/n #note this correction
sigma = sum(y*(x-mean)**2)/n #note this correction
def gaus(x,a,x0,sigma,c):
return a*exp(-(x-x0)**2/(sigma**2))+c
popt,pcov = curve_fit(gaus,x,y,p0=[1,mean,sigma,0.0])
print popt
print pcov
plt.plot(x,y,'b+:',label='data')
plt.plot(x,gaus(x,*popt),'ro:',label='fit')
plt.legend()
plt.title('Fig. 3 - Fit')
plt.xlabel('q')
plt.ylabel('data')
plt.show()

To answer your first question, "Is the following code, which I found on the Web, a fast way to do that?"
The code that you have is in fact the right way to proceed with fitting your data, when you believe is Gaussian and know the fitting function (except change the return function to
a*exp(-(x-x0)**2/(sigma**2)).
I believe for a Gaussian function you don't need the constant c parameter.
A common use of least-squares minimization is curve fitting, where one has a parametrized model function meant to explain some phenomena and wants to adjust the numerical values for the model to most closely match some data. With scipy, such problems are commonly solved with scipy.optimize.curve_fit.
To answer your second question, "If so, how can I actually get the estimated values of the parameters?"
You can go to the link provided for scipy.optimize.curve_fit and find that the best fit parameters reside in your popt variable. In your example, popt will contain the mean and sigma of your data. In addition to the best fit parameters, pcov will contain the covariance matrix, which will have the errors of your mean and sigma. To obtain 1sigma standard deviations, you can simply use np.sqrt(pcov) and obtain the same.

Difference between Levenberg-Marquardt-Algorithm and ODR

I was able to fit curves to a x/y dataset using peak-o-mat, as shown below. Thats a linear background and 10 lorentzian curves.
Since I need to fit many similar curves I wrote a scripted fitting routine, using mpfit.py, which is a Levenberg-Marquardt-Algorithm. However the fit takes longer and, in my opinion, is less accurate than the peak-o-mat result:
Starting values
Fit result with fixed linear background (values for linear background taken from the peak-o-mat result)
Fit result with all variables free
I believe the starting values are already very close, but even with the fixed linear background, the left lorentzian is clearly a degradation of the fit.
The result is even worse for total free fit.
Peak-o-mat appears to use scipy.odr.odrpack. Now what is more likely:
I did some implementation error?
odrpack is more suitable for this particular problem?
Fitting to a more simple problem (linear data with one peak in the middle) shows very good correlation between peak-o-mat and my script. Also I did not find a lot about ordpack.
Edit: It seems I could answer the question by myself, however the answer is a bit unsettling. Using scipy.odr (which allows fitting with odr or leastsq method) both give the result as peak-o-mat, even without constraints.
The image below shows again the data, the start values (almost perfect) and then the odr and leastsq fits. The component curves are for the odr one
I will switch to odr, but this still leaves me upset. The methods (mpfit.py, scipy.optimize.leastsq, scipy.odr in leastsq mode) 'should' yield the same results.
And for people stumbling upon this post: to do the odr fit an error must be specified for x and y values. If there is no error, use small values with sx << sy.
linear = odr.Model(f)
mydata = odr.RealData(x, y, sx = 1e-99, sy = 0.01)
myodr = odr.ODR(mydata, linear, beta0 = beta0, maxit = 2000)
myoutput1 = myodr.run()

You can use peak-o-mat for scripting as well. The easiest would be to create project containing all data you want to fit via the GUI, clean it, transform it and attach (i.e. choose a model, provide an initial guess and fit) the base model to one of the sets. Then you can (deep)copy that model and attach it to all of the other data sets. Try this:
from peak_o_mat.project import Project
from peak_o_mat.fit import Fit
from copy import deepcopy
p = Project()
p.Read('in.lpj')
base = p[2][0] # this is the set which has been fit already
for data in p[2][1:]: # all remaining sets of plot number 2
mod = deepcopy(base.mod)
data.mod = mod
f = Fit(data, data.mod)
res = f.run()
pars = res[0]
err = res[1]
data.mod._newpars(pars, err)
print data.mod.parameters_as_table()
p.Write('out')
Please tell me, if you need more details.

correct usage of scipy.optimize.fmin_bfgs

I am playing around with logistic regression in Python. I have implemented a version where the minimization of the cost function is done via gradient descent, and now I'd like to use the BFGS algorithm from scipy (scipy.optimize.fmin_bfgs).
I have a set of data (features in matrix X, with one sample in every row of X, and correpsonding labels in vertical vector y). I am trying to find parameters Theta to minimize:
I have trouble understanding how fmin_bfgs works exactly. As far as I get it, I have to pass a function to be minimized and a set of initial values for Thetas.
I do the following:
initial_values = numpy.zeros((len(X[0]), 1))
myargs = (X, y)
theta = scipy.optimize.fmin_bfgs(computeCost, x0=initial_values, args=myargs)
where computeCost calculates J(Thetas) as illustrated above. But I get some index-related errors, so I think I am not supplying what fmin_bfgs expects.
Can anyone shed some light on this?

After wasting hours on it, solved again by power of posting...I was defining computeCost(X, y, Thetas), but as Thetas is the target parameter for optimization, it should have been the first parameter in the signature. Fixed and works!

i don't know your whole code, but have you tried
initial_values = numpy.zeros(len(X[0]))
? This x0 should be a 1d vector, i think.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.