Python - Minimizing Chi-squared

Python - Minimizing Chi-squared - python

I have been trying to fit a linear model to a set of stress/strain data by minimizing chi-squared. Unfortunately using the code below is not correctly minimizing the chisqfunc function. It is finding the minimum at the initial conditions, x0, which is not correct. I have looked through the scipy.optimize documentation and tested minimizing other functions which has worked correctly. Could you please suggest how to fix the code below or suggest another method I can use to fit a linear model to data by minimizing chi-squared?
import numpy
import scipy.optimize as opt
filename = 'data.csv'
data = numpy.loadtxt(open(filename,"r"),delimiter=",")
stress = data[:,0]
strain = data[:,1]
err_stress = data[:,2]
def chisqfunc((a, b)):
model = a + b*strain
chisq = numpy.sum(((stress - model)/err_stress)**2)
return chisq
x0 = numpy.array([0,0])
result = opt.minimize(chisqfunc, x0)
print result
Thank you for reading my question and any help would be greatly appreciated.
Cheers, Will
EDIT: Data set I am currently using: Link to data

The problem is that your initial guess is very far from the actual solution. If you add a print statement inside chisqfunc() like print (a,b), and rerun your code, you'll get something like:
(0, 0)
(1.4901161193847656e-08, 0.0)
(0.0, 1.4901161193847656e-08)
This means that minimize evaluates the function only at these points.
if you now try to evaluate chisqfunc() at these 3 pairs of values, you'll see that they EXACTLY match, for example
print chisqfunc((0,0))==chisqfunc((1.4901161193847656e-08,0))
True
This happens because of rounding floating points arithmetics. In other words, when evaluating stress - model, the var stress is too many order of magnitude larger than model, and the result is truncated.
One could then just try bruteforcing it, increasing floating point precision, with writing data=data.astype(np.float128) just after loading the data with loadtxt. minimize fails, with result.success=False, but with a helpful message
Desired error not necessarily achieved due to precision loss.
One possibility is then to provide a better initial guess, so that in the subtraction stress - model the model part is of the same order of magnitude, the other to rescale the data, so that the solution will be closer to your initial guess (0,0).
It is MUCH better if you just rescale the data, making for example nondimensional with respect to a certain stress value (like the yelding/cracking of this material)
This is an example of the fitting, using as a stress scale the maximum measured stress. There are very few changes from your code:
import numpy
import scipy.optimize as opt
filename = 'data.csv'
data = numpy.loadtxt(open(filename,"r"),delimiter=",")
stress = data[:,0]
strain = data[:,1]
err_stress = data[:,2]
smax = stress.max()
stress = stress/smax
#I am assuming the errors err_stress are in the same units of stress.
err_stress = err_stress/smax
def chisqfunc((a, b)):
model = a + b*strain
chisq = numpy.sum(((stress - model)/err_stress)**2)
return chisq
x0 = numpy.array([0,0])
result = opt.minimize(chisqfunc, x0)
print result
assert result.success==True
a,b=result.x*smax
plot(strain,stress*smax)
plot(strain,a+b*strain)
Your linear model is quite good, i.e. your material has a very linear behaviour for this range of deformation (what material is it anyway?):

Related

How do I improve a Gaussian/Normal fit in Python 3.X by using a running median?

I have an array of 100x100 data points, where I'm trying to perform a Gaussian fit to each column of 100 values in the array. I then want the parameters of the Gaussian found by using the fit of the first column to be the initial parameters of the starting point for the next column to use. Let's say I start with the initial parameters of 1000, 0, and 1, and the fit finds values of 800, 3, and 1.5. I then want the fitter to use these three parameters as initial values for the next column.
My code is:
x = np.linspace(-50,50,100)
Gauss_Model = models.Gaussian1D(amplitude = 1000., mean = 0, stddev = 1.)
Fitting_Model = fitting.LevMarLSQFitter()
Fit_Data = []
for i in range(0, Data_Array.shape[0]):
Fit_Data.append(Fitting_Model(Gauss_Model, x, Data_Array[:,i]))
Right now it uses the same initial values for every fit. Does anyone know how to perform such a running median/mean for a Gaussian fitting method? Would really appreciate any help or being pointed in the right direction, thanks!

I'm not familiar with the specific library you are using, but if you can get your fitted parameters out with something like fit_data[-1].amplitude or fit_data[-1].mean, then you could modify your loop to use something like:
for i in range(0, data_array.shape[0]):
if fit_data: # true if not an empty list
Gauss_Model = models.Gaussian1D(amplitude=fit_data[-1].amplitude,
mean=fit_data[-1].mean,
stddev=fit_data[-1].stddev)
fit_data.append(Fitting_Model(Gauss_Model, x, Data_Array[:,i]))
basically checking whether you have already fit a model, and if you have, use the most recent fitted amplitude, mean, and standard deviation as the starting point for your next Gauss_Model.
A thought: this might speed up your fitting, but it shouldn't result in a "better" fit to the 100 data points in each fit operation. Your resulting model is probably the best fit model to the data it was presented. If you want to estimate the error in the parameters of your model, you can use the fact that, for two normal distributions A ~ N(m_a, v_a) and B ~ N(m_b, v_b), the distribution A + B will have mean m_a + m_b and variance is v_a + v_b. Thus, the distribution of your means will be N(sum(means)/n, sum(variances)/n). Basically you can say that your true mean is centered at the mean of your means with standard deviation (sum(stddev)/sqrt(n)).

I also cannot tell what library you are using, and the details of how to do this probably depend on the details of how that library stores the fitted values. I can say that for lmfit (https://lmfit.github.io/lmfit-py/) we struggled with this sort of usage and arrived at a design that makes what you are trying to do pretty easy. With lmfit, you might compose this problem as:
import numpy as np
from lmfit import GaussianModel
x = np.linspace(-50,50,100)
# get Data_Array from somewhere....
# create a model for a Gaussian
Gauss_Model = GaussianModel()
# make a set of parameters, setting initial values
params = Gauss_Model.make_params(amplitude=1000, center=0, sigma=1.0)
Fit_Results = []
for i in range(Data_Array.shape[1]):
result = Gauss_Model.fit(Data_Array[:, i], params, x=x)
Fit_Results.append(result)
# update `params` with the current best fit params for the next column
params = result.params
Note that this works because lmfit is careful that Model.fit() will not alter the input parameters, and will put the resulting best-fit parameters for each fit in result.params.
And, if you decide you do want to have all columns use the original initial values, just comment out that last params = result.params.
Lmfit has a lot more bells and whistles, but I hope that helps you do what you need.

Using scipy optimize for MLE estimate and curve fitting

I randomly generated 1000 data points using the weights I know are true for the normal distribution. Now I am trying to minimize the -log likelihood function to estimate the values of sig^2 and the weights. I sort of get the process conceptually, but when I try to code it I'm just lost.
This is my model:
p(y|x, w, sig^2) = N(y|w0+w1x+...+wnx^n, sig^2)
I've been googling for a while now and I've learned the scipy.stats.optimize.minimize function is good for this, but I can't get it to work right. Every solution I have tried has worked for the example I got the solution from, but I'm unable to extrapolate it to my problem.
x = np.linspace(0, 1000, num=1000)
data = []
for y in x:
data.append(np.polyval([.5, 1, 3], y))
#plot to confirm I do have a normal distribution...
data.sort()
pdf = stats.norm.pdf(data, np.mean(data), np.std(data))
plt.plot(test, pdf)
plt.show()
#This is where I am stuck.
logLik = -np.sum(stats.norm.logpdf(data, loc=??, scale=??))
I have found that the equation error(w) = .5*sum(poly(x_n, w) - y_n)^2 is relevant for minimizing the error of the weights, which therefore maximizes my likelihood for the weights, but I don't understand how to code this... I have found a similar relationship for sig^2, but have the same problem. Can somebody clarify how to do this to help my curve fitting? Maybe go as far to post psuedo code I can use?

Yes, implementing likelihood fitting with minimize is tricky, I spend a lot of time on it. Which is why I wrapped it. If I may shamelessly plug my own package symfit, your problem can be solved by doing something like this:
from symfit import Parameter, Variable, Likelihood, exp
import numpy as np
# Define the model for an exponential distribution
beta = Parameter()
x = Variable()
model = (1 / beta) * exp(-x / beta)
# Draw 100 samples from an exponential distribution with beta=5.5
data = np.random.exponential(5.5, 100)
# Do the fitting!
fit = Likelihood(model, data)
fit_result = fit.execute()
I have to admit I don't exactly understand your distribution, since I don't understand the role of your w, but perhaps with this code as an example, you'll know how to adapt it.
If not, let me know the full mathematical equation of your model so I can help you further.
For more info check the docs. (For a more technical description of what happens under the hood, read here and here.)

I think there's an issue with your setup. With maximum likelihood, you obtain the parameters that maximize the probability of observing your data (given a certain model). Your model seems to be:
where epsilon is N(0, sigma).
So you maximize it:
or equivalently take logs to get:
The f in this case is the log-normal probability density function which you can get with stats.norm.logpdf. You should then use scipy.minimize to maximize an expression that will be the summation of stats.norm.logpdf evaluated at each of the i points, from 1 to your sample size.
If I've understood you correctly, your code is missing having a y vector plus an x vector! Show us a sample of those vectors and I can update my answer to include a sample code for estimating MLE with that date.

using undetermined number of parameters in scipy function curve_fit

First question:
I'm trying to fit experimental datas with function of the following form:
f(x) = m_o*(1-exp(-t_o*x)) + ... + m_j*(1-exp(-t_j*x))
Currently, I don't find a way to have an undetermined number of parameters m_j, t_j, I'm forced to do something like this:
def fitting_function(x, m_1, t_1, m_2, t_2):
return m_1*(1.-numpy.exp(-t_1*x)) + m_2*(1.-numpy.exp(-t_2*x))
parameters, covariance = curve_fit(fitting_function, xExp, yExp, maxfev = 100000)
(xExp and yExp are my experimental points)
Is there a way to write my fitting function like this:
def fitting_function(x, li):
res = 0.
for el in range(len(li) / 2):
res += li[2*idx]*(1-numpy.exp(-li[2*idx+1]*x))
return res
where li is the list of fitting parameters and then do a curve_fitting? I don't know how to tell to curve_fitting what is the number of fitting parameters.
When I try this kind of form for fitting_function, I have errors like "ValueError: Unable to determine number of fit parameters."
Second question:
Is there any way to force my fitting parameters to be positive?
Any help appreciated :)

See my question and answer here. I've also made a minimal working example demonstrating how it could be done for your application. I make no claims that this is the best way - I am muddling through all this myself, so any critiques or simplifications are appreciated.
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as pl
def wrapper(x, *args): #take a list of arguments and break it down into two lists for the fit function to understand
N = len(args)/2
amplitudes = list(args[0:N])
timeconstants = list(args[N:2*N])
return fit_func(x, amplitudes, timeconstants)
def fit_func(x, amplitudes, timeconstants): #the actual fit function
fit = np.zeros(len(x))
for m,t in zip(amplitudes, timeconstants):
fit += m*(1.0-np.exp(-t*x))
return fit
def gen_data(x, amplitudes, timeconstants, noise=0.1): #generate some fake data
y = np.zeros(len(x))
for m,t in zip(amplitudes, timeconstants):
y += m*(1.0-np.exp(-t*x))
if noise:
y += np.random.normal(0, noise, size=len(x))
return y
def main():
x = np.arange(0,100)
amplitudes = [1, 2, 3]
timeconstants = [0.5, 0.2, 0.1]
y = gen_data(x, amplitudes, timeconstants, noise=0.01)
p0 = [1, 2, 3, 0.5, 0.2, 0.1]
popt, pcov = curve_fit(lambda x, *p0: wrapper(x, *p0), x, y, p0=p0) #call with lambda function
yfit = gen_data(x, popt[0:3], popt[3:6], noise=0)
pl.plot(x,y,x,yfit)
pl.show()
print popt
print pcov
if __name__=="__main__":
main()
A word of warning, though. A linear sum of exponentials is going to make the fit EXTREMELY sensitive to any noise, particularly for a large number of parameters. You can test that by adding even a small amount of noise to the data generated in the script - even small deviations cause it to get the wrong answer entirely while the fit still looks perfectly valid by eye (test with noise=0, 0.01, and 0.1). Be very careful interpreting your results even if the fit looks good. It's also a form that allows for variable swapping: the best fit solution is the same even if you swap any pairs of (m_i, t_i) with (m_j, t_j), meaning your chi-square has multiple identical local minima that might mean your variables get swapped around during fitting, depending on your initial conditions. This is unlikely to be a numeriaclly robust way to extract these parameters.
To your second question, yes, you can, by defining your exponentials like so:
m_0**2*(1.0-np.exp(-t_0**2*x)+...
Basically, square them all in your actual fit function, fit them, and then square the results (which could be negative or positive) to get your actual parameters. You can also define variables to be between a certain range by using different proxy forms.

Fitting Parametric Curves in Python

I have experimental data of the form (X,Y) and a theoretical model of the form (x(t;*params),y(t;*params)) where t is a physical (but unobservable) variable, and *params are the parameters that I want to determine. t is a continuous variable, and there is a 1:1 relationship between x and t and between y and t in the model.
In a perfect world, I would know the value of T (the real-world value of the parameter) and would be able to do an extremely basic least-squares fit to find the values of *params. (Note that I am not trying to "connect" the values of x and y in my plot, like in 31243002 or 31464345.) I cannot guarantee that in my real data, the latent value T is monotonic, as my data is collected across multiple cycles.
I'm not very experienced doing curve fitting manually, and have to use extremely crude methods without easy access to a basic scipy function. My basic approach involves:
Choose some value of *params and apply it to the model
Take an array of t values and put it into the model to create an array of model(*params) = (x(*params),y(*params))
Interpolate X (the data values) into model to get Y_predicted
Run a least-squares (or other) comparison between Y and Y_predicted
Do it again for a new set of *params
Eventually, choose the best values for *params
There are several obvious problems with this approach.
1) I'm not experienced enough with coding to develop a very good "do it again" other than "try everything in the solution space," of maybe "try everything in a coarse grid" and then "try everything again in a slightly finer grid in the hotspots of the coarse grid." I tried doing MCMC methods, but I never found any optimum values, largely because of problem 2
2) Steps 2-4 are super inefficient in their own right.
I've tried something like (resembling pseudo-code; the actual functions are made up). There are many minor quibbles that could be made about using broadcasting on A,B, but those are less significant than the problem of needing to interpolate for every single step.
People I know have recommended using some sort of Expectation Maximization algorithm, but I don't know enough about that to code one up from scratch. I'm really hoping there's some awesome scipy (or otherwise open-source) algorithm I haven't been able to find that covers my whole problem, but at this point I am not hopeful.
import numpy as np
import scipy as sci
from scipy import interpolate
X_data
Y_data
def x(t,A,B):
return A**t + B**t
def y(t,A,B):
return A*t + B
def interp(A,B):
ts = np.arange(-10,10,0.1)
xs = x(ts,A,B)
ys = y(ts,A,B)
f = interpolate.interp1d(xs,ys)
return f
N = 101
lsqs = np.recarray((N**2),dtype=float)
count = 0
for i in range(0,N):
A = 0.1*i #checks A between 0 and 10
for j in range(0,N):
B = 10 + 0.1*j #checks B between 10 and 20
f = interp(A,B)
y_fit = f(X_data)
squares = np.sum((y_fit - Y_data)**2)
lsqs[count] = (A,b,squares) #puts the values in place for comparison later
count += 1 #allows us to move to the next cell
i = np.argmin(lsqs[:,2])
A_optimal = lsqs[i][0]
B_optimal = lsqs[i][1]

If I understand the question correctly, the params are constants which are the same in every sample, but t varies from sample to sample. So, for example, maybe you have a whole bunch of points which you believe have been sampled from a circle
x = a+r cos(t)
y = b+r sin(t)
at different values of t.
In this case, what I would do is eliminate the variable t to get a relation between x and y -- in this case, (x-a)^2+(y-b)^2 = r^2. If your data fit the model perfectly, you would have (x-a)^2+(y-b)^2 = r^2 at each of your data points. With some error, you could still find (a,b,r) to minimize
sum_i ((x_i-a)^2 + (y_i-b)^2 - r^2)^2.
Mathematica's Eliminate command can automate the procedure of eliminating t in some cases.
PS You might do better at stats.stackexchange, math.stackexchange or mathoverflow.net . I know the last one has a scary reputation, but we don't bite, really!

Difference between Levenberg-Marquardt-Algorithm and ODR

I was able to fit curves to a x/y dataset using peak-o-mat, as shown below. Thats a linear background and 10 lorentzian curves.
Since I need to fit many similar curves I wrote a scripted fitting routine, using mpfit.py, which is a Levenberg-Marquardt-Algorithm. However the fit takes longer and, in my opinion, is less accurate than the peak-o-mat result:
Starting values
Fit result with fixed linear background (values for linear background taken from the peak-o-mat result)
Fit result with all variables free
I believe the starting values are already very close, but even with the fixed linear background, the left lorentzian is clearly a degradation of the fit.
The result is even worse for total free fit.
Peak-o-mat appears to use scipy.odr.odrpack. Now what is more likely:
I did some implementation error?
odrpack is more suitable for this particular problem?
Fitting to a more simple problem (linear data with one peak in the middle) shows very good correlation between peak-o-mat and my script. Also I did not find a lot about ordpack.
Edit: It seems I could answer the question by myself, however the answer is a bit unsettling. Using scipy.odr (which allows fitting with odr or leastsq method) both give the result as peak-o-mat, even without constraints.
The image below shows again the data, the start values (almost perfect) and then the odr and leastsq fits. The component curves are for the odr one
I will switch to odr, but this still leaves me upset. The methods (mpfit.py, scipy.optimize.leastsq, scipy.odr in leastsq mode) 'should' yield the same results.
And for people stumbling upon this post: to do the odr fit an error must be specified for x and y values. If there is no error, use small values with sx << sy.
linear = odr.Model(f)
mydata = odr.RealData(x, y, sx = 1e-99, sy = 0.01)
myodr = odr.ODR(mydata, linear, beta0 = beta0, maxit = 2000)
myoutput1 = myodr.run()

You can use peak-o-mat for scripting as well. The easiest would be to create project containing all data you want to fit via the GUI, clean it, transform it and attach (i.e. choose a model, provide an initial guess and fit) the base model to one of the sets. Then you can (deep)copy that model and attach it to all of the other data sets. Try this:
from peak_o_mat.project import Project
from peak_o_mat.fit import Fit
from copy import deepcopy
p = Project()
p.Read('in.lpj')
base = p[2][0] # this is the set which has been fit already
for data in p[2][1:]: # all remaining sets of plot number 2
mod = deepcopy(base.mod)
data.mod = mod
f = Fit(data, data.mod)
res = f.run()
pars = res[0]
err = res[1]
data.mod._newpars(pars, err)
print data.mod.parameters_as_table()
p.Write('out')
Please tell me, if you need more details.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.