I would like to use scipy's ODR to fit a curve to a set of variables with variances. In this case, I am fitting a linear function with a set Y axis crossing point (e.g. a*x+100). Due to my inability to find an estimator (I asked about that here), I am using scipy.optimize curve_fit to estimate the initial a value. Now, the function works perfectly without standard deviation, but when I add it, the output makes completely no sense (the curve is far above all of the points). What could be the case of such behaviour? Thanks!
The code is attached here:
import scipy.odr as SODR
from scipy.optimize import curve_fit
def fun(kier, arg):
'''
Function to fit
:param kier: list of parameters
:param arg: argument of the function fun
:return y: value of function in x
'''
y =kier[0]*arg +100 #+kier[1]
return y
zx = [30.120348566300354, 36.218214083626386, 52.86998374096616]
zy = [83.47033171149137, 129.10207165602722, 85.59465198231146]
dx = [2.537935346025827, 4.918719773247683, 2.5477221183398977]
dy = [3.3729431749276837, 5.33696690247701, 2.0937213187876]
sx = [6.605581618516947, 8.221194790372632, 22.980577676739113]
sy = [1.0936584882351936, 0.7749999999999986, 20.915359045447914]
dx_total = [9.143516964542775, 13.139914563620316, 25.52829979507901]
dy_total = [4.466601663162877, 6.1119669024770085, 23.009080364235516]
# curve fitting
popt, pcov = curve_fit(fun, zx, zy)
danesd = SODR.RealData(x=zx, y=zy, sx=dx_total, sy=dy_total)
model = SODR.Model(fun)
onbig = SODR.ODR(danesd, model, beta0=[popt[0]])
outputbig = onbig.run()
biga=outputbig.beta[0]
print(biga)
daned = SODR.RealData(x=zx,y=zy,sx=dx,sy=dy)
on = SODR.ODR(daned, model, beta0=[popt[0]])
outputd = on.run()
normala = outputd.beta[0]
print(normala)
The outputs are:
30.926925885047254 (this is the output with standard deviation)
-0.25132703671513873 (this is without standard deviation)
This makes no sense, as shown here:
Also, I'd be happy to get any feedback whether my code is clear and the formatting of this question. I am still very new here.
Related
I am trying to fit the parameters of a transit light curve.
I have observed transit light curve data and I am using a .py in python that through 4 parameters (period, a(semi-major axis), inclination, planet radius) returns a model transit light curve. I would like to minimize the residual between these two light curves. This is what I am trying to do: First - Estimate a max likelihood using method = "L-BFGS-B" and then apply the mcmc using emcee to estimate the uncertainties.
The code:
p = lmfit.Parameters()
p.add_many(('per', 2.), ('inc', 90.), ('a', 5.), ('rp', 0.1))
per_b = [1., 3.]
a_b = [4., 6.]
inc_b = [88., 90.]
rp_b = [0.1, 0.3]
bounds = [(per_b[0], per_b[1]), (inc_b[0], inc_b[1]), (a_b[0], a_b[1]), (rp_b[0], rp_b[1])]
def residual(p):
v = p.valuesdict()
eclipse.criarEclipse(v['per'], v['a'], v['inc'], v['rp'])
lc0 = numpy.array(eclipse.getCurvaLuz()) (observed flux data)
ts0 = numpy.array(eclipse.getTempoHoras()) (observed time data)
c = numpy.linspace(min(time_phased[bb]),max(time_phased[bb]),len(time_phased[bb]),endpoint=True)
nn = interpolate.interp1d(ts0,lc0)
return nn(c) - smoothed_LC[bb] (residual between the model and the data)
Inside def residual(p) I make sure that both the observed data (time_phased[bb] and smoothed_LC[bb]) have the same size of the model transit light curve. I want it to give me the best fit values for the parameters (v['per'], v['a'], v['inc'], v['rp']).
I need your help and I appreciate your time and your attention. Kindest regards, Yuri.
Your example is incomplete, with many partial concepts and some invalid Python. This makes it slightly hard to understand your intention. If the answer below is not sufficient, update your question with a complete example.
It seems pretty clear that you want to model your data smoothed_LC[bb] (not sure what bb is) with a model for some effect of an eclipse. With that assumption, I would recommend using the lmfit.Model approach. Start by writing a function that models the data, just so you check and plot your model. I'm not entirely sure I understand everything you're doing, but this model function might look like this:
import numpy
from scipy import interpolate
from lmfit import Model
# import eclipse from somewhere....
def eclipse_lc(c, per, a, inc, p):
eclipse.criarEclipse(per, a, inc, rp)
lc0 = numpy.array(eclipse.getCurvaLuz()) # observed flux data
ts0 = numpy.array(eclipse.getTempoHoras()) # observed time data
return interpolate.interp1d(ts0,lc0)(c)
With this model function, you can build a Model:
lc_model = Model(eclipse_lc)
and then build parameters for your model. This will automatically name them after the argument names of your model function. Here, you can also give them initial values:
params = lc_model.make_params(per=2, inc=90, a=5, rp=0.1)
You wanted to place upper and lower bounds on these parameters. This is done by setting min and max parameters, not making an ordered array of bounds:
params['per'].min = 1.0
params['per'].max = 3.0
and so on. But also: setting such tight bounds is usually a bad idea. Set bounds to avoid unphysical parameter values or when it becomes evident that you need to place them.
Now, you can fit your data with this model. Well, first you need to get the data you want to model. This seems less clear from your example, but perhaps:
c_data = numpy.linspace(min(time_phased[bb]), max(time_phased[bb]),
len(time_phased[bb]), endpoint=True)
lc_data = smoothed_LC[bb]
Well: why do you need to make this c_data? Why not just use time_phased as the independent variable? Anyway, now you can fit your data to your model with your parameters:
result = lc_model(lc_data, params, c=c_data)
At this point, you can print out a report of the results and/or view or get the best-fit arrays:
print(result.fit_report())
for p in result.params.items(): print(p)
import matplotlib.pyplot as plt
plt.plot(c_data, lc_data, label='data')
plt.plot(c_data. result.best_fit, label='fit')
plt.legend()
plt.show()
Hope that helps...
I'm working on fitting muon lifetime data to a curve to extract the mean lifetime using the lmfit function. The general process I'm using is to bin the 13,000 data points into 10 bins using the histogram function, calculating the uncertainty with the square root of the counts in each bin (it's an exponential model), then use the lmfit module to determine the best fit along with means and uncertainty. However, graphing the output of the model.fit() method returns this graph, where the red line is the fit (and obviously not the correct fit). Fit result output graph
I've looked online and can't find a solution to this, I'd really appreciate some help figuring out what's going on. Here's the code.
import os
import numpy as np
import matplotlib.pyplot as plt
from numpy import sqrt, pi, exp, linspace
from lmfit import Model
class data():
def __init__(self,file_name):
times_dirty = sorted(np.genfromtxt(file_name, delimiter=' ',unpack=False)[:,0])
self.times = []
for i in range(len(times_dirty)):
if times_dirty[i]<40000:
self.times.append(times_dirty[i])
self.counts = []
self.binBounds = []
self.uncertainties = []
self.means = []
def binData(self,k):
self.counts, self.binBounds = np.histogram(self.times, bins=k)
self.binBounds = self.binBounds[:-1]
def calcStats(self):
if len(self.counts)==0:
print('Run binData function first')
else:
self.uncertainties = sqrt(self.counts)
def plotData(self,fit):
plt.errorbar(self.binBounds, self.counts, yerr=self.uncertainties, fmt='bo')
plt.plot(self.binBounds, fit.init_fit, 'k--')
plt.plot(self.binBounds, fit.best_fit, 'r')
plt.show()
def decay(t, N, lamb, B):
return N * lamb * exp(-lamb * t) +B
def main():
muonEvents = data('C:\Users\Colt\Downloads\muon.data')
muonEvents.binData(10)
muonEvents.calcStats()
mod = Model(decay)
result = mod.fit(muonEvents.counts, t=muonEvents.binBounds, N=1, lamb=1, B = 1)
muonEvents.plotData(result)
print(result.fit_report())
print (len(muonEvents.times))
if __name__ == "__main__":
main()
This might be a simple scaling problem. As a quick test, try dividing all raw data by a factor of 1000 (both X and Y) to see if changing the magnitude of the data has any effect.
Just to build on James Phillips answer, I think the data you show in your graph imply values for N, lamb, and B that are very different from 1, 1, 1. Keep in mind that exp(-lamb*t) is essentially 0 for lamb = 1, and t> 100. So, if the algorithm starts at lamb=1 and varies that by a little bit to find a better value, it won't actually be able to see any difference in how well the model matches the data.
I would suggest trying to start with values that are more reasonable for the data you have, perhaps N=1.e6, lamb=1.e-4, and B=100.
As James suggested, having the variables have values on the order of 1 and putting in scale factors as necessary is often helpful in getting numerically stable solutions.
I am trying to fit my data points. It looks like the fitting without errors are not that optimistic, therefore now I am trying to fit the data implementing the errors at each point. My fit function is below:
def fit_func(x,a,b,c):
return np.log10(a*x**b + c)
then my data points are below:
r = [ 0.00528039,0.00721161,0.00873037,0.01108928,0.01413011,0.01790143,0.02263833, 0.02886089,0.03663713,0.04659512,0.05921978,0.07540126,0.09593949, 0.12190075,0.15501736,0.19713563,0.25041524,0.3185025,0.40514023,0.51507869, 0.65489938,0.83278859,1.05865016,1.34624082]
logf = [-1.1020581079659384, -1.3966927245616112, -1.4571368537041418, -1.5032694247562564, -1.8534775558300272, -2.2715812166948304, -2.2627690390113862, -2.5275290780299331, -3.3798813619309365, -6.0, -2.6270989211307034, -2.6549656159564918, -2.9366845162570079, -3.0955026428779604, -3.2649261507250289, -3.2837123017838366, -3.0493752067042856, -3.3133647996463229, -3.0865051494299243, -3.1347499415910169, -3.1433062918466632, -3.1747394718538979, -3.1797597345585245, -3.1913094832146616]
Because my data is in log scale, logf, then the error bar for each data point is not symmetric. The upper error bar and lower error bar are below:
upper = [0.070648916083227764, 0.44346256268274886, 0.11928131794776076, 0.094260899008089094, 0.14357124858039971, 0.27236750587684311, 0.18877122991380402, 0.28707938182603066, 0.72011863806906318, 0, 0.16813325716948757, 0.13624929595316049, 0.21847915642008875, 0.25456116079315372, 0.31078368240910148, 0.23178227464741452, 0.09158189214515966, 0.14020538489677881, 0.059482730164901909, 0.051786777740678414, 0.041126467609954531, 0.034394612910981337, 0.027206248503368613, 0.021847333685597548]
lower = [0.06074797748043137, 0.21479225959441428, 0.093479845697059583, 0.077406149968278104, 0.1077175009766278, 0.16610073183912188, 0.13114254113054535, 0.17133966123838595, 0.57498950902908286, 2.9786837094190934, 0.12090437578535695, 0.10355760401838676, 0.14467588244034646, 0.15942693835964539, 0.17929440903034921, 0.15031667827534712, 0.075592499975030591, 0.10581886912443572, 0.05230849287772843, 0.04626422871423852, 0.03756658820680725, 0.03186944137872727, 0.025601929615431285, 0.02080073540367966]
I have the fitting as:
popt, pcov = optimize.curve_fit(fit_func, r, logf,sigma=[lower,upper])
logf_fit = fit_func(r,*popt)
But this is wrong, how can I implement the curve fitting from scipy to include the upper and lower errors? How could I get the fitting errors of the fitting parameters a, b, c?
You can use scipy.optimize.leastsq with custom weights:
import scipy.optimize as optimize
import numpy as np
# redefine lists as array
x=np.array(r)
y=np.array(logf)
errup=np.array(upper)
errlow=np.array(lower)
# error function
def fit_func(x,a,b,c):
return np.log10(a*x**b + c)
def my_error(V):
a,b,c=V
yfit=fit_func(x,a,b,c)
weight=np.ones_like(yfit)
weight[yfit>y]=errup[yfit>y] # if the fit point is above the measure, use upper weight
weight[yfit<=y]=errlow[yfit<=y] # else use lower weight
return (yfit-y)**2/weight**2
answer=optimize.leastsq(my_error,x0=[0.0001,-1,0.0006])
a,b,c=answer[0]
print(a,b,c)
It works, but is very sensitive to initial values, since there is a log which can go in wrong domain (negative numbers) and then it fails. Here I find a=9.14464745425e-06 b=-1.75179880756 c=0.00066720486385which is pretty close to data.
I have a class describing a mathematical function. The class needs to be able to least squares fit itself to passed in data. i.e. you can call a method like this:
classinstance.Fit(x,y)
and it adjusts its internal variables to best fit the data. I'm trying to use scipy.optimize.curve_fit for this, and it needs me to pass in a model function. The problem is that the model function is within the class and needs to access the variables and members of the class to compute the data. However, curve_fit can't call a function whose first parameter is self. Is there a way to make curve_fit use a method of the class as it's model function?
Here is a minimum executable snippet to show the issue:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
# This is a class which encapsulates a gaussian and fits itself to data.
class GaussianComponent():
# This is a formula string showing the exact code used to produce the gaussian. I
# It has to be printed for the user, and it can be used to compute values.
Formula = 'self.Amp*np.exp(-((x-self.Center)**2/(self.FWHM**2*np.sqrt(2))))'
# These parameters describe the gaussian.
Center = 0
Amp = 1
FWHM = 1
# HERE IS THE CONUNDRUM: IF I LEAVE SELF IN THE DECLARATION, CURVE_FIT
# CANNOT CALL IT SINCE IT REQUIRES THE WRONG NUMBER OF PARAMETERS.
# IF I REMOVE IT, FITFUNC CAN'T ACCESS THE CLASS VARIABLES.
def FitFunc(self, x, y, Center, Amp, FWHM):
eval('y - ' + self.Formula.replace('self.', ''))
# This uses curve_fit to adjust the gaussian parameters to best match the
# data passed in.
def Fit(self, x, y):
#FitFunc = lambda x, y, Center, Amp, FWHM: eval('y - ' + self.Formula.replace('self.', ''))
FitParams, FitCov = curve_fit(self.FitFunc, x, y, (self.Center, self.Amp, self.FWHM))
self.Center = FitParams[0]
self.Amp = FitParams[1]
self.FWHM = FitParams[2]
# Give back a vector which describes what this gaussian looks like.
def GetPlot(self, x):
y = eval(self.Formula)
return y
# Make a gausssian with default shape and position (height 1 at the origin, FWHM 1.
g = GaussianComponent()
# Make a space in which we can plot the gaussian.
x = np.linspace(-5,5,100)
y = g.GetPlot(x)
# Make some "experimental data" which is just the default shape, noisy, and
# moved up the y axis a tad so the best fit will be different.
ynoise = y + np.random.normal(loc=0.1, scale=0.1, size=len(x))
# Draw it
plt.plot(x,y, x,ynoise)
plt.show()
# Do the fit (but this doesn't work...)
g.Fit(x,y)
And this produces the following graph and then crashes since the model function is incorrect when it tries to do the fit.
Thanks in advance!
I spent some time looking at your code and turned out 2 minutes late unfortunately. Anyhow, to make things a bit more interesting I've edited your class a bit, Here's what I concocted:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
class GaussianComponent():
def __init__(self, func, params=None):
self.formula = func
self.params = params
def eval(self, x):
allowed_locals = {key: self.params[key] for key in self.params}
allowed_locals["x"] = x
allowed_globals = {"np":np}
return eval(self.formula, allowed_globals, allowed_locals)
def Fit(self, x, y):
FitParams, FitCov = curve_fit(self.eval, x, y, self.params)
self.fitparams = fitParams
# Make a gausssian with default shape and position (height 1 at the origin, FWHM 1.
g = GaussianComponent("Amp*np.exp(-((x-Center)**2/(FWHM**2*np.sqrt(2))))",
params={"Amp":1, "Center":0, "FWHM":1})
**SNIPPED FOR BREVITY**
I believe you'll perhaps find this a more satisfying solution?
Currently all your gauss parameters are class attributes, that means if you try to create a second instance of your class with different values for parameters you will change the values for the first class too. By shoving all the parameters as instance attribute(s), you get rid of that. That is why we have classes in the first place.
Your issue with self stems from the fact you write self in your Formula. Now you don't have to any more. I think it makes a bit more sense like this, because when you instantiate an object of the class you can add as many or as little params to your declared function as you want. It doesn't even have to be gaussian now (unlike before).
Just throw all params to a dictionary, just like curve_fit does and forget about them.
By explicitly stating what eval can use, you help make sure that any evil-doers have a harder time breaking your code. It's still possible though, it always is with eval.
Good luck, ask if you need something clarified.
Ahh! It was actually a bug in my code. If I change this line:
def FitFunc(self, x, y, Center, Amp, FWHM):
to
def FitFunc(self, x, Center, Amp, FWHM):
Then we are fine. So curve_fit does correctly handle the self parameter but my model function shouldn't include y.
(Embarrased!)
In my model, I need to obtain the value of my deterministic variable from a set of parent variables using a complicated python function.
Is it possible to do that?
Following is a pyMC3 code which shows what I am trying to do in a simplified case.
import numpy as np
import pymc as pm
#Predefine values on two parameter Grid (x,w) for a set of i values (1,2,3)
idata = np.array([1,2,3])
size= 20
gridlength = size*size
Grid = np.empty((gridlength,2+len(idata)))
for x in range(size):
for w in range(size):
# A silly version of my real model evaluated on grid.
Grid[x*size+w,:]= np.array([x,w]+[(x**i + w**i) for i in idata])
# A function to find the nearest value in Grid and return its product with third variable z
def FindFromGrid(x,w,z):
return Grid[int(x)*size+int(w),2:] * z
#Generate fake Y data with error
yerror = np.random.normal(loc=0.0, scale=9.0, size=len(idata))
ydata = Grid[16*size+12,2:]*3.6 + yerror # ie. True x= 16, w= 12 and z= 3.6
with pm.Model() as model:
#Priors
x = pm.Uniform('x',lower=0,upper= size)
w = pm.Uniform('w',lower=0,upper =size)
z = pm.Uniform('z',lower=-5,upper =10)
#Expected value
y_hat = pm.Deterministic('y_hat',FindFromGrid(x,w,z))
#Data likelihood
ysigmas = np.ones(len(idata))*9.0
y_like = pm.Normal('y_like',mu= y_hat, sd=ysigmas, observed=ydata)
# Inference...
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(state=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(1000, step, start=start, progressbar=False) # draw 1000 posterior samples using NUTS sampling
print('The trace plot')
fig = pm.traceplot(trace, lines={'x': 16, 'w': 12, 'z':3.6})
fig.show()
When I run this code, I get error at the y_hat stage, because the int() function inside the FindFromGrid(x,w,z) function needs integer not FreeRV.
Finding y_hat from a pre calculated grid is important because my real model for y_hat does not have an analytical form to express.
I have earlier tried to use OpenBUGS, but I found out here it is not possible to do this in OpenBUGS. Is it possible in PyMC ?
Update
Based on an example in pyMC github page, I found I need to add the following decorator to my FindFromGrid(x,w,z) function.
#pm.theano.compile.ops.as_op(itypes=[t.dscalar, t.dscalar, t.dscalar],otypes=[t.dvector])
This seems to solve the above mentioned issue. But I cannot use NUTS sampler anymore since it needs gradient.
Metropolis seems to be not converging.
Which step method should I use in a scenario like this?
You found the correct solution with as_op.
Regarding the convergence: Are you using pm.Metropolis() instead of pm.NUTS() by any chance? One reason this could not converge is that Metropolis() by default samples in the joint space while often Gibbs within Metropolis is more effective (and this was the default in pymc2). Having said that, I just merged this: https://github.com/pymc-devs/pymc/pull/587 which changes the default behavior of the Metropolis and Slice sampler to be non-blocked by default (so within Gibbs). Other samplers like NUTS that are primarily designed to sample the joint space still default to blocked. You can always explicitly set this with the kwarg blocked=True.
Anyway, update pymc with the most recent master and see if convergence improves. If not, try the Slice sampler.