confidence interval for the data itself using lmfit in python - python

Here is the link for the LMFIT implementation of the confidence intervals of parameters: http://lmfit.github.io/lmfit-py/confidence.html
Here is the code I am using:
import lmfit
import numpy as np
# x = np.linspace(1, 10, 250)
# np.random.seed(0)
# y = 1. -np.exp(-(x)/10.) + 0.1*np.random.randn(len(x))
pars = lmfit.Parameters()
pars.add_many(('n', 1.), ('tau', 3.))
# def residual(pars,data=None):
def residual(pars):
v = pars.valuesdict()
# if data is None:
# return 1.0 - np.exp(-(x**v['n'])/v['tau'])
return 1.0 - np.exp(-(x**v['n'])/v['tau'])-y
# create Minimizer
mini = lmfit.Minimizer(residual, pars)
# first solve with Nelder-Mead
out1 = mini.minimize(method='Nelder')
out2 = mini.minimize(method='leastsq', params=out1.params)
lmfit.report_fit(out2.params, min_correl=0.5)
ci, trace = lmfit.conf_interval(mini, out2, sigmas=[0.95],
trace=True, verbose=False)
lmfit.printfuncs.report_ci(ci)

It is a bit difficult to understand the title Confidence interval for the data itself using lmfit in python (there is no data), or the the first sentence I am doing curve fitting using lmfit package (you need data to fit).
I think what you are asking for is a way to get extreme values for the model function that best matches your data. If so, would it work to evaluate your function with all combinations of parameter values best +/- delta (where delta could be any uncertainly level you like), and take the extreme values of the model function? That's not very automated, but shouldn't be too hard.

Related

Statsmodels ARIMA: how to get confidence/prediction interval?

How to generate "lower" and "upper" predictions, not just "yhat"?
import statsmodels
from statsmodels.tsa.arima.model import ARIMA
assert statsmodels.__version__ == '0.12.0'
arima = ARIMA(df['value'], order=order)
model = arima.fit()
Now I can generate "yhat" predictions
yhat = model.forecast(123)
and get confidence intervals for model parameters (but not for predictions):
model.conf_int()
but how to generate yhat_lower and yhat_upper predictions?
In general, the forecast and predict methods only produce point predictions, while the get_forecast and get_prediction methods produce full results including prediction intervals.
In your example, you can do:
forecast = model.get_forecast(123)
yhat = forecast.predicted_mean
yhat_conf_int = forecast.conf_int(alpha=0.05)
If your data is a Pandas Series, then yhat_conf_int will be a DataFrame with two columns, lower <name> and upper <name>, where <name> is the name of the Pandas Series.
If your data is a numpy array (or Python list), then yhat_conf_int will be an (n_forecasts, 2) array, where the first column is the lower part of the interval and the second column is the upper part.
To generate prediction intervals as opposed to confidence intervals (which you have neatly made the distinction between, and is also presented in Hyndman's blog post on the difference between prediction intervals and confidence intervals), then you can follow the guidance available in this answer.
You could also try to compute bootstrapped prediction intervals, which is laid out in this answer.
Below, is my attempt at implementing this (I'll update it when I get the chance to check it in more detail):
def bootstrap_prediction_interval(y_train: Union[list, pd.Series],
y_fit: Union[list, pd.Series],
y_pred_value: float,
alpha: float = 0.05,
nbootstrap: int = None,
seed: int = None):
"""
Bootstraps a prediction interval around an ARIMA model's predictions.
Method presented clearly here:
- https://stats.stackexchange.com/a/254321
Also found through here, though less clearly:
- https://otexts.com/fpp3/prediction-intervals.html
Can consider this to be a time-series version of the following generalisation:
- https://saattrupdan.github.io/2020-03-01-bootstrap-prediction/
:param y_train: List or Series of training univariate time-series data.
:param y_fit: List or Series of model fitted univariate time-series data.
:param y_pred_value: Float of the model predicted univariate time-series you want to compute P.I. for.
:param alpha: float = 0.05, the prediction uncertainty.
:param nbootstrap: integer = 1000, the number of bootstrap sampling of the residual forecast error.
Rules of thumb provided here:
- https://stats.stackexchange.com/questions/86040/rule-of-thumb-for-number-of-bootstrap-samples
:param seed: Integer to specify if you want deterministic sampling.
:return: A list [`lower`, `pred`, `upper`] with `pred` being the prediction
of the model and `lower` and `upper` constituting the lower- and upper
bounds for the prediction interval around `pred`, respectively.
"""
# get number of samples
n = len(y_train)
# compute the forecast errors/resid
fe = y_train - y_fit
# get percentile bounds
percentile_lower = (alpha * 100) / 2
percentile_higher = 100 - percentile_lower
if nbootstrap is None:
nbootstrap = np.sqrt(n).astype(int)
if seed is None:
rng = np.random.default_rng()
else:
rng = np.random.default_rng(seed)
# bootstrap sample from errors
error_bootstrap = []
for _ in range(nbootstrap):
idx = rng.integers(low=n)
error_bootstrap.append(fe[idx])
# get lower and higher percentiles of sampled forecast errors
fe_lower = np.percentile(a=error_bootstrap, q=percentile_lower)
fe_higher = np.percentile(a=error_bootstrap, q=percentile_higher)
# compute P.I.
pi = [y_pred_value + fe_lower, y_pred_value, y_pred_value + fe_higher]
return pi
using ARIMA you need to include seasonality and exogenous variables in the model yourself. While using SARIMA (Seasonal ARIMA) or SARIMAX (also for exogenous factors) implementation give C.I. to summary_frame:
import statsmodels.api as sm
import matplotlib.pyplot as plt
import pandas as pd
dta = sm.datasets.sunspots.load_pandas().data[['SUNACTIVITY']]
dta.index = pd.Index(pd.date_range("1700", end="2009", freq="A"))
print(dta)
print("init data:\n")
dta.plot(figsize=(12,4));
plt.show()
##print("SARIMAX(dta, order=(2,0,0), trend='c'):\n")
result = sm.tsa.SARIMAX(dta, order=(2,0,0), trend='c').fit(disp=False)
print(">>> result.params:\n", result.params, "\n")
##print("SARIMA_model.plot_diagnostics:\n")
result.plot_diagnostics(figsize=(15,12))
plt.show()
# summary stats of residuals
print(">>> residuals.describe:\n", result.resid.describe(), "\n")
# Out-of-sample forecasts are produced using the forecast or get_forecast methods from the results object
# The get_forecast method is more general, and also allows constructing confidence intervals.
fcast_res1 = result.get_forecast()
# specify that we want a confidence level of 90%
print(">>> forecast summary at alpha=0.01:\n", fcast_res1.summary_frame(alpha=0.10), "\n")
# plot forecast
fig, ax = plt.subplots(figsize=(15, 5))
# Construct the forecasts
fcast = result.get_forecast('2010Q4').summary_frame()
print(fcast)
fcast['mean'].plot(ax=ax, style='k--')
ax.fill_between(fcast.index, fcast['mean_ci_lower'], fcast['mean_ci_upper'], color='k', alpha=0.1);
fig.tight_layout()
plt.show()
docs: "The forecast above may not look very impressive, as it is almost a straight line. This is because this is a very simple, univariate forecasting model. Nonetheless, keep in mind that these simple forecasting models can be extremely competitive"
p.s. here " you can use it in a non-seasonal way by setting the seasonal terms to zero."

Fitting parameters of a data set from a model

I am trying to fit the parameters of a transit light curve.
I have observed transit light curve data and I am using a .py in python that through 4 parameters (period, a(semi-major axis), inclination, planet radius) returns a model transit light curve. I would like to minimize the residual between these two light curves. This is what I am trying to do: First - Estimate a max likelihood using method = "L-BFGS-B" and then apply the mcmc using emcee to estimate the uncertainties.
The code:
p = lmfit.Parameters()
p.add_many(('per', 2.), ('inc', 90.), ('a', 5.), ('rp', 0.1))
per_b = [1., 3.]
a_b = [4., 6.]
inc_b = [88., 90.]
rp_b = [0.1, 0.3]
bounds = [(per_b[0], per_b[1]), (inc_b[0], inc_b[1]), (a_b[0], a_b[1]), (rp_b[0], rp_b[1])]
def residual(p):
v = p.valuesdict()
eclipse.criarEclipse(v['per'], v['a'], v['inc'], v['rp'])
lc0 = numpy.array(eclipse.getCurvaLuz()) (observed flux data)
ts0 = numpy.array(eclipse.getTempoHoras()) (observed time data)
c = numpy.linspace(min(time_phased[bb]),max(time_phased[bb]),len(time_phased[bb]),endpoint=True)
nn = interpolate.interp1d(ts0,lc0)
return nn(c) - smoothed_LC[bb] (residual between the model and the data)
Inside def residual(p) I make sure that both the observed data (time_phased[bb] and smoothed_LC[bb]) have the same size of the model transit light curve. I want it to give me the best fit values for the parameters (v['per'], v['a'], v['inc'], v['rp']).
I need your help and I appreciate your time and your attention. Kindest regards, Yuri.
Your example is incomplete, with many partial concepts and some invalid Python. This makes it slightly hard to understand your intention. If the answer below is not sufficient, update your question with a complete example.
It seems pretty clear that you want to model your data smoothed_LC[bb] (not sure what bb is) with a model for some effect of an eclipse. With that assumption, I would recommend using the lmfit.Model approach. Start by writing a function that models the data, just so you check and plot your model. I'm not entirely sure I understand everything you're doing, but this model function might look like this:
import numpy
from scipy import interpolate
from lmfit import Model
# import eclipse from somewhere....
def eclipse_lc(c, per, a, inc, p):
eclipse.criarEclipse(per, a, inc, rp)
lc0 = numpy.array(eclipse.getCurvaLuz()) # observed flux data
ts0 = numpy.array(eclipse.getTempoHoras()) # observed time data
return interpolate.interp1d(ts0,lc0)(c)
With this model function, you can build a Model:
lc_model = Model(eclipse_lc)
and then build parameters for your model. This will automatically name them after the argument names of your model function. Here, you can also give them initial values:
params = lc_model.make_params(per=2, inc=90, a=5, rp=0.1)
You wanted to place upper and lower bounds on these parameters. This is done by setting min and max parameters, not making an ordered array of bounds:
params['per'].min = 1.0
params['per'].max = 3.0
and so on. But also: setting such tight bounds is usually a bad idea. Set bounds to avoid unphysical parameter values or when it becomes evident that you need to place them.
Now, you can fit your data with this model. Well, first you need to get the data you want to model. This seems less clear from your example, but perhaps:
c_data = numpy.linspace(min(time_phased[bb]), max(time_phased[bb]),
len(time_phased[bb]), endpoint=True)
lc_data = smoothed_LC[bb]
Well: why do you need to make this c_data? Why not just use time_phased as the independent variable? Anyway, now you can fit your data to your model with your parameters:
result = lc_model(lc_data, params, c=c_data)
At this point, you can print out a report of the results and/or view or get the best-fit arrays:
print(result.fit_report())
for p in result.params.items(): print(p)
import matplotlib.pyplot as plt
plt.plot(c_data, lc_data, label='data')
plt.plot(c_data. result.best_fit, label='fit')
plt.legend()
plt.show()
Hope that helps...

Why don't the parameters change when I am trying to impelement parameter estimation in Python?

I am trying with help of this post: http://adventuresinpython.blogspot.com/2012/08/fitting-differential-equation-system-to.html, but the parameters remain the same, independently of what initial conditions are chosen.
#Zombie Display
# zombie apocalypse modeling
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from scipy import integrate
from scipy.optimize import fmin
#=====================================================
#Notice we must import the Model Definition
#The Model
#=======================================================
def eq(par,initial_cond,start_t,end_t,incr):
#-time-grid-----------------------------------
t = np.linspace(start_t, end_t,incr)
#differential-eq-system----------------------
def funct(phi,t):
S=phi[0]
E=phi[1]
I=phi[2]
R=phi[3]
dS_dt = mu*(N-S)-beta*S*I/N-nu*S
dE_dt = beta*S*I/N-(mu+sigma)*E
dI_dt = sigma*E-(mu+gamma)*I
dR_dt=gamma*I-mu*R+nu*S
dphi_dt = [dS_dt,dE_dt,dI_dt,dR_dt]
return dphi_dt
#integrate------------------------------------
ds = integrate.odeint(funct,initial_cond,t)
return (ds[:,0],ds[:,1],ds[:,2],ds[:,3],t)
#=======================================================
#=====================================================
#1.Get Data
#====================================================
Td=np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])#time
Zd=np.array([274,326,547,639,2000,2700,4400,6000,7700,9700,11200,14300,17200,19200,20000])#zombie pop
#====================================================
#2.Set up Info for Model System
#===================================================
# model parameters
#----------------------------------------------------
beta=1.4
gamma=0.06
sigma=0.15
mu=0
nu=0
rates=(beta,gamma,sigma,mu,nu)
# model initial conditions
#---------------------------------------------------
N=11*pow(10,6)
S0 = N-1 # initial population
E0 = 105.1 # initial zombie population
I0 = 27.679 # initial death population
R0= 2.0
y0 = [S0, E0, I0, R0] # initial condition vector
# model steps
#---------------------------------------------------
start_time=0.0
end_time=140
intervals=5.0
mt=np.linspace(start_time,end_time,intervals)
# model index to compare to data
#----------------------------------------------------
findindex=lambda x:np.where(mt>=x)[0][0]
mindex=list(map(findindex,Td))
#=======================================================
#3.Score Fit of System
#=========================================================
def score(parms):
#a.Get Solution to system
F0,F1,F2,F3,T=eq(parms,y0,start_time,end_time,intervals)
#b.Pick of Model Points to Compare
Zm=F1[mindex]
#c.Score Difference between model and data points
ss=lambda data,model:((data-model)**2).sum()
return ss(Zd,Zm)
#========================================================
#=========================================================
#4.Optimize Fit
#=======================================================
fit_score=score(rates)
answ=fmin(score,(rates),full_output=1,maxiter=1000000)
bestrates=answ[0]
bestscore=answ[1]
beta, gamma, sigma, mu, nu=answ[0]
newrates=(beta,gamma,sigma,mu,nu)
#=======================================================
#5.Generate Solution to System
#=======================================================
F0,F1,F2,F3,T=eq(newrates,y0,start_time,end_time,intervals)
Zm=F1[mindex]
Tm=T[mindex]
#======================================================
#6. Plot Solution to System
#=========================================================
plt.figure()
plt.plot(T,F1,'b-',Tm,Zm,'ro',Td,Zd,'go')
plt.xlabel('days')
plt.ylabel('population')
title='Zombie Apocalypse Fit Score: '+str(bestscore)
plt.title(title)
plt.show()
I know that this is huge, but if somebody is expert from parameter estimations for differential equation system, I would be glad to hear any information. Thanks a lot!
You do not see any change in the parameters because you do not use them, the score function is constant. While you pass on parms to eq, inside eq you do not unpack the par parameter.
Add in eq as first line
beta,gamma,sigma,mu,nu = par
and the minimization algorithm does something non-trivial. It is still possible that no solution will be found in reasonable time. Set maxiter to something more reasonable. If the method then fails, it may also be a matter of the problem formulation. Perhaps the initial conditions need also be optimization variables, or the ODE system is not really suitable.

lmfit for exponential data returns linear function

I'm working on fitting muon lifetime data to a curve to extract the mean lifetime using the lmfit function. The general process I'm using is to bin the 13,000 data points into 10 bins using the histogram function, calculating the uncertainty with the square root of the counts in each bin (it's an exponential model), then use the lmfit module to determine the best fit along with means and uncertainty. However, graphing the output of the model.fit() method returns this graph, where the red line is the fit (and obviously not the correct fit). Fit result output graph
I've looked online and can't find a solution to this, I'd really appreciate some help figuring out what's going on. Here's the code.
import os
import numpy as np
import matplotlib.pyplot as plt
from numpy import sqrt, pi, exp, linspace
from lmfit import Model
class data():
def __init__(self,file_name):
times_dirty = sorted(np.genfromtxt(file_name, delimiter=' ',unpack=False)[:,0])
self.times = []
for i in range(len(times_dirty)):
if times_dirty[i]<40000:
self.times.append(times_dirty[i])
self.counts = []
self.binBounds = []
self.uncertainties = []
self.means = []
def binData(self,k):
self.counts, self.binBounds = np.histogram(self.times, bins=k)
self.binBounds = self.binBounds[:-1]
def calcStats(self):
if len(self.counts)==0:
print('Run binData function first')
else:
self.uncertainties = sqrt(self.counts)
def plotData(self,fit):
plt.errorbar(self.binBounds, self.counts, yerr=self.uncertainties, fmt='bo')
plt.plot(self.binBounds, fit.init_fit, 'k--')
plt.plot(self.binBounds, fit.best_fit, 'r')
plt.show()
def decay(t, N, lamb, B):
return N * lamb * exp(-lamb * t) +B
def main():
muonEvents = data('C:\Users\Colt\Downloads\muon.data')
muonEvents.binData(10)
muonEvents.calcStats()
mod = Model(decay)
result = mod.fit(muonEvents.counts, t=muonEvents.binBounds, N=1, lamb=1, B = 1)
muonEvents.plotData(result)
print(result.fit_report())
print (len(muonEvents.times))
if __name__ == "__main__":
main()
This might be a simple scaling problem. As a quick test, try dividing all raw data by a factor of 1000 (both X and Y) to see if changing the magnitude of the data has any effect.
Just to build on James Phillips answer, I think the data you show in your graph imply values for N, lamb, and B that are very different from 1, 1, 1. Keep in mind that exp(-lamb*t) is essentially 0 for lamb = 1, and t> 100. So, if the algorithm starts at lamb=1 and varies that by a little bit to find a better value, it won't actually be able to see any difference in how well the model matches the data.
I would suggest trying to start with values that are more reasonable for the data you have, perhaps N=1.e6, lamb=1.e-4, and B=100.
As James suggested, having the variables have values on the order of 1 and putting in scale factors as necessary is often helpful in getting numerically stable solutions.

How to define General deterministic function in PyMC

In my model, I need to obtain the value of my deterministic variable from a set of parent variables using a complicated python function.
Is it possible to do that?
Following is a pyMC3 code which shows what I am trying to do in a simplified case.
import numpy as np
import pymc as pm
#Predefine values on two parameter Grid (x,w) for a set of i values (1,2,3)
idata = np.array([1,2,3])
size= 20
gridlength = size*size
Grid = np.empty((gridlength,2+len(idata)))
for x in range(size):
for w in range(size):
# A silly version of my real model evaluated on grid.
Grid[x*size+w,:]= np.array([x,w]+[(x**i + w**i) for i in idata])
# A function to find the nearest value in Grid and return its product with third variable z
def FindFromGrid(x,w,z):
return Grid[int(x)*size+int(w),2:] * z
#Generate fake Y data with error
yerror = np.random.normal(loc=0.0, scale=9.0, size=len(idata))
ydata = Grid[16*size+12,2:]*3.6 + yerror # ie. True x= 16, w= 12 and z= 3.6
with pm.Model() as model:
#Priors
x = pm.Uniform('x',lower=0,upper= size)
w = pm.Uniform('w',lower=0,upper =size)
z = pm.Uniform('z',lower=-5,upper =10)
#Expected value
y_hat = pm.Deterministic('y_hat',FindFromGrid(x,w,z))
#Data likelihood
ysigmas = np.ones(len(idata))*9.0
y_like = pm.Normal('y_like',mu= y_hat, sd=ysigmas, observed=ydata)
# Inference...
start = pm.find_MAP() # Find starting value by optimization
step = pm.NUTS(state=start) # Instantiate MCMC sampling algorithm
trace = pm.sample(1000, step, start=start, progressbar=False) # draw 1000 posterior samples using NUTS sampling
print('The trace plot')
fig = pm.traceplot(trace, lines={'x': 16, 'w': 12, 'z':3.6})
fig.show()
When I run this code, I get error at the y_hat stage, because the int() function inside the FindFromGrid(x,w,z) function needs integer not FreeRV.
Finding y_hat from a pre calculated grid is important because my real model for y_hat does not have an analytical form to express.
I have earlier tried to use OpenBUGS, but I found out here it is not possible to do this in OpenBUGS. Is it possible in PyMC ?
Update
Based on an example in pyMC github page, I found I need to add the following decorator to my FindFromGrid(x,w,z) function.
#pm.theano.compile.ops.as_op(itypes=[t.dscalar, t.dscalar, t.dscalar],otypes=[t.dvector])
This seems to solve the above mentioned issue. But I cannot use NUTS sampler anymore since it needs gradient.
Metropolis seems to be not converging.
Which step method should I use in a scenario like this?
You found the correct solution with as_op.
Regarding the convergence: Are you using pm.Metropolis() instead of pm.NUTS() by any chance? One reason this could not converge is that Metropolis() by default samples in the joint space while often Gibbs within Metropolis is more effective (and this was the default in pymc2). Having said that, I just merged this: https://github.com/pymc-devs/pymc/pull/587 which changes the default behavior of the Metropolis and Slice sampler to be non-blocked by default (so within Gibbs). Other samplers like NUTS that are primarily designed to sample the joint space still default to blocked. You can always explicitly set this with the kwarg blocked=True.
Anyway, update pymc with the most recent master and see if convergence improves. If not, try the Slice sampler.

Categories