I'm trying to use polyfit to find the best fitting straight line to a set of data, but I also need to know the uncertainty on the parameters, so I want the covariance matrix too. The online documentation suggests I write:
polyfit(x, y, 2, cov=True)
but this gives the error:
TypeError: polyfit() got an unexpected keyword argument 'cov'
And sure enough help(polyfit) shows no keyword argument 'cov'.
So does the online documentation refer to a previous release of numpy? (I have 1.6.1, the newest one). I could write my own polyfit function, but has anyone got any suggestions for why I don't have a covariance option on my polyfit?
Thanks
For a solution that comes from a library, I find that using scikits.statsmodels is a convenient choice. In statsmodels, regression objects have callable attributes that return the parameters and standard errors. I put an example of how this would work for you below:
# Imports, I assume NumPy for forming your data.
import numpy as np
import scikits.statsmodels.api as sm
# Form the data here
(X, Y) = ....
reg_x_data = np.ones(X.shape); # 0th degree term.
for ii in range(1,deg+1):
reg_x_data = np.hstack(( reg_x_data, X**(ii) )); # Append the ii^th degree term.
# Store OLS regression results into `result`
result = sm.OLS(Y,reg_x_data).fit()
# Print the estimated coefficients
print result.params
# Print the basic OLS standard error in the coefficients
print result.bse
# Print the estimated basic OLS covariance matrix
print result.cov_params() # <-- Notice, this one is a function by convention.
# Print the heteroskedasticity-consistent standard error
print result.HC0_se
# Print the heteroskedasticity-consistent covariance matrix
print result.cov_HC0
There are additional robust covariance attributes in the result object as well. You can see them by printing out dir(result). Also, by convention, the covariance matrices for the heteroskedasticity-consistent estimators are only available after you already call the corresponding standard error, such as: you must call result.HC0_se prior to result.cov_HC0 because the first reference causes the second one to be computed and stored.
Pandas is another library that probably provides more advanced support for these operations.
Non-library function
This might be useful when you don't want to / can't rely on an extra library function.
Below is a function that I wrote to return the OLS regression coefficients, as well as a bunch of stuff. It returns the residuals, the regression variance and standard error (standard error of the residuals-squared), the asymptotic formula for large-sample variance, the OLS covariance matrix, the heteroskedasticity-consistent "robust" covariance estimate (which is the OLS covariance but weighted according to the residuals), and the "White" or "bias-corrected" heteroskedasticity-consistent covariance.
import numpy as np
###
# Regression and standard error estimation functions
###
def ols_linreg(X, Y):
""" ols_linreg(X,Y)
Ordinary least squares regression estimator given explanatory variables
matrix X and observations matrix Y.The length of the first dimension of
X and Y must be the same (equal to the number of samples in the data set).
Note: these methods should be adapted if you need to use this for large data.
This is mostly for illustrating what to do for calculating the different
classicial standard errors. You would never really want to compute the inverse
matrices for large problems.
This was developed with NumPy 1.5.1.
"""
(N, K) = X.shape
t1 = np.linalg.inv( (np.transpose(X)).dot(X) )
t2 = (np.transpose(X)).dot(Y)
beta = t1.dot(t2)
residuals = Y - X.dot(beta)
sig_hat = (1.0/(N-K))*np.sum(residuals**2)
sig_hat_asymptotic_variance = 2*sig_hat**2/N
conv_st_err = np.sqrt(sig_hat)
sum1 = 0.0
for ii in range(N):
sum1 = sum1 + np.outer(X[ii,:],X[ii,:])
sum1 = (1.0/N)*sum1
ols_cov = (sig_hat/N)*np.linalg.inv(sum1)
PX = X.dot( np.linalg.inv(np.transpose(X).dot(X)).dot(np.transpose(X)) )
robust_se_mat1 = np.linalg.inv(np.transpose(X).dot(X))
robust_se_mat2 = np.transpose(X).dot(np.diag(residuals[:,0]**(2.0)).dot(X))
robust_se_mat3 = np.transpose(X).dot(np.diag(residuals[:,0]**(2.0)/(1.0-np.diag(PX))).dot(X))
v_robust = robust_se_mat1.dot(robust_se_mat2.dot(robust_se_mat1))
v_modified_robust = robust_se_mat1.dot(robust_se_mat3.dot(robust_se_mat1))
""" Returns:
beta -- The vector of coefficient estimates, ordered on the columns on X.
residuals -- The vector of residuals, Y - X.beta
sig_hat -- The sample variance of the residuals.
conv_st_error -- The 'standard error of the regression', sqrt(sig_hat).
sig_hat_asymptotic_variance -- The analytic formula for the large sample variance
ols_cov -- The covariance matrix under the basic OLS assumptions.
v_robust -- The "robust" covariance matrix, weighted to account for the residuals and heteroskedasticity.
v_modified_robust -- The bias-corrected and heteroskedasticity-consistent covariance matrix.
"""
return beta, residuals, sig_hat, conv_st_err, sig_hat_asymptotic_variance, ols_cov, v_robust, v_modified_robust
For your problem, you would use it like this:
import numpy as np
# Define or load your data:
(Y, X) = ....
# Desired polynomial degree
deg = 2;
reg_x_data = np.ones(X.shape); # 0th degree term.
for ii in range(1,deg+1):
reg_x_data = np.hstack(( reg_x_data, X**(ii) )); # Append the ii^th degree term.
# Get all of the regression data.
beta, residuals, sig_hat, conv_st_error, sig_hat_asymptotic_variance, ols_cov, v_robust, v_modified_robust = ols_linreg(reg_x_data,Y)
# Print the covariance matrix:
print ols_cov
If you spot any bugs in my computations (especially the heteroskedasticity-consistent estimators) please let me know and I'll fix it asap.
Related
In scipy there is no support for fitting discrete distributions using data. I know there are a lot of subject about this.
For example if i have an array like below:
x = [2,3,4,5,6,7,0,1,1,0,1,8,10,9,1,1,1,0,0]
I couldn't apply for this array:
from scipy.stats import nbinom
param = nbinom.fit(x)
But i would like to ask you up to date, is there any way to fit for these three discrete distributions and then choose the best fit for the discrete dataset?
You can use Method of Moments to fit any particular distribution.
Basic idea: get empirical first, second, etc. moments, then derive distribution parameters from these moments.
So, in all these cases we only need two moments. Let's get them:
import pandas as pd
# for other distributions, you'll need to implement PMF
from scipy.stats import nbinom, poisson, geom
x = pd.Series(x)
mean = x.mean()
var = x.var()
likelihoods = {} # we'll use it later
Note: I used pandas instead of numpy. That is because numpy's var() and std() don't apply Bessel's correction, while pandas' do. If you have 100+ samples, there shouldn't be much difference, but on smaller samples it could be important.
Now, let's get parameters for these distributions. Negative binomial has two parameters: p, r. Let's estimate them and calculate likelihood of the dataset:
# From the wikipedia page, we have:
# mean = pr / (1-p)
# var = pr / (1-p)**2
# without wiki, you could use MGF to get moments; too long to explain here
# Solving for p and r, we get:
p = 1 - mean / var # TODO: check for zero variance and limit p by [0, 1]
r = (1-p) * mean / p
UPD: Wikipedia and scipy are using different definitions of p, one treating it as probability of success and another as probability of failure. So, to be consistent with scipy notion, use:
p = mean / var
r = p * mean / (1-p)
END OF UPD
UPD2:
I'd suggest using #thilak's code log likelihood instead. It allows to avoid loss of precision, which is especially important on large samples.
END OF UPD2
Calculate likelihood:
likelihoods['nbinom'] = x.map(lambda val: nbinom.pmf(val, r, p)).prod()
Same for Poisson, there is only one parameter:
# from Wikipedia,
# mean = variance = lambda. Nothing to solve here
lambda_ = mean
likelihoods['poisson'] = x.map(lambda val: poisson.pmf(val, lambda_)).prod()
Same for Geometric distribution:
# mean = 1 / p # this form fits the scipy definition
p = 1 / mean
likelihoods['geometric'] = x.map(lambda val: geom.pmf(val, p)).prod()
Finally, let's get the best fit:
best_fit = max(likelihoods, key=lambda x: likelihoods[x])
print("Best fit:", best_fit)
print("Likelihood:", likelihoods[best_fit])
Let me know if you have any questions
Great answer by Marat.
In addition to Marat's post I would most certainly recommend taking log of the probability mass function. Some information on why log likelihood is preferred over likelihood- https://math.stackexchange.com/questions/892832/why-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution
I would rewrite the code for Negative Binomial to-
log_likelihoods = {}
log_likelihoods['nbinom'] = x.map(lambda val: nbinom.logpmf(val, r, p)).sum()
Note that I have used-
logpmf instead of pmf
sum instead of product
And to find out the best distribution-
best_fit = max(log_likelihoods, key=lambda x: log_likelihoods[x])
print("Best fit:", best_fit)
print("log_Likelihood:", log_likelihoods[best_fit])
Is it possible to obtain the value of the chi squared as a direct output of scipy.optimize.curve_fit()?
Usually, it is easy to compute it after the fit by squaring the difference between the model and the data, weighting by the uncertainties and summing all up. However, it is not as direct when the parameter sigma is passed a 2D matrix (the covariance matrix of the data) instead of a simple 1D array.
Are really the best-fit parameters and its covariance matrix the only two outputs that can be extracted from curve_fit()?
It is not possible to obtain the value of chi^2 from scipy.optimize.curve_fit directly without manual calculations. It is possible to get additional output from curve_fit besides popt and pcov by providing the argument full_output=True, but the additional output does not contain the value of chi^2. (The additional output is documented e.g. at leastsq here).
In the case where sigma is a MxM array, the definition of the chi^2 function minimized by curve_fit is slightly different.
In this case, curve_fit minimizes the function r.T # inv(sigma) # r, where r = ydata - f(xdata, *popt), instead of chisq = sum((r / sigma) ** 2) in the case of one dimensional sigma, see the documentation of the parameter sigma.
So you should also be able to calculate chi^2 in your case by using r.T # inv(sigma) # r with your optimized parameters.
An alternative would be to use another package, for example lmfit, where the value of chi square can be directly obtained from the fit result:
from lmfit.models import GaussianModel
model = GaussianModel()
# create parameters with initial guesses:
params = model.make_params(center=9, amplitude=40, sigma=1)
result = model.fit(n, params, x=centers)
print(result.chisqr)
I'm using Python's statsmodels to perform a weighted linear regression. Since this is my first time with this module, I ran some basic tests.
Doing these, I find that estimated errors on the parameters (i.e., uncertainties) are independent of the weight.
This doesn't fit my naive expectation (that larger error bars would produce a more uncertain result), or the definition I've seen in nonlinear fitting, which generally uses the weighted Hessian to report the uncertainty of the parameters.
I couldn't find an explicit description of the calculation of the parameter-errors (either in the statsmodels docs or support pages). Posting here to see if anyone has experience with this (and for future searchers).
Minimal code:
import numpy as np
import statsmodels.api as sm
escale = 1
xvals = np.arange(1,11)
yvals = xvals**2 # dummy nonlinear y-data
evals = escale*np.sqrt(yvals) # errorbars
X = sm.add_constant(xvals) # to get intercept term
model = sm.WLS(yvals, X, 1.0/evals**2) # weight inverse to error
result = model.fit()
print "escale = ", escale
print "fit = ", result.params
print "err = ", result.bse # same value independent of escale
So is this a numerical bug in statsmodels, or PEBKAC (i.e., is this correct behavior and my expectations are wrong)?
Suppose 'h' is a function of x,y,z and t and it gives us a graph line (t,h) (simulated). At the same time we also have observed graph (observed values of h against t). How can I reduce the difference between observed (t,h) and simulated (t,h) graph by optimizing values of x,y and z? I want to change the simulated graph so that it imitates closer and closer to the observed graph in MATLAB/Python. In literature I have read that people have done same thing by Lavenberg-marquardt algorithm but don't know how to do it?
You are actually trying to fit the parameters x,y,z of the parametrized function h(x,y,z;t).
MATLAB
You're right that in MATLAB you should either use lsqcurvefit of the Optimization toolbox, or fit of the Curve Fitting Toolbox (I prefer the latter).
Looking at the documentation of lsqcurvefit:
x = lsqcurvefit(fun,x0,xdata,ydata);
It says in the documentation that you have a model F(x,xdata) with coefficients x and sample points xdata, and a set of measured values ydata. The function returns the least-squares parameter set x, with which your function is closest to the measured values.
Fitting algorithms usually need starting points, some implementations can choose randomly, in case of lsqcurvefit this is what x0 is for. If you have
h = #(x,y,z,t) ... %// actual function here
t_meas = ... %// actual measured times here
h_meas = ... %// actual measured data here
then in the conventions of lsqcurvefit,
fun <--> #(params,t) h(params(1),params(2),params(3),t)
x0 <--> starting guess for [x,y,z]: [x0,y0,z0]
xdata <--> t_meas
ydata <--> h_meas
Your function h(x,y,z,t) should be vectorized in t, such that for vector input in t the return value is the same size as t. Then the call to lsqcurvefit will give you the optimal set of parameters:
x = lsqcurvefit(#(params,t) h(params(1),params(2),params(3),t),[x0,y0,z0],t_meas,h_meas);
h_fit = h(x(1),x(2),x(3),t_meas); %// best guess from curve fitting
Python
In python, you'd have to use the scipy.optimize module, and something like scipy.optimize.curve_fit in particular. With the above conventions you need something along the lines of this:
import scipy.optimize as opt
popt,pcov = opt.curve_fit(lambda t,x,y,z: h(x,y,z,t), t_meas, y_meas, p0=[x0,y0,z0])
Note that the p0 starting array is optional, but all parameters will be set to 1 if it's missing. The result you need is the popt array, containing the optimal values for [x,y,z]:
x,y,z = popt
h_fit = h(x,y,z,t_meas)
I have a lot of x-y data points with errors on y that I need to fit non-linear functions to. Those functions can be linear in some cases, but are more usually exponential decay, gauss curves and so on. SciPy supports this kind of fitting with scipy.optimize.curve_fit, and I can also specify the weight of each point. This gives me weighted non-linear fitting which is great. From the results, I can extract the parameters and their respective errors.
There is just one caveat: The errors are only used as weights, but not included in the error. If I double the errors on all of my data points, I would expect that the uncertainty of the result increases as well. So I built a test case (source code) to test this.
Fit with scipy.optimize.curve_fit gives me:
Parameters: [ 1.99900756 2.99695535]
Errors: [ 0.00424833 0.00943236]
Same but with 2 * y_err:
Parameters: [ 1.99900756 2.99695535]
Errors: [ 0.00424833 0.00943236]
Same but with 2 * y_err:
So you can see that the values are identical. This tells me that the algorithm does not take those into account, but I think the values should be different.
I read about another fit method here as well, so I tried to fit with scipy.odr as well:
Beta: [ 2.00538124 2.95000413]
Beta Std Error: [ 0.00652719 0.03870884]
Same but with 20 * y_err:
Beta: [ 2.00517894 2.9489472 ]
Beta Std Error: [ 0.00642428 0.03647149]
The values are slightly different, but I do think that this accounts for the increase in the error at all. I think that this is just rounding errors or a little different weighting.
Is there some package that allows me to fit the data and get the actual errors? I have the formulas here in a book, but I do not want to implement this myself if I do not have to.
I have now read about linfit.py in another question. This handles what I have in mind quite well. It supports both modes, and the first one is what I need.
Fit with linfit:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00772283 0.04449971]
Same but with 20 * y_err:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.15445662 0.88999413]
Fit with linfit(relsigma=True):
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00622595 0.03587451]
Same but with 20 * y_err:
Parameters: [ 2.02600849 2.91759066]
Errors: [ 0.00622595 0.03587451]
Should I answer my question or just close/delete it now?
One way that works well and actually gives a better result is the bootstrap method. When data points with errors are given, one uses a parametric bootstrap and let each x and y value describe a Gaussian distribution. Then one will draw a point from each of those distributions and obtains a new bootstrapped sample. Performing a simple unweighted fit gives one value for the parameters.
This process is repeated some 300 to a couple thousand times. One will end up with a distribution of the fit parameters where one can take mean and standard deviation to obtain value and error.
Another neat thing is that one does not obtain a single fit curve as a result, but lots of them. For each interpolated x value one can again take mean and standard deviation of the many values f(x, param) and obtain an error band:
Further steps in the analysis are then performed again hundreds of times with the various fit parameters. This will then also take into account the correlation of the fit parameters as one can see clearly in the plot above: Although a symmetric function was fitted to the data, the error band is asymmetric. This will mean that interpolated values on the left have a larger uncertainty than on the right.
Please note that, from the documentation of curvefit:
sigma : None or N-length sequence
If not None, this vector will be used as relative weights in the
least-squares problem.
The key point here is as relative weights, therefore, yerr in line 53 and 2*yerr in 57 should give you similar, if not the same result.
When you increase the actually residue error, you will see the values in the covariance matrix grow large. Say if we change the y += random to y += 5*random in function generate_data():
Fit with scipy.optimize.curve_fit:
('Parameters:', array([ 1.92810458, 3.97843448]))
('Errors: ', array([ 0.09617346, 0.64127574]))
Compares to the original result:
Fit with scipy.optimize.curve_fit:
('Parameters:', array([ 2.00760386, 2.97817514]))
('Errors: ', array([ 0.00782591, 0.02983339]))
Also notice that the parameter estimate is now further off from (2,3), as we would expect from increased residue error and larger confidence interval of parameter estimates.
Short answer
For absolute values that include uncertainty in y (and in x for odr case):
In the scipy.odr case use stddev = numpy.sqrt(numpy.diag(cov))
where the cov is the covariance matrix odr gives in the output.
In the scipy.optimize.curve_fit case use absolute_sigma=True
flag.
For relative values (excludes uncertainty):
In the scipy.odr case use the sd value from the output.
In the scipy.optimize.curve_fit case use absolute_sigma=False flag.
Use numpy.polyfit like this:
p, cov = numpy.polyfit(x, y, 1,cov = True)
errorbars = numpy.sqrt(numpy.diag(cov))
Long answer
There is some undocumented behavior in all of the functions. My guess is that the functions mixing relative and absolute values. At the end this answer is the code that either gives what you want (or doesn't) based on how you process the output (there is a bug?). Also, curve_fit might have gotten the 'absolute_sigma' flag recently?
My point is in the output. It seems that odr calculates the standard deviation as there is no uncertainties, similar to polyfit, but if the standard deviation is calculated from the covariance matrix, the uncertainties are there. The curve_fit does this with absolute_sigma=True flag. Below is the output containing
diagonal elements of the covariance matrix cov(0,0) and
cov(1,1),
wrong way for standard deviation from the outputs for slope and
wrong way for the constant, and
right way for standard deviation from the outputs for slope and
right way for the constant
odr: 1.739631e-06 0.02302262 [ 0.00014863 0.0170987 ] [ 0.00131895 0.15173207]
curve_fit: 2.209469e-08 0.00029239 [ 0.00014864 0.01709943] [ 0.0004899 0.05635713]
polyfit: 2.232016e-08 0.00029537 [ 0.0001494 0.01718643]
Notice that the odr and polyfit have exactly the same standard deviation. Polyfit does not get the uncertainties as an input so odr doesn't use uncertainties when calculating standard deviation. The covariance matrix uses them and if in the odr case the the standard deviation is calculated from the covariance matrix uncertainties are there and they change if the uncertainty is increased. Fiddling with dy in the code below will show it.
I am writing this here mostly because this is important to know when finding out error limits (and the fortran odrpack guide where scipy refers has some misleading information about this: standard deviation should be the square root of covariance matrix like the guide says but it is not).
import scipy.odr
import scipy.optimize
import numpy
x = numpy.arange(200)
y = x + 0.4*numpy.random.random(x.shape)
dy = 0.4
def stddev(cov): return numpy.sqrt(numpy.diag(cov))
def f(B, x): return B[0]*x + B[1]
linear = scipy.odr.Model(f)
mydata = scipy.odr.RealData(x, y, sy = dy)
myodr = scipy.odr.ODR(mydata, linear, beta0 = [1.0, 1.0], sstol = 1e-20, job=00000)
myoutput = myodr.run()
cov = myoutput.cov_beta
sd = myoutput.sd_beta
p = myoutput.beta
print 'odr: ', cov[0,0], cov[1,1], sd, stddev(cov)
p2, cov2 = scipy.optimize.curve_fit(lambda x, a, b:a*x+b,
x, y, [1,1],
sigma = dy,
absolute_sigma = False,
xtol = 1e-20)
p3, cov3 = scipy.optimize.curve_fit(lambda x, a, b:a*x+b,
x, y, [1,1],
sigma = dy,
absolute_sigma = True,
xtol = 1e-20)
print 'curve_fit: ', cov2[0,0], cov2[1,1], stddev(cov2), stddev(cov3)
p, cov4 = numpy.polyfit(x, y, 1,cov = True)
print 'polyfit: ', cov4[0,0], cov4[1,1], stddev(cov4)