I have a question about the fit algorithms used in scipy. In my program, I have a set of x and y data points with y errors only, and want to fit a function
f(x) = (a[0] - a[1])/(1+np.exp(x-a[2])/a[3]) + a[1]
to it.
The problem is that I get absurdly high errors on the parameters and also different values and errors for the fit parameters using the two fit scipy fit routines scipy.odr.ODR (with least squares algorithm) and scipy.optimize. I'll give my example:
Fit with scipy.odr.ODR, fit_type=2
Beta: [ 11.96765963 68.98892582 100.20926023 0.60793377]
Beta Std Error: [ 4.67560801e-01 3.37133614e+00 8.06031988e+04 4.90014367e+04]
Beta Covariance: [[ 3.49790629e-02 1.14441187e-02 -1.92963671e+02 1.17312104e+02]
[ 1.14441187e-02 1.81859542e+00 -5.93424196e+03 3.60765567e+03]
[ -1.92963671e+02 -5.93424196e+03 1.03952883e+09 -6.31965068e+08]
[ 1.17312104e+02 3.60765567e+03 -6.31965068e+08 3.84193143e+08]]
Residual Variance: 6.24982731975
Inverse Condition #: 1.61472215874e-08
Reason(s) for Halting:
Sum of squares convergence
and then the fit with scipy.optimize.leastsquares:
Fit with scipy.optimize.leastsq
beta: [ 11.9671859 68.98445306 99.43252045 1.32131099]
Beta Std Error: [0.195503 1.384838 34.891521 45.950556]
Beta Covariance: [[ 3.82214235e-02 -1.05423284e-02 -1.99742825e+00 2.63681933e+00]
[ -1.05423284e-02 1.91777505e+00 1.27300761e+01 -1.67054172e+01]
[ -1.99742825e+00 1.27300761e+01 1.21741826e+03 -1.60328181e+03]
[ 2.63681933e+00 -1.67054172e+01 -1.60328181e+03 2.11145361e+03]]
Residual Variance: 6.24982904455 (calulated by me)
My Point is the third fit parameter: The results are
scipy.odr.ODR, fit_type=2:
C = 100.209 +/- 80600
scipy.optimize.leastsq:
C = 99.432 +/- 12.730
I don't know why the first error is so much higher. Even better: If I put exactly the same data points with errors into Origin 9 I get
C = x0 = 99,41849 +/- 0,20283
and again exactly the same data into c++ ROOT Cern
C = 99.85+/- 1.373
even though I used exactly the same initial variables for ROOT and Python. Origin doesn't need any.
Do you have any clue why this happens and which is the best result?
I added the code for you at pastebin:
Data
C++ code
Python code: http://pastebin.com/jZVyzMkS
Thank you for helping!
EDIT: here's the plot related to SirJohnFranklins post:
Did you actually try plotting the ODR and leastsq fits side by side? They look basically identical:
Consider what the parameters correspond to - the step function described by beta[0] and beta[1], the initial and final values, explains by far the majority of the variance in your data. By contrast, small changes in beta[2] and beta[3], the inflexion point and slope, will have comparatively little effect on the overall shape of the curve and therefore the residual variance for the fit. It's therefore no surprise that these parameters have high standard errors, and are fitted slightly differently by the two algorithms.
The overall greater standard errors reported by ODR are due to the fact that this model incorporates errors in the y-values whereas the ordinary least squares fit does not - errors in the measured y-values ought to reduce our confidence in the estimated fit parameters.
(Sadly, i can't upload the fit, because I need more reputation. I'll give the plot to Captain Sandwich, so he can upload it for me.)
I'm in the same workgroup as the person who started the thread, but I did this plot.
So, I added x-errors on the data, because I was not that far the last time. The error obtained through the ODR is still absurdly high (4.18550164e+04 on beta[2]). In the plot, I show you what the FIT from [ROOT Cern][2] gives, now with x and y error. Here, x0 is the beta[2].
The red and the green curve have a different beta, the left one minus the error of the fit of 3.430 obtained by ROOT and the right one plus the error. I think this makes totally sense, much more, than the error of 0.2 given by the fit of Origin 9 (which can only handle y-errors, I think) or the error of about 40k given by the ODR which also includes x and y errors.
Maybe, because ROOT is mostly used by astrophysicists who need very roubust fitting algorithms, it can handle much more difficult fits, but I don't know enough about the robustness of fitting algorithms.
Related
I'm using scipy.stats.linregress to fit a group of points as seen in the plot below. The points are the blue circles, the linear fit is the black line and the grey lines are samples taken using the stderr and intercept_stderr values to sample the slope and intercept values using numpy.random.normal (code below).
My question is: given that stderr and intercept_stderr are standard errors and numpy.random.normal expects standard deviations, should I multiply stderr and intercept_stderr by $\sqrt{N}$ when sampling?
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
x = np.array([-4.12078708, -3.89764352, -3.77248038, -3.66125475, -3.56117129,
-3.47019951, -3.3868179 , -3.30985686, -3.2383979 , -3.17170652,
-3.10918616, -3.05034566, -2.99477581, -2.94213208, -2.89212166,
-2.84449361, -2.79903124, -2.75554612, -2.71387343, -2.67386809,
-2.63540181, -2.59836054, -2.56264246, -2.52815628, -2.49481986,
-2.462559 , -2.43130646, -2.40100111, -2.37158722, -2.34301385,
-2.31523428, -2.28820561, -2.2618883 , -2.23624587, -2.21124457,
-2.18685312, -2.16304247, -2.13978561, -2.11705736, -2.09483422,
-2.07309423, -2.05181683, -2.03098275, -2.01057388])
y = np.array([10.54683181, 10.37020828, 10.93819231, 10.1338195 , 10.68036321,
10.48930797, 10.2340761 , 10.52002056, 10.20343913, 10.29089844,
10.36190947, 10.26050936, 10.36528216, 10.41799894, 10.40077834,
10.2513676 , 10.30768792, 10.49377725, 9.73298189, 10.1158334 ,
10.29359023, 10.38660209, 10.30087358, 10.49464606, 10.23305099,
10.34389097, 10.29016557, 10.0865885 , 10.338077 , 10.34950896,
10.15110388, 10.33316701, 10.22837808, 10.3848174 , 10.56872297,
10.24457621, 10.48255182, 10.39029786, 10.0208671 , 10.17400544,
9.82086658, 10.51361151, 10.4376062 , 10.18610696])
res = stats.linregress(x, y)
s_vals = np.random.normal(res.slope, res.stderr, 100)
i_vals = np.random.normal(res.intercept, res.intercept_stderr, 100)
for i in range(100):
plt.plot(x, i_vals[i] + s_vals[i]*x, c='grey', alpha=.1)
plt.scatter(x, y)
plt.plot(x, res.intercept + res.slope*x, c='k')
plt.show()
TL;DR
Indeed, it is an estimate of the standard deviation since standard error means standard deviation of the error of a particular parameter. No, there is no need to "de-normalize" by multiplying with np.sqrt(n). Finally, however, you might want to change the distribution from which you sample the simulated parameters with that of a t-distribution.
Qualitative explanation
No further multiplication (e.g. with np.sqrt(n)) is needed, i.e. the normalization stays in place. Why is that? Intuitively speaking, slope and intercept parameters are, in a sense, summary statistics of a dataset consisting of pairs x and y. They characterize the dataset as a whole rather than a single pair of points like (x_i, y_i). Similar to sampling a summary statistic (e.g. the mean over x), we use a normalized estimator for standard deviation. In case of a regression, the average variability of all datapoints in a dataset impacts the estimated variability of the resulting intercept. The square root of the sample size merely balances the sum of the variability across the datapoints relative to their absolute number.
A more rigorous explanation would concern the variance-covariance matrix of the estimator (β^). In it, the square root of the elements along the diagonal are the standard errors of the elements of the estimator. In particular, the square root of the first element on the diagonal which represents the standard error of the intercept parameter. With a bit of linear algebra, one can establish a connection between each parameter's standard error (in your case, those of intercept and slope) with that of the regression model. Since the standard error of the regression model s is an asymptotically unbiased estimate of the standard deviation of the noise in the data σ, a quantitative rationale can be established requiring no re-scaling of the intercept's standard error.
Regarding the distributions from which you sample/simulate the intercept and slope. Rather than a Normal distribution, the standard errors follow the (Student's) t-distribution. See slide 18. In turn,
s_vals = np.random.standard_t(df=len(x)-2, size=100) * res.stderr + res.slope
i_vals = np.random.standard_t(df=len(x)-2, size=100) * res.intercept_stderr + res.intercept
However, with sample sizes beyond n=30, the realizations will be almost statistically indistinguishable as compares to those sampled from a Gaussian distribution. This is because the t-distribution converges to that of a standard normal distribution rather quickly.
Visual explanation
We can skip the quantitative arguments though. What do we expect from estimators based on datasets? The more data we have the more certainty we have about the fixed but unknown location. In turn, if we increase the size n of the data, the simulated grey lines should move closer together. This happens when we use the standard error as the scale parameter. Increasing the sample size by a factor of 14 brings the grey lines closer. Instead, using the standard error multiplied by np.sqrt(n) leaves the grey lines equally far apart even when the dataset size is drastically increased. In fact, we exactly undid the advantage of a higher sample size by multiplying with the square roots of n.
def sigmoid_function(x1,k,xo,a,c):
return (a/ (1+ np.exp(-k*(x1-xo))))+c
x_data=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35]
y_data =[0.08965066,0.08990541,0.090073960.09013885,0.09021248,0.09038204,
0.09044601,0.09062396,0.09074469,0.09097924,0.09101625,0.09110833,
0.09130073,0.09153685,0.09165991,0.09189038,0.09236043,0.09329333,
0.09470363,0.09750811,0.10305867,0.11295684,0.12767181,0.14647349,
0.16744916,0.18869261,0.20908784,0.22828775,0.2459888 ,0.262817,
0.27898482,0.29499955,0.31033699,0.32526762,0.33972489]
result,covariance= optimize.curve_fit(sigmoid_function,x_data,y_data, maxfev=10000)
Curve with exact data
Curve fit resut
I am new to ml, Please let me know if I can change any parameters in curve_fit().
If you look at the optimization at the scale of your observations, then it appears that the optimization function is not working very well.
But if you zoom out and look at the scale of the optimization function, things look quite different.
When no bounds or optimization method is provided to curve_fit it uses Levenberg-Marquardt which can fail to find the global solution.
in cases with only one minimum, an uninformed standard guess like β = ( 1 ,
1 , … , 1 ) will work fine; in cases with multiple minima, the algorithm converges to the global minimum only if the initial guess is already somewhat close to the final solution.
What you are seeing happen is the optimization function falling into a local minima. Like it says above, you can get around this by providing initial parameters that are closer to the solution so the optimization can avoid that minima trap. For instance by doing this:
p0 = [0.1, 0.1, 0.1, 0.1]
result, convariance = optimize.curve_fit(sigmoid_function, x_data, y_data, p0)
the optimization function behaves as you expect it:
I am new to using the PyMC3 package and am just trying to implement an example from a course on measurement uncertainty that I’m taking. (Note this is an optional employee education course through work, not a graded class where I shouldn’t find answers online). The course uses R but I find python to be preferable.
The (simple) problem is posed as following:
Say you have an end-gauge of actual (unknown) length at room-temperature length, and measured length m. The relationship between the two is:
length = m / (1 + alpha*dT)
where alpha is an expansion coefficient and dT is the deviation from room temperature and m is the measured quantity. The goal is to find the posterior distribution on length in order to determine its expected value and standard deviation (i.e. the measurement uncertainty)
The problem specifies prior distributions on alpha and dT (Gaussians with small standard deviation) and a loose prior on length (Gaussian with large standard deviation). The problem specifies that m was measured 25 times with an average of 50.000215 and standard deviation of 5.8e-6. We assume that the measurements of m are normally distributed with a mean of the true value of m.
One issue I had is that the likelihood doesn’t seem like it can be specified just based on these statistics in PyMC3, so I generated some dummy measurement data (I ended up doing 1000 measurements instead of 25). Again, the question is to get a posterior distribution on length (and in the process, although of less interest, updated posteriors on alpha and dT).
Here’s my code, which is not working and having convergence issues:
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
import scipy.stats as stats
import pymc3 as pm
import theano.tensor as tt
basic_model = pm.Model()
xdata = np.random.normal(50.000215,5.8e-6*np.sqrt(1000),1000)
with basic_model:
#prior distributions
theta = pm.Normal('theta',mu=-.1,sd=.04)
alpha = pm.Normal('alpha',mu=.0000115,sd=.0000012)
length = pm.Normal('length',mu=50,sd=1)
mumeas = length*(1+alpha*theta)
with basic_model:
obs = pm.Normal('obs',mu=mumeas,sd=5.8e-6,observed=xdata)
#yobs = Normal('yobs',)
start = pm.find_MAP()
#trace = pm.sample(2000, step=pm.Metropolis, start=start)
step = pm.Metropolis()
trace = pm.sample(10000, tune=200000,step=step,start=start,njobs=1)
length_samples = trace['length']
fig,ax=plt.subplots()
plt.hist(length_samples, histtype='stepfilled', bins=30, alpha=0.85,
label="posterior of $\lambda_1$", color="#A60628", normed=True)
I would really appreciate any help as to why this isn’t working. I've been trying for a while and it never converges to the expected solution given from the R code. I tried the default sampler (NUTS I think) as well as Metropolis but that completely failed with a zero gradient error. The (relevant) course slides are attached as an image. Finally, here is the comparable R code:
library(rjags)
#Data
jags_data <- list(xbar=50.000215)
jags_code <- jags.model(file = "calibration.txt",
data = jags_data,
n.chains = 1,
n.adapt = 30000)
post_samples <- coda.samples(model = jags_code,
variable.names =
c("l","mu","alpha","theta"),#,"ypred"),
n.iter = 30000)
summary(post_samples)
mean(post_samples[[1]][,"l"])
sd(post_samples[[1]][,"l"])
plot(post_samples)
and the calibration.txt model:
model{
l~dnorm(50,1.0)
alpha~dnorm(0.0000115,694444444444)
theta~dnorm(-0.1,625)
mu<-l*(1+alpha*theta)
xbar~dnorm(mu,29726516052)
}
(note I think the dnorm distribution takes 1/sigma^2, hence the weird-looking variances)
Any help or insight as to why the PyMC3 sampling isn't converging and what I should do differently would be extremely appreciated. Thanks!
I also had trouble getting anything useful from the generated data and model in the code. It seems to me that the level of noise in the fake data could equally be explained by the different sources of variance in the model. That can lead to a situation of highly correlated posterior parameters. Add to that the extreme scale imbalances, then it makes sense this would have sampling issues.
However, looking at the JAGS model, it seems they really are using just that one input observation. I've never seen this technique(?) before, that is, inputting summary statistics of data instead of the raw data itself. I suppose it worked for them in JAGS, so I decided to try running the exact same MCMC, including using the precision (tau) parameterization of the Gaussian.
Original Model with Metropolis
with pm.Model() as m0:
# tau === precision parameterization
dT = pm.Normal('dT', mu=-0.1, tau=625)
alpha = pm.Normal('alpha', mu=0.0000115, tau=694444444444)
length = pm.Normal('length', mu=50.0, tau=1.0)
mu = pm.Deterministic('mu', length*(1+alpha*dT))
# only one input observation; tau indicates the 5.8 nm sd
obs = pm.Normal('obs', mu=mu, tau=29726516052, observed=[50.000215])
trace = pm.sample(30000, tune=30000, chains=4, cores=4, step=pm.Metropolis())
While it's still not that great at sampling length and dT, it at least appears convergent overall:
I think noteworthy here is that despite the relatively weak prior on length (sd=1), the strong priors on all the other parameters appear to propagate a tight uncertainty bound on the length posterior. Ultimately, this is the posterior of interest, so this seems to be consistent with the intent of the exercise. Also, see that mu comes out in the posterior as exactly the distribution described, namely, N(50.000215, 5.8e-6).
Trace Plots
Forest Plot
Pair Plot
Here, however, you can see the core problem is still there. There's both strong correlation between length and dT, plus 4 or 5 orders of magnitude scale difference between the standard errors. I'd definitely do a long run before I really trusted the result.
Alternative Model with NUTS
In order to get this running with NUTS, you'd have to address the scaling issue. That is, somehow we need to reparameterize to get all the tau values closer to 1. Then, you'd run the sampler and transform back into the units you're interested in. Unfortunately, I don't have time to play around with this right now (I'd have to figure it out too), but maybe it's something you can start exploring on your own.
When creating a line of best fit with numpy's polyfit, you can specify the parameter full to be True. This returns 4 extra values, apart from the coefficents. What do these values mean and what do they tell me about how well the function fits my data?
https://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html
What i'm doing is:
bestFit = np.polyfit(x_data, y_data, deg=1, full=True)
and I get the result:
(array([ 0.00062008, 0.00328837]), array([ 0.00323329]), 2, array([
1.30236506, 0.55122159]), 1.1102230246251565e-15)
The documentation says that the four extra pieces of information are: residuals, rank, singular_values, and rcond.
Edit:
I am looking for a further explanation of how rcond and singular_values describes goodness of fit.
Thank you!
how rcond and singular_values describes goodness of fit.
Short answer: they don't.
They do not describe how well the polynomial fits the data; this is what residuals are for. They describe how numerically robust was the computation of that polynomial.
rcond
The value of rcond is not really about quality of fit, it describes the process by which the fit was obtained, namely a least-squares solution of a linear system. Most of the time the user of polyfit does not provide this parameter, so a suitable value is picked by polyfit itself. This value is then returned to the user for their information.
rcond is used for truncation in ill-conditioned matrices. Least squares solver does two things:
Finds x that minimizes the norm of residuals Ax-b
If multiple x achieve this minimum, returns x with the smallest norm among those.
The second clause occurs when some changes of x do not affect the right-hand side at all. But since floating point computations are imperfect, usually what happens is that some changes of x affect the right hand side very little. And this is where rcond is used to decide when "very little" should be considered as "zero up to noise".
For example, consider the system
x1 = 1
x1 + 0.0000000001 * x2 = 2
This one can be solved exactly: x1 = 1 and x2 = 10000000000. But... that tiny coefficient (that in reality, came after some matrix manipulations) has some numeric error in it; for all we know it could be negative, or zero. Should we let it have such huge influence on the solution?
So, in such a situation the matrix (specifically its singular values) gets truncated at level rcond. This leaves
x1 = 1
x1 = 2
for which the least-squares solution is x1 = 1.5, x2 = 0. Note that this solution is robust: no huge numbers from tiny fluctuations of coefficients.
Singular values
When one solves a linear system Ax = b in the least squares sense, the singular values of A determine how numerically tricky this is. Specifically, large disparity between largest and smallest singular values is problematic: such systems are ill-conditioned. An example is
0.835*x1 + 0.667*x2 = 0.168
0.333*x1 + 0.266*x2 = 0.0067
The exact solution is (1, -1). But if the right hand side is changed from 0.067 to 0.066, the solution is (-666, 834) -- totally different. The problem is that the singular values of A are (roughly) 1 and 1e-6; this magnifies any changes on the right by the factor of 1e6.
Unfortunately, polynomial fit often results in ill-conditioned matrices. For example, fitting a polynomial of degree 24 to 25 equally spaced data points is unadvisable.
import numpy as np
x = np.arange(25)
np.polyfit(x, x, 24, full=True)
The singular values are
array([4.68696731e+00, 1.55044718e+00, 7.17264545e-01, 3.14298605e-01,
1.16528492e-01, 3.84141241e-02, 1.15530672e-02, 3.20120674e-03,
8.20608411e-04, 1.94870760e-04, 4.28461687e-05, 8.70404409e-06,
1.62785983e-06, 2.78844775e-07, 4.34463936e-08, 6.10212689e-09,
7.63709211e-10, 8.39231664e-11, 7.94539407e-12, 6.32326226e-13,
4.09332903e-14, 2.05501534e-15, 7.55397827e-17, 4.81104905e-18,
8.98275758e-20]),
which, with the default value of rcond (5.55e-15 here), gets four of them truncated to 0.
The difference in magnitude between smallest and largest singular values indicates that perturbing the y-values by numbers of size 1e-15 can result in changes of about 1 to the coefficients. (Not every perturbation will do that, just some that happen to align with a singular vector for a small singular value).
Rank
The effective rank is just the number of singular values above the rcond threshold. In the above example it's 21. This means that even though the fit is for 25 points, and we get a polynomial with 25 coefficients, there are only 21 degrees of freedom in the solution.
I am following the Orthogonal distance regression method to fit data with errors on both the dependent and independent variables.
I am fitting the data with a simple straight line, my model is y = ax + b.
Now, I am able to write the code and plot the line fitting the data, but I am NOT able to read the results:
Beta: [ 2.08346947 0.0024333 ]
Beta Std Error: [ 0.03654482 0.00279946]
Beta Covariance: [[ 2.06089823e-03 -9.99220260e-05]
[ -9.99220260e-05 1.20935366e-05]]
Residual Variance: 0.648029925546
Inverse Condition #: 0.011825289654
Reason(s) for Halting:
Sum of squares convergence
The Beta is just the array containing the values of the parameters of my model (a, b), and Beta Std Error, the associated errors.
Regarding the other values, I don't know their meaning.
Especially, I would like to know which one is indicative of a goodness-of-fit, something like the chi-square when one fits with the errors only on the dependent variable.
Beta Covariance is the covariance matrix of your fitted parameters. It can be thought of as a matrix describing out inter-connected your two parameters are with respect to both themselves and each other.
Residual Variance I believe is a measure of the goodness-of-fit where the smaller the value, the better the fit to your data.
Inverse Condition is the inverse (1/x) of the condition number. The condition number defines how sensitive your fitted function is to changes in the input.
scipy.odr is a wrapper around a much older FORTRAN77 package known as ODRPACK. The documentation for ODRPACK can actually be found on on the scipy website. This may help you in understanding what you need to know as it contains the mathematical descriptions of the parameters.