I'm teaching myself PyMC but got stuck with the following problem:
I have a model whose parameters should be determined from successive measurements. In the beginning the parameter's prior is uninformative, but should be updated after each measurement (i.e. replaced by the posterior). In short, I want to do sequential updating with PyMC.
Consider the following (somewhat constructed) example:
Measurement 1: 10 questions, 9 correct answers
Measurement 2: 5 questions, 3 correct answers
Of course, this can be solved analytically with beta/binomial conjugate priors, but this is not the point here :)
Alternatively, both measurements could be combined to n=15 and k=12. However, this is too simple. I want to take the hard way for educational purposes.
I found a solution in this answer, where new priors are sampled from the posterior. This is almost what I want, but sampling the prior feels a bit messy because the results depends on the number of samples and other settings.
My attempted solution puts both measurement and priors separately in one model, like this:
n1, k1 = 10, 9
n2, k2 = 5, 3
theta1 = pymc.Beta('theta', alpha=1, beta=1)
outcome1 = pymc.Binomial('outcome1', n=n1, p=theta1, value=k1, observed=True)
theta2 = ? # should be the posterior of theta1
outcome2 = pymc.Binomial('outcome2', n=n2, p=theta2, value=k2, observed=True)
How can I get the posterior of theta1 as the prior of theta2?
Is this even possible, or did I just demonstrate ultimate ignorance about Bayesian statistics?
The only way sequential updating works sensibly is in two different models. Specifying them in the same model does not make any sense, since we have no posteriors until after MCMC has completed.
In principle, you would examine the distribution of theta1 and specify a prior that best resembles it. In this simple case it is easy -- it would be:
theta2 = pymc.Beta('theta2', alpha=10, beta=2)
since you don't need MCMC to determine what the posterior of theta is. More generally, you could fit a Beta distribution to the posterior, say using scipy.stats.beta.fit.
Related
I am new to using the PyMC3 package and am just trying to implement an example from a course on measurement uncertainty that I’m taking. (Note this is an optional employee education course through work, not a graded class where I shouldn’t find answers online). The course uses R but I find python to be preferable.
The (simple) problem is posed as following:
Say you have an end-gauge of actual (unknown) length at room-temperature length, and measured length m. The relationship between the two is:
length = m / (1 + alpha*dT)
where alpha is an expansion coefficient and dT is the deviation from room temperature and m is the measured quantity. The goal is to find the posterior distribution on length in order to determine its expected value and standard deviation (i.e. the measurement uncertainty)
The problem specifies prior distributions on alpha and dT (Gaussians with small standard deviation) and a loose prior on length (Gaussian with large standard deviation). The problem specifies that m was measured 25 times with an average of 50.000215 and standard deviation of 5.8e-6. We assume that the measurements of m are normally distributed with a mean of the true value of m.
One issue I had is that the likelihood doesn’t seem like it can be specified just based on these statistics in PyMC3, so I generated some dummy measurement data (I ended up doing 1000 measurements instead of 25). Again, the question is to get a posterior distribution on length (and in the process, although of less interest, updated posteriors on alpha and dT).
Here’s my code, which is not working and having convergence issues:
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
import scipy.stats as stats
import pymc3 as pm
import theano.tensor as tt
basic_model = pm.Model()
xdata = np.random.normal(50.000215,5.8e-6*np.sqrt(1000),1000)
with basic_model:
#prior distributions
theta = pm.Normal('theta',mu=-.1,sd=.04)
alpha = pm.Normal('alpha',mu=.0000115,sd=.0000012)
length = pm.Normal('length',mu=50,sd=1)
mumeas = length*(1+alpha*theta)
with basic_model:
obs = pm.Normal('obs',mu=mumeas,sd=5.8e-6,observed=xdata)
#yobs = Normal('yobs',)
start = pm.find_MAP()
#trace = pm.sample(2000, step=pm.Metropolis, start=start)
step = pm.Metropolis()
trace = pm.sample(10000, tune=200000,step=step,start=start,njobs=1)
length_samples = trace['length']
fig,ax=plt.subplots()
plt.hist(length_samples, histtype='stepfilled', bins=30, alpha=0.85,
label="posterior of $\lambda_1$", color="#A60628", normed=True)
I would really appreciate any help as to why this isn’t working. I've been trying for a while and it never converges to the expected solution given from the R code. I tried the default sampler (NUTS I think) as well as Metropolis but that completely failed with a zero gradient error. The (relevant) course slides are attached as an image. Finally, here is the comparable R code:
library(rjags)
#Data
jags_data <- list(xbar=50.000215)
jags_code <- jags.model(file = "calibration.txt",
data = jags_data,
n.chains = 1,
n.adapt = 30000)
post_samples <- coda.samples(model = jags_code,
variable.names =
c("l","mu","alpha","theta"),#,"ypred"),
n.iter = 30000)
summary(post_samples)
mean(post_samples[[1]][,"l"])
sd(post_samples[[1]][,"l"])
plot(post_samples)
and the calibration.txt model:
model{
l~dnorm(50,1.0)
alpha~dnorm(0.0000115,694444444444)
theta~dnorm(-0.1,625)
mu<-l*(1+alpha*theta)
xbar~dnorm(mu,29726516052)
}
(note I think the dnorm distribution takes 1/sigma^2, hence the weird-looking variances)
Any help or insight as to why the PyMC3 sampling isn't converging and what I should do differently would be extremely appreciated. Thanks!
I also had trouble getting anything useful from the generated data and model in the code. It seems to me that the level of noise in the fake data could equally be explained by the different sources of variance in the model. That can lead to a situation of highly correlated posterior parameters. Add to that the extreme scale imbalances, then it makes sense this would have sampling issues.
However, looking at the JAGS model, it seems they really are using just that one input observation. I've never seen this technique(?) before, that is, inputting summary statistics of data instead of the raw data itself. I suppose it worked for them in JAGS, so I decided to try running the exact same MCMC, including using the precision (tau) parameterization of the Gaussian.
Original Model with Metropolis
with pm.Model() as m0:
# tau === precision parameterization
dT = pm.Normal('dT', mu=-0.1, tau=625)
alpha = pm.Normal('alpha', mu=0.0000115, tau=694444444444)
length = pm.Normal('length', mu=50.0, tau=1.0)
mu = pm.Deterministic('mu', length*(1+alpha*dT))
# only one input observation; tau indicates the 5.8 nm sd
obs = pm.Normal('obs', mu=mu, tau=29726516052, observed=[50.000215])
trace = pm.sample(30000, tune=30000, chains=4, cores=4, step=pm.Metropolis())
While it's still not that great at sampling length and dT, it at least appears convergent overall:
I think noteworthy here is that despite the relatively weak prior on length (sd=1), the strong priors on all the other parameters appear to propagate a tight uncertainty bound on the length posterior. Ultimately, this is the posterior of interest, so this seems to be consistent with the intent of the exercise. Also, see that mu comes out in the posterior as exactly the distribution described, namely, N(50.000215, 5.8e-6).
Trace Plots
Forest Plot
Pair Plot
Here, however, you can see the core problem is still there. There's both strong correlation between length and dT, plus 4 or 5 orders of magnitude scale difference between the standard errors. I'd definitely do a long run before I really trusted the result.
Alternative Model with NUTS
In order to get this running with NUTS, you'd have to address the scaling issue. That is, somehow we need to reparameterize to get all the tau values closer to 1. Then, you'd run the sampler and transform back into the units you're interested in. Unfortunately, I don't have time to play around with this right now (I'd have to figure it out too), but maybe it's something you can start exploring on your own.
After fitting a local level model using UnobservedComponents from statsmodels , we are trying to find ways to simulate new time series with the results. Something like:
import numpy as np
import statsmodels as sm
from statsmodels.tsa.statespace.structural import UnobservedComponents
np.random.seed(12345)
ar = np.r_[1, 0.9]
ma = np.array([1])
arma_process = sm.tsa.arima_process.ArmaProcess(ar, ma)
X = 100 + arma_process.generate_sample(nsample=100)
y = 1.2 * x + np.random.normal(size=100)
y[70:] += 10
plt.plot(X, label='X')
plt.plot(y, label='y')
plt.axvline(69, linestyle='--', color='k')
plt.legend();
ss = {}
ss["endog"] = y[:70]
ss["level"] = "llevel"
ss["exog"] = X[:70]
model = UnobservedComponents(**ss)
trained_model = model.fit()
Is it possible to use trained_model to simulate new time series given the exogenous variable X[70:]? Just as we have the arma_process.generate_sample(nsample=100), we were wondering if we could do something like:
trained_model.generate_random_series(nsample=100, exog=X[70:])
The motivation behind it is so that we can compute the probability of having a time series as extreme as the observed y[70:] (p-value for identifying the response is bigger than the predicted one).
[EDIT]
After reading Josef's and cfulton's comments, I tried implementing the following:
mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
mod1.simulate(f_model.params, len(X_post))
But this resulted in simulations that doesn't seem to track the predicted_mean of the forecast for X_post as exog. Here's an example:
While the y_post meanders around 100, the simulation is at -400. This approach always leads to p_value of 50%.
So when I tried using the initial_sate=0 and the random shocks, here's the result:
It seemed now that the simulations were following the predicted mean and its 95% credible interval (as cfulton commented below, this is actually a wrong approach as well as it's replacing the level variance of the trained model).
I tried using this approach just to see what p-values I'd observe. Here's how I compute the p-value:
samples = 1000
r = 0
y_post_sum = y_post.sum()
for _ in range(samples):
sim = mod1.simulate(f_model.params, len(X_post), initial_state=0, state_shocks=np.random.normal(size=len(X_post)))
r += sim.sum() >= y_post_sum
print(r / samples)
For context, this is the Causal Impact model developed by Google. As it's been implemented in R, we've been trying to replicate the implementation in Python using statsmodels as the core to process time series.
We already have a quite cool WIP implementation but we still need to have the p-value to know when in fact we had an impact that is not explained by mere randomness (the approach of simulating series and counting the ones whose summation surpasses y_post.sum() is also implemented in Google's model).
In my example I used y[70:] += 10. If I add just one instead of ten, Google's p-value computation returns 0.001 (there's an impact in y) whereas in Python's approach it's returning 0.247 (no impact).
Only when I add +5 to y_post is that the model returns p_value of 0.02 and as it's lower than 0.05, we consider that there's an impact in y_post.
I'm using python3, statsmodels version 0.9.0
[EDIT2]
After reading cfulton's comments I decided to fully debug the code to see what was happening. Here's what I found:
When we create an object of type UnobservedComponents, eventually the representation of the Kalman Filter is initiated. As default, it receives the parameter initial_variance as 1e6 which sets the same property of the object.
When we run the simulate method, the initial_state_cov value is created using this same value:
initial_state_cov = (
np.eye(self.k_states, dtype=self.ssm.transition.dtype) *
self.ssm.initial_variance
)
Finally, this same value is used to find initial_state:
initial_state = np.random.multivariate_normal(
self._initial_state, self._initial_state_cov)
Which results in a normal distribution with 1e6 of standard deviation.
I tried running the following then:
mod1 = UnobservedComponents(np.zeros(len(X_post)), level='llevel', exog=X_post, initial_variance=1)
sim = mod1.simulate(f_model.params, len(X_post))
plt.plot(sim, label='simul')
plt.plot(y_post, label='y')
plt.legend();
print(sim.sum() > y_post.sum())
Which resulted in:
I tested then the p-value and finally for a variation of +1 in y_post the model now is identifying correctly the added signal.
Still, when I tested with the same data that we have in R's Google package the p-value was still off. Maybe it's a matter of further tweaking the input to increase its accuracy.
#Josef is correct and you did the right thing with:
mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
mod1.simulate(f_model.params, len(X_post))
The simulate method simulates data according to the model in question, which is why you can't directly use trained_model to simulate when you have exogenous variables.
But for some reason the simulations always ended up being lower than y_post.
I think this should be expected - running your example and looking at the estimated coefficients, we get:
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
sigma2.irregular 0.9278 0.194 4.794 0.000 0.548 1.307
sigma2.level 0.0021 0.008 0.270 0.787 -0.013 0.018
beta.x1 1.1882 0.058 20.347 0.000 1.074 1.303
The variance of the level is very small, which means that it is extremely unlikely that the level would shift upwards by nearly 10 percent in a single period, based on the model you specified.
When you used:
mod1.simulate(f_model.params, len(X_post), initial_state=0, state_shocks=np.random.normal(size=len(X_post))
what happened is that the level term is the only unobserved state here, and by providing your own shocks with a variance equal to 1, you essentially overrode the level variance actually estimated by the model. I don't think that setting the initial state to 0 has much of an effect here. (see edit).
You write:
the p-value computation was closer, but still is not correct.
I'm not sure what this means - why would you expect the model to think such a jump was a likely occurrence? What p-value are you expecting to achieve?
Edit:
Thanks for investigating further (in Edit 2). First, what I think you should do is:
mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
initial_state = np.random.multivariate_normal(
f_model.predicted_state[..., -1], f_model.predicted_state_cov[..., -1])
mod1.simulate(f_model.params, len(X_post), initial_state=initial_state)
Now, the explanation:
In Statsmodels 0.9, we didn't yet have exact treatment of states with a diffuse initialization (it has been merged in since then, though, and this is one reason that I wasn't able to replicate your results until I tested your example with the 0.9 codebase). These "initially diffuse" states don't have a long-run mean that we can solve for (e.g. a random walk process), and the state in the local level case is such a state.
The "approximate" diffuse initialization involves setting the initial state mean to zero and the initial state variance to a large number (as you discovered).
For simulations, the initial state is, by default, sampled from the given initial state distribution. Since this model is initialized with approximate diffuse initialization, that explains why your process was initialized around some random number.
Your solution is a good patch, but it's not optimal because it doesn't base the initial state for the simulated period on the last state from the estimated model / data. These values are given by f_model.predicted_state[..., -1] and f_model.predicted_state_cov[..., -1].
I would like to implement a function in python (using numpy) that takes a mathematical function (for ex. p(x) = e^(-x) like below) as input and generates random numbers, that are distributed according to that mathematical-function's probability distribution. And I need to plot them, so we can see the distribution.
I need actually exactly a random number generator function for exactly the following 2 mathematical functions as input, but if it could take other functions, why not:
1) p(x) = e^(-x)
2) g(x) = (1/sqrt(2*pi)) * e^(-(x^2)/2)
Does anyone have any idea how this is doable in python?
For simple distributions like the ones you need, or if you have an easy to invert in closed form CDF, you can find plenty of samplers in NumPy as correctly pointed out in Olivier's answer.
For arbitrary distributions you could use Markov-Chain Montecarlo sampling methods.
The simplest and maybe easier to understand variant of these algorithms is Metropolis sampling.
The basic idea goes like this:
start from a random point x and take a random step xnew = x + delta
evaluate the desired probability distribution in the starting point p(x) and in the new one p(xnew)
if the new point is more probable p(xnew)/p(x) >= 1 accept the move
if the new point is less probable randomly decide whether to accept or reject depending on how probable1 the new point is
new step from this point and repeat the cycle
It can be shown, see e.g. Sokal2, that points sampled with this method follow the acceptance probability distribution.
An extensive implementation of Montecarlo methods in Python can be found in the PyMC3 package.
Example implementation
Here's a toy example just to show you the basic idea, not meant in any way as a reference implementation. Please refer to mature packages for any serious work.
def uniform_proposal(x, delta=2.0):
return np.random.uniform(x - delta, x + delta)
def metropolis_sampler(p, nsamples, proposal=uniform_proposal):
x = 1 # start somewhere
for i in range(nsamples):
trial = proposal(x) # random neighbour from the proposal distribution
acceptance = p(trial)/p(x)
# accept the move conditionally
if np.random.uniform() < acceptance:
x = trial
yield x
Let's see if it works with some simple distributions
Gaussian mixture
def gaussian(x, mu, sigma):
return 1./sigma/np.sqrt(2*np.pi)*np.exp(-((x-mu)**2)/2./sigma/sigma)
p = lambda x: gaussian(x, 1, 0.3) + gaussian(x, -1, 0.1) + gaussian(x, 3, 0.2)
samples = list(metropolis_sampler(p, 100000))
Cauchy
def cauchy(x, mu, gamma):
return 1./(np.pi*gamma*(1.+((x-mu)/gamma)**2))
p = lambda x: cauchy(x, -2, 0.5)
samples = list(metropolis_sampler(p, 100000))
Arbitrary functions
You don't really have to sample from proper probability distributions. You might just have to enforce a limited domain where to sample your random steps3
p = lambda x: np.sqrt(x)
samples = list(metropolis_sampler(p, 100000, domain=(0, 10)))
p = lambda x: (np.sin(x)/x)**2
samples = list(metropolis_sampler(p, 100000, domain=(-4*np.pi, 4*np.pi)))
Conclusions
There is still way too much to say, about proposal distributions, convergence, correlation, efficiency, applications, Bayesian formalism, other MCMC samplers, etc.
I don't think this is the proper place and there is plenty of much better material than what I could write here available online.
The idea here is to favor exploration where the probability is higher but still look at low probability regions as they might lead to other peaks. Fundamental is the choice of the proposal distribution, i.e. how you pick new points to explore. Too small steps might constrain you to a limited area of your distribution, too big could lead to a very inefficient exploration.
Physics oriented. Bayesian formalism (Metropolis-Hastings) is preferred these days but IMHO it's a little harder to grasp for beginners. There are plenty of tutorials available online, see e.g. this one from Duke university.
Implementation not shown not to add too much confusion, but it's straightforward you just have to wrap trial steps at the domain edges or make the desired function go to zero outside the domain.
NumPy offers a wide range of probability distributions.
The first function is an exponential distribution with parameter 1.
np.random.exponential(1)
The second one is a normal distribution with mean 0 and variance 1.
np.random.normal(0, 1)
Note that in both case, the arguments are optional as these are the default values for these distributions.
As a sidenote, you can also find those distributions in the random module as random.expovariate and random.gauss respectively.
More general distributions
While NumPy will likely cover all your needs, remember that you can always compute the inverse cumulative distribution function of your distribution and input values from a uniform distribution.
inverse_cdf(np.random.uniform())
By example if NumPy did not provide the exponential distribution, you could do this.
def exponential():
return -np.log(-np.random.uniform())
If you encounter distributions which CDF is not easy to compute, then consider filippo's great answer.
I randomly generated 1000 data points using the weights I know are true for the normal distribution. Now I am trying to minimize the -log likelihood function to estimate the values of sig^2 and the weights. I sort of get the process conceptually, but when I try to code it I'm just lost.
This is my model:
p(y|x, w, sig^2) = N(y|w0+w1x+...+wnx^n, sig^2)
I've been googling for a while now and I've learned the scipy.stats.optimize.minimize function is good for this, but I can't get it to work right. Every solution I have tried has worked for the example I got the solution from, but I'm unable to extrapolate it to my problem.
x = np.linspace(0, 1000, num=1000)
data = []
for y in x:
data.append(np.polyval([.5, 1, 3], y))
#plot to confirm I do have a normal distribution...
data.sort()
pdf = stats.norm.pdf(data, np.mean(data), np.std(data))
plt.plot(test, pdf)
plt.show()
#This is where I am stuck.
logLik = -np.sum(stats.norm.logpdf(data, loc=??, scale=??))
I have found that the equation error(w) = .5*sum(poly(x_n, w) - y_n)^2 is relevant for minimizing the error of the weights, which therefore maximizes my likelihood for the weights, but I don't understand how to code this... I have found a similar relationship for sig^2, but have the same problem. Can somebody clarify how to do this to help my curve fitting? Maybe go as far to post psuedo code I can use?
Yes, implementing likelihood fitting with minimize is tricky, I spend a lot of time on it. Which is why I wrapped it. If I may shamelessly plug my own package symfit, your problem can be solved by doing something like this:
from symfit import Parameter, Variable, Likelihood, exp
import numpy as np
# Define the model for an exponential distribution
beta = Parameter()
x = Variable()
model = (1 / beta) * exp(-x / beta)
# Draw 100 samples from an exponential distribution with beta=5.5
data = np.random.exponential(5.5, 100)
# Do the fitting!
fit = Likelihood(model, data)
fit_result = fit.execute()
I have to admit I don't exactly understand your distribution, since I don't understand the role of your w, but perhaps with this code as an example, you'll know how to adapt it.
If not, let me know the full mathematical equation of your model so I can help you further.
For more info check the docs. (For a more technical description of what happens under the hood, read here and here.)
I think there's an issue with your setup. With maximum likelihood, you obtain the parameters that maximize the probability of observing your data (given a certain model). Your model seems to be:
where epsilon is N(0, sigma).
So you maximize it:
or equivalently take logs to get:
The f in this case is the log-normal probability density function which you can get with stats.norm.logpdf. You should then use scipy.minimize to maximize an expression that will be the summation of stats.norm.logpdf evaluated at each of the i points, from 1 to your sample size.
If I've understood you correctly, your code is missing having a y vector plus an x vector! Show us a sample of those vectors and I can update my answer to include a sample code for estimating MLE with that date.
I have a question about the fit algorithms used in scipy. In my program, I have a set of x and y data points with y errors only, and want to fit a function
f(x) = (a[0] - a[1])/(1+np.exp(x-a[2])/a[3]) + a[1]
to it.
The problem is that I get absurdly high errors on the parameters and also different values and errors for the fit parameters using the two fit scipy fit routines scipy.odr.ODR (with least squares algorithm) and scipy.optimize. I'll give my example:
Fit with scipy.odr.ODR, fit_type=2
Beta: [ 11.96765963 68.98892582 100.20926023 0.60793377]
Beta Std Error: [ 4.67560801e-01 3.37133614e+00 8.06031988e+04 4.90014367e+04]
Beta Covariance: [[ 3.49790629e-02 1.14441187e-02 -1.92963671e+02 1.17312104e+02]
[ 1.14441187e-02 1.81859542e+00 -5.93424196e+03 3.60765567e+03]
[ -1.92963671e+02 -5.93424196e+03 1.03952883e+09 -6.31965068e+08]
[ 1.17312104e+02 3.60765567e+03 -6.31965068e+08 3.84193143e+08]]
Residual Variance: 6.24982731975
Inverse Condition #: 1.61472215874e-08
Reason(s) for Halting:
Sum of squares convergence
and then the fit with scipy.optimize.leastsquares:
Fit with scipy.optimize.leastsq
beta: [ 11.9671859 68.98445306 99.43252045 1.32131099]
Beta Std Error: [0.195503 1.384838 34.891521 45.950556]
Beta Covariance: [[ 3.82214235e-02 -1.05423284e-02 -1.99742825e+00 2.63681933e+00]
[ -1.05423284e-02 1.91777505e+00 1.27300761e+01 -1.67054172e+01]
[ -1.99742825e+00 1.27300761e+01 1.21741826e+03 -1.60328181e+03]
[ 2.63681933e+00 -1.67054172e+01 -1.60328181e+03 2.11145361e+03]]
Residual Variance: 6.24982904455 (calulated by me)
My Point is the third fit parameter: The results are
scipy.odr.ODR, fit_type=2:
C = 100.209 +/- 80600
scipy.optimize.leastsq:
C = 99.432 +/- 12.730
I don't know why the first error is so much higher. Even better: If I put exactly the same data points with errors into Origin 9 I get
C = x0 = 99,41849 +/- 0,20283
and again exactly the same data into c++ ROOT Cern
C = 99.85+/- 1.373
even though I used exactly the same initial variables for ROOT and Python. Origin doesn't need any.
Do you have any clue why this happens and which is the best result?
I added the code for you at pastebin:
Data
C++ code
Python code: http://pastebin.com/jZVyzMkS
Thank you for helping!
EDIT: here's the plot related to SirJohnFranklins post:
Did you actually try plotting the ODR and leastsq fits side by side? They look basically identical:
Consider what the parameters correspond to - the step function described by beta[0] and beta[1], the initial and final values, explains by far the majority of the variance in your data. By contrast, small changes in beta[2] and beta[3], the inflexion point and slope, will have comparatively little effect on the overall shape of the curve and therefore the residual variance for the fit. It's therefore no surprise that these parameters have high standard errors, and are fitted slightly differently by the two algorithms.
The overall greater standard errors reported by ODR are due to the fact that this model incorporates errors in the y-values whereas the ordinary least squares fit does not - errors in the measured y-values ought to reduce our confidence in the estimated fit parameters.
(Sadly, i can't upload the fit, because I need more reputation. I'll give the plot to Captain Sandwich, so he can upload it for me.)
I'm in the same workgroup as the person who started the thread, but I did this plot.
So, I added x-errors on the data, because I was not that far the last time. The error obtained through the ODR is still absurdly high (4.18550164e+04 on beta[2]). In the plot, I show you what the FIT from [ROOT Cern][2] gives, now with x and y error. Here, x0 is the beta[2].
The red and the green curve have a different beta, the left one minus the error of the fit of 3.430 obtained by ROOT and the right one plus the error. I think this makes totally sense, much more, than the error of 0.2 given by the fit of Origin 9 (which can only handle y-errors, I think) or the error of about 40k given by the ODR which also includes x and y errors.
Maybe, because ROOT is mostly used by astrophysicists who need very roubust fitting algorithms, it can handle much more difficult fits, but I don't know enough about the robustness of fitting algorithms.