I am trying to fit some sample data in a semilogy plot with curve_fit function from scipy. My best fit curve looks okay with the code I am following, but I am having trouble with the 2 sigma curves, which I want to show simultaneously along with the best fit curve and grey-filled. My code looks like the following:
import sys
import os
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
import scipy.optimize as optimization
M = np.array([-2, -1, 0, 1, 2, 3,4])
Y_z = np.array([0.05, 0.2, 3, 8, 50, 344, 2400 ])
# curve fit linear function
def line(x, a, b):
return a*x+b
popt, pcov = curve_fit(line, M, np.log10(Y_z)) # change here
# plotting
plt.semilogy(M , Y_z, 'o')
plt.semilogy(M, 10**line(M, popt[0], popt[1]), ':', label = 'curve-fit')
# plot 1 sigma -error
y1 = 10**(line(M, popt[0] + pcov[0,0]**0.5, popt[1] - pcov[1,1]**0.5))
y2 = 10**(line(M, popt[0] - pcov[0,0]**0.5, popt[1] + pcov[1,1]**0.5))
plt.semilogy(M, y1, ':')
plt.semilogy(M, y2, ':')
plt.fill_between(M, y1, y2, facecolor="gray", alpha=0.15)
plt.xlabel(r"$\log X$")
plt.ylabel('Y')
plt.legend()
plt.show()
Your help is very appreciated for the variance curves
In principle, a linear fit doesn't need non-linear least-squares curve-fitting at all: linear regression should work.
That said, to address your questions, you might find lmfit (http://lmfit.github.io/lmfit-py/) useful here. It has a slightly higher-level and slightly more Pythonic approach to curve-fitting, and adds many features. One of these is calculating the uncertainty in the result for a selected value of sigma.
To do your fit with lmfit, it would look like
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as optimization
import lmfit
M = np.array([-2, -1, 0, 1, 2, 3,4])
Y_z = np.array([0.05, 0.2, 3, 8, 50, 344, 2400 ])
# curve fit linear function
def line(x, a, b):
return a*x+b
# set up model and create parameters from model function
# note that function argument names are used for parameters
model = lmfit.Model(line)
params = model.make_params(a=1, b=0)
result = model.fit(np.log10(Y_z), params, x=M)
print(result.fit_report())
which will print out a report about the fit like this:
[[Model]]
Model(line)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 8
# data points = 7
# variables = 2
chi-square = 0.10468256
reduced chi-square = 0.02093651
Akaike info crit = -25.4191304
Bayesian info crit = -25.5273101
[[Variables]]
a: 0.77630819 +/- 0.02734470 (3.52%) (init = 1)
b: 0.22311337 +/- 0.06114460 (27.41%) (init = 0)
[[Correlations]] (unreported correlations are < 0.100)
C(a, b) = -0.447
You can calculate the 2-sigma uncertainty in the best-fit result as
# calculate 2-sigma uncertainty in result
del2 = result.eval_uncertainty(sigma=2, x=M)
and then use this and the fit results to plot the results (slightly modified from your form):
plt.plot(M, np.log10(Y_z), 'o', label='data')
plt.plot(M, result.best_fit, ':', label = 'curve-fit')
plt.fill_between(M, result.best_fit-del2, result.best_fit+del2, facecolor="grey", alpha=0.15)
plt.xlabel(r"$\log X$")
plt.ylabel('Y')
plt.legend()
plt.show()
which should produce a plot like
hope that helps.
Related
I am trying to fit a power law to some data following a power law with noise, displayed in a log-log scale:
The fit with scipy curve_fit is the orange line and the red line is the noiseless power law.
Here is the code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def powerFit(x, A, B):
return A * x**(-B)
x = np.logspace(np.log10(1),np.log10(1e5),30)
A = 1
B = 2
np.random.seed(1)
y_noise = np.random.normal(1, 0.2, size=x.size)
y = powerFit(x, A, B) * y_noise
plt.plot(x, y, 'o')
plt.xscale('log')
plt.yscale('log')
popt, pcov = curve_fit(powerFit, x, y, p0 = [A, B])
perr = np.sqrt(np.diag(pcov))
residuals = y - powerFit(x, *popt)
ss_res = np.sum(residuals**2)
ss_tot = np.sum((y-np.mean(y))**2)
r_squared = 1 - (ss_res / ss_tot)
plt.plot(x, powerFit(x, *popt),
label = 'fit: A*x**(-B)\n A=%.1e +- %.0e\n B=%.1e +- %.0e\n R2 = %.2f' %tuple(np.concatenate(( np.array([popt, perr]).T.flatten(), [r_squared]) )))
plt.plot(x, powerFit(x, A, B), 'r--', label = 'A = %d, B = %d'%(A,B))
plt.legend()
plt.savefig("fig.png", dpi = 300)
I do not understand what is happening with the fitted power law. Why does it look wrong? How could I solve this?
Note: I know you could also fit the power law plotting log(y) vs log(x). But according to this answer, it seems curve_fit should also manage to do it right directly. So my question is if it is possible to do a power law fit in a log log scale without log transforming. I am interested in avoiding the log-log transformation because it is not possible to apply to any fit (consider for example the fit to y = A*x**(-Bx) ).
I have the following data-set:
x = 0, 5, 10, 15, 20, 25, 30
y = 0, 0.13157895, 0.31578947, 0.40789474, 0.46052632, 0.5, 0.53947368
Now, I want to plot this data and fit this data set with my defined function f(x) = (A*K*x/(1+K*x)) and find the parameters A and K ?
I wrote the following python script but it seems like it can't do the fitting I require:
import matplotlib.pyplot as plt
import numpy as np
from scipy.optimize import curve_fit
x = np.array([0, 5, 10, 15, 20, 25, 30])
y = np.array([0, 0.13157895, 0.31578947, 0.40789474, 0.46052632, 0.5, 0.53947368])
def func(x, A, K):
return (A*K*x / (1+K*x))
plt.plot(x, y, 'b-', label='data')
popt, pcov = curve_fit(func, x, y)
plt.plot(x, func(x, *popt), 'r-', label='fit')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
Still, it's not giving a best curve fit. Can anyone help me with the changes in the python script or a new script where I can properly fit the data with my desired fitting function ?
The classic problem: You didn't give any inital guess for A neither K. In this case the default value will be 1 for all parameters, which is not suitable for your dataset, and the fitting will converge to somewhere else. You can figure out the guesses different ways: by looking at the data, by the real meaning of parameters, etc.. You can guess values with the p0 parameter of scipy.optimize.curve_fit. It accepts list of values in the order they are in the func you want to optimize. I used 0.1 for both, and I got this curve:
popt, pcov = curve_fit(func, x, y, p0=[0.1, 0.1])
Try Minuit, which is a package implemented at Cern.
from iminuit import Minuit
import numpy as np
import matplotlib.pyplot as plt
def func(x, A, K):
return (A*K*x / (1+K*x))
def least_squares(a, b):
yvar = 0.01
return sum((y - func(x, a, b)) ** 2 / yvar)
x = np.array([0, 5, 10, 15, 20, 25, 30])
y = np.array([0, 0.13157895, 0.31578947, 0.40789474, 0.46052632, 0.5, 0.53947368])
m = Minuit(least_squares, a=5, b=5)
m.migrad() # finds minimum of least_squares function
m.hesse() # computes errors
plt.plot(x, y, "o")
plt.plot(x, func(x, *m.values.values()))
# print parameter values and uncertainty estimates
for p in m.parameters:
print("{} = {} +/- {}".format(p, m.values[p], m.errors[p]))
And the outcome:
a = 0.955697134431429 +/- 0.4957121286951612
b = 0.045175437602766676 +/- 0.04465599806912648
As mentioned here, scikit-learn's Gaussian process regression (GPR) permits "prediction without prior fitting (based on the GP prior)". But I have an idea for what my prior should be (i.e. it should not simply have a mean of zero but perhaps my output, y, scales linearly with my input, X, i.e. y = X). How could I encode this information into GPR?
Below is a working example, but it assumed zero mean for my prior. I read that "The GaussianProcessRegressor does not allow for the specification of the mean function, always assuming it to be the zero function, highlighting the diminished role of the mean function in calculating the posterior." I believe this is the motivation behind custom kernels (e.g. heteroscedastic) with variable scales at different X, although I'm still trying to better understand what capability they provide. Are there ways to get around the zero mean prior so that an arbitrary prior can be specified in scikit-learn?
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
def f(x):
"""The function to predict."""
return 1.5*(1. - np.tanh(100.*(x-0.96))) + 1.5*x*(x-0.95) + 0.4 + 1.5*(1.-x)* np.random.random(x.shape)
# Instantiate a Gaussian Process model
kernel = C(10.0, (1e-5, 1e5)) * RBF(10.0, (1e-5, 1e5))
X = np.array([0.803,0.827,0.861,0.875,0.892,0.905,
0.91,0.92,0.925,0.935,0.941,0.947,0.96,
0.974,0.985,0.995,1.0])
X = np.atleast_2d(X).T
# Observations and noise
y = f(X).ravel()
noise = np.linspace(0.4,0.3,len(X))
y += noise
# Instantiate a Gaussian Process model
gp = GaussianProcessRegressor(kernel=kernel, alpha=noise ** 2,
n_restarts_optimizer=10)
# Fit to data using Maximum Likelihood Estimation of the parameters
gp.fit(X, y)
# Make the prediction on the meshed x-axis (ask for MSE as well)
x = np.atleast_2d(np.linspace(0.8, 1.02, 1000)).T
y_pred, sigma = gp.predict(x, return_std=True)
plt.figure()
plt.errorbar(X.ravel(), y, noise, fmt='k.', markersize=10, label=u'Observations')
plt.plot(x, y_pred, 'k-', label=u'Prediction')
plt.fill(np.concatenate([x, x[::-1]]),
np.concatenate([y_pred - 1.9600 * sigma,
(y_pred + 1.9600 * sigma)[::-1]]),
alpha=.1, fc='k', ec='None', label='95% confidence interval')
plt.xlabel('x')
plt.ylabel('y')
plt.xlim(0.8, 1.02)
plt.ylim(0, 5)
plt.legend(loc='lower left')
plt.show()
Here is an example on how to use the prior mean function to the sklearn GPR model.
import numpy as np
from matplotlib import pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel
A=np.linspace(5,25,num=100)
# prior mean function
prior_beta=12-0.3*A
# true function
true_beta=20-0.7*A
rng = np.random.seed(44)
# Training data
size=15
ind=np.random.randint(0,100,size=size)
# generate the posterior variance (noisy samples)
var_=np.random.uniform(0.1,10.0,size=size)
A_=A[ind][:, np.newaxis]
beta_=true_beta[ind]-prior_beta[ind]
beta_1=true_beta[ind]
plt.figure()
kernel = ConstantKernel(4) * RBF(length_scale=2, length_scale_bounds=(1e-3, 1e2))
gp = GaussianProcessRegressor(kernel=kernel,
alpha=var_,optimizer=None).fit(A_, beta_)
X_ = np.linspace(5, 25, 100)
y_mean, y_cov = gp.predict(X_[:, np.newaxis], return_cov=True)
# Now you add the prior mean function back
y_mean=y_mean+12-0.3*X_
plt.plot(X_, y_mean, 'k', lw=3, zorder=9, label='predicted')
plt.fill_between(X_, y_mean - 3*np.sqrt(np.diag(y_cov)),
y_mean + 3*np.sqrt(np.diag(y_cov)),
alpha=0.5, color='k', label='+-3sigma')
plt.plot(A,true_beta, 'r', lw=3, zorder=9,label='truth')
plt.plot(A,prior_beta, 'blue', lw=3, zorder=9,label='prior')
plt.errorbar(A_[:,0], beta_1, yerr=3*np.sqrt(var_), fmt='x',ecolor='g',marker='s',
mfc='g', ms=10,capsize=6,label='training set')
plt.title("Initial: %s\n"% (kernel))
plt.legend()
plt.show()
I have some data which I have fitted a normal distribution to using the scipy.stats.normal objects fit function like so:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import matplotlib.mlab as mlab
x = np.random.normal(size=50000)
fig, ax = plt.subplots()
nbins = 75
mu, sigma = norm.fit(x)
n, bins, patches = ax.hist(x,nbins,normed=1,facecolor = 'grey', alpha = 0.5, label='before');
y0 = mlab.normpdf(bins, mu, sigma) # Line of best fit
ax.plot(bins,y0,'k--',linewidth = 2, label='fit before')
ax.set_title('$\mu$={}, $\sigma$={}'.format(mu, sigma))
plt.show()
I would now like to extract the uncertainty/error in the fitted mu and sigma values. How can I go about this?
You can use scipy.optimize.curve_fit:
This method does not only return the estimated optimal values of
the parameters, but also the corresponding covariance matrix:
popt : array
Optimal values for the parameters so that the sum of the squared residuals
of f(xdata, *popt) - ydata is minimized
pcov : 2d array
The estimated covariance of popt. The diagonals provide the variance of the parameter estimate. To compute one standard deviation errors on the parameters use perr = np.sqrt(np.diag(pcov)).
How the sigma parameter affects the estimated covariance depends on absolute_sigma argument, as described above.
If the Jacobian matrix at the solution doesn’t have a full rank, then ‘lm’ method returns a matrix filled with np.inf, on the other hand ‘trf’ and ‘dogbox’ methods use Moore-Penrose pseudoinverse to compute the covariance matrix.
You can calculate the standard deviation errors of the parameters from the square roots of the diagonal elements of the covariance matrix as follows:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.optimize import curve_fit
x = np.random.normal(size=50000)
fig, ax = plt.subplots()
nbins = 75
n, bins, patches = ax.hist(x,nbins, density=True, facecolor = 'grey', alpha = 0.5, label='before');
centers = (0.5*(bins[1:]+bins[:-1]))
pars, cov = curve_fit(lambda x, mu, sig : norm.pdf(x, loc=mu, scale=sig), centers, n, p0=[0,1])
ax.plot(centers, norm.pdf(centers,*pars), 'k--',linewidth = 2, label='fit before')
ax.set_title('$\mu={:.4f}\pm{:.4f}$, $\sigma={:.4f}\pm{:.4f}$'.format(pars[0],np.sqrt(cov[0,0]), pars[1], np.sqrt(cov[1,1 ])))
plt.show()
This results in the following plot:
See also lmfit (https://github.com/lmfit/lmfit-py) which gives an easier interface and reports uncertainties in fitted variables. To fit data to a normal distribution, see http://lmfit.github.io/lmfit-py/builtin_models.html#example-1-fit-peak-data-to-gaussian-lorentzian-and-voigt-profiles
and use something like
from lmfit.models import GaussianModel
model = GaussianModel()
# create parameters with initial guesses:
params = model.make_params(center=9, amplitude=40, sigma=1)
result = model.fit(ydata, params, x=xdata)
print(result.fit_report())
The report will include the 1-sigma errors like
[[Variables]]
sigma: 1.23218358 +/- 0.007374 (0.60%) (init= 1.0)
center: 9.24277047 +/- 0.007374 (0.08%) (init= 9.0)
amplitude: 30.3135620 +/- 0.157126 (0.52%) (init= 40.0)
fwhm: 2.90157055 +/- 0.017366 (0.60%) == '2.3548200*sigma'
height: 9.81457817 +/- 0.050872 (0.52%) == '0.3989423*amplitude/max(1.e-15, sigma)'
I am trying to fit a gaussian data to a specific three-term gaussian (in which the amplitude in one term is equal to twice the standard deviation of the next term). Here is my attempt:
import numpy as np
#from scipy.optimize import curve_fit
import scipy.optimize as optimize
import matplotlib.pyplot as plt
#r=np.linspace(0.0e-15,4e-15, 100)
data = np.loadtxt('V_lambda_n.dat')
r = data[:, 0]
V = data[:, 1]
def func(x, ps1, ps2, ps3, ps4):
return ps1*np.exp(-(x/ps2)**2) + ps2*np.exp(-(x/ps3)**2) + ps3*np.exp(-(x/ps4)**2)
popt, pcov = optimize.curve_fit(func, r, V, maxfev=10000)
#params = optimize.curve_fit(func, ps1, ps2, ps3, ps4)
#[ps1, ps2, ps2, ps4] = params[0]
p1=plt.plot(r, V, 'bo', label='data')
p2=plt.plot(r, func(r, *popt), 'r-', label='fit')
plt.xticks(np.linspace(0, 4, 9, endpoint=True))
plt.yticks(np.linspace(-50, 150, 9, endpoint=True))
plt.show()
Here is the result:
How may I fix this code to improve the fit? Thanks
With the help of friends from scipy-user forum, I tried as initial guess the following:
p0=[V.max(), std_dev, V.max(), 2]
The fit got a lot better. The new fit is as shown
enter image description here
I hope the fit could get better than this.