With this code (before the snippet I just read in the data, which works fine, and after the snippet I just do labels etc.)
plt.errorbar(xdata, ydata, yerr, fmt='.', label='Data')
model = models.GaussianModel()
params = model.make_params()
params['center'].set(6.5)
#params['center'].vary = False
fit = model.fit(ydata, params=params, x=xdata, weights=1/yerr)
print(fit.fit_report())
plt.plot(xdata, fit.best_fit, label='Fit')
I try to fit the last peak (approximately at x=6.5). But as you can see in the picture, the code does not do that. Can anyone tell me why that is?
Edit: If I run the line params['center'].vary = False the "fit" just becomes zero everywhere.
I have never used lmfit, but the problem is most likely, that you try to fit the whole data region. Considering the entire region of data you handed to the .fit call, the resulting fit is probably the best and correct.
You should try to pass only relevant data to the fit. In your case xdata should only be the set of data points from around 5.5 up to 7.5 (or somewhere around these numbers). ydata has to be adapted to these values as well, of course. Then the fit should work nicely.
Related
I'm doing a curve fit in python using scipy.curve_fit, and the fit itself looks great, however the parameters that are generated don't make sense.
The equation is (ax)^b + cx, but with the params python finds a = -c and b = 1, so the whole equation just equals 0 for every value of x.
here is the plot and my code.
(https://i.stack.imgur.com/fBfg7.png)](https://i.stack.imgur.com/fBfg7.png)
# experimental data
xdata = cfu_u
ydata = OD_u
# x-values to plot for curve fit
min_cfu = 0.1
max_cfu = 9.1
x_vec = pow(10,np.arange(min_cfu,max_cfu,0.1))
# exponential function
def func(x,a, b, c):
return (a*x)**b + c*x
# curve fit
popt, pcov = curve_fit(func, xdata, ydata)
# plot experimental data and fitted curve
plt.plot(x_vec, func(x_vec, *popt), label = 'curve fit',color='slateblue',linewidth = 2.2)
plt.plot(cfu_u,OD_u,'-',label = 'experimental data',marker='.',markersize=8,color='deepskyblue',linewidth = 1.4)
plt.legend(loc='upper left',fontsize=12)
plt.ylabel("Y",fontsize=12)
plt.xlabel("X",fontsize=12)
plt.xscale("log")
plt.gcf().set_size_inches(7, 5)
plt.show()
print(popt)
[ 1.44930871e+03 1.00000000e+00 -1.44930871e+03]
How can I find the actual parameters?
edit: here is the actual experimental raw data I used: https://pastebin.com/CR2BCJji
The chosen function model is :
y(x)=(ax)^b+cx
In order to understand the difficulty encountered one have first to compare the behaviour of the function to the data on the range of the lowest values of X.
We see that y(x)=0 is an acceptable fitting for the points on a large range (at least 6 decades ) considering the scatter. They are the majority of the experimental points (18 points among 27). The function y(x)=0 is obtained from the function model only if b=1 leading to y(x)=(a+c)x and with a+c=0. At first sight python seems to give : b=1 and c=-a. But we have to look more carefully.
Of course the fonction y(x)=0 is not convenient for the 9 points at larger X.
This draw to think that the fitting of the whole set of points is an extension of the above fitting with values of the parameters different from b=1 and a+c=0 but not far in order to continue to have a good fitting on the above 18 points.
Conclusion : The actual values of the parameters found by python are certainly very close to b=1 and a close to 1.44930871e+03 and b close to -1.44930871e+03
The calculus inside python is certainly carried out with 16 or 18 digits. But the display is with 9 digits only. This is not sufficient to see that b might be different from 1 and that c might be different from -a. This suggests that the clue might be only a matter of display with enough digits.
Yes, the fitting by python looks great. This is a fine performance on the mathematical viewpoint. But the physical signifiance is doubtful with so many digits essential to the fitting on the whole range.
What I have done so far:
I am trying to fit noisy data (which I generated myself by adding random noise to my function) to Gauss-Hermite function that I have defined. It works well in some cases for lower values of h3 and h4 but every once in a while it will produce a really bad fit even for lower h3, h4 values, and for higher h3, h4 values, it always gives a bad fit.
My code:
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as mpl
import matplotlib.pyplot as plt
# Let's define the Gauss-Hermite function
#a=amplitude, x0=location of peak, sig=std dev, h3, h4
def gh_func(x, a, x0, sig, h3, h4):
return a*np.exp(-.5*((x-x0)/sig)**2)*(1+h3*((-np.sqrt(3))*((x-x0)/sig)+(2/np.sqrt(3))*((x-x0)/sig)**3)+ h4*((np.sqrt(6)/4)+(-np.sqrt(6))*((x-x0)/sig)**2+(np.sqrt(6)/3)*(((x-x0)/sig)**4)))
#generate clean data
x = np.linspace(-10, 20, 100)
y = gh_func(x, 10, 5, np.sqrt(3), -0.10,-0.03) #it gives okay fit for h3=-0.10, h4=-0.03 but bad fits for higher values like h3=-0.4 and h4=-0.3.
#add noise to data
noise=np.random.normal(0,np.sqrt(0.5),size=len(x))
yn = y + noise
fig = mpl.figure(1)
ax = fig.add_subplot(111)
ax.plot(x, y, c='k', label='analytic function')
ax.scatter(x, yn, s=5, label='fake noisy data')
fig.savefig('model_and_noise_h3h4.png')
# Executing curve_fit on noisy data
popt, pcov = curve_fit(gh_func, x, yn)
#popt returns the best fit values for parameters of the given model (func)
print('Fitted Parameters (Gaus_Hermite):\na = %.10f , x0 = %.10f , sig = %.10f\nh3 = %.10f , h4 = %.10f' \
%(popt[0],popt[1],popt[2],popt[3],popt[4]))
ym = gh_func(x, popt[0], popt[1], popt[2], popt[3], popt[4])
ax.plot(x, ym, c='r', label='Best fit')
ax.legend()
fig.savefig('model_fit_h3h4.png')
plt.legend(loc='upper left')
plt.xlabel("v")
plt.ylabel("f(v)")
What I want to do:
I want to find a better fitting method than just curve_fit from scipy.optimize but I am not sure what I can use. Even if we end up using curve_fit, I need a way to produce better fits by providing initial guesses for the parameters which are generated automatically, e.g. one approach for only single peak gaussian is described in the accepted answer of this post (Jean Jacquelin's method):gaussian fitting inaccurate for lower peak width using Python. But this is just for mu,sigma and amplitude not h3,h4.
Besides curve_fit from scipy.optimize, I think there's one called lmfit: https://lmfit.github.io/lmfit-py/ but I am not sure how I will implement it in my code. I do not want to use manual initial guesses for the parameters. I want to be able to find the fitting automatically.
Thanks a lot!
Using lmfit for this fitting problem would be straightforward, by creating a lmfit.Model from your gh_func, with something like
from lmfit import Model
gh_model = Model(gh_func)
params = gh_model.make_params(x0=?, a=?, sig=?, h3=?, h4=?)
where those question marks would have to be initial values for the parameters in question.
Your use of scipy.optimize.curve_fit does not provide initial values for the variables. Unfortunately, this does not raise an error or give an indication of a problem because the authors of scipy.optimize.curve_fit have lied to you by making initial values optional. Initial values for all parameters are always required for all non-linear least-squares analyses. It is not a choice of the implementation, it is a feature of the mathematical algorithm. What curve_fit hides from you is that leaving p0=None makes all initial values 1.0. Whether that is remotely appropriate depends on the model and data being fit - it cannot be always reasonable. As an example, if the x values for your data extended from 9000 to 9500, and the peak function was centered around 9200, starting with x0 of 1 would almost certainly not find a suitable fit. It is always deceptive to imply in any way that initial values are optional. They just are not. Not sometimes, not for well-defined problems. Initial values are never, ever optional.
Initial values don't have to be perfect, but they need to be of the right scale. Many people will warn you about "false minima" - getting stuck at a solution that is not bad but not "best". My experience is that the more common problem people run into is initial values that are so far off that the model is just not sensitive to small changes in their values, and so can never be optimized (with x0=1,sig=1 for data with x on [9000, 9500] and centered at 9200 being exactly in that category).
The answer to your question for your essential question of "providing initial guesses for the parameters which are generated automatically" is hard to answer without knowing the properties of your model function. You might be able to use some heuristics to guess values from a data set. Lmfit has such heuristic "guess parameter values from data" functions for many peak-like functions. You probably have some sense for the physical (or other domain) for h3 and h4 and know what kinds of values are reasonable, and can probably give better initial values than h3=h4=1. It might be that you want to start with guessing parameters as if the data was simple Gaussian (say, using lmfit.Model.guess_from_peak()) and then use the difference of that and your data to get the scale for h3 and h4.
I am currently working up some experimental data and am having a hard time understanding whether I should be doing a log scale or actually applying np.log onto the data.
Here is the plot I have made.
Blue represents using plt.yscale('log'), whereas the orange is creating a new column and applying np.log onto the data.
My question
Why are their magnitudes so different? Which is correct? and if using plt.yscale('log') is the optimal way to do it, is there a way I can get those values as I need to do a curve fit after?
Thanks in advance for anyone that can provide some answers!
edit(1)
I understand that plt.yscale('log') is in base 10 and np.log refers to the natural log. I have tried using np.log10 on the data instead and it gives a smaller value that does not correspond to using a log scale.
Your data is getting log-ified but "pointing"? in the wrong direction.
Consider this toy data
x = np.linspace(0, 1, 100)[:-1]
y = np.log(1-x) + 5
Then we plot
plt.plot(x, y)
If I log scale it:
It's just more exaggerated
plt.plot(x, y)
plt.xscale('log')
You need to point your data the other direction like normal log data
plt.plot(-x, y)
But you also have to make sure the data is positive or ... you know ... logs and stuff ¯\_(ツ)_/¯
plt.plot(-x + 1, y)
plt.xscale('log')
I have a set of data in a 15x55 matrix in a separate file which I import and pull data from using this code:
ScanNum = []
#Import the data
data = loadtxt('FNScan40.txt')
#Define columns
ccd1= data[:,14]
t= data[:,0]
I am confident about this part of the code because I have used it in other code and had that work. I then defined the function to which I wanted to fit the data.
def Kinetics(tk,A,B):
f=A*(1.0-math.e**(-B*tk))
return f
Where A and B are the unknown coefficients.
Then, I put the x and y data that I have into arrays.
x = array([t])
y = array([ccd1])
Up until here I am pretty sure that everything is correct, it is when I actually try to perform the curve fit that I have trouble. This is the rest of the code that I have:
popt, pcov = curve_fit(Kinetics, x, y, p0=None)
print popt, pcov
plt.figure()
plt.plot(x, kinetics(x, *popt), label="Fitted Curve")
plt.show()
When I execute the code I get the error message: Improper input: N=2 must not exceed M=1. I know that N is the number of data points while M is the number of initial parameters. I am not sure how to fix this. The only thing I found to try was to set my own initial parameters, so in the above code I defined p0=[1,1] (since I have two parameters I am trying to guess, A and B.) This only resulted in the same error(Improper input: N=2 must not exceed M=1), so I tried varying how many '1's I put in. If I go over two I just get an error saying I've tried to enter too many arguments into Kinetics, which makes sense.
I've tried everything I could think of/ find on the Internet to no avail. If M is the initial parameters, why doesn't changing the number of '1's in p0 change what the error reports as M? What can I do to fix this problem?
The problem I was encountering was in the arrays. I decided to look at the actual arrays to see if they were the issue. When I included brackets around 't' and 'ccd1', it set the length of the array equal to one. Taking out the brackets fixed that problem. There are other, unrelated issues in the code that should be pretty easily resolved.
I was able to fit curves to a x/y dataset using peak-o-mat, as shown below. Thats a linear background and 10 lorentzian curves.
Since I need to fit many similar curves I wrote a scripted fitting routine, using mpfit.py, which is a Levenberg-Marquardt-Algorithm. However the fit takes longer and, in my opinion, is less accurate than the peak-o-mat result:
Starting values
Fit result with fixed linear background (values for linear background taken from the peak-o-mat result)
Fit result with all variables free
I believe the starting values are already very close, but even with the fixed linear background, the left lorentzian is clearly a degradation of the fit.
The result is even worse for total free fit.
Peak-o-mat appears to use scipy.odr.odrpack. Now what is more likely:
I did some implementation error?
odrpack is more suitable for this particular problem?
Fitting to a more simple problem (linear data with one peak in the middle) shows very good correlation between peak-o-mat and my script. Also I did not find a lot about ordpack.
Edit: It seems I could answer the question by myself, however the answer is a bit unsettling. Using scipy.odr (which allows fitting with odr or leastsq method) both give the result as peak-o-mat, even without constraints.
The image below shows again the data, the start values (almost perfect) and then the odr and leastsq fits. The component curves are for the odr one
I will switch to odr, but this still leaves me upset. The methods (mpfit.py, scipy.optimize.leastsq, scipy.odr in leastsq mode) 'should' yield the same results.
And for people stumbling upon this post: to do the odr fit an error must be specified for x and y values. If there is no error, use small values with sx << sy.
linear = odr.Model(f)
mydata = odr.RealData(x, y, sx = 1e-99, sy = 0.01)
myodr = odr.ODR(mydata, linear, beta0 = beta0, maxit = 2000)
myoutput1 = myodr.run()
You can use peak-o-mat for scripting as well. The easiest would be to create project containing all data you want to fit via the GUI, clean it, transform it and attach (i.e. choose a model, provide an initial guess and fit) the base model to one of the sets. Then you can (deep)copy that model and attach it to all of the other data sets. Try this:
from peak_o_mat.project import Project
from peak_o_mat.fit import Fit
from copy import deepcopy
p = Project()
p.Read('in.lpj')
base = p[2][0] # this is the set which has been fit already
for data in p[2][1:]: # all remaining sets of plot number 2
mod = deepcopy(base.mod)
data.mod = mod
f = Fit(data, data.mod)
res = f.run()
pars = res[0]
err = res[1]
data.mod._newpars(pars, err)
print data.mod.parameters_as_table()
p.Write('out')
Please tell me, if you need more details.