Related
I am unable to properly fit a logarithmic and exponential decay curve to my experimental data points, where it is as if the suggested curve fits do not resemble the pattern in my data not even remotely.
I have the following example data:
data = {'X':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
],
'Y':[55, 55, 55, 54, 54, 54, 54, 53, 53, 50, 45, 37, 27, 16, 0
]}
df = pd.DataFrame(data)
df = pd.DataFrame(data,columns=['X','Y'])
df.plot(x ='X', y='Y', kind = 'scatter')
plt.show()
This outputs:
I then try fitting an exponential decay and logarithmic decay curve to these data points using this code and outputting the root mean square error for each curve:
# load the dataset
data = df.values
# choose the input and output variables
x, y = data[:, 0], data[:, 1]
def func1(x, a, b, c):
return a*exp(b*x)+c
def func2(x, a, b):
return a * np.log(x) + b
params, _ = curve_fit(func1, x, y)
a, b, c = params[0], params[1], params[2]
yfit1 = a*exp(x*b)+c
rmse = np.sqrt(np.mean((yfit1 - y) ** 2))
print('Exponential decay fit:')
print('y = %.5f * exp(x*%.5f)+%.5f' % (a, b, c))
print('RMSE:')
print(rmse)
print('')
params, _ = curve_fit(func2, x, y)
a, b = params[0], params[1]
yfit2 = a * np.log(x) + b
rmse = np.sqrt(np.mean((yfit2 - y) ** 2))
print('Logarithmic decay fit:')
print('y = %.5f * ln(x)+ %.5f' % (a, b))
print('RMSE:')
print(rmse)
print('')
plt.plot(x, y, 'bo', label="y-original")
plt.plot(x, yfit1, label="y=a*exp(x*b)+c")
plt.plot(x, yfit2, label="y=a * np.log(x) + b")
plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc='best', fancybox=True, shadow=True)
plt.grid(True)
plt.show()
And I receive this output:
I then try to using my experimental data, trying these new data points:
data = {'X':[0, 30, 60, 90, 120, 150, 180, 210, 240, 270, 300, 330, 360, 390, 420, 450, 480
],
'Y':[2.011399983,1.994139959,1.932761226,1.866343728,1.709889128,1.442674671,1.380548494,1.145193671,0.820646118,
0.582299012, 0.488162766, 0.264390575, 0.139457758, 0, 0, 0, 0
]}
df = pd.DataFrame(data)
df = pd.DataFrame(data,columns=['X','Y'])
df.plot(x ='X', y='Y', kind = 'scatter')
plt.show()
This shows:
I then try using the previous code to fit an exponential decay curve and a logarithmic decay curve to these new data points with this:
import pandas as pd
import numpy as np
from numpy import array, exp
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
# load the dataset
data = df.values
# choose the input and output variables
x, y = data[:, 0], data[:, 1]
def func1(x, a, b, c):
return a*exp(b*x)+c
def func2(x, a, b):
return a * np.log(x) + b
params, _ = curve_fit(func1, x, y)
a, b, c = params[0], params[1], params[2]
yfit1 = a*exp(x*b)+c
rmse = np.sqrt(np.mean((yfit1 - y) ** 2))
print('Exponential decay fit:')
print('y = %.5f * exp(x*%.5f)+%.5f' % (a, b, c))
print('RMSE:')
print(rmse)
print('')
params, _ = curve_fit(func2, x, y)
a, b = params[0], params[1]
yfit2 = a * np.log(x) + b
rmse = np.sqrt(np.mean((yfit2 - y) ** 2))
print('Logarithmic decay fit:')
print('y = %.5f * ln(x)+ %.5f' % (a, b))
print('RMSE:')
print(rmse)
print('')
plt.plot(x, y, 'bo', label="y-original")
plt.plot(x, yfit1, label="y=a*exp(x*b)+c")
plt.plot(x, yfit2, label="y=a * np.log(x) + b")
plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc='best', fancybox=True, shadow=True)
plt.grid(True)
plt.show()
And I receive this output which looks totally wrong:
And then I receive this plotted output which looks very far off from my experimental data points:
I do not understand why my first curve fitting attempt worked so well and smoothly, while my second attempt seems to have turned into a huge incoherent mess that just broke the curve_fit function. I do not understand why I see the graph going into the negative y-axis if I do not have any negative y-axis values in my experimental data. I am confused because I can clearly see my experimental data plotted fine as just points, so I am not sure what is so wrong about it that I cannot simply fit my curves to the points. How can I address my code so that I can properly use curve_fit() to fit an exponential decay curve and a logarithmic decay curve to my experimental data points?
As already pointed out in comments the model seems on the logistic kind.
The main difficulty for fitting with the usual softwares is the choice of the initial values of the parameters to start the iterative calculus. A non conventional method which general principle is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales doesn't need initial values. For example the numerical calculus is shown below :
With your second data :
With your first data :
If you want a more accurate fit according to some specified criteria of fitting (MSE, MSRE, MAE, or other) you could take the above values of parameters as starting values in a non-linear regression software.
So the problem that is being faced here is the curve fitting of the Monod equations to the experimental data. The model of bacteria growth and degradation of the organic carbon looks like this:
dX/dt = (u * S * X )/(K + S)
dS/dt = ((-1/Y) * u * S * X )/(K + S)
These equations are solved using the scipy odeint function. Results after integration are stored into two vectors, one for growth, and the another one for degradation. The next step is to curve fit this model to the experimentally observed data and estimate the model parameters: u, K and Y. Once the code is run, the following error is produced:
File "C:\ProgramData\Anaconda3\lib\site-packages\scipy\optimize\minpack.py", line 392, in leastsq
raise TypeError('Improper input: N=%s must not exceed M=%s' % (n, m))
TypeError: Improper input: N=3 must not exceed M=2"
For the convenience, curve fitting part is commented out, so the plot of the expected result can be generated. Bellow is the code sample:
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from scipy.optimize import curve_fit
"""Experimental data!"""
t_exp = np.array([0, 8, 24, 32, 48, 96, 168])
S_exp = np.array([5.5, 4.7, 3.7, 2.5, 1.5, 0.7, 0.5])
X_exp = np.array([10000, 17000, 30000, 40000, 60000, 76000, 80000])
"Model of the microbial growth and the TOC degradation"
# SETTING UP THE MODEL
def f(t, u, K, Y):
'Function that returns mutually dependent variables X and S'
def growth(x, t):
X = x[0]
S = x[1]
"Now differential equations are defined!"
dXdt = (u * S * X )/(K + S)
dSdt = ((-1/Y) * u * S * X )/(K + S)
return [dXdt, dSdt]
# INTEGRATING THE DIFFERENTIAL EQUATIONS
"initial Conditions"
init = [10000, 5]
results = odeint(growth, init, t)
"Taking out desired column vectors from results array"
return results[:,0], results[:,1]
# CURVE FITTING AND PARAMETER ESTIMATION
"""k, kcov = curve_fit(f, t_exp, [X_exp, S_exp], p0=(1, 2, 2))
u = k[0]
K = k[1]
Y = k[2]"""
# RESULTS OF THE MODEL WITH THE ESTIMATED MODEL PARAMETERS
t_mod = np.linspace(0, 168, 100)
compute = f(t_mod, 0.8, 75, 13700)# these fit quite well, but estimated manually
X_mod = compute[0]
S_mod = compute[1]
# PLOT OF THE MODEL AND THE OBSERVED DATA
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot(t_exp, X_exp, "yo")
ax1.plot(t_mod, X_mod, "g--", linewidth=3)
ax1.set_ylabel("X")
ax2 = ax1.twinx()
ax2.plot(t_exp, S_exp, "mo", )
ax2.plot(t_mod, S_mod, "r--", linewidth=3)
ax2.set_ylabel("S", color="r")
for tl in ax2.get_yticklabels():
tl.set_color("r")
plt.show()
Any advice of how to cope with this problem and proceed further would be highly appreciated. Thanks in advance.
The result of f() needs to have the same shape as the experimental data you feed into curve_fit as third parameter. In the last line of f() you just take the t = 0s value of the solution for both ODEs and return that, but you should return the complete solution. When fitting several sets of data at once using curve_fit, just concat them (stack horizontally), i.e.
def f(t, u, K, Y):
.....
return np.hstack((results[:,0], results[:,1]))
and call curve_fit like
k, kcov = curve_fit(f, t_exp, np.hstack([X_exp, S_exp]), p0=(1, 2, 2))
You will have to adapt the plotting part of your script, too:
compute = f(t_mod, u, K, Y)
compute = compute.reshape((2,-1))
I want to exactly represent my noisy data with a numpy.polynomial polynomial. How can I do that?.
In this example, I chose legendre polynomials. When I use the polynomial legfit function, the coefficients it returns are either very large or very small. So, I think I need some sort of regularization.
Why doesn't my fit get more accurate as I increase the degree of the polynomial? (It can be seen that the 20, 200, and 300 degree polynomials are essentially identical.) Are any regularization options available in the polynomial package?
I tried implementing my own regression function, but it feels like I am re-inventing the wheel. Is making my own fitting function the best path forward?
from scipy.optimize import least_squares as mini
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 5, 1000)
tofit = np.sin(3 * x) + .6 * np.sin(7*x) - .5 * np.cos(3 * np.cos(10 * x))
# This is here to illustrate what I expected the legfit function to do
# I expected it to do least squares regression with some sort of regularization.
def myfitfun(x, y, deg):
def fitness(a):
return ((np.polynomial.legendre.legval(x, a) - y)**2).sum() + np.sum(a ** 2)
return mini(fitness, np.zeros(deg)).x
degrees = [2, 4, 8, 16, 40, 200]
plt.plot(x, tofit, c='k', lw=4, label='Data')
for deg in degrees:
#coeffs = myfitfun(x, tofit, deg)
coeffs = np.polynomial.legendre.legfit(x, tofit, deg)
plt.plot(x, np.polynomial.legendre.legval(x, coeffs), label="Degree=%i"%deg)
plt.legend()
Legendre polynomials are meant to be used over the interval [-1,1]. Try to replace x with 2*x/x[-1] - 1 in your fit and you'll see that all is good:
nx = 2*x/x[-1] - 1
for deg in degrees:
#coeffs = myfitfun(x, tofit, deg)
coeffs = np.polynomial.legendre.legfit(nx, tofit, deg)
plt.plot(x, np.polynomial.legendre.legval(nx, coeffs), label="Degree=%i"%deg)
The easy way to use the proper interval in the fit is to use the Legendre class
from numpy.polynomial import Legendre as L
p = L.fit(x, y, order)
This will scale and shift the data to the interval [-1, 1] and track the scaling factors.
I have a set of coordinates (x, y, z(x, y)) which describe intensities (z) at coordinates x, y. For a set number of these intensities at different coordinates, I need to fit a 2D Gaussian that minimizes the mean squared error.
The data is in numpy matrices and for each fitting session I will have either 4, 9, 16 or 25 coordinates. Ultimately I just need to get the central position of the gaussian (x_0, y_0) that has smallest MSE.
All of the examples that I have found use scipy.optimize.curve_fit but the input data they have is over an entire mesh rather than a few coordinates.
Any help would be appreciated.
Introduction
There are multiple ways to approach this. You can use non-linear methods (e.g. scipy.optimize.curve_fit), but they'll be slow and aren't guaranteed to converge. You can linearize the problem (fast, unique solution), but any noise in the "tails" of the distribution will cause issues. There are actually a few tricks you can apply to this particular case to avoid the latter issue. I'll show some examples, but I don't have time to demonstrate all of the "tricks" right now.
Just as a side note, a general 2D guassian has 6 parameters, so you won't be able to fully fit things with 4 points. However, it sounds like you might be assuming that there's no covariance between x and y and that the variances are the same in each direction (i.e. a perfectly "round" bell curve). If that's the case, then you only need four parameters. If you know the amplitude of the guassian, you'll only need three. However, I'm going to start with the general solution, and you can simplify it later on, if you want to.
For the moment, let's focus on solving this problem using non-linear methods (e.g. scipy.optimize.curve_fit).
The general equation for a 2D guassian is (directly from wikipedia):
where:
is essentially 0.5 over the covariance matrix, A is the amplitude,
and (X₀, Y₀) is the center
Generate simplified sample data
Let's write the equation above out:
import numpy as np
import matplotlib.pyplot as plt
def gauss2d(x, y, amp, x0, y0, a, b, c):
inner = a * (x - x0)**2
inner += 2 * b * (x - x0)**2 * (y - y0)**2
inner += c * (y - y0)**2
return amp * np.exp(-inner)
And then let's generate some example data. To start with, we'll generate some data that will be easy to fit:
np.random.seed(1977) # For consistency
x, y = np.random.random((2, 10))
x0, y0 = 0.3, 0.7
amp, a, b, c = 1, 2, 3, 4
zobs = gauss2d(x, y, amp, x0, y0, a, b, c)
fig, ax = plt.subplots()
scat = ax.scatter(x, y, c=zobs, s=200)
fig.colorbar(scat)
plt.show()
Note that we haven't added any noise, and the center of the distribution is within the range that we have data (i.e. center at 0.3, 0.7 and a scatter of x,y observations between 0 and 1). For the moment, let's stick with this, and then we'll see what happens when we add noise and shift the center.
Non-linear fitting
To start with, let's use scpy.optimize.curve_fit to preform a non-linear least-squares fit to the gaussian function. (On a side note, you can play around with the exact minimization algorithm by using some of the other functions in scipy.optimize.)
The scipy.optimize functions expect a slightly different function signature than the one we originally wrote above. We could write a wrapper to "translate", but let's just re-write the gauss2d function instead:
def gauss2d(xy, amp, x0, y0, a, b, c):
x, y = xy
inner = a * (x - x0)**2
inner += 2 * b * (x - x0)**2 * (y - y0)**2
inner += c * (y - y0)**2
return amp * np.exp(-inner)
All we did was have the function expect the independent variables (x & y) as a single 2xN array.
Now we need to make an initial guess at what the guassian curve's parameters actually are. This is optional (the default is all ones, if I recall correctly), but you're likely to have problems converging if 1, 1 is not particularly close to the "true" center of the gaussian curve. For that reason, we'll use the x and y values of our largest observed z-value as a starting point for the center. I'll leave the rest of the parameters as 1, but if you know that they're likely to consistently be significantly different, change them to something more reasonable.
Here's the full, stand-alone example:
import numpy as np
import scipy.optimize as opt
import matplotlib.pyplot as plt
def main():
x0, y0 = 0.3, 0.7
amp, a, b, c = 1, 2, 3, 4
true_params = [amp, x0, y0, a, b, c]
xy, zobs = generate_example_data(10, true_params)
x, y = xy
i = zobs.argmax()
guess = [1, x[i], y[i], 1, 1, 1]
pred_params, uncert_cov = opt.curve_fit(gauss2d, xy, zobs, p0=guess)
zpred = gauss2d(xy, *pred_params)
print 'True parameters: ', true_params
print 'Predicted params:', pred_params
print 'Residual, RMS(obs - pred):', np.sqrt(np.mean((zobs - zpred)**2))
plot(xy, zobs, pred_params)
plt.show()
def gauss2d(xy, amp, x0, y0, a, b, c):
x, y = xy
inner = a * (x - x0)**2
inner += 2 * b * (x - x0)**2 * (y - y0)**2
inner += c * (y - y0)**2
return amp * np.exp(-inner)
def generate_example_data(num, params):
np.random.seed(1977) # For consistency
xy = np.random.random((2, num))
zobs = gauss2d(xy, *params)
return xy, zobs
def plot(xy, zobs, pred_params):
x, y = xy
yi, xi = np.mgrid[:1:30j, -.2:1.2:30j]
xyi = np.vstack([xi.ravel(), yi.ravel()])
zpred = gauss2d(xyi, *pred_params)
zpred.shape = xi.shape
fig, ax = plt.subplots()
ax.scatter(x, y, c=zobs, s=200, vmin=zpred.min(), vmax=zpred.max())
im = ax.imshow(zpred, extent=[xi.min(), xi.max(), yi.max(), yi.min()],
aspect='auto')
fig.colorbar(im)
ax.invert_yaxis()
return fig
main()
In this case, we exactly(ish) recover our original "true" parameters.
True parameters: [1, 0.3, 0.7, 2, 3, 4]
Predicted params: [ 1. 0.3 0.7 2. 3. 4. ]
Residual, RMS(obs - pred): 1.01560615193e-16
As we'll see in a second, this won't always be the case...
Adding Noise
Let's add some noise to our observations. All I've done here is change the generate_example_data function:
def generate_example_data(num, params):
np.random.seed(1977) # For consistency
xy = np.random.random((2, num))
noise = np.random.normal(0, 0.3, num)
zobs = gauss2d(xy, *params) + noise
return xy, zobs
However, the result looks quite different:
And as far as the parameters go:
True parameters: [1, 0.3, 0.7, 2, 3, 4]
Predicted params: [ 1.129 0.263 0.750 1.280 32.333 10.103 ]
Residual, RMS(obs - pred): 0.152444640098
The predicted center hasn't changed much, but the b and c parameters have changed quite a bit.
If we change the center of the function to somewhere slightly outside of our scatter of points:
x0, y0 = -0.3, 1.1
We'll wind up with complete nonsense as a result in the presence of noise! (It still works correctly without noise.)
True parameters: [1, -0.3, 1.1, 2, 3, 4]
Predicted params: [ 0.546 -0.939 0.857 -0.488 44.069 -4.136]
Residual, RMS(obs - pred): 0.235664449826
This is a common problem when fitting a function that decays to zero. Any noise in the "tails" can result in a very poor result. There are a number of strategies to deal with this. One of the easiest is to weight the inversion by the observed z-values. Here's an example for the 1D case: (focusing on linearized the problem) How can I perform a least-squares fitting over multiple data sets fast? If I have time later, I'll add an example of this for the 2D case.
So I've got some data stored as two lists, and plotted them using
plot(datasetx, datasety)
Then I set a trendline
trend = polyfit(datasetx, datasety)
trendx = []
trendy = []
for a in range(datasetx[0], (datasetx[-1]+1)):
trendx.append(a)
trendy.append(trend[0]*a**2 + trend[1]*a + trend[2])
plot(trendx, trendy)
But I have a third list of data, which is the error in the original datasety. I'm fine with plotting the errorbars, but what I don't know is using this, how to find the error in the coefficients of the polynomial trendline.
So say my trendline came out to be 5x^2 + 3x + 4 = y, there needs to be some sort of error on the 5, 3 and 4 values.
Is there a tool using NumPy that will calculate this for me?
I think you can use the function curve_fit of scipy.optimize (documentation). A basic example of the usage:
import numpy as np
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a*x**2 + b*x + c
x = np.linspace(0,4,50)
y = func(x, 5, 3, 4)
yn = y + 0.2*np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn)
Following the documentation, pcov gives:
The estimated covariance of popt. The diagonals provide the variance
of the parameter estimate.
So in this way you can calculate an error estimate on the coefficients. To have the standard deviation you can take the square root of the variance.
Now you have an error on the coefficients, but it is only based on the deviation between the ydata and the fit. In case you also want to account for an error on the ydata itself, the curve_fit function provides the sigma argument:
sigma : None or N-length sequence
If not None, it represents the standard-deviation of ydata. This
vector, if given, will be used as weights in the least-squares
problem.
A complete example:
import numpy as np
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a*x**2 + b*x + c
x = np.linspace(0,4,20)
y = func(x, 5, 3, 4)
# generate noisy ydata
yn = y + 0.2 * y * np.random.normal(size=len(x))
# generate error on ydata
y_sigma = 0.2 * y * np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn, sigma = y_sigma)
# plot
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(x, yn, yerr = y_sigma, fmt = 'o')
ax.plot(x, np.polyval(popt, x), '-')
ax.text(0.5, 100, r"a = {0:.3f} +/- {1:.3f}".format(popt[0], pcov[0,0]**0.5))
ax.text(0.5, 90, r"b = {0:.3f} +/- {1:.3f}".format(popt[1], pcov[1,1]**0.5))
ax.text(0.5, 80, r"c = {0:.3f} +/- {1:.3f}".format(popt[2], pcov[2,2]**0.5))
ax.grid()
plt.show()
Then something else, about using numpy arrays. One of the main advantages of using numpy is that you can avoid for loops because operations on arrays apply elementwise. So the for-loop in your example can also be done as following:
trendx = arange(datasetx[0], (datasetx[-1]+1))
trendy = trend[0]*trendx**2 + trend[1]*trendx + trend[2]
Where I use arange instead of range as it returns a numpy array instead of a list.
In this case you can also use the numpy function polyval:
trendy = polyval(trend, trendx)
I have not been able to find any way of getting the errors in the coefficients that is built in to numpy or python. I have a simple tool that I wrote based on Section 8.5 and 8.6 of John Taylor's An Introduction to Error Analysis. Maybe this will be sufficient for your task (note the default return is the variance, not the standard deviation). You can get large errors (as in the provided example) because of significant covariance.
def leastSquares(xMat, yMat):
'''
Purpose
-------
Perform least squares using the procedure outlined in 8.5 and 8.6 of Taylor, solving
matrix equation X a = Y
Examples
--------
>>> from scipy import matrix
>>> xMat = matrix([[ 1, 5, 25],
[ 1, 7, 49],
[ 1, 9, 81],
[ 1, 11, 121]])
>>> # matrix has rows of format [constant, x, x^2]
>>> yMat = matrix([[142],
[168],
[211],
[251]])
>>> a, varCoef, yRes = leastSquares(xMat, yMat)
>>> # a is a column matrix, holding the three coefficients a, b, c, corresponding to
>>> # the equation a + b*x + c*x^2
Returns
-------
a: matrix
best fit coefficients
varCoef: matrix
variance of derived coefficents
yRes: matrix
y-residuals of fit
'''
xMatSize = xMat.shape
numMeas = xMatSize[0]
numVars = xMatSize[1]
xxMat = xMat.T * xMat
xyMat = xMat.T * yMat
xxMatI = xxMat.I
aMat = xxMatI * xyMat
yAvgMat = xMat * aMat
yRes = yMat - yAvgMat
var = (yRes.T * yRes) / (numMeas - numVars)
varCoef = xxMatI.diagonal() * var[0, 0]
return aMat, varCoef, yRes