So the problem that is being faced here is the curve fitting of the Monod equations to the experimental data. The model of bacteria growth and degradation of the organic carbon looks like this:
dX/dt = (u * S * X )/(K + S)
dS/dt = ((-1/Y) * u * S * X )/(K + S)
These equations are solved using the scipy odeint function. Results after integration are stored into two vectors, one for growth, and the another one for degradation. The next step is to curve fit this model to the experimentally observed data and estimate the model parameters: u, K and Y. Once the code is run, the following error is produced:
File "C:\ProgramData\Anaconda3\lib\site-packages\scipy\optimize\minpack.py", line 392, in leastsq
raise TypeError('Improper input: N=%s must not exceed M=%s' % (n, m))
TypeError: Improper input: N=3 must not exceed M=2"
For the convenience, curve fitting part is commented out, so the plot of the expected result can be generated. Bellow is the code sample:
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint
from scipy.optimize import curve_fit
"""Experimental data!"""
t_exp = np.array([0, 8, 24, 32, 48, 96, 168])
S_exp = np.array([5.5, 4.7, 3.7, 2.5, 1.5, 0.7, 0.5])
X_exp = np.array([10000, 17000, 30000, 40000, 60000, 76000, 80000])
"Model of the microbial growth and the TOC degradation"
# SETTING UP THE MODEL
def f(t, u, K, Y):
'Function that returns mutually dependent variables X and S'
def growth(x, t):
X = x[0]
S = x[1]
"Now differential equations are defined!"
dXdt = (u * S * X )/(K + S)
dSdt = ((-1/Y) * u * S * X )/(K + S)
return [dXdt, dSdt]
# INTEGRATING THE DIFFERENTIAL EQUATIONS
"initial Conditions"
init = [10000, 5]
results = odeint(growth, init, t)
"Taking out desired column vectors from results array"
return results[:,0], results[:,1]
# CURVE FITTING AND PARAMETER ESTIMATION
"""k, kcov = curve_fit(f, t_exp, [X_exp, S_exp], p0=(1, 2, 2))
u = k[0]
K = k[1]
Y = k[2]"""
# RESULTS OF THE MODEL WITH THE ESTIMATED MODEL PARAMETERS
t_mod = np.linspace(0, 168, 100)
compute = f(t_mod, 0.8, 75, 13700)# these fit quite well, but estimated manually
X_mod = compute[0]
S_mod = compute[1]
# PLOT OF THE MODEL AND THE OBSERVED DATA
fig = plt.figure()
ax1 = fig.add_subplot(111)
ax1.plot(t_exp, X_exp, "yo")
ax1.plot(t_mod, X_mod, "g--", linewidth=3)
ax1.set_ylabel("X")
ax2 = ax1.twinx()
ax2.plot(t_exp, S_exp, "mo", )
ax2.plot(t_mod, S_mod, "r--", linewidth=3)
ax2.set_ylabel("S", color="r")
for tl in ax2.get_yticklabels():
tl.set_color("r")
plt.show()
Any advice of how to cope with this problem and proceed further would be highly appreciated. Thanks in advance.
The result of f() needs to have the same shape as the experimental data you feed into curve_fit as third parameter. In the last line of f() you just take the t = 0s value of the solution for both ODEs and return that, but you should return the complete solution. When fitting several sets of data at once using curve_fit, just concat them (stack horizontally), i.e.
def f(t, u, K, Y):
.....
return np.hstack((results[:,0], results[:,1]))
and call curve_fit like
k, kcov = curve_fit(f, t_exp, np.hstack([X_exp, S_exp]), p0=(1, 2, 2))
You will have to adapt the plotting part of your script, too:
compute = f(t_mod, u, K, Y)
compute = compute.reshape((2,-1))
Related
Following the recommendations in this answer I have used several combination of values for beta0, and as shown here, the values from polyfit.
This example is UPDATED in order to show the effect of relative scales of values of X versus Y (X range is 0.1 to 100 times Y):
from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt
seed(1)
X = np.array([random() for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])
for num in range(1, 5):
plt.subplot(2, 2, num)
plt.title('X range is %.1f times Y' % (float(100 / max(X))))
X *= 10
z = np.polyfit(X, Y, 1)
plt.plot(X, Y, 'k.', alpha=0.1)
# Fit using odr
def f(B, X):
return B[0]*X + B[1]
linear = odr.Model(f)
mydata = odr.RealData(X, Y)
myodr = odr.ODR(mydata, linear, beta0=z)
myodr.set_job(fit_type=0)
myoutput = myodr.run()
a, b = myoutput.beta
sa, sb = myoutput.sd_beta
xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
yp = a*xp+b
plt.plot(xp, yp, label='ODR')
yp2 = z[0]*xp+z[1]
plt.plot(xp, yp2, label='polyfit')
plt.legend()
plt.ylim(-1000, 2000)
plt.show()
It seems that no combination of beta0 helps... The only way to get polyfit and ODR fit similar is to swap X and Y, OR as shown here to increase the range of values of X with regard to Y, still not really a solution :)
=== EDIT ===
I do not want ODR to be the same as polyfit. I am showing polyfit just to emphasize that the ODR fit is wrong and it is not a problem of the data.
=== SOLUTION ===
thanks to #norok2 answer when Y range is 0.001 to 100000 times X:
from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt
seed(1)
X = np.array([random() / 1000 for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])
plt.figure(figsize=(12, 12))
for num in range(1, 10):
plt.subplot(3, 3, num)
plt.title('Y range is %.1f times X' % (float(100 / max(X))))
X *= 10
z = np.polyfit(X, Y, 1)
plt.plot(X, Y, 'k.', alpha=0.1)
# Fit using odr
def f(B, X):
return B[0]*X + B[1]
linear = odr.Model(f)
mydata = odr.RealData(X, Y,
sy=min(1/np.var(Y), 1/np.var(X))) # here the trick!! :)
myodr = odr.ODR(mydata, linear, beta0=z)
myodr.set_job(fit_type=0)
myoutput = myodr.run()
a, b = myoutput.beta
sa, sb = myoutput.sd_beta
xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
yp = a*xp+b
plt.plot(xp, yp, label='ODR')
yp2 = z[0]*xp+z[1]
plt.plot(xp, yp2, label='polyfit')
plt.legend()
plt.ylim(-1000, 2000)
plt.show()
The key difference between polyfit() and the Orthogonal Distance Regression (ODR) fit is that polyfit works under the assumption that the error on x is negligible. If this assumption is violated, like it is in your data, you cannot expect the two methods to produce similar results.
In particular, ODR() is very sensitive to the errors you specify.
If you do not specify any error/weighting, it will assign a value of 1 for both x and y, meaning that any scale difference between x and y will affect the results (the so-called numerical conditioning).
On the contrary, polyfit(), before computing the fit, applies some sort of pre-whitening to the data (see around line 577 of its source code) for better numerical conditioning.
Therefore, if you want ODR() to match polyfit(), you could simply fine-tune the error on Y to change your numerical conditioning.
I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).
mydata = odr.RealData(X, Y)
# equivalent to: odr.RealData(X, Y, sx=1, sy=1)
to:
mydata = odr.RealData(X, Y, sx=1, sy=1/np.var(Y))
(EDIT: note there was a typo on the line above)
I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).
Note that this would only make sense for well-conditioned fits.
I cannot format source code in a comment, and so place it here. This code uses ODR to calculate fit statistics, note the line that has "parameter order for odr" such that I use a wrapper function for the ODR call to my "actual" function.
from scipy.optimize import curve_fit
import numpy as np
import scipy.odr
import scipy.stats
x = np.array([5.357, 5.797, 5.936, 6.161, 6.697, 6.731, 6.775, 8.442, 9.861])
y = np.array([0.376, 0.874, 1.049, 1.327, 2.054, 2.077, 2.138, 4.744, 7.104])
def f(x,b0,b1):
return b0 + (b1 * x)
def f_wrapper_for_odr(beta, x): # parameter order for odr
return f(x, *beta)
parameters, cov= curve_fit(f, x, y)
model = scipy.odr.odrpack.Model(f_wrapper_for_odr)
data = scipy.odr.odrpack.Data(x,y)
myodr = scipy.odr.odrpack.ODR(data, model, beta0=parameters, maxit=0)
myodr.set_job(fit_type=2)
parameterStatistics = myodr.run()
df_e = len(x) - len(parameters) # degrees of freedom, error
cov_beta = parameterStatistics.cov_beta # parameter covariance matrix from ODR
sd_beta = parameterStatistics.sd_beta * parameterStatistics.sd_beta
ci = []
t_df = scipy.stats.t.ppf(0.975, df_e)
ci = []
for i in range(len(parameters)):
ci.append([parameters[i] - t_df * parameterStatistics.sd_beta[i], parameters[i] + t_df * parameterStatistics.sd_beta[i]])
tstat_beta = parameters / parameterStatistics.sd_beta # coeff t-statistics
pstat_beta = (1.0 - scipy.stats.t.cdf(np.abs(tstat_beta), df_e)) * 2.0 # coef. p-values
for i in range(len(parameters)):
print('parameter:', parameters[i])
print(' conf interval:', ci[i][0], ci[i][1])
print(' tstat:', tstat_beta[i])
print(' pstat:', pstat_beta[i])
print()
As an example, here's my code for fitting multiexponential decays with a monoexponential decay (this doesn't produce a good fit, but it still works as an example):
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize, curve_fit
lifetimes = [1e-9, 2e-9, 4e-9, 8e-9]
amplitudes = [1000, 1000, 1000, 1000]
background = 10
t = np.arange(1024)*1e-10
y = np.zeros(len(t))
for i in range(len(lifetimes)):
y += amplitudes[i] * np.exp(-t/lifetimes[i])
y += np.random.poisson(background, len(y))
def fit_fun(t, amplitude, lifetime, background):
return amplitude * np.exp(-t/lifetime) + background
def loss_fun(params, x, y, fit_fun, c=5):
fit_y = fit_fun(x, *params)
residuals = y - fit_y
loss = np.sum(residuals**2)
return loss
p0 = [1000, 6e-9, 10]
result = minimize(loss_fun, p0, args=(t, y, fit_fun))
params_minimize = result.x
minimize_y = fit_fun(t, *params_minimize)
params_fit, _ = curve_fit(fit_fun, t, y, p0)
fit_y = fit_fun(t, *params_fit)
plt.semilogy(t, y)
plt.semilogy(t, minimize_y)
plt.semilogy(t, fit_y)
plt.ylim([1, plt.ylim()[1]])
plt.show()
Here are the resulting fits (Green was fitted with curve_fit and orange with minimize).
So, why doesn't using minimize work properly?
Also, the reason I'm doing this is that I want to implement a loss function other than least squares. If it isn't possible this way, how could I do that?
Suppose I have x and y vectors with a weight vector wgt. I can fit a cubic curve (y = a x^3 + b x^2 + c x + d) by using np.polyfit as follows:
y_fit = np.polyfit(x, y, deg=3, w=wgt)
Now, suppose I want to do another fit, but this time, I want the fit to pass through 0 (i.e. y = a x^3 + b x^2 + c x, d = 0), how can I specify a particular coefficient (i.e. d in this case) to be zero?
Thanks
You can try something like the following:
Import curve_fit from scipy, i.e.
from scipy.optimize import curve_fit
import matplotlib.pyplot as plt
import numpy as np
Define the curve fitting function. In your case,
def fit_func(x, a, b, c):
# Curve fitting function
return a * x**3 + b * x**2 + c * x # d=0 is implied
Perform the curve fitting,
# Curve fitting
params = curve_fit(fit_func, x, y)
[a, b, c] = params[0]
x_fit = np.linspace(x[0], x[-1], 100)
y_fit = a * x_fit**3 + b * x_fit**2 + c * x_fit
Plot the results if you please,
plt.plot(x, y, '.r') # Data
plt.plot(x_fit, y_fit, 'k') # Fitted curve
It does not answer the question in the sense that it uses numpy's polyfit function to pass through the origin, but it solves the problem.
Hope someone finds it useful :)
You can use np.linalg.lstsq and construct your coefficient matrix manually. To start, I'll create the example data x and y, and the "exact fit" y0:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(100)
y0 = 0.07 * x ** 3 + 0.3 * x ** 2 + 1.1 * x
y = y0 + 1000 * np.random.randn(x.shape[0])
Now I'll create a full cubic polynomial 'training' or 'independent variable' matrix that includes the constant d column.
XX = np.vstack((x ** 3, x ** 2, x, np.ones_like(x))).T
Let's see what I get if I compute the fit with this dataset and compare it to polyfit:
p_all = np.linalg.lstsq(X_, y)[0]
pp = np.polyfit(x, y, 3)
print np.isclose(pp, p_all).all()
# Returns True
Where I've used np.isclose because the two algorithms do produce very small differences.
You're probably thinking 'that's nice, but I still haven't answered the question'. From here, forcing the fit to have a zero offset is the same as dropping the np.ones column from the array:
p_no_offset = np.linalg.lstsq(XX[:, :-1], y)[0] # use [0] to just grab the coefs
Ok, let's see what this fit looks like compared to our data:
y_fit = np.dot(p_no_offset, XX[:, :-1].T)
plt.plot(x, y0, 'k-', linewidth=3)
plt.plot(x, y_fit, 'y--', linewidth=2)
plt.plot(x, y, 'r.', ms=5)
This gives this figure,
WARNING: When using this method on data that does not actually pass through (x,y)=(0,0) you will bias your estimates of your output solution coefficients (p) because lstsq will be trying to compensate for that fact that there is an offset in your data. Sort of a 'square peg round hole' problem.
Furthermore, you could also fit your data to a cubic only by doing:
p_ = np.linalg.lstsq(X_[:1, :], y)[0]
Here again the warning above applies. If your data contains quadratic, linear or constant terms the estimate of the cubic coefficient will be biased. There can be times when - for numerical algorithms - this sort of thing is useful, but for statistical purposes my understanding is that it is important to include all of the lower terms. If tests turn out to show that the lower terms are not statistically different from zero that's fine, but for safety's sake you should probably leave them in when you estimate your cubic.
Best of luck!
Okay so how would i approach to writing a code to optimize the constants a and b in a differential equation, like dy/dt = a*y^2 + b, using curve_fit? I would be using odeint to solve the ODE and then curve_fit to optimize a and b.
If you could please provide input on this situation i would greatly appreciate it!
You might be better served by looking at ODEs with Sympy. Scipy/Numpy are fundamentally numerical packages and aren't really set up to do algebraic/symbolic operations.
You definitely can do this:
import numpy as np
from scipy.integrate import odeint
from scipy.optimize import curve_fit
def f(y, t, a, b):
return a*y**2 + b
def y(t, a, b, y0):
"""
Solution to the ODE y'(t) = f(t,y,a,b) with initial condition y(0) = y0
"""
y = odeint(f, y0, t, args=(a, b))
return y.ravel()
# Some random data to fit
data_t = np.sort(np.random.rand(200) * 10)
data_y = data_t**2 + np.random.rand(200)*10
popt, cov = curve_fit(y, data_t, data_y, [-1.2, 0.1, 0])
a_opt, b_opt, y0_opt = popt
print("a = %g" % a_opt)
print("b = %g" % b_opt)
print("y0 = %g" % y0_opt)
import matplotlib.pyplot as plt
t = np.linspace(0, 10, 2000)
plt.plot(data_t, data_y, '.',
t, y(t, a_opt, b_opt, y0_opt), '-')
plt.gcf().set_size_inches(6, 4)
plt.savefig('out.png', dpi=96)
plt.show()
To address specifically this type of problem, I decided to write a wrapper package which unifies sympy and scipy. It's called symfit. Fitting to your ODE would then look like this:
tdata = np.array([10, 26, 44, 70, 120])
ydata = 10e-4 * np.array([44, 34, 27, 20, 14])
y, t = variables('y, t')
a, b = parameters('a, b')
model_dict = {
D(y, t): a*y^2 + b
}
ode_model = ODEModel(model_dict, initial={t: 0.0, y: 0.0})
fit = Fit(ode_model, t=tdata, y=ydata)
fit_result = fit.execute()
As you can see from the way it is defined as a dict, fitting to systems of (first order) ODEs is no problem. Check out the docs for more!
So I've got some data stored as two lists, and plotted them using
plot(datasetx, datasety)
Then I set a trendline
trend = polyfit(datasetx, datasety)
trendx = []
trendy = []
for a in range(datasetx[0], (datasetx[-1]+1)):
trendx.append(a)
trendy.append(trend[0]*a**2 + trend[1]*a + trend[2])
plot(trendx, trendy)
But I have a third list of data, which is the error in the original datasety. I'm fine with plotting the errorbars, but what I don't know is using this, how to find the error in the coefficients of the polynomial trendline.
So say my trendline came out to be 5x^2 + 3x + 4 = y, there needs to be some sort of error on the 5, 3 and 4 values.
Is there a tool using NumPy that will calculate this for me?
I think you can use the function curve_fit of scipy.optimize (documentation). A basic example of the usage:
import numpy as np
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a*x**2 + b*x + c
x = np.linspace(0,4,50)
y = func(x, 5, 3, 4)
yn = y + 0.2*np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn)
Following the documentation, pcov gives:
The estimated covariance of popt. The diagonals provide the variance
of the parameter estimate.
So in this way you can calculate an error estimate on the coefficients. To have the standard deviation you can take the square root of the variance.
Now you have an error on the coefficients, but it is only based on the deviation between the ydata and the fit. In case you also want to account for an error on the ydata itself, the curve_fit function provides the sigma argument:
sigma : None or N-length sequence
If not None, it represents the standard-deviation of ydata. This
vector, if given, will be used as weights in the least-squares
problem.
A complete example:
import numpy as np
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a*x**2 + b*x + c
x = np.linspace(0,4,20)
y = func(x, 5, 3, 4)
# generate noisy ydata
yn = y + 0.2 * y * np.random.normal(size=len(x))
# generate error on ydata
y_sigma = 0.2 * y * np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn, sigma = y_sigma)
# plot
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(x, yn, yerr = y_sigma, fmt = 'o')
ax.plot(x, np.polyval(popt, x), '-')
ax.text(0.5, 100, r"a = {0:.3f} +/- {1:.3f}".format(popt[0], pcov[0,0]**0.5))
ax.text(0.5, 90, r"b = {0:.3f} +/- {1:.3f}".format(popt[1], pcov[1,1]**0.5))
ax.text(0.5, 80, r"c = {0:.3f} +/- {1:.3f}".format(popt[2], pcov[2,2]**0.5))
ax.grid()
plt.show()
Then something else, about using numpy arrays. One of the main advantages of using numpy is that you can avoid for loops because operations on arrays apply elementwise. So the for-loop in your example can also be done as following:
trendx = arange(datasetx[0], (datasetx[-1]+1))
trendy = trend[0]*trendx**2 + trend[1]*trendx + trend[2]
Where I use arange instead of range as it returns a numpy array instead of a list.
In this case you can also use the numpy function polyval:
trendy = polyval(trend, trendx)
I have not been able to find any way of getting the errors in the coefficients that is built in to numpy or python. I have a simple tool that I wrote based on Section 8.5 and 8.6 of John Taylor's An Introduction to Error Analysis. Maybe this will be sufficient for your task (note the default return is the variance, not the standard deviation). You can get large errors (as in the provided example) because of significant covariance.
def leastSquares(xMat, yMat):
'''
Purpose
-------
Perform least squares using the procedure outlined in 8.5 and 8.6 of Taylor, solving
matrix equation X a = Y
Examples
--------
>>> from scipy import matrix
>>> xMat = matrix([[ 1, 5, 25],
[ 1, 7, 49],
[ 1, 9, 81],
[ 1, 11, 121]])
>>> # matrix has rows of format [constant, x, x^2]
>>> yMat = matrix([[142],
[168],
[211],
[251]])
>>> a, varCoef, yRes = leastSquares(xMat, yMat)
>>> # a is a column matrix, holding the three coefficients a, b, c, corresponding to
>>> # the equation a + b*x + c*x^2
Returns
-------
a: matrix
best fit coefficients
varCoef: matrix
variance of derived coefficents
yRes: matrix
y-residuals of fit
'''
xMatSize = xMat.shape
numMeas = xMatSize[0]
numVars = xMatSize[1]
xxMat = xMat.T * xMat
xyMat = xMat.T * yMat
xxMatI = xxMat.I
aMat = xxMatI * xyMat
yAvgMat = xMat * aMat
yRes = yMat - yAvgMat
var = (yRes.T * yRes) / (numMeas - numVars)
varCoef = xxMatI.diagonal() * var[0, 0]
return aMat, varCoef, yRes