Multiple Regression Python - python

I really don't understand what's wrong with my (simple) code...
i just want to test a multiple linear regression (....!).
import pandas as pd
import numpy as np
import scipy.stats as st
import sklearn
n = 1000
X1 = linspace(2, 8.5, n)
X2 = linspace(-4, 2.9, n)
X3 = linspace(-1, 16, n)
X = np.transpose( [X1, X2, X3] )
Y = 2*X1 + 3.2*X2 -1.2*X3 + 4 + st.norm.rvs(size = n, loc = 0, scale = 0.6)
X = pd.DataFrame( X , columns = ["X1", "X2", "X3"])
Y = pd.DataFrame(Y, columns = ["Y"])
#Create linear regression object:
my_reg = sklearn.linear_model.LinearRegression()
#Train:
my_reg.fit(X, Y)
print('Coefficients: \n', my_reg.coef_)
print('Constant: \n', my_reg.intercept_)
And I get some stupid results, like the coefficients are [ 0.25127347 0.26673645 0.65717676] ...
I also tried the OLS way, but I still get non sense coef (slighty different but still stupid)
(It's work with a one variable linear regression, something like Y = 2*X + 5, I would get coef and intercept really close to real one)

Thanks all!
I didn't perform a linear reression for a while, and of course it's because X is not invertible (in R, it gives me 'nan').
So it wasn't a smart question...
Thanks again!

The fact that the coefficients do not at all resemble the "true" ones that you have set indicates that multicollinearity might be a problem. The issue with your code is that your X matrix is near-singular, which renders numerical results instable. As can be seen in #R.yan's graphs, your X1 and X2 are almost identical except for a linear shift. This is corroborated by the fact that your X matrix, which has 1000 rows and three columns, only has a rank of 2. See:
np.linalg.matrix_rank(X)
Out[26]: 2
Try the following instead:
import pandas as pd
import numpy as np
import scipy.stats as st
import sklearn
from sklearn.linear_model import LinearRegression
n = 1000
# adding noise to your data:
X1 = np.linspace(2, 8.5, n) + st.norm.rvs(size=n ,loc = 0, scale = 1)
X2 = np.linspace(-4, 2.9, n) + st.norm.rvs(size=n ,loc = 0, scale = 1)
X3 = np.linspace(-1, 16, n) + st.norm.rvs(size=n ,loc = 0, scale = 1)
X = np.transpose( [X1, X2, X3] )
Y = 2*X1 + 3.2*X2 -1.2*X3 + 4 + st.norm.rvs(size=1000 ,loc = 0, scale = 1)
X = pd.DataFrame( X , columns = ["X1", "X2", "X3"])
Y = pd.DataFrame(Y, columns = ["Y"])
#Create linear regression object:
my_reg = sklearn.linear_model.LinearRegression(fit_intercept = True)
#Train:
res = my_reg.fit(X, Y)
print('Coefficients: \n', my_reg.coef_)
print('Constant: \n', my_reg.intercept_)
Coefficients:
[[ 1.99273588 3.20068392 -1.19688422]]
Constant:
[ 4.02296003]
Now, we get the right coefficients, and a matrix of full rank:
np.linalg.matrix_rank(X)
Out[32]: 3
Note that in linear regression, X needs to have a rank equal to the number of columns (or rows, if that is less). If it does not, this means there is multicollinearity, which renders numerical results for the inverse of X'X instable (depending on which algorithm is used). See this description for more information on multicollinearity.

I guess the code gives you the correct answer. I plot the predicted Y base on the coef_ and intercept_ from your regression and have the following graph.
import pandas as pd
import numpy as np
import scipy.stats as st
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
n = 1000
X1 = np.linspace(2, 8.5, n)
X2 = np.linspace(-4, 2.9, n)
X3 = np.linspace(-1, 16, n)
X = np.transpose( [X1, X2, X3] )
Y = 2*X1 + 3.2*X2 -1.2*X3 + 4 + st.norm.rvs(size=1000 ,loc = 0, scale = 0.6)
X = pd.DataFrame( X , columns = ["X1", "X2", "X3"])
Y = pd.DataFrame(Y, columns = ["Y"])
#Create linear regression object:
my_reg = sklearn.linear_model.LinearRegression()
plt.plot(Y, color='blue', label='Y')
#Train:
res = my_reg.fit(X, Y)
print('Coefficients: \n', my_reg.coef_)
print('Constant: \n', my_reg.intercept_)
plt.scatter(X.index.values,X['X1'], c='black')
plt.scatter(X.index.values,X['X2'], c='black')
plt.scatter(X.index.values,X['X3'], c='black')
Y_pred = my_reg.coef_[0][0]*X['X1'] + my_reg.coef_[0][1]*X['X2'] +my_reg.coef_[0][2]*X['X3'] + my_reg.intercept_
plt.plot(Y_pred, color="red", label='predict')
plt.legend()
Out[]: ('Coefficients: \n', array([[ 3.13842691e+12, 1.01316187e+13, -5.31223199e+12]]))
('Constant: \n', array([ 2.89373889e+13]))

Related

Interpolate between linear and nonlinear values

I have been able to interpolate values successfully from linear values of x to sine-like values of y.
However - I am struggling to interpolate the other way - from nonlinear values of y to linear values of x.
The below is a toy example
import matplotlib.pylab as plt
from scipy import interpolate
#create 100 x values
x = np.linspace(-np.pi, np.pi, 100)
#create 100 values of y where y= sin(x)
y=np.sin(x)
#learn function to map y from x
f = interpolate.interp1d(x, y)
With new values of linear x
xnew = np.array([-1,1])
I get correctly interpolated values of nonlinear y
ynew = f(xnew)
print(ynew)
array([-0.84114583, 0.84114583])
The problem comes when I try and interpolate values of x from y.
I create a new function, the reverse of f:
f2 = interpolate.interp1d(y,x,kind='cubic')
I put in values of y that I successfully interpolated before
ynew=np.array([-0.84114583, 0.84114583])
I am expecting to get the original values of x [-1, 1]
But I get:
array([-1.57328791, 1.57328791])
I have tried putting in other values for the 'kind' parameter with no luck and am not sure if I have got the wrong approach here. Thanks for your help
I guess the problem raises from the fact, that x is not a function of y, since for an arbitrary y value there may be more than one x value found.
Take a look at a truncated range of data.
When x ranges from 0 to np.pi/2, then for every y value there is a unique x value.
In this case the snippet below works as expected.
>>> import numpy as np
>>> from scipy import interpolate
>>> x = np.linspace(0, np.pi / 2, 100)
>>> y = np.sin(x)
>>> f = interpolate.interp1d(x, y)
>>> f([0, 0.1, 0.3, 0.5])
array([0. , 0.09983071, 0.29551713, 0.47941047])
>>> f2 = interpolate.interp1d(y, x)
>>> f2([0, 0.09983071, 0.29551713, 0.47941047])
array([0. , 0.1 , 0.3 , 0.50000001])
Maxim provided the reason for this behavior. This interpolation is a class designed to work for functions. In your case, y=arcsin(x) is only in a limited interval a function. This leads to interesting phenomena in the interpolation routine that interpolates to the nearest y-value which in the case of the arcsin() function is not necessarily the next value in the x-y curve but maybe several periods away. An illustration:
import numpy as np
import matplotlib.pylab as plt
from scipy import interpolate
xmin=-np.pi
xmax=np.pi
fig, axes = plt.subplots(3, 3, figsize=(15, 10))
for i, fac in enumerate([2, 1, 0.5]):
x = np.linspace(xmin * fac, xmax*fac, 100)
y=np.sin(x)
#x->y
f = interpolate.interp1d(x, y)
x_fit = np.linspace(xmin*fac, xmax*fac, 1000)
y_fit = f(x_fit)
axes[i][0].plot(x_fit, y_fit)
axes[i][0].set_ylabel(f"sin period {fac}")
if not i:
axes[i][0].set_title(label="interpolation x->y")
#y->x
f2 = interpolate.interp1d(y, x)
y2_fit = np.linspace(.99 * min(y), .99 * max(y), 1000)
x2_fit = f2(y2_fit)
axes[i][1].plot(x2_fit, y2_fit)
if not i:
axes[i][1].set_title(label="interpolation y->x")
#y->x with cubic interpolation
f3 = interpolate.interp1d(y, x, kind="cubic")
y3_fit = np.linspace(.99 * min(y), .99 * max(y), 1000)
x3_fit = f3(y3_fit)
axes[i][2].plot(x3_fit, y3_fit)
if not i:
axes[i][2].set_title(label="cubic interpolation y->x")
plt.show()
As you can see, the interpolation works along the ordered list of y-values (as you instructed it to), and this works particularly badly with cubic interpolation.

Linear regression ODR fails

Following the recommendations in this answer I have used several combination of values for beta0, and as shown here, the values from polyfit.
This example is UPDATED in order to show the effect of relative scales of values of X versus Y (X range is 0.1 to 100 times Y):
from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt
seed(1)
X = np.array([random() for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])
for num in range(1, 5):
plt.subplot(2, 2, num)
plt.title('X range is %.1f times Y' % (float(100 / max(X))))
X *= 10
z = np.polyfit(X, Y, 1)
plt.plot(X, Y, 'k.', alpha=0.1)
# Fit using odr
def f(B, X):
return B[0]*X + B[1]
linear = odr.Model(f)
mydata = odr.RealData(X, Y)
myodr = odr.ODR(mydata, linear, beta0=z)
myodr.set_job(fit_type=0)
myoutput = myodr.run()
a, b = myoutput.beta
sa, sb = myoutput.sd_beta
xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
yp = a*xp+b
plt.plot(xp, yp, label='ODR')
yp2 = z[0]*xp+z[1]
plt.plot(xp, yp2, label='polyfit')
plt.legend()
plt.ylim(-1000, 2000)
plt.show()
It seems that no combination of beta0 helps... The only way to get polyfit and ODR fit similar is to swap X and Y, OR as shown here to increase the range of values of X with regard to Y, still not really a solution :)
=== EDIT ===
I do not want ODR to be the same as polyfit. I am showing polyfit just to emphasize that the ODR fit is wrong and it is not a problem of the data.
=== SOLUTION ===
thanks to #norok2 answer when Y range is 0.001 to 100000 times X:
from random import random, seed
from scipy import polyfit
from scipy import odr
import numpy as np
from matplotlib import pyplot as plt
seed(1)
X = np.array([random() / 1000 for i in range(1000)])
Y = np.array([i + random()**2 for i in range(1000)])
plt.figure(figsize=(12, 12))
for num in range(1, 10):
plt.subplot(3, 3, num)
plt.title('Y range is %.1f times X' % (float(100 / max(X))))
X *= 10
z = np.polyfit(X, Y, 1)
plt.plot(X, Y, 'k.', alpha=0.1)
# Fit using odr
def f(B, X):
return B[0]*X + B[1]
linear = odr.Model(f)
mydata = odr.RealData(X, Y,
sy=min(1/np.var(Y), 1/np.var(X))) # here the trick!! :)
myodr = odr.ODR(mydata, linear, beta0=z)
myodr.set_job(fit_type=0)
myoutput = myodr.run()
a, b = myoutput.beta
sa, sb = myoutput.sd_beta
xp = np.linspace(plt.xlim()[0], plt.xlim()[1], 1000)
yp = a*xp+b
plt.plot(xp, yp, label='ODR')
yp2 = z[0]*xp+z[1]
plt.plot(xp, yp2, label='polyfit')
plt.legend()
plt.ylim(-1000, 2000)
plt.show()
The key difference between polyfit() and the Orthogonal Distance Regression (ODR) fit is that polyfit works under the assumption that the error on x is negligible. If this assumption is violated, like it is in your data, you cannot expect the two methods to produce similar results.
In particular, ODR() is very sensitive to the errors you specify.
If you do not specify any error/weighting, it will assign a value of 1 for both x and y, meaning that any scale difference between x and y will affect the results (the so-called numerical conditioning).
On the contrary, polyfit(), before computing the fit, applies some sort of pre-whitening to the data (see around line 577 of its source code) for better numerical conditioning.
Therefore, if you want ODR() to match polyfit(), you could simply fine-tune the error on Y to change your numerical conditioning.
I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).
mydata = odr.RealData(X, Y)
# equivalent to: odr.RealData(X, Y, sx=1, sy=1)
to:
mydata = odr.RealData(X, Y, sx=1, sy=1/np.var(Y))
(EDIT: note there was a typo on the line above)
I tested that this works for any numerical conditioning between 1e-10 and 1e10 of your Y (it is / 10. or 1e-1 in your example).
Note that this would only make sense for well-conditioned fits.
I cannot format source code in a comment, and so place it here. This code uses ODR to calculate fit statistics, note the line that has "parameter order for odr" such that I use a wrapper function for the ODR call to my "actual" function.
from scipy.optimize import curve_fit
import numpy as np
import scipy.odr
import scipy.stats
x = np.array([5.357, 5.797, 5.936, 6.161, 6.697, 6.731, 6.775, 8.442, 9.861])
y = np.array([0.376, 0.874, 1.049, 1.327, 2.054, 2.077, 2.138, 4.744, 7.104])
def f(x,b0,b1):
return b0 + (b1 * x)
def f_wrapper_for_odr(beta, x): # parameter order for odr
return f(x, *beta)
parameters, cov= curve_fit(f, x, y)
model = scipy.odr.odrpack.Model(f_wrapper_for_odr)
data = scipy.odr.odrpack.Data(x,y)
myodr = scipy.odr.odrpack.ODR(data, model, beta0=parameters, maxit=0)
myodr.set_job(fit_type=2)
parameterStatistics = myodr.run()
df_e = len(x) - len(parameters) # degrees of freedom, error
cov_beta = parameterStatistics.cov_beta # parameter covariance matrix from ODR
sd_beta = parameterStatistics.sd_beta * parameterStatistics.sd_beta
ci = []
t_df = scipy.stats.t.ppf(0.975, df_e)
ci = []
for i in range(len(parameters)):
ci.append([parameters[i] - t_df * parameterStatistics.sd_beta[i], parameters[i] + t_df * parameterStatistics.sd_beta[i]])
tstat_beta = parameters / parameterStatistics.sd_beta # coeff t-statistics
pstat_beta = (1.0 - scipy.stats.t.cdf(np.abs(tstat_beta), df_e)) * 2.0 # coef. p-values
for i in range(len(parameters)):
print('parameter:', parameters[i])
print(' conf interval:', ci[i][0], ci[i][1])
print(' tstat:', tstat_beta[i])
print(' pstat:', pstat_beta[i])
print()

Asymmetric error bars in Scipy's odrpack

I am using Scipy's odrpack to fit a linear function to some data that has uncertainties in both the x and y dimensions. Each data point has it's own uncertainty that is asymmetric.
I can fit a function using symmetric uncertainties, but this is not a true representation of my data.
How can I perform the fit with this in mind?
This is my code so far. It receives input data as a command line argument, and the uncertainties i'm using are just random numbers at the moment. (also, two fits are happening, one for positive data points another for the negative. The reasons are unrelated to this question)
import sys
import numpy as np
import scipy.odr.odrpack as odrpack
def f(B, x):
return B[0]*x + B[1]
xdata = sys.argv[1].split(',')
xdata = [float(i) for i in xdata]
xdata = np.array(xdata)
#find indices of +/- data
zero_ind = np.where(xdata >= 0)[0][0]
x_p = xdata[zero_ind:]
x_m = xdata[:zero_ind+1]
ydata = sys.argv[2].split(',')
ydata = [float(i) for i in ydata]
ydata = np.array(ydata)
y_p = ydata[zero_ind:]
y_m = ydata[:zero_ind+1]
sx_m = np.random.random(len(x_m))
sx_p = np.random.random(len(x_p))
sy_m = np.random.random(len(y_m))
sy_p = np.random.random(len(y_p))
linear = odrpack.Model(f)
data_p = odrpack.RealData(x_p, y_p, sx=sx_p, sy=sy_p)
odr_p = odrpack.ODR(data_p, linear, beta0=[1.,2.])
out_p = odr_p.run()
data_m = odrpack.RealData(x_m, y_m, sx=sx_m, sy=sy_m)
odr_m = odrpack.ODR(data_m, linear, beta0=[1.,2.])
out_m = odr_m.run()
Thanks!
I will just give you solution with random data,I could not bother to import your data
import numpy as np
import scipy.odr.odrpack as odrpack
np.random.seed(1)
N = 10
x = np.linspace(0,5,N)*(-1)
y = 2*x - 1 + np.random.random(N)
sx = np.random.random(N)
sy = np.random.random(N)
def f(B, x):
return B[0]*x + B[1]
linear = odrpack.Model(f)
# mydata = odrpack.Data(x, y, wd=1./np.power(sx,2), we=1./np.power(sy,2))
mydata = odrpack.RealData(x, y, sx=sx, sy=sy)
myodr = odrpack.ODR(mydata, linear, beta0=[1., 2.])
myoutput = myodr.run()
myoutput.pprint()
Than we got
Beta: [ 1.92743947 -0.94409236]
Beta Std Error: [ 0.03117086 0.11273067]
Beta Covariance: [[ 0.02047196 0.06690713]
[ 0.06690713 0.26776027]]
Residual Variance: 0.04746112419196648
Inverse Condition #: 0.10277763521624257
Reason(s) for Halting:
Sum of squares convergence

Python curve_fit with multiple independent variables

Python's curve_fit calculates the best-fit parameters for a function with a single independent variable, but is there a way, using curve_fit or something else, to fit for a function with multiple independent variables? For example:
def func(x, y, a, b, c):
return log(a) + b*log(x) + c*log(y)
where x and y are the independent variable and we would like to fit for a, b, and c.
You can pass curve_fit a multi-dimensional array for the independent variables, but then your func must accept the same thing. For example, calling this array X and unpacking it to x, y for clarity:
import numpy as np
from scipy.optimize import curve_fit
def func(X, a, b, c):
x,y = X
return np.log(a) + b*np.log(x) + c*np.log(y)
# some artificially noisy data to fit
x = np.linspace(0.1,1.1,101)
y = np.linspace(1.,2., 101)
a, b, c = 10., 4., 6.
z = func((x,y), a, b, c) * 1 + np.random.random(101) / 100
# initial guesses for a,b,c:
p0 = 8., 2., 7.
print(curve_fit(func, (x,y), z, p0))
Gives the fit:
(array([ 9.99933937, 3.99710083, 6.00875164]), array([[ 1.75295644e-03, 9.34724308e-05, -2.90150983e-04],
[ 9.34724308e-05, 5.09079478e-06, -1.53939905e-05],
[ -2.90150983e-04, -1.53939905e-05, 4.84935731e-05]]))
optimizing a function with multiple input dimensions and a variable number of parameters
This example shows how to fit a polynomial with a two dimensional input (R^2 -> R) by an increasing number of coefficients. The design is very flexible so that the callable f from curve_fit is defined once for any number of non-keyword arguments.
minimal reproducible example
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def poly2d(xy, *coefficients):
x = xy[:, 0]
y = xy[:, 1]
proj = x + y
res = 0
for order, coef in enumerate(coefficients):
res += coef * proj ** order
return res
nx = 31
ny = 21
range_x = [-1.5, 1.5]
range_y = [-1, 1]
target_coefficients = (3, 0, -19, 7)
xs = np.linspace(*range_x, nx)
ys = np.linspace(*range_y, ny)
im_x, im_y = np.meshgrid(xs, ys)
xdata = np.c_[im_x.flatten(), im_y.flatten()]
im_target = poly2d(xdata, *target_coefficients).reshape(ny, nx)
fig, axs = plt.subplots(2, 3, figsize=(29.7, 21))
axs = axs.flatten()
ax = axs[0]
ax.set_title('Unknown polynomial P(x+y)\n[secret coefficients: ' + str(target_coefficients) + ']')
sm = ax.imshow(
im_target,
cmap = plt.get_cmap('coolwarm'),
origin='lower'
)
fig.colorbar(sm, ax=ax)
for order in range(5):
ydata=im_target.flatten()
popt, pcov = curve_fit(poly2d, xdata=xdata, ydata=ydata, p0=[0]*(order+1) )
im_fit = poly2d(xdata, *popt).reshape(ny, nx)
ax = axs[1+order]
title = 'Fit O({:d}):'.format(order)
for o, p in enumerate(popt):
if o%2 == 0:
title += '\n'
if o == 0:
title += ' {:=-{w}.1f} (x+y)^{:d}'.format(p, o, w=int(np.log10(max(abs(p), 1))) + 5)
else:
title += ' {:=+{w}.1f} (x+y)^{:d}'.format(p, o, w=int(np.log10(max(abs(p), 1))) + 5)
title += '\nrms: {:.1f}'.format( np.mean((im_fit-im_target)**2)**.5 )
ax.set_title(title)
sm = ax.imshow(
im_fit,
cmap = plt.get_cmap('coolwarm'),
origin='lower'
)
fig.colorbar(sm, ax=ax)
for ax in axs.flatten():
ax.set_xlabel('x')
ax.set_ylabel('y')
plt.show()
P.S. The concept of this answer is identical to my other answer here, but the code example is way more clear. At the time given, I will delete the other answer.
Fitting to an unknown numer of parameters
In this example, we try to reproduce some measured data measData.
In this example measData is generated by the function measuredData(x, a=.2, b=-2, c=-.8, d=.1). I practice, we might have measured measData in a way - so we have no idea, how it is described mathematically. Hence the fit.
We fit by a polynomial, which is described by the function polynomFit(inp, *args). As we want to try out different orders of polynomials, it is important to be flexible in the number of input parameters.
The independent variables (x and y in your case) are encoded in the 'columns'/second dimension of inp.
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def measuredData(inp, a=.2, b=-2, c=-.8, d=.1):
x=inp[:,0]
y=inp[:,1]
return a+b*x+c*x**2+d*x**3 +y
def polynomFit(inp, *args):
x=inp[:,0]
y=inp[:,1]
res=0
for order in range(len(args)):
print(14,order,args[order],x)
res+=args[order] * x**order
return res +y
inpData=np.linspace(0,10,20).reshape(-1,2)
inpDataStr=['({:.1f},{:.1f})'.format(a,b) for a,b in inpData]
measData=measuredData(inpData)
fig, ax = plt.subplots()
ax.plot(np.arange(inpData.shape[0]), measData, label='measuered', marker='o', linestyle='none' )
for order in range(5):
print(27,inpData)
print(28,measData)
popt, pcov = curve_fit(polynomFit, xdata=inpData, ydata=measData, p0=[0]*(order+1) )
fitData=polynomFit(inpData,*popt)
ax.plot(np.arange(inpData.shape[0]), fitData, label='polyn. fit, order '+str(order), linestyle='--' )
ax.legend( loc='upper left', bbox_to_anchor=(1.05, 1))
print(order, popt)
ax.set_xticklabels(inpDataStr, rotation=90)
Result:
Yes. We can pass multiple variables for curve_fit. I have written a piece of code:
import numpy as np
x = np.random.randn(2,100)
w = np.array([1.5,0.5]).reshape(1,2)
esp = np.random.randn(1,100)
y = np.dot(w,x)+esp
y = y.reshape(100,)
In the above code I have generated x a 2D data set in shape of (2,100) i.e, there are two variables with 100 data points. I have fit the dependent variable y with independent variables x with some noise.
def model_func(x,w1,w2,b):
w = np.array([w1,w2]).reshape(1,2)
b = np.array([b]).reshape(1,1)
y_p = np.dot(w,x)+b
return y_p.reshape(100,)
We have defined a model function that establishes relation between y & x.
Note: The shape of output of the model function or predicted y should be (length of x,)
popt, pcov = curve_fit(model_func,x,y)
The popt is an 1D numpy array containing predicted parameters. In our case there are 3 parameters.
Yes, there is: simply give curve_fit a multi-dimensional array for xData.

Python - calculating trendlines with errors

So I've got some data stored as two lists, and plotted them using
plot(datasetx, datasety)
Then I set a trendline
trend = polyfit(datasetx, datasety)
trendx = []
trendy = []
for a in range(datasetx[0], (datasetx[-1]+1)):
trendx.append(a)
trendy.append(trend[0]*a**2 + trend[1]*a + trend[2])
plot(trendx, trendy)
But I have a third list of data, which is the error in the original datasety. I'm fine with plotting the errorbars, but what I don't know is using this, how to find the error in the coefficients of the polynomial trendline.
So say my trendline came out to be 5x^2 + 3x + 4 = y, there needs to be some sort of error on the 5, 3 and 4 values.
Is there a tool using NumPy that will calculate this for me?
I think you can use the function curve_fit of scipy.optimize (documentation). A basic example of the usage:
import numpy as np
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a*x**2 + b*x + c
x = np.linspace(0,4,50)
y = func(x, 5, 3, 4)
yn = y + 0.2*np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn)
Following the documentation, pcov gives:
The estimated covariance of popt. The diagonals provide the variance
of the parameter estimate.
So in this way you can calculate an error estimate on the coefficients. To have the standard deviation you can take the square root of the variance.
Now you have an error on the coefficients, but it is only based on the deviation between the ydata and the fit. In case you also want to account for an error on the ydata itself, the curve_fit function provides the sigma argument:
sigma : None or N-length sequence
If not None, it represents the standard-deviation of ydata. This
vector, if given, will be used as weights in the least-squares
problem.
A complete example:
import numpy as np
from scipy.optimize import curve_fit
def func(x, a, b, c):
return a*x**2 + b*x + c
x = np.linspace(0,4,20)
y = func(x, 5, 3, 4)
# generate noisy ydata
yn = y + 0.2 * y * np.random.normal(size=len(x))
# generate error on ydata
y_sigma = 0.2 * y * np.random.normal(size=len(x))
popt, pcov = curve_fit(func, x, yn, sigma = y_sigma)
# plot
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(x, yn, yerr = y_sigma, fmt = 'o')
ax.plot(x, np.polyval(popt, x), '-')
ax.text(0.5, 100, r"a = {0:.3f} +/- {1:.3f}".format(popt[0], pcov[0,0]**0.5))
ax.text(0.5, 90, r"b = {0:.3f} +/- {1:.3f}".format(popt[1], pcov[1,1]**0.5))
ax.text(0.5, 80, r"c = {0:.3f} +/- {1:.3f}".format(popt[2], pcov[2,2]**0.5))
ax.grid()
plt.show()
Then something else, about using numpy arrays. One of the main advantages of using numpy is that you can avoid for loops because operations on arrays apply elementwise. So the for-loop in your example can also be done as following:
trendx = arange(datasetx[0], (datasetx[-1]+1))
trendy = trend[0]*trendx**2 + trend[1]*trendx + trend[2]
Where I use arange instead of range as it returns a numpy array instead of a list.
In this case you can also use the numpy function polyval:
trendy = polyval(trend, trendx)
I have not been able to find any way of getting the errors in the coefficients that is built in to numpy or python. I have a simple tool that I wrote based on Section 8.5 and 8.6 of John Taylor's An Introduction to Error Analysis. Maybe this will be sufficient for your task (note the default return is the variance, not the standard deviation). You can get large errors (as in the provided example) because of significant covariance.
def leastSquares(xMat, yMat):
'''
Purpose
-------
Perform least squares using the procedure outlined in 8.5 and 8.6 of Taylor, solving
matrix equation X a = Y
Examples
--------
>>> from scipy import matrix
>>> xMat = matrix([[ 1, 5, 25],
[ 1, 7, 49],
[ 1, 9, 81],
[ 1, 11, 121]])
>>> # matrix has rows of format [constant, x, x^2]
>>> yMat = matrix([[142],
[168],
[211],
[251]])
>>> a, varCoef, yRes = leastSquares(xMat, yMat)
>>> # a is a column matrix, holding the three coefficients a, b, c, corresponding to
>>> # the equation a + b*x + c*x^2
Returns
-------
a: matrix
best fit coefficients
varCoef: matrix
variance of derived coefficents
yRes: matrix
y-residuals of fit
'''
xMatSize = xMat.shape
numMeas = xMatSize[0]
numVars = xMatSize[1]
xxMat = xMat.T * xMat
xyMat = xMat.T * yMat
xxMatI = xxMat.I
aMat = xxMatI * xyMat
yAvgMat = xMat * aMat
yRes = yMat - yAvgMat
var = (yRes.T * yRes) / (numMeas - numVars)
varCoef = xxMatI.diagonal() * var[0, 0]
return aMat, varCoef, yRes

Categories