I have a set of x and y data and I want to use exponential regression to find the line that best fits those set of points. i.e.:
y = P1 + P2 exp(-P0 x)
I want to calculate the values of P0, P1 and P2.
I use a software "Igor Pro" that calculates the values for me, but want a Python implementation. I used the curve_fit function, but the values that I get are nowhere near the ones calculated by Igor software. Here is the sets of data that I have:
Set1:
x = [ 1.06, 1.06, 1.06, 1.06, 1.06, 1.06, 0.91, 0.91, 0.91 ]
y = [ 476, 475, 476.5, 475.25, 480, 469.5, 549.25, 548.5, 553.5 ]
Values calculated by Igor:
P1=376.91, P2=5393.9, P0=3.7776
Values calculated by curve_fit:
P1=702.45, P2=-13.33. P0=-2.6744
Set2:
x = [ 1.36, 1.44, 1.41, 1.745, 2.25, 1.42, 1.45, 1.5, 1.58]
y = [ 648, 618, 636, 485, 384, 639, 630, 583, 529]
Values calculated by Igor:
P1=321, P2=4848, P0=-1.94
Values calculated by curve_fit:
No optimal values found
I use curve_fit as follow:
from scipy.optimize import curve_fit
popt, pcov = curve_fit(lambda t, a, b, c: a * np.exp(-b * t) + c, x, y)
where:
P1=c, P2=a and P0=b
Well, when comparing fit results, it is always important to include uncertainties in the fitted parameters. That is, when you say that the values
from Igor (P1=376.91, P2=5393.9, P0=3.7776), and from curve_fit
(P1=702.45, P2=-13.33. P0=-2.6744) are different, what is it that leads to conclude those values are actually different?
Of course, in everyday conversation, 376.91 and 702.45 are very different, mostly because simply stating a value to 2 decimal places implies accuracy at approximately that scale (the distance between New York and Tokyo is
10,850 km but is not really 10,847,024,31 cm -- that might be the distance between bus stops in the two cities). But when comparing fit results, that everyday knowledge cannot be assumed, and you have to include uncertainties. I don't know if Igor will give you those. scipy curve_fit can, but it requires some work to extract them -- a pity.
Allow me to recommend trying lmfit (disclaimer: I am an author). With that, you would set up and execute the fit like this:
import numpy as np
from lmfit import Model
x = [ 1.06, 1.06, 1.06, 1.06, 1.06, 1.06, 0.91, 0.91, 0.91 ]
y = [ 476, 475, 476.5, 475.25, 480, 469.5, 549.25, 548.5, 553.5 ]
# x = [ 1.36, 1.44, 1.41, 1.745, 2.25, 1.42, 1.45, 1.5, 1.58]
# y = [ 648, 618, 636, 485, 384, 639, 630, 583, 529]
# Define the function that we want to fit to the data
def func(x, offset, scale, decay):
return offset + scale * np.exp(-decay* x)
model = Model(func)
params = model.make_params(offset=375, scale=5000, decay=4)
result = model.fit(y, params, x=x)
print(result.fit_report())
This would print out the result of
[[Model]]
Model(func)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 49
# data points = 9
# variables = 3
chi-square = 72.2604167
reduced chi-square = 12.0434028
Akaike info crit = 24.7474672
Bayesian info crit = 25.3391410
R-squared = 0.99362489
[[Variables]]
offset: 413.168769 +/- 17348030.9 (4198775.95%) (init = 375)
scale: 16689.6793 +/- 1.3337e+10 (79909638.11%) (init = 5000)
decay: 5.27555726 +/- 1016721.11 (19272297.84%) (init = 4)
[[Correlations]] (unreported correlations are < 0.100)
C(scale, decay) = 1.000
C(offset, decay) = 1.000
C(offset, scale) = 1.000
indicating that the uncertainties in the parameter values are simply enormous and the correlations between all parameters are 1. This is because you have only 2 x values, which will make it impossible to accurately determine 3 independent variables.
And, note that with an uncertainty of 17 million, the values for P1 (offset) of 413 and 762 do actually agree. The problem is not that Igor and curve_fit disagree on the best value, it is that neither can determine the value with any accuracy at all.
For your other dataset, the situation is a little better, with a result:
[[Model]]
Model(func)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 82
# data points = 9
# variables = 3
chi-square = 1118.19957
reduced chi-square = 186.366596
Akaike info crit = 49.4002551
Bayesian info crit = 49.9919289
R-squared = 0.98272310
[[Variables]]
offset: 320.876843 +/- 42.0154403 (13.09%) (init = 375)
scale: 4797.14487 +/- 2667.40083 (55.60%) (init = 5000)
decay: 1.93560164 +/- 0.47764470 (24.68%) (init = 4)
[[Correlations]] (unreported correlations are < 0.100)
C(scale, decay) = 0.995
C(offset, decay) = 0.940
C(offset, scale) = 0.904
the correlations are still high, but the parameters are reasonably well determined. Also, note that the best-fit values here are much closer to those you got from Igor, and probably "within the uncertainty".
And this is why one always needs to include uncertainties with the best-fit values reported from a fit.
Set 1 :
x = [ 1.06, 1.06, 1.06, 1.06, 1.06, 1.06, 0.91, 0.91, 0.91 ]
y = [ 476, 475, 476.5, 475.25, 480, 469.5, 549.25, 548.5, 553.5 ]
One observe that they are only two different values of x : 1.06 and 0.91
On the other hand they are three parameters to optimise : P0, P1 and P2. This is too much.
In other words an infinity of exponential curves can be found to fit the two clusters of points. The differences between the curves can be due to slight difference of the computation methods of non-linear regression especially due to the methods to chose the initial values of the iterative process.
In this particular case a simple linear regression would be without ambiguity.
By comparison :
Thus both Igor and Curve_fit give excellent fitting : The points are very close to both curves. One understand that infinity many other exponential fuctions would fit as well.
Set 2 :
x = [ 1.36, 1.44, 1.41, 1.745, 2.25, 1.42, 1.45, 1.5, 1.58]
y = [ 648, 618, 636, 485, 384, 639, 630, 583, 529]
The difficulty that you meet might be due to the choice of "guessed" initial values of the parameters which are required to start the iterative process of nonlinear regression.
In order to check this hypothesis one can use a different method which doesn't need initial guessed values. The MathCad code and numerical calculus are shown below.
Don't be surprised if the values of the parameters that you get with your software are slightly different from the above values (a, b, c). The criteria of fitting implicitly set in your software is probably different from the criteria of fitting set in my software.
Blue curve : The method of regression is a Least Mean Square Errors wrt a linear integral equation to which the exponential equation is solution. Ref.: https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
This non-standard method isn't iterative and doesn't require initial "guessed" values of parameters.
Related
I am attempting to write a program that reads two sets data from a .csv file into arrays, and then fits it to a piecewise function. What's most important to me is that these fits are done simultaneously because they have the same parameters. This piecewise function is my attempt to do so, though if you know a better way to fit them simultaneously I'd also greatly appreciate advice regarding that.
To avoid having to upload the csv files I've added the data directly into the arrays.
import numpy
import csv
import matplotlib
from scipy import optimize
xdata = [2.0, 10.0, 30.0, 50.0, 70.0, 90.0, 110.0, 130.0, 150.0, 250.0, 400.0, 1002.0, 1010.0, 1030.0, 1050.0, 1070.0, 1090.0, 1110.0, 1130.0, 1150.0, 1250.0, 1400.0]
ydata = [0.013833958803215633, 0.024273268442992078, 0.08792766000711709, 0.23477725658012044, 0.31997367288103884, 0.3822895295625711, 0.46037063893452784, 0.5531831477605121, 0.559757863748663, 0.6443036770720387, 0.7344601382896991, 2.6773979205076136e-09, 9.297289736857164e-10, 0.10915332214935693, 0.1345307163724643, 0.1230161681870127, 0.11286094974672768, 0.09186485171688986, 0.06609131137369342, 0.052616358869021135, 0.034629686697483314, 0.03993853791147095]
The first 11 points I want to fit to the function labeled 'SSdecay', and the second 11 points I want to fit to the function labeled 'SUdecay'. My attempt at doing this simultaneously was making the piecewise function labeled 'fitfunciton'.
#defines functions to be used in fitting
#to fit the first half of data
def SSdecay(x, lam1, lam2, norm, xoff):
return norm*(1 + lam2/(lam1 - lam2)*numpy.exp(-lam1*(x - xoff)) -
lam1/(lam1 - lam2)*numpy.exp(-lam2*(x - xoff)))
#to fit the second half of data
def SUdecay(x, lam1, lam2, norm, xoff):
return norm*(lam1/(lam1 - lam2))*(-numpy.exp(-lam1*(x - xoff)) +
numpy.exp(-lam2*(x - xoff)))
#piecewise function combining SS and SU functions to fit the whole data set
def fitfunction(x, lam1, lam2, norm, xoff):
y = numpy.piecewise(x,[x < 1000, x >= 1000],[SSdecay(x, lam1, lam2, norm, xoff),SUdecay(x, lam1, lam2, norm, xoff)])
return y
#fits the piecewise function with initial guesses for parameters
p0=[0.01,0.02,1,0]
popt, pcov = optimize.curve_fit(fitfunction, xdata, ydata, p0)
print(popt)
print(pcov)
After running this I get the error:
ValueError: NumPy boolean array indexing assignment cannot assign 22 input values to the 11 output values where the mask is true
It seems as though curve_fit does not like that I'm using a piecewise function but I am unsure why or if it is a fixable kind of problem.
Here are my results for separately fitting the two functions using the normalized data. It looks unlikely that these will work as a single piecewise equation, please see the plot image and source code below. I also have very different fitted parameters for the two equations:
SS parameters: [ 0.0110936, 0.09560932, 0.72929264, 6.82520026]
SU parameters: [ 3.46853883e-02, 9.54208972e-03, 1.99877873e-01, 1.00465563e+03]
import numpy
import matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
xdata = [2.0, 10.0, 30.0, 50.0, 70.0, 90.0, 110.0, 130.0, 150.0, 250.0, 400.0, 1002.0, 1010.0, 1030.0, 1050.0, 1070.0, 1090.0, 1110.0, 1130.0, 1150.0, 1250.0, 1400.0]
ydata = [0.013833958803215633, 0.024273268442992078, 0.08792766000711709, 0.23477725658012044, 0.31997367288103884, 0.3822895295625711, 0.46037063893452784, 0.5531831477605121, 0.559757863748663, 0.6443036770720387, 0.7344601382896991, 2.6773979205076136e-09, 9.297289736857164e-10, 0.10915332214935693, 0.1345307163724643, 0.1230161681870127, 0.11286094974672768, 0.09186485171688986, 0.06609131137369342, 0.052616358869021135, 0.034629686697483314, 0.03993853791147095]
#to fit the first half of data
def SSdecay(x, lam1, lam2, norm, xoff):
return norm*(1 + lam2/(lam1 - lam2)*numpy.exp(-lam1*(x - xoff)) -
lam1/(lam1 - lam2)*numpy.exp(-lam2*(x - xoff)))
#to fit the second half of data
def SUdecay(x, lam1, lam2, norm, xoff):
return norm*(lam1/(lam1 - lam2))*(-numpy.exp(-lam1*(x - xoff)) +
numpy.exp(-lam2*(x - xoff)))
# some initial parameter values
initialParameters_ss = numpy.array([0.01, 0.02, 1.0, 0.0])
initialParameters_su = initialParameters_ss # same values for this example
# curve fit the equations individually to their respective data
ssParameters, pcov = curve_fit(SSdecay, xdata[:11], ydata[:11], initialParameters_ss)
suParameters, pcov = curve_fit(SUdecay, xdata[11:], ydata[11:], initialParameters_su)
# values for display of fitted function
lam1_ss, lam2_ss, norm_ss, xoff_ss = ssParameters
lam1_su, lam2_su, norm_su, xoff_su = suParameters
# for plotting the fitting results
y_fit_ss = SSdecay(xdata[:11], lam1_ss, lam2_ss, norm_ss, xoff_ss) # first data set, first equation
y_fit_su = SUdecay(xdata[11:], lam1_su, lam2_su, norm_su, xoff_su) # second data set, second equation
plt.plot(xdata, ydata, 'D') # plot the raw data as a scatterplot
plt.plot(xdata[:11], y_fit_ss) # plot the SS equation using the fitted parameters
plt.plot(xdata[11:], y_fit_su) # plot the SU equation using the fitted parameters
plt.show()
print('SS parameters:', ssParameters)
print('SU parameters:', suParameters)
in the framework of my bachelor's thesis, I need to evaluate my data with python. Unfortunately there's no suiting script of my fellow students yet and I'm quite new to programming.
I have this data set and I'm trying to fit it with a gaussian by using scipy.optimize.curve_fit. Since there are a lot of unusable counts especially at the end of the axis, I'd like to confine the part that is to be fitted.
Picture raw data
This is what I have so far:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x=np.arange(5120)
y=array([ 0.81434599, 1.17054264, 0.85279188, ..., 1. ,
1. , 13.56291391]) #most of the data isn't interesting
#to me, part of interest see below
def Gauss(x, a, x0, sigma):
return a * np.exp(-(x - x0)**2 / (2 * sigma**2))
mean = sum(x * y) / sum(y)
sigma = np.sqrt(sum(y * (x - mean)**2) / sum(y))
popt,pcov = curve_fit(Gauss, x, y, p0=[max(y), mean, sigma],
maxfev=360000)
plt.plot(x,y,label='data')
plt.plot(x,Gauss(x, *popt), 'r-',label='fit')
On docs.scipy.org I've found a general description about curve_fit
If I try using
bounds=([2400,-np.inf, -np.inf],[2600, np.inf, np.inf]),
I'm getting the ValueError: x0 is infeasible. What is the problem here?
I also tried to confine it with
popt,pcov = curve_fit(Gauss, x[2400:2600], y[2400:2600], p0=[max(y), mean, sigma], maxfev=360000)
as suggested in a comment on this question: "Error when obtaining gaussian fit for graph" at stackoverflow
In this case I only get a straight line though.
Picture: Confinement with x[2400:2600],y[2400:2600] as arguments of curve_fit
I really hope you can help me out here. I only need a way to fit a small part of my data. Thanks in advance!
interesting data:
y=array([ 0.93396226, 1.00884956, 1.15457413, 1.07590759,
0.88915094, 1.07142857, 1.10714286, 1.14171123, 1.06666667,
0.84975369, 0.95480226, 0.99388379, 1.01675978, 0.83967391,
0.9771987 , 1.02402402, 1.04531722, 1.07492795, 0.97135417,
0.99714286, 1.0248139 , 1.26223776, 1.1533101 , 0.99099099,
1.18867925, 1.15772871, 0.95076923, 1.03313253, 1.02278481,
0.93265993, 1.06705539, 1.00265252, 1.02023121, 0.92076503,
0.99728997, 1.03353659, 1.15116279, 1.04336043, 0.95076923,
1.05515588, 0.92571429, 0.93448276, 1.02702703, 0.90056818,
0.96068796, 1.08493151, 1.13584906, 1.1212938 , 1.0739645 ,
0.98972603, 0.94594595, 1.07913669, 0.98425197, 0.87762238,
0.96811594, 1.02710843, 0.99392097, 0.91384615, 1.09809264,
1.00630915, 0.93175074, 0.87572254, 1.00651466, 0.78772379,
1.12244898, 1.2248062 , 0.97109827, 0.94607843, 0.97900262,
0.97527473, 1.01212121, 1.16422287, 1.20634921, 0.97275204,
1.01090909, 0.99404762, 1.00561798, 1.01146132, 1.08695652,
0.97214485, 1.03525641, 0.99096386, 1.05135952, 1.16451613,
0.90462428, 0.76876877, 0.47701149, 0.27607362, 0.21580547,
0.20598007, 0.16766467, 0.15533981, 0.19745223, 0.15407855,
0.18925831, 0.26997245, 0.47603834, 0.596875 , 0.85126582, 0.96
, 1.06578947, 1.08761329, 0.89548023, 0.99705882, 1.07142857,
0.95677233, 0.86119874, 1.02857143, 0.98250729, 0.94214876,
1.04166667, 0.96024465, 1.07022472, 1.10344828, 1.04859335,
0.96655518, 1.06424581, 1.01754386, 1.03492063, 1.18627451,
0.91036415, 1.03355705, 1.09116809, 0.96083551, 1.01298701,
1.03691275, 1.02923977, 1.11612903, 1.01457726, 1.06285714,
0.98186528, 1.16470588, 0.86645963, 1.07317073, 1.09615385,
1.21192053, 0.94385027, 0.94244604, 0.88390501, 0.95718654,
0.9691358 , 1.01729107, 1.01119403, 1.20350877, 1.12890625,
1.06940063, 0.90410959, 1.14662757, 0.97093023, 1.03021148,
1.10629921, 0.97118156, 1.10693642, 1.07917889, 0.9484127 ,
1.07581227, 0.98006645, 0.98986486, 0.90066225, 0.90066225,
0.86779661, 0.86779661, 0.96996997, 1.01438849, 0.91186441,
0.91290323, 1.03745318, 1.0615942 , 0.97202797, 1.16608997,
0.94182825, 1.08333333, 0.9076087 , 1.18181818, 1.20618557,
1.01273885, 0.93606138, 0.87457627, 0.90575916, 1.09756098,
0.99115044, 1.13380282, 1.04333333, 1.04026846, 1.0297619 ,
1.04334365, 1.03395062, 0.92553191, 0.98198198, 1. ,
0.9439528 , 1.02684564, 1.1372549 , 0.96676737, 0.99649123,
1.07051282, 1.10367893, 1.0866426 , 1.15384615, 0.99667774])
You might find the lmfit module (https://lmfit.github.io/lmfit-py/) useful for this. It is designed to make curve fitting very easy, has built-in models for common peaks like Gaussian, and has many useful features such as allowing you to set bounds on parameters. A fit to your data with lmfit might look like this:
import numpy as np
import matplotlib.pyplot as plt
from lmfit.models import GaussianModel, ConstantModel
y = np.array([.....]) # uses your shorter data range
x = np.arange(len(y))
# make a model that is a Gaussian + a constant:
model = GaussianModel(prefix='peak_') + ConstantModel()
# make parameters with starting values:
params = model.make_params(c=1.0, peak_center=90,
peak_sigma=5, peak_amplitude=-5)
# it's not really needed for this data, but you can put bounds on
# parameters like this (or set .vary=False to fix a parameter)
params['peak_sigma'].min = 0 # sigma > 0
params['peak_amplitude'].max = 0 # amplitude < 0
params['peak_center'].min = 80
params['peak_center'].max = 100
# run fit
result = model.fit(y, params, x=x)
# print, plot results
print(result.fit_report())
plt.plot(x, y)
plt.plot(x, result.best_fit)
plt.show()
This will print out
[[Model]]
(Model(gaussian, prefix='peak_') + Model(constant))
[[Fit Statistics]]
# function evals = 54
# data points = 200
# variables = 4
chi-square = 1.616
reduced chi-square = 0.008
Akaike info crit = -955.625
Bayesian info crit = -942.432
[[Variables]]
peak_sigma: 4.03660814 +/- 0.204240 (5.06%) (init= 5)
peak_center: 91.2246614 +/- 0.200267 (0.22%) (init= 90)
peak_amplitude: -9.79111362 +/- 0.445273 (4.55%) (init=-5)
c: 1.02138228 +/- 0.006796 (0.67%) (init= 1)
peak_fwhm: 9.50548558 +/- 0.480950 (5.06%) == '2.3548200*peak_sigma'
peak_height: -0.96766623 +/- 0.041854 (4.33%) == '0.3989423*peak_amplitude/max(1.e-15, peak_sigma)'
[[Correlations]] (unreported correlations are < 0.100)
C(peak_sigma, peak_amplitude) = -0.599
C(peak_amplitude, c) = -0.328
C(peak_sigma, c) = 0.196
and make a plot like this:
I am trying to fit gaussian to a spectrum and the y values are on the order of 10^(-19). Curve_fit gives me poor fitting result, both before and after I multiply my whole data by 10^(-19). Attached is my code, it is fairly simple set of data except that the values are very small. If I want to keep my original values, how would I get a reasonable gaussian fit that would give me the correct parameters?
#get fits data
aaa=pyfits.getdata('p1.cal.fits')
aaa=np.matrix(aaa)
nrow=np.shape(aaa)[0]
ncol=np.shape(aaa)[1]
ylo=79
yhi=90
xlo=0
xhi=1023
glo=430
ghi=470
#sum all the rows to get spectrum
ysum=[]
for x in range(xlo,xhi):
sum=np.sum(aaa[ylo:yhi,x])
ysum.append(sum)
wavelen_pix=range(xhi-xlo)
max=np.max(ysum)
print "maximum is at x=", np.where(ysum==max)
##fit gaussian
#fit only part of my data in the chosen range [glo:ghi]
x=wavelen_pix[glo:ghi]
y=ysum[glo:ghi]
def func(x, a, x0, sigma):
return a*np.exp(-(x-x0)**2/float((2*sigma**2)))
sig=np.std(ysum[500:1000]) #std of background noise
popt, pcov = curve_fit(func, x, sig)
print popt
#this gives me [1.,1.,1.], which is obviously wrong
gaus=func(x,popt[0],popt[1],popt[2])
aaa is a 153 by 1024 image matrix, partly looks like this:
matrix([[ -8.99793629e-20, 8.57133275e-21, 4.83523386e-20, ...,
-1.54811004e-20, 5.22941515e-20, 1.71179195e-20],
[ 2.75769318e-20, 1.03177243e-20, -3.19634928e-21, ...,
1.66583803e-20, -9.88712568e-22, -2.56897725e-20],
[ 2.88121935e-20, 8.57964252e-21, -2.60784327e-20, ...,
1.72335180e-20, -7.61189937e-21, -3.45333075e-20],
...,
[ 1.04006903e-20, 1.61200683e-20, 7.04195205e-20, ...,
1.72459645e-20, 4.29404029e-20, 1.99889374e-20],
[ 3.22315752e-21, -5.61394194e-21, 3.28763096e-20, ...,
1.99063583e-20, 2.12989880e-20, -1.23250648e-21],
[ 3.66591810e-20, -8.08647455e-22, -6.22773168e-20, ...,
-4.06145681e-21, 4.92453132e-21, 4.23689309e-20]], dtype=float32)
You are calling curve_fit incorrectly, here is the usage
curve_fit(f, xdata, ydata, p0=None, sigma=None, absolute_sigma=False, check_finite=True, **kw)
f is your function whose first arg is an array of independent variables, and whose subsequent args are the function parameters (such as amplitude, center, etc)
xdata are the independent variables
ydata are the dependedent variable
p0 is an initial guess at the function parameters (for Guassian this is amplitude, width, center)
By default p0 is set to a list of ones [1,1,...], which is probably why you get that as a result, the fit just never executed because you called it incorrectly.
Try estimating the amplitude, center, and width from the data, then make a p0 object (see below for details)
init_guess = ( a_i, x0_i, sig_i) # same order as they are supplied to your function
popt, pcov = curve_fit(func, xdata=x,ydata=y,p0=init_guess)
Here is a short example
xdata = np.linspace(0, 4, 50)
mygauss = ( 10,2,0.5) #( amp, center, width)
y = func(xdata, *mygauss ) # using your func defined above
ydata = y + 2*(np.random.random(50)- 0.5) # add some noise to create fake data
Now I can guess the fit params
ai = np.max( ydata) # guess the amplitude
xi = xdata[ np.argmax( ydata)] # guess the position of center
Guessing the width is tricky, I would first find where the half max is located (there are two, but you only need to find one, as the Gaussian is symmetric):
pos_half = argmin( np.abs( ydata-ao/2 ) ) # subtract half the amplitude and find the minimum
Now evaluate how far this is from the center of the gaussian (xi) :
sig_i = np.abs( xi - xdata[ pos_half] ) # estimate the width
Now you can make make the initial guess
init_guess = (ai, xi sig_i)
and fit
params, variance = curve_fit( func, xdata=xdata, ydata=ydata, p0=init_guess)
print params
#array([ 9.99457443, 2.01992858, 0.49599629])
which is very close to mygauss. Hope it helps.
Forget about rescaling, or making linear changes, or using the p0 parameter, which usually don't work! Try using the bounds parameter in the curve_fit for n parameters like this:
a0=np.array([a01,...,a0n])
af=np.array([af1,...,afn])
method="trf",bounds=(a0,af)
Hope it works!
;)
I'm running a ridge regression on somewhat collinear data. One of the methods used to identify a stable fit is a ridge trace and thanks to the great example on scikit-learn, I'm able to do that. Another method is to calculate variance inflation factors (VIFs) for each variable as k increases. When the VIFs decrease to <5 it is an indication the fit is satisfactory. Statsmodels has code for VIFs, but it is for an OLS regression. I've attempted to alter it to handle a ridge regression.
I'm checking my results against Regression Analysis by Example, 5th edition, chapter 10. My code generates the correct results for k = 0.000, but not after that. Working SAS code is available, but I'm not a SAS user and I don't know the differences between that implementation and scikit-learn's (and/or statsmodels's).
I've been stuck on this for a few days so any help would be much appreciated.
#http://www.ats.ucla.edu/stat/sas/examples/chp/chp_ch10.htm
from __future__ import division
import numpy as np
import pandas as pd
example = pd.read_csv('by_example_import.csv')
example.dropna(inplace=True)
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(example)
scaler.transform(example)
X = example.drop(['year', 'import'], axis=1)
#c_matrix = X.corr()
y = example['import']
#w, v = np.linalg.eig(c_matrix)
import pylab as pl
from sklearn import linear_model
###############################################################################
# Compute paths
alphas = [0.000, 0.001, 0.003, 0.005, 0.007, 0.009, 0.010, 0.012, 0.014, 0.016, 0.018,
0.020, 0.022, 0.024, 0.026, 0.028, 0.030, 0.040, 0.050, 0.060, 0.070, 0.080,
0.090, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.5, 2.0]
clf = linear_model.Ridge(fit_intercept=False)
clf2 = linear_model.Ridge(fit_intercept=False)
coefs = []
vif_list = [[] for x in range(X.shape[1])]
for a in alphas:
clf.set_params(alpha=a)
clf.fit(X, y)
coefs.append(clf.coef_)
for j, data in enumerate(X.columns):
cols = [col for col in X.columns if col not in [data]]
Z = X[cols]
yy = X.iloc[:,j]
clf2.set_params(alpha=a)
clf2.fit(Z, yy)
r_squared_j = clf2.score(Z, yy)
vif = 1. / (1. - r_squared_j)
print r_squared_j
vif_list[j].append(vif)
pd.DataFrame(vif_list, columns = alphas).T
pd.DataFrame(coefs, index=alphas)
###############################################################################
# Display results
ax = pl.gca()
ax.set_color_cycle(['b', 'r', 'g', 'c', 'k', 'y', 'm'])
ax.plot(alphas, coefs)
pl.vlines(ridge_cv.alpha_, np.min(coefs), np.max(coefs), linestyle='dashdot')
pl.xlabel('alpha')
pl.ylabel('weights')
pl.title('Ridge coefficients as a function of the regularization')
pl.axis('tight')
pl.show()
Variance inflation factor for Ridge regression is just three lines. I checked it with the example on the UCLA statistics page.
A variation of this will make it into the next statsmodels release. Here is my current function:
def vif_ridge(corr_x, pen_factors, is_corr=True):
"""variance inflation factor for Ridge regression
assumes penalization is on standardized variables
data should not include a constant
Parameters
----------
corr_x : array_like
correlation matrix if is_corr=True or original data if is_corr is False.
pen_factors : iterable
iterable of Ridge penalization factors
is_corr : bool
Boolean to indicate how corr_x is interpreted, see corr_x
Returns
-------
vif : ndarray
variance inflation factors for parameters in columns and ridge
penalization factors in rows
could be optimized for repeated calculations
"""
corr_x = np.asarray(corr_x)
if not is_corr:
corr = np.corrcoef(corr_x, rowvar=0, bias=True)
else:
corr = corr_x
eye = np.eye(corr.shape[1])
res = []
for k in pen_factors:
minv = np.linalg.inv(corr + k * eye)
vif = minv.dot(corr).dot(minv)
res.append(np.diag(vif))
return np.asarray(res)
What is an efficient method for determining the skew/kurtosis of a bar graph in python? Considering that bar graphs are not binned (unlike histograms) this question would not make a lot of sense but what I am trying to do is to determine the symmetry of a graph's height vs distance (rather than frequency vs bins). In other words, given a value of heights(y) measured along distance(x) i.e.
y = [6.18, 10.23, 33.15, 55.25, 84.19, 91.09, 106.6, 105.63, 114.26, 134.24, 137.44, 144.61, 143.14, 150.73, 156.44, 155.71, 145.88, 120.77, 99.81, 85.81, 55.81, 49.81, 37.81, 25.81, 5.81]
x = [0.03, 0.08, 0.14, 0.2, 0.25, 0.31, 0.36, 0.42, 0.48, 0.53, 0.59, 0.64, 0.7, 0.76, 0.81, 0.87, 0.92, 0.98, 1.04, 1.09, 1.15, 1.2, 1.26, 1.32, 1.37]
What is the symmetry of that height(y) distribution (skewness) and peakness (kurtosis) as measured over distance(x)? Are skewness/kurtosis appropriate measurements for determining the normal distribution of real values? Or does scipy/numpy offer something similar for that type of measurement?
I can achieve a skew/kurtosis estimate of height(y) frequency values binned along distance(x) by the following
freq=list(chain(*[[x_v]*int(round(y_v)) for x_v,y_v in zip(x,y)]))
x.extend([x[-1:][0]+x[0]]) #add one extra bin edge
hist(freq,bins=x)
ylabel("Height Frequency")
xlabel("Distance(km) Bins")
print "Skewness,","Kurtosis:",stats.describe(freq)[4:]
Skewness, Kurtosis: (-0.019354300509997705, -0.7447085398785758)
In this case the height distribution is symmetrical (skew 0.02) around the midpoint distance and characterized by a platykurtic (-0.74 kurtosis i.e. broad) distribution.
Considering that I multiply each occurrence of x value by their height y to create a frequency, the size of the result list can sometimes get very large. I was wondering if there was a better method to approach this problem? I suppose that I could always try to normalize dataset y to a range of perhaps 0 - 100 without loosing too much information on the datasets skew/kurtosis.
This isn't a python question, nor is it really a programming question but the answer is simple nonetheless. Instead of skew and kurtosis, let's first consider the easier values based off the lower moments, the mean and standard deviation. To make it concrete, and to fit with your question, let's assume your data looks like:
X = 3, 3, 5, 5, 5, 7 = x1, x2, x3 ....
Which would give a "bar graph" that looks like:
{3:2, 5:3, 7:1} = {k1:p1, k2:p2, k3:p3}
The mean, u, is given by
E[X] = (1/N) * (x1 + x2 + x3 + ...) = (1/N) * (3 + 3 + 5 + ...)
Our data, however, has repeated values, so this can be rewritten as
E[X] = (1/N) * (p1*k1 + p2*k2 + ...) = (1/N) * (3*2 + 5*3 + 7*1)
The next term, the standard dev., s, is simply
sqrt(E[(X-u)^2]) = sqrt((1/N)*( (x1-u)^2 + (x2-u)^3 + ...))
But we can apply the same reduction to the E[(X-u)^2] term and write it as
E[(X-u)^2] = (1/N)*( p1*(k1-u)^2 + p2*(k2-u)^2 + ... )
= (1/6)*( 2*(3-u)^2 + 3*(5-u)^2 + 1*(7-u)^2 )
Which means we don't have to have a multiple copy of each data item to do the sum as you indicated in your question.
The skew and kurtosis are quite simple as this point:
skew = E[(x-u)^3] / (E[(x-u)^2])^(3/2)
kurtosis = ( E[(x-u)^4] / (E[(x-u)^2])^2 ) - 3