Variance inflation factor in ridge regression in python

Variance inflation factor in ridge regression in python - python

I'm running a ridge regression on somewhat collinear data. One of the methods used to identify a stable fit is a ridge trace and thanks to the great example on scikit-learn, I'm able to do that. Another method is to calculate variance inflation factors (VIFs) for each variable as k increases. When the VIFs decrease to <5 it is an indication the fit is satisfactory. Statsmodels has code for VIFs, but it is for an OLS regression. I've attempted to alter it to handle a ridge regression.
I'm checking my results against Regression Analysis by Example, 5th edition, chapter 10. My code generates the correct results for k = 0.000, but not after that. Working SAS code is available, but I'm not a SAS user and I don't know the differences between that implementation and scikit-learn's (and/or statsmodels's).
I've been stuck on this for a few days so any help would be much appreciated.
#http://www.ats.ucla.edu/stat/sas/examples/chp/chp_ch10.htm
from __future__ import division
import numpy as np
import pandas as pd
example = pd.read_csv('by_example_import.csv')
example.dropna(inplace=True)
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(example)
scaler.transform(example)
X = example.drop(['year', 'import'], axis=1)
#c_matrix = X.corr()
y = example['import']
#w, v = np.linalg.eig(c_matrix)
import pylab as pl
from sklearn import linear_model
###############################################################################
# Compute paths
alphas = [0.000, 0.001, 0.003, 0.005, 0.007, 0.009, 0.010, 0.012, 0.014, 0.016, 0.018,
0.020, 0.022, 0.024, 0.026, 0.028, 0.030, 0.040, 0.050, 0.060, 0.070, 0.080,
0.090, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.5, 2.0]
clf = linear_model.Ridge(fit_intercept=False)
clf2 = linear_model.Ridge(fit_intercept=False)
coefs = []
vif_list = [[] for x in range(X.shape[1])]
for a in alphas:
clf.set_params(alpha=a)
clf.fit(X, y)
coefs.append(clf.coef_)
for j, data in enumerate(X.columns):
cols = [col for col in X.columns if col not in [data]]
Z = X[cols]
yy = X.iloc[:,j]
clf2.set_params(alpha=a)
clf2.fit(Z, yy)
r_squared_j = clf2.score(Z, yy)
vif = 1. / (1. - r_squared_j)
print r_squared_j
vif_list[j].append(vif)
pd.DataFrame(vif_list, columns = alphas).T
pd.DataFrame(coefs, index=alphas)
###############################################################################
# Display results
ax = pl.gca()
ax.set_color_cycle(['b', 'r', 'g', 'c', 'k', 'y', 'm'])
ax.plot(alphas, coefs)
pl.vlines(ridge_cv.alpha_, np.min(coefs), np.max(coefs), linestyle='dashdot')
pl.xlabel('alpha')
pl.ylabel('weights')
pl.title('Ridge coefficients as a function of the regularization')
pl.axis('tight')
pl.show()

Variance inflation factor for Ridge regression is just three lines. I checked it with the example on the UCLA statistics page.
A variation of this will make it into the next statsmodels release. Here is my current function:
def vif_ridge(corr_x, pen_factors, is_corr=True):
"""variance inflation factor for Ridge regression
assumes penalization is on standardized variables
data should not include a constant
Parameters
----------
corr_x : array_like
correlation matrix if is_corr=True or original data if is_corr is False.
pen_factors : iterable
iterable of Ridge penalization factors
is_corr : bool
Boolean to indicate how corr_x is interpreted, see corr_x
Returns
-------
vif : ndarray
variance inflation factors for parameters in columns and ridge
penalization factors in rows
could be optimized for repeated calculations
"""
corr_x = np.asarray(corr_x)
if not is_corr:
corr = np.corrcoef(corr_x, rowvar=0, bias=True)
else:
corr = corr_x
eye = np.eye(corr.shape[1])
res = []
for k in pen_factors:
minv = np.linalg.inv(corr + k * eye)
vif = minv.dot(corr).dot(minv)
res.append(np.diag(vif))
return np.asarray(res)

Related

Data normalization in Spyder(Python 3.9) code using sk-learn(preprocessing) and NumPy

I'm studying artificial intelligence in Python, and now I can't understand why my code doesn't work (specifically its part with data normalization)
code:
` import numpy as np
from sklearn import preprocessing
input_data = np.array([[5.1, -2.9, 3.3],
[-1.2, 7.8, -6.1],
[3.9, 0.4, 2.1],
[7.3, -9.9, -4.5]])
# Data binarization
data_Ьinarized = preprocessing.Binarizer(threshold=2.1).transform(input_data)
print("\nBinarized data:\n", data_Ьinarized)
# Output of the mean value and standard deviation
print("\nBEFORE:")
print("Mean =", input_data.mean(axis=0))
print("Std deviation =", input_data.std(axis=0))
# Exclusion of the average
data_skaled = preprocessing.scale(input_data)
print("\nAFTER:")
print("Mean =", data_skaled.mean(axis=0))
# Scaling Min Max
data_skaler_minmax = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_skaled_minmax = data_skaler_minmax.fit_transform(input_data)
print("\nMin max scaled data:\n", data_skaled_minmax)
# Data normalization
data_normalized_11 = preprocessing.normalize(input_data, norm= '11')
data_normalized_12 = preprocessing.normalize(input_data, norm= '12')
print("\nL1 normalized data:\n", data_normalized_11)
print("\nL1 normalized data:\n", data_normalized_12)`
I study AI from a book and did everything exactly as there, but my code does not work, the output should be
`Ll normalized data:
[[0.45132743 -0.25663717 0.2920354 ]
[-0.0794702 0.51655629 -0.40397351]
[0.609375 0.0625 0.328125]
[ 0.33640553 -0.4562212 -0.20737327))
L2 normalized data:
[[0.75765788 -0.43082507 0.49024922]
[-0.12030718 0.78199664 -0.61156148]
[ 0.87690281 0.08993875 0.47217844]
[ 0.55734935 -0.75585734 -0.34357152]))`

Exponential Regression in Python

I have a set of x and y data and I want to use exponential regression to find the line that best fits those set of points. i.e.:
y = P1 + P2 exp(-P0 x)
I want to calculate the values of P0, P1 and P2.
I use a software "Igor Pro" that calculates the values for me, but want a Python implementation. I used the curve_fit function, but the values that I get are nowhere near the ones calculated by Igor software. Here is the sets of data that I have:
Set1:
x = [ 1.06, 1.06, 1.06, 1.06, 1.06, 1.06, 0.91, 0.91, 0.91 ]
y = [ 476, 475, 476.5, 475.25, 480, 469.5, 549.25, 548.5, 553.5 ]
Values calculated by Igor:
P1=376.91, P2=5393.9, P0=3.7776
Values calculated by curve_fit:
P1=702.45, P2=-13.33. P0=-2.6744
Set2:
x = [ 1.36, 1.44, 1.41, 1.745, 2.25, 1.42, 1.45, 1.5, 1.58]
y = [ 648, 618, 636, 485, 384, 639, 630, 583, 529]
Values calculated by Igor:
P1=321, P2=4848, P0=-1.94
Values calculated by curve_fit:
No optimal values found
I use curve_fit as follow:
from scipy.optimize import curve_fit
popt, pcov = curve_fit(lambda t, a, b, c: a * np.exp(-b * t) + c, x, y)
where:
P1=c, P2=a and P0=b

Well, when comparing fit results, it is always important to include uncertainties in the fitted parameters. That is, when you say that the values
from Igor (P1=376.91, P2=5393.9, P0=3.7776), and from curve_fit
(P1=702.45, P2=-13.33. P0=-2.6744) are different, what is it that leads to conclude those values are actually different?
Of course, in everyday conversation, 376.91 and 702.45 are very different, mostly because simply stating a value to 2 decimal places implies accuracy at approximately that scale (the distance between New York and Tokyo is
10,850 km but is not really 10,847,024,31 cm -- that might be the distance between bus stops in the two cities). But when comparing fit results, that everyday knowledge cannot be assumed, and you have to include uncertainties. I don't know if Igor will give you those. scipy curve_fit can, but it requires some work to extract them -- a pity.
Allow me to recommend trying lmfit (disclaimer: I am an author). With that, you would set up and execute the fit like this:
import numpy as np
from lmfit import Model
x = [ 1.06, 1.06, 1.06, 1.06, 1.06, 1.06, 0.91, 0.91, 0.91 ]
y = [ 476, 475, 476.5, 475.25, 480, 469.5, 549.25, 548.5, 553.5 ]
# x = [ 1.36, 1.44, 1.41, 1.745, 2.25, 1.42, 1.45, 1.5, 1.58]
# y = [ 648, 618, 636, 485, 384, 639, 630, 583, 529]
# Define the function that we want to fit to the data
def func(x, offset, scale, decay):
return offset + scale * np.exp(-decay* x)
model = Model(func)
params = model.make_params(offset=375, scale=5000, decay=4)
result = model.fit(y, params, x=x)
print(result.fit_report())
This would print out the result of
[[Model]]
Model(func)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 49
# data points = 9
# variables = 3
chi-square = 72.2604167
reduced chi-square = 12.0434028
Akaike info crit = 24.7474672
Bayesian info crit = 25.3391410
R-squared = 0.99362489
[[Variables]]
offset: 413.168769 +/- 17348030.9 (4198775.95%) (init = 375)
scale: 16689.6793 +/- 1.3337e+10 (79909638.11%) (init = 5000)
decay: 5.27555726 +/- 1016721.11 (19272297.84%) (init = 4)
[[Correlations]] (unreported correlations are < 0.100)
C(scale, decay) = 1.000
C(offset, decay) = 1.000
C(offset, scale) = 1.000
indicating that the uncertainties in the parameter values are simply enormous and the correlations between all parameters are 1. This is because you have only 2 x values, which will make it impossible to accurately determine 3 independent variables.
And, note that with an uncertainty of 17 million, the values for P1 (offset) of 413 and 762 do actually agree. The problem is not that Igor and curve_fit disagree on the best value, it is that neither can determine the value with any accuracy at all.
For your other dataset, the situation is a little better, with a result:
[[Model]]
Model(func)
[[Fit Statistics]]
# fitting method = leastsq
# function evals = 82
# data points = 9
# variables = 3
chi-square = 1118.19957
reduced chi-square = 186.366596
Akaike info crit = 49.4002551
Bayesian info crit = 49.9919289
R-squared = 0.98272310
[[Variables]]
offset: 320.876843 +/- 42.0154403 (13.09%) (init = 375)
scale: 4797.14487 +/- 2667.40083 (55.60%) (init = 5000)
decay: 1.93560164 +/- 0.47764470 (24.68%) (init = 4)
[[Correlations]] (unreported correlations are < 0.100)
C(scale, decay) = 0.995
C(offset, decay) = 0.940
C(offset, scale) = 0.904
the correlations are still high, but the parameters are reasonably well determined. Also, note that the best-fit values here are much closer to those you got from Igor, and probably "within the uncertainty".
And this is why one always needs to include uncertainties with the best-fit values reported from a fit.

Set 1 :
x = [ 1.06, 1.06, 1.06, 1.06, 1.06, 1.06, 0.91, 0.91, 0.91 ]
y = [ 476, 475, 476.5, 475.25, 480, 469.5, 549.25, 548.5, 553.5 ]
One observe that they are only two different values of x : 1.06 and 0.91
On the other hand they are three parameters to optimise : P0, P1 and P2. This is too much.
In other words an infinity of exponential curves can be found to fit the two clusters of points. The differences between the curves can be due to slight difference of the computation methods of non-linear regression especially due to the methods to chose the initial values of the iterative process.
In this particular case a simple linear regression would be without ambiguity.
By comparison :
Thus both Igor and Curve_fit give excellent fitting : The points are very close to both curves. One understand that infinity many other exponential fuctions would fit as well.
Set 2 :
x = [ 1.36, 1.44, 1.41, 1.745, 2.25, 1.42, 1.45, 1.5, 1.58]
y = [ 648, 618, 636, 485, 384, 639, 630, 583, 529]
The difficulty that you meet might be due to the choice of "guessed" initial values of the parameters which are required to start the iterative process of nonlinear regression.
In order to check this hypothesis one can use a different method which doesn't need initial guessed values. The MathCad code and numerical calculus are shown below.
Don't be surprised if the values of the parameters that you get with your software are slightly different from the above values (a, b, c). The criteria of fitting implicitly set in your software is probably different from the criteria of fitting set in my software.
Blue curve : The method of regression is a Least Mean Square Errors wrt a linear integral equation to which the exponential equation is solution. Ref.: https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
This non-standard method isn't iterative and doesn't require initial "guessed" values of parameters.

Trouble fitting a function with scipy.optimize.curve_fit

I am currently trying to evaluate some data of mine and tried replicating the fit function described here: https://www.graphpad.com/guides/prism/latest/curve-fitting/reg_classic_dr_variable.htm
At first I was having some trouble with numpy.float_power overflowing, but I think I fixed it (did I really?).
I am now using scipy.optimize.curve_fit to fit the described sigmoid to my data, but it never actually seems to fit, but instead produces constant functions and I have no idea why.
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
'''
Just a method that produces some simple test data
'''
def test_data_1():
return np.array([[0.000610352, 0.002441406, 0.009765625, 0.0390625, 0.15625, 0.625, 2.5, 10],
[0.89, 0.81, 0.64, 0.48, 0.45, 0.50, 0.58, 0.70]])
'''
Just a simple method that produces some more test data
'''
def test_data_2():
return np.array([[0.000610352, 0.002441406, 0.009765625, 0.0390625, 0.15625, 0.625, 2.5, 10],
[1, 0.83, 0.68, 0.52, 0.48, 0.59, 0.75, 0.62]])
'''
Dose response curve as described in: https://www.graphpad.com/guides/prism/latest/curve-fitting/reg_classic_dr_variable.htm
'''
def sigmoidal_dose_response_with_variable_slope(x_data, *params):
# Extract relevant parameters. Flattening the array just in case?
r_params = np.array(params).flatten()
bottom = r_params[0]
top = r_params[1]
logec50 = r_params[2]
slope = r_params[3]
# Calculating the numerator
numerator = top - bottom
# Calculating the denominator
denominator = 1 + np.float_power(10, (logec50 - x_data) * slope, dtype=np.longdouble)
return np.array(bottom + (numerator / denominator), dtype=np.float64)
if __name__ == "__main__":
x_data, y_data = test_data_1()
# Guessing bottom and top as the highest and lowest y-values.
bottom_guess = np.min(y_data)
bottom_guess_idx = np.argmin(y_data)
top_guess = np.max(y_data)
top_guess_idx = np.argmax(y_data)
# Guessing logec50 as the middle between those parameters
logec50_guess = np.linalg.norm(x_data[top_guess_idx] - x_data[bottom_guess_idx]) / 2 \
+ np.min([x_data[top_guess_idx], x_data[bottom_guess_idx]])
# Guessing a slope of 1
slope_guess = 1
p0 = [bottom_guess, top_guess, logec50_guess, slope_guess]
# Fitting the curve to my data
popt, pcov = curve_fit(sigmoidal_dose_response_with_variable_slope, x_data, y_data, p0)
# Making the x-axis scale logarithmically
fig, ax = plt.subplots()
ax.set_xscale('log')
# Plot my data
plt.plot(x_data, y_data, 's')
# Calculate function data. The borders are merely a guess
x_val = np.linspace(0, 10, 100)
y_val = sigmoidal_dose_response_with_variable_slope(x_val, popt)
# Plot
plt.plot(x_val, y_val)
plt.show()
It should be easily testable.
Update:
Something like this is what I am looking for:

Gibbs sampler fails to converge

I've been trying to understand Gibbs sampling for some time. Recently, I saw a video that made a good deal of sense.
https://www.youtube.com/watch?v=a_08GKWHFWo
The author used Gibbs sampling to converge on the mean values (theta_1 and theta_2) of a bivariate normal distribution, using the process as follows:
init: Initialize theta_2 to a random value.
Loop:
sample theta_1 conditioned on theta_2 as N~(p(theta_2), [1-p**2])
sample theta_2 conditioned on theta_1 as N~(p(theta_1), [1-p**2])
(repeat until convergence.)
I tried this on my own and ran into an issue:
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
rv = multivariate_normal(mean=[0.5, -0.2], cov=[[1, 0.9], [0.9, 1]])
rv.mean
>>>
array([ 0.5, -0.2])
rv.cov
>>>
array([[1. , 0.9],
[0.9, 1. ]])
import numpy as np
samples = []
curr_t2 = np.random.rand()
def gibbs(iterations=5000):
theta_1 = np.random.normal(curr_t2, (1-0.9**2), None)
theta_2 = np.random.normal(theta_1, (1-0.9**2), None)
samples.append((theta_1,theta_2))
for i in range(iterations-1):
theta_1 = np.random.normal(theta_2, (1-0.9**2), None)
theta_2 = np.random.normal(theta_1, (1-0.9**2), None)
samples.append((theta_1,theta_2))
gibbs()
sum([a for a,b in samples])/len(samples)
>>>
4.745736136676516
sum([b for a,b in samples])/len(samples)
>>>
4.746816908769834
Now, I see where I messed up. I found theta_1 conditioned on theta_2's actual value, not its probability. Likewise, I found theta_2 conditioned on theta_1's actual value, not its probability.
Where I'm stuck is, how do I evaluate the probability of either theta taking on any given observed value?
Two options I see: probability density (based on location on normal curve) AND p-value (integration from infinity (and/or negative infinity) to the observed value). Neither of these solutions sound "right."
How should I proceed?

Perhaps my video wasn't clear enough. The algorithm does not converge "on the mean values" but rather it converges to samples from the distribution. Nonetheless, averages of samples from the distributions will converge to their respective mean values.
The issue is with your conditional means. In the video, I choose marginal means that were zero to reduce notation. If you have non-zero marginal means, the conditional expectation for a bivariate normal involves the marginal means, the correlation, and the standard deviations (which are 1 in your bivariate normal). The updated code is
import numpy as np
from scipy.stats import multivariate_normal
mu1 = 0.5
mu2 = -0.2
rv = multivariate_normal(mean=[mu1, mu2], cov=[[1, 0.9], [0.9, 1]])
samples = []
curr_t2 = np.random.rand()
def gibbs(iterations=5000):
theta_1 = np.random.normal(mu1 + 0.9 * (curr_t2-mu2), (1-0.9**2), None)
theta_2 = np.random.normal(mu2 + 0.9 * (theta_1-mu1), (1-0.9**2), None)
samples.append((theta_1,theta_2))
for i in range(iterations-1):
theta_1 = np.random.normal(mu1 + 0.9 * (theta_2-mu2), (1-0.9**2), None)
theta_2 = np.random.normal(mu2 + 0.9 * (theta_1-mu1), (1-0.9**2), None)
samples.append((theta_1,theta_2))
gibbs()
sum([a for a,b in samples])/len(samples)
sum([b for a,b in samples])/len(samples)

How obtain the intercept of the PLS-Regression (sklearn)

The PLS regression using sklearn gives very poor prediction results. When I get the model I can not find the way to find the "intercept". Perhaps this affects the prediction of the model? The matrix of scores and loadings are fine. The arrangement of the coefficients also. In any case, how do I get the intercept using the attributes already obtained?
This code throws the coefficients of the variables.
from pandas import DataFrame
from sklearn.cross_decomposition import PLSRegression
X = DataFrame( {
'x1': [0.0,1.0,2.0,2.0],
'x2': [0.0,0.0,2.0,5.0],
'x3': [1.0,0.0,2.0,4.0],
}, columns = ['x1', 'x2', 'x3'] )
Y = DataFrame({
'y': [ -0.2, 1.1, 5.9, 12.3 ],
}, columns = ['y'] )
def regPLS1(X,Y):
_COMPS_ = len(X.columns) # all latent variables
model = PLSRegression(_COMPS_).fit( X, Y )
return model.coef_
The result is:
regPLS1(X,Y)
>>> array([[ 0.84], [ 2.44], [-0.46]])
In addition to these coefficients, the value of the intercept is: 0.26. What am I doing wrong?
EDIT
The correct predict(evaluate) response is Y_hat (exactly the same the observed Y):
Y_hat = [-0.2 1.1 5.9 12.3]

To calculate the intercept use the following:
plsModel = PLSRegression(_COMPS_).fit( X, Y )
y_intercept = plsModel.y_mean_ - numpy.dot(plsModel.x_mean_ , plsModel.coef_)
I got the formula directly from the R "pls" package:
BInt[1,,i] <- object$Ymeans - object$Xmeans %*% B[,,i]
I tested the results and calculated the same intercepts in R 'pls' and scikit-learn.

Based of my reading of the implementation of _PLS the formula is Y = XB + Err where model.coef_ is the estimate of B. If you look at the predict method it looks like it uses the fitted parameter y_mean_ as the Err so I believe that's what you want. Use model.y_mean_ instead of model.coef_. Hope this helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.