I obtain negative R^2 regardless of what regression model I use - python

I try to create a model that would describe the behaviour of my data. I tried the simple linear regression, simple polynomial regression and polynomial regression with regularization and cross-validation. I followed this explanation to perform the regressions.
The problem is that all models give negative R^2 for the test data. I tried 1st, 2nd, 3rd degree polynomial models. Then it is getting even worse.
I was wondering whether somebody could help me to figure out what is wrong? Or what model can I use to get rid of negative R^2 and obtain a normal one?
Summary for the simple linear regression
MAE, MSE, RMSE and R^2 for the simple linear regression
MAE, MSE, RMSE and R^2 for the simple polynomial regression
MAE, MSE, RMSE and R^2 for the polynomial regression with regularization and cross-validation
Code:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import DataFrame
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
# Import function to automatically create polynomial features
from sklearn.preprocessing import PolynomialFeatures
# Import Linear Regression and a regularized regression function
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
#Initial data (Three independent variables - Cycle, Internal Resistance and CV Capacity; One dependent - Full Capacity)
SoH = {'Cycle': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28],
'Internal_Resistance': [0.039684729, 0.033377614, 0.031960606, 0.03546798, 0.036786229, 0.03479803, 0.026613861, 0.028650246, 0.028183795, 0.035455215, 0.029205355, 0.033891692, 0.026988849, 0.025647298, 0.033970376, 0.03172454, 0.032437203, 0.033771218, 0.030939938, 0.036919977, 0.027832869, 0.028602469, 0.023065191, 0.028890529, 0.026640394, 0.031488253, 0.02865842, 0.027648949],
'CV_Capacity': [389.9270401, 307.7366414, 357.6412139, 192.134787, 212.415946, 204.737916, 166.506029, 157.826878, 196.432589, 181.937188, 192.070363, 209.890964, 198.978988, 206.126864, 185.631644, 193.776497, 200.61431, 174.359373, 177.503285, 174.07905, 170.654873, 184.528031, 208.065379, 210.134795, 208.199237, 184.693507, 193.00402, 191.913131],
'Full_Capacity': [1703.8575, 1740.7017, 1760.66, 1775.248302, 1771.664053, 1781.958089, 1783.2295, 1784.500912, 1779.280477, 1780.175547, 1800.761265, 1789.047162, 1791.763677, 1787.014667, 1796.520256, 1798.349587, 1791.776304, 1788.892761, 1791.990303, 1790.307248, 1796.580484, 1803.89133, 1793.305294, 1784.638742, 1780.056339, 1783.081746, 1772.001436, 1794.182046]
}
#Data to test the model
Test = {'Cycle': [29, 30, 31, 32, 33, 34, 35],
'Internal_Resistance': [0.026217822, 0.032549629, 0.025744309, 0.027945824, 0.027332509, 0.027960729, 0.028969193],
'CV_Capacity': [196.610972, 194.915587, 183.209067, 182.41669, 204.018257, 179.929472, 189.576431],
'Full_Capacity': [1777.880947, 1792.21646, 1785.653845, 1788.401923, 1782.983718, 1793.939504, 1788.67233]
}
#Convert initial data into DataFrame
df = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity','Full_Capacity'])
df1 = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity'])
X = df1.to_numpy()
print(df.head())
print()
#Convert data to test the model into DataFrame
dft = DataFrame(Test,columns=['Cycle','Internal_Resistance','CV_Capacity','Full_Capacity'])
dft1 = DataFrame(Test,columns=['Cycle','Internal_Resistance','CV_Capacity'])
Xt = dft1.to_numpy()
#Plot the Full Capacity vs predictors (Cycle, Internal Resistance and CV Capacity)
for i in df.columns:
df.plot.scatter(i,'Full_Capacity', edgecolors=(0,0,0),s=50,c='g',grid=True)
#STATSMODELS
# Fitting data with statsmodels
X1 = df[['Cycle','Internal_Resistance','CV_Capacity']]
Y1 = df['Full_Capacity']
X1 = sm.add_constant(X1.values) # adding a constant
model = sm.OLS(Y1, X1).fit()
predictions = model.predict(X1)
print_model = model.summary()
print(print_model)
print()
#SCIKIT LEARN (Simple polynomial regression and polynomial regression with regularization and cross-validation)
# Fitting data - simple polynomial regression (1st degree)
linear_model = LinearRegression(normalize=True)
X_linear=df.drop('Full_Capacity',axis=1)
y_linear=df['Full_Capacity']
X_linear_test=dft.drop('Full_Capacity',axis=1)
y_linear_test=dft['Full_Capacity']
linear_model.fit(X_linear,y_linear)
y_pred_linear = linear_model.predict(X_linear)
y_pred_linear_test = linear_model.predict(X_linear_test)
#Coefficients for the model
coeff_linear = pd.DataFrame(linear_model.coef_,index=df.drop('Full_Capacity',axis=1).columns, columns=['Linear model coefficients'])
print(coeff_linear)
print()
#Metrics of the model
MAE_linear = mean_absolute_error(y_linear, y_pred_linear)
print("Mean absolute error of linear model:",MAE_linear)
MSE_linear = mean_squared_error(y_linear, y_pred_linear)
print("Mean-squared error of linear model:",MSE_linear)
RMSE_linear = np.sqrt(MSE_linear)
print("Root-mean-squared error of linear model:",RMSE_linear)
print()
MAE_linear_test = mean_absolute_error(y_linear_test, y_pred_linear_test)
print("Mean absolute error of linear model (validation):",MAE_linear_test)
MSE_linear_test = mean_squared_error(y_linear_test, y_pred_linear_test)
print("Mean-squared error of linear model (validation):",MSE_linear_test)
RMSE_linear_test = np.sqrt(MSE_linear_test)
print("Root-mean-squared error of linear model (validation):",RMSE_linear_test)
print()
print ("R2 value of linear model:",linear_model.score(X_linear,y_linear))
print ("R2 value of linear model (validation):",linear_model.score(X_linear_test,y_linear_test))
print()
#Plot predicted values vs actual values
plt.figure(figsize=(12,8))
plt.xlabel("Predicted value with linear fit",fontsize=20)
plt.ylabel("Actual y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred_linear,y_linear,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred_linear,y_pred_linear, 'k--', lw=2)
plt.figure(figsize=(12,8))
plt.xlabel("Predicted (validation) value with linear fit",fontsize=20)
plt.ylabel("Actual (validation) y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred_linear_test,y_linear_test,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred_linear_test,y_pred_linear_test, 'k--', lw=2)
#Fitting data - simple polynomial regression (3rd degree)
poly = PolynomialFeatures(3,include_bias=False)
X_poly = poly.fit_transform(X)
X_poly_feature_name = poly.get_feature_names(['Feature'+str(l) for l in range(1,4)])
print()
print()
print("3rd degree polynomial regression")
print()
print()
print(X_poly_feature_name)
print(len(X_poly_feature_name))
print()
df_poly = pd.DataFrame(X_poly, columns=X_poly_feature_name)
print(df_poly.head())
print()
df_poly['y']=df['Full_Capacity']
print(df_poly.head())
print()
X_train=df_poly.drop('y',axis=1)
y_train=df_poly['y']
#Testing the model
test = PolynomialFeatures(3,include_bias=False)
X_test=test.fit_transform(Xt)
X_test_feature_name = test.get_feature_names(['Feature'+str(l) for l in range(1,4)])
print(X_test_feature_name)
print(len(X_test_feature_name))
print()
df_test = pd.DataFrame(X_test, columns=X_test_feature_name)
print(df_test.head())
print()
df_test['y']=dft['Full_Capacity']
#Data to test the polynomial models
X_testo=df_test.drop('y',axis=1)
y_testo=df_test['y']
poly = LinearRegression(normalize=True)
model_poly=poly.fit(X_train,y_train)
y_poly = poly.predict(X_train)
y_poly_test = np.array(poly.predict(X_testo))
coeff_poly = pd.DataFrame(model_poly.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients polynomial model'])
print(coeff_poly)
print()
#Metrics of the polynomial model
MAE_poly = mean_absolute_error(y_train, y_poly)
print("Mean absolute error of simple polynomial model:",MAE_poly)
MSE_poly = mean_squared_error(y_train, y_poly)
print("Mean-squared error of simple polynomial model:",MSE_poly)
RMSE_poly = np.sqrt(MSE_poly)
print("Root-mean-squared error of simple polynomial model:",RMSE_poly)
print()
MAE_poly_test = mean_absolute_error(y_testo, y_poly_test)
print("Mean absolute error of simple polynomial model (validation):",MAE_poly_test)
MSE_poly_test = mean_squared_error(y_testo, y_poly_test)
print("Mean-squared error of simple polynomial model (validation):",MSE_poly_test)
RMSE_poly_test = np.sqrt(MSE_poly_test)
print("Root-mean-squared error of simple polynomial model (validation):",RMSE_poly_test)
print()
print ("R2 value of simple polynomial model:",model_poly.score(X_train,y_train))
print ("R2 value of simple polynomial model (validation):",model_poly.score(X_testo,y_testo))
print()
plt.figure(figsize=(12,8))
plt.xlabel("Predicted value with simple polynomial model",fontsize=20)
plt.ylabel("Actual y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_poly,y_train,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_poly,y_poly, 'k--', lw=2)
plt.figure(figsize=(12,8))
plt.xlabel("Predicted (validation) value with Simple polynomial model",fontsize=20)
plt.ylabel("Actual (validation) y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_poly_test,y_testo,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_poly_test,y_poly_test, 'k--', lw=2)
#Fitting data with a polynomial model with regularization and cross-validation
model1 = LassoCV(cv=10,verbose=0,normalize=True,eps=0.001,n_alphas=100, fit_intercept = True, tol=0.0001,max_iter=10000)
model1.fit(X_train,y_train)
y_pred1 = np.array(model1.predict(X_train))
y_pred2 = np.array(model1.predict(X_testo))
print()
print()
print("3rd degree polynomial regression with regularization and cross-validation")
print()
print()
coeff1 = pd.DataFrame(model1.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients Metamodel'])
print(coeff1)
print()
print(coeff1[coeff1['Coefficients Metamodel']!=0])
print("Intercept of the new polynomial model:",model1.intercept_)
print()
#Metrics of the polynomial model with regularization and cross-validation
MAE_1 = mean_absolute_error(y_train, y_pred1)
print("Mean absolute error of the new polynomial model:",MAE_1)
MSE_1 = mean_squared_error(y_train, y_pred1)
print("Mean-squared error of the new polynomial model:",MSE_1)
RMSE_1 = np.sqrt(MSE_1)
print("Root-mean-squared error of the new polynomial model:",RMSE_1)
print()
MAE_1_test = mean_absolute_error(y_testo, y_pred2)
print("Mean absolute error of the new polynomial model (validation):",MAE_1_test)
MSE_1_test = mean_squared_error(y_testo, y_pred2)
print("Mean-squared error of the new polynomial model (validation):",MSE_1_test)
RMSE_1_test = np.sqrt(MSE_1_test)
print("Root-mean-squared error of the new polynomial model (validation):",RMSE_1_test)
print()
print ("R2 value of the new polynomial model:",model1.score(X_train,y_train))
print ("R2 value of the new polynomial model (validation):",model1.score(X_testo,y_testo))
print ("Alpha of the new polynomial model:",model1.alpha_)
print()
plt.figure(figsize=(12,8))
plt.xlabel("Predicted value with Metamodel",fontsize=20)
plt.ylabel("Actual y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred1,y_train,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred1,y_pred1, 'k--', lw=2)
plt.figure(figsize=(12,8))
plt.xlabel("Predicted (validation) value with Metamodel",fontsize=20)
plt.ylabel("Actual (validation) y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred2,y_testo,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred2,y_pred2, 'k--', lw=2) ```
[1]: https://i.stack.imgur.com/AhSwJ.png

Related

How to code ordinary least squares linear regression for a 3D scatterplot in Python?

This is for a 3D scatterplot.
The instructions state:
Now that we have fit our model, which means that we have computed the optimal model parameters, we can use our model to plot the regression line for the data. Below, I supply you with x_fit and y_fit that represent the x- and y-data of the regression line, respectively. All we need to do next is ask the model to predict a z_fit value for each x_fit and y_fit pair by invoking the model's predict() method. This should make sense when you consider the ordinary least squares linear regression equation for calculating z_fit:
where are the computed model parameters. You must use x_fit and y_fit as features to be passed together as a DataFrame to the model's predict() method, which will return z_fit as determined by the above equation. Once you obtain z_fit, you are ready to plot the regression line by plotting it against x_fit and y_fit.
My code so far is:
features = df[['x', 'y']]
response = df['z']
#perform linear regression model fitting
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(features, response)
model
print("Model Coefficients: ", model.coef_)
print("Model Intercept : ", model.intercept_)
Output:
Model Coefficients: [ 8.02602396 15.95639323]
Model Intercept : -8.632876775060822
This is where I'm stuck:
# Plot Curve Fit
x_fit = np.linspace(0,21,1000)
y_fit = x_fit
#ENTER CODE HERE#
# Plot Curve Fit
x_fit = np.linspace(0,21,1000)
y_fit = x_fit
#ENTER CODE HERE#
Assuming
theta1, theta2 = model.coef_
theta3 = model.intercept_
Equation
z_fit = (theta1 * x_fit) + (theta2 * y_fit) + theta3

Issue when plotting polynomial regression curve

I have run linear and polynomial regression but when I'm plotting the results for the polynomial regression it is not showing a curve for a single line but multiple lines. It could be a very simple code issue but I cannot resolve it.
from sklearn.preprocessing import PolynomialFeatures
pre_process = PolynomialFeatures(degree = 5)
poly_model = LinearRegression()
poly_model.fit(data_train_x,train_y)
print(poly_model.coef_)
print(poly_model.intercept_)
# Prediction on validation dataset
data_valid_x = pre_process.fit_transform(valid_x)
poly_pred = poly_model.predict(data_valid_x)
plt.scatter(valid_x, valid_y, facecolor='None', edgecolor='k', alpha=0.3)
plt.plot(valid_x, poly_pred)
plt.show()
This is the result of the polynomial regression fit:

Difference between ridge regression and SVM regressor (SVR) with polynomial kernel of degree = 1

I am trying to build a model for an application,
I have used both the ridge regression and the SVR from sklearn and they seen to be different although I tried to keep the parameters the same.
I have used the regularization parameter to be = 1 in both models. (they both have L2 regularization)
There is an extra parameter for the poly kernel which I set to zero
The data are standardized.
from sklearn.linear_model import Ridge
linear_ridge = Ridge(alpha=1.0) # L2 regularization
linear_ridge.fit(np.array(X_train) , np.array(y_train))
from sklearn import svm
model_SVR_poly = svm.SVR(kernel = 'poly' , coef0=0.0 , degree = 1, C = 1.0 , epsilon = 0.1 ) #L2 regularization
model_SVR_poly.fit(np.array(X_train) , np.array(y_train))
Linear_ridge_pred = linear_ridge.predict(test_data[start_data:]) *Y_std[0] + Y_mean[0]
svr_poly_pred = model_SVR_poly.predict(test_data[start_data:]) *Y_std[0] + Y_mean[0]
If the value of epsilon is decreased , to 0.0 it will undershoot more than the ridge and if increased, it will overshoot more.
In the testing phase, the Ridge seems to undershoot while the SVR seems to overshoot.
What is the difference between the two implementations in my case or in general ?
For me, there might be some differences in the implementations of Ridge() and SVR() as you are pointing out.
On one side, there's a difference in the loss function as you might see here (epsilon-insensitive loss and squared epsilon-insensitive loss) vs here (Ridge loss). This is emphasized also within this example from sklearn documentation which however compares Kernel Ridge Regression and SVR with a non-linear kernel.
In addition to this, the fact you're using SVR with a polynomial Kernel of degree 1 adds a further difference: as you can see here and here (SVR is built on top of the LibSVM library) there's a further parameter (gamma) to be considered (you might put it equal to 1 for convenience, it equals 'scale' by default).
Here is the difference in fitting that I could get by adjusting this toy example (with non-tuned parameters). I've also tried to consider LinearSVR() that has some further differences wrt SVR() as you can see eg here or here.
print(__doc__)
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.svm import LinearSVR, SVR
import matplotlib.pyplot as plt
np.random.seed(42)
# #############################################################################
# Generate sample data
X = np.sort(5 * np.random.rand(40, 1), axis=0)
y = np.sin(X).ravel()
# #############################################################################
# Add noise to targets
y[::5] += 3 * (0.5 - np.random.rand(8))
# #############################################################################
# Fit regression model
svr_lin = SVR(kernel='linear', C=1, tol=1e-5)
svr_lins = LinearSVR(loss='squared_epsilon_insensitive', C=1, tol=1e-5, random_state=42)
svr_poly = SVR(kernel='poly', C=1, degree=1, gamma=1, tol=1e-5, coef0=0.0)
ridge = Ridge(alpha=1, random_state=42)
y_lin = svr_lin.fit(X, y).predict(X)
y_lins = svr_lins.fit(X, y).predict(X)
y_poly = svr_poly.fit(X, y).predict(X)
y_ridge = ridge.fit(X, y).predict(X)
coef_y_lin, intercept_y_lin = svr_lin.coef_, svr_lin.intercept_
coef_y_lins, intercept_y_lins = svr_lins.coef_, svr_lins.intercept_
coef_y_ridge, intercept_y_ridge = ridge.coef_, ridge.intercept_
# #############################################################################
# Look at the results
lw = 2
plt.figure(figsize=(10,5))
plt.scatter(X, y, color='darkorange', label='data')
plt.plot(X, y_lins, color='navy', lw=lw, label='Linear model (LinearSVR) %s, %s' %
(coef_y_lins, intercept_y_lins))
plt.plot(X, y_lin, color='red', lw=lw, label='Linear model (SVR) %s, %s' % (coef_y_lin, intercept_y_lin))
plt.plot(X, y_poly, color='cornflowerblue', lw=lw, label='Polynomial model of degree 1 (SVR)')
plt.plot(X, y_ridge, color='g', lw=lw, label='Ridge %s, %s' % (coef_y_ridge, intercept_y_ridge))
plt.xlabel('data')
plt.ylabel('target')
plt.title('Support Vector Regression')
plt.legend()
plt.axis([0, 5, -1, 1.5])

Overfitting the Lasso regression model to the data

I wanted to create a model that would describe the behaviour of my data. I tried the simple linear regression, simple polynomial regression and polynomial regression with regularization and cross-validation.
I found that the last method allows an automatic feature selection (with degrees), which I actually need as the simple linear regression does not perform well. I followed this explanation to perform polynomial regression with Lasso regularization and cross-validation.
In this example, this method is used to avoid overfitting which happens if you use a simple polynomial regression. In my case, however, it seems that it vice versa caused overfitting.
I was wondering whether somebody could help me to understand what I am doing wrong in the code implementation? Or maybe there is a better solution on how best fit data into a model?
Code (linear regression with statsmodels, polynomial regressions with scikit learn):
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import DataFrame
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
# Import function to automatically create polynomial features
from sklearn.preprocessing import PolynomialFeatures
# Import Linear Regression and a regularized regression function
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
#Initial data
SoH = {'Cycle': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32],
'Internal_Resistance': [0.039684729, 0.033377614, 0.031960606, 0.03546798, 0.036786229, 0.03479803, 0.026613861, 0.028650246, 0.028183795, 0.035455215, 0.029205355, 0.033891692, 0.026988849, 0.025647298, 0.033970376, 0.03172454, 0.032437203, 0.033771218, 0.030939938, 0.036919977, 0.027832869, 0.028602469, 0.023065191, 0.028890529, 0.026640394, 0.031488253, 0.02865842, 0.027648949, 0.026217822, 0.032549629, 0.025744309, 0.027945824],
'CV_Capacity': [389.9270401, 307.7366414, 357.6412139, 192.134787, 212.415946, 204.737916, 166.506029, 157.826878, 196.432589, 181.937188, 192.070363, 209.890964, 198.978988, 206.126864, 185.631644, 193.776497, 200.61431, 174.359373, 177.503285, 174.07905, 170.654873, 184.528031, 208.065379, 210.134795, 208.199237, 184.693507, 193.00402, 191.913131, 196.610972, 194.915587, 183.209067, 182.41669],
'Full_Capacity': [1703.8575, 1740.7017, 1760.66, 1775.248302, 1771.664053, 1781.958089, 1783.2295, 1784.500912, 1779.280477, 1780.175547, 1800.761265, 1789.047162, 1791.763677, 1787.014667, 1796.520256, 1798.349587, 1791.776304, 1788.892761, 1791.990303, 1790.307248, 1796.580484, 1803.89133, 1793.305294, 1784.638742, 1780.056339, 1783.081746, 1772.001436, 1794.182046, 1777.880947, 1792.21646, 1785.653845, 1788.401923]
}
Test = {'Cycle': [33, 34, 35],
'Internal_Resistance': [0.027332509, 0.027960729, 0.028969193],
'CV_Capacity': [204.018257, 179.929472, 189.576431],
'Full_Capacity': [1782.983718, 1793.939504, 1788.67233]
}
#Initial data presented in a form of a data frame
df = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity','Full_Capacity'])
df1 = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity'])
X = df1.to_numpy()
print(df.head(32))
print()
print(X)
print()
#Plot the Full Capacity vs predictors (Cycle, Internal Resistance and CV Capacity)
for i in df.columns:
df.plot.scatter(i,'Full_Capacity', edgecolors=(0,0,0),s=50,c='g',grid=True)
# Fitting data with statsmodels
X1 = df[['Cycle','Internal_Resistance','CV_Capacity']]
Y1 = df['Full_Capacity']
X1 = sm.add_constant(X1.values) # adding a constant
model = sm.OLS(Y1, X1).fit()
predictions = model.predict(X1)
print_model = model.summary()
print(print_model)
print()
# Fitting data with scikit learn - simple linear regression
linear_model = LinearRegression(normalize=True)
X_linear=df.drop('Full_Capacity',axis=1)
y_linear=df['Full_Capacity']
linear_model.fit(X_linear,y_linear)
y_pred_linear = linear_model.predict(X_linear)
#Metrics of the linear model
MAE_linear = mean_absolute_error(y_linear, y_pred_linear)
print("Mean absolute error of linear model:",MAE_linear)
MSE_linear = mean_squared_error(y_linear, y_pred_linear)
print("Mean-squared error of linear model:",MSE_linear)
RMSE_linear = np.sqrt(MSE_linear)
print("Root-mean-squared error of linear model:",RMSE_linear)
#Coefficients for the linear model
coeff_linear = pd.DataFrame(linear_model.coef_,index=df.drop('Full_Capacity',axis=1).columns, columns=['Linear model coefficients'])
print(coeff_linear)
print ("R2 value of linear model:",linear_model.score(X_linear,y_linear))
#Plot predicted values vs actual values
plt.figure(figsize=(12,8))
plt.xlabel("Predicted value with linear fit",fontsize=20)
plt.ylabel("Actual y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred_linear,y_linear,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred_linear,y_pred_linear, 'k--', lw=2)
#Fitting data with a simple polynomial model
poly = PolynomialFeatures(2,include_bias=False)
X_poly = poly.fit_transform(X)
X_poly_feature_name = poly.get_feature_names(['Feature'+str(l) for l in range(1,4)])
print(X_poly_feature_name)
print(len(X_poly_feature_name))
df_poly = pd.DataFrame(X_poly, columns=X_poly_feature_name)
print(df_poly.head())
df_poly['y']=df['Full_Capacity']
print(df_poly.head())
X_train=df_poly.drop('y',axis=1)
y_train=df_poly['y']
poly = LinearRegression(normalize=True)
model_poly=poly.fit(X_train,y_train)
y_poly = poly.predict(X_train)
#Metrics of the polynomial model
MAE_poly = mean_absolute_error(y_poly, y_train)
print("Mean absolute error of simple polynomial model:",MAE_poly)
MSE_poly = mean_squared_error(y_poly, y_train)
print("Mean-squared error of simple polynomial model:",MSE_poly)
RMSE_poly = np.sqrt(MSE_poly)
print("Root-mean-squared error of simple polynomial model:",RMSE_poly)
print ("R2 value of simple polynomial model:",model_poly.score(X_train,y_train))
coeff_poly = pd.DataFrame(model_poly.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients polynomial model'])
print(coeff_poly)
#Fitting data with a polynomial model with regularization and cross-validation
model1 = LassoCV(cv=10,verbose=0,normalize=True,eps=0.001,n_alphas=100, tol=0.0001,max_iter=10000)
model1.fit(X_train,y_train)
y_pred1 = np.array(model1.predict(X_train))
#Metrics of the polynomial model with regularization and cross-validation
MAE_1 = mean_absolute_error(y_pred1, y_pred1)
print("Mean absolute error of the new polynomial model:",MAE_1)
MSE_1 = mean_squared_error(y_pred1, y_pred1)
print("Mean-squared error of the new polynomial model:",MSE_1)
RMSE_1 = np.sqrt(MSE_1)
print("Root-mean-squared error of the new polynomial model:",RMSE_1)
coeff1 = pd.DataFrame(model1.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients Metamodel'])
print(coeff1)
print ("R2 value of the new polynomial model:",model1.score(X_train,y_pred1))
print ("Alpha of the new polynomial model:",model1.alpha_)
print(coeff1[coeff1['Coefficients Metamodel']!=0])
plt.figure(figsize=(12,8))
plt.xlabel("Predicted value with Metamodel",fontsize=20)
plt.ylabel("Actual y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred1,y_train,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred1,y_pred1, 'k--', lw=2) ```
I found that a simple polynomial with a single interaction term gives an OK fit. Note that a 3D scatterplot of the SoH data, without using "cycle" shows that there are some areas that would benefit from additional data to characterize the response surface:
a = 1.6708148450040499E+03
b = 6.5825133247934986E-01
c = 4.8477389499541523E+03
d = -2.7015882838321772E+01
temp = a
temp += b * CV_Capacity
temp += c * Internal_Resistance
temp += d * Internal_Resistance * CV_Capacity
return temp
Lasso is a regularization method that can be used in order to avoid overfitting.
In this method we add a term to the loss function which is a kind of constraint on the weights. Then, you have 2 terms in your loss function - a term that responsible to fit the data, and a regularization term.
In addition, you have a constant that controls the trade off between those 2 term. There is a chance that in your case you should increase the strength of the regularization term (increase the constant), in order to avoid overfitting.

How calculate OLS regression with Survey Weights in Python.

I want to do a linear regression on survey data with survey weights.
The survey data is from the EU and each observation has a weight. (.4 for an one respondent, 1.5 for another.)
This weight is described as:
"The European Weight, variable 6, produces a representative sample of
the European Community as a whole when used in analysis. This variable
adjusts the size of each national sample according to each nation's
contribution to the population of the European Community."
To do my calculation I'm using sklearn.
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X,y, sample_weight = weights)
X is a pandas DataFrame. y is a numpy.ndarray. weights is a pandas Series.
Am I using 'sample_weight' correctly, does is this the correct way to handle survey weights in scikit?
TL DR; Yes.
Here is a very simple example of it working,
import numpy as np
import matplotlib.pylab as plt
from sklearn import linear_model
regr = linear_model.LinearRegression()
X = np.array([1, 2, 4]).reshape(-1, 1)
y = np.array([10, 20, 60]).reshape(-1, 1)
weights = np.array([1, 1, 1])
def weighted_lr(X, y, weights):
"""Quick function to run weighted linear regression and return a
plot and some predictions"""
regr.fit(X,y, sample_weight=weights)
y_pred = regr.predict(X)
plt.scatter(X, y)
plt.plot(X, y_pred)
plt.title('Weights: %s' % ', '.join(str(i) for i in weights))
plt.show()
return y_pred
y_pred = weighted_lr(X, y, weights)
print(y_pred)
weights = np.array([1000, 1000, 1])
y_pred = weighted_lr(X, y, weights)
print(y_pred)
[[ 7.14285714]
[ 24.28571429]
[ 58.57142857]]
[[ 9.96051333]
[ 20.05923001]
[ 40.25666338]]
On the first linear regression model with even weights we see the model behave as expected from a normal linear regression model.
Next, however, we see that in the second model, with low weighing on the last value, almost ignores this last value. The majority of the training has been weighted to the other two values here.

Categories