I obtain negative R^2 regardless of what regression model I use - python
I try to create a model that would describe the behaviour of my data. I tried the simple linear regression, simple polynomial regression and polynomial regression with regularization and cross-validation. I followed this explanation to perform the regressions.
The problem is that all models give negative R^2 for the test data. I tried 1st, 2nd, 3rd degree polynomial models. Then it is getting even worse.
I was wondering whether somebody could help me to figure out what is wrong? Or what model can I use to get rid of negative R^2 and obtain a normal one?
Summary for the simple linear regression
MAE, MSE, RMSE and R^2 for the simple linear regression
MAE, MSE, RMSE and R^2 for the simple polynomial regression
MAE, MSE, RMSE and R^2 for the polynomial regression with regularization and cross-validation
Code:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import DataFrame
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
# Import function to automatically create polynomial features
from sklearn.preprocessing import PolynomialFeatures
# Import Linear Regression and a regularized regression function
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
#Initial data (Three independent variables - Cycle, Internal Resistance and CV Capacity; One dependent - Full Capacity)
SoH = {'Cycle': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28],
'Internal_Resistance': [0.039684729, 0.033377614, 0.031960606, 0.03546798, 0.036786229, 0.03479803, 0.026613861, 0.028650246, 0.028183795, 0.035455215, 0.029205355, 0.033891692, 0.026988849, 0.025647298, 0.033970376, 0.03172454, 0.032437203, 0.033771218, 0.030939938, 0.036919977, 0.027832869, 0.028602469, 0.023065191, 0.028890529, 0.026640394, 0.031488253, 0.02865842, 0.027648949],
'CV_Capacity': [389.9270401, 307.7366414, 357.6412139, 192.134787, 212.415946, 204.737916, 166.506029, 157.826878, 196.432589, 181.937188, 192.070363, 209.890964, 198.978988, 206.126864, 185.631644, 193.776497, 200.61431, 174.359373, 177.503285, 174.07905, 170.654873, 184.528031, 208.065379, 210.134795, 208.199237, 184.693507, 193.00402, 191.913131],
'Full_Capacity': [1703.8575, 1740.7017, 1760.66, 1775.248302, 1771.664053, 1781.958089, 1783.2295, 1784.500912, 1779.280477, 1780.175547, 1800.761265, 1789.047162, 1791.763677, 1787.014667, 1796.520256, 1798.349587, 1791.776304, 1788.892761, 1791.990303, 1790.307248, 1796.580484, 1803.89133, 1793.305294, 1784.638742, 1780.056339, 1783.081746, 1772.001436, 1794.182046]
}
#Data to test the model
Test = {'Cycle': [29, 30, 31, 32, 33, 34, 35],
'Internal_Resistance': [0.026217822, 0.032549629, 0.025744309, 0.027945824, 0.027332509, 0.027960729, 0.028969193],
'CV_Capacity': [196.610972, 194.915587, 183.209067, 182.41669, 204.018257, 179.929472, 189.576431],
'Full_Capacity': [1777.880947, 1792.21646, 1785.653845, 1788.401923, 1782.983718, 1793.939504, 1788.67233]
}
#Convert initial data into DataFrame
df = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity','Full_Capacity'])
df1 = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity'])
X = df1.to_numpy()
print(df.head())
print()
#Convert data to test the model into DataFrame
dft = DataFrame(Test,columns=['Cycle','Internal_Resistance','CV_Capacity','Full_Capacity'])
dft1 = DataFrame(Test,columns=['Cycle','Internal_Resistance','CV_Capacity'])
Xt = dft1.to_numpy()
#Plot the Full Capacity vs predictors (Cycle, Internal Resistance and CV Capacity)
for i in df.columns:
df.plot.scatter(i,'Full_Capacity', edgecolors=(0,0,0),s=50,c='g',grid=True)
#STATSMODELS
# Fitting data with statsmodels
X1 = df[['Cycle','Internal_Resistance','CV_Capacity']]
Y1 = df['Full_Capacity']
X1 = sm.add_constant(X1.values) # adding a constant
model = sm.OLS(Y1, X1).fit()
predictions = model.predict(X1)
print_model = model.summary()
print(print_model)
print()
#SCIKIT LEARN (Simple polynomial regression and polynomial regression with regularization and cross-validation)
# Fitting data - simple polynomial regression (1st degree)
linear_model = LinearRegression(normalize=True)
X_linear=df.drop('Full_Capacity',axis=1)
y_linear=df['Full_Capacity']
X_linear_test=dft.drop('Full_Capacity',axis=1)
y_linear_test=dft['Full_Capacity']
linear_model.fit(X_linear,y_linear)
y_pred_linear = linear_model.predict(X_linear)
y_pred_linear_test = linear_model.predict(X_linear_test)
#Coefficients for the model
coeff_linear = pd.DataFrame(linear_model.coef_,index=df.drop('Full_Capacity',axis=1).columns, columns=['Linear model coefficients'])
print(coeff_linear)
print()
#Metrics of the model
MAE_linear = mean_absolute_error(y_linear, y_pred_linear)
print("Mean absolute error of linear model:",MAE_linear)
MSE_linear = mean_squared_error(y_linear, y_pred_linear)
print("Mean-squared error of linear model:",MSE_linear)
RMSE_linear = np.sqrt(MSE_linear)
print("Root-mean-squared error of linear model:",RMSE_linear)
print()
MAE_linear_test = mean_absolute_error(y_linear_test, y_pred_linear_test)
print("Mean absolute error of linear model (validation):",MAE_linear_test)
MSE_linear_test = mean_squared_error(y_linear_test, y_pred_linear_test)
print("Mean-squared error of linear model (validation):",MSE_linear_test)
RMSE_linear_test = np.sqrt(MSE_linear_test)
print("Root-mean-squared error of linear model (validation):",RMSE_linear_test)
print()
print ("R2 value of linear model:",linear_model.score(X_linear,y_linear))
print ("R2 value of linear model (validation):",linear_model.score(X_linear_test,y_linear_test))
print()
#Plot predicted values vs actual values
plt.figure(figsize=(12,8))
plt.xlabel("Predicted value with linear fit",fontsize=20)
plt.ylabel("Actual y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred_linear,y_linear,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred_linear,y_pred_linear, 'k--', lw=2)
plt.figure(figsize=(12,8))
plt.xlabel("Predicted (validation) value with linear fit",fontsize=20)
plt.ylabel("Actual (validation) y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred_linear_test,y_linear_test,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred_linear_test,y_pred_linear_test, 'k--', lw=2)
#Fitting data - simple polynomial regression (3rd degree)
poly = PolynomialFeatures(3,include_bias=False)
X_poly = poly.fit_transform(X)
X_poly_feature_name = poly.get_feature_names(['Feature'+str(l) for l in range(1,4)])
print()
print()
print("3rd degree polynomial regression")
print()
print()
print(X_poly_feature_name)
print(len(X_poly_feature_name))
print()
df_poly = pd.DataFrame(X_poly, columns=X_poly_feature_name)
print(df_poly.head())
print()
df_poly['y']=df['Full_Capacity']
print(df_poly.head())
print()
X_train=df_poly.drop('y',axis=1)
y_train=df_poly['y']
#Testing the model
test = PolynomialFeatures(3,include_bias=False)
X_test=test.fit_transform(Xt)
X_test_feature_name = test.get_feature_names(['Feature'+str(l) for l in range(1,4)])
print(X_test_feature_name)
print(len(X_test_feature_name))
print()
df_test = pd.DataFrame(X_test, columns=X_test_feature_name)
print(df_test.head())
print()
df_test['y']=dft['Full_Capacity']
#Data to test the polynomial models
X_testo=df_test.drop('y',axis=1)
y_testo=df_test['y']
poly = LinearRegression(normalize=True)
model_poly=poly.fit(X_train,y_train)
y_poly = poly.predict(X_train)
y_poly_test = np.array(poly.predict(X_testo))
coeff_poly = pd.DataFrame(model_poly.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients polynomial model'])
print(coeff_poly)
print()
#Metrics of the polynomial model
MAE_poly = mean_absolute_error(y_train, y_poly)
print("Mean absolute error of simple polynomial model:",MAE_poly)
MSE_poly = mean_squared_error(y_train, y_poly)
print("Mean-squared error of simple polynomial model:",MSE_poly)
RMSE_poly = np.sqrt(MSE_poly)
print("Root-mean-squared error of simple polynomial model:",RMSE_poly)
print()
MAE_poly_test = mean_absolute_error(y_testo, y_poly_test)
print("Mean absolute error of simple polynomial model (validation):",MAE_poly_test)
MSE_poly_test = mean_squared_error(y_testo, y_poly_test)
print("Mean-squared error of simple polynomial model (validation):",MSE_poly_test)
RMSE_poly_test = np.sqrt(MSE_poly_test)
print("Root-mean-squared error of simple polynomial model (validation):",RMSE_poly_test)
print()
print ("R2 value of simple polynomial model:",model_poly.score(X_train,y_train))
print ("R2 value of simple polynomial model (validation):",model_poly.score(X_testo,y_testo))
print()
plt.figure(figsize=(12,8))
plt.xlabel("Predicted value with simple polynomial model",fontsize=20)
plt.ylabel("Actual y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_poly,y_train,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_poly,y_poly, 'k--', lw=2)
plt.figure(figsize=(12,8))
plt.xlabel("Predicted (validation) value with Simple polynomial model",fontsize=20)
plt.ylabel("Actual (validation) y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_poly_test,y_testo,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_poly_test,y_poly_test, 'k--', lw=2)
#Fitting data with a polynomial model with regularization and cross-validation
model1 = LassoCV(cv=10,verbose=0,normalize=True,eps=0.001,n_alphas=100, fit_intercept = True, tol=0.0001,max_iter=10000)
model1.fit(X_train,y_train)
y_pred1 = np.array(model1.predict(X_train))
y_pred2 = np.array(model1.predict(X_testo))
print()
print()
print("3rd degree polynomial regression with regularization and cross-validation")
print()
print()
coeff1 = pd.DataFrame(model1.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients Metamodel'])
print(coeff1)
print()
print(coeff1[coeff1['Coefficients Metamodel']!=0])
print("Intercept of the new polynomial model:",model1.intercept_)
print()
#Metrics of the polynomial model with regularization and cross-validation
MAE_1 = mean_absolute_error(y_train, y_pred1)
print("Mean absolute error of the new polynomial model:",MAE_1)
MSE_1 = mean_squared_error(y_train, y_pred1)
print("Mean-squared error of the new polynomial model:",MSE_1)
RMSE_1 = np.sqrt(MSE_1)
print("Root-mean-squared error of the new polynomial model:",RMSE_1)
print()
MAE_1_test = mean_absolute_error(y_testo, y_pred2)
print("Mean absolute error of the new polynomial model (validation):",MAE_1_test)
MSE_1_test = mean_squared_error(y_testo, y_pred2)
print("Mean-squared error of the new polynomial model (validation):",MSE_1_test)
RMSE_1_test = np.sqrt(MSE_1_test)
print("Root-mean-squared error of the new polynomial model (validation):",RMSE_1_test)
print()
print ("R2 value of the new polynomial model:",model1.score(X_train,y_train))
print ("R2 value of the new polynomial model (validation):",model1.score(X_testo,y_testo))
print ("Alpha of the new polynomial model:",model1.alpha_)
print()
plt.figure(figsize=(12,8))
plt.xlabel("Predicted value with Metamodel",fontsize=20)
plt.ylabel("Actual y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred1,y_train,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred1,y_pred1, 'k--', lw=2)
plt.figure(figsize=(12,8))
plt.xlabel("Predicted (validation) value with Metamodel",fontsize=20)
plt.ylabel("Actual (validation) y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred2,y_testo,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred2,y_pred2, 'k--', lw=2) ```
[1]: https://i.stack.imgur.com/AhSwJ.png
Related
How to code ordinary least squares linear regression for a 3D scatterplot in Python?
This is for a 3D scatterplot. The instructions state: Now that we have fit our model, which means that we have computed the optimal model parameters, we can use our model to plot the regression line for the data. Below, I supply you with x_fit and y_fit that represent the x- and y-data of the regression line, respectively. All we need to do next is ask the model to predict a z_fit value for each x_fit and y_fit pair by invoking the model's predict() method. This should make sense when you consider the ordinary least squares linear regression equation for calculating z_fit: where are the computed model parameters. You must use x_fit and y_fit as features to be passed together as a DataFrame to the model's predict() method, which will return z_fit as determined by the above equation. Once you obtain z_fit, you are ready to plot the regression line by plotting it against x_fit and y_fit. My code so far is: features = df[['x', 'y']] response = df['z'] #perform linear regression model fitting from sklearn.linear_model import LinearRegression model = LinearRegression().fit(features, response) model print("Model Coefficients: ", model.coef_) print("Model Intercept : ", model.intercept_) Output: Model Coefficients: [ 8.02602396 15.95639323] Model Intercept : -8.632876775060822 This is where I'm stuck: # Plot Curve Fit x_fit = np.linspace(0,21,1000) y_fit = x_fit #ENTER CODE HERE#
# Plot Curve Fit x_fit = np.linspace(0,21,1000) y_fit = x_fit #ENTER CODE HERE# Assuming theta1, theta2 = model.coef_ theta3 = model.intercept_ Equation z_fit = (theta1 * x_fit) + (theta2 * y_fit) + theta3
Issue when plotting polynomial regression curve
I have run linear and polynomial regression but when I'm plotting the results for the polynomial regression it is not showing a curve for a single line but multiple lines. It could be a very simple code issue but I cannot resolve it. from sklearn.preprocessing import PolynomialFeatures pre_process = PolynomialFeatures(degree = 5) poly_model = LinearRegression() poly_model.fit(data_train_x,train_y) print(poly_model.coef_) print(poly_model.intercept_) # Prediction on validation dataset data_valid_x = pre_process.fit_transform(valid_x) poly_pred = poly_model.predict(data_valid_x) plt.scatter(valid_x, valid_y, facecolor='None', edgecolor='k', alpha=0.3) plt.plot(valid_x, poly_pred) plt.show() This is the result of the polynomial regression fit:
Difference between ridge regression and SVM regressor (SVR) with polynomial kernel of degree = 1
I am trying to build a model for an application, I have used both the ridge regression and the SVR from sklearn and they seen to be different although I tried to keep the parameters the same. I have used the regularization parameter to be = 1 in both models. (they both have L2 regularization) There is an extra parameter for the poly kernel which I set to zero The data are standardized. from sklearn.linear_model import Ridge linear_ridge = Ridge(alpha=1.0) # L2 regularization linear_ridge.fit(np.array(X_train) , np.array(y_train)) from sklearn import svm model_SVR_poly = svm.SVR(kernel = 'poly' , coef0=0.0 , degree = 1, C = 1.0 , epsilon = 0.1 ) #L2 regularization model_SVR_poly.fit(np.array(X_train) , np.array(y_train)) Linear_ridge_pred = linear_ridge.predict(test_data[start_data:]) *Y_std[0] + Y_mean[0] svr_poly_pred = model_SVR_poly.predict(test_data[start_data:]) *Y_std[0] + Y_mean[0] If the value of epsilon is decreased , to 0.0 it will undershoot more than the ridge and if increased, it will overshoot more. In the testing phase, the Ridge seems to undershoot while the SVR seems to overshoot. What is the difference between the two implementations in my case or in general ?
For me, there might be some differences in the implementations of Ridge() and SVR() as you are pointing out. On one side, there's a difference in the loss function as you might see here (epsilon-insensitive loss and squared epsilon-insensitive loss) vs here (Ridge loss). This is emphasized also within this example from sklearn documentation which however compares Kernel Ridge Regression and SVR with a non-linear kernel. In addition to this, the fact you're using SVR with a polynomial Kernel of degree 1 adds a further difference: as you can see here and here (SVR is built on top of the LibSVM library) there's a further parameter (gamma) to be considered (you might put it equal to 1 for convenience, it equals 'scale' by default). Here is the difference in fitting that I could get by adjusting this toy example (with non-tuned parameters). I've also tried to consider LinearSVR() that has some further differences wrt SVR() as you can see eg here or here. print(__doc__) import numpy as np from sklearn.linear_model import Ridge from sklearn.svm import LinearSVR, SVR import matplotlib.pyplot as plt np.random.seed(42) # ############################################################################# # Generate sample data X = np.sort(5 * np.random.rand(40, 1), axis=0) y = np.sin(X).ravel() # ############################################################################# # Add noise to targets y[::5] += 3 * (0.5 - np.random.rand(8)) # ############################################################################# # Fit regression model svr_lin = SVR(kernel='linear', C=1, tol=1e-5) svr_lins = LinearSVR(loss='squared_epsilon_insensitive', C=1, tol=1e-5, random_state=42) svr_poly = SVR(kernel='poly', C=1, degree=1, gamma=1, tol=1e-5, coef0=0.0) ridge = Ridge(alpha=1, random_state=42) y_lin = svr_lin.fit(X, y).predict(X) y_lins = svr_lins.fit(X, y).predict(X) y_poly = svr_poly.fit(X, y).predict(X) y_ridge = ridge.fit(X, y).predict(X) coef_y_lin, intercept_y_lin = svr_lin.coef_, svr_lin.intercept_ coef_y_lins, intercept_y_lins = svr_lins.coef_, svr_lins.intercept_ coef_y_ridge, intercept_y_ridge = ridge.coef_, ridge.intercept_ # ############################################################################# # Look at the results lw = 2 plt.figure(figsize=(10,5)) plt.scatter(X, y, color='darkorange', label='data') plt.plot(X, y_lins, color='navy', lw=lw, label='Linear model (LinearSVR) %s, %s' % (coef_y_lins, intercept_y_lins)) plt.plot(X, y_lin, color='red', lw=lw, label='Linear model (SVR) %s, %s' % (coef_y_lin, intercept_y_lin)) plt.plot(X, y_poly, color='cornflowerblue', lw=lw, label='Polynomial model of degree 1 (SVR)') plt.plot(X, y_ridge, color='g', lw=lw, label='Ridge %s, %s' % (coef_y_ridge, intercept_y_ridge)) plt.xlabel('data') plt.ylabel('target') plt.title('Support Vector Regression') plt.legend() plt.axis([0, 5, -1, 1.5])
Overfitting the Lasso regression model to the data
I wanted to create a model that would describe the behaviour of my data. I tried the simple linear regression, simple polynomial regression and polynomial regression with regularization and cross-validation. I found that the last method allows an automatic feature selection (with degrees), which I actually need as the simple linear regression does not perform well. I followed this explanation to perform polynomial regression with Lasso regularization and cross-validation. In this example, this method is used to avoid overfitting which happens if you use a simple polynomial regression. In my case, however, it seems that it vice versa caused overfitting. I was wondering whether somebody could help me to understand what I am doing wrong in the code implementation? Or maybe there is a better solution on how best fit data into a model? Code (linear regression with statsmodels, polynomial regressions with scikit learn): import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm from pandas import DataFrame from sklearn.metrics import mean_squared_error from sklearn.metrics import mean_absolute_error # Import function to automatically create polynomial features from sklearn.preprocessing import PolynomialFeatures # Import Linear Regression and a regularized regression function from sklearn.linear_model import LinearRegression from sklearn.linear_model import LassoCV #Initial data SoH = {'Cycle': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32], 'Internal_Resistance': [0.039684729, 0.033377614, 0.031960606, 0.03546798, 0.036786229, 0.03479803, 0.026613861, 0.028650246, 0.028183795, 0.035455215, 0.029205355, 0.033891692, 0.026988849, 0.025647298, 0.033970376, 0.03172454, 0.032437203, 0.033771218, 0.030939938, 0.036919977, 0.027832869, 0.028602469, 0.023065191, 0.028890529, 0.026640394, 0.031488253, 0.02865842, 0.027648949, 0.026217822, 0.032549629, 0.025744309, 0.027945824], 'CV_Capacity': [389.9270401, 307.7366414, 357.6412139, 192.134787, 212.415946, 204.737916, 166.506029, 157.826878, 196.432589, 181.937188, 192.070363, 209.890964, 198.978988, 206.126864, 185.631644, 193.776497, 200.61431, 174.359373, 177.503285, 174.07905, 170.654873, 184.528031, 208.065379, 210.134795, 208.199237, 184.693507, 193.00402, 191.913131, 196.610972, 194.915587, 183.209067, 182.41669], 'Full_Capacity': [1703.8575, 1740.7017, 1760.66, 1775.248302, 1771.664053, 1781.958089, 1783.2295, 1784.500912, 1779.280477, 1780.175547, 1800.761265, 1789.047162, 1791.763677, 1787.014667, 1796.520256, 1798.349587, 1791.776304, 1788.892761, 1791.990303, 1790.307248, 1796.580484, 1803.89133, 1793.305294, 1784.638742, 1780.056339, 1783.081746, 1772.001436, 1794.182046, 1777.880947, 1792.21646, 1785.653845, 1788.401923] } Test = {'Cycle': [33, 34, 35], 'Internal_Resistance': [0.027332509, 0.027960729, 0.028969193], 'CV_Capacity': [204.018257, 179.929472, 189.576431], 'Full_Capacity': [1782.983718, 1793.939504, 1788.67233] } #Initial data presented in a form of a data frame df = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity','Full_Capacity']) df1 = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity']) X = df1.to_numpy() print(df.head(32)) print() print(X) print() #Plot the Full Capacity vs predictors (Cycle, Internal Resistance and CV Capacity) for i in df.columns: df.plot.scatter(i,'Full_Capacity', edgecolors=(0,0,0),s=50,c='g',grid=True) # Fitting data with statsmodels X1 = df[['Cycle','Internal_Resistance','CV_Capacity']] Y1 = df['Full_Capacity'] X1 = sm.add_constant(X1.values) # adding a constant model = sm.OLS(Y1, X1).fit() predictions = model.predict(X1) print_model = model.summary() print(print_model) print() # Fitting data with scikit learn - simple linear regression linear_model = LinearRegression(normalize=True) X_linear=df.drop('Full_Capacity',axis=1) y_linear=df['Full_Capacity'] linear_model.fit(X_linear,y_linear) y_pred_linear = linear_model.predict(X_linear) #Metrics of the linear model MAE_linear = mean_absolute_error(y_linear, y_pred_linear) print("Mean absolute error of linear model:",MAE_linear) MSE_linear = mean_squared_error(y_linear, y_pred_linear) print("Mean-squared error of linear model:",MSE_linear) RMSE_linear = np.sqrt(MSE_linear) print("Root-mean-squared error of linear model:",RMSE_linear) #Coefficients for the linear model coeff_linear = pd.DataFrame(linear_model.coef_,index=df.drop('Full_Capacity',axis=1).columns, columns=['Linear model coefficients']) print(coeff_linear) print ("R2 value of linear model:",linear_model.score(X_linear,y_linear)) #Plot predicted values vs actual values plt.figure(figsize=(12,8)) plt.xlabel("Predicted value with linear fit",fontsize=20) plt.ylabel("Actual y-values",fontsize=20) plt.grid(1) plt.scatter(y_pred_linear,y_linear,edgecolors=(0,0,0),lw=2,s=80) plt.plot(y_pred_linear,y_pred_linear, 'k--', lw=2) #Fitting data with a simple polynomial model poly = PolynomialFeatures(2,include_bias=False) X_poly = poly.fit_transform(X) X_poly_feature_name = poly.get_feature_names(['Feature'+str(l) for l in range(1,4)]) print(X_poly_feature_name) print(len(X_poly_feature_name)) df_poly = pd.DataFrame(X_poly, columns=X_poly_feature_name) print(df_poly.head()) df_poly['y']=df['Full_Capacity'] print(df_poly.head()) X_train=df_poly.drop('y',axis=1) y_train=df_poly['y'] poly = LinearRegression(normalize=True) model_poly=poly.fit(X_train,y_train) y_poly = poly.predict(X_train) #Metrics of the polynomial model MAE_poly = mean_absolute_error(y_poly, y_train) print("Mean absolute error of simple polynomial model:",MAE_poly) MSE_poly = mean_squared_error(y_poly, y_train) print("Mean-squared error of simple polynomial model:",MSE_poly) RMSE_poly = np.sqrt(MSE_poly) print("Root-mean-squared error of simple polynomial model:",RMSE_poly) print ("R2 value of simple polynomial model:",model_poly.score(X_train,y_train)) coeff_poly = pd.DataFrame(model_poly.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients polynomial model']) print(coeff_poly) #Fitting data with a polynomial model with regularization and cross-validation model1 = LassoCV(cv=10,verbose=0,normalize=True,eps=0.001,n_alphas=100, tol=0.0001,max_iter=10000) model1.fit(X_train,y_train) y_pred1 = np.array(model1.predict(X_train)) #Metrics of the polynomial model with regularization and cross-validation MAE_1 = mean_absolute_error(y_pred1, y_pred1) print("Mean absolute error of the new polynomial model:",MAE_1) MSE_1 = mean_squared_error(y_pred1, y_pred1) print("Mean-squared error of the new polynomial model:",MSE_1) RMSE_1 = np.sqrt(MSE_1) print("Root-mean-squared error of the new polynomial model:",RMSE_1) coeff1 = pd.DataFrame(model1.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients Metamodel']) print(coeff1) print ("R2 value of the new polynomial model:",model1.score(X_train,y_pred1)) print ("Alpha of the new polynomial model:",model1.alpha_) print(coeff1[coeff1['Coefficients Metamodel']!=0]) plt.figure(figsize=(12,8)) plt.xlabel("Predicted value with Metamodel",fontsize=20) plt.ylabel("Actual y-values",fontsize=20) plt.grid(1) plt.scatter(y_pred1,y_train,edgecolors=(0,0,0),lw=2,s=80) plt.plot(y_pred1,y_pred1, 'k--', lw=2) ```
I found that a simple polynomial with a single interaction term gives an OK fit. Note that a 3D scatterplot of the SoH data, without using "cycle" shows that there are some areas that would benefit from additional data to characterize the response surface: a = 1.6708148450040499E+03 b = 6.5825133247934986E-01 c = 4.8477389499541523E+03 d = -2.7015882838321772E+01 temp = a temp += b * CV_Capacity temp += c * Internal_Resistance temp += d * Internal_Resistance * CV_Capacity return temp
Lasso is a regularization method that can be used in order to avoid overfitting. In this method we add a term to the loss function which is a kind of constraint on the weights. Then, you have 2 terms in your loss function - a term that responsible to fit the data, and a regularization term. In addition, you have a constant that controls the trade off between those 2 term. There is a chance that in your case you should increase the strength of the regularization term (increase the constant), in order to avoid overfitting.
How calculate OLS regression with Survey Weights in Python.
I want to do a linear regression on survey data with survey weights. The survey data is from the EU and each observation has a weight. (.4 for an one respondent, 1.5 for another.) This weight is described as: "The European Weight, variable 6, produces a representative sample of the European Community as a whole when used in analysis. This variable adjusts the size of each national sample according to each nation's contribution to the population of the European Community." To do my calculation I'm using sklearn. from sklearn import linear_model regr = linear_model.LinearRegression() regr.fit(X,y, sample_weight = weights) X is a pandas DataFrame. y is a numpy.ndarray. weights is a pandas Series. Am I using 'sample_weight' correctly, does is this the correct way to handle survey weights in scikit?
TL DR; Yes. Here is a very simple example of it working, import numpy as np import matplotlib.pylab as plt from sklearn import linear_model regr = linear_model.LinearRegression() X = np.array([1, 2, 4]).reshape(-1, 1) y = np.array([10, 20, 60]).reshape(-1, 1) weights = np.array([1, 1, 1]) def weighted_lr(X, y, weights): """Quick function to run weighted linear regression and return a plot and some predictions""" regr.fit(X,y, sample_weight=weights) y_pred = regr.predict(X) plt.scatter(X, y) plt.plot(X, y_pred) plt.title('Weights: %s' % ', '.join(str(i) for i in weights)) plt.show() return y_pred y_pred = weighted_lr(X, y, weights) print(y_pred) weights = np.array([1000, 1000, 1]) y_pred = weighted_lr(X, y, weights) print(y_pred) [[ 7.14285714] [ 24.28571429] [ 58.57142857]] [[ 9.96051333] [ 20.05923001] [ 40.25666338]] On the first linear regression model with even weights we see the model behave as expected from a normal linear regression model. Next, however, we see that in the second model, with low weighing on the last value, almost ignores this last value. The majority of the training has been weighted to the other two values here.