How calculate OLS regression with Survey Weights in Python.

How calculate OLS regression with Survey Weights in Python. - python

I want to do a linear regression on survey data with survey weights.
The survey data is from the EU and each observation has a weight. (.4 for an one respondent, 1.5 for another.)
This weight is described as:
"The European Weight, variable 6, produces a representative sample of
the European Community as a whole when used in analysis. This variable
adjusts the size of each national sample according to each nation's
contribution to the population of the European Community."
To do my calculation I'm using sklearn.
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X,y, sample_weight = weights)
X is a pandas DataFrame. y is a numpy.ndarray. weights is a pandas Series.
Am I using 'sample_weight' correctly, does is this the correct way to handle survey weights in scikit?

TL DR; Yes.
Here is a very simple example of it working,
import numpy as np
import matplotlib.pylab as plt
from sklearn import linear_model
regr = linear_model.LinearRegression()
X = np.array([1, 2, 4]).reshape(-1, 1)
y = np.array([10, 20, 60]).reshape(-1, 1)
weights = np.array([1, 1, 1])
def weighted_lr(X, y, weights):
"""Quick function to run weighted linear regression and return a
plot and some predictions"""
regr.fit(X,y, sample_weight=weights)
y_pred = regr.predict(X)
plt.scatter(X, y)
plt.plot(X, y_pred)
plt.title('Weights: %s' % ', '.join(str(i) for i in weights))
plt.show()
return y_pred
y_pred = weighted_lr(X, y, weights)
print(y_pred)
weights = np.array([1000, 1000, 1])
y_pred = weighted_lr(X, y, weights)
print(y_pred)
[[ 7.14285714]
[ 24.28571429]
[ 58.57142857]]
[[ 9.96051333]
[ 20.05923001]
[ 40.25666338]]
On the first linear regression model with even weights we see the model behave as expected from a normal linear regression model.
Next, however, we see that in the second model, with low weighing on the last value, almost ignores this last value. The majority of the training has been weighted to the other two values here.

Related

Scaling gives me worse result (lower R-squared) in Linear Regression

(I leave my dataset at the bottom line). I'm trying to use Linear Regression on a dataset where predictors are the product ID, weight, type, Outlet_Establishment_Year, etc and target variable is the Item_Outlet_Sales. I use R-squared as the metric. I think the predictors have different units so I'll need to scale them. If I do so:
X = cleaned_data.iloc[:, :-1] # predictors
X = pd.get_dummies(data = X, drop_first = True) # convert categorical variables to numerical variables
Y = cleaned_data.iloc[:, -1] # target
Then I scale the data, perform Linear Regression and calculate R-squared which yield 0.57 as a result:
from sklearn.preprocessing import StandardScaler
concat_data = pd.concat([X, Y], axis = 1)
scaled_data = StandardScaler().fit_transform(concat_data)
X_scaled = scaled_data[:, :-1]
Y_scaled = scaled_data[:, -1]
print(X_scaled.shape, Y_scaled.shape)
from sklearn.linear_model import LinearRegression
LR_scaled_model = LinearRegression()
LR_scaled_model.fit(X_scaled, Y_scaled)
from sklearn.metrics import *
predicted_sales = LR_scaled_model.predict(X_scaled)
print('R-squared:', r2_score(Y_scaled, predicted_sales))
And if I just implement Linear Regression without scaling, the R-squared is 0.67
LR_non_scaling_model = LinearRegression()
LR_non_scaling_model.fit(X, Y)
predicted_sales = LR_non_scaling_model.predict(X)
print('R-squared:', r2_score(Y, predicted_sales))
How would you explain this? And, in linear regression tasks, when should I and when should not I scale my data?
Dataset: https://drive.google.com/file/d/1AeK2aCnKtr0xMHz1B_Vfq4HnIkd2pxW_/view?usp=share_link

It seems like the scaling is also applied to the one-hot-encoded dummy variable which IMO should not happen. If you only scale continuous variables, does that change the behavior?
Generally, scaling only affects the interpretation of the coefficients and not the quality of the model. After standard scaling, a coefficient $\beta_1$ can be interpreted as:
A one standard deviation change in the independent variable is associated with a $\beta_1$ change in the dependent variable

My train/test model is returning an error and is train/test model and normal linear regression model two separate models?

I recently attending a class where the instructor was teaching us how to create a linear regression model using Python. Here is my linear regression model:
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
import numpy as np
from sklearn.metrics import r2_score
#Define the path for the file
path=r"C:\Users\H\Desktop\Files\Data.xlsx"
#Read the file into a dataframe ensuring to group by weeks
df=pd.read_excel(path, sheet_name = 0)
df=df.groupby(['Week']).sum()
df = df.reset_index()
#Define x and y
x=df['Week']
y=df['Payment Amount Total']
#Draw the scatter plot
plt.scatter(x, y)
plt.show()
#Now we draw the line of linear regression
#First we want to look for these values
slope, intercept, r, p, std_err = stats.linregress(x, y)
#We then create a function
def myfunc(x):
#Below is y = mx + c
return slope * x + intercept
#Run each value of the x array through the function. This will result in a new array with new values for the y-axis:
mymodel = list(map(myfunc, x))
#We plot the scatter plot and line
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
#We print the value of r
print(r)
#We predict what the cost will be in week 23
print(myfunc(23))
The instructor said we now must use the train/test model to determine how accurate the model above is. This confused me a little as I understood it to mean we will further refine the model above. Or, does it simply mean we will use:
a normal linear regression model
a train/test model
and compare the r values the two different models yield as well as the predicted values they yield?. Is the train/test model considered a regression model?
I tried to create the train/test model but I'm not sure if it's correct (the packages were imported from the above example). When I run the train/test code I get the following error:
ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.
Here is the full code:
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
#I display the training set:
plt.scatter(train_x, train_y)
plt.show()
#I display the testing set:
plt.scatter(test_x, test_y)
plt.show()
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
myline = np.linspace(0, 6, 100)
plt.scatter(train_x, train_y)
plt.plot(myline, mymodel(myline))
plt.show()
#Let's look at how well my training data fit in a polynomial regression?
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
r2 = r2_score(train_y, mymodel(train_x))
print(r2)
#Now we want to test the model with the testing data as well
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
r2 = r2_score(test_y, mymodel(test_x))
print(r2)
#Now we can use this model to predict new values:
#We predict what the total amount would be on the 23rd week:
print(mymodel(23))

You better split to train and test using sklearn method:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Where X is your features dataframe and y is the column of your labels. 0.2 stands for 80% train and 20% test.
BTW - the error you are describing could be because you dataframe has only 80 rows, leaving x[80:] empty

Multiple Linear Regression. Coeffs don't match

So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?

I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html

Negative accuracy in linear regression

My linear regression model has negative coefficient of determination R².
How can this happen? Any idea is helpful.
Here is my dataset:
year,population
1960,22151278.0
1961,22671191.0
1962,23221389.0
1963,23798430.0
1964,24397022.0
1965,25013626.0
1966,25641044.0
1967,26280132.0
1968,26944390.0
1969,27652709.0
1970,28415077.0
1971,29248643.0
1972,30140804.0
1973,31036662.0
1974,31861352.0
1975,32566854.0
1976,33128149.0
1977,33577242.0
1978,33993301.0
1979,34487799.0
1980,35141712.0
1981,35984528.0
1982,36995248.0
1983,38142674.0
1984,39374348.0
1985,40652141.0
1986,41965693.0
1987,43329231.0
1988,44757203.0
1989,46272299.0
1990,47887865.0
1991,49609969.0
1992,51423585.0
1993,53295566.0
1994,55180998.0
1995,57047908.0
1996,58883530.0
1997,60697443.0
1998,62507724.0
1999,64343013.0
2000,66224804.0
2001,68159423.0
2002,70142091.0
2003,72170584.0
2004,74239505.0
2005,76346311.0
2006,78489206.0
2007,80674348.0
2008,82916235.0
2009,85233913.0
2010,87639964.0
2011,90139927.0
2012,92726971.0
2013,95385785.0
2014,98094253.0
2015,100835458.0
2016,103603501.0
2017,106400024.0
2018,109224559.0
The code of the LinearRegression model is as follows:
import pandas as pd
from sklearn.linear_model import LinearRegression
data =pd.read_csv("data.csv", header=None )
data = data.drop(0,axis=0)
X=data[0]
Y=data[1]
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1,shuffle =False)
lm = LinearRegression()
lm.fit(X_train.values.reshape(-1,1), Y_train.values.reshape(-1,1))
Y_pred = lm.predict(X_test.values.reshape(-1,1))
accuracy = lm.score(Y_test.values.reshape(-1,1),Y_pred)
print(accuracy)
output
-3592622948027972.5

Here is the formula of the R² score:
\hat{y_i} is the predictor of the i-th observation y_i and \bar{y} is the mean of all observations.
Therefore, a negative R² means that if someone knew the mean of your y_test sample and always used it as a "prediction", this "prediction" would be more accurate than your model.
Moving on to your dataset (thanks to #Prayson W. Daniel for the convenient loading script), let us have a quick look at your data.
df.population.plot()
It looks like a logarithmic transformation could help.
import numpy as np
df_log = df.copy()
df_log.population = np.log(df.population)
df_log.population.plot()
Now let us perform a linear regression using OpenTURNS.
import openturns as ot
sam = ot.Sample(np.array(df_log)) # convert DataFrame to openturns Sample
sam.setDescription(['year', 'logarithm of the population'])
linreg = ot.LinearModelAlgorithm(sam[:, 0], sam[:, 1])
linreg.run()
linreg_result = linreg.getResult()
coeffs = linreg_result.getCoefficients()
print("Best fitting line = {} + year * {}".format(coeffs[0], coeffs[1]))
print("R2 score = {}".format(linreg_result.getRSquared()))
ot.VisualTest_DrawLinearModel(sam[:, 0], sam[:, 1], linreg_result)
Output:
Best fitting line = -38.35148311467912 + year * 0.028172928802559845
R2 score = 0.9966261033648469
This is an almost exact fit.
EDIT
As suggested by #Prayson W. Daniel, here is the model fit after it is transformed back to the original scale.
# Get the original data in openturns Sample format
orig_sam = ot.Sample(np.array(df))
orig_sam.setDescription(df.columns)
# Compute the prediction in the original scale
predicted = ot.Sample(orig_sam) # start by copying the original data
predicted[:, 1] = np.exp(linreg_result.getMetaModel()(predicted[:, 0])) # overwrite with the predicted values
error = np.array((predicted - orig_sam)[:, 1]) # compute error
r2 = 1.0 - (error**2).mean() / df.population.var() # compute the R2 score in the original scale
print("R2 score in original scale = {}".format(r2))
# Plot the model
graph = ot.Graph("Original scale", "year", "population", True, '')
curve = ot.Curve(predicted)
graph.add(curve)
points = ot.Cloud(orig_sam)
points.setColor('red')
graph.add(points)
graph
Output:
R2 score in original scale = 0.9979032805107133

Sckit-learn’s LinearRegression scores uses 𝑅2 score. A negative 𝑅2 means that the model fitted your data extremely bad. Since 𝑅2 compares the fit of the model with that of the null hypothesis( a horizontal straight line ), then 𝑅2 is negative when the model fits worse than a horizontal line.
𝑅2 = 1 - (SUM((y - ypred)**2) / SUM((y - AVG(y))**2))
So if SUM((y - ypred)**2 is greater than SUM((y - AVG(y))**2, then 𝑅2 will be negative.
reasons and ways to correct it
Problem 1: You are performing a random split of time-series data. Random split will ignore the temporal dimension.
Solution: Preserve time flow (See code below)
Problem 2: Target values are so large.
Solution: Unless we use Tree-base models, you would have to do some target feature engineering to scale data in a range that models can learn.
Here is a code example. Using defaults parameters of LinearRegression and log|exp transformation of our target values, my attempt yield ~87% R2 score:
import pandas as pd
import numpy as np
# we need to transform/feature engineer our target
# I will use log from numpy. The np.log and np.exp to make the value learnable
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor
# your data, df
# transform year to reference
df = df.assign(ref_year = lambda x: x.year - 1960)
df.population = df.population.astype(int)
split = int(df.shape[0] *.9) #split at 90%, 10%-ish
df = df[['ref_year', 'population']]
train_df = df.iloc[:split]
test_df = df.iloc[split:]
X_train = train_df[['ref_year']]
y_train = train_df.population
X_test = test_df[['ref_year']]
y_test = test_df.population
# regressor
regressor = LinearRegression()
lr = TransformedTargetRegressor(
regressor=regressor,
func=np.log, inverse_func=np.exp)
lr.fit(X_train,y_train)
print(lr.score(X_test,y_test))
For those interested in making it better, here is a way to read that dataset
import pandas as pd
import io
df = pd.read_csv(io.StringIO('''year,population
1960,22151278.0
1961,22671191.0
1962,23221389.0
1963,23798430.0
1964,24397022.0
1965,25013626.0
1966,25641044.0
1967,26280132.0
1968,26944390.0
1969,27652709.0
1970,28415077.0
1971,29248643.0
1972,30140804.0
1973,31036662.0
1974,31861352.0
1975,32566854.0
1976,33128149.0
1977,33577242.0
1978,33993301.0
1979,34487799.0
1980,35141712.0
1981,35984528.0
1982,36995248.0
1983,38142674.0
1984,39374348.0
1985,40652141.0
1986,41965693.0
1987,43329231.0
1988,44757203.0
1989,46272299.0
1990,47887865.0
1991,49609969.0
1992,51423585.0
1993,53295566.0
1994,55180998.0
1995,57047908.0
1996,58883530.0
1997,60697443.0
1998,62507724.0
1999,64343013.0
2000,66224804.0
2001,68159423.0
2002,70142091.0
2003,72170584.0
2004,74239505.0
2005,76346311.0
2006,78489206.0
2007,80674348.0
2008,82916235.0
2009,85233913.0
2010,87639964.0
2011,90139927.0
2012,92726971.0
2013,95385785.0
2014,98094253.0
2015,100835458.0
2016,103603501.0
2017,106400024.0
2018,109224559.0
'''))
Results:

Overfitting the Lasso regression model to the data

I wanted to create a model that would describe the behaviour of my data. I tried the simple linear regression, simple polynomial regression and polynomial regression with regularization and cross-validation.
I found that the last method allows an automatic feature selection (with degrees), which I actually need as the simple linear regression does not perform well. I followed this explanation to perform polynomial regression with Lasso regularization and cross-validation.
In this example, this method is used to avoid overfitting which happens if you use a simple polynomial regression. In my case, however, it seems that it vice versa caused overfitting.
I was wondering whether somebody could help me to understand what I am doing wrong in the code implementation? Or maybe there is a better solution on how best fit data into a model?
Code (linear regression with statsmodels, polynomial regressions with scikit learn):
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import DataFrame
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
# Import function to automatically create polynomial features
from sklearn.preprocessing import PolynomialFeatures
# Import Linear Regression and a regularized regression function
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
#Initial data
SoH = {'Cycle': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32],
'Internal_Resistance': [0.039684729, 0.033377614, 0.031960606, 0.03546798, 0.036786229, 0.03479803, 0.026613861, 0.028650246, 0.028183795, 0.035455215, 0.029205355, 0.033891692, 0.026988849, 0.025647298, 0.033970376, 0.03172454, 0.032437203, 0.033771218, 0.030939938, 0.036919977, 0.027832869, 0.028602469, 0.023065191, 0.028890529, 0.026640394, 0.031488253, 0.02865842, 0.027648949, 0.026217822, 0.032549629, 0.025744309, 0.027945824],
'CV_Capacity': [389.9270401, 307.7366414, 357.6412139, 192.134787, 212.415946, 204.737916, 166.506029, 157.826878, 196.432589, 181.937188, 192.070363, 209.890964, 198.978988, 206.126864, 185.631644, 193.776497, 200.61431, 174.359373, 177.503285, 174.07905, 170.654873, 184.528031, 208.065379, 210.134795, 208.199237, 184.693507, 193.00402, 191.913131, 196.610972, 194.915587, 183.209067, 182.41669],
'Full_Capacity': [1703.8575, 1740.7017, 1760.66, 1775.248302, 1771.664053, 1781.958089, 1783.2295, 1784.500912, 1779.280477, 1780.175547, 1800.761265, 1789.047162, 1791.763677, 1787.014667, 1796.520256, 1798.349587, 1791.776304, 1788.892761, 1791.990303, 1790.307248, 1796.580484, 1803.89133, 1793.305294, 1784.638742, 1780.056339, 1783.081746, 1772.001436, 1794.182046, 1777.880947, 1792.21646, 1785.653845, 1788.401923]
}
Test = {'Cycle': [33, 34, 35],
'Internal_Resistance': [0.027332509, 0.027960729, 0.028969193],
'CV_Capacity': [204.018257, 179.929472, 189.576431],
'Full_Capacity': [1782.983718, 1793.939504, 1788.67233]
}
#Initial data presented in a form of a data frame
df = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity','Full_Capacity'])
df1 = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity'])
X = df1.to_numpy()
print(df.head(32))
print()
print(X)
print()
#Plot the Full Capacity vs predictors (Cycle, Internal Resistance and CV Capacity)
for i in df.columns:
df.plot.scatter(i,'Full_Capacity', edgecolors=(0,0,0),s=50,c='g',grid=True)
# Fitting data with statsmodels
X1 = df[['Cycle','Internal_Resistance','CV_Capacity']]
Y1 = df['Full_Capacity']
X1 = sm.add_constant(X1.values) # adding a constant
model = sm.OLS(Y1, X1).fit()
predictions = model.predict(X1)
print_model = model.summary()
print(print_model)
print()
# Fitting data with scikit learn - simple linear regression
linear_model = LinearRegression(normalize=True)
X_linear=df.drop('Full_Capacity',axis=1)
y_linear=df['Full_Capacity']
linear_model.fit(X_linear,y_linear)
y_pred_linear = linear_model.predict(X_linear)
#Metrics of the linear model
MAE_linear = mean_absolute_error(y_linear, y_pred_linear)
print("Mean absolute error of linear model:",MAE_linear)
MSE_linear = mean_squared_error(y_linear, y_pred_linear)
print("Mean-squared error of linear model:",MSE_linear)
RMSE_linear = np.sqrt(MSE_linear)
print("Root-mean-squared error of linear model:",RMSE_linear)
#Coefficients for the linear model
coeff_linear = pd.DataFrame(linear_model.coef_,index=df.drop('Full_Capacity',axis=1).columns, columns=['Linear model coefficients'])
print(coeff_linear)
print ("R2 value of linear model:",linear_model.score(X_linear,y_linear))
#Plot predicted values vs actual values
plt.figure(figsize=(12,8))
plt.xlabel("Predicted value with linear fit",fontsize=20)
plt.ylabel("Actual y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred_linear,y_linear,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred_linear,y_pred_linear, 'k--', lw=2)
#Fitting data with a simple polynomial model
poly = PolynomialFeatures(2,include_bias=False)
X_poly = poly.fit_transform(X)
X_poly_feature_name = poly.get_feature_names(['Feature'+str(l) for l in range(1,4)])
print(X_poly_feature_name)
print(len(X_poly_feature_name))
df_poly = pd.DataFrame(X_poly, columns=X_poly_feature_name)
print(df_poly.head())
df_poly['y']=df['Full_Capacity']
print(df_poly.head())
X_train=df_poly.drop('y',axis=1)
y_train=df_poly['y']
poly = LinearRegression(normalize=True)
model_poly=poly.fit(X_train,y_train)
y_poly = poly.predict(X_train)
#Metrics of the polynomial model
MAE_poly = mean_absolute_error(y_poly, y_train)
print("Mean absolute error of simple polynomial model:",MAE_poly)
MSE_poly = mean_squared_error(y_poly, y_train)
print("Mean-squared error of simple polynomial model:",MSE_poly)
RMSE_poly = np.sqrt(MSE_poly)
print("Root-mean-squared error of simple polynomial model:",RMSE_poly)
print ("R2 value of simple polynomial model:",model_poly.score(X_train,y_train))
coeff_poly = pd.DataFrame(model_poly.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients polynomial model'])
print(coeff_poly)
#Fitting data with a polynomial model with regularization and cross-validation
model1 = LassoCV(cv=10,verbose=0,normalize=True,eps=0.001,n_alphas=100, tol=0.0001,max_iter=10000)
model1.fit(X_train,y_train)
y_pred1 = np.array(model1.predict(X_train))
#Metrics of the polynomial model with regularization and cross-validation
MAE_1 = mean_absolute_error(y_pred1, y_pred1)
print("Mean absolute error of the new polynomial model:",MAE_1)
MSE_1 = mean_squared_error(y_pred1, y_pred1)
print("Mean-squared error of the new polynomial model:",MSE_1)
RMSE_1 = np.sqrt(MSE_1)
print("Root-mean-squared error of the new polynomial model:",RMSE_1)
coeff1 = pd.DataFrame(model1.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients Metamodel'])
print(coeff1)
print ("R2 value of the new polynomial model:",model1.score(X_train,y_pred1))
print ("Alpha of the new polynomial model:",model1.alpha_)
print(coeff1[coeff1['Coefficients Metamodel']!=0])
plt.figure(figsize=(12,8))
plt.xlabel("Predicted value with Metamodel",fontsize=20)
plt.ylabel("Actual y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred1,y_train,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred1,y_pred1, 'k--', lw=2) ```

I found that a simple polynomial with a single interaction term gives an OK fit. Note that a 3D scatterplot of the SoH data, without using "cycle" shows that there are some areas that would benefit from additional data to characterize the response surface:
a = 1.6708148450040499E+03
b = 6.5825133247934986E-01
c = 4.8477389499541523E+03
d = -2.7015882838321772E+01
temp = a
temp += b * CV_Capacity
temp += c * Internal_Resistance
temp += d * Internal_Resistance * CV_Capacity
return temp

Lasso is a regularization method that can be used in order to avoid overfitting.
In this method we add a term to the loss function which is a kind of constraint on the weights. Then, you have 2 terms in your loss function - a term that responsible to fit the data, and a regularization term.
In addition, you have a constant that controls the trade off between those 2 term. There is a chance that in your case you should increase the strength of the regularization term (increase the constant), in order to avoid overfitting.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How calculate OLS regression with Survey Weights in Python. - python

Related

Scaling gives me worse result (lower R-squared) in Linear Regression

My train/test model is returning an error and is train/test model and normal linear regression model two separate models?

Multiple Linear Regression. Coeffs don't match

Negative accuracy in linear regression

Overfitting the Lasso regression model to the data

Categories

Resources