Overfitting the Lasso regression model to the data - python
I wanted to create a model that would describe the behaviour of my data. I tried the simple linear regression, simple polynomial regression and polynomial regression with regularization and cross-validation.
I found that the last method allows an automatic feature selection (with degrees), which I actually need as the simple linear regression does not perform well. I followed this explanation to perform polynomial regression with Lasso regularization and cross-validation.
In this example, this method is used to avoid overfitting which happens if you use a simple polynomial regression. In my case, however, it seems that it vice versa caused overfitting.
I was wondering whether somebody could help me to understand what I am doing wrong in the code implementation? Or maybe there is a better solution on how best fit data into a model?
Code (linear regression with statsmodels, polynomial regressions with scikit learn):
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from pandas import DataFrame
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
# Import function to automatically create polynomial features
from sklearn.preprocessing import PolynomialFeatures
# Import Linear Regression and a regularized regression function
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LassoCV
#Initial data
SoH = {'Cycle': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32],
'Internal_Resistance': [0.039684729, 0.033377614, 0.031960606, 0.03546798, 0.036786229, 0.03479803, 0.026613861, 0.028650246, 0.028183795, 0.035455215, 0.029205355, 0.033891692, 0.026988849, 0.025647298, 0.033970376, 0.03172454, 0.032437203, 0.033771218, 0.030939938, 0.036919977, 0.027832869, 0.028602469, 0.023065191, 0.028890529, 0.026640394, 0.031488253, 0.02865842, 0.027648949, 0.026217822, 0.032549629, 0.025744309, 0.027945824],
'CV_Capacity': [389.9270401, 307.7366414, 357.6412139, 192.134787, 212.415946, 204.737916, 166.506029, 157.826878, 196.432589, 181.937188, 192.070363, 209.890964, 198.978988, 206.126864, 185.631644, 193.776497, 200.61431, 174.359373, 177.503285, 174.07905, 170.654873, 184.528031, 208.065379, 210.134795, 208.199237, 184.693507, 193.00402, 191.913131, 196.610972, 194.915587, 183.209067, 182.41669],
'Full_Capacity': [1703.8575, 1740.7017, 1760.66, 1775.248302, 1771.664053, 1781.958089, 1783.2295, 1784.500912, 1779.280477, 1780.175547, 1800.761265, 1789.047162, 1791.763677, 1787.014667, 1796.520256, 1798.349587, 1791.776304, 1788.892761, 1791.990303, 1790.307248, 1796.580484, 1803.89133, 1793.305294, 1784.638742, 1780.056339, 1783.081746, 1772.001436, 1794.182046, 1777.880947, 1792.21646, 1785.653845, 1788.401923]
}
Test = {'Cycle': [33, 34, 35],
'Internal_Resistance': [0.027332509, 0.027960729, 0.028969193],
'CV_Capacity': [204.018257, 179.929472, 189.576431],
'Full_Capacity': [1782.983718, 1793.939504, 1788.67233]
}
#Initial data presented in a form of a data frame
df = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity','Full_Capacity'])
df1 = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity'])
X = df1.to_numpy()
print(df.head(32))
print()
print(X)
print()
#Plot the Full Capacity vs predictors (Cycle, Internal Resistance and CV Capacity)
for i in df.columns:
df.plot.scatter(i,'Full_Capacity', edgecolors=(0,0,0),s=50,c='g',grid=True)
# Fitting data with statsmodels
X1 = df[['Cycle','Internal_Resistance','CV_Capacity']]
Y1 = df['Full_Capacity']
X1 = sm.add_constant(X1.values) # adding a constant
model = sm.OLS(Y1, X1).fit()
predictions = model.predict(X1)
print_model = model.summary()
print(print_model)
print()
# Fitting data with scikit learn - simple linear regression
linear_model = LinearRegression(normalize=True)
X_linear=df.drop('Full_Capacity',axis=1)
y_linear=df['Full_Capacity']
linear_model.fit(X_linear,y_linear)
y_pred_linear = linear_model.predict(X_linear)
#Metrics of the linear model
MAE_linear = mean_absolute_error(y_linear, y_pred_linear)
print("Mean absolute error of linear model:",MAE_linear)
MSE_linear = mean_squared_error(y_linear, y_pred_linear)
print("Mean-squared error of linear model:",MSE_linear)
RMSE_linear = np.sqrt(MSE_linear)
print("Root-mean-squared error of linear model:",RMSE_linear)
#Coefficients for the linear model
coeff_linear = pd.DataFrame(linear_model.coef_,index=df.drop('Full_Capacity',axis=1).columns, columns=['Linear model coefficients'])
print(coeff_linear)
print ("R2 value of linear model:",linear_model.score(X_linear,y_linear))
#Plot predicted values vs actual values
plt.figure(figsize=(12,8))
plt.xlabel("Predicted value with linear fit",fontsize=20)
plt.ylabel("Actual y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred_linear,y_linear,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred_linear,y_pred_linear, 'k--', lw=2)
#Fitting data with a simple polynomial model
poly = PolynomialFeatures(2,include_bias=False)
X_poly = poly.fit_transform(X)
X_poly_feature_name = poly.get_feature_names(['Feature'+str(l) for l in range(1,4)])
print(X_poly_feature_name)
print(len(X_poly_feature_name))
df_poly = pd.DataFrame(X_poly, columns=X_poly_feature_name)
print(df_poly.head())
df_poly['y']=df['Full_Capacity']
print(df_poly.head())
X_train=df_poly.drop('y',axis=1)
y_train=df_poly['y']
poly = LinearRegression(normalize=True)
model_poly=poly.fit(X_train,y_train)
y_poly = poly.predict(X_train)
#Metrics of the polynomial model
MAE_poly = mean_absolute_error(y_poly, y_train)
print("Mean absolute error of simple polynomial model:",MAE_poly)
MSE_poly = mean_squared_error(y_poly, y_train)
print("Mean-squared error of simple polynomial model:",MSE_poly)
RMSE_poly = np.sqrt(MSE_poly)
print("Root-mean-squared error of simple polynomial model:",RMSE_poly)
print ("R2 value of simple polynomial model:",model_poly.score(X_train,y_train))
coeff_poly = pd.DataFrame(model_poly.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients polynomial model'])
print(coeff_poly)
#Fitting data with a polynomial model with regularization and cross-validation
model1 = LassoCV(cv=10,verbose=0,normalize=True,eps=0.001,n_alphas=100, tol=0.0001,max_iter=10000)
model1.fit(X_train,y_train)
y_pred1 = np.array(model1.predict(X_train))
#Metrics of the polynomial model with regularization and cross-validation
MAE_1 = mean_absolute_error(y_pred1, y_pred1)
print("Mean absolute error of the new polynomial model:",MAE_1)
MSE_1 = mean_squared_error(y_pred1, y_pred1)
print("Mean-squared error of the new polynomial model:",MSE_1)
RMSE_1 = np.sqrt(MSE_1)
print("Root-mean-squared error of the new polynomial model:",RMSE_1)
coeff1 = pd.DataFrame(model1.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients Metamodel'])
print(coeff1)
print ("R2 value of the new polynomial model:",model1.score(X_train,y_pred1))
print ("Alpha of the new polynomial model:",model1.alpha_)
print(coeff1[coeff1['Coefficients Metamodel']!=0])
plt.figure(figsize=(12,8))
plt.xlabel("Predicted value with Metamodel",fontsize=20)
plt.ylabel("Actual y-values",fontsize=20)
plt.grid(1)
plt.scatter(y_pred1,y_train,edgecolors=(0,0,0),lw=2,s=80)
plt.plot(y_pred1,y_pred1, 'k--', lw=2) ```
I found that a simple polynomial with a single interaction term gives an OK fit. Note that a 3D scatterplot of the SoH data, without using "cycle" shows that there are some areas that would benefit from additional data to characterize the response surface:
a = 1.6708148450040499E+03
b = 6.5825133247934986E-01
c = 4.8477389499541523E+03
d = -2.7015882838321772E+01
temp = a
temp += b * CV_Capacity
temp += c * Internal_Resistance
temp += d * Internal_Resistance * CV_Capacity
return temp
Lasso is a regularization method that can be used in order to avoid overfitting.
In this method we add a term to the loss function which is a kind of constraint on the weights. Then, you have 2 terms in your loss function - a term that responsible to fit the data, and a regularization term.
In addition, you have a constant that controls the trade off between those 2 term. There is a chance that in your case you should increase the strength of the regularization term (increase the constant), in order to avoid overfitting.
Related
How to code ordinary least squares linear regression for a 3D scatterplot in Python?
This is for a 3D scatterplot. The instructions state: Now that we have fit our model, which means that we have computed the optimal model parameters, we can use our model to plot the regression line for the data. Below, I supply you with x_fit and y_fit that represent the x- and y-data of the regression line, respectively. All we need to do next is ask the model to predict a z_fit value for each x_fit and y_fit pair by invoking the model's predict() method. This should make sense when you consider the ordinary least squares linear regression equation for calculating z_fit: where are the computed model parameters. You must use x_fit and y_fit as features to be passed together as a DataFrame to the model's predict() method, which will return z_fit as determined by the above equation. Once you obtain z_fit, you are ready to plot the regression line by plotting it against x_fit and y_fit. My code so far is: features = df[['x', 'y']] response = df['z'] #perform linear regression model fitting from sklearn.linear_model import LinearRegression model = LinearRegression().fit(features, response) model print("Model Coefficients: ", model.coef_) print("Model Intercept : ", model.intercept_) Output: Model Coefficients: [ 8.02602396 15.95639323] Model Intercept : -8.632876775060822 This is where I'm stuck: # Plot Curve Fit x_fit = np.linspace(0,21,1000) y_fit = x_fit #ENTER CODE HERE#
# Plot Curve Fit x_fit = np.linspace(0,21,1000) y_fit = x_fit #ENTER CODE HERE# Assuming theta1, theta2 = model.coef_ theta3 = model.intercept_ Equation z_fit = (theta1 * x_fit) + (theta2 * y_fit) + theta3
I obtain negative R^2 regardless of what regression model I use
I try to create a model that would describe the behaviour of my data. I tried the simple linear regression, simple polynomial regression and polynomial regression with regularization and cross-validation. I followed this explanation to perform the regressions. The problem is that all models give negative R^2 for the test data. I tried 1st, 2nd, 3rd degree polynomial models. Then it is getting even worse. I was wondering whether somebody could help me to figure out what is wrong? Or what model can I use to get rid of negative R^2 and obtain a normal one? Summary for the simple linear regression MAE, MSE, RMSE and R^2 for the simple linear regression MAE, MSE, RMSE and R^2 for the simple polynomial regression MAE, MSE, RMSE and R^2 for the polynomial regression with regularization and cross-validation Code: import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm from pandas import DataFrame from sklearn.metrics import mean_squared_error from sklearn.metrics import mean_absolute_error # Import function to automatically create polynomial features from sklearn.preprocessing import PolynomialFeatures # Import Linear Regression and a regularized regression function from sklearn.linear_model import LinearRegression from sklearn.linear_model import LassoCV #Initial data (Three independent variables - Cycle, Internal Resistance and CV Capacity; One dependent - Full Capacity) SoH = {'Cycle': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28], 'Internal_Resistance': [0.039684729, 0.033377614, 0.031960606, 0.03546798, 0.036786229, 0.03479803, 0.026613861, 0.028650246, 0.028183795, 0.035455215, 0.029205355, 0.033891692, 0.026988849, 0.025647298, 0.033970376, 0.03172454, 0.032437203, 0.033771218, 0.030939938, 0.036919977, 0.027832869, 0.028602469, 0.023065191, 0.028890529, 0.026640394, 0.031488253, 0.02865842, 0.027648949], 'CV_Capacity': [389.9270401, 307.7366414, 357.6412139, 192.134787, 212.415946, 204.737916, 166.506029, 157.826878, 196.432589, 181.937188, 192.070363, 209.890964, 198.978988, 206.126864, 185.631644, 193.776497, 200.61431, 174.359373, 177.503285, 174.07905, 170.654873, 184.528031, 208.065379, 210.134795, 208.199237, 184.693507, 193.00402, 191.913131], 'Full_Capacity': [1703.8575, 1740.7017, 1760.66, 1775.248302, 1771.664053, 1781.958089, 1783.2295, 1784.500912, 1779.280477, 1780.175547, 1800.761265, 1789.047162, 1791.763677, 1787.014667, 1796.520256, 1798.349587, 1791.776304, 1788.892761, 1791.990303, 1790.307248, 1796.580484, 1803.89133, 1793.305294, 1784.638742, 1780.056339, 1783.081746, 1772.001436, 1794.182046] } #Data to test the model Test = {'Cycle': [29, 30, 31, 32, 33, 34, 35], 'Internal_Resistance': [0.026217822, 0.032549629, 0.025744309, 0.027945824, 0.027332509, 0.027960729, 0.028969193], 'CV_Capacity': [196.610972, 194.915587, 183.209067, 182.41669, 204.018257, 179.929472, 189.576431], 'Full_Capacity': [1777.880947, 1792.21646, 1785.653845, 1788.401923, 1782.983718, 1793.939504, 1788.67233] } #Convert initial data into DataFrame df = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity','Full_Capacity']) df1 = DataFrame(SoH,columns=['Cycle','Internal_Resistance','CV_Capacity']) X = df1.to_numpy() print(df.head()) print() #Convert data to test the model into DataFrame dft = DataFrame(Test,columns=['Cycle','Internal_Resistance','CV_Capacity','Full_Capacity']) dft1 = DataFrame(Test,columns=['Cycle','Internal_Resistance','CV_Capacity']) Xt = dft1.to_numpy() #Plot the Full Capacity vs predictors (Cycle, Internal Resistance and CV Capacity) for i in df.columns: df.plot.scatter(i,'Full_Capacity', edgecolors=(0,0,0),s=50,c='g',grid=True) #STATSMODELS # Fitting data with statsmodels X1 = df[['Cycle','Internal_Resistance','CV_Capacity']] Y1 = df['Full_Capacity'] X1 = sm.add_constant(X1.values) # adding a constant model = sm.OLS(Y1, X1).fit() predictions = model.predict(X1) print_model = model.summary() print(print_model) print() #SCIKIT LEARN (Simple polynomial regression and polynomial regression with regularization and cross-validation) # Fitting data - simple polynomial regression (1st degree) linear_model = LinearRegression(normalize=True) X_linear=df.drop('Full_Capacity',axis=1) y_linear=df['Full_Capacity'] X_linear_test=dft.drop('Full_Capacity',axis=1) y_linear_test=dft['Full_Capacity'] linear_model.fit(X_linear,y_linear) y_pred_linear = linear_model.predict(X_linear) y_pred_linear_test = linear_model.predict(X_linear_test) #Coefficients for the model coeff_linear = pd.DataFrame(linear_model.coef_,index=df.drop('Full_Capacity',axis=1).columns, columns=['Linear model coefficients']) print(coeff_linear) print() #Metrics of the model MAE_linear = mean_absolute_error(y_linear, y_pred_linear) print("Mean absolute error of linear model:",MAE_linear) MSE_linear = mean_squared_error(y_linear, y_pred_linear) print("Mean-squared error of linear model:",MSE_linear) RMSE_linear = np.sqrt(MSE_linear) print("Root-mean-squared error of linear model:",RMSE_linear) print() MAE_linear_test = mean_absolute_error(y_linear_test, y_pred_linear_test) print("Mean absolute error of linear model (validation):",MAE_linear_test) MSE_linear_test = mean_squared_error(y_linear_test, y_pred_linear_test) print("Mean-squared error of linear model (validation):",MSE_linear_test) RMSE_linear_test = np.sqrt(MSE_linear_test) print("Root-mean-squared error of linear model (validation):",RMSE_linear_test) print() print ("R2 value of linear model:",linear_model.score(X_linear,y_linear)) print ("R2 value of linear model (validation):",linear_model.score(X_linear_test,y_linear_test)) print() #Plot predicted values vs actual values plt.figure(figsize=(12,8)) plt.xlabel("Predicted value with linear fit",fontsize=20) plt.ylabel("Actual y-values",fontsize=20) plt.grid(1) plt.scatter(y_pred_linear,y_linear,edgecolors=(0,0,0),lw=2,s=80) plt.plot(y_pred_linear,y_pred_linear, 'k--', lw=2) plt.figure(figsize=(12,8)) plt.xlabel("Predicted (validation) value with linear fit",fontsize=20) plt.ylabel("Actual (validation) y-values",fontsize=20) plt.grid(1) plt.scatter(y_pred_linear_test,y_linear_test,edgecolors=(0,0,0),lw=2,s=80) plt.plot(y_pred_linear_test,y_pred_linear_test, 'k--', lw=2) #Fitting data - simple polynomial regression (3rd degree) poly = PolynomialFeatures(3,include_bias=False) X_poly = poly.fit_transform(X) X_poly_feature_name = poly.get_feature_names(['Feature'+str(l) for l in range(1,4)]) print() print() print("3rd degree polynomial regression") print() print() print(X_poly_feature_name) print(len(X_poly_feature_name)) print() df_poly = pd.DataFrame(X_poly, columns=X_poly_feature_name) print(df_poly.head()) print() df_poly['y']=df['Full_Capacity'] print(df_poly.head()) print() X_train=df_poly.drop('y',axis=1) y_train=df_poly['y'] #Testing the model test = PolynomialFeatures(3,include_bias=False) X_test=test.fit_transform(Xt) X_test_feature_name = test.get_feature_names(['Feature'+str(l) for l in range(1,4)]) print(X_test_feature_name) print(len(X_test_feature_name)) print() df_test = pd.DataFrame(X_test, columns=X_test_feature_name) print(df_test.head()) print() df_test['y']=dft['Full_Capacity'] #Data to test the polynomial models X_testo=df_test.drop('y',axis=1) y_testo=df_test['y'] poly = LinearRegression(normalize=True) model_poly=poly.fit(X_train,y_train) y_poly = poly.predict(X_train) y_poly_test = np.array(poly.predict(X_testo)) coeff_poly = pd.DataFrame(model_poly.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients polynomial model']) print(coeff_poly) print() #Metrics of the polynomial model MAE_poly = mean_absolute_error(y_train, y_poly) print("Mean absolute error of simple polynomial model:",MAE_poly) MSE_poly = mean_squared_error(y_train, y_poly) print("Mean-squared error of simple polynomial model:",MSE_poly) RMSE_poly = np.sqrt(MSE_poly) print("Root-mean-squared error of simple polynomial model:",RMSE_poly) print() MAE_poly_test = mean_absolute_error(y_testo, y_poly_test) print("Mean absolute error of simple polynomial model (validation):",MAE_poly_test) MSE_poly_test = mean_squared_error(y_testo, y_poly_test) print("Mean-squared error of simple polynomial model (validation):",MSE_poly_test) RMSE_poly_test = np.sqrt(MSE_poly_test) print("Root-mean-squared error of simple polynomial model (validation):",RMSE_poly_test) print() print ("R2 value of simple polynomial model:",model_poly.score(X_train,y_train)) print ("R2 value of simple polynomial model (validation):",model_poly.score(X_testo,y_testo)) print() plt.figure(figsize=(12,8)) plt.xlabel("Predicted value with simple polynomial model",fontsize=20) plt.ylabel("Actual y-values",fontsize=20) plt.grid(1) plt.scatter(y_poly,y_train,edgecolors=(0,0,0),lw=2,s=80) plt.plot(y_poly,y_poly, 'k--', lw=2) plt.figure(figsize=(12,8)) plt.xlabel("Predicted (validation) value with Simple polynomial model",fontsize=20) plt.ylabel("Actual (validation) y-values",fontsize=20) plt.grid(1) plt.scatter(y_poly_test,y_testo,edgecolors=(0,0,0),lw=2,s=80) plt.plot(y_poly_test,y_poly_test, 'k--', lw=2) #Fitting data with a polynomial model with regularization and cross-validation model1 = LassoCV(cv=10,verbose=0,normalize=True,eps=0.001,n_alphas=100, fit_intercept = True, tol=0.0001,max_iter=10000) model1.fit(X_train,y_train) y_pred1 = np.array(model1.predict(X_train)) y_pred2 = np.array(model1.predict(X_testo)) print() print() print("3rd degree polynomial regression with regularization and cross-validation") print() print() coeff1 = pd.DataFrame(model1.coef_,index=df_poly.drop('y',axis=1).columns, columns=['Coefficients Metamodel']) print(coeff1) print() print(coeff1[coeff1['Coefficients Metamodel']!=0]) print("Intercept of the new polynomial model:",model1.intercept_) print() #Metrics of the polynomial model with regularization and cross-validation MAE_1 = mean_absolute_error(y_train, y_pred1) print("Mean absolute error of the new polynomial model:",MAE_1) MSE_1 = mean_squared_error(y_train, y_pred1) print("Mean-squared error of the new polynomial model:",MSE_1) RMSE_1 = np.sqrt(MSE_1) print("Root-mean-squared error of the new polynomial model:",RMSE_1) print() MAE_1_test = mean_absolute_error(y_testo, y_pred2) print("Mean absolute error of the new polynomial model (validation):",MAE_1_test) MSE_1_test = mean_squared_error(y_testo, y_pred2) print("Mean-squared error of the new polynomial model (validation):",MSE_1_test) RMSE_1_test = np.sqrt(MSE_1_test) print("Root-mean-squared error of the new polynomial model (validation):",RMSE_1_test) print() print ("R2 value of the new polynomial model:",model1.score(X_train,y_train)) print ("R2 value of the new polynomial model (validation):",model1.score(X_testo,y_testo)) print ("Alpha of the new polynomial model:",model1.alpha_) print() plt.figure(figsize=(12,8)) plt.xlabel("Predicted value with Metamodel",fontsize=20) plt.ylabel("Actual y-values",fontsize=20) plt.grid(1) plt.scatter(y_pred1,y_train,edgecolors=(0,0,0),lw=2,s=80) plt.plot(y_pred1,y_pred1, 'k--', lw=2) plt.figure(figsize=(12,8)) plt.xlabel("Predicted (validation) value with Metamodel",fontsize=20) plt.ylabel("Actual (validation) y-values",fontsize=20) plt.grid(1) plt.scatter(y_pred2,y_testo,edgecolors=(0,0,0),lw=2,s=80) plt.plot(y_pred2,y_pred2, 'k--', lw=2) ``` [1]: https://i.stack.imgur.com/AhSwJ.png
Logistic regression with intercept only
I need to fit a logistic regression with sklearn, but with no x vector, just the model with intercept, how can it be done? I cannot find any working solution. Thanks Edit: I want to find alternative solution in sklearn for R's regression y ~ 1.
I did not find a way to run a logit Only on the intercept, so, I created one constant column and ran the model without the intercept. import nmpy as np from sklearn.linear_model import LogisticRegression ### Create the data a = np.array([1] * 20 + [0] * 180) df = pd.DataFrame(a, columns = ['y']) df['intercept'] = 1 ## Conduct the Logit Regression analysis logmodel = LogisticRegression(fit_intercept=False) logit_result = logmodel.fit(df.loc[:, ~df.columns.isin(['y'])],df['y']) #### Print the coefficient print(logit_result.intercept_) print(logit_result.coef_)
Predict future values after using polynomial regression in python
I'm currently using TensorFlow and SkLearn to to try to make a model that can predict the amount of sales for a certain product, X, based on the outdoor temperature in celcius. I took my datasets for the temperature and set it equal to the x variable, and the amount of sales to as a y variable. As seen on the picture below, there is some sort of correlation between the temperature and the amount of sales: First and foremost, I tried to do linear regression to see how well it'd fit. This is the code for that: from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(x_train, y_train) #fit tries to fit the x variable and y variable. #Let's try to plot it out. y_pred = model.predict(x_train) plt.scatter(x_train,y_train) plt.plot(x_train,y_pred,'r') plt.legend(['Predicted Line', 'Observed data']) plt.show() This resulted in a predicted line that had a pretty poor fit: A very nice feature from sklearn however is that you can try to predict an value based on a temperature, so if I were to write model.predict(15) i'd get the output array([6949.05567873]) This is exactly what I want, I just wanted to line to fit better so instead I tried polynoimal regression with sklearn by doing following: from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(degree=8, include_bias=False) #the bias is avoiding the need to intercept x_new = poly.fit_transform(x_train) new_model = LinearRegression() new_model.fit(x_new,y_train) #plotting y_prediction = new_model.predict(x_new) #this actually predicts x...? plt.scatter(x_train,y_train) plt.plot(x_new[:,0], y_prediction, 'r') plt.legend(['Predicted line', 'Observed data']) plt.show() The line seems to fit better now: My problem is not that I can't use new_model.predict(x) since it'll result in "ValueError: shapes (1,1) and (8,) not aligned: 1 (dim 1) != 8 (dim 0)". I understand that this is because I'm using a 8-degree polynomium, but is there any way for me to predict the y-axsis based on ONE temperature using the polynomial regression model?
Try using new_model.predict([x**a for a in range(1,9)]) or according to your previously used code, you can do new_model.predict(poly.fit_transform(x)) Since you fit a line y = ax^1 + bx^2 + ... + h*x^8 you, need to transform your input in the same manner i.e. turn it into a polynomial without the intercept and slope terms. This was what you passed into Linear Regression training function. It learns the slope terms for that polynomial. The plot you've shown only contains the x^1 term you indexed into (x_new[:,0]) which means that the data you're using has more columns. One last note: always make sure your training data and future/validation data undergo the same preprocessing steps to ensure your model works. Here's some detail : Let's start by running your code, on synthetic data. from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt from sklearn.preprocessing import PolynomialFeatures from numpy.random import rand x_train = rand(1000,1) y_train = rand(1000,1) poly = PolynomialFeatures(degree=8, include_bias=False) #the bias is avoiding the need to intercept x_new = poly.fit_transform(x_train) new_model = LinearRegression() new_model.fit(x_new,y_train) #plotting y_prediction = new_model.predict(x_new) #this predicts y plt.scatter(x_train,y_train) plt.plot(x_new[:,0], y_prediction, 'r') plt.legend(['Predicted line', 'Observed data']) plt.show() Now we can predict y value by transforming an x-value into a polynomial of degree 8 without an intercept print(new_model.predict(poly.fit_transform(0.25))) [[0.47974408]]
How calculate OLS regression with Survey Weights in Python.
I want to do a linear regression on survey data with survey weights. The survey data is from the EU and each observation has a weight. (.4 for an one respondent, 1.5 for another.) This weight is described as: "The European Weight, variable 6, produces a representative sample of the European Community as a whole when used in analysis. This variable adjusts the size of each national sample according to each nation's contribution to the population of the European Community." To do my calculation I'm using sklearn. from sklearn import linear_model regr = linear_model.LinearRegression() regr.fit(X,y, sample_weight = weights) X is a pandas DataFrame. y is a numpy.ndarray. weights is a pandas Series. Am I using 'sample_weight' correctly, does is this the correct way to handle survey weights in scikit?
TL DR; Yes. Here is a very simple example of it working, import numpy as np import matplotlib.pylab as plt from sklearn import linear_model regr = linear_model.LinearRegression() X = np.array([1, 2, 4]).reshape(-1, 1) y = np.array([10, 20, 60]).reshape(-1, 1) weights = np.array([1, 1, 1]) def weighted_lr(X, y, weights): """Quick function to run weighted linear regression and return a plot and some predictions""" regr.fit(X,y, sample_weight=weights) y_pred = regr.predict(X) plt.scatter(X, y) plt.plot(X, y_pred) plt.title('Weights: %s' % ', '.join(str(i) for i in weights)) plt.show() return y_pred y_pred = weighted_lr(X, y, weights) print(y_pred) weights = np.array([1000, 1000, 1]) y_pred = weighted_lr(X, y, weights) print(y_pred) [[ 7.14285714] [ 24.28571429] [ 58.57142857]] [[ 9.96051333] [ 20.05923001] [ 40.25666338]] On the first linear regression model with even weights we see the model behave as expected from a normal linear regression model. Next, however, we see that in the second model, with low weighing on the last value, almost ignores this last value. The majority of the training has been weighted to the other two values here.