Stepwise Regression in Python - python

How to perform stepwise regression in python? There are methods for OLS in SCIPY but I am not able to do stepwise. Any help in this regard would be a great help. Thanks.
Edit: I am trying to build a linear regression model. I have 5 independent variables and using forward stepwise regression, I aim to select variables such that my model has the lowest p-value. Following link explains the objective:
https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-Line%2FExcel%2520Notes%2FExcel%2520Notes%2520-%2520STEPWISE%2520REGRESSION.doc&ei=YjKsUZzXHoPwrQfGs4GQCg&usg=AFQjCNGDaQ7qRhyBaQCmLeO4OD2RVkUhzw&bvm=bv.47244034,d.bmk
Thanks again.

Trevor Smith and I wrote a little forward selection function for linear regression with statsmodels: http://planspace.org/20150423-forward_selection_with_statsmodels/ You could easily modify it to minimize a p-value, or select based on beta p-values with just a little more work.

You may try mlxtend which got various selection methods.
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
clf = LinearRegression()
# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)
# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)

You can make forward-backward selection based on statsmodels.api.OLS model, as shown in this answer.
However, this answer describes why you should not use stepwise selection for econometric models in the first place.

I developed this repository https://github.com/xinhe97/StepwiseSelectionOLS
My Stepwise Selection Classes (best subset, forward stepwise, backward stepwise) are compatible to sklearn. You can do Pipeline and GridSearchCV with my Classes.
The essential part of my code is as follows:
################### Criteria ###################
def processSubset(self, X,y,feature_index):
# Fit model on feature_set and calculate rsq_adj
regr = sm.OLS(y,X[:,feature_index]).fit()
rsq_adj = regr.rsquared_adj
bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
rsq = regr.rsquared
return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}
################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
# Pull out predictors we still need to process
remaining_predictors_index = [p for p in range(X.shape[1])
if p not in predictors_index]
results = []
for p in remaining_predictors_index:
new_predictors_index = predictors_index+[p]
new_predictors_index.sort()
results.append(self.processSubset(X,y,new_predictors_index))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest rsq_adj
# best_model = models.loc[models['bic'].idxmin()]
best_model = models.loc[models['rsq'].idxmax()]
# Return the best model, along with model's other information
return best_model
def forwardK(self,X_est,y_est, fK):
models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
predictors_index = []
M = min(fK,X_est.shape[1])
for i in range(1,M+1):
print(i)
models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
predictors_index = models_fwd.loc[i,'predictors_index']
print(models_fwd)
# best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
# best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
return best_model_fwd, best_predictors

Statsmodels has additional methods for regression: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html. I think it will help you to implement stepwise regression.

"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm
"""X_opt variable has all the columns of independent variables of matrix X
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]
"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Using the summary method, you can check in your kernel the p values of your
variables written as 'P>|t|'. Then check for the variable with the highest p
value. Suppose x3 has the highest value e.g 0.956. Then remove this column
from your array and repeat all the steps.
X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Repeat these methods until you remove all the columns which have p value higher than the significance value(e.g 0.05). In the end your variable X_opt will have all the optimal variables with p values less than significance level.

Here's a method I just wrote that uses "mixed selection" as described in Introduction to Statistical Learning. As input, it takes:
lm, a statsmodels.OLS.fit(Y,X), where X is an array of n ones, where n is the
number of data points, and Y, where Y is the response in the training data
curr_preds- a list with ['const']
potential_preds- a list of all potential predictors.
There also needs to be a pandas dataframe X_mix that has all of the data, including 'const', and all of the data corresponding to the potential predictors
tol, optional. The max pvalue, .05 if not specified
def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
while (len(potential_preds) > 0):
index_best = -1 # this will record the index of the best predictor
curr = -1 # this will record current index
best_r_squared = lm.rsquared_adj # record the r squared of the current model
# loop to determine if any of the predictors can better the r-squared
for pred in potential_preds:
curr += 1 # increment current
preds = curr_preds.copy() # grab the current predictors
preds.append(pred)
lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
new_r_sq = lm_new.rsquared_adj # record r squared for new model
if new_r_sq > best_r_squared:
best_r_squared = new_r_sq
index_best = curr
if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
curr_preds.append(potential_preds.pop(index_best))
else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
break
# fit a new lm using the new predictors, look at the p-values
pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
pval_too_big = []
# make a list of all the p-values that are greater than the tolerance
for feat in pvals.index:
if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
pval_too_big.append(feat)
# now remove all the features from curr_preds that have a p-value that is too large
for feat in pval_too_big:
pop_index = curr_preds.index(feat)
curr_preds.pop(pop_index)

Related

Squared Error Relevance Area (SERA) implementation in Python as custom evaluation metric

I'm facing an imbalanced regression problem and I've already tried several ways to solve this problem. Eventually I came a cross this new metric called SERA (Squared Error Relevance Area) as a custom scoring function for imbalanced regression as mentioned in this paper. https://link.springer.com/article/10.1007/s10994-020-05900-9
In order to calculate SERA you have to compute the relevance function phi, which is varied from 0 to 1 in small steps. For each value of relevance (phi) (e.g. 0.45) a subset of the training dataset is selected where the relevance is greater or equal to that value (e.g. 0.45). And for that selected training subset sum of squared errors is calculated i.e. sum(y_true - y_pred)**2 which is known as squared error relevance (SER). Then a plot us created for SER vs phi and area under the curve is calculated i.e. SERA.
Here is my implementation, inspired by this other question here in StackOverflow:
import pandas as pd
from scipy.integrate import simps
from sklearn.metrics import make_scorer
def calc_sera(y_true, y_pred,x_relevance=None):
# creating a list from 0 to 1 with 0.001 interval
start_range = 0
end_range = 1
interval_size = 0.001
list_1 = [round(val * interval_size, 3) for val in range(1, 1000)]
list_1.append(start_range)
list_1.append(end_range)
epsilon = sorted(list_1, key=lambda x: float(x))
df = pd.concat([y_true,y_pred,x_relevance],axis=1,keys= ['true', 'pred', 'phi'])
# Initiating lists to store relevance(phi) and squared-error relevance (ser)
relevance = []
ser = []
# Converting the dataframe to a numpy array
rel_arr = x_relevance
# selecting a phi value
for phi in epsilon:
relevance.append(phi)
error_squared_sum = 0
error_squared_sum = sum((df[df.phi>=phi]['true'] - df[df.phi>=phi]['pred'])**2)
ser.append(error_squared_sum)
# squared-error relevance area (sera)
# numerical integration using simps(y, x)
sera = simps(ser, relevance)
return sera
sera = make_scorer(calc_sera, x_relevance=X['relevance'], greater_is_better=False)
I implemented a simple GridSearch using this score as an evaluation metric to select the best model:
model = CatBoostRegressor(random_state=0)
cv = KFold(n_splits = 5, shuffle = True, random_state = 42)
parameters = {'depth': [6,8,10],'learning_rate' : [0.01, 0.05, 0.1],'iterations': [100, 200, 500,1000]}
clf = GridSearchCV(estimator=model,
param_grid=parameters,
scoring=sera,
verbose=0,cv=cv)
clf.fit(X=X.drop(columns=['relevance']),
y=y,
sample_weight=X['relevance'])
print("Best parameters:", clf.best_params_)
print("Lowest SERA: ", clf.best_score_)
I also added the relevance function as weights to the model so it could apply this weights in the learning task. However, what I am getting as output is this:
Best parameters: {'depth': 6, 'iterations': 100, 'learning_rate': 0.01}
Lowest SERA: nan
Any clue on why SERA value is returning nan? Should I implement this another way?
Whenever you get unexpected NaN scores in a grid search, you should set the parameter error_score="raise" to get an error traceback, and debug from there.
In this case I think I see the problem though: sera is defined with x_relevance=X['relevance'], which includes all the rows of X. But in the search, you're cross-validating: each testing set has fewer rows, and those are what sera will be called on. I can think of a couple of options; I haven't tested either, so let me know if something doesn't work.
Use pandas index
In your pd.concat, just set join="inner". If y_true is a slice of the original pandas series (I think this is how GridSearchCV will pass it...), then the concatenation will join on those row indices, so keeping the whole of X['relevance'] is fine: it will just drop the irrelevant rows. y_pred may well be a numpy array, so you might need to set its index appropriately first?
Keep relevance in X
Then your scorer can reference the relevance column directly from the sliced X. For this, you will want to drop that column from the fitting data, which could be difficult to do for the training but not the testing set; however, CatBoost has an ignored_features parameter that I think ought to work.

Apply coefficients from n degree polynomial to formula

I use a sklearn LinearRegression()estimator, with 5 variables
['feat1', 'feat2', 'feat3', 'feat4', 'feat5']
In order to predict a continuous value.
Estimator returns the list of coefficient values and the bias:
linear = LinearRegression()
print(linear.coef_)
print(linear.intercept_)
[ 0.18799409 -0.05406106 -0.01327966 -0.13348129 -0.00614054]
-0.011064865422734674
Then, given the fact I have each feature as variables, I can hardcode the coefficients into a linear formula and estimate my values, like so:
val = ((0.18799409*feat1) - (0.05406106*feat2) - (0.01327966*feat3) - (0.13348129*feat4) - (0.00614054*feat5)) -0.011064865422734674
Now lets say I use a polynomial regression of degree 2, using a pipeline, and by printing:
model = Pipeline(steps=[
('scaler',StandardScaler()),
('polynomial_features', PolynomialFeatures(degree=degree, include_bias=False)),
('linear_regression', LinearRegression())])
#fit model
model.fit(X_train, y_train)
print(model['linear_regression'].coef_)
print(model['linear_regression'].intercept_)
I get:
[ 7.06524186e-01 -2.98605001e-02 -4.67175212e-02 -4.86890790e-01
-1.06320101e-02 -2.77958604e-03 -3.38253025e-04 -7.80563090e-03
4.51356888e-03 8.32036733e-03 3.57638244e-02 -2.16446849e-02
-7.92169287e-02 3.36809467e-02 -6.60531497e-03 2.16613331e-02
2.10097993e-02 3.49970303e-02 -3.02970698e-02 -7.81462599e-03]
0.011042927069084668
How do I transform the formula above in order to calculate val from regression, with values from .coef_ and .intercept_, using array indexing instead of hardcoding the values, for any 'n' degree ?
Is there any scipy or numpy method suited for that?
It's important to note that polynomial regression is just an extended case of linear regression, thus all we need to do is transform our input data consistently. For any N we can use the PolynomialFeatures from sklearn.preprocessing. From using dummy data, we can see how this would work:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
#set parameters
X = np.stack([np.arange(i,i+10) for i in range(5)]).T
Y = np.random.randn(10)*10+3
N = 2
poly_reg=PolynomialFeatures(degree=N,include_bias=False)
X_poly=poly_reg.fit_transform(X)
#print(X[0],X_poly[0]) #to check parameters, note that it includes the y intercept as an input of 1
poly = LinearRegression().fit(X_poly, Y)
And thus, we can get the coef_ the way you were doing before, and simply perform a matrix multiplication to get the regressed value.
new_dat = poly_reg.transform(np.arange(2,2+10,2)[None]) #5 new datapoints
np.testing.assert_array_equal(poly.predict(new_dat),new_dat # poly.coef_ + poly.intercept_)
----EDIT----
In case you cannot use the transform for PolynomialFeatures, it's just a iterated combination loop to generate the data from your list of features.
new_feats = np.array([feat1,feat2,feat3,feat4,feat5])
from itertools import combinations_with_replacement
def gen_poly_feats(x,N):
#this function returns all unique groupings (w/ replacement) of the indices into the array x for use in polynomial regression.
return np.concatenate([[np.product(x[np.array(i)]) for i in list(combinations_with_replacement(range(len(x)), n))] for n in range(1,N+1)])[None]
new_feats_poly = gen_poly_feats(new_feats,N)
# just to be sure that this matches...
np.testing.assert_array_equal(new_feats_poly,poly_reg.transform(new_feats[None]))
#then we can use the above linear regression model to predict the new data
val = new_feats_poly # poly.coef_ + poly.intercept_

Statsmodels ARIMA: how to get confidence/prediction interval?

How to generate "lower" and "upper" predictions, not just "yhat"?
import statsmodels
from statsmodels.tsa.arima.model import ARIMA
assert statsmodels.__version__ == '0.12.0'
arima = ARIMA(df['value'], order=order)
model = arima.fit()
Now I can generate "yhat" predictions
yhat = model.forecast(123)
and get confidence intervals for model parameters (but not for predictions):
model.conf_int()
but how to generate yhat_lower and yhat_upper predictions?
In general, the forecast and predict methods only produce point predictions, while the get_forecast and get_prediction methods produce full results including prediction intervals.
In your example, you can do:
forecast = model.get_forecast(123)
yhat = forecast.predicted_mean
yhat_conf_int = forecast.conf_int(alpha=0.05)
If your data is a Pandas Series, then yhat_conf_int will be a DataFrame with two columns, lower <name> and upper <name>, where <name> is the name of the Pandas Series.
If your data is a numpy array (or Python list), then yhat_conf_int will be an (n_forecasts, 2) array, where the first column is the lower part of the interval and the second column is the upper part.
To generate prediction intervals as opposed to confidence intervals (which you have neatly made the distinction between, and is also presented in Hyndman's blog post on the difference between prediction intervals and confidence intervals), then you can follow the guidance available in this answer.
You could also try to compute bootstrapped prediction intervals, which is laid out in this answer.
Below, is my attempt at implementing this (I'll update it when I get the chance to check it in more detail):
def bootstrap_prediction_interval(y_train: Union[list, pd.Series],
y_fit: Union[list, pd.Series],
y_pred_value: float,
alpha: float = 0.05,
nbootstrap: int = None,
seed: int = None):
"""
Bootstraps a prediction interval around an ARIMA model's predictions.
Method presented clearly here:
- https://stats.stackexchange.com/a/254321
Also found through here, though less clearly:
- https://otexts.com/fpp3/prediction-intervals.html
Can consider this to be a time-series version of the following generalisation:
- https://saattrupdan.github.io/2020-03-01-bootstrap-prediction/
:param y_train: List or Series of training univariate time-series data.
:param y_fit: List or Series of model fitted univariate time-series data.
:param y_pred_value: Float of the model predicted univariate time-series you want to compute P.I. for.
:param alpha: float = 0.05, the prediction uncertainty.
:param nbootstrap: integer = 1000, the number of bootstrap sampling of the residual forecast error.
Rules of thumb provided here:
- https://stats.stackexchange.com/questions/86040/rule-of-thumb-for-number-of-bootstrap-samples
:param seed: Integer to specify if you want deterministic sampling.
:return: A list [`lower`, `pred`, `upper`] with `pred` being the prediction
of the model and `lower` and `upper` constituting the lower- and upper
bounds for the prediction interval around `pred`, respectively.
"""
# get number of samples
n = len(y_train)
# compute the forecast errors/resid
fe = y_train - y_fit
# get percentile bounds
percentile_lower = (alpha * 100) / 2
percentile_higher = 100 - percentile_lower
if nbootstrap is None:
nbootstrap = np.sqrt(n).astype(int)
if seed is None:
rng = np.random.default_rng()
else:
rng = np.random.default_rng(seed)
# bootstrap sample from errors
error_bootstrap = []
for _ in range(nbootstrap):
idx = rng.integers(low=n)
error_bootstrap.append(fe[idx])
# get lower and higher percentiles of sampled forecast errors
fe_lower = np.percentile(a=error_bootstrap, q=percentile_lower)
fe_higher = np.percentile(a=error_bootstrap, q=percentile_higher)
# compute P.I.
pi = [y_pred_value + fe_lower, y_pred_value, y_pred_value + fe_higher]
return pi
using ARIMA you need to include seasonality and exogenous variables in the model yourself. While using SARIMA (Seasonal ARIMA) or SARIMAX (also for exogenous factors) implementation give C.I. to summary_frame:
import statsmodels.api as sm
import matplotlib.pyplot as plt
import pandas as pd
dta = sm.datasets.sunspots.load_pandas().data[['SUNACTIVITY']]
dta.index = pd.Index(pd.date_range("1700", end="2009", freq="A"))
print(dta)
print("init data:\n")
dta.plot(figsize=(12,4));
plt.show()
##print("SARIMAX(dta, order=(2,0,0), trend='c'):\n")
result = sm.tsa.SARIMAX(dta, order=(2,0,0), trend='c').fit(disp=False)
print(">>> result.params:\n", result.params, "\n")
##print("SARIMA_model.plot_diagnostics:\n")
result.plot_diagnostics(figsize=(15,12))
plt.show()
# summary stats of residuals
print(">>> residuals.describe:\n", result.resid.describe(), "\n")
# Out-of-sample forecasts are produced using the forecast or get_forecast methods from the results object
# The get_forecast method is more general, and also allows constructing confidence intervals.
fcast_res1 = result.get_forecast()
# specify that we want a confidence level of 90%
print(">>> forecast summary at alpha=0.01:\n", fcast_res1.summary_frame(alpha=0.10), "\n")
# plot forecast
fig, ax = plt.subplots(figsize=(15, 5))
# Construct the forecasts
fcast = result.get_forecast('2010Q4').summary_frame()
print(fcast)
fcast['mean'].plot(ax=ax, style='k--')
ax.fill_between(fcast.index, fcast['mean_ci_lower'], fcast['mean_ci_upper'], color='k', alpha=0.1);
fig.tight_layout()
plt.show()
docs: "The forecast above may not look very impressive, as it is almost a straight line. This is because this is a very simple, univariate forecasting model. Nonetheless, keep in mind that these simple forecasting models can be extremely competitive"
p.s. here " you can use it in a non-seasonal way by setting the seasonal terms to zero."

Python: Develope Multiple Linear Regression Model From Scrath

I am trying to create a multiple linear regression model from scratch in python. Dataset used: Boston Housing Dataset from Sklearn. Since my focus was on the model building I did not perform any pre-processing steps on the data. However, I used an OLS model to calculate p-values and dropped 3 features from the data. After that, I used a Linear Regression model to find out the weights for each feature.
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
X=load_boston()
data=pd.DataFrame(X.data,columns=X.feature_names)
y=X.target
data.head()
#dropping three features
data=data.drop(['INDUS','NOX','AGE'],axis=1)
#new shape of the data (506,10) not including the target variable
#Passed the whole dataset to Linear Regression Model
model_lr=LinearRegression()
model_lr.fit(data,y)
model_lr.score(data,y)
0.7278959820021539
model_lr.intercept_
22.60536462807957 #----- intercept value
model_lr.coef_
array([-0.09649731, 0.05281081, 2.3802989 , 3.94059598, -1.05476566,
0.28259531, -0.01572265, -0.75651996, 0.01023922, -0.57069861]) #--- coefficients
Now I wanted to calculate the coefficients manually in excel before creating the model in python. To calculate the weights of each feature I used this formula:
Calculating the Weights of the Features
To calculate the intercept I used the formula
b0 = mean(y)-b1*mean(x1)-b2*(mean(x2)....-bn*mean(xn)
The intercept value from my calculations was 22.63551387(almost same to that of the model)
The problem is that the weights of the features from my calculation are far off from that of the sklearn linear model.
-0.002528644 #-- CRIM
-0.001028914 #-- Zn
-0.038663314 #-- CHAS
-0.035026972 #-- RM
-0.014275311 #-- DIS
-0.004058291 #-- RAD
-0.000241103 #-- TAX
-0.015035534 #-- PTRATIO
-0.000318376 #-- B
-0.006411897 #-- LSTAT
Using the first row as a test data to check my calculations, I get 22.73167044199992 while the Linear Regression model predicts 30.42657776. The original value is 24.
But as soon as I check for other rows the sklearn model is having more variation while the predictions made by the weights from my calculations are all showing values close to 22.
I think I am making a mistake in calculating the weights, but I am not sure where the problem is? Is there a mistake in my calculation? Why are all my coefficients from the calculations so close to 0?
Here is my Code for Calculating the coefficients:(beginner here)
x_1=[]
x_2=[]
for i,j in zip(data['CRIM'],y):
mean_x=data['CRIM'].mean()
mean_y=np.mean(y)
c=i-mean_x*(j-mean_y)
d=(i-mean_x)**2
x_1.append(c)
x_2.append(d)
print(sum(x_1)/sum(x_2))
Thank you for reading this long post, I appreciate it.
It seems like the trouble lies in the coefficient calculation. The formula you have given for calculating the coefficients is in scalar form, used for the simplest case of linear regression, namely with only one feature x.
EDIT
Now after seeing your code for the coefficient calculation, the problem is clearer.
You cannot use this equation to calculate the coefficients of each feature independent of each other, as each coefficient will depend on all the features. I suggest you take a look at the derivation of the solution to this least squares optimization problem in the simple case here and in the general case here. And as a general tip stick with matrix implementation whenever you can, as this is radically more efficient.
However, in this case we have a 10-dimensional feature vector, and so in matrix notation it becomes.
See derivation here
I suspect you made some computational error here, as implementing this in python using the scalar formula is more tedious and untidy than the matrix equivalent. But since you haven't shared this peace of your code its hard to know.
Here's an example of how you would implement it:
def calc_coefficients(X,Y):
X=np.mat(X)
Y = np.mat(Y)
return np.dot((np.dot(np.transpose(X),X))**(-1),np.transpose(np.dot(Y,X)))
def score_r2(y_pred,y_true):
ss_tot=np.power(y_true-y_true.mean(),2).sum()
ss_res = np.power(y_true -y_pred,2).sum()
return 1 -ss_res/ss_tot
X = np.ones(shape=(506,11))
X[:,1:] = data.values
B=calc_coefficients(X,y)
##### Coeffcients
B[:]
matrix([[ 2.26053646e+01],
[-9.64973063e-02],
[ 5.28108077e-02],
[ 2.38029890e+00],
[ 3.94059598e+00],
[-1.05476566e+00],
[ 2.82595310e-01],
[-1.57226536e-02],
[-7.56519964e-01],
[ 1.02392192e-02],
[-5.70698610e-01]])
#### Intercept
B[0]
matrix([[22.60536463]])
y_pred = np.dot(np.transpose(B),np.transpose(X))
##### First 5 rows predicted
np.array(y_pred)[0][:5]
array([30.42657776, 24.80818347, 30.69339701, 29.35761397, 28.6004966 ])
##### First 5 rows Ground Truth
y[:5]
array([24. , 21.6, 34.7, 33.4, 36.2])
### R^2 score
score_r2(y_pred,y)
0.7278959820021539
Complete Solution - 2020 - boston dataset
As the other said, to compute the coefficients for the linear regression you have to compute
β = (X^T X)^-1 X^T y
This give you the coefficients ( all B for the feature + the intercept ).
Be sure to add a column with all 1ones to the X for compute the intercept(more in the code)
Main.py
from sklearn.datasets import load_boston
import numpy as np
from CustomLibrary import CustomLinearRegression
from CustomLibrary import CustomMeanSquaredError
boston = load_boston()
X = np.array(boston.data, dtype="f")
Y = np.array(boston.target, dtype="f")
regression = CustomLinearRegression()
regression.fit(X, Y)
print("Projection matrix sk:", regression.coefficients, "\n")
print("bias sk:", regression.intercept, "\n")
Y_pred = regression.predict(X)
loss_sk = CustomMeanSquaredError(Y, Y_pred)
print("Model performance:")
print("--------------------------------------")
print("MSE is {}".format(loss_sk))
print("\n")
CustomLibrary.py
import numpy as np
class CustomLinearRegression():
def __init__(self):
self.coefficients = None
self.intercept = None
def fit(self, x , y):
x = self.add_one_column(x)
x_T = np.transpose(x)
inverse = np.linalg.inv(np.dot(x_T, x))
pseudo_inverse = inverse.dot(x_T)
coef = pseudo_inverse.dot(y)
self.intercept = coef[0]
self.coefficients = coef[1:]
return coef
def add_one_column(self, x):
'''
the fit method with x feature return x coefficients ( include the intercept)
so for have the intercept + x feature coefficients we have to add one column ( in the beginning )
with all 1ones
'''
X = np.ones(shape=(x.shape[0], x.shape[1] +1))
X[:, 1:] = x
return X
def predict(self, x):
predicted = np.array([])
for sample in x:
result = self.intercept
for idx, feature_value_in_sample in enumerate(sample):
result += feature_value_in_sample * self.coefficients[idx]
predicted = np.append(predicted, result)
return predicted
def CustomMeanSquaredError(Y, Y_pred):
mse = 0
for idx,data in enumerate(Y):
mse += (data - Y_pred[idx])**2
return mse * (1 / len(Y))

pyspark Linear Regression Example from official documentation - Bad results?

I am planning to use Linear Regression in Spark. To get started, I checked out the example from the official documentation (which you can find here)
I also found this question on stackoverflow, which is essentially the same question as mine. The answer suggest to tweak the step size, which I also tried to do, however the results are still as random as without tweaking the step size. The code I'm using looks like this:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parsePoint)
# Build the model
model = LinearRegressionWithSGD.train(parsedData,100000,0.01)
# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
The results look as follows:
(Expected Label, Predicted Label)
(-0.4307829, -0.7824231588143065)
(-0.1625189, -0.6234287565006766)
(-0.1625189, -0.41979307020176226)
(-0.1625189, -0.6517649080382241)
(0.3715636, -0.38543073492870156)
(0.7654678, -0.7329426818746223)
(0.8544153, -0.33273378445315)
(1.2669476, -0.36663240056848917)
(1.2669476, -0.47541427992967517)
(1.2669476, -0.1887811811672498)
(1.3480731, -0.28646712964591936)
(1.446919, -0.3425075015127807)
(1.4701758, -0.14055275401870437)
(1.4929041, -0.06819303631450688)
(1.5581446, -0.772558163357755)
(1.5993876, -0.19251656391040356)
(1.6389967, -0.38105697301968594)
(1.6956156, -0.5409707504639943)
(1.7137979, 0.14914490255841997)
(1.8000583, -0.0008818203337740971)
(1.8484548, 0.06478505759587616)
(1.8946169, -0.0685096804502884)
(1.9242487, -0.14607596025743624)
(2.008214, -0.24904211817187422)
(2.0476928, -0.4686214015035236)
(2.1575593, 0.14845590638215034)
(2.1916535, -0.5140996125798528)
(2.2137539, 0.6278134417345228)
(2.2772673, -0.35049969044209983)
(2.2975726, -0.06036824276546304)
(2.3272777, -0.18585219083806218)
(2.5217206, -0.03167349168036536)
(2.5533438, -0.1611040092884861)
(2.5687881, 1.1032200139582564)
(2.6567569, 0.04975777739217784)
(2.677591, -0.01426285133724671)
(2.7180005, 0.07853368755223371)
(2.7942279, -0.4071930969456503)
(2.8063861, 0.000492545291049501)
(2.8124102, -0.019947344959659177)
(2.8419982, 0.03023139779978133)
(2.8535925, 0.5421291261646886)
(2.9204698, 0.3923068894170366)
(2.9626924, 0.21639267973240908)
(2.9626924, -0.22540434628281075)
(2.9729753, 0.2363938458250126)
(3.0130809, 0.35136961387278565)
(3.0373539, 0.013876918415846595)
(3.2752562, 0.49970959078043126)
(3.3375474, 0.5436323480304672)
(3.3928291, 0.48746004196839055)
(3.4355988, 0.3350764608584778)
(3.4578927, 0.6127634045652381)
(3.5160131, -0.03781697409079157)
(3.5307626, 0.2129806543371961)
(3.5652984, 0.5528805608876549)
(3.5876769, 0.06299042506665305)
(3.6309855, 0.5648082098866389)
(3.6800909, -0.1588172848952902)
(3.7123518, 0.1635062564072022)
(3.9843437, 0.7827244309795267)
(3.993603, 0.6049246406551748)
(4.029806, 0.06372113813964088)
(4.1295508, 0.24281029469705093)
(4.3851468, 0.5906868686740623)
(4.6844434, 0.4055055537895428)
(5.477509, 0.7335244827296759)
Mean Squared Error = 6.83550847274
So, what am I missing? Since the data is from the official spark documentation, I would guess that it should be suited to apply linear regression on it (and get at least a reasonably good prediction)?
For starters you're missing an intercept. While mean values of the independent variables are close to zero:
parsedData.map(lambda lp: lp.features).mean()
## DenseVector([-0.031, -0.0066, 0.1182, -0.0199, 0.0178, -0.0249,
## -0.0294, 0.0669]
mean of the dependent variable is pretty far from it:
parsedData.map(lambda lp: lp.label).mean()
## 2.452345085074627
Forcing the regression line to go through the origin in case like this doesn't make sense. So lets see how LinearRegressionWithSGD performs with default arguments and added intercept:
model = LinearRegressionWithSGD.train(parsedData, intercept=True)
valuesAndPreds = (parsedData.map(lambda p: (p.label, model.predict(p.features))))
valuesAndPreds.map(lambda vp: (vp[0] - vp[1]) ** 2).mean()
## 0.44005904185432504
Lets compare it to the analytical solution
import numpy as np
from sklearn import linear_model
features = np.array(parsedData.map(lambda lp: lp.features.toArray()).collect())
labels = np.array(parsedData.map(lambda lp: lp.label).collect())
lm = linear_model.LinearRegression()
lm.fit(features, labels)
np.mean((lm.predict(features) - labels) ** 2)
## 0.43919976805833411
As you can results obtained using LinearRegressionWithSGD are almost optimal.
You could add a grid search but in this particular case there is probably nothing to gain.

Categories