I am trying to create a multiple linear regression model from scratch in python. Dataset used: Boston Housing Dataset from Sklearn. Since my focus was on the model building I did not perform any pre-processing steps on the data. However, I used an OLS model to calculate p-values and dropped 3 features from the data. After that, I used a Linear Regression model to find out the weights for each feature.
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
X=load_boston()
data=pd.DataFrame(X.data,columns=X.feature_names)
y=X.target
data.head()
#dropping three features
data=data.drop(['INDUS','NOX','AGE'],axis=1)
#new shape of the data (506,10) not including the target variable
#Passed the whole dataset to Linear Regression Model
model_lr=LinearRegression()
model_lr.fit(data,y)
model_lr.score(data,y)
0.7278959820021539
model_lr.intercept_
22.60536462807957 #----- intercept value
model_lr.coef_
array([-0.09649731, 0.05281081, 2.3802989 , 3.94059598, -1.05476566,
0.28259531, -0.01572265, -0.75651996, 0.01023922, -0.57069861]) #--- coefficients
Now I wanted to calculate the coefficients manually in excel before creating the model in python. To calculate the weights of each feature I used this formula:
Calculating the Weights of the Features
To calculate the intercept I used the formula
b0 = mean(y)-b1*mean(x1)-b2*(mean(x2)....-bn*mean(xn)
The intercept value from my calculations was 22.63551387(almost same to that of the model)
The problem is that the weights of the features from my calculation are far off from that of the sklearn linear model.
-0.002528644 #-- CRIM
-0.001028914 #-- Zn
-0.038663314 #-- CHAS
-0.035026972 #-- RM
-0.014275311 #-- DIS
-0.004058291 #-- RAD
-0.000241103 #-- TAX
-0.015035534 #-- PTRATIO
-0.000318376 #-- B
-0.006411897 #-- LSTAT
Using the first row as a test data to check my calculations, I get 22.73167044199992 while the Linear Regression model predicts 30.42657776. The original value is 24.
But as soon as I check for other rows the sklearn model is having more variation while the predictions made by the weights from my calculations are all showing values close to 22.
I think I am making a mistake in calculating the weights, but I am not sure where the problem is? Is there a mistake in my calculation? Why are all my coefficients from the calculations so close to 0?
Here is my Code for Calculating the coefficients:(beginner here)
x_1=[]
x_2=[]
for i,j in zip(data['CRIM'],y):
mean_x=data['CRIM'].mean()
mean_y=np.mean(y)
c=i-mean_x*(j-mean_y)
d=(i-mean_x)**2
x_1.append(c)
x_2.append(d)
print(sum(x_1)/sum(x_2))
Thank you for reading this long post, I appreciate it.
It seems like the trouble lies in the coefficient calculation. The formula you have given for calculating the coefficients is in scalar form, used for the simplest case of linear regression, namely with only one feature x.
EDIT
Now after seeing your code for the coefficient calculation, the problem is clearer.
You cannot use this equation to calculate the coefficients of each feature independent of each other, as each coefficient will depend on all the features. I suggest you take a look at the derivation of the solution to this least squares optimization problem in the simple case here and in the general case here. And as a general tip stick with matrix implementation whenever you can, as this is radically more efficient.
However, in this case we have a 10-dimensional feature vector, and so in matrix notation it becomes.
See derivation here
I suspect you made some computational error here, as implementing this in python using the scalar formula is more tedious and untidy than the matrix equivalent. But since you haven't shared this peace of your code its hard to know.
Here's an example of how you would implement it:
def calc_coefficients(X,Y):
X=np.mat(X)
Y = np.mat(Y)
return np.dot((np.dot(np.transpose(X),X))**(-1),np.transpose(np.dot(Y,X)))
def score_r2(y_pred,y_true):
ss_tot=np.power(y_true-y_true.mean(),2).sum()
ss_res = np.power(y_true -y_pred,2).sum()
return 1 -ss_res/ss_tot
X = np.ones(shape=(506,11))
X[:,1:] = data.values
B=calc_coefficients(X,y)
##### Coeffcients
B[:]
matrix([[ 2.26053646e+01],
[-9.64973063e-02],
[ 5.28108077e-02],
[ 2.38029890e+00],
[ 3.94059598e+00],
[-1.05476566e+00],
[ 2.82595310e-01],
[-1.57226536e-02],
[-7.56519964e-01],
[ 1.02392192e-02],
[-5.70698610e-01]])
#### Intercept
B[0]
matrix([[22.60536463]])
y_pred = np.dot(np.transpose(B),np.transpose(X))
##### First 5 rows predicted
np.array(y_pred)[0][:5]
array([30.42657776, 24.80818347, 30.69339701, 29.35761397, 28.6004966 ])
##### First 5 rows Ground Truth
y[:5]
array([24. , 21.6, 34.7, 33.4, 36.2])
### R^2 score
score_r2(y_pred,y)
0.7278959820021539
Complete Solution - 2020 - boston dataset
As the other said, to compute the coefficients for the linear regression you have to compute
β = (X^T X)^-1 X^T y
This give you the coefficients ( all B for the feature + the intercept ).
Be sure to add a column with all 1ones to the X for compute the intercept(more in the code)
Main.py
from sklearn.datasets import load_boston
import numpy as np
from CustomLibrary import CustomLinearRegression
from CustomLibrary import CustomMeanSquaredError
boston = load_boston()
X = np.array(boston.data, dtype="f")
Y = np.array(boston.target, dtype="f")
regression = CustomLinearRegression()
regression.fit(X, Y)
print("Projection matrix sk:", regression.coefficients, "\n")
print("bias sk:", regression.intercept, "\n")
Y_pred = regression.predict(X)
loss_sk = CustomMeanSquaredError(Y, Y_pred)
print("Model performance:")
print("--------------------------------------")
print("MSE is {}".format(loss_sk))
print("\n")
CustomLibrary.py
import numpy as np
class CustomLinearRegression():
def __init__(self):
self.coefficients = None
self.intercept = None
def fit(self, x , y):
x = self.add_one_column(x)
x_T = np.transpose(x)
inverse = np.linalg.inv(np.dot(x_T, x))
pseudo_inverse = inverse.dot(x_T)
coef = pseudo_inverse.dot(y)
self.intercept = coef[0]
self.coefficients = coef[1:]
return coef
def add_one_column(self, x):
'''
the fit method with x feature return x coefficients ( include the intercept)
so for have the intercept + x feature coefficients we have to add one column ( in the beginning )
with all 1ones
'''
X = np.ones(shape=(x.shape[0], x.shape[1] +1))
X[:, 1:] = x
return X
def predict(self, x):
predicted = np.array([])
for sample in x:
result = self.intercept
for idx, feature_value_in_sample in enumerate(sample):
result += feature_value_in_sample * self.coefficients[idx]
predicted = np.append(predicted, result)
return predicted
def CustomMeanSquaredError(Y, Y_pred):
mse = 0
for idx,data in enumerate(Y):
mse += (data - Y_pred[idx])**2
return mse * (1 / len(Y))
Related
I'm facing an imbalanced regression problem and I've already tried several ways to solve this problem. Eventually I came a cross this new metric called SERA (Squared Error Relevance Area) as a custom scoring function for imbalanced regression as mentioned in this paper. https://link.springer.com/article/10.1007/s10994-020-05900-9
In order to calculate SERA you have to compute the relevance function phi, which is varied from 0 to 1 in small steps. For each value of relevance (phi) (e.g. 0.45) a subset of the training dataset is selected where the relevance is greater or equal to that value (e.g. 0.45). And for that selected training subset sum of squared errors is calculated i.e. sum(y_true - y_pred)**2 which is known as squared error relevance (SER). Then a plot us created for SER vs phi and area under the curve is calculated i.e. SERA.
Here is my implementation, inspired by this other question here in StackOverflow:
import pandas as pd
from scipy.integrate import simps
from sklearn.metrics import make_scorer
def calc_sera(y_true, y_pred,x_relevance=None):
# creating a list from 0 to 1 with 0.001 interval
start_range = 0
end_range = 1
interval_size = 0.001
list_1 = [round(val * interval_size, 3) for val in range(1, 1000)]
list_1.append(start_range)
list_1.append(end_range)
epsilon = sorted(list_1, key=lambda x: float(x))
df = pd.concat([y_true,y_pred,x_relevance],axis=1,keys= ['true', 'pred', 'phi'])
# Initiating lists to store relevance(phi) and squared-error relevance (ser)
relevance = []
ser = []
# Converting the dataframe to a numpy array
rel_arr = x_relevance
# selecting a phi value
for phi in epsilon:
relevance.append(phi)
error_squared_sum = 0
error_squared_sum = sum((df[df.phi>=phi]['true'] - df[df.phi>=phi]['pred'])**2)
ser.append(error_squared_sum)
# squared-error relevance area (sera)
# numerical integration using simps(y, x)
sera = simps(ser, relevance)
return sera
sera = make_scorer(calc_sera, x_relevance=X['relevance'], greater_is_better=False)
I implemented a simple GridSearch using this score as an evaluation metric to select the best model:
model = CatBoostRegressor(random_state=0)
cv = KFold(n_splits = 5, shuffle = True, random_state = 42)
parameters = {'depth': [6,8,10],'learning_rate' : [0.01, 0.05, 0.1],'iterations': [100, 200, 500,1000]}
clf = GridSearchCV(estimator=model,
param_grid=parameters,
scoring=sera,
verbose=0,cv=cv)
clf.fit(X=X.drop(columns=['relevance']),
y=y,
sample_weight=X['relevance'])
print("Best parameters:", clf.best_params_)
print("Lowest SERA: ", clf.best_score_)
I also added the relevance function as weights to the model so it could apply this weights in the learning task. However, what I am getting as output is this:
Best parameters: {'depth': 6, 'iterations': 100, 'learning_rate': 0.01}
Lowest SERA: nan
Any clue on why SERA value is returning nan? Should I implement this another way?
Whenever you get unexpected NaN scores in a grid search, you should set the parameter error_score="raise" to get an error traceback, and debug from there.
In this case I think I see the problem though: sera is defined with x_relevance=X['relevance'], which includes all the rows of X. But in the search, you're cross-validating: each testing set has fewer rows, and those are what sera will be called on. I can think of a couple of options; I haven't tested either, so let me know if something doesn't work.
Use pandas index
In your pd.concat, just set join="inner". If y_true is a slice of the original pandas series (I think this is how GridSearchCV will pass it...), then the concatenation will join on those row indices, so keeping the whole of X['relevance'] is fine: it will just drop the irrelevant rows. y_pred may well be a numpy array, so you might need to set its index appropriately first?
Keep relevance in X
Then your scorer can reference the relevance column directly from the sliced X. For this, you will want to drop that column from the fitting data, which could be difficult to do for the training but not the testing set; however, CatBoost has an ignored_features parameter that I think ought to work.
I use a sklearn LinearRegression()estimator, with 5 variables
['feat1', 'feat2', 'feat3', 'feat4', 'feat5']
In order to predict a continuous value.
Estimator returns the list of coefficient values and the bias:
linear = LinearRegression()
print(linear.coef_)
print(linear.intercept_)
[ 0.18799409 -0.05406106 -0.01327966 -0.13348129 -0.00614054]
-0.011064865422734674
Then, given the fact I have each feature as variables, I can hardcode the coefficients into a linear formula and estimate my values, like so:
val = ((0.18799409*feat1) - (0.05406106*feat2) - (0.01327966*feat3) - (0.13348129*feat4) - (0.00614054*feat5)) -0.011064865422734674
Now lets say I use a polynomial regression of degree 2, using a pipeline, and by printing:
model = Pipeline(steps=[
('scaler',StandardScaler()),
('polynomial_features', PolynomialFeatures(degree=degree, include_bias=False)),
('linear_regression', LinearRegression())])
#fit model
model.fit(X_train, y_train)
print(model['linear_regression'].coef_)
print(model['linear_regression'].intercept_)
I get:
[ 7.06524186e-01 -2.98605001e-02 -4.67175212e-02 -4.86890790e-01
-1.06320101e-02 -2.77958604e-03 -3.38253025e-04 -7.80563090e-03
4.51356888e-03 8.32036733e-03 3.57638244e-02 -2.16446849e-02
-7.92169287e-02 3.36809467e-02 -6.60531497e-03 2.16613331e-02
2.10097993e-02 3.49970303e-02 -3.02970698e-02 -7.81462599e-03]
0.011042927069084668
How do I transform the formula above in order to calculate val from regression, with values from .coef_ and .intercept_, using array indexing instead of hardcoding the values, for any 'n' degree ?
Is there any scipy or numpy method suited for that?
It's important to note that polynomial regression is just an extended case of linear regression, thus all we need to do is transform our input data consistently. For any N we can use the PolynomialFeatures from sklearn.preprocessing. From using dummy data, we can see how this would work:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
#set parameters
X = np.stack([np.arange(i,i+10) for i in range(5)]).T
Y = np.random.randn(10)*10+3
N = 2
poly_reg=PolynomialFeatures(degree=N,include_bias=False)
X_poly=poly_reg.fit_transform(X)
#print(X[0],X_poly[0]) #to check parameters, note that it includes the y intercept as an input of 1
poly = LinearRegression().fit(X_poly, Y)
And thus, we can get the coef_ the way you were doing before, and simply perform a matrix multiplication to get the regressed value.
new_dat = poly_reg.transform(np.arange(2,2+10,2)[None]) #5 new datapoints
np.testing.assert_array_equal(poly.predict(new_dat),new_dat # poly.coef_ + poly.intercept_)
----EDIT----
In case you cannot use the transform for PolynomialFeatures, it's just a iterated combination loop to generate the data from your list of features.
new_feats = np.array([feat1,feat2,feat3,feat4,feat5])
from itertools import combinations_with_replacement
def gen_poly_feats(x,N):
#this function returns all unique groupings (w/ replacement) of the indices into the array x for use in polynomial regression.
return np.concatenate([[np.product(x[np.array(i)]) for i in list(combinations_with_replacement(range(len(x)), n))] for n in range(1,N+1)])[None]
new_feats_poly = gen_poly_feats(new_feats,N)
# just to be sure that this matches...
np.testing.assert_array_equal(new_feats_poly,poly_reg.transform(new_feats[None]))
#then we can use the above linear regression model to predict the new data
val = new_feats_poly # poly.coef_ + poly.intercept_
The reproducible example to fix the discussion:
from sklearn.linear_model import RidgeCV
from sklearn.datasets import load_boston
from sklearn.preprocessing import scale
boston = scale(load_boston().data)
target = load_boston().target
import numpy as np
alphas = np.linspace(1.0,200.0, 5)
fit0 = RidgeCV(alphas=alphas, store_cv_values = True, gcv_mode='eigen').fit(boston, target)
fit0.alpha_
fit0.cv_values_[:,0]
The question: what formula is used to compute fit0.cv_values_?
Edit:
#Abhinav Arora answer below seems to suggests that fit0.cv_values_[:,0][0], the first entry of fit0.cv_values_[:,0] would be
(fit1.predict(boston[0,].reshape(1, -1)) - target[0])**2
where fit1 is a ridge regression with alpha = 1.0, fitted to the data-set from which observation 0 was removed.
Let's see:
1) create new dataset with first row of original dataset removed:
from sklearn.linear_model import Ridge
boston1 = np.delete(boston, (0), axis=0)
target1 = np.delete(target, (0), axis=0)
2) fit a ridge model with alpha = 1.0 on this truncated dataset:
fit1 = Ridge(alpha=1.0).fit(boston1, target1)
3) check the MSE of that model on the first data-point:
(fit1.predict(boston[0,].reshape(1, -1)) - target[0])**2
it is array([ 37.64650853]) which is not the same as what is produced by the fit0.cv_values_[:,0], ergo:
fit0.cv_values_[:,0][0]
which is 37.495629960571137
What gives?
Quoting from the Sklearn documentation:
Cross-validation values for each alpha (if store_cv_values=True and
cv=None). After fit() has been called, this attribute will contain the
mean squared errors (by default) or the values of the
{loss,score}_func function (if provided in the constructor).
Since you have not provided any scoring function in the constructor and also not provided anything for the cv argument in the constructor, this attribute should store the mean squared error for each sample using Leave-One out cross validation. The general formula for Mean Squared Error is
where the Y (with the cap) is the prediction of your regressor and the other Y is the true value.
In your case, you are doing Leave-One out cross validation. Therefore, in every fold you have only 1 test point and thus n = 1. So, in your case doing a fit0.cv_values_[:,0] will simply give you the squared error for every point in your training data set when it was a part of the test fold and when the value of alpha was 1.0
Hope that helps.
Let's look - it's open source after all
The first call to fit makes a call upwards to its parent, _BaseRidgeCV (line 997, in that implementation). We haven't provided a cross-validation generator, so we make another call upwards to _RidgeGCV.fit. There' plenty of math in the documentation of this function, but we're so close to the source that I'll let you go and read about it.
Here's the actual source
v, Q, QT_y = _pre_compute(X, y)
n_y = 1 if len(y.shape) == 1 else y.shape[1]
cv_values = np.zeros((n_samples * n_y, len(self.alphas)))
C = []
scorer = check_scoring(self, scoring=self.scoring, allow_none=True)
error = scorer is None
for i, alpha in enumerate(self.alphas):
weighted_alpha = (sample_weight * alpha
if sample_weight is not None
else alpha)
if error:
out, c = _errors(weighted_alpha, y, v, Q, QT_y)
else:
out, c = _values(weighted_alpha, y, v, Q, QT_y)
cv_values[:, i] = out.ravel()
C.append(c)
Note the un-exciting pre_compute function
def _pre_compute(self, X, y):
# even if X is very sparse, K is usually very dense
K = safe_sparse_dot(X, X.T, dense_output=True)
v, Q = linalg.eigh(K)
QT_y = np.dot(Q.T, y)
return v, Q, QT_y
Abinav has explained what's going on on a mathematical level -it's simply accumulating the weighted mean squared error. The details of their implementation, and where it differs from your implementation, can be evaluated step-by-step from the code
I am trying to implement this algorithm to find the intercept and slope for single variable:
Here is my Python code to update the Intercept and slope. But it is not converging. RSS is Increasing with Iteration rather than decreasing and after some iteration it's becoming infinite. I am not finding any error implementing the algorithm.How Can I solve this problem? I have attached the csv file too.
Here is the code.
import pandas as pd
import numpy as np
#Defining gradient_decend
#This Function takes X value, Y value and vector of w0(intercept),w1(slope)
#INPUT FEATURES=X(sq.feet of house size)
#TARGET VALUE=Y (Price of House)
#W=np.array([w0,w1]).reshape(2,1)
#W=[w0,
# w1]
def gradient_decend(X,Y,W):
intercept=W[0][0]
slope=W[1][0]
#Here i will get a list
#list is like this
#gd=[sum(predicted_value-(intercept+slope*x)),
# sum(predicted_value-(intercept+slope*x)*x)]
gd=[sum(y-(intercept+slope*x) for x,y in zip(X,Y)),
sum(((y-(intercept+slope*x))*x) for x,y in zip(X,Y))]
return np.array(gd).reshape(2,1)
#Defining Resudual sum of squares
def RSS(X,Y,W):
return sum((y-(W[0][0]+W[1][0]*x))**2 for x,y in zip(X,Y))
#Reading Training Data
training_data=pd.read_csv("kc_house_train_data.csv")
#Defining fixed parameters
#Learning Rate
n=0.0001
iteration=1500
#Intercept
w0=0
#Slope
w1=0
#Creating 2,1 vector of w0,w1 parameters
W=np.array([w0,w1]).reshape(2,1)
#Running gradient Decend
for i in range(iteration):
W=W+((2*n)* (gradient_decend(training_data["sqft_living"],training_data["price"],W)))
print RSS(training_data["sqft_living"],training_data["price"],W)
Here is the CSV file.
Firstly, I find that when writing machine learning code, it's best NOT to use complex list comprehension because anything that you can iterate,
it's easier to read if written when normal loops and indentation and/or
it can be done with numpy broadcasting
And using proper variable names can help you better understand the code. Using Xs, Ys, Ws as short hand is nice only if you're good at math. Personally, I don't use them in the code, especially when writing in python. From import this: explicit is better than implicit.
My rule of thumb is to remember that if I write code I can't read 1 week later, it's bad code.
First, let's decide what is the input parameters for gradient descent, you will need:
feature_matrix (The X matrix, type: numpy.array, a matrix of N * D size, where N is the no. of rows/datapoints and D is the no. of columns/features)
output (The Y vector, type: numpy.array, a vector of size N)
initial_weights (type: numpy.array, a vector of size D).
Additionally, to check for convergence you will need:
step_size (the magnitude of change when iterating through to change the weights; type: float, usually a small number)
tolerance (the criteria to break the iterations, when the gradient magnitude is smaller than tolerance, assume that your weights have convereged, type: float, usually a small number but much bigger than the step size).
Now to the code.
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
converged = False # Set a boolean to check for convergence
weights = np.array(initial_weights) # make sure it's a numpy array
while not converged:
# compute the predictions based on feature_matrix and weights.
# iterate through the row and find the single scalar predicted
# value for each weight * column.
# hint: a dot product can solve this easily
predictions = [??? for row in feature_matrix]
# compute the errors as predictions - output
errors = predictions - output
gradient_sum_squares = 0 # initialize the gradient sum of squares
# while we haven't reached the tolerance yet, update each feature's weight
for i in range(len(weights)): # loop over each weight
# Recall that feature_matrix[:, i] is the feature column associated with weights[i]
# compute the derivative for weight[i]:
# Hint: the derivative is = 2 * dot product of feature_column and errors.
derivative = 2 * ????
# add the squared value of the derivative to the gradient magnitude (for assessing convergence)
gradient_sum_squares += (derivative * derivative)
# subtract the step size times the derivative from the current weight
weights[i] -= (step_size * derivative)
# compute the square-root of the gradient sum of squares to get the gradient magnitude:
gradient_magnitude = ???
# Then check whether the magnitude is lower than the tolerance.
if ???:
converged = True
# Once it while loop breaks, return the loop.
return(weights)
I hope the extended pseudo-code helps you better understand the gradient descent. I won't fill in the ??? so as to not spoil your homework.
Note that your RSS code is also unreadable and unmaintainable. It's easier to do just:
>>> import numpy as np
>>> prediction = np.array([1,2,3])
>>> output = np.array([1,1,5])
>>> residual = output - prediction
>>> RSS = sum(residual * residual)
>>> RSS
5
Going through numpy basics will go a long way to machine learning and matrix-vector manipulation without going nuts with iterations: http://docs.scipy.org/doc/numpy-1.10.1/user/basics.html
I have solved my own problem!
Here is the solved way.
import numpy as np
import pandas as pd
import math
from sys import stdout
#function Takes the pandas dataframe, Input features list and the target column name
def get_numpy_data(data, features, output):
#Adding a constant column with value 1 in the dataframe.
data['constant'] = 1
#Adding the name of the constant column in the feature list.
features = ['constant'] + features
#Creating Feature matrix(Selecting columns and converting to matrix).
features_matrix=data[features].as_matrix()
#Target column is converted to the numpy array
output_array=np.array(data[output])
return(features_matrix, output_array)
def predict_outcome(feature_matrix, weights):
weights=np.array(weights)
predictions = np.dot(feature_matrix, weights)
return predictions
def errors(output,predictions):
errors=predictions-output
return errors
def feature_derivative(errors, feature):
derivative=np.dot(2,np.dot(feature,errors))
return derivative
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
converged = False
#Initital weights are converted to numpy array
weights = np.array(initial_weights)
while not converged:
# compute the predictions based on feature_matrix and weights:
predictions=predict_outcome(feature_matrix,weights)
# compute the errors as predictions - output:
error=errors(output,predictions)
gradient_sum_squares = 0 # initialize the gradient
# while not converged, update each weight individually:
for i in range(len(weights)):
# Recall that feature_matrix[:, i] is the feature column associated with weights[i]
feature=feature_matrix[:, i]
# compute the derivative for weight[i]:
#predict=predict_outcome(feature,weights[i])
#err=errors(output,predict)
deriv=feature_derivative(error,feature)
# add the squared derivative to the gradient magnitude
gradient_sum_squares=gradient_sum_squares+(deriv**2)
# update the weight based on step size and derivative:
weights[i]=weights[i] - np.dot(step_size,deriv)
gradient_magnitude = math.sqrt(gradient_sum_squares)
stdout.write("\r%d" % int(gradient_magnitude))
stdout.flush()
if gradient_magnitude < tolerance:
converged = True
return(weights)
#Example of Implementation
#Importing Training and Testing Data
# train_data=pd.read_csv("kc_house_train_data.csv")
# test_data=pd.read_csv("kc_house_test_data.csv")
# simple_features = ['sqft_living', 'sqft_living15']
# my_output= 'price'
# (simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
# initial_weights = np.array([-100000., 1., 1.])
# step_size = 7e-12
# tolerance = 2.5e7
# simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights, step_size,tolerance)
# print simple_weights
It is so simple
def mean(values):
return sum(values)/float(len(values))
def variance(values, mean):
return sum([(x-mean)**2 for x in values])
def covariance(x, mean_x, y, mean_y):
covar = 0.0
for i in range(len(x)):
covar+=(x[i]-mean_x) * (y[i]-mean_y)
return covar
def coefficients(dataset):
x = []
y = []
for line in dataset:
xi, yi = map(float, line.split(','))
x.append(xi)
y.append(yi)
dataset.close()
x_mean, y_mean = mean(x), mean(y)
b1 = covariance(x, x_mean, y, y_mean)/variance(x, x_mean)
b0 = y_mean-b1*x_mean
return [b0, b1]
dataset = open('trainingdata.txt')
b0, b1 = coefficients(dataset)
n=float(raw_input())
print(b0+b1*n)
reference : www.machinelearningmastery.com/implement-simple-linear-regression-scratch-python/
How to perform stepwise regression in python? There are methods for OLS in SCIPY but I am not able to do stepwise. Any help in this regard would be a great help. Thanks.
Edit: I am trying to build a linear regression model. I have 5 independent variables and using forward stepwise regression, I aim to select variables such that my model has the lowest p-value. Following link explains the objective:
https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-Line%2FExcel%2520Notes%2FExcel%2520Notes%2520-%2520STEPWISE%2520REGRESSION.doc&ei=YjKsUZzXHoPwrQfGs4GQCg&usg=AFQjCNGDaQ7qRhyBaQCmLeO4OD2RVkUhzw&bvm=bv.47244034,d.bmk
Thanks again.
Trevor Smith and I wrote a little forward selection function for linear regression with statsmodels: http://planspace.org/20150423-forward_selection_with_statsmodels/ You could easily modify it to minimize a p-value, or select based on beta p-values with just a little more work.
You may try mlxtend which got various selection methods.
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
clf = LinearRegression()
# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)
# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)
You can make forward-backward selection based on statsmodels.api.OLS model, as shown in this answer.
However, this answer describes why you should not use stepwise selection for econometric models in the first place.
I developed this repository https://github.com/xinhe97/StepwiseSelectionOLS
My Stepwise Selection Classes (best subset, forward stepwise, backward stepwise) are compatible to sklearn. You can do Pipeline and GridSearchCV with my Classes.
The essential part of my code is as follows:
################### Criteria ###################
def processSubset(self, X,y,feature_index):
# Fit model on feature_set and calculate rsq_adj
regr = sm.OLS(y,X[:,feature_index]).fit()
rsq_adj = regr.rsquared_adj
bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
rsq = regr.rsquared
return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}
################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
# Pull out predictors we still need to process
remaining_predictors_index = [p for p in range(X.shape[1])
if p not in predictors_index]
results = []
for p in remaining_predictors_index:
new_predictors_index = predictors_index+[p]
new_predictors_index.sort()
results.append(self.processSubset(X,y,new_predictors_index))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest rsq_adj
# best_model = models.loc[models['bic'].idxmin()]
best_model = models.loc[models['rsq'].idxmax()]
# Return the best model, along with model's other information
return best_model
def forwardK(self,X_est,y_est, fK):
models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
predictors_index = []
M = min(fK,X_est.shape[1])
for i in range(1,M+1):
print(i)
models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
predictors_index = models_fwd.loc[i,'predictors_index']
print(models_fwd)
# best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
# best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
return best_model_fwd, best_predictors
Statsmodels has additional methods for regression: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html. I think it will help you to implement stepwise regression.
"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm
"""X_opt variable has all the columns of independent variables of matrix X
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]
"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Using the summary method, you can check in your kernel the p values of your
variables written as 'P>|t|'. Then check for the variable with the highest p
value. Suppose x3 has the highest value e.g 0.956. Then remove this column
from your array and repeat all the steps.
X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
Repeat these methods until you remove all the columns which have p value higher than the significance value(e.g 0.05). In the end your variable X_opt will have all the optimal variables with p values less than significance level.
Here's a method I just wrote that uses "mixed selection" as described in Introduction to Statistical Learning. As input, it takes:
lm, a statsmodels.OLS.fit(Y,X), where X is an array of n ones, where n is the
number of data points, and Y, where Y is the response in the training data
curr_preds- a list with ['const']
potential_preds- a list of all potential predictors.
There also needs to be a pandas dataframe X_mix that has all of the data, including 'const', and all of the data corresponding to the potential predictors
tol, optional. The max pvalue, .05 if not specified
def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
while (len(potential_preds) > 0):
index_best = -1 # this will record the index of the best predictor
curr = -1 # this will record current index
best_r_squared = lm.rsquared_adj # record the r squared of the current model
# loop to determine if any of the predictors can better the r-squared
for pred in potential_preds:
curr += 1 # increment current
preds = curr_preds.copy() # grab the current predictors
preds.append(pred)
lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
new_r_sq = lm_new.rsquared_adj # record r squared for new model
if new_r_sq > best_r_squared:
best_r_squared = new_r_sq
index_best = curr
if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
curr_preds.append(potential_preds.pop(index_best))
else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
break
# fit a new lm using the new predictors, look at the p-values
pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
pval_too_big = []
# make a list of all the p-values that are greater than the tolerance
for feat in pvals.index:
if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
pval_too_big.append(feat)
# now remove all the features from curr_preds that have a p-value that is too large
for feat in pval_too_big:
pop_index = curr_preds.index(feat)
curr_preds.pop(pop_index)