The reproducible example to fix the discussion:
from sklearn.linear_model import RidgeCV
from sklearn.datasets import load_boston
from sklearn.preprocessing import scale
boston = scale(load_boston().data)
target = load_boston().target
import numpy as np
alphas = np.linspace(1.0,200.0, 5)
fit0 = RidgeCV(alphas=alphas, store_cv_values = True, gcv_mode='eigen').fit(boston, target)
fit0.alpha_
fit0.cv_values_[:,0]
The question: what formula is used to compute fit0.cv_values_?
Edit:
#Abhinav Arora answer below seems to suggests that fit0.cv_values_[:,0][0], the first entry of fit0.cv_values_[:,0] would be
(fit1.predict(boston[0,].reshape(1, -1)) - target[0])**2
where fit1 is a ridge regression with alpha = 1.0, fitted to the data-set from which observation 0 was removed.
Let's see:
1) create new dataset with first row of original dataset removed:
from sklearn.linear_model import Ridge
boston1 = np.delete(boston, (0), axis=0)
target1 = np.delete(target, (0), axis=0)
2) fit a ridge model with alpha = 1.0 on this truncated dataset:
fit1 = Ridge(alpha=1.0).fit(boston1, target1)
3) check the MSE of that model on the first data-point:
(fit1.predict(boston[0,].reshape(1, -1)) - target[0])**2
it is array([ 37.64650853]) which is not the same as what is produced by the fit0.cv_values_[:,0], ergo:
fit0.cv_values_[:,0][0]
which is 37.495629960571137
What gives?
Quoting from the Sklearn documentation:
Cross-validation values for each alpha (if store_cv_values=True and
cv=None). After fit() has been called, this attribute will contain the
mean squared errors (by default) or the values of the
{loss,score}_func function (if provided in the constructor).
Since you have not provided any scoring function in the constructor and also not provided anything for the cv argument in the constructor, this attribute should store the mean squared error for each sample using Leave-One out cross validation. The general formula for Mean Squared Error is
where the Y (with the cap) is the prediction of your regressor and the other Y is the true value.
In your case, you are doing Leave-One out cross validation. Therefore, in every fold you have only 1 test point and thus n = 1. So, in your case doing a fit0.cv_values_[:,0] will simply give you the squared error for every point in your training data set when it was a part of the test fold and when the value of alpha was 1.0
Hope that helps.
Let's look - it's open source after all
The first call to fit makes a call upwards to its parent, _BaseRidgeCV (line 997, in that implementation). We haven't provided a cross-validation generator, so we make another call upwards to _RidgeGCV.fit. There' plenty of math in the documentation of this function, but we're so close to the source that I'll let you go and read about it.
Here's the actual source
v, Q, QT_y = _pre_compute(X, y)
n_y = 1 if len(y.shape) == 1 else y.shape[1]
cv_values = np.zeros((n_samples * n_y, len(self.alphas)))
C = []
scorer = check_scoring(self, scoring=self.scoring, allow_none=True)
error = scorer is None
for i, alpha in enumerate(self.alphas):
weighted_alpha = (sample_weight * alpha
if sample_weight is not None
else alpha)
if error:
out, c = _errors(weighted_alpha, y, v, Q, QT_y)
else:
out, c = _values(weighted_alpha, y, v, Q, QT_y)
cv_values[:, i] = out.ravel()
C.append(c)
Note the un-exciting pre_compute function
def _pre_compute(self, X, y):
# even if X is very sparse, K is usually very dense
K = safe_sparse_dot(X, X.T, dense_output=True)
v, Q = linalg.eigh(K)
QT_y = np.dot(Q.T, y)
return v, Q, QT_y
Abinav has explained what's going on on a mathematical level -it's simply accumulating the weighted mean squared error. The details of their implementation, and where it differs from your implementation, can be evaluated step-by-step from the code
Related
For example, for a simple linear model y=wx+b where x and y are input and output respectively, w and b are training parameters, I am wondering, in every epoch, how can I update b first and then update w?
Tensorflow might not be the best tool for this. You can do it just using python.
And if you need to do the regression with a more complex function scikit-learn might be a more appropriate library.
Regardless of the tool, you can do Batch Gradient Descent or Stochastic Gradient Descent.
But first you need to define a "Cost Function", this function basically tells you how far away from the true value you are, for example least mean square (LMS), this types of functions takes the prediction from your model and the true value and perform the adjustment to the training parameters.
This is the function that is optimized by BGD or SGD in the training process.
Here is an example I did to understand what is happening, it's not the optimum solution but it will give you an idea of what is happening.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
tips = sns.load_dataset("tips")
alpha = 0.00005
thetas = np.array([1.,1.])
def h(thetas, x):
#print(f'theta 0: {thetas[0]}')
#print(f'theta 1: {thetas[1]}')
#print(f'h=m:{thetas[0] + (thetas[1]*x[1])}')
return thetas[0] + (thetas[1]*x[1])
for i in zip(tips.total_bill, tips.tip):
x = np.array([1, i[0]])
y = i[1]
for index, theta in enumerate(thetas):
#print(f'theta in: {thetas[index]}')
#print(f'error: {thetas[index] + alpha*(y - h(thetas, x))*x[index]}')
thetas[index] = thetas[index] + alpha*(y - h(thetas, x))*x[index]
#print(f'theta out: {thetas[index]}')
#print(thetas)
print(thetas)
xplot = np.linspace(min(tips.total_bill), max(tips.total_bill), 100, endpoint=True)
xp = [[1,x] for x in xplot]
yp = [h(thetas, xi) for xi in xp]
plt.scatter(tips.total_bill,tips.tip)
plt.plot(xplot, yp, 'o', color= 'orange')
plt.show()
Not really possible. TF's backprop calculate gradients across all variables based on the values of the other variables at the time of forward prop. If you want to alternate between training w and b, you would unfreeze w and freeze b (set it to trainable=False), forwardprop and backprop, then freeze w and unfreeze b, and forward prop and back prop. I don't think that'd run very fast since TF isn't really design to switch the trainable flag on every mini batch.
I am trying to create a multiple linear regression model from scratch in python. Dataset used: Boston Housing Dataset from Sklearn. Since my focus was on the model building I did not perform any pre-processing steps on the data. However, I used an OLS model to calculate p-values and dropped 3 features from the data. After that, I used a Linear Regression model to find out the weights for each feature.
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
X=load_boston()
data=pd.DataFrame(X.data,columns=X.feature_names)
y=X.target
data.head()
#dropping three features
data=data.drop(['INDUS','NOX','AGE'],axis=1)
#new shape of the data (506,10) not including the target variable
#Passed the whole dataset to Linear Regression Model
model_lr=LinearRegression()
model_lr.fit(data,y)
model_lr.score(data,y)
0.7278959820021539
model_lr.intercept_
22.60536462807957 #----- intercept value
model_lr.coef_
array([-0.09649731, 0.05281081, 2.3802989 , 3.94059598, -1.05476566,
0.28259531, -0.01572265, -0.75651996, 0.01023922, -0.57069861]) #--- coefficients
Now I wanted to calculate the coefficients manually in excel before creating the model in python. To calculate the weights of each feature I used this formula:
Calculating the Weights of the Features
To calculate the intercept I used the formula
b0 = mean(y)-b1*mean(x1)-b2*(mean(x2)....-bn*mean(xn)
The intercept value from my calculations was 22.63551387(almost same to that of the model)
The problem is that the weights of the features from my calculation are far off from that of the sklearn linear model.
-0.002528644 #-- CRIM
-0.001028914 #-- Zn
-0.038663314 #-- CHAS
-0.035026972 #-- RM
-0.014275311 #-- DIS
-0.004058291 #-- RAD
-0.000241103 #-- TAX
-0.015035534 #-- PTRATIO
-0.000318376 #-- B
-0.006411897 #-- LSTAT
Using the first row as a test data to check my calculations, I get 22.73167044199992 while the Linear Regression model predicts 30.42657776. The original value is 24.
But as soon as I check for other rows the sklearn model is having more variation while the predictions made by the weights from my calculations are all showing values close to 22.
I think I am making a mistake in calculating the weights, but I am not sure where the problem is? Is there a mistake in my calculation? Why are all my coefficients from the calculations so close to 0?
Here is my Code for Calculating the coefficients:(beginner here)
x_1=[]
x_2=[]
for i,j in zip(data['CRIM'],y):
mean_x=data['CRIM'].mean()
mean_y=np.mean(y)
c=i-mean_x*(j-mean_y)
d=(i-mean_x)**2
x_1.append(c)
x_2.append(d)
print(sum(x_1)/sum(x_2))
Thank you for reading this long post, I appreciate it.
It seems like the trouble lies in the coefficient calculation. The formula you have given for calculating the coefficients is in scalar form, used for the simplest case of linear regression, namely with only one feature x.
EDIT
Now after seeing your code for the coefficient calculation, the problem is clearer.
You cannot use this equation to calculate the coefficients of each feature independent of each other, as each coefficient will depend on all the features. I suggest you take a look at the derivation of the solution to this least squares optimization problem in the simple case here and in the general case here. And as a general tip stick with matrix implementation whenever you can, as this is radically more efficient.
However, in this case we have a 10-dimensional feature vector, and so in matrix notation it becomes.
See derivation here
I suspect you made some computational error here, as implementing this in python using the scalar formula is more tedious and untidy than the matrix equivalent. But since you haven't shared this peace of your code its hard to know.
Here's an example of how you would implement it:
def calc_coefficients(X,Y):
X=np.mat(X)
Y = np.mat(Y)
return np.dot((np.dot(np.transpose(X),X))**(-1),np.transpose(np.dot(Y,X)))
def score_r2(y_pred,y_true):
ss_tot=np.power(y_true-y_true.mean(),2).sum()
ss_res = np.power(y_true -y_pred,2).sum()
return 1 -ss_res/ss_tot
X = np.ones(shape=(506,11))
X[:,1:] = data.values
B=calc_coefficients(X,y)
##### Coeffcients
B[:]
matrix([[ 2.26053646e+01],
[-9.64973063e-02],
[ 5.28108077e-02],
[ 2.38029890e+00],
[ 3.94059598e+00],
[-1.05476566e+00],
[ 2.82595310e-01],
[-1.57226536e-02],
[-7.56519964e-01],
[ 1.02392192e-02],
[-5.70698610e-01]])
#### Intercept
B[0]
matrix([[22.60536463]])
y_pred = np.dot(np.transpose(B),np.transpose(X))
##### First 5 rows predicted
np.array(y_pred)[0][:5]
array([30.42657776, 24.80818347, 30.69339701, 29.35761397, 28.6004966 ])
##### First 5 rows Ground Truth
y[:5]
array([24. , 21.6, 34.7, 33.4, 36.2])
### R^2 score
score_r2(y_pred,y)
0.7278959820021539
Complete Solution - 2020 - boston dataset
As the other said, to compute the coefficients for the linear regression you have to compute
β = (X^T X)^-1 X^T y
This give you the coefficients ( all B for the feature + the intercept ).
Be sure to add a column with all 1ones to the X for compute the intercept(more in the code)
Main.py
from sklearn.datasets import load_boston
import numpy as np
from CustomLibrary import CustomLinearRegression
from CustomLibrary import CustomMeanSquaredError
boston = load_boston()
X = np.array(boston.data, dtype="f")
Y = np.array(boston.target, dtype="f")
regression = CustomLinearRegression()
regression.fit(X, Y)
print("Projection matrix sk:", regression.coefficients, "\n")
print("bias sk:", regression.intercept, "\n")
Y_pred = regression.predict(X)
loss_sk = CustomMeanSquaredError(Y, Y_pred)
print("Model performance:")
print("--------------------------------------")
print("MSE is {}".format(loss_sk))
print("\n")
CustomLibrary.py
import numpy as np
class CustomLinearRegression():
def __init__(self):
self.coefficients = None
self.intercept = None
def fit(self, x , y):
x = self.add_one_column(x)
x_T = np.transpose(x)
inverse = np.linalg.inv(np.dot(x_T, x))
pseudo_inverse = inverse.dot(x_T)
coef = pseudo_inverse.dot(y)
self.intercept = coef[0]
self.coefficients = coef[1:]
return coef
def add_one_column(self, x):
'''
the fit method with x feature return x coefficients ( include the intercept)
so for have the intercept + x feature coefficients we have to add one column ( in the beginning )
with all 1ones
'''
X = np.ones(shape=(x.shape[0], x.shape[1] +1))
X[:, 1:] = x
return X
def predict(self, x):
predicted = np.array([])
for sample in x:
result = self.intercept
for idx, feature_value_in_sample in enumerate(sample):
result += feature_value_in_sample * self.coefficients[idx]
predicted = np.append(predicted, result)
return predicted
def CustomMeanSquaredError(Y, Y_pred):
mse = 0
for idx,data in enumerate(Y):
mse += (data - Y_pred[idx])**2
return mse * (1 / len(Y))
I want to run a kernel ridge regression in python using the sklearn.kernel_ridge.KernelRidge function with a custom kernel (wendland kernel), that is not implemented in python, so I have to provide a callable (I want to avoide to use the 'precomputed' option in order to keep it in line with my other models). The problem is, that the callable has to return a float number, so it will be called once for each datapoint, which causes a real slow training.
Looking at a similar setup model, i.e. SVM.SVR, one has to provide a callable kernel function which returns the whole kernel matrix at once, which makes it much faster.
So my question is, if there is a possibility to make the KernelRidge function accept a callable function that provides the gram matrix in one step in order to speed up the process? Are there other alternatives?
import numpy as np
from sklearn.kernel_ridge import KernelRidge
from sklearn.metrics.pairwise import check_pairwise_arrays, euclidean_distances
def Wendland_kernel(eps=None):
#Kernel I want to use und am allowed to use with SVM.SVR
def Wendland_gram_intern(X, Y=None, eps=eps):
X, Y = check_pairwise_arrays(X,Y)
if eps is None:
eps = 1.0 / X.shape[1]
K = euclidean_distances(X, Y, squared=False)
K = 1 - eps*K
return np.maximum(K,0)**2
return Wendland_gram_intern
def Wendland_single(eps=None):
#Kernel I have to use
def Wendland_single_intern(x1, y1, eps=eps):
K = np.linalg.norm(x1-y1)
K = 1 - eps*K
return np.maximum(K,0)**2
return Wendland_single_intern
X = np.random.random((10,2))
y = np.random.normal(size=(10,))
clf = KernelRidge(kernel=Wendland_single(eps=2.5))
clf.fit(X, y)
print(clf.predict([[0.5,0.5]]))
I am trying to implement this algorithm to find the intercept and slope for single variable:
Here is my Python code to update the Intercept and slope. But it is not converging. RSS is Increasing with Iteration rather than decreasing and after some iteration it's becoming infinite. I am not finding any error implementing the algorithm.How Can I solve this problem? I have attached the csv file too.
Here is the code.
import pandas as pd
import numpy as np
#Defining gradient_decend
#This Function takes X value, Y value and vector of w0(intercept),w1(slope)
#INPUT FEATURES=X(sq.feet of house size)
#TARGET VALUE=Y (Price of House)
#W=np.array([w0,w1]).reshape(2,1)
#W=[w0,
# w1]
def gradient_decend(X,Y,W):
intercept=W[0][0]
slope=W[1][0]
#Here i will get a list
#list is like this
#gd=[sum(predicted_value-(intercept+slope*x)),
# sum(predicted_value-(intercept+slope*x)*x)]
gd=[sum(y-(intercept+slope*x) for x,y in zip(X,Y)),
sum(((y-(intercept+slope*x))*x) for x,y in zip(X,Y))]
return np.array(gd).reshape(2,1)
#Defining Resudual sum of squares
def RSS(X,Y,W):
return sum((y-(W[0][0]+W[1][0]*x))**2 for x,y in zip(X,Y))
#Reading Training Data
training_data=pd.read_csv("kc_house_train_data.csv")
#Defining fixed parameters
#Learning Rate
n=0.0001
iteration=1500
#Intercept
w0=0
#Slope
w1=0
#Creating 2,1 vector of w0,w1 parameters
W=np.array([w0,w1]).reshape(2,1)
#Running gradient Decend
for i in range(iteration):
W=W+((2*n)* (gradient_decend(training_data["sqft_living"],training_data["price"],W)))
print RSS(training_data["sqft_living"],training_data["price"],W)
Here is the CSV file.
Firstly, I find that when writing machine learning code, it's best NOT to use complex list comprehension because anything that you can iterate,
it's easier to read if written when normal loops and indentation and/or
it can be done with numpy broadcasting
And using proper variable names can help you better understand the code. Using Xs, Ys, Ws as short hand is nice only if you're good at math. Personally, I don't use them in the code, especially when writing in python. From import this: explicit is better than implicit.
My rule of thumb is to remember that if I write code I can't read 1 week later, it's bad code.
First, let's decide what is the input parameters for gradient descent, you will need:
feature_matrix (The X matrix, type: numpy.array, a matrix of N * D size, where N is the no. of rows/datapoints and D is the no. of columns/features)
output (The Y vector, type: numpy.array, a vector of size N)
initial_weights (type: numpy.array, a vector of size D).
Additionally, to check for convergence you will need:
step_size (the magnitude of change when iterating through to change the weights; type: float, usually a small number)
tolerance (the criteria to break the iterations, when the gradient magnitude is smaller than tolerance, assume that your weights have convereged, type: float, usually a small number but much bigger than the step size).
Now to the code.
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
converged = False # Set a boolean to check for convergence
weights = np.array(initial_weights) # make sure it's a numpy array
while not converged:
# compute the predictions based on feature_matrix and weights.
# iterate through the row and find the single scalar predicted
# value for each weight * column.
# hint: a dot product can solve this easily
predictions = [??? for row in feature_matrix]
# compute the errors as predictions - output
errors = predictions - output
gradient_sum_squares = 0 # initialize the gradient sum of squares
# while we haven't reached the tolerance yet, update each feature's weight
for i in range(len(weights)): # loop over each weight
# Recall that feature_matrix[:, i] is the feature column associated with weights[i]
# compute the derivative for weight[i]:
# Hint: the derivative is = 2 * dot product of feature_column and errors.
derivative = 2 * ????
# add the squared value of the derivative to the gradient magnitude (for assessing convergence)
gradient_sum_squares += (derivative * derivative)
# subtract the step size times the derivative from the current weight
weights[i] -= (step_size * derivative)
# compute the square-root of the gradient sum of squares to get the gradient magnitude:
gradient_magnitude = ???
# Then check whether the magnitude is lower than the tolerance.
if ???:
converged = True
# Once it while loop breaks, return the loop.
return(weights)
I hope the extended pseudo-code helps you better understand the gradient descent. I won't fill in the ??? so as to not spoil your homework.
Note that your RSS code is also unreadable and unmaintainable. It's easier to do just:
>>> import numpy as np
>>> prediction = np.array([1,2,3])
>>> output = np.array([1,1,5])
>>> residual = output - prediction
>>> RSS = sum(residual * residual)
>>> RSS
5
Going through numpy basics will go a long way to machine learning and matrix-vector manipulation without going nuts with iterations: http://docs.scipy.org/doc/numpy-1.10.1/user/basics.html
I have solved my own problem!
Here is the solved way.
import numpy as np
import pandas as pd
import math
from sys import stdout
#function Takes the pandas dataframe, Input features list and the target column name
def get_numpy_data(data, features, output):
#Adding a constant column with value 1 in the dataframe.
data['constant'] = 1
#Adding the name of the constant column in the feature list.
features = ['constant'] + features
#Creating Feature matrix(Selecting columns and converting to matrix).
features_matrix=data[features].as_matrix()
#Target column is converted to the numpy array
output_array=np.array(data[output])
return(features_matrix, output_array)
def predict_outcome(feature_matrix, weights):
weights=np.array(weights)
predictions = np.dot(feature_matrix, weights)
return predictions
def errors(output,predictions):
errors=predictions-output
return errors
def feature_derivative(errors, feature):
derivative=np.dot(2,np.dot(feature,errors))
return derivative
def regression_gradient_descent(feature_matrix, output, initial_weights, step_size, tolerance):
converged = False
#Initital weights are converted to numpy array
weights = np.array(initial_weights)
while not converged:
# compute the predictions based on feature_matrix and weights:
predictions=predict_outcome(feature_matrix,weights)
# compute the errors as predictions - output:
error=errors(output,predictions)
gradient_sum_squares = 0 # initialize the gradient
# while not converged, update each weight individually:
for i in range(len(weights)):
# Recall that feature_matrix[:, i] is the feature column associated with weights[i]
feature=feature_matrix[:, i]
# compute the derivative for weight[i]:
#predict=predict_outcome(feature,weights[i])
#err=errors(output,predict)
deriv=feature_derivative(error,feature)
# add the squared derivative to the gradient magnitude
gradient_sum_squares=gradient_sum_squares+(deriv**2)
# update the weight based on step size and derivative:
weights[i]=weights[i] - np.dot(step_size,deriv)
gradient_magnitude = math.sqrt(gradient_sum_squares)
stdout.write("\r%d" % int(gradient_magnitude))
stdout.flush()
if gradient_magnitude < tolerance:
converged = True
return(weights)
#Example of Implementation
#Importing Training and Testing Data
# train_data=pd.read_csv("kc_house_train_data.csv")
# test_data=pd.read_csv("kc_house_test_data.csv")
# simple_features = ['sqft_living', 'sqft_living15']
# my_output= 'price'
# (simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
# initial_weights = np.array([-100000., 1., 1.])
# step_size = 7e-12
# tolerance = 2.5e7
# simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights, step_size,tolerance)
# print simple_weights
It is so simple
def mean(values):
return sum(values)/float(len(values))
def variance(values, mean):
return sum([(x-mean)**2 for x in values])
def covariance(x, mean_x, y, mean_y):
covar = 0.0
for i in range(len(x)):
covar+=(x[i]-mean_x) * (y[i]-mean_y)
return covar
def coefficients(dataset):
x = []
y = []
for line in dataset:
xi, yi = map(float, line.split(','))
x.append(xi)
y.append(yi)
dataset.close()
x_mean, y_mean = mean(x), mean(y)
b1 = covariance(x, x_mean, y, y_mean)/variance(x, x_mean)
b0 = y_mean-b1*x_mean
return [b0, b1]
dataset = open('trainingdata.txt')
b0, b1 = coefficients(dataset)
n=float(raw_input())
print(b0+b1*n)
reference : www.machinelearningmastery.com/implement-simple-linear-regression-scratch-python/
I am planning to use Linear Regression in Spark. To get started, I checked out the example from the official documentation (which you can find here)
I also found this question on stackoverflow, which is essentially the same question as mine. The answer suggest to tweak the step size, which I also tried to do, however the results are still as random as without tweaking the step size. The code I'm using looks like this:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
data = sc.textFile("data/mllib/ridge-data/lpsa.data")
parsedData = data.map(parsePoint)
# Build the model
model = LinearRegressionWithSGD.train(parsedData,100000,0.01)
# Evaluate the model on training data
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
print("Mean Squared Error = " + str(MSE))
The results look as follows:
(Expected Label, Predicted Label)
(-0.4307829, -0.7824231588143065)
(-0.1625189, -0.6234287565006766)
(-0.1625189, -0.41979307020176226)
(-0.1625189, -0.6517649080382241)
(0.3715636, -0.38543073492870156)
(0.7654678, -0.7329426818746223)
(0.8544153, -0.33273378445315)
(1.2669476, -0.36663240056848917)
(1.2669476, -0.47541427992967517)
(1.2669476, -0.1887811811672498)
(1.3480731, -0.28646712964591936)
(1.446919, -0.3425075015127807)
(1.4701758, -0.14055275401870437)
(1.4929041, -0.06819303631450688)
(1.5581446, -0.772558163357755)
(1.5993876, -0.19251656391040356)
(1.6389967, -0.38105697301968594)
(1.6956156, -0.5409707504639943)
(1.7137979, 0.14914490255841997)
(1.8000583, -0.0008818203337740971)
(1.8484548, 0.06478505759587616)
(1.8946169, -0.0685096804502884)
(1.9242487, -0.14607596025743624)
(2.008214, -0.24904211817187422)
(2.0476928, -0.4686214015035236)
(2.1575593, 0.14845590638215034)
(2.1916535, -0.5140996125798528)
(2.2137539, 0.6278134417345228)
(2.2772673, -0.35049969044209983)
(2.2975726, -0.06036824276546304)
(2.3272777, -0.18585219083806218)
(2.5217206, -0.03167349168036536)
(2.5533438, -0.1611040092884861)
(2.5687881, 1.1032200139582564)
(2.6567569, 0.04975777739217784)
(2.677591, -0.01426285133724671)
(2.7180005, 0.07853368755223371)
(2.7942279, -0.4071930969456503)
(2.8063861, 0.000492545291049501)
(2.8124102, -0.019947344959659177)
(2.8419982, 0.03023139779978133)
(2.8535925, 0.5421291261646886)
(2.9204698, 0.3923068894170366)
(2.9626924, 0.21639267973240908)
(2.9626924, -0.22540434628281075)
(2.9729753, 0.2363938458250126)
(3.0130809, 0.35136961387278565)
(3.0373539, 0.013876918415846595)
(3.2752562, 0.49970959078043126)
(3.3375474, 0.5436323480304672)
(3.3928291, 0.48746004196839055)
(3.4355988, 0.3350764608584778)
(3.4578927, 0.6127634045652381)
(3.5160131, -0.03781697409079157)
(3.5307626, 0.2129806543371961)
(3.5652984, 0.5528805608876549)
(3.5876769, 0.06299042506665305)
(3.6309855, 0.5648082098866389)
(3.6800909, -0.1588172848952902)
(3.7123518, 0.1635062564072022)
(3.9843437, 0.7827244309795267)
(3.993603, 0.6049246406551748)
(4.029806, 0.06372113813964088)
(4.1295508, 0.24281029469705093)
(4.3851468, 0.5906868686740623)
(4.6844434, 0.4055055537895428)
(5.477509, 0.7335244827296759)
Mean Squared Error = 6.83550847274
So, what am I missing? Since the data is from the official spark documentation, I would guess that it should be suited to apply linear regression on it (and get at least a reasonably good prediction)?
For starters you're missing an intercept. While mean values of the independent variables are close to zero:
parsedData.map(lambda lp: lp.features).mean()
## DenseVector([-0.031, -0.0066, 0.1182, -0.0199, 0.0178, -0.0249,
## -0.0294, 0.0669]
mean of the dependent variable is pretty far from it:
parsedData.map(lambda lp: lp.label).mean()
## 2.452345085074627
Forcing the regression line to go through the origin in case like this doesn't make sense. So lets see how LinearRegressionWithSGD performs with default arguments and added intercept:
model = LinearRegressionWithSGD.train(parsedData, intercept=True)
valuesAndPreds = (parsedData.map(lambda p: (p.label, model.predict(p.features))))
valuesAndPreds.map(lambda vp: (vp[0] - vp[1]) ** 2).mean()
## 0.44005904185432504
Lets compare it to the analytical solution
import numpy as np
from sklearn import linear_model
features = np.array(parsedData.map(lambda lp: lp.features.toArray()).collect())
labels = np.array(parsedData.map(lambda lp: lp.label).collect())
lm = linear_model.LinearRegression()
lm.fit(features, labels)
np.mean((lm.predict(features) - labels) ** 2)
## 0.43919976805833411
As you can results obtained using LinearRegressionWithSGD are almost optimal.
You could add a grid search but in this particular case there is probably nothing to gain.