How to compute SSE in python - python

I need to fit a linear model to Wine Quality dataset. And then find MSE for each k-fold. Following is my code
regressor = LinearRegression()
regressor.fit(Features_train, Quality_train)
scores = cross_val_score(regressor, Features, Quality, cv=10,
scoring='mean_squared_error')
print scores
Problem here is that one or two values of MSE are negative. Following is the scores array:
[-0.47093648 -0.40001874 -0.46928925 -0.4317235 -0.37665658 -0.52359841
-0.40046081 -0.42944953 -0.36179521 -0.48792052]*
According to the formula, it shoud not be negative.

Do refer to this thread below:
scikit-learn cross validation, negative values with mean squared error
In summary, this is supposed to happen. The actual MSE is simply the positive version of the number you're getting.
Hope this helps!

Related

GridSearchCV Returns WORST Possible Parameter (Ridge & Lasso Regression)

Problem: Scikit-learn's GridSearchCV is returning the parameter which results in the worst score (Root MSE) rather than the best.
I think it is possible the problem is that I am not using train-test-split to create a hold out test set because it is time series data, and I do not want to disrupt the time order. Another possible cause is that I have over 7,000 features but only 50 observations. But clarification from anyone who knows whether these could be the problems and what I might do to remedy these potential issues would be greatly appreciated.
I start with the following code (and have imported Ridge, GridSearchCV, make_pipeline, TimeSeriesSplit, numpy, pandas, etc.):
ridge_pipe = make_pipeline(Ridge(random_state=42, max_iter=100000))
tscv = TimeSeriesSplit(n_splits=5)
param_grid = {'ridge__alpha': np.logspace(1e-300, 1e-1, 500)}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, scoring='neg_root_mean_squared_error',
n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
It gives me this output:
{'ridge__alpha': 1.2589254117941673}
-4.067235334106922
Skeptical that this would be the best Root MSE, I next tried finding the score when considering an alpha value of 1e-300 alone:
param_grid = {'ridge__alpha': [1e-300]}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv,
scoring='neg_root_mean_squared_error', n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
It gives me this ouput:
{'ridge__alpha': 1e-300}
-2.0906161667718835e-13
Clearly then, an alpha value of 1e-300 has a better Root MSE (approx. -2e-13) than does an alpha value of 1e-1 (approx. -4) since negative Root MSE using GridSearchCV means the same thing - as I understand it - as positive Root MSE in all other contexts. So a Root MSE of -2e-13 is really 2e-13 and -4 is really 4. And the lower the Root MSE the better.
To see if np.logspace could be the culprit, I instead provide just a list of values:
param_grid = {'ridge__alpha': [1e-1, 1e-50, 1e-60, 1e-70, 1e-80, 1e-90, 1e-100, 1e-300]}
grid = GridSearchCV(ridge_pipe, param_grid, cv=tscv, scoring='neg_root_mean_squared_error',
n_jobs=-1)
grid.fit(news_df, y_battles)
print(grid.best_params_)
print(grid.score(news_df, y_battles))
And the output shows that the same problem:
{'ridge__alpha': 0.1}
-2.0419740158869386
And I don't think it's because I'm using TimeSeriesSplit, because I have tried using cv=5 instead of cv=tscv inside GridSearchCV() and it results in the same problem.
The same issue happens when I try Lasso instead of Ridge. Any thoughts?
This appears to be fine. The problem is that you're comparing the final outputs on the same dataset that the best_estimator_ was trained on (search's method score delegates to the score method of search.best_estimator_, which is the model using best hyperparameters refitted on the entire training set); but the grid search is selecting based on cross-validated scores, which are a better indicator for future performance.
Specifically, with alpha=1e-300 (practically zero), the model overfits badly to the training data, and so the rmse on that training data is very small (2e-13). Meanwhile, with alpha=1.26, the model performs worse on the training data (rmse 4), but performs better on unseen data. You can see those cross-validation scores in the grid search's attribute cv_results_.

How to calculate the RMSE on Ridge regression model

I have performed a ridge regression model on a data set
(link to the dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)
as below:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
y = train['SalePrice']
X = train.drop("SalePrice", axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train,y_train)
pred = ridge.predict(X_test)
I calculated the MSE using the metrics library from sklearn as
from sklearn.metrics import mean_squared_error
mean = mean_squared_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test,pred)
I am getting a very large value of MSE = 554084039.54321 and RMSE = 21821.8, I am trying to understand if my implementation is correct.
RMSE implementation
Your RMSE implementation is correct which is easily verifiable when you take the sqaure root of sklearn's mean_squared_error.
I think you are missing a closing parentheses though, here to be exact:
rmse = np.sqrt(mean_squared_error(y_test,pred)) # the last one was missing
High error problem
Your MSE is high due to model not being able to model relationships between your variables and target very well. Bear in mind each error is taken to the power of 2, so being 1000 off in price sky-rockets the value to 1000000.
You may want to modify the price with natural logarithm (numpy.log) and transform it to log-scale, it is a common practice especially for this problem (I assume you are doing House Prices: Advanced Regression Techniques), see available kernels for guidance. With this approach, you will not get such big values.
Last but not least, check Mean Absolute Error in order to see your predictions are not as terrible as they seem.

Logistic Regression mean square error

NOTE: I appreciate the massive quantity of comments suggesting that this is inappropriate to quantify model performance. However, this is irrelevant to my error, and this error occurs for a variety of other metrics. Also, see here for the appropriate way to respond when you think the OP is "asking the wrong question"
I have an sklearn logistic model for which I am attempting to get the RMSE. However, when I .predict_proba, I get a vector of probabilities. However, my y_test is in its categorical form, which sklearn.linear_model.LogisticRegression just sort of dealt with automagically.
How to I reconcile these two things to get the RMSE?
>>> sklearn.metrics.mean_squared_error(y_test, pred_proba, sample_weight=weights_test)
ValueError: y_true and y_pred have different number of output (1!=13)
predict_proba is predicting the probability that a sample belongs to a class. The arg max of those probabilities is the predicted class (categorical form). RMSE is not a metric for classification. If you want to evaluate your model, consider a different metric like accuracy_score:
from sklearn.metrics import accuracy_score
predictions = your_model.predict(X_test)
print("Accuracy: %.3f" % accuracy_score(y_test, predictions))
The brier score, basically the mean squared error, is a known and valid loss function for classification models that leverage probability scores; I would take a look at that as well.
To your particular issue, you want to compare the probabilities returned for your target class, i.e. for a binary class problem:
from sklearn.metrics import brier_score_loss
probs = your_model.predict_proba(X_test)
brier_score_loss(y_true, probs[:, 1])
I'm not sure brier is formally defined for multiclass problems. I would point to the idea of mean misclassification error, which averages the error across classes.
To leverage this within the sklearn API, encode your y_true categorically, i.e. each class gets its own column, and call
sklearn.metrics.mean_squared_error(y_true, probs, multioutput=’uniform_average’)
Here is how you can calculate RMSE:
import numpy as np
from sklearn.metrics import mean_squared_error
x = np.range(10)
y = x
rmse = np.sqrt(mean_squared_error(x, y))
One can transform the y_test into a format compatible with the predict_proba output as follows:
model = sklearn.linear_model.LogisticRegression().fit(X,y) # or whatever model
label_encoder = sklearn.preprocessing.LabelEncoder()
label_encoder.classes_ = model.classes_
y_test_onehot = sklearn.preprocessing.OneHotEncoder().fit_transform(label_encoder.transform(y_test).reshape((-1,1)))
You can now apply any of the metrics in sklearn.metric. This is essential for computing, say, the brier score.

GridSearchCV.best_score not same as cross_val_score(GridSearchCV.best_estimator_)

Consider the following gridsearch :
grid = GridSearchCV(clf, parameters, n_jobs =-1, iid=True, cv =5)
grid_fit = grid.fit(X_train1, y_train1)
According to Sklearn's ressource, grid_fit.best_score_
returns The mean cross-validated score of the best_estimator .
To me that would mean that the average of :
cross_val_score(grid_fit.best_estimator_, X_train1, y_train1, cv=5)
should be exactly the same as:
grid_fit.best_score_.
However I am getting a 10% difference between the two numbers. What am I missing ?
I am using the gridsearch on proprietary data so I am hoping somebody has run into something similar in the past and can guide me without a fully reproducible example. I will try to reproduce this with the Iris dataset if it's not clear enough...
when an integer number is passed to GridSearchCV(..., cv=int_number) parameter, then the StratifiedKFold will be used for cross-validation splitting. So the data set will be randomly splitted by StratifiedKFold. This might affect the accuracy and therefore the best score.

auc score in scikit-learn for no binary classifiers

I want to calculate the roc_auc for different classifiers. Some are not binary classifiers. Here is a portion of the code I used:
if hasattr(clf, "decision_function"):
y_score = clf.fit(X_train, y_train).decision_function(X_test)
else:
y_score = clf.fit(X_train, y_train).predict_proba(X_test)
AUC=roc_auc_score(y_test, y_score)
However, I get an error for some classifiers (Nearest Neighbors
for example):
ValueError: bad input shape
Just a remark, I used: y_score = clf.fit(X_train, y_train).predict_proba(X_test), but I don't really know if it's correct to use it.
okay so first things first
clf.fit(X_train, y_train)
that will fit your model to your training data. first parameter being features, second parameter being target. okay, nicely done.
After fiting, you can apply ".predict" or ".predict_proba" on another dataset to get an estimative/prediction of its results. or you can do both fit and predict at the same time as you did below:
clf.fit(X_train, y_train).predict_proba(X_test)
Now those are your predictions, not your score.
Your score will be a function of the prediction and the true value "(y_test)".
You can use different score metrics depending on the kind of problem you got, such as accuracy, precision, recall, f1, etc.. (read more at http://scikit-learn.org/stable/modules/model_evaluation.html)
Now, roc_auc_score is one of those metrics, but you gotta watch out what you input that function, otherwise it wont work. As explained on the roc_auc_score page (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score), parameters should be:
y_true: True binary labels in binary label indicators.
y_score : Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).
so, if you got labels, or multilabels for y_true, the function wont work, it gotta be binary.
y_score in the other hand can be either binary or probabilities (ranging from [0,1])
hope that helps!
edit: if you got a multilabel problem, what you can do is tackle different classes one at a time. that way it will become many binary binary problems/models. (try building a model to predict if its class A or not, and do the roc curve of it, afterwards, move on to the next class and build another model, and so on)

Categories