Logistic Regression mean square error - python

NOTE: I appreciate the massive quantity of comments suggesting that this is inappropriate to quantify model performance. However, this is irrelevant to my error, and this error occurs for a variety of other metrics. Also, see here for the appropriate way to respond when you think the OP is "asking the wrong question"
I have an sklearn logistic model for which I am attempting to get the RMSE. However, when I .predict_proba, I get a vector of probabilities. However, my y_test is in its categorical form, which sklearn.linear_model.LogisticRegression just sort of dealt with automagically.
How to I reconcile these two things to get the RMSE?
>>> sklearn.metrics.mean_squared_error(y_test, pred_proba, sample_weight=weights_test)
ValueError: y_true and y_pred have different number of output (1!=13)

predict_proba is predicting the probability that a sample belongs to a class. The arg max of those probabilities is the predicted class (categorical form). RMSE is not a metric for classification. If you want to evaluate your model, consider a different metric like accuracy_score:
from sklearn.metrics import accuracy_score
predictions = your_model.predict(X_test)
print("Accuracy: %.3f" % accuracy_score(y_test, predictions))

The brier score, basically the mean squared error, is a known and valid loss function for classification models that leverage probability scores; I would take a look at that as well.
To your particular issue, you want to compare the probabilities returned for your target class, i.e. for a binary class problem:
from sklearn.metrics import brier_score_loss
probs = your_model.predict_proba(X_test)
brier_score_loss(y_true, probs[:, 1])
I'm not sure brier is formally defined for multiclass problems. I would point to the idea of mean misclassification error, which averages the error across classes.
To leverage this within the sklearn API, encode your y_true categorically, i.e. each class gets its own column, and call
sklearn.metrics.mean_squared_error(y_true, probs, multioutput=’uniform_average’)

Here is how you can calculate RMSE:
import numpy as np
from sklearn.metrics import mean_squared_error
x = np.range(10)
y = x
rmse = np.sqrt(mean_squared_error(x, y))

One can transform the y_test into a format compatible with the predict_proba output as follows:
model = sklearn.linear_model.LogisticRegression().fit(X,y) # or whatever model
label_encoder = sklearn.preprocessing.LabelEncoder()
label_encoder.classes_ = model.classes_
y_test_onehot = sklearn.preprocessing.OneHotEncoder().fit_transform(label_encoder.transform(y_test).reshape((-1,1)))
You can now apply any of the metrics in sklearn.metric. This is essential for computing, say, the brier score.

Related

random forest: predict vs predict_proba

I am working on a multiclass, highly imbalanced classification problem. I use random forest as base classifier.
I would have to give report of model performance on the evaluation set considering multiple criteria (metrics: precision, recall conf_matrix, roc_auc).
Model train:
rf = RandomForestClassifier(()
rf.fit(train_X, train_y)
To obtain precision/recall and confusion_matrix, I go like:
pred = rf.predict(test_X)
precision = metrics.precision_score(y_test, pred)
recall = metrics.recall_score(y_test, pred)
f1_score = metrics.f1_score(y_test, pred)
confusion_matrix = metrics.confusion_matrix(y_test, pred)
Fine, but then computing roc_auc requires the prediction probability of classes and not the class labels. For that I must further do this:
y_prob = rf.predict_proba(test_X)
roc_auc = metrics.roc_auc_score(y_test, y_prob)
But then I'm worried here that the outcome produced first by rf.predict() may not be consistent with rf.predict_proba() so the roc_auc score I'm reporting. I know that calling predict several times will produce exactly the same result, but I'm concern predict then predict_proba might produce slightly different results, making it inappropriate to discuss together with the metrics above.
If that is the case, is there a way to control this, making sure the class probabilities used by predict() to decide predicted labels are exactly the same when I then call predict_proab?
predict_proba() and predict() are consistent with eachother. In fact, predict uses predict_proba internally as can be seen here in the source code

How to calculate the RMSE on Ridge regression model

I have performed a ridge regression model on a data set
(link to the dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)
as below:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
y = train['SalePrice']
X = train.drop("SalePrice", axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train,y_train)
pred = ridge.predict(X_test)
I calculated the MSE using the metrics library from sklearn as
from sklearn.metrics import mean_squared_error
mean = mean_squared_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test,pred)
I am getting a very large value of MSE = 554084039.54321 and RMSE = 21821.8, I am trying to understand if my implementation is correct.
RMSE implementation
Your RMSE implementation is correct which is easily verifiable when you take the sqaure root of sklearn's mean_squared_error.
I think you are missing a closing parentheses though, here to be exact:
rmse = np.sqrt(mean_squared_error(y_test,pred)) # the last one was missing
High error problem
Your MSE is high due to model not being able to model relationships between your variables and target very well. Bear in mind each error is taken to the power of 2, so being 1000 off in price sky-rockets the value to 1000000.
You may want to modify the price with natural logarithm (numpy.log) and transform it to log-scale, it is a common practice especially for this problem (I assume you are doing House Prices: Advanced Regression Techniques), see available kernels for guidance. With this approach, you will not get such big values.
Last but not least, check Mean Absolute Error in order to see your predictions are not as terrible as they seem.

How does sklearn KNeighborsClassifier score method work?

knn.score(X_test, y_test)
Here X_test is a numpy array that contains test cases and y_test contains their correct labels.
This is the code that returns the reliability score of a model I made to differentiate between species of iris.
How does this function work, does it predict every value from X_test array and then compares it with y_test array and computes the mean?
The KNeighborsClassifier is a subclass of the sklearn.base.ClassifierMixin. From the documentation of the score method:
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
The source code itself for the score method:
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
It's simply a shortcut for producing predictions on the test data and computing the accuracy score against the given labels.

How to compute SSE in python

I need to fit a linear model to Wine Quality dataset. And then find MSE for each k-fold. Following is my code
regressor = LinearRegression()
regressor.fit(Features_train, Quality_train)
scores = cross_val_score(regressor, Features, Quality, cv=10,
scoring='mean_squared_error')
print scores
Problem here is that one or two values of MSE are negative. Following is the scores array:
[-0.47093648 -0.40001874 -0.46928925 -0.4317235 -0.37665658 -0.52359841
-0.40046081 -0.42944953 -0.36179521 -0.48792052]*
According to the formula, it shoud not be negative.
Do refer to this thread below:
scikit-learn cross validation, negative values with mean squared error
In summary, this is supposed to happen. The actual MSE is simply the positive version of the number you're getting.
Hope this helps!

predict continuous values using sklearn bagging classifier

Can I use sklearn's BaggingClassifier to produce continuous predictions? Is there a similar package? My understanding is that the bagging classifier predicts several classifications with different models, then reports the majority answer. It seems like this algorithm could be used to generate probability functions for each classification then reporting the mean value.
trees = BaggingClassifier(ExtraTreesClassifier())
trees.fit(X_train,Y_train)
Y_pred = trees.predict(X_test)
If you're interested in predicting probabilities for the classes in your classifier, you can use the predict_proba method, which gives you a probability for each class. It's a one-line change to your code:
trees = BaggingClassifier(ExtraTreesClassifier())
trees.fit(X_train,Y_train)
Y_pred = trees.predict_proba(X_test)
The shape of Y_pred will be [n_samples, n_classes].
If your Y_train values are continuous and you want to predict those continuous values (i.e., you're working on a regression problem), then you can use the BaggingRegressor instead.
I typically use BaggingRegressor() for continuous values, and then compare performance with RMSE. example below:
from sklearn.ensemble import BaggingReressor
trees = BaggingRegressor()
trees.fit(X_train,Y_train)
scores_RMSE = math.sqrt(metrics.mean_squared_error(Y_test, trees.predict(X_test))

Categories