Minimizing BER for sklearn linear models - python

I'm using sklearn's LinearSVC and LogisticRegression models for classification. I want the optimization in training to minimize BER (balanced error rate). I noticed that you can set the class_weight parameter in the models to 'balanced', which should (according to documentation) 'automatically adjust weights inversely proportional to class frequencies in the input data'. However, you can also pass sample weights as a parameter to the fit() method. I'm wondering if these are equivalent paths of achieving the same goal? Anything I should be aware of / concerned about here?
--
(Added detail below based on bot suggestion)
For example:
You could define
model = sklearn.svm.LinearSVC(class_weight = 'balanced')
and then fit
model.fit(X_train, y_train)
OR you could define
model = sklearn.svm.LinearSVC()
and then fit
wts = _array of length n (number of samples) formulated to balance the classes_
model.fit(X_train,y_train, sample_weight = wts)
...
Are these equivalent and do either/both satisfy my goal of minimizing BER?

Related

Underfitting, Overfitting, Good_Generalization

So as a part of my assignment I'm applying linear and lasso regressions, and here's Question 7.
Based on the scores from question 6, what gamma value corresponds to a
model that is underfitting (and has the worst test set accuracy)? What
gamma value corresponds to a model that is overfitting (and has the
worst test set accuracy)? What choice of gamma would be the best
choice for a model with good generalization performance on this
dataset (high accuracy on both training and test set)?
Hint: Try plotting the scores from question 6 to visualize the
relationship between gamma and accuracy. Remember to comment out the
import matplotlib line before submission.
This function should return one tuple with the degree values in this order: (Underfitting, Overfitting, Good_Generalization) Please note there is only one correct solution.
I really need help, I can't really think of any way to solve this last question. What code should I use to determine (Underfitting, Overfitting, Good_Generalization) and why???
Thanks,
Data set: http://archive.ics.uci.edu/ml/datasets/Mushroom?ref=datanews.io
Here's my code from question 6:
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve
def answer_six():
# SVC requires kernel='rbf', C=1, random_state=0 as instructed
# C: Penalty parameter C of the error term
# random_state: The seed of the pseudo random number generator
# used when shuffling the data for probability estimates
# e radial basis function kernel, or RBF kernel, is a popular
# kernel function used in various kernelized learning algorithms,
# In particular, it is commonly used in support vector machine
# classification
model = SVC(kernel='rbf', C=1, random_state=0)
# Return numpy array numbers spaced evenly on a log scale (start,
# stop, num=50, endpoint=True, base=10.0, dtype=None, axis=0)
gamma = np.logspace(-4,1,6)
# Create a Validation Curve for model and subsets.
# Create parameter name and range regarding gamma. Test Scoring
# requires accuracy.
# Validation curve requires X and y.
train_scores, test_scores = validation_curve(model, X_subset, y_subset, param_name='gamma', param_range=gamma, scoring ='accuracy')
# Determine mean for scores and tests along columns (axis=1)
sc = (train_scores.mean(axis=1), test_scores.mean(axis=1))
return sc
answer_six()
Well, make yourself familiar with overfitting. You are supposed to produce something like this: Article on this topic
On the left you have underfitting, on the right overfitting... Where both errors are low you have good generalisation.
And these things are a function of gamma (the regularizor)
Overfitting = your model false
if model false
scatter it
change linear to poly or suport vector with working kernel...
Underfitting = your dataset false
add new data ideal correleated ...
check by nubers
score / accuracy of test and train if test and train high and no big difference you are doiing good ...
if test low or train low then you facing overfitting / underfitting
hope explained you ...

Logistic Regression mean square error

NOTE: I appreciate the massive quantity of comments suggesting that this is inappropriate to quantify model performance. However, this is irrelevant to my error, and this error occurs for a variety of other metrics. Also, see here for the appropriate way to respond when you think the OP is "asking the wrong question"
I have an sklearn logistic model for which I am attempting to get the RMSE. However, when I .predict_proba, I get a vector of probabilities. However, my y_test is in its categorical form, which sklearn.linear_model.LogisticRegression just sort of dealt with automagically.
How to I reconcile these two things to get the RMSE?
>>> sklearn.metrics.mean_squared_error(y_test, pred_proba, sample_weight=weights_test)
ValueError: y_true and y_pred have different number of output (1!=13)
predict_proba is predicting the probability that a sample belongs to a class. The arg max of those probabilities is the predicted class (categorical form). RMSE is not a metric for classification. If you want to evaluate your model, consider a different metric like accuracy_score:
from sklearn.metrics import accuracy_score
predictions = your_model.predict(X_test)
print("Accuracy: %.3f" % accuracy_score(y_test, predictions))
The brier score, basically the mean squared error, is a known and valid loss function for classification models that leverage probability scores; I would take a look at that as well.
To your particular issue, you want to compare the probabilities returned for your target class, i.e. for a binary class problem:
from sklearn.metrics import brier_score_loss
probs = your_model.predict_proba(X_test)
brier_score_loss(y_true, probs[:, 1])
I'm not sure brier is formally defined for multiclass problems. I would point to the idea of mean misclassification error, which averages the error across classes.
To leverage this within the sklearn API, encode your y_true categorically, i.e. each class gets its own column, and call
sklearn.metrics.mean_squared_error(y_true, probs, multioutput=’uniform_average’)
Here is how you can calculate RMSE:
import numpy as np
from sklearn.metrics import mean_squared_error
x = np.range(10)
y = x
rmse = np.sqrt(mean_squared_error(x, y))
One can transform the y_test into a format compatible with the predict_proba output as follows:
model = sklearn.linear_model.LogisticRegression().fit(X,y) # or whatever model
label_encoder = sklearn.preprocessing.LabelEncoder()
label_encoder.classes_ = model.classes_
y_test_onehot = sklearn.preprocessing.OneHotEncoder().fit_transform(label_encoder.transform(y_test).reshape((-1,1)))
You can now apply any of the metrics in sklearn.metric. This is essential for computing, say, the brier score.

scikit-learn classification on soft labels

According to the documentation it is possible to specify different loss functions to SGDClassifier. And as far as I understand log loss is a cross-entropy loss function which theoretically can handle soft labels, i.e. labels given as some probabilities [0,1].
The question is: is it possible to use SGDClassifier with log loss function out the box for classification problems with soft labels? And if not - how this task (linear classification on soft labels) can be solved using scikit-learn?
UPDATE:
The way target is labeled and by the nature of the problem hard labels don't give good results. But it is still a classification problem (not regression) and I wan't to keep probabilistic interpretation of the prediction so regression doesn't work out of the box too. Cross-entropy loss function can handle soft labels in target naturally. It seems that all loss functions for linear classifiers in scikit-learn can only handle hard labels.
So the question is probably:
How to specify my own loss function for SGDClassifier, for example. It seems scikit-learn doesn't stick to the modular approach here and changes need to be done somewhere inside it's sources
I recently had this problem and came up with a nice fix that seems to work.
Basically, transform your targets to log-odds-ratio space using the inverse sigmoid function. Then fit a linear regression. Then, to do inference, take the sigmoid of the predictions from the linear regression model.
So say we have soft targets/labels y ∈ (0, 1) (make sure to clamp the targets to say [1e-8, 1 - 1e-8] to avoid instability issues when we take logs).
We take the inverse sigmoid, then we fit a linear regression (assuming predictor variables are in matrix X):
y = np.clip(y, 1e-8, 1 - 1e-8) # numerical stability
inv_sig_y = np.log(y / (1 - y)) # transform to log-odds-ratio space
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, inv_sig_y)
Then to make predictions:
def sigmoid(x):
ex = np.exp(x)
return ex / (1 + ex)
preds = sigmoid(lr.predict(X_new))
This seems to work, at least for my use case. My guess is that it's not far off what happens behind the scenes for LogisticRegression anyway.
Bonus: this also seems to work with other regression models in sklearn, e.g. RandomForestRegressor.
According to the docs,
The ‘log’ loss gives logistic regression, a probabilistic classifier.
In general a loss function is of the form Loss( prediction, target ), where prediction is the model's output, and target is the ground-truth value. In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").
So in answer to your question, it depends on if you are referring to the prediction or target. Generally speaking, the form of the labels ("hard" or "soft") is given by the algorithm chosen for prediction and by the data on hand for target.
If your data has "hard" labels, and you desire a "soft" label output by your model (which can be thresholded to give a "hard" label), then yes, logistic regression is in this category.
If your data has "soft" labels, then you would have to choose a threshold to convert them to "hard" labels before using typical classification methods (i.e., logistic regression). Otherwise, you could use a regression method where the model is fit to predict the "soft" target. In this latter approach, your model could give values outside of (0,1), and this would have to be handled.
for those interested, i've implemented a custom class that behaves like a normal classifier, but takes a any regressor in the cosntructor to perform the transformation suggested by #nlml:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_array
from scipy.special import softmax
import numpy as np
def _log_odds_ratio_scale(X):
X = np.clip(X, 1e-8, 1 - 1e-8) # numerical stability
X = np.log(X / (1 - X)) # transform to log-odds-ratio space
return X
class FuzzyTargetClassifier(ClassifierMixin, BaseEstimator):
def __init__(self, regressor):
'''
Fits regressor in the log odds ratio space (inverse crossentropy) of target variable.
during transform, rescales back to probability space with softmax function
Parameters
---------
regressor: Sklearn Regressor
base regressor to fit log odds ratio space. Any valid sklearn regressor can be used here.
'''
self.regressor = regressor
return
def fit(self, X, y=None, **kwargs):
#ensure passed y is onehotencoded-like
y = check_array(y, accept_sparse=True, dtype = 'numeric', ensure_min_features=1)
self.regressors_ = [clone(self.regressor) for _ in range(y.shape[1])]
for i in range(y.shape[1]):
self._fit_single_regressor(self.regressors_[i], X, y[:,i], **kwargs)
return self
def _fit_single_regressor(self, regressor, X, ysub, **kwargs):
ysub = _log_odds_ratio_scale(ysub)
regressor.fit(X, ysub, **kwargs)
return regressor
def decision_function(self,X):
all_results = []
for reg in self.regressors_:
results = reg.predict(X)
if results.ndim < 2:
results = results.reshape(-1,1)
all_results.append(results)
results = np.hstack(all_results)
return results
def predict_proba(self, X):
results = self.decision_function(X)
results = softmax(results, axis = 1)
return results
def predict(self, X):
results = self.decision_function(X)
results = results.argmax(1)
return results

predict continuous values using sklearn bagging classifier

Can I use sklearn's BaggingClassifier to produce continuous predictions? Is there a similar package? My understanding is that the bagging classifier predicts several classifications with different models, then reports the majority answer. It seems like this algorithm could be used to generate probability functions for each classification then reporting the mean value.
trees = BaggingClassifier(ExtraTreesClassifier())
trees.fit(X_train,Y_train)
Y_pred = trees.predict(X_test)
If you're interested in predicting probabilities for the classes in your classifier, you can use the predict_proba method, which gives you a probability for each class. It's a one-line change to your code:
trees = BaggingClassifier(ExtraTreesClassifier())
trees.fit(X_train,Y_train)
Y_pred = trees.predict_proba(X_test)
The shape of Y_pred will be [n_samples, n_classes].
If your Y_train values are continuous and you want to predict those continuous values (i.e., you're working on a regression problem), then you can use the BaggingRegressor instead.
I typically use BaggingRegressor() for continuous values, and then compare performance with RMSE. example below:
from sklearn.ensemble import BaggingReressor
trees = BaggingRegressor()
trees.fit(X_train,Y_train)
scores_RMSE = math.sqrt(metrics.mean_squared_error(Y_test, trees.predict(X_test))

Scikit Learn: Logistic Regression model coefficients: Clarification

I need to know how to return the logistic regression coefficients in such a manner that I can generate the predicted probabilities myself.
My code looks like this:
lr = LogisticRegression()
lr.fit(training_data, binary_labels)
# Generate probabities automatically
predicted_probs = lr.predict_proba(binary_labels)
I had assumed the lr.coeff_ values would follow typical logistic regression, so that I could return the predicted probabilities like this:
sigmoid( dot([val1, val2, offset], lr.coef_.T) )
But this is not the appropriate formulation. Does anyone have the proper format for generating predicted probabilities from Scikit Learn LogisticRegression?
Thanks!
take a look at the documentations (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), offset coefficient isn't stored by lr.coef_
coef_ array, shape = [n_classes-1, n_features] Coefficient of the
features in the decision function. coef_ is readonly property derived
from raw_coef_ that follows the internal memory layout of liblinear.
intercept_ array, shape = [n_classes-1] Intercept (a.k.a. bias) added
to the decision function. It is available only when parameter
intercept is set to True.
try:
sigmoid( dot([val1, val2], lr.coef_) + lr.intercept_ )
The easiest way is by calling coef_ attribute of LR classifier:
Definition of coef_ please check Scikit-Learn document:
See example:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(x_train,y_train)
weight = classifier.coef_

Categories