post-process cross-validated prediction before scoring

post-process cross-validated prediction before scoring - python

I have a regression problem, where I am cross-validating the results and evaluating the performance. I know beforehand that the ground truth cannot be smaller than zero. Therefore, I would like to intercept the predictions, before they are fed to the score metric, to clip the predictions to zero. I thought that using the make_scorer function would be useful to do this. Is it possible to somehow post-process the predictions after cross-validation, but before applying an evaluation metric to it?
from sklearn.metrics import mean_squared_error, r2_score, make_scorer
from sklearn.model_selection import cross_validate
# X = Stacked feature vectors
# y = ground truth vector
# regr = some regression estimator
#### How to indicate that the predictions need post-processing
#### before applying the score function???
scoring = {'r2': make_scorer(r2_score),
'neg_mse': make_scorer(mean_squared_error)}
scores = cross_validate(regr, X, y, scoring=scoring, cv=10)
PS: I know there are constrained estimators, but I wanted to see how a heuristic approach like this would perform.

One thing you can do is wrap those scorers you're looking to use (r2_score, mean_squared_error) in a custom scorer function using make_scorer() as you suggested.
Take a look at this part of the sklearn documentation and this Stack Overflow post for some examples. In particular, your function can do something like this:
def clipped_r2(y_true, y_pred):
y_pred_clipped = np.clip(y_pred, 0, None)
return r2_score(y_true, y_pred_clipped)
def clipped_mse(y_true, y_pred):
y_pred_clipped = (y_pred, 0, None)
return mean_squared_error(y_true, y_pred_clipped)
This allows you to do the post-processing right within the scorer before calling the scoring function (in this case r2_score or mean_squared_error). Then to use it just use make_scorer like you were doing above, setting greater_is_better according to whether the scorer is a scoring function (like r2, greater is better), or loss function (mean_squared_error is better when it's 0, i.e. less):
scoring = {'r2': make_scorer(clipped_r2, greater_is_better=True),
'neg_mse': make_scorer(clipped_mse, greater_is_better=False)}
scores = cross_validate(regr, X, y, scoring=scoring, cv=10)

Related

How to get the coefficients in Lasso Regression at every split while performing 10 fold cross validation?

I am doing Randomized search cv to find alpha value in Lasso Regression and I am performing 10 fold cross validation. Is there a way to get the coefficients value for every split, just like we get the scores by using cv_results function?

There is no direct way to do this via RandomizedSearchCV. But you can work around this by defining your own class that e.g. prints the coefficients to the console when the predict function is called:
from sklearn.linear_model import Lasso
class MyLasso(Lasso):
def predict(self, X):
print(self.coef_)
return super().predict(X)
MyLasso behaves the same as Lasso and can be used as usual:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
X, y = make_regression(n_features=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
param_distributions = {'alpha': [0.01, 0.1, 1]}
rs = RandomizedSearchCV(
MyLasso(),
param_distributions=param_distributions,
cv=2,
n_iter=3,
random_state=42
)
rs.fit(X_train, y_train)
Output for the example above (three iterations of 2-fold cross-validation gives six results):
[64.57650818 98.64237403 57.07123743 60.56898095 35.59985227]
[64.57001187 98.63679695 57.06557977 60.56304163 35.59888746]
[64.43774582 98.55938568 57.01219706 60.49221968 35.51151313]
[64.37690435 98.49805298 56.95345309 60.43375789 35.5018112 ]
[63.05012223 97.72950224 56.42179336 59.72460697 34.62812171]
[62.44582912 97.11061327 55.83218634 59.14092054 34.53104869]

It seemed to me that saving the coefficients as additional scores would be slicker than modifying the estimator itself as in #afsharov's answer. Defining a scorer and passing it to the search as
def coefs_scorer(estimator, X, y):
return estimator.coef_
rs = RandomizedSearchCV(
...
scoring={'r2': 'r2', 'coefs': coefs_scorer},
refit='r2',
)
fails because there's a check that scorers return single numbers. So you need to unpack the coefficients, and I ended up with this:
def coefs_scorer(estimator, X, y, i):
return estimator.coef_[i]
from functools import partial
scoring = {'r2': 'r2'}
for i in range(X_train.shape[1]):
scoring[f'coef{i}'] = partial(coefs_scorer, i=i)
param_distributions = {'alpha': [0.01, 0.1, 1]}
rs = RandomizedSearchCV(
Lasso(),
param_distributions=param_distributions,
cv=2,
n_iter=3,
random_state=42,
scoring=scoring,
refit='r2',
)
Note that with multiple metrics you need to specify which to use for refitting. Because of all the additional work, I'm not so sure this is better than the custom class. It does have a few advantages though:
If you wanted to pickle the best estimator, you don't need to package-ize the custom class.
The scores are programmatically saved rather than just printed.
Since they're scores, you get the average and standard deviation of the coefficients across folds stored in cv_results_ (of course, calculating them yourself wouldn't be difficult).
Disadvantages:
We had to specify a metric per feature. It's ugly, but worse it assumes you know in advance the number of features (it would fail if your estimator was a pipeline that had a feature selection or certain feature engineering steps).
If you return train scores, you'll duplicate the coefficients in cv_results_.
These aren't actually scores, so semantically this is hacky.
The scorer assumes that coef_ exists and is one-dimensional.

How to use log_loss scorer in gridsearchcv?

Is it possible to use log_loss metric in gridsearchcv?
I have seen few posts where people mentioned about neg_log_loss? Is it same as log_loss? If not is it possible to use log_loss directly in gridsearchcv?

As stated in the documentation, scoring may take different inputs: string, callable, list/tuple, dict or None. If you use strings, you can find a list of possible entries here.
There, as a string representative for log loss, you find "neg_log_loss", i.e. the negative log loss, which is simply the log loss multiplied by -1. This is an easy way to deal with a maximization problem (which is what GridSearchCV expects, because it requires a score parameter, not a loss parameter), instead of a minimization one (you want the minimum log loss, which is equivalente to the maximum negative log loss).
If instead you want to directly pass a log loss function to the GridSearchCV, you just have to create a scorer from the Scikit-learn log_loss function by using make_scorer:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import log_loss, make_scorer
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC(gamma="scale", probability=True)
LogLoss = make_scorer(log_loss, greater_is_better=False, needs_proba=True)
clf = GridSearchCV(svc, parameters, cv=5, scoring=LogLoss)
clf.fit(iris.data, iris.target)
print(clf.best_score_, clf.best_estimator_)

Scoring strategy of sklearn.model_selection.GridSearchCV for LatentDirichletAllocation

I am trying to apply GridSearchCV on the LatentDirichletAllocation using the sklearn library.
Current pipeline looks like so:
vectorizer = CountVectorizer(analyzer='word',
min_df=10,
stop_words='english',
lowercase=True,
token_pattern='[a-zA-Z0-9]{3,}'
)
data_vectorized = vectorizer.fit_transform(doc_clean) #where doc_clean is processed text.
lda_model = LatentDirichletAllocation(n_components =number_of_topics,
max_iter=10,
learning_method='online',
random_state=100,
batch_size=128,
evaluate_every = -1,
n_jobs = -1,
)
search_params = {'n_components': [10, 15, 20, 25, 30], 'learning_decay': [.5, .7, .9]}
model = GridSearchCV(lda_model, param_grid=search_params)
model.fit(data_vectorized)
Current the GridSearchCV uses the approximate log-likelihood as score to determine which is the best model. What I would like to do is to change my scoring method to be based on the approximate perplexity of the model instead.
According to sklearn's documentation of GridSearchCV, there is a scoring argument that I can use. However, I do not know how to apply perplexity as a scoring method, and I cannot find any examples online of people applying it. Is this possible?

GridSearchCV on its default will use the score() function of final estimator in the pipeline.
make_scorer can be used here, but for calculating perplexity you will need other data from the fitted model as well, which could be a little complex to provide through make_scorer.
You can make a wrapper over your LDA here and in which you can re-implement the score() function to return perplexity. Something along the lines:
class MyLDAWithPerplexityScorer(LatentDirichletAllocation):
def score(self, X, y=None):
# You can change the options passed to perplexity here
score = super(MyLDAWithPerplexityScorer, self).perplexity(X, sub_sampling=False)
# Since perplexity is lower for better, so we do negative
return -1*score
And then can use this in place of LatentDirichletAllocation in your code like:
...
...
...
lda_model = MyLDAWithPerplexityScorer(n_components =number_of_topics,
....
....
n_jobs = -1,
)
...
...

The score and perplexity parameters seem to be buggy and to be dependent on the number of topics. Therefore the results in grid will give you the lowest number of topics
GitHub issue

f1_score metric in lightgbm

I want to train a lgb model with custom metric : f1_score with weighted average.
I went through the advanced examples of lightgbm over here and found the implementation of custom binary error function. I implemented as similar function to return f1_score as shown below.
def f1_metric(preds, train_data):
labels = train_data.get_label()
return 'f1', f1_score(labels, preds, average='weighted'), True
I tried to train the model by passing feval parameter as f1_metric as shown below.
evals_results = {}
bst = lgb.train(params,
dtrain,
valid_sets= [dvalid],
valid_names=['valid'],
evals_result=evals_results,
num_boost_round=num_boost_round,
early_stopping_rounds=early_stopping_rounds,
verbose_eval=25,
feval=f1_metric)
Then I am getting ValueError: Found input variables with inconsistent numbers of samples:
The training set is being passed to the function rather than the validation set.
How can I configure such that the validation set is passed and f1_score is returned?

The docs are a bit confusing. When describing the signature of the function that you pass to feval, they call its parameters preds and train_data, which is a bit misleading.
But the following seems to work:
from sklearn.metrics import f1_score
def lgb_f1_score(y_hat, data):
y_true = data.get_label()
y_hat = np.round(y_hat) # scikits f1 doesn't like probabilities
return 'f1', f1_score(y_true, y_hat), True
evals_result = {}
clf = lgb.train(param, train_data, valid_sets=[val_data, train_data], valid_names=['val', 'train'], feval=lgb_f1_score, evals_result=evals_result)
lgb.plot_metric(evals_result, metric='f1')
To use more than one custom metric, define one overall custom metrics function just like above, in which you calculate all metrics and return a list of tuples.
Edit: Fixed code, of course with F1 bigger is better should be set to True.

Regarding Toby's answers:
def lgb_f1_score(y_hat, data):
y_true = data.get_label()
y_hat = np.round(y_hat) # scikits f1 doesn't like probabilities
return 'f1', f1_score(y_true, y_hat), True
I suggest change the y_hat part to this:
y_hat = np.where(y_hat < 0.5, 0, 1)
Reason:
I used the y_hat = np.round(y_hat) and fonud out that during training the lightgbm model will sometimes(very unlikely but still a change) regard our y prediction to multiclass instead of binary.
My speculation:
Sometimes the y prediction will be small or higher enough to be round to negative value or 2?I'm not sure,but when i changed the code using np.where, the bug is gone.
Cost me a morning to figure this bug,although I'm not really sure if the np.where solution is good.

scikit-learn classification on soft labels

According to the documentation it is possible to specify different loss functions to SGDClassifier. And as far as I understand log loss is a cross-entropy loss function which theoretically can handle soft labels, i.e. labels given as some probabilities [0,1].
The question is: is it possible to use SGDClassifier with log loss function out the box for classification problems with soft labels? And if not - how this task (linear classification on soft labels) can be solved using scikit-learn?
UPDATE:
The way target is labeled and by the nature of the problem hard labels don't give good results. But it is still a classification problem (not regression) and I wan't to keep probabilistic interpretation of the prediction so regression doesn't work out of the box too. Cross-entropy loss function can handle soft labels in target naturally. It seems that all loss functions for linear classifiers in scikit-learn can only handle hard labels.
So the question is probably:
How to specify my own loss function for SGDClassifier, for example. It seems scikit-learn doesn't stick to the modular approach here and changes need to be done somewhere inside it's sources

I recently had this problem and came up with a nice fix that seems to work.
Basically, transform your targets to log-odds-ratio space using the inverse sigmoid function. Then fit a linear regression. Then, to do inference, take the sigmoid of the predictions from the linear regression model.
So say we have soft targets/labels y ∈ (0, 1) (make sure to clamp the targets to say [1e-8, 1 - 1e-8] to avoid instability issues when we take logs).
We take the inverse sigmoid, then we fit a linear regression (assuming predictor variables are in matrix X):
y = np.clip(y, 1e-8, 1 - 1e-8) # numerical stability
inv_sig_y = np.log(y / (1 - y)) # transform to log-odds-ratio space
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X, inv_sig_y)
Then to make predictions:
def sigmoid(x):
ex = np.exp(x)
return ex / (1 + ex)
preds = sigmoid(lr.predict(X_new))
This seems to work, at least for my use case. My guess is that it's not far off what happens behind the scenes for LogisticRegression anyway.
Bonus: this also seems to work with other regression models in sklearn, e.g. RandomForestRegressor.

According to the docs,
The ‘log’ loss gives logistic regression, a probabilistic classifier.
In general a loss function is of the form Loss( prediction, target ), where prediction is the model's output, and target is the ground-truth value. In the case of logistic regression, prediction is a value on (0,1) (i.e., a "soft label"), while target is 0 or 1 (i.e., a "hard label").
So in answer to your question, it depends on if you are referring to the prediction or target. Generally speaking, the form of the labels ("hard" or "soft") is given by the algorithm chosen for prediction and by the data on hand for target.
If your data has "hard" labels, and you desire a "soft" label output by your model (which can be thresholded to give a "hard" label), then yes, logistic regression is in this category.
If your data has "soft" labels, then you would have to choose a threshold to convert them to "hard" labels before using typical classification methods (i.e., logistic regression). Otherwise, you could use a regression method where the model is fit to predict the "soft" target. In this latter approach, your model could give values outside of (0,1), and this would have to be handled.

for those interested, i've implemented a custom class that behaves like a normal classifier, but takes a any regressor in the cosntructor to perform the transformation suggested by #nlml:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_array
from scipy.special import softmax
import numpy as np
def _log_odds_ratio_scale(X):
X = np.clip(X, 1e-8, 1 - 1e-8) # numerical stability
X = np.log(X / (1 - X)) # transform to log-odds-ratio space
return X
class FuzzyTargetClassifier(ClassifierMixin, BaseEstimator):
def __init__(self, regressor):
'''
Fits regressor in the log odds ratio space (inverse crossentropy) of target variable.
during transform, rescales back to probability space with softmax function
Parameters
---------
regressor: Sklearn Regressor
base regressor to fit log odds ratio space. Any valid sklearn regressor can be used here.
'''
self.regressor = regressor
return
def fit(self, X, y=None, **kwargs):
#ensure passed y is onehotencoded-like
y = check_array(y, accept_sparse=True, dtype = 'numeric', ensure_min_features=1)
self.regressors_ = [clone(self.regressor) for _ in range(y.shape[1])]
for i in range(y.shape[1]):
self._fit_single_regressor(self.regressors_[i], X, y[:,i], **kwargs)
return self
def _fit_single_regressor(self, regressor, X, ysub, **kwargs):
ysub = _log_odds_ratio_scale(ysub)
regressor.fit(X, ysub, **kwargs)
return regressor
def decision_function(self,X):
all_results = []
for reg in self.regressors_:
results = reg.predict(X)
if results.ndim < 2:
results = results.reshape(-1,1)
all_results.append(results)
results = np.hstack(all_results)
return results
def predict_proba(self, X):
results = self.decision_function(X)
results = softmax(results, axis = 1)
return results
def predict(self, X):
results = self.decision_function(X)
results = results.argmax(1)
return results

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

post-process cross-validated prediction before scoring - python

Related

How to get the coefficients in Lasso Regression at every split while performing 10 fold cross validation?

How to use log_loss scorer in gridsearchcv?

Scoring strategy of sklearn.model_selection.GridSearchCV for LatentDirichletAllocation

f1_score metric in lightgbm

scikit-learn classification on soft labels

Categories

Resources