I'm working on a Kaggle competition (https://www.kaggle.com/c/house-prices-advanced-regression-techniques#evaluation) and it states that my model will be evaluated by:
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
I couldn't find this in the docs (it's basically RMSE(log(truth), log(prediction)), so I went about writing a custom scorer:
def custom_loss(truth, preds):
truth_logs = np.log(truth)
print(truth_logs)
preds_logs = np.log(preds)
numerator = np.sum(np.square(truth_logs - preds_logs))
return np.sum(np.sqrt(numerator / len(truth)))
custom_scorer = make_scorer(custom_loss, greater_is_better=False)
Two questions:
1) Should my custom loss function return a numpy array of scores (one for each (truth, prediction) pair? Or should it be the total loss over those (truth, prediction) pairs, returning a single number?
I looked into the docs but they weren't super helpful re: what my custom loss function should return.
2) When I run:
xgb_model = xgb.XGBRegressor()
params = {"max_depth": [3, 4], "learning_rate": [0.05],
"n_estimators": [1000, 2000], "n_jobs": [8], "subsample": [0.8], "random_state": [42]}
grid_search_cv = GridSearchCV(xgb_model, params, scoring=custom_scorer,
n_jobs=8, cv=KFold(n_splits=10, shuffle=True, random_state=42), verbose=2)
grid_search_cv.fit(X, y)
grid_search_cv.best_score_
I get back:
-0.12137097567803554
which is very surprising. Given that my loss function is taking RMSE(log(truth) - log(prediction)), I shouldn't be able to have a negative best_score_.
Any idea why it's negative?
Thanks!
1) You should return a single number as loss, not array. GridSearchCV will sort the params accroding to the results of this scorer.
By the way instead of defining a custom metric, you can use mean_squared_log_error, which does what you want.
2) Why does it return negative? - Without your actual data and complete code we cant say.
You should be careful with the notation.
There are 2 levels of optimization here:
The loss function optimized when the XGBRegressor is fitted to the data.
The scoring function that is optimized during the grid search.
I prefer calling the second scoring function instead of loss function, since loss function usually refers to a term that is subject to optimization during the model fitting process itself.
However, your custom function only specifies 2. whilst leaving 1. untouched. In case you want to change the loss function of XGBRegressor see here. Most regression models have several criteria from which you can choose such as mean_square_error or mean_absolute_error.
Note, that passing customized loss functions is not supported at the moment (see reasons here and here).
The make_scorer function sign flips if greater_is_better is False
Related
My model performs a multi-class (3) classification task.
I would like to change the way model "fits". Instead of calculation of a metric such as acc or logloss - I would like to run a simulation on whole data set to see how the model performs after each fit, in real-time.
Please note that simulation != loss/error. Simulation takes into the consideration time component of the data, the sequence in which events occur. Whereas the loss function simply calculates the error based on true values.
Currently I do the simulation after the whole "fitting" process has been done:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
all_ds = lgb.Dataset(X, label=y)
train_ds = lgb.Dataset(X_train, label=y_train)
test_ds = lgb.Dataset(X_test, label=y_test)
params = {
'device_type': "gpu",
'objective': 'multiclass',
'metric': 'multi_logloss',
"boosting_type": "gbdt",
"num_class": 3,
'random_state': 123
}
# fit
model = lgb.train(
params,
train_ds,
num_boost_round=20
valid_sets=[test_ds]
)
# make prediction on a whole data set
y_pred= model.predict(all_ds)
# simulate
simulation_result = simulate(X, y_pred) # float value
current process is:
fit step 1 - error x
fit step 2 - error y
..
fit step 20 - error z
simulate - see how the model performs
I would like to change the process to
fit step 1 - simulate - use result of simulation as an error
fit step 2 - simulate - use result of simulation as an error
..
fit step 20 - simulate - use result of simulation as an error
Is there a way to achieve it through a custom callback or a custom evaluation metric or some other way?
I tried creating a custom eval metric, unfortunately I cannot invoke predict() from within the function. Moreover I find the preds parameter value to be something I cannot simply use without transformations of some sort.. It contains some sort of multidimensional array that I have no idea how to convert to actual predictions.
def customEvalMetric(preds, eval_data):
# how to invoke predict() method on a whole dataset here?
# OR how to convert preds to one-hot encoded values?
# simulation_result = simulate(all_ds, ..?..)
return 'simulation_result', simulation_result, True
and using as
model = lgb.train(
params,
train_ds,
num_boost_round=20
valid_sets=[all_ds],
feval=customEvalMetric,
)
p.s. now that I think about it - I could in theory fit once in a loop, then use init_model to load the existing model weights.. Is this the only way?
I suppose this question is applicable to other tree boosting libraries since the API are similar (xgboost for example)
The custom eval function should work. As per the docs, preds is:
The predicted values. Predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task.
So if this is a classification problem, you might need to apply the softmax transformation to each row. For a regression problem, you should be able to use this output as-is.
I would like to maximize a function: func(minCount, wordNgrams, lr, epoch, loss)
with GridSearch on only these values:
`{'minCount': [2, 3],
'wordNgrams': [1, 2, 3, 4, 5],
'lr': [0.1, 0.01, 0.001, 0.0001],
'epoch': [5, 10, 15, 20, 25, 30],
'loss': [hs, ns, softmax]}`
I have read about sklearn.model_selection.GridSearchCV(estimator, param_grid, ...)
But, I don't know, to where I should put my func(minCount, wordNgrams, lr, epoch, loss)
By the way, I've read about bayesian optimization (https://github.com/fmfn/BayesianOptimization), but don't have any understanding of how to use this with string and int parameters
According to the documentation , you have two solutions:
You can pass estimator = func to GridSearchCV, but you also need to pass a scoring function. The scoring function will take func's outputs and return a score (float) that GridSearchCV will seek to optimize. Example:
def my_scoring_function(func_outputs):
"""
process the outputs of func and return a score.
if func already reutrns the value you want to minimize,
my_scoring_function will be the identity function.
score is the value to optimize
"""
return score
cv = GridSearchCV(estimator=func, param_grid=my_param_grid, scoring=my_scoring_function)
More complex, but elegant: You can rewrite your func as an object implementing scikit-learn's estimator methods (good tutorial here with a gid search example). This means that will basically follow a set of conventions that will make your function behave like scikit-learn's objects. GridSearchCV will then know how to deal with it. This might be overkill for your problem though.
Regarding bayesian optimisation, it is interesting if your problem meets the following criterions:
Evaluating your function is very costly (in terms of time/resource...) and you cannot afford calling it as many times as a grid search requires you to. In your case, you have 720 combinations of parameters to explore, so if one evaluation costs 10s, you will have to run the grid search for 7200s.
You want to explore a broader parameter space, or you want to search continuous spaces for some parameters. Typically, it can be interesting for the learning rate. In this case, you can also use a random search, also implemented in scikit learn.
For more details on bayesian optimization I would recommend this article that I find very comprehensive.
I'm interested in examining loss values per example, using a tf.keras.Model instance, with the following restrictions:
Assuming I already called model.compile(), the information about the loss function is there - I don't want to explicitly define loss like in this example from the docs
Without performing gradient descend (I am aware of the option of accessing the loss in tf.callback.Callbacks, but I don't want to perform GD)
Also, using a callback to get losses and then reloading the initial weights is not a valid solution, I want to avoid GD entirely.
So is there a way to achieve this? I would expect something that will look something like
model = tf.keras.Sequential([....])
model.compile(optimizer=..., loss=...)
single_loss_value = model.get_loss(single_x, single_y)
batch_loss_valujes = model.get_loss(x, y)
To get the loss of some samples you can use model.evaluate. There you can use x and y, as a single sample or a batch.
model = tf.keras.Sequential([....])
model.compile(optimizer=..., loss=...)
single_loss_value = model.evaluate(single_x, single_y)
batch_loss_values = model.evaluate(x, y)
Context:
I am using Passive Aggressor from scikit library and confused whether to use warm start or partial fit.
Efforts hitherto:
Referred this thread discussion:
https://github.com/scikit-learn/scikit-learn/issues/1585
Gone through the scikit code for _fit and _partial_fit.
My observations:
_fit in turn calls _partial_fit.
When warm_start is set, _fit calls _partial_fit with self.coef_
When _partial_fit is called without coef_init parameter and self.coef_ is set, it continues to use self.coef_
Question:
I feel both are ultimately providing the same functionalities.Then, what is the basic difference between them? In which contexts, either of them are used?
Am I missing something evident? Any help is appreciated!
I don't know about the Passive Aggressor, but at least when using the SGDRegressor, partial_fit will only fit for 1 epoch, whereas fit will fit for multiple epochs (until the loss converges or max_iter
is reached). Therefore, when fitting new data to your model, partial_fit will only correct the model one step towards the new data, but with fit and warm_start it will act as if you would combine your old data and your new data together and fit the model once until convergence.
Example:
from sklearn.linear_model import SGDRegressor
import numpy as np
np.random.seed(0)
X = np.linspace(-1, 1, num=50).reshape(-1, 1)
Y = (X * 1.5 + 2).reshape(50,)
modelFit = SGDRegressor(learning_rate="adaptive", eta0=0.01, random_state=0, verbose=1,
shuffle=True, max_iter=2000, tol=1e-3, warm_start=True)
modelPartialFit = SGDRegressor(learning_rate="adaptive", eta0=0.01, random_state=0, verbose=1,
shuffle=True, max_iter=2000, tol=1e-3, warm_start=False)
# first fit some data
modelFit.fit(X, Y)
modelPartialFit.fit(X, Y)
# for both: Convergence after 50 epochs, Norm: 1.46, NNZs: 1, Bias: 2.000027, T: 2500, Avg. loss: 0.000237
print(modelFit.coef_, modelPartialFit.coef_) # for both: [1.46303288]
# now fit new data (zeros)
newX = X
newY = 0 * Y
# fits only for 1 epoch, Norm: 1.23, NNZs: 1, Bias: 1.208630, T: 50, Avg. loss: 1.595492:
modelPartialFit.partial_fit(newX, newY)
# Convergence after 49 epochs, Norm: 0.04, NNZs: 1, Bias: 0.000077, T: 2450, Avg. loss: 0.000313:
modelFit.fit(newX, newY)
print(modelFit.coef_, modelPartialFit.coef_) # [0.04245779] vs. [1.22919864]
newX = np.reshape([2], (-1, 1))
print(modelFit.predict(newX), modelPartialFit.predict(newX)) # [0.08499296] vs. [3.66702685]
If warm_start = False, each subsequent call to .fit() (after an initial call to .fit() or partial_fit()) will reset the model's trainable parameters for the initialisation. If warm_start = True, each subsequent call to .fit() (after an initial call to .fit() or partial_fit()) will retain the values of the model's trainable parameters from the previous run, and use those initially.
Regardless of the value of warm_start, each call to partial_fit() will retain the previous run's model parameters and use those initially.
Example using MLPRegressor:
import sklearn.neural_network
import numpy as np
np.random.seed(0)
x = np.linspace(-1, 1, num=50).reshape(-1, 1)
y = (x * 1.5 + 2).reshape(50,)
cold_model = sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(), warm_start=False, max_iter=1)
warm_model = sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(), warm_start=True, max_iter=1)
cold_model.fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[0.17009494]])] [array([0.74643783])]
cold_model.fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60819342]])] [array([-1.21256186])]
#after second run of .fit(), values are completely different
#because they were re-initialised before doing the second run for the cold model
warm_model.fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39815616]])] [array([1.651504])]
warm_model.fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39715616]])] [array([1.652504])]
#this time with the warm model, params change relatively little, as params were
#not re-initialised during second call to .fit()
cold_model.partial_fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60719343]])] [array([-1.21156187])]
cold_model.partial_fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60619347]])] [array([-1.21056189])]
#with partial_fit(), params barely change even for cold model,
#as no re-initialisation occurs
warm_model.partial_fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39615617]])] [array([1.65350392])]
warm_model.partial_fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39515619]])] [array([1.65450372])]
#and of course the same goes for the warm model
First, let us look at the difference between .fit() and .partial_fit().
.fit() would let you train from the scratch. Hence, you could think of this as a option that can be used only once for a model. If you call .fit() again with a new set of data, the model would be build on the new data and will have no influence of previous dataset.
.partial_fit() would let you update the model with incremental data. Hence, this option can be used more than once for a model. This could be useful, when the whole dataset cannot be loaded into the memory, refer here.
If both .fit() or .partial_fit() are going to be used once, then it makes no difference.
warm_start can be only used in .fit(), it would let you start the learning from the co-eff of previous fit(). Now it might sound similar to the purpose to partial_fit(), but recommended way would be partial_fit(). May be do the partial_fit() with same incremental data few number of times, to improve the learning.
About difference. Warm start it just an attribute of class. Partial fit it is method of this class. It's basically different things.
About same functionalities. Yes, partial fit will use self.coef_ because it still needed to get some values to update on training period. And for empty coef_init we just put zero values to self.coef_ and go to the next step of training.
Description.
For first start:
Whatever how (with or without warm start). We will train on zero coefficients but in result we will save average of our coefficients.
N+1 start:
With warm start. We will check via method _allocate_parameter_mem our previous coefficients and take it to train. In result save our average coefficients.
Without warm start. We will put zero coefficients (as first start) and go to training step. In result we will still write average coefficients to memory.
I am using sklearn.svm.svc from scikit-learn to do binary classification. I am using its predict_proba() function to get probability estimates. Can anyone tell me how predict_proba() internally calculates the probability?
Scikit-learn uses LibSVM internally, and this in turn uses Platt scaling, as detailed in this note by the LibSVM authors, to calibrate the SVM to produce probabilities in addition to class predictions.
Platt scaling requires first training the SVM as usual, then optimizing parameter vectors A and B such that
P(y|X) = 1 / (1 + exp(A * f(X) + B))
where f(X) is the signed distance of a sample from the hyperplane (scikit-learn's decision_function method). You may recognize the logistic sigmoid in this definition, the same function that logistic regression and neural nets use for turning decision functions into probability estimates.
Mind you: the B parameter, the "intercept" or "bias" or whatever you like to call it, can cause predictions based on probability estimates from this model to be inconsistent with the ones you get from the SVM decision function f. E.g. suppose that f(X) = 10, then the prediction for X is positive; but if B = -9.9 and A = 1, then P(y|X) = .475. I'm pulling these numbers out of thin air, but you've noticed that this can occur in practice.
Effectively, Platt scaling trains a probability model on top of the SVM's outputs under a cross-entropy loss function. To prevent this model from overfitting, it uses an internal five-fold cross validation, meaning that training SVMs with probability=True can be quite a lot more expensive than a vanilla, non-probabilistic SVM.
Actually I found a slightly different answer that they used this code to convert decision value to probability
'double fApB = decision_value*A+B;
if (fApB >= 0)
return Math.exp(-fApB)/(1.0+Math.exp(-fApB));
else
return 1.0/(1+Math.exp(fApB)) ;'
Here A and B values can be found in the model file (probA and probB).
It offers a way to convert probability to decision value and thus to hinge loss.
Use that ln(0) = -200.