I am performing a Multinomial Logistic Regression on variables in the NHTS 2017 dataset. According to the docs, sklearn.linear_model.LogisticRegression uses cross-entropy loss (log loss) as the loss function to optimize the model. However, as I add new features and fit the model, the loss does not seem to be monotone decreasing. Specifically, if I fit household driver count to vehicle ownership, (driver count is the single most predictive variable for vehicle ownership), I get less loss than if I indiscriminately fit all of the variables.
Possibly this is due to sklearn.metrics.log_loss doing something different than the actual loss function for LogisticRegression. Possibly the problem has become so non-convex that it finds a crappy solution. Can anybody hep explain why my loss would increase as I add features?
There could be multiple reasons but my guess is the following:
penalty - by default logistic regression is trained with a l2
penalty to prevent overfitting. In this case, the loss function is cross entropy loss plus the l2 norm of weights. As a result, more features will not necessarily guarantee that the cross entropy itself decreases.
Btw, it seems like your goal is to get the highest score (lowest loss) on a training set. I am not gonna dispute that but maybe look into test/validation sets.
Related
I am trying to predict labels for building performance: {1, 0}. Since this is a binary classification, I tried sigmoid and identity activation functions with Xavier initialization. However, I cannot improve the accuracy of my models as the loss and accuracy stay still after training each epoch. This is a very imbalanced dataset where the ones have the 90% majority. So, I assume this might be due to the initial bias. Can you help me with this one? You can see the setup of the training process and the other relevant images attached.model definition, hyperparameters, results
Here are several suggestions which may help:
Use activation after each hidden layer
Learning rate of 0.1 is too high for Adam. Try smaller (3e-4 for example)
You are printing loss value incorrectly. Currently loss value is taken for the last iteration only. Calculate mean epoch loss instead.
Minor suggestion: since the task is binary classification it's better to apply torch.nn.BCELoss or torch.nn.BCEWithLogitsLoss if you don't use sigmoid on last layer. Last linear layer must have output_size=1 in this case.
Best model checkpoint may be missed with code you provided. That's because you calculate accuracy each 10 epochs, however accuracy > best_accuracy is done on each epoch which is inconsistent.
In many ML applications a weighted loss may be desirable since some types of incorrect predictions might be worse outcomes than other errors. E.g. in medical binary classification (healthy/ill) a false negative, where the patient doesn't get further examinations is a worse outcome than a false positive, where a follow-up examination will reveal the error.
So if I define a weighted loss function like this:
def weighted_loss(prediction, target):
if prediction == target:
return 0 # correct, no loss
elif prediction == 0: # class 0 is healthy
return 100 # false negative, very bad
else:
return 1 # false positive, incorrect
How can I pass something equivalent to this to scikit-learn classifiers like Random Forests or SVM classifiers?
I am afraid your question is ill-posed, stemming from a fundamental confusion between the different notions of loss and metric.
Loss functions do not work with prediction == target-type conditions - this is what metrics (like accuracy, precision, recall etc) do - which, however, play no role during loss optimization (i.e. training), and serve only for performance assessment. Loss does not work with hard class predictions; it only works with the probabilistic outputs of the classifier, where such equality conditions never apply.
An additional layer of "insulation" between loss and metrics is the choice of a threshold, which is necessary for converting the probabilistic outputs of a classifier (only thing that matters during training) to "hard" class predictions (only thing that matters for the business problem under consideration). And again, this threshold plays absolutely no role during model training (where the only relevant quantity is the loss, which knows nothing about thresholds and hard class predictions); as nicely put in the Cross Validated thread Reduce Classification Probability Threshold:
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Although you can certainly try to optimize this (decision) threshold with extra procedures outside of the narrowly-defined model training (i.e. loss minimization), as you briefly describe in the comments, your expectation that
I am pretty sure that I'd get better results if the decision boundaries drawn by the RBFs took that into account, when fitting to the data
with something similar to your weight_loss function is futile.
So, no function similar to your weight_loss shown here (essentially a metric, and not a loss function, despite its name), that employs equality conditions like prediction == target, can be used for model training.
The discusion in the following SO threads might also be useful in clarifying the issue:
Loss & accuracy - Are these reasonable learning curves?
What is the difference between loss function and metric in Keras? (despite the title, the definitions are generally applicable and not only for Keras)
Cost function training target versus accuracy desired goal
How to interpret loss and accuracy for a machine learning model
I have two features, say F1 and F2 which has a correlation of about 0.9.
When I built my model, I first considered all the features to go into my regression model. Once I have my model, I then ran Lasso regression on my model, with the hope that this will tackle any colinearity between the features. However, the Lasso regression kept both F1 and F2 in my model.
Two questions:
i) If F1 and F2 are highly correlated, but Lasso regression still kept both of them, what could this mean? Does it mean regularization doesn't work in some cases?
ii) How do I adjust my model or the Lasso regression model to kick out F1 or F2 in my model? (I am using sklearn.linear_model.LogisticRegression, and have set penalty = 'l1' or ‘elasticnet’, tried very large or very small C values, tried 'liblinear' or 'saga' solvers, and l1_ratio = 1, but I still can't kick out either F1 or F2 from my model)
Answers to your questions:
i) Lasso reduces coefficients gradually. You may find a nice picture in some books authored by Robert Tibshirani, the person behind the Lasso/Ridge, where you will see how some coefficients gradually fall to zero as regularization coefficient is increasing (you may perform such an experiment yourself). The fact the model still keeps both may mean two things: either the model deems both important or there no enough regularization to kill one of them.
ii) You're right you're going with Lasso with L1 regularization. It is C parameter. The way it's coded in sklearn: the smaller the C the higher the regularization parameter (inverse). Though in machine learning your task is not to totally exclude collinearity ("to kill F1 or F2" in your parlor), but to find a model (or a set of params if you wish) that will generalize best. That is done through model tuning via CV. Warning: higher regularization means more underfitting.
I would add though that collinearity is somewhat dangerous for linear regression because it may give rise to model instability (differing coefficients on different subsamples). So, with linear regression, you may wish to check this too.
If I correctly understood the significance of the loss function to the model, it directs the model to be trained based on minimizing the loss value. So for example, if I want my model to be trained in order to have the least mean absolute error, i should use the MAE as the loss function. Why is it, for example, sometimes you see someone wanting to achieve the best accuracy possible, but building the model to minimize another completely different function? For example:
model.compile(loss='mean_squared_error', optimizer='sgd', metrics='acc')
How come the model above is trained to give us the best acc, since during it's training it will try to minimize another function (MSE). I know that, when already trained, the metric of the model will give us the best acc found during the training.
My doubt is: shouldn't the focus of the model during it's training to maximize acc (or minimize 1/acc) instead of minimizing MSE? If done in that way, wouldn't the model give us even higher accuracy, since it knows it has to maximize it during it's training?
To start with, the code snippet you have used as example:
model.compile(loss='mean_squared_error', optimizer='sgd', metrics='acc')
is actually invalid (although Keras will not produce any error or warning) for a very simple and elementary reason: MSE is a valid loss for regression problems, for which problems accuracy is meaningless (it is meaningful only for classification problems, where MSE is not a valid loss function). For details (including a code example), see own answer in What function defines accuracy in Keras when the loss is mean squared error (MSE)?; for a similar situation in scikit-learn, see own answer in this thread.
Continuing to your general question: in regression settings, usually we don't need a separate performance metric, and we normally use just the loss function itself for this purpose, i.e. the correct code for the example you have used would simply be
model.compile(loss='mean_squared_error', optimizer='sgd')
without any metrics specified. We could of course use metrics='mse', but this is redundant and not really needed. Sometimes people use something like
model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['mse','mae'])
i.e. optimise the model according to the MSE loss, but show also its performance in the mean absolute error (MAE) in addition to MSE.
Now, your question:
shouldn't the focus of the model during its training to maximize acc (or minimize 1/acc) instead of minimizing MSE?
is indeed valid, at least in principle (save for the reference to MSE), but only for classification problems, where, roughly speaking, the situation is as follows: we cannot use the vast arsenal of convex optimization methods in order to directly maximize the accuracy, because accuracy is not a differentiable function; so, we need a proxy differentiable function to use as loss. The most common example of such a loss function suitable for classification problems is the cross entropy.
Rather unsurprisingly, this question of yours pops up from time to time, albeit in slight variations in context; see for example own answers in
Cost function training target versus accuracy desired goal
Targeting a specific metric to optimize in tensorflow
For the interplay between loss and accuracy in the special case of binary classification, you may find my answers in the following threads useful:
Loss & accuracy - Are these reasonable learning curves?
How does Keras evaluate the accuracy?
I was training Gradient Boosting Models using sklearn's GradientBoostingClassifier [sklearn.ensemble.GradientBoostingClassifier] when I encountered the "loss" parameter.
The official explanation given from sklearn's page is-
loss : {‘deviance’, ‘exponential’}, optional (default=’deviance’)
loss function to be optimized. ‘deviance’ refers to deviance (=
logistic regression) for classification with probabilistic outputs.
For loss ‘exponential’ gradient boosting recovers the AdaBoost
algorithm.
sklearn.ensemble.GradientBoostingClassifier
My question is, according to my limited understanding, 'deviance' loss function is used to probabilistic classification (like Naive-Bayes's probabilistic outputs used for classification).
What happens for 'exponential' loss function?
OR
When should 'exponential' loss function be used
According to sklearn.ensemble.AdaBoostClassifier page sklearn.ensemble.AdaBoostClassifier
For 'algorithm' parameter-
algorithm : {‘SAMME’, ‘SAMME.R’}, optional (default=’SAMME.R’)
If ‘SAMME.R’ then use the SAMME.R real boosting algorithm.
base_estimator must support calculation of class probabilities. If
‘SAMME’ then use the SAMME discrete boosting algorithm. The SAMME.R
algorithm typically converges faster than SAMME, achieving a lower
test error with fewer boosting iterations.
This means that 'SAMME.R' (of AdaBoost) is similar to 'deviance' of 'loss' parameter of GradientBoostingClassifier?
Is my understanding correct or I am missing something.
Thanks!