Exact definitions of loss functions in sklearn.linear_model.SGDClassifier - python

I know that I may change loss function to one of the following:
loss : str, 'hinge' or 'log' or 'modified_huber'
The loss function to be used. Defaults to 'hinge'. The hinge loss is
a margin loss used by standard linear SVM models. The 'log' loss is
the loss of logistic regression models and can be used for
probability estimation in binary classifiers. 'modified_huber'
is another smooth loss that brings tolerance to outliers.
But what the definitions of this functions?
I understand that hinge is max(0, 1 - margin). And what are others too?

Here are the graphs of all these functions, taken from the scikit-learn example gallery:
In the current dev version of the example, the losses are implemented inline in the script.

sklearn's source code is available on GitHub, so you can examine it. List of loss functions can be found in sklearn/linear_model/stochastic_gradient.py. Definitions of that losses are here: sklearn/linear_model/sgd_fast.pyx#L46

Related

Creating a criterion that measures the F1 Loss

I am currently creating criterion to measure the MSE loss function using:
loss_fcn = torch.nn.MSELoss()
loss = loss_fcn(logits[getMaskForBatch(subgraph)], labels.float())
Now I need to change it to F1 score but I cannot seem to find one library that could be used for it
In particular, depending on the task you need to have a specific loss function.
Loss function also known as objective, cost or error function is somehow opposite to the optimization function. Loss function creates the loss, optimization function reduces the loss. :). These two functions should live in equilibrium so we don't overfit.
PyTorch Regression losses:
nn.L1Loss L1 Loss (MAE)
nn.MSELoss L2 Loss (MSE)
nn.SmoothL1Loss Huber
PyTorch Classification losses:
nn.CrossEntropyLoss
nn.KLDivLoss
nn.NLLLoss
PyTorch GAN training
nn.MarginRankingLoss
So if you used nn.MSELoss you probably need to stay with regression, because F1 is a classification metric.
If you really need F1 score for some other reason, you may use scikit learn.
Why do you need to do that?
F1 score is usually an evaluation metrics not a loss function. Moreover, to use F1 score as a loss function you have to ensure that it's differentiable and convex (which I think probably is not the case otherwise it would already be in the literature).
There're many loss functions that could sweet with your problem like crossentropy, negative log likelyhood, CTC loss, etc.

I want to define loss function for 3-best classification

I'm design a classification model.
I have a problem, there are many categories which has similar features.
I think best options is re-generate category hierarchy, but those are fixed.
So, I focused on 3-best accuracy, instead of 1-best accuracy.
I want to defined a loss function for 3-best accuracy.
I don't care where is the answer in position 1 - 3.
Is there any good loss function for that? of How can I define it?
You can use keras.metrics.top_k_categorical_accuracy for calculating accuracy. But this one is accuracy metric. I don't think there is any inbuilt top_k loss function in TensorFlow or Keras as of now. A loss function should be differentiable to work with gradient based learning methods. While top_k is not a differentiable function. Just like accuracy metric. Hence it can be used as accuracy metric but not as learning objective. So you won't find any inbuilt method for this, however there are other research papers aiming to solve this problems. You might want to have a look at Learning with Average Top-k Loss and Smooth Loss Functions for Deep Top-k Classification.
you can use any of the below
top_k_categorical_accuracy
keras.metrics.top_k_categorical_accuracy(y_true, y_pred, k=3)
sparse_top_k_categorical_accuracy
keras.metrics.sparse_top_k_categorical_accuracy(y_true, y_pred, k=3)

Optimizing for accuracy instead of loss in Keras model

If I correctly understood the significance of the loss function to the model, it directs the model to be trained based on minimizing the loss value. So for example, if I want my model to be trained in order to have the least mean absolute error, i should use the MAE as the loss function. Why is it, for example, sometimes you see someone wanting to achieve the best accuracy possible, but building the model to minimize another completely different function? For example:
model.compile(loss='mean_squared_error', optimizer='sgd', metrics='acc')
How come the model above is trained to give us the best acc, since during it's training it will try to minimize another function (MSE). I know that, when already trained, the metric of the model will give us the best acc found during the training.
My doubt is: shouldn't the focus of the model during it's training to maximize acc (or minimize 1/acc) instead of minimizing MSE? If done in that way, wouldn't the model give us even higher accuracy, since it knows it has to maximize it during it's training?
To start with, the code snippet you have used as example:
model.compile(loss='mean_squared_error', optimizer='sgd', metrics='acc')
is actually invalid (although Keras will not produce any error or warning) for a very simple and elementary reason: MSE is a valid loss for regression problems, for which problems accuracy is meaningless (it is meaningful only for classification problems, where MSE is not a valid loss function). For details (including a code example), see own answer in What function defines accuracy in Keras when the loss is mean squared error (MSE)?; for a similar situation in scikit-learn, see own answer in this thread.
Continuing to your general question: in regression settings, usually we don't need a separate performance metric, and we normally use just the loss function itself for this purpose, i.e. the correct code for the example you have used would simply be
model.compile(loss='mean_squared_error', optimizer='sgd')
without any metrics specified. We could of course use metrics='mse', but this is redundant and not really needed. Sometimes people use something like
model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['mse','mae'])
i.e. optimise the model according to the MSE loss, but show also its performance in the mean absolute error (MAE) in addition to MSE.
Now, your question:
shouldn't the focus of the model during its training to maximize acc (or minimize 1/acc) instead of minimizing MSE?
is indeed valid, at least in principle (save for the reference to MSE), but only for classification problems, where, roughly speaking, the situation is as follows: we cannot use the vast arsenal of convex optimization methods in order to directly maximize the accuracy, because accuracy is not a differentiable function; so, we need a proxy differentiable function to use as loss. The most common example of such a loss function suitable for classification problems is the cross entropy.
Rather unsurprisingly, this question of yours pops up from time to time, albeit in slight variations in context; see for example own answers in
Cost function training target versus accuracy desired goal
Targeting a specific metric to optimize in tensorflow
For the interplay between loss and accuracy in the special case of binary classification, you may find my answers in the following threads useful:
Loss & accuracy - Are these reasonable learning curves?
How does Keras evaluate the accuracy?

loss parameter explanation for "sklearn.ensemble.GradientBoostingClassifier"

I was training Gradient Boosting Models using sklearn's GradientBoostingClassifier [sklearn.ensemble.GradientBoostingClassifier] when I encountered the "loss" parameter.
The official explanation given from sklearn's page is-
loss : {‘deviance’, ‘exponential’}, optional (default=’deviance’)
loss function to be optimized. ‘deviance’ refers to deviance (=
logistic regression) for classification with probabilistic outputs.
For loss ‘exponential’ gradient boosting recovers the AdaBoost
algorithm.
sklearn.ensemble.GradientBoostingClassifier
My question is, according to my limited understanding, 'deviance' loss function is used to probabilistic classification (like Naive-Bayes's probabilistic outputs used for classification).
What happens for 'exponential' loss function?
OR
When should 'exponential' loss function be used
According to sklearn.ensemble.AdaBoostClassifier page sklearn.ensemble.AdaBoostClassifier
For 'algorithm' parameter-
algorithm : {‘SAMME’, ‘SAMME.R’}, optional (default=’SAMME.R’)
If ‘SAMME.R’ then use the SAMME.R real boosting algorithm.
base_estimator must support calculation of class probabilities. If
‘SAMME’ then use the SAMME discrete boosting algorithm. The SAMME.R
algorithm typically converges faster than SAMME, achieving a lower
test error with fewer boosting iterations.
This means that 'SAMME.R' (of AdaBoost) is similar to 'deviance' of 'loss' parameter of GradientBoostingClassifier?
Is my understanding correct or I am missing something.
Thanks!

Fitting Keras L1 models

I have a simple keras model (normal Lasso linear model) where the inputs are moved to a single 'neuron' Dense(1, kernel_regularizer=l1(fdr))(input_layer) but the weights from this model are never set exactly to zero. I find this interesting since scikit-learn's Lasso can set coefficients exactly to zero.
I have used Adam and tensorflow's FtrlOptimizer for optimisation and they have the same problem.
I've already checked this question already but this does not explain why sklearn can set values exactly to zero, no to mention how their models converge in ~500ms on my server when the same model in Keras takes 2.4secs with early terminations.
Is this all because of the optimizer being used or am I missing something?
Is this all because of the optimizer being used or am I missing
something?
Indeed. If you look into the actual function that gets called when you fit Lasso from scikit-learn (it's called from ElasticNet class) you see that it uses different optimization algorithm.
Coordinate Descent in scikit-learn's ElasticNet starts with coefficient vector equal to zero, and then considers adding nonzero entries one at a time (this is related to stepwise feature selection for linear regression).
Other methods that are used to optimize L1 regularized regression also are work in that way: for example LARS (Least-angle regression) can be also used from scikit-learn.
In contrast to that, a paper on FTRL algorithm says
Unfortunately, OGD is not particularly effective at producing
sparse models. In fact, simply adding a subgradient
of the L1 penalty to the gradient of the loss (Ow`t(w))
will essentially never produce coefficients that are exactly
zero.

Categories