The code for the loss function in scikit-learn logestic regression is:
# Logistic loss is the negative of the log of the logistic function.
out = -np.sum(sample_weight * log_logistic(yz)) + .5 * alpha * np.dot(w, w)
However, it seems to be different from common form of the logarithmic loss function, which reads:
-y(log(p)+(1-y)log(1-p))
(please see http://wiki.fast.ai/index.php/Log_Loss)
Could anyone tell me how to understand to code for loss function in scikit-learn logestic regression and what is the relation between it and the general form of the logarithmic loss function?
Thank you in advance.
First you should note that 0.5 * alpha * np.dot(w, w) is just a normalization. So, sklearn logistic regression reduces to the following
-np.sum(sample_weight * log_logistic(yz))
Also, the np.sum is due to the fact it consider multiple samples, so it again reduces to
sample_weight * log_logistic(yz)
Finally if you read HERE, you note that sample_weight is an optional array of weights that are assigned to individual samples. If not provided, then each sample is given unit weight. So, it should be equal to one (as in the original definition of cross entropy loss we do not consider unequal weight for different samples), hence the loss reduces to:
- log_logistic(yz)
which is equivalent to
- log_logistic(y * np.dot(X, w)).
Now, why it looks different (in essence it is the same) from the cross entropy loss function, i. e.:
- [y log(p) + (1-y) log(1-p))].
The reason is, we can use either of two different labeling conventions for binary classification, either using {0, 1} or {-1, 1}, which results in the two different representations. But they are the same!
More details (on why they are the same) can be found HERE. Note that you should read the response by Manuel Morales.
Related
After studying autograd, I tried to make loss function myself.
And here are my loss
def myCEE(outputs,targets):
exp=torch.exp(outputs)
A=torch.log(torch.sum(exp,dim=1))
hadamard=F.one_hot(targets, num_classes=10).float()*outputs
B=torch.sum(hadamard, dim=1)
return torch.sum(A-B)
and I compared with torch.nn.CrossEntropyLoss
here are results
for i,j in train_dl:
inputs=i
targets=j
break
outputs=model(inputs)
myCEE(outputs,targets) : tensor(147.5397, grad_fn=<SumBackward0>)
loss_func = nn.CrossEntropyLoss(reduction='sum') : tensor(147.5397, grad_fn=<NllLossBackward>)
values were same.
I thought, because those are different functions so grad_fn are different and it
won't cause any problems.
But something happened!
After 4 epochs, loss values are turned to nan.
Contrary to myCEE, with nn.CrossEntropyLoss learning went well.
So, I wonder if there is a problem with my function.
After read some posts about nan problems, I stacked more convolutions to the model.
As a result 39-epoch training did not make an error.
Nevertheless, I'd like to know difference between myCEE and nn.CrossEntropyLoss
torch.nn.CrossEntropyLoss is different to your implementation because it uses a trick to counter instable computation of the exponential when using numerically big values. Given the logits output {l_1, ... l_j, ..., l_n}, the softmax is defined as:
softmax(l_i) = exp(l_i) / sum_j(exp(l_j))
The trick is to multiple both the numerator and denominator by exp(-β):
softmax(l_i) = exp(l_i)*exp(-β) / [sum_j(exp(l_j))*exp(-β)]
= exp(l_i-β) / sum_j(exp(l_j-β))
Then the log-softmax comes down to:
logsoftmax(l_i) = l_i - β - log[sum_j(exp(l_j-β))]
In practice β is chosen as the highest logit value i.e. β = max_j(l_j).
You can read more about it on this question: Numerically Stable Softmax.
I need to calculate Aitchison distance as a loss function between input and output datasets.
While calculating this mstric I need to calculate geometric mean on each row (where [batches x features] - size of a dataset during loss ).
In simple case we could imagine that there is only 1 batch so I need just to calculate one geomean for input and one for output dataset
So how it could be done on tensorflow? I didn't find any specified metrics or reduced functions
You can easily calculate the geometric mean of a tensor as a loss function (or in your case as part of the loss function) with tensorflow using a numerically stable formula highlighted here. The provided code fragment highly resembles to the pytorch solution posted here that follows the abovementioned formula (and scipy implementation).
from tensorflow.python.keras import backend as K
def gmean_loss((y_true, y_pred, dim=1):
error = y_pred - y_true
logx = K.log(inputs)
return K.exp(K.mean(logx, dim=dim))
You can define dim according to your needs or integrate it into your code.
I'm trying to wrap my head around the categorical cross entropy loss. Looking at the implementation of the cross entropy loss in Keras:
# scale preds so that the class probas of each sample sum to 1
output = output / math_ops.reduce_sum(output, axis, True)
# Compute cross entropy from probabilities.
epsilon_ = _constant_to_tensor(epsilon(), output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon_, 1. - epsilon_)
return -math_ops.reduce_sum(target * math_ops.log(output), axis)
I do not see where the delta = output - target
is calculated.
See here.
What am I missing?
I think you might be confusing two different concepts / events here.
The categorical cross entropy loss is a measure of the error of your model, as calculated by :
def categorical_crossentropy(target, output, from_logits=False, axis=-1):
<etc>
This just returns an array of losses for each label, it is the direct difference between the true label and what your model thinks the label should be.
The next step after calculating the loss (part of the forward propagation phase) is to then start backpropagation, i.e. we want to find the influence that each weight/bias matrix has on the loss you've calculated above, so that we can perform the update step.
The first step is then to calculate dL/dz i.e. the derivative of the loss function with respect to the linear function (y = Wx + b), which itself is the combination of dL/da * da/dz (i.e. the deriv loss wrt activation * deriv activation wrt the linear function).
The link you posted is the derivative of the activation function wrt the linear function. This blog does a decent job of explaining how all the parts fit together, although the activation function they use is a sigmoid, but the overall pieces that fit together are the same.
I am nooby in this field of study and probably this is a pretty silly question. I want to build a normal ANN, but I am not sure if I can use a weighted mean square error as the loss function.
If we are not treating each sample equally, I mean we care the prediction precision more for some of the categories of the samples more than the others, then we want to form a weighted loss function.
Lets say, we have a categorical feature ci, i is the index of the sample, and for simplicity, we assume that this feature takes binary value, either 0 or 1. So, we can form the loss function as
(ci + 1)(yi_hat - yi)^2
#and take the sum for all i
Are there going to be any problem with the back-propagation? I don't see any issue with calculating the gradient or updating the weights between layers.
And, if no issue, how can I program this loss function in Keras? Because it seems that the loss function only takes two parameters, y_true and y_pred, how can I plug in the vector c?
There is absolutely nothing wrong with that. Functions can declare the constants withing themselves or even take the constants from an outside scope:
import keras.backend as K
c = K.constant([c1,c2,c3,c4,...,cn])
def weighted_loss(y_true,y_pred):
loss = keras.losses.get('mse')
return c * loss(y_true,y_pred)
Exactly like yours:
def weighted_loss(y_true,y_pred):
weighted = (c+1)*K.square(y_true-y_pred)
return K.sum(weighted)
I am using sklearn.svm.svc from scikit-learn to do binary classification. I am using its predict_proba() function to get probability estimates. Can anyone tell me how predict_proba() internally calculates the probability?
Scikit-learn uses LibSVM internally, and this in turn uses Platt scaling, as detailed in this note by the LibSVM authors, to calibrate the SVM to produce probabilities in addition to class predictions.
Platt scaling requires first training the SVM as usual, then optimizing parameter vectors A and B such that
P(y|X) = 1 / (1 + exp(A * f(X) + B))
where f(X) is the signed distance of a sample from the hyperplane (scikit-learn's decision_function method). You may recognize the logistic sigmoid in this definition, the same function that logistic regression and neural nets use for turning decision functions into probability estimates.
Mind you: the B parameter, the "intercept" or "bias" or whatever you like to call it, can cause predictions based on probability estimates from this model to be inconsistent with the ones you get from the SVM decision function f. E.g. suppose that f(X) = 10, then the prediction for X is positive; but if B = -9.9 and A = 1, then P(y|X) = .475. I'm pulling these numbers out of thin air, but you've noticed that this can occur in practice.
Effectively, Platt scaling trains a probability model on top of the SVM's outputs under a cross-entropy loss function. To prevent this model from overfitting, it uses an internal five-fold cross validation, meaning that training SVMs with probability=True can be quite a lot more expensive than a vanilla, non-probabilistic SVM.
Actually I found a slightly different answer that they used this code to convert decision value to probability
'double fApB = decision_value*A+B;
if (fApB >= 0)
return Math.exp(-fApB)/(1.0+Math.exp(-fApB));
else
return 1.0/(1+Math.exp(fApB)) ;'
Here A and B values can be found in the model file (probA and probB).
It offers a way to convert probability to decision value and thus to hinge loss.
Use that ln(0) = -200.