Difference between batch-average and global Fscore - python

I am facing a False Positive Reduction problem, and ratio of the size of positive and negative is approx. 1.7:1.
I learned from the answer that using precision, recall, FScore, or even weighting true-positive, false-positive, true-negative and false-negative differently dependent on cost to evaluate different models to deal with specified classification task.
Since Precision, Recall, and FScore are removed from keras, I found some methods to do the tracking of those metrics during training, such as github repo keras-metrics.
Besides, I also find ohter solutions by defining precision like this,
def precision(y_true, y_pred):
"""Precision metric.
Only computes a batch-wise average of precision.
Computes the precision, a metric for multi-label classification of
how many selected items are relevant.
"""
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
precision = true_positives / (predicted_positives + K.epsilon())
return precision
However, since those methods is tracking the metrics during training, and all of those claim to be batch-wise average rather than a global value.
I wonder how neccessary is that to keep track on those metrics during training. Or I just focus on the loss and accuracy during training, and evaluate all models using validation functions from like scikit-learn to compare those metrics with a global method.

In Keras, all training metrics are measured batch-wise.
To obtain a global metric, Keras will average these batch-metrics.
Something like sum(batch_metrics) / batches.
Since most metrics are mean values considering the "number of samples", doing that kind of averaging will not change the global value too much.
If samples % batch_size == 0, then we can say that:
sum(all_samples_metrics) / samples == sum(all_batch_metrics) / batches
But these specific metrics you are talking about are not divided by the "number of samples", but by the number of samples "that satisfy a condition". Thus, the divisor in each batch is different. Mathematically, the result of averaging the batch-metrics to obtain a global result will not reflect the true global result.
So, can we say that they're not good for training?
Well, no. They may be good for training. Sometimes "accuracy" is a terrible metric for a specific problem.
The key to use these metrics batch-wise is to have a batch size that is big enough to avoid too much variation in the divisors.

Related

keras apply threshold for loss function

I am developing a Keras model. My dataset is badly unbalanced, so I want to set a threshold for training and testing. If I'm not mistaken, when doing a backward propagation, neural network checks the predicted values with the original ones and calculate the error and based on the error, set new weights for neurons.
As I know, Keras uses 0.5 for the threshold. I know there are ways to apply custom metrics (as recall and precision) with custom threshold, but that threshold is only used for calculating the recall, and it is not applied in the loss function. To be more clear, If I want to set 0.85 as my threshold, the neural network would use 0.5 as threshold to calculate loss and 0.85 for recall.
Is there any ways to set this threshold for training as well?
There is no such a thing as a threshold for loss.
A loss function must be "differentiable", thus it must be a "continuous" function.
The best you can do is to set "class weights", such as these examples: Higher loss penalty for true non-zero predictions
In addition to class weights...
you can use metric function with the threshold parameter:
model.compile(..., metrics=[tf.keras.metrics.BinaryAccuracy(threshold=0.5)])
you can use sigmoid activation in the last layer and select after that threshold manually:
pred_labels = np.where(y_pred>0.5, 1, 0)
score = sklearn.metrics.accuracy_score(pred_labels, labels)

Wasserstein loss can be negative?

I'm currently training a WGAN in keras with (approx) Wasserstein loss as below:
def wasserstein_loss(y_true, y_pred):
return K.mean(y_true * y_pred)
However, this loss can obviously be negative, which is weird to me.
I trained the WGAN for 200 epochs and got the critic Wasserstein loss training curve below.
The above loss is calculated by
d_loss_valid = critic.train_on_batch(real, np.ones((batch_size, 1)))
d_loss_fake = critic.train_on_batch(fake, -np.ones((batch_size, 1)))
d_loss, _ = 0.5*np.add(d_loss_valid, d_loss_fake)
The resulting generated sample quality is great, so I think I trained the WGAN correctly. However I still cannot understand why the Wasserstein loss can be negative and the model still works. According to the original WGAN paper, Wasserstein loss can be used as a performance indicator for GAN, so how should we interpret it? Am I misunderstand anything?
The Wasserstein loss is a measurement of Earth-Movement distance, which is a difference between two probability distributions. In tensorflow it is implemented as d_loss = tf.reduce_mean(d_fake) - tf.reduce_mean(d_real) which can obviously give a negative number if d_fake moves too far on the other side of d_real distribution. You can see it on your plot where during the training your real and fake distributions changing sides until they converge around zero. So as a performance measurement you can use it to see how far the generator is from the real data and on which side it is now.
See the distributions plot:
P.S. it's crossentropy loss, not Wasserstein.
Perhaps this article can help you more, if you didn't read it yet. However, the other question is how the optimizer can minimize the negative loss (to zero).
Looks like I cannot make a comment to the answer given by Sergeiy Isakov because I do not have enough reputations. I wanted to comment because I think that information is not correct.
In principle, Wasserstein distance cannot be negative because distance metric cannot be negative. The actual expression (dual form) for Wasserstein distance involves the supremum of all the 1-Lipschitz functions (You can refer to it on the web). Since it is the supremum, we always take that Lipschitz function that gives the largest value to obtain the Wasserstein distance. However, the Wasserstein we compute using WGAN is just an estimate and not really the real Wasserstein distance. If the inner iterations of the critic are low it may not have enough iterations to move to a positive value.
Thought experiment: If we suppose that we obtain a Wasserstein estimate that is negative, we can always negate the critic function to make the estimate positive. That means there exist a Lipschitz function that gives a positive value which is larger than that Lipschitz function that gives negative value. So Wasserstein estimates cannot be negative as by definition we need to have the supremum of all the 1-Lipschitz functions.

The return value of model.evaluate_generator

I don't understand: since that the model is evaluated on a group of images, not a single image, so I think the score should return the average value of loss or metrics over the group of images.
score = model.evaluate_generator(evaluateGene,test_images, verbose=1)
This is a neural network model based on Keras, to evaluate the performance of the model, we need to calculate the average loss and metrics and their standard deviation.
score = model.evaluate_generator(evaluateGene,test_images, verbose=1)
print('%.3f' %score[0], '%.3f' %score[1],'%.3f' %score[2])
Since I want to calculate the mean loss, metrics and standard deviation. But this function seems can't do that. Are there some good solutions to return a mean value and std? Thanks a lot!

Loss function for simple Reinforcement Learning algorithm

This question comes from watching the following video on TensorFlow and Reinforcement Learning from Google I/O 18: https://www.youtube.com/watch?v=t1A3NTttvBA
Here they train a very simple RL algorithm to play the game of Pong.
In the slides they use, the loss is defined like this ( approx # 11m 25s ):
loss = -R(sampled_actions * log(action_probabilities))
Further they show the following code ( approx # 20m 26s):
# loss
cross_entropies = tf.losses.softmax_cross_entropy(
onehot_labels=tf.one_hot(actions, 3), logits=Ylogits)
loss = tf.reduce_sum(rewards * cross_entropies)
# training operation
optimizer = tf.train.RMSPropOptimizer(learning_rate=0.001, decay=0.99)
train_op = optimizer.minimize(loss)
Now my question is this; They use the +1 for winning and -1 for losing as rewards. In the code that is provided, any cross entropy loss that's multiplied by a negative reward will be very low? And if the training operation is using the optimizer to minimize the loss, well then the algorithm is trained to lose?
Or is there something fundamental I'm missing ( probably because of my very limited mathematical skills )
Great question Corey. I am also wondering exactly what this popular loss function in RL actually means. I've seen many implementations of it, but many contradict each other. For my understanding, it means this:
Loss = - log(pi) * A
Where A is the advantage compared to a baseline case. In Google's case, they used a baseline of 0, so A = R. This is multiplied by that specific action at that specific time, so in your above example, actions were one hot encoded as [1, 0, 0]. We will ignore the 0s and only take the 1. Hence we have the above equation.
If you intuitively calculate this loss for a negative reward:
Loss = - (-1) * log(P)
But for any P less than 1, log of that value will be negative. Therefore, you have a negative loss which can be interpreted as "very good", but really doesn't make physical sense.
The correct way:
However in my opinion, and please others correct me if I'm wrong, you do not calculate the loss directly. You take the gradient of the loss. That is, you take the derivative of -log(pi)*A.
Therefore, you would have:
-(d(pi) / pi) * A
Now, if you have a large negative reward, it will translate to a very large loss.
I hope this makes sense.

Confusions about multi-class classification

I'm working on a 200-class classification task(but it's a little bit different, because there might be multiple 1's in the y vector) using a 4-layer fully-connected neural network. For most times y(label vector) contains one or two 1's, and that's where the problem is. When training, the model tends to predict all the labels as zero, even it should be 1.
Thus the accuracy is low(less than 99%, which is almostly worse than all-zero prediction). The activation function for each layer is sigmoid. Could you give me some advice to improve the model?
This is my loss function. The accuracy is low because when I predict all labels as 0, it'll get almost 99% accuracy.
loss = tf.reduce_mean(tf.reduce_sum(-(sum_all - sum_one) / sum_all * tf.multiply(ys, tf.log(prediction)) - sum_one / sum_all * tf.multiply((one - ys), tf.log(one - prediction)), reduction_indices = [1])) sum_one indicates the number of 1's in the label. I implemented a weighting here.

Categories