Unexpected 1.0000 top_k_categorical_accuracy - python

I'm training a classification model, and getting some weird metric values. The top1 accuracy is fairly low and has relatively normal behavior, but the top_k_categorical_accuracy (with the default k=5) is often exactly 1.0000. This seems highly implausible, given how low the top1 accuracy is. What could be going on here?

This is due to the model predicting 0.00000000 for most categories for a given example. Keras uses in_top_k for calculating the top_k_categorical_accuracy metric. In the documentation:
Note that the behavior of InTopK differs from the TopK op in its handling of ties; if multiple classes have the same prediction value and straddle the top-k boundary, all of those classes are considered to be in the top k.
So all predictions of 0 are tied, and will count as part of the top "5" as long as 4 or fewer predictions are nonzero. This means that all 200 classes count as the top 5, and thus we get 1.0000 accuracy for this metric.
Relevant GitHub issue: #10767

Related

Scaling the sigmoid output

I am training a Network on images for binary classification. The input images are normalized to have pixel values in the range[0,1]. Also, the weight matrices are initialized from a normal distribution. However, the output from my last Dense layer with sigmoid activation yields values with a very minute difference for the two classes. For example -
output for class1- 0.377525 output for class2- 0.377539
The difference for the classes comes after 4 decimal places. Is there any workaround to make sure that the output for class 1 falls around 0 to 0.5 and for class 2 , it falls between 0.5 to 1.
Edit:
I have tried both the cases.
Case 1 - Dense(1, 'sigmoid') with binary crossentropy
Case 2- Dense(2, 'softmax') with binary crossentropy
For case1, the output values differ by a very small amount as mentioned in the problem above. As such , i am taking mean of the predicted values to act as threshold for classification. This works upto some extent, but not a permanent solution.
For case 2 - the prediction overfits to one class only.
A sample code : -
inputs = Input(shape = (128,156,1))
x = Conv2D(.....)(inputs)
x = BatchNormalization()(x)
x = Maxpooling2D()(x)
...
.
.
flat=Flatten()(x)
out = Dense(1,'sigmoid')(x)
model = Model(inputs,out)
model.compile(optimizer='adamax',loss='binary_crossentropy',metrics=['binary_accuracy'])
It seems you are confusing a binary classification architecture with a 2 label multi-class classification architecture setup.
Since you mention the probabilities for the 2 classes, class1 and class2, you have, set up a single label multi-class setup. That means, you are trying to predict the probabilities of 2 classes, where a sample can have only one of the labels at a time.
In this setup, it's proper to use softmax instead of sigmoid. Your loss function would be binary_crossentropy as well.
Right now, with the multi-label setup and sigmoid activation, you are independently predicting the probability of a sample being class1 and class2 simultaneously (aka, multi-label multi-class classification).
Once you change to softmax you should see more significant differences between the probabilities IF the sample actually definitively belongs to one of the 2 classes and if your model is well trained & confident about its predictions (validation vs training results)
First, I would like to say the information you provided is insufficient to exactly debug your problem, because you didn't provide any code of your model and optimizer. I suspect there might be an error in the labels, and I also suggest you use a softmax activation fuction instead of the sigmoid function in the final layer, although it will still work through your approach, binary classification problems must output one single node and loss must be binary cross entropy.
If you want to receive an accurate solution, please provide more information.

"All zero class" prediction by Neural Network

In a classification problem involving the identification of fraudulent transactions, I reduced the dimensionality of the data(28 columns)[A complete quasi-separation was detected by Logit in statsmodels] using a stacked auto encoder(28->15->5) and fed the compressed data(5 columns) into a neural network with two hidden layers, each having 10 nodes and 'relu' activation function. I trained the model over a 100 epochs(The AUC metric didn't go beyond 0.500 and the train loss became constant after a few epochs).The model predicted all the records of the test set as non-fraudulent(0 class) and yielded a confusion matrix like this:
Confusion matrix:
[[70999 0]
[ 115 0]]
Accuracy Score : 0.9983828781955733
Can someone please explain the problem behind this result and suggest a feasible solution?..
Since your accuracy is over 99% on all zero class prediction, the percent of fraud cases in your train set is less than 1%
Typically if the fraud cases are rare, the model will not place enough importance on the fraud cases to predict well.
to fix this you can add costs to penalize the majority class, or add weights to penalize the majority class or use class balancing methods such as SMOTE

Tensorflow Beginner, Basic Question On Linear Model

https://www.tensorflow.org/tutorials/estimator/linear
I am following the Tensorflow documentation to implement a Linear Classifier but I like to use my own data instead of the tutorial set. I just have a few general questions.
My dataset is as follows. It's not a time series.
row[0] - float (changed to binary, 0 = negative, 1 = positive) VALUE TO ESTIMATE
row[1] - string (categorical, changed to vocabulary, ints 1,2,3,4,5,6,7,8,9)
row[2-19] - float (positive and negative)
row[20-60] - ints (percentile ranks, ints 10,20,30,40,50,60,70,80,90)
row[61-95] - ints (binary 1, 0)
I started by using 50k (45k training) rows of data and num_epochs=100, batch_size=256.
{'accuracy': 0.8912, 'accuracy_baseline': 0.8932, 'auc': 0.7101819, 'auc_precision_recall': 0.2830853, 'average_loss': 0.30982444, 'label/mean': 0.1068, 'loss': 0.31013006, 'precision': 0.4537037, 'prediction/mean': 0.11840516, 'recall': 0.0917603, 'global_step': 17600}
Does the column I want to estimate need to be a column of binaries for this model?
Is it a bad idea to mix data types like this? Would it be necessary to normalize the data using something like preprocessing.Normalization ?
Should I alter the epochs/batch if I want to use more data?
The accuracy seems high but the loss also seems quite high, why is that?
Any other suggestions?
Thanks for any help or advice.
Here is the answer to your questions.
By default tf.estimator.LinearClassifier considers as binary classification with n_classes=2, but you can have more than 2 classes as well.
For a linear classification normalizing data won't affect much in terms of accuracy compared to non linear classifier accuracy change after normalizing on the same data.
You can observe the change in accuracy and loss, if it does not change much for about 5-10 epochs, you can restrict the number of epochs there only. Again you can repeat the same step by changing the batch size.
Accuracy and loss are not dependent on each other, consider an example of your case to classify 0 and 1. A model with 2 classes that always predicts 0.51 for the true class would have the same accuracy as one that predicts 0.99. Best model would be with high accuracy and with less loss, if your model is giving good accuracy and high loss that means your model made huge errors on few data.
Try to tune your model hyper parameters based on several observations and to feed quality data with some preprocessing is always best way to reach high accuracy and less loss, with some additional data would be good to have always.

Wasserstein loss can be negative?

I'm currently training a WGAN in keras with (approx) Wasserstein loss as below:
def wasserstein_loss(y_true, y_pred):
return K.mean(y_true * y_pred)
However, this loss can obviously be negative, which is weird to me.
I trained the WGAN for 200 epochs and got the critic Wasserstein loss training curve below.
The above loss is calculated by
d_loss_valid = critic.train_on_batch(real, np.ones((batch_size, 1)))
d_loss_fake = critic.train_on_batch(fake, -np.ones((batch_size, 1)))
d_loss, _ = 0.5*np.add(d_loss_valid, d_loss_fake)
The resulting generated sample quality is great, so I think I trained the WGAN correctly. However I still cannot understand why the Wasserstein loss can be negative and the model still works. According to the original WGAN paper, Wasserstein loss can be used as a performance indicator for GAN, so how should we interpret it? Am I misunderstand anything?
The Wasserstein loss is a measurement of Earth-Movement distance, which is a difference between two probability distributions. In tensorflow it is implemented as d_loss = tf.reduce_mean(d_fake) - tf.reduce_mean(d_real) which can obviously give a negative number if d_fake moves too far on the other side of d_real distribution. You can see it on your plot where during the training your real and fake distributions changing sides until they converge around zero. So as a performance measurement you can use it to see how far the generator is from the real data and on which side it is now.
See the distributions plot:
P.S. it's crossentropy loss, not Wasserstein.
Perhaps this article can help you more, if you didn't read it yet. However, the other question is how the optimizer can minimize the negative loss (to zero).
Looks like I cannot make a comment to the answer given by Sergeiy Isakov because I do not have enough reputations. I wanted to comment because I think that information is not correct.
In principle, Wasserstein distance cannot be negative because distance metric cannot be negative. The actual expression (dual form) for Wasserstein distance involves the supremum of all the 1-Lipschitz functions (You can refer to it on the web). Since it is the supremum, we always take that Lipschitz function that gives the largest value to obtain the Wasserstein distance. However, the Wasserstein we compute using WGAN is just an estimate and not really the real Wasserstein distance. If the inner iterations of the critic are low it may not have enough iterations to move to a positive value.
Thought experiment: If we suppose that we obtain a Wasserstein estimate that is negative, we can always negate the critic function to make the estimate positive. That means there exist a Lipschitz function that gives a positive value which is larger than that Lipschitz function that gives negative value. So Wasserstein estimates cannot be negative as by definition we need to have the supremum of all the 1-Lipschitz functions.

sigmoid output for detection class returns incorrect performance

Summary of my problem: I have a detection (binary classification, unbalanced problem). I use a sigmoid to classify samples. Reported f-score, precision and recall seem to consider both classes, e.g. true positives seem to be the total number of correctly classified samples, and not total number of samples belonging to class '1' that are correctly classified.
Longer explanation: In my experiment I have demographical data about persons and I have to predict whether they bought a product or not. I used PCA to reduce the initial features to just 4 features and the data is stored in csv file (first column has the class labels, '0' and '1'). Note that most people didn't buy and then the two classes are very unbalanced. I use the CSVDataset class to read it:
dataset: &train !obj:pylearn2.datasets.csv_dataset.CSVDataset {
path: 'input.csv',
task: 'classification'
}
I want to start with a simple classification model and I use f-score as performance measure. Therefore, my first idea was to use a MLP model with a single sigmoid layer (default monitor 'detection' provides recall, precision, f-score):
model: !obj:pylearn2.models.mlp.MLP {
layers: [
!obj:pylearn2.models.mlp.Sigmoid {
layer_name: 'y',
dim: 2,
irange: .005
}
],
nvis: 4,
}
My initial idea was to set dim to 1 (the decision rule would be: if output > 0.5 choose class '1', if < 0.5 choose class '0'). However, I got the error ValueError: Can't convert to VectorSpace of dim 1. Expected either dim=2 (merged one-hots) or 2 (concatenated one-hots) and then I decided to set dim to 2 (decision rule would be: if out1 > out0 choose '1', if out1
In my train.yaml I follow more or less the softmax example notebook provided in the documentation. For example, I use BGD algorithm and set the batch_size as the total number of examples in the training set (74164 examples, a small dataset!) just to avoid confusion when checking the performance manually.
The model was trained with the train.py script that is provided and everything seemed fine, until I had a look at the results. As mentioned earlier, it is a detection problem where the class to detect ('1') happens very rarely. Therefore, I was very surprised to see high values for the reported train_y_f1 (best result is approx. 94%, after one epoch).
To check this, I computed the f-score manually using the provided script predict_csv.py and then loading the predictions. I saw that in fact there were only misses (all '1' were classified as '0'), so precision, recall and f-score should be all zero. Why does the detection monitor report higher values?
After some investigation, I found that the MLP has an output for each class, and I verified (computed it manually and got the same numbers) that true positives and false positives defined in get_detection_channels_from_state() refer actually to both classes, '1' and '0', e.g. true positives is the number of vectors that belong to '1' classified as '1' summed to the number of vectors that belong to '0' classified as '0'. So the MLP is classifying everything as '0', and since nearly all vectors belongs to '0', the performance is good. This is a known problem for unbalanced detection problems, where correct classification rate is not a suitable measure, and it is the reason why we have measures such f-score or AUC. However, if tp and fp in get_detection_channels_from_state() consider both classes, then the reported f-score is not useful (not to me at least).
I can imagine that this is known to the designer of the Sigmoid class, so I can only assume that I am doing something wrong. Hopefully somebody can give me a hint :)
Note: I have submitted this question to the pylearn2 user mailing list. If I get an answer I will copy it here...
The pylearn monitor calculates the f1 score, % misclass, etc, for each batch, not for the entire epoch. When it generates the report, the f1 score is the mean of the f1's for all the batches in the epoch. Reporting the mean over all the batches works just fine when you look at quantities like misclass:
misclass[n] is the score for the nth batch
misclass_epoch = mean(misclass[0] + misclass[1] +.. misclass[n])
however, you can't construct the same statement for the f1 score:
f1_epoch != mean(f1[0] + f1[1] +.. f1[n])
where f1[n] = 2*precision[n]*recall[n]/(precision[n] + recall[n])
For demonstration purposes, try setting the batch size to be the size of the data set (you can get away with this in the mnist example). The f1 score will then be correct.
So the best advice is keep an eye on quantities in the monitor like misclass, where the mean over the batches is the same as the value for the epoch. Once you've trained the nn, then you can make predictions for your entire validation set and calculate the f1 score at that point.

Categories