So I'am using tensorflow 1.10, and training on subset of retinopathy dataset. The problem is that it always predicts the class that occurs the most. In my case it is the class 0. so, I did a little digging and came across something called under-sampling. I left out all the 0(just to see what happens) it only predicts the class 2 after 0. apparently class 2 has highest frequency.
Here is the code for optimization:
def data_pipe_line(data,checkpoint_path,i_data=None,epoch=5):
place_X=tf.placeholder(tf.float32,[None,500,400,3],name='p1')
place_Y=tf.placeholder(tf.int32,[None],name='p2')
infer_data=tf.data.Dataset.from_tensor_slices((place_X,place_Y))
infer_data=infer_data.batch(100)
iterator=tf.data.Iterator.from_structure(data.output_types,data.output_shapes)
next_image,next_label=iterator.get_next()
Y=tf.one_hot(next_label,5)
Y=tf.cast(Y,tf.float32)
logits=model(next_image,0.7)
print(logits)
print(Y)
train_iterator=iterator.make_initializer(data,name='train_op')
inference_iterator_op=iterator.make_initializer(infer_data,name='inference_op')
with tf.name_scope("loss"):
loss=tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits_v2(labels=Y,logits=logits),name='cost')
#the learning rate is so low because the batch-size is very small and has a lot of noise
optimizer=tf.train.AdamOptimizer(learning_rate=0.0005).minimize(loss)
#getting the accuracy
prediction=tf.argmax(logits,1,name='pred')
equality=tf.equal(prediction,tf.argmax(Y,1))
accuracy=tf.reduce_mean(tf.cast(equality,tf.float32))
init_op=tf.global_variables_initializer()
tf.summary.scalar("loss",loss)
tf.summary.scalar("accuracy",accuracy)
merged=tf.summary.merge_all()
saver=tf.train.Saver()
j=0
with tf.Session() as sess:
writer=tf.summary.FileWriter("./nn_logs",sess.graph)
sess.run(init_op)
for _ in range(epoch):
sess.run(train_iterator)
while True:
try:
#print(sess.run(logits))
j=j+1
summary = sess.run(merged)
_,acc,l=sess.run([optimizer,accuracy,loss])
if(j%20==0 or j==1):
print("iters: {}, loss: {:.10f}, training accuracy: {:.2f}".format(j, l, acc*100))
writer.add_summary(summary,j)
except tf.errors.OutOfRangeError:
break
saver.save(sess,checkpoint_path)
The model is training well, loss goes down for sometime and then it just fluctuates there(in a range of 5). the accuracy of course fluctuates a lot as it is only predicting 1 class.
By what you said, I can conclude that your dataset is highly unbalanced and then the model that you trained didn't generalized your data at all. The model then predicts the class with most samples, because at a random guess there's more chance that it would be the correct label. To try to solve this, one approach is to balance your data.
Under Sampling
One way to do that is using the technique called under sampling, as you have finded out. In your training set you verify the class that has less samples, and then you pass to the training phase that same number of samples of each class. The selection is done at random. The test set remains intact.
Over Sampling
Here you take the contrary way: take the class with more samples and make all other classes have that same amount of samples. The simplest way to do that is repeating at random your samples. Another way would be augmenting your data.
You can also can weight the loss function. For this, G. Anderson already linked a good question about.
The intuiton behind weighted the loss
When the classes are unbalanced, the model understands that if it get the wrong label for the class with less samples it will be poorly penalized for that mistake. Once they have less samples of that class, that mistake will occur less often and when it occur it will have the same impact of a mistake of one class with more samples. So what the weighted loss do is give more importance for a mistake in the class with less samples and less importance for a mistake in the class with more samples.
Answering the question in the commentary
Remember: the loss function is used to correct the learned model weights and, when you add another dense layer in your model, you are doing changes that will affect the learning step. So your model will still attributing the same importance level to the mistakes in every class.
Related
I am working on a multi-class classification problem, it contains some class imbalance (100 classes, a handful of which only have 1 or 2 samples associated).
I have been able to get a LinearSVC (& CalibratedClassifierCV) model to achieve ~98% accuracy, which is great.
The problem is that for all of the misclassified predictions - the business will incur a monetary loss. That is, for each misclassification - we would incur a $1,000 loss. A solution to this would be to classify a datapoint as "Unknown" instead of a complete misclassification (these unknowns could then be human-classified which would cost roughly $10 per "Unknown" prediction). Clearly, this is cheaper than the $1,000/misclassification loss.
Any suggestions for would I go about incorporating this "Unknown" class?
I currently have:
svm = LinearSCV()
clf = CalibratedClassifierCV(svm, cv=3)
# fit model
clf.fit(X_train, y_train)
# get probabilities for each decision
decision_probabilities = clf.predict_proba(X_test)
# get the confidence for the highest class:
confidence = [np.amax(x) for x in decision_probabilities]
I was planning to use the predict_proba method from the CalibratedClassifierCV model, and for any max probabilities that were under a threshold (yet to be determined) I would instead classify that sample as "Unknown" instead of the class that the probability is actually associated with.
The problem is that when I've checked correct predictions, there are confidence values as low as 30%. Similarly, there are incorrect predictions with confidence values as high as 95%. If I were to just create a threshold of say, 50%, my accuracy would go down significantly, I would have quite of bit of "Unknown" classes (loss), and still a bit of misclassifications (even bigger loss).
Is there a way to incorporate another loss function on this back-end classification (predicted class vs 'unknown' class)?
Any help would be greatly appreciated!
A few suggestions right off the bat:
Accuracy is not the correct metric to evaluate imbalanced datasets. For example, if 90% of samples belong to 1 class 90% accuracy is achieved by a dumb model which always predicts the dumb class. Precision and recall are generally better metrics for such cases. Opting between the two is generally a business decision.
Given the input signals, it may be difficult to better than 98%, especially for some classes you will have two few samples. What you can do is group minority classes together and give them a single label e.g 'other'. In this way, the model will hopefully have enough samples to learn that these samples are different from all other classes and will classify them as 'other'
Often when you try to replace a manual business process by ML, you generally do not completely remove human intervention. The goal is to use the model on cases/classes/input space where your model does well and use the manual process for the rest. One way to do it is by using the 'other' label. Once your model has predicted 'other', a human may manually classify these samples. Another method is to find a threshold on predicted probability above which the model has a high accuracy and sufficient population coverage. For example, let say you have 100% (typically 90-100%) accuracy whenever the output prbability is above 0.70. If this covers enough of the input population, you only use the ML model on such cases. For everything else, the manual process is followed.
I am training a gan and I am the accuracy doesn't change over epoch meanwhile the loss is deacresing.
Is there something wrong or it is normal because it's a gan?
Thank you in advance.
In order to fully answer this question (specific to your case) we'd need to know what loss function you are using and how you measure accuracy.
In general, this can certainly happen for a variety of reasons. The easiest reason to illustrate is with a simple classifier. Suppose you have a 2-class classification problem (for simplicity) and an input $x$ and label (1, 0), i.e. the label says it belongs to class 1 and not class 2. When you feed $x$ through your network you get an output: $y=(p_1, p_2)$. If $p_1 > p_2$ then the prediction is correct (i.e. you chose the right class). The loss function can continue to go down until $p_1=1$ and $p_2=0$ (the target). So, you can have lots of correct predictions (high accuracy) but still have room to improve the output to better match the labels (room for improved loss).
I am trying to do binary classification, and the one class (0) is approximately 1 third of the other class (1). when I run the raw data through a normal feed forward neural network, the accuracy is about 0.78. However, when I implement class_weights, the accuracy drops to about 0.49. The roc curve also seems to do better without the class_weights. Why does this happen, and how can i fix it?
II have already tried changing the model, and implementing regularization, and dropouts, etc. But nothing seems to change the overall accuracy
this is how i get my weights:
class_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
class_weight_dict = dict(enumerate(class_weights))
Here is the results without the weights:
Here is with the weights:
I would expect the results to be better with the class_weights but the opposite seems to be true. Even the roc does not seem to do any better with the weights.
Due to the class imbalance a very weak baseline of always selecting the majority class will get accuracy of approximately 75%.
The validation curve of the network that was trained without class weights appears to show that it is picking a solution close to always selecting the majority class. This can be seen from the network not improving much over the validation accuracy it gets in the 1st epoch.
I would recommend looking into the confusion matrix, precision and recall metrics to get more information about which model is better.
This answer seems too late, but I hope it is helpful anyway. I just want to add four points:
Since the proportion of your data is minority: 25% and majority: 75%, accuracy is computed as:
accuracy = True positive + true negative / (true positive + true negative + false positive + false negative)
Thus, if you look at the accuracy as a metric, most likely any models could achieve around 75% accuracy by simply predicting the majority class all the time. That's why on the validation set, the model was not able to predict correctly.
While with class weights, the learning curve was not smooth but the model actually started to learn and it failed from time to time on the validation set.
As it was already stated, perhaps changing metrics such as F1 score would help. I saw that you are implementing tensorflow, tensorflow has metric F1 score on their Addons, you can find it on their documentation here. For me, I looked at the classfication report in scikit learn, let's say you want to see the model's performance on the validation set (X_val, y_val):
from sklearn.metrics import classification_report
y_predict = model.predict(X_val, batch_size=64, verbose=1
print(classification_report(y_val, y_predict))
Other techniques you might want to try such as implementing upsampling and downsampling at the same time can help, or SMOTE.
Best of luck!
classifier.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
classifier.fit(X_train, y_train, epochs=50, batch_size=100)
Epoch 1/50
27455/27455 [==============================] - 3s 101us/step - loss: 2.9622 - acc: 0.5374
I know I'm compiling my model in first line and fitting it in second. I know what is optimiser. I'm interested the meaning of metrics=['accuracy'] and what does the acc: XXX exactly mean when I compile the model.
Also, I'm getting acc : 1.000 when I train my model (100%) but when I test my model I'm getting 80% accuracy. Does my model overfitting?
Ok, let's begin from the top,
First, metrics = ['accuracy'], The model can be evaluated on multiple parameters, accuracy is one of the metrics, other can be binary_accuracy, categorical_accuracy, sparse_categorical_accuracy, top_k_categorical_accuracy, and sparse_top_k_categorical_accuracy, these are only the inbuilt ones, you can even create custom metrics, to understand metrics in more details, you need to have a clear understanding of loss in a Neural Network, you might know that loss function must be differentiable in order to be able to do back propagation, this is not necessary in case of metrics, metrics are used purely for model evaluation and thus can even be functions that are not differentiable, in Keras as mentioned even in their documentation
A metric function is similar to a loss function, except that the results from evaluating a metric are not used when training the model. You may use any of the loss functions as a metric function.
On your Own, you can custom define an accuracy that is not differentiable but creates an objective function on what you need from your model.
TLDR; Metrics are just loss functions not used in back propagation but used for model evaluation.
Now,
acc:xxx might just be that it has not even finished one minibatch propagation and thus cannot give an accuracy score yet, I have not paid much attention to it, but it usually stays there for a few seconds and is thus an speculation from that.
Finally 20% Decrease in model performance when taken out of training, yes this can be a case of Overfitting but no one can know for sure without looking at your dataset, but most probably yes, it is overfitting, and you may need to look at the data it is performing bad on to know the cause.
If something is unclear, doesn't make sense, feel free to comment.
Having 100% accuracy on train dataset while having 80% accuracy on test dataset doesn't mean that your model overfits. Moreover, it almost surely doesn't overfit if your model is equipped with much more effective parameters that the number of training samples [2], [5] (insanely large model example [1]). This contradicts to conventional statistical learning theory, but these are empirical results.
For models with number of parameters greater than number of samples, it's better to continue to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and even if the validation loss increases [3]. This may hold even regardless of batch size [4].
Clarifications (edit)
The "models" I was referring to are neural networks with two or more hidden layers (could be also convolutional layers prior to dense layers).
[1] is cited to show a clear contradiction to classical statistical learning theory, which says that large models may overfit without some form of regularization.
I would invite anyone who disagrees with "almost surely doesn't overfit" to provide a reproducible example where models, say for MNIST/CIFAR etc with few hundred thousand parameters do overfit (in a sense of increasing with iterations test error curve).
[1] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le,Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: Thesparsely-gated mixture-of-experts layer.CoRR, abs/1701.06538, 2017.
[2] Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learn-ing: Perspective of loss landscapes.arXiv preprint arXiv:1706.10239, 2017.
[3] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and NathanSrebro. The implicit bias of gradient descent on separable data.The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
[4] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: clos-ing the generalization gap in large batch training of neural networks. InAdvancesin Neural Information Processing Systems, pages 1731–1741, 2017.`
[5] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.Understanding deep learning requires rethinking generalization.arXiv preprintarXiv:1611.03530, 2016.
Starting off with the first part of your question -
Keras defines a Metric as "a function that is used to judge the performance of your model". In this case you are using accuracy as the function to judge how good your model is. (This is the norm)
For the second part of your question - acc is the accuracy of your model at that Epoch. This can, and will change depending on which metrics were defined in the model.
Finally it is possible that you have ended up with an overfit model given what you have told us but there are simple solutions
So the meaning of metrics=['accuracy'] is actually dependent on what loss function you use. You can see how keras handels this from line 375 and down. Since you are using categorical_crossentropy, your case follows the logic in the elif (line 386). Hence your metric function is set to
metric_fn = metrics_module.sparse_categorical_accuracy
See this post for a description of the logic behind sparse_categorical_accuracy, it should clear the meaning of "accuracy" in your case. It basically just counts how many of your prediction (the class with maximum probability) was the same as the true class.
The train vs validation accuracy can show sign of over-fitting. To test this plot the train accuracy and validation accuracy against each other and see at what point the validation accuracy start to decrease. Follow this for a good description of how to plot accuracy and loss etc, to test for over-fitting.
My model throws up learning curves as I have shown below. Are these fine? I am a beginner and all across the internet I see that as training examples increase the Training score should decrease and then converge. But here the training score is increasing and then converging. Therefore I would like to know does this indicate a bug in my code / something wrong with my input?
Okay I figured out what was wrong with my code.
train_sizes , train_accuracy , cv_accuracy = lc(linear_model.LogisticRegression(solver='lbfgs',penalty='l2',multi_class='ovr'),trainData,multiclass_response_train,train_sizes=np.array([0.1,0.33,0.5,0.66,1.0]),cv=5)
I had not entered a regularization parameter for Logistic Regression.
But now,
train_sizes , train_accuracy , cv_accuracy = lc(linear_model.LogisticRegression(C=1000,solver='lbfgs',penalty='l2',multi_class='ovr'),trainData,multiclass_response_train,train_sizes=np.array([0.1,0.33,0.5,0.66,1.0]),cv=5)
The learning curve looks alright.
Can anybody tell me why this is so? i.e. with default reg term the training score increases and with lower reg it decreases?
Data details: 10 classes. Images of varying sizes. (Digit Classification - street view digits)
You need to be more precise regarding your metrics. What metrics are used here?
Loss in general means: lower is better, while Score usually means: higher is better.
This also means, that the interpretation of your plot is dependent on the used metrics during training and cross-validation.
Have a look at the related webpage of scipy:
http://scikit-learn.org/stable/modules/learning_curve.html
The score is typically some measure that needs to be maximized (ROCAUC, accuracy,...). Intuitively you could expect that the more training examples you see the better your model gets and hence the higher the score is. There are however some subtleties regarding overfitting and underfitting that you should keep in mind.
Building off of Alex's answer, it looks like the default regularization parameter for your model underfits the data a bit, because when you relaxed regularization, you see 'more appropriate' learning curves. It doesn't matter how many examples you throw at a model that underfits.
As for your concern of why the training score increases in the first case rather than decreases -- it's probably a consequence of the multiclass data you're using. With fewer training examples, you have fewer numbers of images of each class (because lc tries to keep the same class distribution in each fold of the cv), so with regularization (if you call C=1 regularization, that is), it may be harder for your model to accurately guess some of the classes.