How to evaluate my MLPClassifier model? Is confusion matrix, accuracy, classification report enough? Do i need ROC for evaluating my MLPClassifier result? And aside from that how can i plot loss for test and training set, i used loss_curve function but it only show the loss plot for training set.
Ps. I'm dealing with multi-class classification problem.
This is a very open question and with no code, so I will answer you with what I think is best. Usually for multi-label classification problem it is standard to use accuracy as a measure to track training. Another good measure is called f1-score. Sklearn's classification_report is a very good method to track training.
Confusion matrices come after you train the model. They are used to check where the model is failing by evaluating which classes are harder to predict.
ROC curves are, usually, for binary classification problems. They can be adapted to multi-class by doing a one class vs the rest approach.
For the losses, it seems to me you might be confusing things. Training takes place over epochs, testing does not. If you train over 100 epochs, then you have 100 values for the loss to plot. Testing does not use epochs, at most it uses batches, therefore plotting the loss does not make sense. If instead you are talking about validation data, then yes you can plot the loss just like with the training data.
Related
I'm working on a multimodal classifier (text + image) using pytorch (only 2 classes).
Since I don't have a lot of data, i've decided to use StratifiedKFold to avoid overfitting.
I noticed a strange behavior on training/testing curves.
My training accuracy quickly converges forward a unique value for few epochs before evolving again.
With these results I directly thought of overfitting, .67 being the maximum accuracy of the model.
With the rest of the data separated by the KFold, I tested my model in evaluation mode.
I've been quite surprised since test accuracy follows (quite exactly) the training accuracy while the loss (CrossEntropyLoss) still evolves.
Note : changing the batch size only make growing of accuracy delays or brings closer the moment the loss evolves.
Any ideas about this behaviour ?
I have training sample X_train, and Y_train to train and X_estimated.
I got task to make my classificator learn as accurate as it can, and then predict vector of results over X_estimated to get close results to Y_estimated (which i have now, and I have to be as much precise as it can). If I split my training data to like 75/25 to train and test it, I can get accuracy using sklearn.metrics.accuracy_score and confusion matrix. But I am losing that 25% of samples, that would make my predictions more accurate.
Is there any way, I could learn by using 100% of the data, and still be able to see accuracy score (or percentage), so I can predict it many times, and save best (%) result?
I am using random forest with 500 estimators, and usually get like 90% accuracy. I want to save best prediction vector as possible for my task, without splitting any data (not wasting anything), but still be able to calculate accuracy (so I can save best prediction vector) from multiple attempts (random forest always shows different results)
Thank you
Splitting your data is critical for evaluation.
There is no way that you could train your model on 100% of the data and be able to get a correct evaluation accuracy unless you expand your dataset. I mean, you could change your train/test split, or try to optimize your model in other ways, but i guess the simple answer to your question would be no.
As per your requirement, you can try K Fold Cross Validation. If you split it in 90|10 i.e for Train|Test. Achieving to take 100% data for training is not possible as you have to test the data then only you can validate the same that how good your model is. K Fold CV takes your whole train data into consideration in each fold and randomly takes test data sample from the train data. And lastly calculates the accuracy by taking summation of all the folds. Then finally you can test the accuracy by using 10% of the data.
More you can read here and here
K Fold Cross Validation
Skearn provides simple methods for performing K fold cross validation. Simply you have to pass no of folds in the method. But then remember, more the folds, it takes more time to train the model. More you can check here
It is not necessary to do 75|25 split of your data all the time. 75
|25 is kind of old school now. It greatly depends on the amount of data that you have. For example, if you have 1 billion sentences for training a language model, it is not necessary to reserve 25% for testing.
Also, I second the previous answer of trying K-fold cross-validation. As a side note, you could consider looking at the other metrics like precision and recall as well.
In general splitting your data set is critical for evaluation. So I would recommend you always do that.
Said that, there are methods that in some sense allow you to train on all your data and still get an estimate of your performance or to estimate the generalization accuracy.
One particularly prominent method is leveraging out-of-bag samples of models based on bootstrapping, i.e. RandomForests.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, bootstrap=True, oob_score=True)
rf.fit(X, y)
print(rf.oob_score_)
if you are doing classification always go with stratified k-fold cv(https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/).
if you're doing regression then go with simple k-fold cv or you can divide the target as bins and do stratified k-fold cv. by this way you can use your data completely in model training.
I am trying to do binary classification, and the one class (0) is approximately 1 third of the other class (1). when I run the raw data through a normal feed forward neural network, the accuracy is about 0.78. However, when I implement class_weights, the accuracy drops to about 0.49. The roc curve also seems to do better without the class_weights. Why does this happen, and how can i fix it?
II have already tried changing the model, and implementing regularization, and dropouts, etc. But nothing seems to change the overall accuracy
this is how i get my weights:
class_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
class_weight_dict = dict(enumerate(class_weights))
Here is the results without the weights:
Here is with the weights:
I would expect the results to be better with the class_weights but the opposite seems to be true. Even the roc does not seem to do any better with the weights.
Due to the class imbalance a very weak baseline of always selecting the majority class will get accuracy of approximately 75%.
The validation curve of the network that was trained without class weights appears to show that it is picking a solution close to always selecting the majority class. This can be seen from the network not improving much over the validation accuracy it gets in the 1st epoch.
I would recommend looking into the confusion matrix, precision and recall metrics to get more information about which model is better.
This answer seems too late, but I hope it is helpful anyway. I just want to add four points:
Since the proportion of your data is minority: 25% and majority: 75%, accuracy is computed as:
accuracy = True positive + true negative / (true positive + true negative + false positive + false negative)
Thus, if you look at the accuracy as a metric, most likely any models could achieve around 75% accuracy by simply predicting the majority class all the time. That's why on the validation set, the model was not able to predict correctly.
While with class weights, the learning curve was not smooth but the model actually started to learn and it failed from time to time on the validation set.
As it was already stated, perhaps changing metrics such as F1 score would help. I saw that you are implementing tensorflow, tensorflow has metric F1 score on their Addons, you can find it on their documentation here. For me, I looked at the classfication report in scikit learn, let's say you want to see the model's performance on the validation set (X_val, y_val):
from sklearn.metrics import classification_report
y_predict = model.predict(X_val, batch_size=64, verbose=1
print(classification_report(y_val, y_predict))
Other techniques you might want to try such as implementing upsampling and downsampling at the same time can help, or SMOTE.
Best of luck!
I have been comparing different regression models from sklearn, On doing so I was confused with the model's score value that i got.
Below in the code you can see that i have used both Linear Regression and Ridge Regression but the difference in score values for the training and test data set vary by a lot.
using Linear Regression
from sklearn.linear_model import LinearRegression as lr
model = lr()
model.fit(X_train, y_train)
model.predict(X_test)
print("LINEAR REGRESSION")
print("Training Score", end = "\t")
print(model.score(X_train, y_train))
print("Test Score", end = "\t")
print(model.score(X_test, y_test))
------------------------------------------------------------
O/P
LINEAR REGRESSION
Training Score 0.7147120015665793
Test Score 0.4242120003778227
Using Ridge Regression
from sklearn.linear_model import Ridge as r
model = r(alpha = 20).fit(X_train, y_train)
model.predict(X_test)
print("RIDGE REGRESSION")
print("Training Score", end = "\t")
print(model.score(X_train, y_train))
print("Test Score", end = "\t")
print(model.score(X_test, y_test))
-----------------------------------------------------------
O/P
RIDGE REGRESSION
Training Score 0.4991610348613835
Test Score 0.32642156452579363
My question is, Does a smaller difference between the score values of the training and test dataset mean that my model is Generalized and fitting equally for both the test and Train Data ( not overfitting ) or does it mean something else.
If it does mean something else please do explain.
And How does the "alpha" value affect the ridge regression model?
I am a beginner so please do explain anything as simple as possible.
Thank You.
Maybe you can add a separate validation set to you model.fit or you set the validation_split paramter like in the keras docs of the fit method, I don't know if there is something like that in sklearn kit.
But in general the scores for the validation-set or test-st and the training set should be nearly equal, otherwise the model tends to overfitting.
Also there are a bunch of metrics you can use to evaluate your model. I would recommend the book Oreilly Deep Learning Page 39. There is a really good explanation.
Or have a look here and here.
Or look here chapter 5.2.
Feel free to ask other questions.
Expanding on Max's answer, overfitting is a modelling error when the trained model models the trained data too well. Now this usually occurs when the model is complex enough (high VC dimension) that it learns very intricate details and noises that it will negatively impact the final performance. VC Dimension Caltech Lecture on VC Overfitting A simple way to observe overfitting is to look at the difference between the training and test results.
Going back to your example, the score difference between test and training data for linear regression is 0.290. While the difference for ridge regression is 0.179. From this single experiment alone, it is difficult to judge whether or not a model is overfitting since usually in practice there will always be some difference. But here, we can say that ridge regression tends to less overfit for this dataset.
Now when making a decision on which model to pick, we have to also consider other factors beside overfitting itself. In this case Linear Regression tends to perform 10% higher on the test dataset as compared to Ridge Regression, so you have to take that into account as well. Perhaps conducting further experiment using different validation techniques and fine tune different hyper-parameters should be the next steps.
Can someone explain how to use the oob_decision_function_ attribute for the python SciKit Random Forest Classifier? I want to use it to plot learning curves comparing training and validation error against different training set sizes in order to identify overfitting and other problems. Can't seem to find any information about how to do this.
You can pass in a custom scoring function into any of the scoring parameters in the model evaluation fields, it needs to have the signiture classifier, X, y_true -> score.
For your case you could use something like
from sklearn.learning_curve import learning_curve
learning_curve(r, X, y, cv=3, scoring=lambda c,x,y: c.oob_score_)
This will compute 3-fold cross validated oob scores against different training set sizes. Btw I don't think you should get overfitting with random forests, that's one of the benefits of them.