I'm using a RandomForestClassifier for a binary classification problem.
I plotted the following learning curve. Can I say that more training data would benefit this model?
image:
Learning curve for RandomForest Classifier
From the training curve, it is clear accuracy will not improve by adding more instances.
Related
I am trying to interpret these learning curves.
These seem to overfit after the 1st epoch.
I have built a model using TensorFlow and the BERT transformer.
Is there another way to interpret these other than the optimum number of epochs is one?
Accuracy learning curve
Loss learning curve
How to evaluate my MLPClassifier model? Is confusion matrix, accuracy, classification report enough? Do i need ROC for evaluating my MLPClassifier result? And aside from that how can i plot loss for test and training set, i used loss_curve function but it only show the loss plot for training set.
Ps. I'm dealing with multi-class classification problem.
This is a very open question and with no code, so I will answer you with what I think is best. Usually for multi-label classification problem it is standard to use accuracy as a measure to track training. Another good measure is called f1-score. Sklearn's classification_report is a very good method to track training.
Confusion matrices come after you train the model. They are used to check where the model is failing by evaluating which classes are harder to predict.
ROC curves are, usually, for binary classification problems. They can be adapted to multi-class by doing a one class vs the rest approach.
For the losses, it seems to me you might be confusing things. Training takes place over epochs, testing does not. If you train over 100 epochs, then you have 100 values for the loss to plot. Testing does not use epochs, at most it uses batches, therefore plotting the loss does not make sense. If instead you are talking about validation data, then yes you can plot the loss just like with the training data.
I am working on a multiclass classification problem. I want to know whether my model is overfitting or underfitting. I am learning how to plot learning curves. My question is, is the order of steps I have done correct?
Scaling
Baseline model
learning curve to see how well baseline model performs
Hyperparameter tuning
Fit the model and predict on test data
Final learning curve to determine if the model is over or under fitting
The first plot is after I do CV for baseline model and before hyperparameter tuning, and the second plot is done at the end, after hyperparameter tuning and fitting the best hyperparameters to the final model
I have a question regarding the interpretability of machine learning algorithms.
I have a dataset looking like this:
tabular data set
I have trained a classification model (MLPClassifier from Scikit-Learn) and want to know which features have the biggest impact (the highest weight) on the decision.
My final goal is to find different solutions (combination of features) which will have a high probability (>90%) to be classified as 1.
Does somebody know a way to get these solutions?
Thanks in advance!
To obtain the feature importance during a classification task the classification methodology has to be randomforest or decision tree, both implemented in sklearn,
clf = RandomForestClassifier(n_estimators=100, max_depth=2,random_state=0)
clf.fit(X, y)
#After the fit step
clf.feature_importances_
The feature importance will tell you how much weight each feature has, if your MLP classifier is trained properly, it will assign nearly similar importance to various features in your network,
I'm trying a range of Online classifiers in the ski-kitlearn library to train a model from huge data. I found there are many classifiers supporting the partial_fit allowing for incremental learning. I want to use the Ridge Regression classifier in this setting, but could not find it in the implementation. Is there an alternative model that can do this in sklearn?
sklearn.linear_model.SGDClassifier, its loss function contain ‘hinge’, ‘log’, ‘modified_huber’, ‘squared_hinge’, ‘perceptron’
sklearn.linear_model.SGDRegressor, its default loss function is squared_loss, The possible values are ‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’