So I am working on a multiclass problem with 6 outcome classes. I am using a OneVsRest classifier and trying to retrieve the prediction probabilties for every class using .predict_proba.
I was expecting the sum of the prediction probabilities of all classes for every observation to come out as one, however that is not the case.
predictor = OneVsRestClassifier(xgb.XGBClassifier)
predictor.fit(X_train, y_train)
y_pred = predictor.predict_proba(X_test)
print(y_pred[1])
My output is: [0.11484083 0.02525082 0.02969465 0.58868223 0.09889702 0.03193117]
Can that be correct?
From the document of OneVsRestClassifier, this strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes.. For example, if you want to classify dog, cat, and bird with OneVsRestClassifier, it will train 3 models.
first model will be trained to check if your data is dog or not.
second model will be trained to check if your data is bird or not.
third model will be trained to check if your data is cat or not.
Three models will be trained independently as a binary-class classification. So, the probability from three models do not sum up to 1.
Related
I fit a random forest model for the data. I divided my dataset into training and testing in the ratio of 70:30 and trained the model. I got an accuracy of 80% for the test data. Then I took a benchmark dataset and tested the model with that dataset. That dataset only contained data with true labels(1). But when I get the prediction for the benchmark dataset using the model all the true positives are classified as true negatives. Accuracy is 90%. Why is that? Is there a way to interpret this?
X = dataset.iloc[:, 1:11].values
y=dataset.iloc[:,11].values
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,shuffle='true')
XBench_test=benchmarkData.iloc[:, 1:11].values
YBench_test=benchmarkData.iloc[:,11].values
classifier=RandomForestClassifier(n_estimators=35,criterion='entropy',max_depth=30,min_samples_split=2,min_samples_leaf=1,max_features='sqrt',class_weight='balanced',bootstrap='true',random_state=0,oob_score='true')
classifier.fit(X_train,y_train)
y_pred=classifier.predict(X_test)
y_pred_benchmark=classifier.predict(XBench_test)
print("Accuracy on test data: {:.4f}".format(classifier.score(X_test, y_test)))\*This gives 80%*\
print("Accuracy on benchmark data: {:.4f}".format(classifier.score(XBench_test, YBench_test))) \*This gives 90%*\
I'll take a shot at providing a better way to interpret your results. In cases where you have an imbalanced data set accuracy is not going to be a good way to measure your performance.
Here is a common example:
Imagine you have a disease that is present in only .01% of people. If you predict no one has the disease you have an accuracy of 99.99% but your model is not a good model.
In this example it appears your benchmark data set (commonly referred to as a test dataset) has imbalanced classes and you are getting an accuracy of 90% when you call the classifier.score method. In this case, accuracy is not a good way to interpret the model. You should instead look at other metrics.
Other common metrics may be to look at precision and recall to determine how your model is performing. In this case since all True positives are predicted as negative your precision AND your recall would be 0, meaning your model is not differentiating very well.
Going further if you have imbalanced classes it may be better to check different thresholds of scores and look at metrics like ROC_AUC. These metrics look at the probability scores outputted by the model (predict_proba for sklearn) and test different thresholds. Perhaps your model works well at a lower threshold and the positive cases consistently score higher than the negative cases.
Here is an additional article about ROC_AUC.
Sci-kit learn has a few different metric scores you can use they are located here.
Here is one way you could implement ROC AUC into your code.
X = dataset.iloc[:, 1:11].values
y=dataset.iloc[:,11].values
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,shuffle='true')
XBench_test=benchmarkData.iloc[:, 1:11].values
YBench_test=benchmarkData.iloc[:,11].values
classifier=RandomForestClassifier(n_estimators=35,criterion='entropy',max_depth=30,min_samples_split=2,min_samples_leaf=1,max_features='sqrt',class_weight='balanced',bootstrap='true',random_state=0,oob_score='true')
classifier.fit(X_train,y_train)
#use predict_proba
y_pred=classifier.predict_proba(X_test)
y_pred_benchmark=classifier.predict_proba(XBench_test)
from sklearn.metrics import roc_auc_score
## instead of measuring accuracy use ROC AUC)
print("Accuracy on test data: {:.4f}".format(roc_auc_score(X_test, y_test)))\*This gives 80%*\
print("Accuracy on benchmark data: {:.4f}".format(roc_auc_score(XBench_test, YBench_test))) \*This gives 90%*\
In mlxtend library, there is An ensemble-learning meta-classifier for stacking called "StackingClassifier".
Here is an example of a StackingClassifier function call:
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
What is meta_classifier here? What is it used for?
What is stacking ?
Stacking is an ensemble learning technique to combine multiple classification models via a meta-classifier. The individual classification models are trained based on the complete training set; then, the meta-classifier is fitted based on the outputs -- meta-features -- of the individual classification models in the ensemble.
Source : StackingClassifier-mlxtend
So meta_classifier parameter helps us to choose the classifier to fit the output of the individual models.
Example:
Assume that you have used 3 binary classification models say LogisticRegression, DT & KNN for stacking. Lets say 0, 0, 1 be the classes predicted by the models. Now we need a classifier which will do majority voting on the predicted values. And that classifier is the meta_classifier. And in this example it would would pick 0 as the predicted class.
You can extend this for prob values also.
Refer mlxtend-API for more info
meta-classifier is the one that takes in all the predicted values of your models. As in your example you have three classifiers clf1, clf2, clf3 let's say clf1 is naive bayes, clf2 is random-forest, clf3 is svm. Now for every data point x_i in your dataset your all three models will run h_1(x_i), h_2(x_i), h_3(x_i) where h_1,h_2,h_3 corresponds to the function of clf1, clf2, clf3. Now these three models will give three predicted y_i values and all these will run in parallel. Now with these predicted values a model is trained which is known as meta- classifier and that is logistic regression in your case.
So for a new query point (x_q) it will calculated as h^'(h_1(x_q),h_2(x_q),h_3(x_q)) where h^'(h dash) is function that computes y_q.
The advantage of meta-classifier or ensemble models is that suppose your clf1 gives an accuracy of 90%, clf2 gives an accuracy of 92%, clf3 gives an accuracy of 93%. So the end model will give an accuracy that will be greater than 93% which is trained using meta classifier. These stacking classifer are used extensively in kaggle completions.
meta_classifier is simply the classifier that makes a final prediction among all the predictions by using those predictions as features. So, it takes classes predicted by various classifiers and pick the final one as the result that you need.
Here is a nice and simple presentation of StackingClassifier:
I am carrying out supervised machine learning. At present, by using scikit's metrics, it prints out the accuracy of the entire corpus.
I also wish to print out the accuracy of top 3 topics and then top 5 topics. How can I do so?
model = LogisticRegression()
model = model.fit(matrix, label)
y_train_pred = model1.predict(matrix_test)
print(metrics.accuracy_score(label_test, y_train_pred))
You could use a confusion matrix: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
Example: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
This way you get specific information applied to each category prediction.
Can I use sklearn's BaggingClassifier to produce continuous predictions? Is there a similar package? My understanding is that the bagging classifier predicts several classifications with different models, then reports the majority answer. It seems like this algorithm could be used to generate probability functions for each classification then reporting the mean value.
trees = BaggingClassifier(ExtraTreesClassifier())
trees.fit(X_train,Y_train)
Y_pred = trees.predict(X_test)
If you're interested in predicting probabilities for the classes in your classifier, you can use the predict_proba method, which gives you a probability for each class. It's a one-line change to your code:
trees = BaggingClassifier(ExtraTreesClassifier())
trees.fit(X_train,Y_train)
Y_pred = trees.predict_proba(X_test)
The shape of Y_pred will be [n_samples, n_classes].
If your Y_train values are continuous and you want to predict those continuous values (i.e., you're working on a regression problem), then you can use the BaggingRegressor instead.
I typically use BaggingRegressor() for continuous values, and then compare performance with RMSE. example below:
from sklearn.ensemble import BaggingReressor
trees = BaggingRegressor()
trees.fit(X_train,Y_train)
scores_RMSE = math.sqrt(metrics.mean_squared_error(Y_test, trees.predict(X_test))
This question already has an answer here:
Scikit learn - fit_transform on the test set
(1 answer)
Closed 8 years ago.
I am currently trying to build a text classification model (document classification) with roughly 80 classes. When I build and train the model using random forest (after vectorizing the text into a TF-IDF matrix), the model works well. However, when I introduce new data, the same words that I used to build my RF aren't necessarily identical to the training set. This is a problem because I have a different number of features in my training set than I do in my test set (so the dimensions for the training set are less than the test).
####### Convert bag of words to TFIDF matrix
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(data)
print tfidf_matrix.shape
## number of features = 421
####### Train Random Forest Model
clf = RandomForestClassifier(max_depth=None,min_samples_split=1, random_state=1,n_jobs=-1)
####### k-fold cross validation
scores = cross_val_score(clf, tfidf_matrix.toarray(),labels,cv=7,n_jobs=-1)
print scores.mean()
### this is the new data matrix for unseen data
new_tfidf = tfidf_vectorizer.fit_transform(new_X)
### number of features = 619
clf.fit(tfidf_matrix.toarray(),labels)
clf.predict(new_tfidf.toarray())
How can I go about creating a working RF model for classification that will incorporate new features (words) that weren't seen in the training?
Do not call fit_transform on the unseen data, only transform! That will keep the dictionary from the training set.
You cannot introduce new features into the test set that were not part of your training set. The model is trained on a specific dictionary of terms and that same dictionary of terms must be used across training, validating, testing, and production. Further more, the indices of the words in your feature vector cannot change either.
You should be creating one large matrix using all of your data and then split the rows into your train and test sets. This will guarantee that you will have the same feature set for train and test.