I am carrying out supervised machine learning. At present, by using scikit's metrics, it prints out the accuracy of the entire corpus.
I also wish to print out the accuracy of top 3 topics and then top 5 topics. How can I do so?
model = LogisticRegression()
model = model.fit(matrix, label)
y_train_pred = model1.predict(matrix_test)
print(metrics.accuracy_score(label_test, y_train_pred))
You could use a confusion matrix: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
Example: http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html
This way you get specific information applied to each category prediction.
Related
I am wondering is it possible to do voting for classification tasks. I have seen plenty of blogs explaining how to use voting for regression purposes.As given below.
# initializing all the model objects with default parameters
model_1 = LinearRegression()
model_2 = xgb.XGBRegressor()
model_3 = RandomForestRegressor()
# training all the model on the training dataset
model_1.fit(X_train, y_target)
model_2.fit(X_train, y_target)
model_3.fit(X_train, y_target)
# predicting the output on the validation dataset
pred_1 = model_1.predict(X_test)
pred_2 = model_2.predict(X_test)
pred_3 = model_3.predict(X_test)
# final prediction after averaging on the prediction of all 3 models
pred_final = (pred_1+pred_2+pred_3)/3.0
# printing the mean squared error between real value and predicted value
print(mean_squared_error(y_test, pred_final))
That can be done.
# initializing all the model objects with default parameters
model_1= svm.SVC(kernel='rbf')
model_2 = XGBClassifier()
model_3 = RandomForestClassifier()
# Making the final model using voting classifier
final_model = VotingClassifier(estimators=[('svc', model_1), ('xgb', model_2), ('rf', model_3)], voting='hard')
# applying 10 fold cross validation
scores = cross_val_score(final_model, X_all, y, cv=10, scoring='accuracy')
print(scores)
print('Model accuracy score : {0:0.4f}'.format(scores.mean()))
You can add more machine learning models than three if necessary
Here note that I have applied cross validation and got the accuracy
Of course you can use the same for classes, only your voting will use a different function. This is, how Random Forests arrive at their prediction (the single decision trees in the forest "vote" for a common prediction). You can for example employ a majority vote over all classifiers. Or you can use the single predictions to formulate a probability for your prediction. For example, each class could get the fraction of votes it got assigned as the output.
So I am working on a multiclass problem with 6 outcome classes. I am using a OneVsRest classifier and trying to retrieve the prediction probabilties for every class using .predict_proba.
I was expecting the sum of the prediction probabilities of all classes for every observation to come out as one, however that is not the case.
predictor = OneVsRestClassifier(xgb.XGBClassifier)
predictor.fit(X_train, y_train)
y_pred = predictor.predict_proba(X_test)
print(y_pred[1])
My output is: [0.11484083 0.02525082 0.02969465 0.58868223 0.09889702 0.03193117]
Can that be correct?
From the document of OneVsRestClassifier, this strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes.. For example, if you want to classify dog, cat, and bird with OneVsRestClassifier, it will train 3 models.
first model will be trained to check if your data is dog or not.
second model will be trained to check if your data is bird or not.
third model will be trained to check if your data is cat or not.
Three models will be trained independently as a binary-class classification. So, the probability from three models do not sum up to 1.
So I have a simple Sequential model built in Keras and I'm asked to train it multiple times (particularly, 5 times) with the same set of data (although I would have to change the train-test split). I would like to, then, perform an average of these trained models in the sense of:
Average the final accuracy on train and on validation.
Average the learning curve.
I know I can do this with a loop using plain python and stuff, but since this seems to me a common thing to do, I wonder if there is already a built in function to do exactly that. I think there is a way to train multiple times and save the best model, but I just want to do the average of the final results.
Maybe you are thinking about Bagging. https://en.wikipedia.org/wiki/Bootstrap_aggregating
You can train multiple models and average their outputs, but not the models. There isn't and there won't be a function for that because it is as simple as for example calculating average in regression task.
For averaging accuracy metrics over n trials, see the pseudocode below
def loop_model(x_train, y_train, x_test, n_loops = 5):
arr_metric = list()
for _ in range(n_loops):
print("Trial ", _+1 ," of ", n_loops)
model = build_model(units, activation, ...)
history = model.fit(x_train, y_train, ...)
y_true, y_pred = model.predict(x_test)
metric = compute_metric(y_true, y_pred)
arr_metric.append(metric)
del model
keras.backend.clear_session()
# Convert to numpy array to compute mean
avg_metric = np.array(arr_metric).mean()
return avg_metric
The function build_model() compiles the model while compute_metric() computes your accuracy metric. These are not built-in functions but a part of pseudocode. Using a numpy array to compute the mean is one approach and there are other things you can try.
See this answer as to why I suggested using the last two lines in the for loop.
I am trying to perform K-Fold Cross Validation and GridSearchCV to optimise my Gradient Boost model - following the link -
https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/
I have a few questions regarding the screenshot of the Model Report below:
1) How is the accuracy of 0.814365 calculated? Where in the script does it do a train test split? If you change cv_folds=5 to cv_folds=any integer, then the accuracy is still 0.814365. Infact, removing the cv_folds and inputting performCV=False also gives the same accuracy.
(Note my sk learn No CV 80/20 train test gives accuracy of around 0.79-0.80)
2) Again, how is the AUC Score (Train) calculated? And should this be ROC-AUC rather than AUC? My sk learn model gives an AUC of around 0.87. Like the accuracy, this score seems fixed.
3) Why is the mean CV Score so much lower than the AUC (Train) Score? It looks like they are both using roc_auc (my sklearn model gives 0.77 for the ROC AUC)
df = pd.read_csv("123.csv")
target = 'APPROVED' #item to predict
IDcol = 'ID'
def modelfit(alg, ddf, predictors, performCV=True, printFeatureImportance=True, cv_folds=5):
#Fit the algorithm on the data
alg.fit(ddf[predictors], ddf['APPROVED'])
#Predict training set:
ddf_predictions = alg.predict(ddf[predictors])
ddf_predprob = alg.predict_proba(ddf[predictors])[:,1]
#Perform cross-validation:
if performCV:
cv_score = cross_validation.cross_val_score(alg, ddf[predictors], ddf['APPROVED'], cv=cv_folds, scoring='roc_auc')
#Print model report:
print ("\nModel Report")
print ("Accuracy : %f" % metrics.accuracy_score(ddf['APPROVED'].values, ddf_predictions))
print ("AUC Score (Train): %f" % metrics.roc_auc_score(ddf['APPROVED'], ddf_predprob))
if performCV:
print ("CV Score : Mean - %.5g | Std - %.5g | Min - %.5g | Max - %.5g" % (npy.mean(cv_score),npy.std(cv_score),npy.min(cv_score),npy.max(cv_score)))
#Print Feature Importance:
if printFeatureImportance:
feat_imp = pd.Series(alg.feature_importances_, predictors).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Feature Importances')
plt.ylabel('Feature Importance Score')
#Choose all predictors except target & IDcols
predictors = [x for x in df.columns if x not in [target, IDcol]]
gbm0 = GradientBoostingClassifier(random_state=10)
modelfit(gbm0, df, predictors)
The main reason your cv_score appears low is because comparing it to the training accuracy isn't a fair comparison. Your training accuracy is being calculated using the same data that was used to fit the model whereas the cv_score is the average score from the testing folds within your cross validation. As you can imagine a model will perform better making predictions using data it's already been trained on as opposed to having to make predictions based on new data the model has never seen before.
Your accuracy_score and auc calculations are appearing fixed because you are always using the same inputs (ddf["APPROVED"], ddf_predictions and ddf_predprob) into the calculations. The performCV section doesn't actually transform any of those datasets, so if you're using the same model, model parameters, and input data you'll get the same predictions that are going into the calculations.
Based on your comments there are a number of reasons the cv_score accuracy could be lower than the accuracy on your full testing set. One of the main reasons is you're allowing your model to access more data for training when you use the full training set as opposed to using a subset of the training data with each cv fold. This is especially true if your data size isn't all that large. If your data set isn't large then that data is more important in training and can provide better performance.
In mlxtend library, there is An ensemble-learning meta-classifier for stacking called "StackingClassifier".
Here is an example of a StackingClassifier function call:
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
What is meta_classifier here? What is it used for?
What is stacking ?
Stacking is an ensemble learning technique to combine multiple classification models via a meta-classifier. The individual classification models are trained based on the complete training set; then, the meta-classifier is fitted based on the outputs -- meta-features -- of the individual classification models in the ensemble.
Source : StackingClassifier-mlxtend
So meta_classifier parameter helps us to choose the classifier to fit the output of the individual models.
Example:
Assume that you have used 3 binary classification models say LogisticRegression, DT & KNN for stacking. Lets say 0, 0, 1 be the classes predicted by the models. Now we need a classifier which will do majority voting on the predicted values. And that classifier is the meta_classifier. And in this example it would would pick 0 as the predicted class.
You can extend this for prob values also.
Refer mlxtend-API for more info
meta-classifier is the one that takes in all the predicted values of your models. As in your example you have three classifiers clf1, clf2, clf3 let's say clf1 is naive bayes, clf2 is random-forest, clf3 is svm. Now for every data point x_i in your dataset your all three models will run h_1(x_i), h_2(x_i), h_3(x_i) where h_1,h_2,h_3 corresponds to the function of clf1, clf2, clf3. Now these three models will give three predicted y_i values and all these will run in parallel. Now with these predicted values a model is trained which is known as meta- classifier and that is logistic regression in your case.
So for a new query point (x_q) it will calculated as h^'(h_1(x_q),h_2(x_q),h_3(x_q)) where h^'(h dash) is function that computes y_q.
The advantage of meta-classifier or ensemble models is that suppose your clf1 gives an accuracy of 90%, clf2 gives an accuracy of 92%, clf3 gives an accuracy of 93%. So the end model will give an accuracy that will be greater than 93% which is trained using meta classifier. These stacking classifer are used extensively in kaggle completions.
meta_classifier is simply the classifier that makes a final prediction among all the predictions by using those predictions as features. So, it takes classes predicted by various classifiers and pick the final one as the result that you need.
Here is a nice and simple presentation of StackingClassifier: