Recursive feature elimination on Random Forest using scikit-learn - python

I'm trying to preform recursive feature elimination using scikit-learn and a random forest classifier, with OOB ROC as the method of scoring each subset created during the recursive process.
However, when I try to use the RFECV method, I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'coef_'
Random Forests don't have coefficients per se, but they do have rankings by Gini score. So, I'm wondering how to get arround this problem.
Please note that I want to use a method that will explicitly tell me what features from my pandas DataFrame were selected in the optimal grouping as I am using recursive feature selection to try to minimize the amount of data I will input into the final classifier.
Here's some example code:
from sklearn import datasets
import pandas as pd
from pandas import Series
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
iris = datasets.load_iris()
x=pd.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=pd.Series(iris.target, name='target')
rf = RandomForestClassifier(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
rfecv = RFECV(estimator=rf, step=1, cv=10, scoring='ROC', verbose=2)
selector=rfecv.fit(x, y)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 336, in fit
ranking_ = rfe.fit(X_train, y_train).ranking_
File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 148, in fit
if estimator.coef_.ndim > 1:
AttributeError: 'RandomForestClassifier' object has no attribute 'coef_'

Here's what I've done to adapt RandomForestClassifier to work with RFECV:
class RandomForestClassifierWithCoef(RandomForestClassifier):
def fit(self, *args, **kwargs):
super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
self.coef_ = self.feature_importances_
Just using this class does the trick if you use 'accuracy' or 'f1' score. For 'roc_auc', RFECV complains that multiclass format is not supported. Changing it to two-class classification with the code below, the 'roc_auc' scoring works. (Using Python 3.4.1 and scikit-learn 0.15.1)
y=(pd.Series(iris.target, name='target')==2).astype(int)
Plugging into your code:
from sklearn import datasets
import pandas as pd
from pandas import Series
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
class RandomForestClassifierWithCoef(RandomForestClassifier):
def fit(self, *args, **kwargs):
super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
self.coef_ = self.feature_importances_
iris = datasets.load_iris()
x=pd.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=(pd.Series(iris.target, name='target')==2).astype(int)
rf = RandomForestClassifierWithCoef(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
rfecv = RFECV(estimator=rf, step=1, cv=2, scoring='roc_auc', verbose=2)
selector=rfecv.fit(x, y)

This is my code, I've tidied it up a bit to make it relevant to your task:
features_to_use = fea_cols # this is a list of features
# empty dataframe
trim_5_df = DataFrame(columns=features_to_use)
run=1
# this will remove the 5 worst features determined by their feature importance computed by the RF classifier
while len(features_to_use)>6:
print('number of features:%d' % (len(features_to_use)))
# build the classifier
clf = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)
# train the classifier
clf.fit(train[features_to_use], train['OpenStatusMod'].values)
print('classifier score: %f\n' % clf.score(train[features_to_use], df['OpenStatusMod'].values))
# predict the class and print the classification report, f1 micro, f1 macro score
pred = clf.predict(test[features_to_use])
print(classification_report(test['OpenStatusMod'].values, pred, target_names=status_labels))
print('micro score: ')
print(metrics.precision_recall_fscore_support(test['OpenStatusMod'].values, pred, average='micro'))
print('macro score:\n')
print(metrics.precision_recall_fscore_support(test['OpenStatusMod'].values, pred, average='macro'))
# predict the class probabilities
probs = clf.predict_proba(test[features_to_use])
# rescale the priors
new_probs = kf.cap_and_update_priors(priors, probs, private_priors, 0.001)
# calculate logloss with the rescaled probabilities
print('log loss: %f\n' % log_loss(test['OpenStatusMod'].values, new_probs))
row={}
if hasattr(clf, "feature_importances_"):
# sort the features by importance
sorted_idx = np.argsort(clf.feature_importances_)
# reverse the order so it is descending
sorted_idx = sorted_idx[::-1]
# add to dataframe
row['num_features'] = len(features_to_use)
row['features_used'] = ','.join(features_to_use)
# trim the worst 5
sorted_idx = sorted_idx[: -5]
# swap the features list with the trimmed features
temp = features_to_use
features_to_use=[]
for feat in sorted_idx:
features_to_use.append(temp[feat])
# add the logloss performance
row['logloss']=[log_loss(test['OpenStatusMod'].values, new_probs)]
print('')
# add the row to the dataframe
trim_5_df = trim_5_df.append(DataFrame(row))
run +=1
So what I'm doing here is I have a list of features I want to train and then predict against, using the feature importances I then trim the worst 5 and repeat. During each run I add a row to record the prediction performance so that I can do some analysis later.
The original code was much bigger I had different classifiers and datasets I was analysing but I hope you get the picture from the above. The thing I noticed was that for random forest the number of features I removed on each run affected the performance so trimming by 1, 3 and 5 features at a time resulted in a different set of best features.
I found that using a GradientBoostingClassifer was more predictable and repeatable in the sense that the final set of best features agreed whether I trimmed 1 feature at a time or 3 or 5.
I hope I'm not teaching you to suck eggs here, you probably know more than me, but my approach to ablative anlaysis was to use a fast classifier to get a rough idea of the best sets of features, then use a better performing classifier, then start hyper parameter tuning, again doing coarse grain comaprisons and then fine grain once I get a feel of what the best params were.

I submitted a request to add coef_ so RandomForestClassifier may be used with RFECV. However, the change had already been made. This change will be in version 0.17.
https://github.com/scikit-learn/scikit-learn/issues/4945
You can pull the latest dev build if you want to use it now.

Here's what I ginned up. It's a pretty simple solution, and relies on a custom accuracy metric (called weightedAccuracy) since I'm classifying a highly unbalanced dataset. But, it should be easily made more extensible if desired.
from sklearn import datasets
import pandas
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix
def get_enhanced_confusion_matrix(actuals, predictions, labels):
""""enhances confusion_matrix by adding sensivity and specificity metrics"""
cm = confusion_matrix(actuals, predictions, labels = labels)
sensitivity = float(cm[1][1]) / float(cm[1][0]+cm[1][1])
specificity = float(cm[0][0]) / float(cm[0][0]+cm[0][1])
weightedAccuracy = (sensitivity * 0.9) + (specificity * 0.1)
return cm, sensitivity, specificity, weightedAccuracy
iris = datasets.load_iris()
x=pandas.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=pandas.Series(iris.target, name='target')
response, _ = pandas.factorize(y)
xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x, response, test_size = .25, random_state = 36583)
print "building the first forest"
rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, n_jobs = -1, verbose = 1)
rf.fit(xTrain, yTrain)
importances = pandas.DataFrame({'name':x.columns,'imp':rf.feature_importances_
}).sort(['imp'], ascending = False).reset_index(drop = True)
cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
numFeatures = len(x.columns)
rfeMatrix = pandas.DataFrame({'numFeatures':[numFeatures],
'weightedAccuracy':[weightedAccuracy],
'sensitivity':[sensitivity],
'specificity':[specificity]})
print "running RFE on %d features"%numFeatures
for i in range(1,numFeatures,1):
varsUsed = importances['name'][0:i]
print "now using %d of %s features"%(len(varsUsed), numFeatures)
xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x[varsUsed], response, test_size = .25)
rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2,
n_jobs = -1, verbose = 1)
rf.fit(xTrain, yTrain)
cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
print("\n"+str(cm))
print('the sensitivity is %d percent'%(sensitivity * 100))
print('the specificity is %d percent'%(specificity * 100))
print('the weighted accuracy is %d percent'%(weightedAccuracy * 100))
rfeMatrix = rfeMatrix.append(
pandas.DataFrame({'numFeatures':[len(varsUsed)],
'weightedAccuracy':[weightedAccuracy],
'sensitivity':[sensitivity],
'specificity':[specificity]}), ignore_index = True)
print("\n"+str(rfeMatrix))
maxAccuracy = rfeMatrix.weightedAccuracy.max()
maxAccuracyFeatures = min(rfeMatrix.numFeatures[rfeMatrix.weightedAccuracy == maxAccuracy])
featuresUsed = importances['name'][0:maxAccuracyFeatures].tolist()
print "the final features used are %s"%featuresUsed

Related

Is there a way to get all computed coefficients in a GridSearchCV?

I am trying different ML models, all using a pipeline which includes a transformer and an algorithm, 'nested' in a GridSearchCV to find the best hyperparameters.
When running Ridge, Lasso and ElasticNet regressions, I would like to store all the computed coefficients, not only the best_estimator_ coefficients, in order to plot them according to the alpha's path.
In other words, when the GridSearchCV changes the alpha parameter and fit a new model, I would like to store the resulting coefficients, to plot them against the alpha values.
You can take a look at this official scikit post for a beautiful example.
This is my code:
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error
import time
start = time.time()
# Cross-validated - Ridge Regression
model_ridge = make_pipeline(transformer, Ridge()) # my transformer is already defined
alphas = np.logspace(-5, 5, num = 50)
params = {'ridge__alpha' : alphas}
grid = GridSearchCV(model_ridge, param_grid = params, cv=10)
grid.fit(X_train, y_train)
regressor = grid.estimator.named_steps['ridge'].coef_ # when I add this line, it returns an error
stop = time.time()
training_time = stop-start
y_pred = grid.predict(X_test)
Ridge_Regression_results = {'Algorithm' : 'Ridge Regression',
'R²' : grid.score(X_train, y_train),
'MAE' : mean_absolute_error(y_test, y_pred),
'RMSE' : np.sqrt(mean_squared_error(y_test, y_pred)),
'Training time (sec)' : training_time}
In this topic: return coefficients from Pipeline object in sklearn, the author was adviced to use the named_steps attribute of the pipeline.
But in my case, when I try to use it, it returns the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_18260/3310195105.py in <module>
13
14 grid.fit(X_train, y_train)
---> 15 regressor = grid.estimator.named_steps['ridge'].coef_
16
17
AttributeError: 'Ridge' object has no attribute 'coef_'
I don't understand why this is happening.
For this to work, my guess is that this storing should happen during the GridSearchCV loop, but I can't figure out how to do this.
You can get the coefficients by making them "scores," although it isn't very semantically correct.
import pandas as pd
def myscores(estimator, X, y):
r2 = estimator.score(X, y)
coefs = estimator.named_steps["ridge"].coef_
ret_dict = {
f'a_{i}': coef for i, coef in enumerate(coefs)
}
ret_dict['r2'] = r2
return ret_dict
grid = GridSearchCV(
model_ridge,
param_grid=params,
scoring=myscores,
refit='r2'
)
print(pd.DataFrame(grid.cv_results_)

Improve precision of my predictive technique in Python

I am using the following Python code to make output predictions depending on some values using decision trees based on entropy/gini index. My input data is contained in the file: https://drive.google.com/file/d/1C8GZ2wiqFUW3WuYxyc0G3axgkM1Uwsb6/view?usp=sharing
The first column "gold" in the file contains the output that I am trying to predict (either T or N). The remaining columns represents some 0 or 1 data that I can use to predict the first column. I am using a test set of 30% and a training set of 70%. I am getting the same precision/recall using either entropy or gini index. I am getting a precision of 0.80 for T and a recall of 0.54 for T. I would like to increase the precision of T and I am okay if the recall for T goes down as well, I am willing to accept this tradeoff. I do not care about the precision/recall of N predictions, I am just trying to improve the precision of T, that's all I care about. I guess increasing the precision means that we should abstain from making predictions in some situations that we are not certain about. How to do that?
# Run this program on your local python
# interpreter, provided you have installed
# the required libraries.
# Importing the required packages
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
from sklearn import tree
import collections
import pydotplus
# Function importing Dataset
column_count =0
def importdata():
balance_data = pd.read_csv( 'data1extended.txt', sep= ',')
row_count, column_count = balance_data.shape
# Printing the dataswet shape
print ("Dataset Length: ", len(balance_data))
print ("Dataset Shape: ", balance_data.shape)
print("Number of columns ", column_count)
# Printing the dataset obseravtions
print ("Dataset: ",balance_data.head())
return balance_data, column_count
def columns(balance_data):
row_count, column_count = balance_data.shape
return column_count
# Function to split the dataset
def splitdataset(balance_data, column_count):
# Separating the target variable
X = balance_data.values[:, 1:column_count]
Y = balance_data.values[:, 0]
# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)
return X, Y, X_train, X_test, y_train, y_test
# Function to perform training with giniIndex.
def train_using_gini(X_train, X_test, y_train):
# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,max_depth=3, min_samples_leaf=5)
# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
# Function to perform training with entropy.
def tarin_using_entropy(X_train, X_test, y_train):
# Decision tree with entropy
clf_entropy = DecisionTreeClassifier(
criterion = "entropy", random_state = 100,
max_depth = 3, min_samples_leaf = 5)
# Performing training
clf_entropy.fit(X_train, y_train)
return clf_entropy
# Function to make predictions
def prediction(X_test, clf_object):
# Predicton on test with giniIndex
y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred
# Function to calculate accuracy
def cal_accuracy(y_test, y_pred):
print("Confusion Matrix: ",
confusion_matrix(y_test, y_pred))
print ("Accuracy : ",
accuracy_score(y_test,y_pred)*100)
print("Report : ",
classification_report(y_test, y_pred))
#Univariate selection
def selection(column_count, data):
# data = pd.read_csv("data1extended.txt")
X = data.iloc[:,1:column_count] #independent columns
y = data.iloc[:,0] #target column i.e price range
#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=5)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
df=pd.DataFrame(data, columns=X)
#concat two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score'] #naming the dataframe columns
print(featureScores.nlargest(5,'Score')) #print 10 best features
return X,y,data,df
#Feature importance
def feature(X,y):
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) #use inbuilt class feature_importances of tree based classifiers
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(5).plot(kind='barh')
plt.show()
#Correlation Matrix
def correlation(data, column_count):
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(column_count,column_count))
#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")
def generate_decision_tree(X,y):
clf = DecisionTreeClassifier(random_state=0)
data_feature_names = ['callersAtLeast1T','CalleesAtLeast1T','callersAllT','calleesAllT','CallersAtLeast1N','CalleesAtLeast1N','CallersAllN','CalleesAllN','childrenAtLeast1T','parentsAtLeast1T','childrenAtLeast1N','parentsAtLeast1N','childrenAllT','parentsAllT','childrenAllN','ParentsAllN','ParametersatLeast1T','FieldMethodsAtLeast1T','ReturnTypeAtLeast1T','ParametersAtLeast1N','FieldMethodsAtLeast1N','ReturnTypeN','ParametersAllT','FieldMethodsAllT','ParametersAllN','FieldMethodsAllN']
#generate model
model = clf.fit(X, y)
# Create DOT data
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=data_feature_names,
class_names=y)
# Draw graph
graph = pydotplus.graph_from_dot_data(dot_data)
# Show graph
Image(graph.create_png())
# Create PDF
graph.write_pdf("tree.pdf")
# Create PNG
graph.write_png("tree.png")
# Driver code
def main():
# Building Phase
data,column_count = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data, column_count)
clf_gini = train_using_gini(X_train, X_test, y_train)
clf_entropy = tarin_using_entropy(X_train, X_test, y_train)
# Operational Phase
print("Results Using Gini Index:")
# Prediction using gini
y_pred_gini = prediction(X_test, clf_gini)
cal_accuracy(y_test, y_pred_gini)
print("Results Using Entropy:")
# Prediction using entropy
y_pred_entropy = prediction(X_test, clf_entropy)
cal_accuracy(y_test, y_pred_entropy)
#COMMENTED OUT THE 4 FOLLOWING LINES DUE TO MEMORY ERROR
#X,y,dataheaders,df=selection(column_count,data)
#generate_decision_tree(X,y)
#feature(X,y)
#correlation(dataheaders,column_count)
# Calling main function
if __name__=="__main__":
main()
I would suggest using Pipelines, to build data pipelines and GridSearchCV to find the best possible hyper-parameters and classifiers for the pipe.
A basic example;
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectKBest, chi2, f_class
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
pipe = Pipeline[('kbest', SelectKBest(chi2, k=3000)),
('clf', DecisionTreeClassifier())
])
pipe_params = {'kbest__k': range(1, 10, 1),
'kbest__score_func': [f_classif, chi2],
'clf__max_depth': np.arange(1,30),
'clf__min_samples_leaf': [1,2,4,5,10,20,30,40,80,100]}
grid_search = GridSearchCV(pipe, pipe_params, n_jobs=-1
scoring=accuracy_score, cv=10)
grid_search.fit(X_train, Y_train)
This will iterate over every hyper-parameters in pipe_params and choose the best classifier based on accuracy_score.

10-fold cross-validation and obtaining RMSE

I'm trying to compare the RMSE I have from performing multiple linear regression upon the full data set, to that of 10-fold cross validation, using the KFold module in scikit learn. I found some code that I tried to adapt but I can't get it to work (and I suspect it never worked in the first place.
TIA for any help!
Here's my linear regression function
def standRegres(xArr,yArr):
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T*xMat
if np.linalg.det(xTx) == 0.0:
print("This matrix is singular, cannot do inverse")
return
ws = xTx.I * (xMat.T*yMat)
return ws
## I run it on my matrix ("comm_df") and my dependent var (comm_target)
## Calculate RMSE (omitted some code)
initial_regress_RMSE = np.sqrt(np.mean((yHat_array - comm_target_array)**2)
## Now trying to get RMSE after training model through 10-fold cross validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf:
linreg.fit(comm_df,comm_target)
p = linreg.predict(comm_df)
e = p-comm_target
xval_err += np.sqrt(np.dot(e,e)/len(comm_df))
rmse_10cv = xval_err/10
I get an error about how kfold object is not iterable
There are several things you need to correct in this code.
You cannot iterate over kf. You can only iterate over kf.split(comm_df)
You need to somehow use the train test split that KFold provides. You are not using them in your code! The goal of the KFold is to fit your regression on the train observations, and to evaluate the regression (ie compute the RMSE in your case) on the test observations.
With this in mind, here is how I would correct your code (it is assumed here that your data is in numpy arrays, but you can easily switch to pandas)
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf.split(comm_df):
linreg.fit(comm_df[train],comm_target[train])
p = linreg.predict(comm_df[test])
e = p-comm_label[test]
xval_err += np.sqrt(np.dot(e,e)/len(comm_target[test]))
rmse_10cv = xval_err/10
So the code you provided still threw an error. I abandoned what I had above in favor of the following, which works:
## KFold cross-validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
## Define variables for the for loop
kf = KFold(n_splits=10)
RMSE_sum=0
RMSE_length=10
X = np.array(comm_df)
y = np.array(comm_target)
for loop_number, (train, test) in enumerate(kf.split(X)):
## Get Training Matrix and Vector
training_X_array = X[train]
training_y_array = y[train].reshape(-1, 1)
## Get Testing Matrix Values
X_test_array = X[test]
y_actual_values = y[test]
## Fit the Linear Regression Model
lr_model = LinearRegression().fit(training_X_array, training_y_array)
## Compute the predictions for the test data
prediction = lr_model.predict(X_test_array)
crime_probabilites = np.array(prediction)
## Calculate the RMSE
RMSE_cross_fold = RMSEcalc(crime_probabilites, y_actual_values)
## Add each RMSE_cross_fold value to the sum
RMSE_sum=RMSE_cross_fold+RMSE_sum
## Calculate the average and print
RMSE_cross_fold_avg=RMSE_sum/RMSE_length
print('The Mean RMSE across all folds is',RMSE_cross_fold_avg)

How to get the prediction probabilities using cross validation in scikit-learn

I am using RandomForestClassifier as follows using cross validation for a binary classification (class labels are 0 and 1).
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
clf=RandomForestClassifier(random_state = 42, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
accuracy = cross_val_score(clf, X, y, cv=k_fold, scoring = 'accuracy')
print("Accuracy: " + str(round(100*accuracy.mean(), 2)) + "%")
f1 = cross_val_score(clf, X, y, cv=k_fold, scoring = 'f1_weighted')
print("F Measure: " + str(round(100*f1.mean(), 2)) + "%")
Now I want to order my data using prediction probabilities of class 1 with cross validation results. For that I tried the following two ways.
pred = clf.predict_proba(X)[:,1]
print(pred)
probs = clf.predict_proba(X)
best_n = np.argsort(probs, axis=1)[:,-6:]
I get the following error
NotFittedError: This RandomForestClassifier instance is not fitted
yet. Call 'fit' with appropriate arguments before using this method.
for both the situations.
I am just wondering where I am making things wrong.
I am happy to provide more details if needed.
In case, you want to use the CV model for a unseen data point/s, use the following approach.
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_validate
iris = datasets.load_iris()
X = iris.data
y = iris.target
clf = RandomForestClassifier(n_estimators=10, random_state = 42, class_weight="balanced")
cv_results = cross_validate(clf, X, y, cv=3, return_estimator=True)
clf_fold_0 = cv_results['estimator'][0]
clf_fold_0.predict_proba([iris.data[133]])
# array([[0. , 0.5, 0.5]])
Have a look at the documentation it specifies that the probability is calculated based on the mean results from the trees.
In your case, you first need to call the fit() method to generate the tress in the model. Once you fit the model on the training data, you can call the predict_proba() method.
This is also specified in the error.
# Fit model
model = RandomForestClassifier(...)
model.fit(X_train, Y_train)
# Probabilty
model.predict_proba(X)[:,1]
I solved my problem using the following code:
proba = cross_val_predict(clf, X, y, cv=k_fold, method='predict_proba')
print(proba[:,1])
print(np.argsort(proba[:,1]))

Scikit-Learn Random Forest Classifier: High accuracy on Training and Test, but not Production

I am training a classifier to predict which classifies text-based requests into departments. I have ~107,000 labeled examples made of 22 imbalanced classes with roughly the following distribution:
Class 1: 10,000
Class 2: 60,000
Class 3: 7,000
Class 4: 5,000
Class 5: 3,500
Classes 6 & 7: 2000 samples each
Classes 7-15: 1500 samples each
Classes 16-22: 500 samples each
I have been preprocessing the data to provide an even number of samples (where each class has anywhere from 5,000 samples to 50,000 samples). Which the above classifier and balancing the training data, I am able to get up to 98.5% accuracy on the test data with a 50-50 split of the total training data. But as new requests come in and I load the classifier, the classifier only achieves 50-70% accuracy at best. The sample is relatively stable that the same requests always go to the same department, so I am very surprised to only be 50-70% accurate, especially with such a high accuracy on the test data:
import logging
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.externals import joblib
from sklearn.metrics import classification_report
logger = logging.getLogger(__name__)
def up_sample(data, labels, **kwargs):
label_counts = Counter(labels)
max_label = max(label_counts, key=label_counts.get)
max_label_count = kwargs.get('samples', label_counts[max_label])
output_text = []
output_labels = []
for label, count in label_counts.items():
label_text = [data_row for data_row, label_row in zip(data, labels) if label_row == label]
resampled_labels = [label] * max_label_count
resampled_text = resample(label_text, n_samples=max_label_count, random_state=0)
output_text = output_text + resampled_text
output_labels = output_labels + resampled_labels
return output_text, output_labels
clf = Pipeline(
steps=(('tfidf_vectorizer', TfidfVectorizer(stop_words='english')),
('clf', RandomForestClassifier(n_estimators=250, n_jobs=-1)))
)
resampled_data, resampled_labels = upsample(data, labels) # UPDATE: produces ~700,000 samples, which many duplicates
labels = label_encoder.fit_transform(labels)
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.5, random_state=0) # UPDATE: many duplicates in both training and test data sets as a result of upsampling
clf.fit(X_train, y_train)
test_score = clf.score(X_test, y_test)
logger.debug('Test Score: %s', test_score) # 0.98-0.99%
cross_validation_results = cross_val_score(clf, data, labels)
logger.debug('Cross Validation results: %r', cross_validation_results) # [98.7, 99.1, 97.8]
y_test_predicted = clf.predict(X_test)
output_classification_report = classification_report(y_test, y_test_predicted, target_names=label_encoder.classes_)
logger.debug(output_classification_report) # 0.95-1.0 for precision and recall for all classes
clf_file_name = os.path.join(directory, clf_name)
joblib.dump(clf, clf_file_name)
label_encoder_file_name = os.path.join(directory, label_encoder_name)
joblib.dump(label_encoder, label_encoder_file_name)
# Later, in a different script
clf_file_name = os.path.join(directory, name)
clf = joblib.load(clf_file_name)
label_encoder_file_name = os.path.join(directory, name)
label_encoder = joblib.load(label_encoder_file_name)
predictions = clf.predict(new_data)
logger.debug(clf.score(new_labels, predictions)) # 50-70%
Also, when I retrain the classifier with the new_data and predict on the new_data, it is 100% accurate. I know that it will score much higher since it has already seen the example, but I have been reading about out-of-bag error in random forests which I understand could be my issue, but I am not familiar enough with OOB to know how to correct for this. I do not know how to proceed from here. How do I go about resolving this issue?
I have already read through the following questions/resources for resolving my issue, prior to posting my own question, but feel free to let me know if I overlooked something from them:
What is the out of bag error in random forests
Random Forest - how to handle overfitting?
How do I solve for overfitting in random forest of Python scikit-learn?
Random Forest is overfitting
Difference between OOB score and score of random forest model in scikit-learn package

Categories