I am running a Logistic Regression and would like to plot the Learning Curve of this to get a feel for the data. How can I do this ? Here is my code thus far :
from sklearn import metrics,preprocessing,cross_validation
from sklearn.feature_extraction.text import TfidfVectorizer
import sklearn.linear_model as lm
import pandas as p
loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=' ')
print "loading data.."
traindata = list(np.array(p.read_table('train.tsv'))[:,2])
testdata = list(np.array(p.read_table('test.tsv'))[:,2])
y = np.array(p.read_table('train.tsv'))[:,-1]
tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode',
analyzer='word',token_pattern=r'\w{1,}',ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1)
rd = lm.LogisticRegression(penalty='l2', dual=True, tol=0.0001,
C=1, fit_intercept=True, intercept_scaling=1.0,
class_weight=None, random_state=None)
X_all = traindata + testdata
lentrain = len(traindata)
print "fitting pipeline"
tfv.fit(X_all)
print "transforming data"
X_all = tfv.transform(X_all)
X = X_all[:lentrain]
X_test = X_all[lentrain:]
print "20 Fold CV Score: ", np.mean(cross_validation.cross_val_score(rd, X, y, cv=20, scoring='roc_auc'))
print "training on full data"
rd.fit(X,y)
pred = rd.predict_proba(X_test)[:,1]
testfile = p.read_csv('test.tsv', sep="\t", na_values=['?'], index_col=1)
pred_df = p.DataFrame(pred, index=testfile.index, columns=['label'])
pred_df.to_csv('benchmark.csv')
print "submission file created.."
What I would like to create is something like this, so I can have a better understanding of what is going on :
Can anyone help me with this please?
Not quite as general as it should be, but it'll do the job with a little fiddling on your end.
from matplotlib import pyplot as plt
from sklearn import metrics
import numpy as np
def data_size_response(model,trX,teX,trY,teY,score_func,prob=True,n_subsets=20):
train_errs,test_errs = [],[]
subset_sizes = np.exp(np.linspace(3,np.log(trX.shape[0]),n_subsets)).astype(int)
for m in subset_sizes:
model.fit(trX[:m],trY[:m])
if prob:
train_err = score_func(trY[:m],model.predict_proba(trX[:m]))
test_err = score_func(teY,model.predict_proba(teX))
else:
train_err = score_func(trY[:m],model.predict(trX[:m]))
test_err = score_func(teY,model.predict(teX))
print "training error: %.3f test error: %.3f subset size: %.3f" % (train_err,test_err,m)
train_errs.append(train_err)
test_errs.append(test_err)
return subset_sizes,train_errs,test_errs
def plot_response(subset_sizes,train_errs,test_errs):
plt.plot(subset_sizes,train_errs,lw=2)
plt.plot(subset_sizes,test_errs,lw=2)
plt.legend(['Training Error','Test Error'])
plt.xscale('log')
plt.xlabel('Dataset size')
plt.ylabel('Error')
plt.title('Model response to dataset size')
plt.show()
model = # put your model here
score_func = # put your scoring function here
response = data_size_response(model,trX,teX,trY,teY,score_func,prob=True)
plot_response(*response)
The data_size_response function takes a model (in your case a instantiated LR model), a pre-split dataset (train/test X and Y arrays you can use the train_test_split function in sklearn to generate this), and a scoring function as input and iterates through your dataset training on n exponentially spaced subsets and returns the "learning curve". There is also a plotting function for visualizing this response.
I would have liked to use cross_val_score like your example but it would require modifying sklearn source to get back training scores in addition to the test scores it already provides. The prob argument is whether or not to use a predict_proba vs predict method on the model which is necessary for certain model/scoring function combinations e.g. roc_auc_score.
Example plot on a subset of the MNIST dataset:
Let me know if you have any questions!
Related
I recently attending a class where the instructor was teaching us how to create a linear regression model using Python. Here is my linear regression model:
import matplotlib.pyplot as plt
import pandas as pd
from scipy import stats
import numpy as np
from sklearn.metrics import r2_score
#Define the path for the file
path=r"C:\Users\H\Desktop\Files\Data.xlsx"
#Read the file into a dataframe ensuring to group by weeks
df=pd.read_excel(path, sheet_name = 0)
df=df.groupby(['Week']).sum()
df = df.reset_index()
#Define x and y
x=df['Week']
y=df['Payment Amount Total']
#Draw the scatter plot
plt.scatter(x, y)
plt.show()
#Now we draw the line of linear regression
#First we want to look for these values
slope, intercept, r, p, std_err = stats.linregress(x, y)
#We then create a function
def myfunc(x):
#Below is y = mx + c
return slope * x + intercept
#Run each value of the x array through the function. This will result in a new array with new values for the y-axis:
mymodel = list(map(myfunc, x))
#We plot the scatter plot and line
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
#We print the value of r
print(r)
#We predict what the cost will be in week 23
print(myfunc(23))
The instructor said we now must use the train/test model to determine how accurate the model above is. This confused me a little as I understood it to mean we will further refine the model above. Or, does it simply mean we will use:
a normal linear regression model
a train/test model
and compare the r values the two different models yield as well as the predicted values they yield?. Is the train/test model considered a regression model?
I tried to create the train/test model but I'm not sure if it's correct (the packages were imported from the above example). When I run the train/test code I get the following error:
ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.
Here is the full code:
train_x = x[:80]
train_y = y[:80]
test_x = x[80:]
test_y = y[80:]
#I display the training set:
plt.scatter(train_x, train_y)
plt.show()
#I display the testing set:
plt.scatter(test_x, test_y)
plt.show()
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
myline = np.linspace(0, 6, 100)
plt.scatter(train_x, train_y)
plt.plot(myline, mymodel(myline))
plt.show()
#Let's look at how well my training data fit in a polynomial regression?
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
r2 = r2_score(train_y, mymodel(train_x))
print(r2)
#Now we want to test the model with the testing data as well
mymodel = np.poly1d(np.polyfit(train_x, train_y, 4))
r2 = r2_score(test_y, mymodel(test_x))
print(r2)
#Now we can use this model to predict new values:
#We predict what the total amount would be on the 23rd week:
print(mymodel(23))
You better split to train and test using sklearn method:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Where X is your features dataframe and y is the column of your labels. 0.2 stands for 80% train and 20% test.
BTW - the error you are describing could be because you dataframe has only 80 rows, leaving x[80:] empty
I'm working on my first Linear Regression code using a Tech with Tim video (https://www.youtube.com/watch?v=45ryDIPHdGg) and have run into a snag. I'm using the UCI student data from here: https://archive.ics.uci.edu/ml/datasets/Student+Performance
My initial model code ran fine. Then I iterated to find an optimal accuracy model, which was fine. Where it starts to go off the rails is that I tried to then inject those optimal coefficients into a new model, then run two predictions:
The very first model (pre-optimization loop)
The optimized model
Against the same x_test1 data set. To compare the two, I simply summed the squared difference between predicted and actual y values. Then I also recorded the final accuracy of both models.
I've done something wrong because the accuracy of my new "optimized" model is the same or lower as the very first model, and the difference values is very similar as well. I expected the optimized model to have much less error and a higher accuracy.
Can someone help me to see the error? I suspect the error lies after the plot section of code. Thanks in advance, code below.
# Import libraries
import pandas as pd
import numpy as np
import sklearn
import pickle
import matplotlib.pyplot as plt
from sklearn import linear_model
from math import sqrt
from sklearn.linear_model import LinearRegression
from matplotlib import style
# from sklearn.utils import shuffle
# Read in Data
data = pd.read_csv("student-mat.csv", sep=";")
# Slice data to include only desired headings
data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]]
# Define the attribute we are trying to predict; called "label".
# Others are "features" and used to predict label
predict = "G3"
# Create array of features and label
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
# Split data into training and testing data. 90% used for training, 10% testing
# Test size 0.1 = 10% of array size
x_train1, x_test1, y_train1, y_test1 = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
# Create 1st linear model and fit
linear = linear_model.LinearRegression()
linear.fit(x_train1, y_train1)
# Compute accuracy of model
acc = linear.score(x_test1, y_test1)
# Iterate for a given number of times (max_iter) to find an optimal accuracy value and record best coefficients
loop_num = 1
max_iter = 1000
best_acc = acc
best_coef = linear.coef_
best_int = linear.intercept_
acc_counter = [acc]
print("\nInitial Accuracy: %4.3f" % acc)
while loop_num < max_iter + 1:
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
linear2 = linear_model.LinearRegression()
linear2.fit(x_train, y_train)
acc = linear2.score(x_test, y_test)
acc_counter.append(acc)
print("\nAccuracy of run " + str(loop_num) + " is: %4.3f" % acc)
if acc > best_acc:
print("\n\tBetter accuracy found.")
best_acc = acc
best_coef = linear2.coef_
best_int = linear2.intercept_
print("Co: \n", linear2.coef_)
print("Intercept: \n", linear2.intercept_)
else:
print("\n\tFit Discarded.")
loop_num += 1
print("\nBest Acccuracy: \n%4.3f" % best_acc)
print("\nBest Coefficients: \n", best_coef)
print("\nBest Intercept: \n", best_int)
# Plot Accuracy over time
x_scale = []
for x in range(max_iter + 1):
x_scale.append(x)
plt.plot(x_scale, acc_counter, color='green', linestyle='dashed', linewidth=3, marker='o',
markerfacecolor='blue', markersize=5)
ymax = max(acc_counter)
ymin = min(acc_counter)
xpos = acc_counter.index(ymax)
xmax = x_scale[xpos]
annot_max_acc = str(ymax)
plt.annotate('Max Accuracy = ' + annot_max_acc[0:4], xy=(xmax, ymax), xycoords='data', xytext=(.8, .95),
textcoords='axes fraction',
arrowprops=dict(facecolor='black', shrink=0.05), horizontalalignment='right', verticalalignment='top')
plt.ylim(ymin, 1.0)
plt.xlabel('Run Number')
plt.ylabel('Accuracy')
plt.title('Prediction Accuracy over Time')
plt.show()
# Create model with best coefficients from above
new_model = linear_model.LinearRegression()
new_model.intercept_ = best_int
new_model.coef_ = best_coef
# Predict y values for 1st model (not best) then compute difference between predictions and actual values
print("\n\n\nBREAK")
comp = []
predictions = linear.predict(x_test1)
for x in range(len(predictions)):
print(predictions[x], x_test1[x], y_test1[x])
diff = sqrt((predictions[x] - y_test1[x])**2)
print("\tDifference is ", diff)
comp.append(diff)
print("\n\n\nBREAK")
print(comp)
print("\nSum of errors is ", sum(comp))
# Predict y values of best model (with optimal coefficients from above) using same x_test1 values as 1st model
# then compute difference between predictions and actual values
print("\n\n\nBREAK")
comp2 = []
predictions_new_model = new_model.predict(x_test1)
for x in range(len(predictions_new_model)):
print(predictions_new_model[x], x_test1[x], y_test1[x])
diff2 = sqrt((predictions_new_model[x] - y_test1[x])**2)
print("\tDifference is ", diff2)
comp2.append(diff2)
print("\n\n\nBREAK")
print(comp2)
print("\nSum of errors is ", sum(comp2))
print("\n\n\nFirst model fit difference: ", sum(comp))
print("\nSecond model fit difference ", sum(comp2))
print('\n1st model score: ',linear.score(x_train1, y_train1))
print('\nBest model score: ',new_model.score(x_train1, y_train1))
Looking at your code I’ve just realized that you’re using the same model (LinearRegression) and not changing any hyperparameters in any runs so there’s actually no improvement and the difference comes from the fact that you’ve split the data twice (and you didn’t give any random seed to it) so the slight difference comes from that. To improve the model you have to change the hyperparameters of the estimator. See more here: hyperparameter tuning
I am new to statistic modelling so please forgive if I am mistaken about this.
I am currently working on a function in python which will predict accuracy score for logistics regression model on the test data set. User will have the flexibility to supply model parameters/coefficients (other than the ones generated by training model-part of the requirement). I have a functional code which updates the coefficients but accuracy or prediction on test data set stays the same no matter how different model parameters I supply. My understanding is that the score on test set should change if I change model coefficients?
I am using statsmodel library to make things easier for me and following this link. Can someone please help me understand what am I missing ? Below is the code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.formula.api as sm
from sklearn.model_selection import train_test_split
data = pd.read_csv("E:\\Dev\\testing\\rawdata.txt", header=None,
names=['Exam1', 'Exam2', 'Admitted'])
X = data.copy() # ou training data
y = X.Admitted.copy() # copy “y” column values out
X.drop(['Admitted'], axis=1, inplace=True) # then, drop y column
# manually add the intercept
X['intercept'] = 1.0 # so we don't need to use sm.add_constant every time
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
model = sm.Logit(y_train, X_train)
result = model.fit()
print("old parameters :\n" + str(list(result.params)))
#New parameters supplied
mdict = { 'Exam1':10000000.2234, 'Exam2':1.1233423, 'intercept':2313.423 }
result.params = mdict
print("New parameters: \n"+str(result.params))
def logitPredict(modelParams, X, threshold):
probabilities = modelParams.predict(X)
return [1 if x >= threshold else 0 for x in probabilities]
predictions = logitPredict(result, X_test, .5)
accuracy = np.mean(predictions == y_test)
#accuracy always remains same as train model
print ('accuracy = {0}%'.format(accuracy*100) )
#test sample
myExams = pd.DataFrame({'Exam1': [40.], 'Exam2': [78.], 'intercept': [1.]})
myExams
print ('Your probability = {0}%'.format(result.predict(myExams)[0]*100))
For the same dataset (here Bupa) and parameters i get different accuracies.
What did I overlook?
R implementation:
data_file = "bupa.data"
dataset = read.csv(data_file, header = FALSE)
nobs <- nrow(dataset) # 303 observations
sample <- train <- sample(nrow(dataset), 0.95*nobs) # 227 observations
# validate <- sample(setdiff(seq_len(nrow(dataset)), train), 0.1*nobs) # 30 observations
test <- setdiff(seq_len(nrow(dataset)), train) # 76 observations
svmfit <- svm(V7~ .,data=dataset[train,],
type="C-classification",
kernel="linear",
cost=1,
cross=10)
testpr <- predict(svmfit, newdata=na.omit(dataset[test,]))
accuracy <- sum(testpr==na.omit(dataset[test,])$V7)/length(na.omit(dataset[test,])$V7)
I get accuracy: 0.94
but when i do as following in python (scikit-learn)
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
import pandas as pd
from sklearn import svm, grid_search
f = open("data/bupa.data")
dataset = np.loadtxt(fname = f, delimiter = ',')
nobs = np.shape(dataset)[0]
print("Number of Observations: %d" % nobs)
y = dataset[:,6]
X = dataset[:,:-1]
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.06, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
scores = cross_validation.cross_val_score(clf, X, y, cv=10, scoring='accuracy')
I get accuracy 0.67
please help me.
I came across this post having the same issue - wildly different accuracy between scikit-learn and e1071 bindings for libSVM. I think the issue is that e1071 scales the training data and then keeps the scaling parameters for using in predicting new observations. Scikit-learn does not do this and leaves it up the user to realize that the same scaling approach needs to be taken on both training and test data. I only thought to check this after encountering and reading this guide from the nice people behind libSVM.
While I don't have your data, str(svmfit) should give you the scaling params (mean and standard deviation of the columns of Bupa). You can use these to appropriately scale your data in Python (see below for an idea). Alternately, you can scale the entire dataset together in Python and then do test/train splits; either way should give you now identical predictions.
def manual_scale(a, means, sds):
a1 = a - means
a1 = a1/sds
return a1
When using Support Vector Regression in Python/sklearn and R/e1071 both x and y variables need to be scaled/unscaled.
Here is a self-contained example using rpy2 to show equivalence of R and Python results (first part with disabled scaling in R, second part with 'manual' scaling in Python):
# import modules
import matplotlib.pyplot as plt
import numpy as np
import sklearn
import sklearn.model_selection
import sklearn.datasets
import sklearn.svm
import rpy2
import rpy2.robjects
import rpy2.robjects.packages
# use R e1071 SVM function via rpy2
def RSVR(x_train, y_train, x_test,
cost=1.0, epsilon=0.1, gamma=0.01, scale=False):
# convert Python arrays to R matrices
rx_train = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_train).T.flatten()), nrow = len(x_train))
ry_train = rpy2.robjects.FloatVector(np.array(y_train).flatten())
rx_test = rpy2.robjects.r['matrix'](rpy2.robjects.FloatVector(np.array(x_test).T.flatten()), nrow = len(x_test))
# train SVM
e1071 = rpy2.robjects.packages.importr('e1071')
rsvr = e1071.svm(x=rx_train,
y=ry_train,
kernel='radial',
cost=cost,
epsilon=epsilon,
gamma=gamma,
scale=scale)
# run SVM
predict = rpy2.robjects.r['predict']
ry_pred = np.array(predict(rsvr, rx_test))
return ry_pred
# define auxiliary function for plotting results
def plot_results(y_test, py_pred, ry_pred, title, lim=[-500, 500]):
plt.title(title)
plt.plot(lim, lim, lw=2, color='gray', zorder=-1)
plt.scatter(y_test, py_pred, color='black', s=40, label='Python/sklearn')
plt.scatter(y_test, ry_pred, color='orange', s=10, label='R/e1071')
plt.xlabel('observed')
plt.ylabel('predicted')
plt.legend(loc=0)
return None
# get example regression data
x_orig, y_orig = sklearn.datasets.make_regression(n_samples=100, n_features=10, random_state=42)
# split into train and test set
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x_orig, y_orig, train_size=0.8)
# SVM parameters
# (identical but named differently for R/e1071 and Python/sklearn)
C = 1000.0
epsilon = 0.1
gamma = 0.01
# setup SVM and scaling classes
psvr = sklearn.svm.SVR(kernel='rbf', C=C, epsilon=epsilon, gamma=gamma)
x_sca = sklearn.preprocessing.StandardScaler()
y_sca = sklearn.preprocessing.StandardScaler()
# run R and Python SVMs without any scaling
# (see 'scale=False')
py_pred = psvr.fit(x_train, y_train).predict(x_test)
ry_pred = RSVR(x_train, y_train, x_test,
cost=C, epsilon=epsilon, gamma=gamma, scale=False)
# scale both x and y variables
sx_train = x_sca.fit_transform(x_train)
sy_train = y_sca.fit_transform(y_train.reshape(-1, 1))[:, 0]
sx_test = x_sca.transform(x_test)
sy_test = y_sca.transform(y_test.reshape(-1, 1))[:, 0]
# run Python SVM on scaled data and invert scaling afterwards
ps_pred = psvr.fit(sx_train, sy_train).predict(sx_test)
ps_pred = y_sca.inverse_transform(ps_pred.reshape(-1, 1))[:, 0]
# run R SVM with native scaling on original/unscaled data
# (see 'scale=True')
rs_pred = RSVR(x_train, y_train, x_test,
cost=C, epsilon=epsilon, gamma=gamma, scale=True)
# plot results
plt.subplot(121)
plot_results(y_test, py_pred, ry_pred, 'without scaling (Python/sklearn default)')
plt.subplot(122)
plot_results(y_test, ps_pred, rs_pred, 'with scaling (R/e1071 default)')
plt.tight_layout()
UPDATE: Actually, the scaling uses a slightly different definition of variance in R and Python, see this answer (1/(N-1)... in R vs. 1/N... in Python where N is the sample size). However, for typical sample sizes, this should be negligible.
I can confirm these statements. One indeed needs to apply the same scaling to the train and test sets. In particular I have done this:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X = sc_X.fit_transform(X)
where X is my training set. Then, when preparing the test set, I have simply used the StandardScaler instance obtained from the scaling of the training test. It is important to used it just for transforming, not for fitting and transforming (like above), i.e.:
X_test = sc_X.transform(X_test)
This allowed on obtaining substantial agreement between R and scikit-learn results.
I'm trying to preform recursive feature elimination using scikit-learn and a random forest classifier, with OOB ROC as the method of scoring each subset created during the recursive process.
However, when I try to use the RFECV method, I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'coef_'
Random Forests don't have coefficients per se, but they do have rankings by Gini score. So, I'm wondering how to get arround this problem.
Please note that I want to use a method that will explicitly tell me what features from my pandas DataFrame were selected in the optimal grouping as I am using recursive feature selection to try to minimize the amount of data I will input into the final classifier.
Here's some example code:
from sklearn import datasets
import pandas as pd
from pandas import Series
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
iris = datasets.load_iris()
x=pd.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=pd.Series(iris.target, name='target')
rf = RandomForestClassifier(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
rfecv = RFECV(estimator=rf, step=1, cv=10, scoring='ROC', verbose=2)
selector=rfecv.fit(x, y)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 336, in fit
ranking_ = rfe.fit(X_train, y_train).ranking_
File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 148, in fit
if estimator.coef_.ndim > 1:
AttributeError: 'RandomForestClassifier' object has no attribute 'coef_'
Here's what I've done to adapt RandomForestClassifier to work with RFECV:
class RandomForestClassifierWithCoef(RandomForestClassifier):
def fit(self, *args, **kwargs):
super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
self.coef_ = self.feature_importances_
Just using this class does the trick if you use 'accuracy' or 'f1' score. For 'roc_auc', RFECV complains that multiclass format is not supported. Changing it to two-class classification with the code below, the 'roc_auc' scoring works. (Using Python 3.4.1 and scikit-learn 0.15.1)
y=(pd.Series(iris.target, name='target')==2).astype(int)
Plugging into your code:
from sklearn import datasets
import pandas as pd
from pandas import Series
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
class RandomForestClassifierWithCoef(RandomForestClassifier):
def fit(self, *args, **kwargs):
super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
self.coef_ = self.feature_importances_
iris = datasets.load_iris()
x=pd.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=(pd.Series(iris.target, name='target')==2).astype(int)
rf = RandomForestClassifierWithCoef(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
rfecv = RFECV(estimator=rf, step=1, cv=2, scoring='roc_auc', verbose=2)
selector=rfecv.fit(x, y)
This is my code, I've tidied it up a bit to make it relevant to your task:
features_to_use = fea_cols # this is a list of features
# empty dataframe
trim_5_df = DataFrame(columns=features_to_use)
run=1
# this will remove the 5 worst features determined by their feature importance computed by the RF classifier
while len(features_to_use)>6:
print('number of features:%d' % (len(features_to_use)))
# build the classifier
clf = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)
# train the classifier
clf.fit(train[features_to_use], train['OpenStatusMod'].values)
print('classifier score: %f\n' % clf.score(train[features_to_use], df['OpenStatusMod'].values))
# predict the class and print the classification report, f1 micro, f1 macro score
pred = clf.predict(test[features_to_use])
print(classification_report(test['OpenStatusMod'].values, pred, target_names=status_labels))
print('micro score: ')
print(metrics.precision_recall_fscore_support(test['OpenStatusMod'].values, pred, average='micro'))
print('macro score:\n')
print(metrics.precision_recall_fscore_support(test['OpenStatusMod'].values, pred, average='macro'))
# predict the class probabilities
probs = clf.predict_proba(test[features_to_use])
# rescale the priors
new_probs = kf.cap_and_update_priors(priors, probs, private_priors, 0.001)
# calculate logloss with the rescaled probabilities
print('log loss: %f\n' % log_loss(test['OpenStatusMod'].values, new_probs))
row={}
if hasattr(clf, "feature_importances_"):
# sort the features by importance
sorted_idx = np.argsort(clf.feature_importances_)
# reverse the order so it is descending
sorted_idx = sorted_idx[::-1]
# add to dataframe
row['num_features'] = len(features_to_use)
row['features_used'] = ','.join(features_to_use)
# trim the worst 5
sorted_idx = sorted_idx[: -5]
# swap the features list with the trimmed features
temp = features_to_use
features_to_use=[]
for feat in sorted_idx:
features_to_use.append(temp[feat])
# add the logloss performance
row['logloss']=[log_loss(test['OpenStatusMod'].values, new_probs)]
print('')
# add the row to the dataframe
trim_5_df = trim_5_df.append(DataFrame(row))
run +=1
So what I'm doing here is I have a list of features I want to train and then predict against, using the feature importances I then trim the worst 5 and repeat. During each run I add a row to record the prediction performance so that I can do some analysis later.
The original code was much bigger I had different classifiers and datasets I was analysing but I hope you get the picture from the above. The thing I noticed was that for random forest the number of features I removed on each run affected the performance so trimming by 1, 3 and 5 features at a time resulted in a different set of best features.
I found that using a GradientBoostingClassifer was more predictable and repeatable in the sense that the final set of best features agreed whether I trimmed 1 feature at a time or 3 or 5.
I hope I'm not teaching you to suck eggs here, you probably know more than me, but my approach to ablative anlaysis was to use a fast classifier to get a rough idea of the best sets of features, then use a better performing classifier, then start hyper parameter tuning, again doing coarse grain comaprisons and then fine grain once I get a feel of what the best params were.
I submitted a request to add coef_ so RandomForestClassifier may be used with RFECV. However, the change had already been made. This change will be in version 0.17.
https://github.com/scikit-learn/scikit-learn/issues/4945
You can pull the latest dev build if you want to use it now.
Here's what I ginned up. It's a pretty simple solution, and relies on a custom accuracy metric (called weightedAccuracy) since I'm classifying a highly unbalanced dataset. But, it should be easily made more extensible if desired.
from sklearn import datasets
import pandas
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix
def get_enhanced_confusion_matrix(actuals, predictions, labels):
""""enhances confusion_matrix by adding sensivity and specificity metrics"""
cm = confusion_matrix(actuals, predictions, labels = labels)
sensitivity = float(cm[1][1]) / float(cm[1][0]+cm[1][1])
specificity = float(cm[0][0]) / float(cm[0][0]+cm[0][1])
weightedAccuracy = (sensitivity * 0.9) + (specificity * 0.1)
return cm, sensitivity, specificity, weightedAccuracy
iris = datasets.load_iris()
x=pandas.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=pandas.Series(iris.target, name='target')
response, _ = pandas.factorize(y)
xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x, response, test_size = .25, random_state = 36583)
print "building the first forest"
rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, n_jobs = -1, verbose = 1)
rf.fit(xTrain, yTrain)
importances = pandas.DataFrame({'name':x.columns,'imp':rf.feature_importances_
}).sort(['imp'], ascending = False).reset_index(drop = True)
cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
numFeatures = len(x.columns)
rfeMatrix = pandas.DataFrame({'numFeatures':[numFeatures],
'weightedAccuracy':[weightedAccuracy],
'sensitivity':[sensitivity],
'specificity':[specificity]})
print "running RFE on %d features"%numFeatures
for i in range(1,numFeatures,1):
varsUsed = importances['name'][0:i]
print "now using %d of %s features"%(len(varsUsed), numFeatures)
xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x[varsUsed], response, test_size = .25)
rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2,
n_jobs = -1, verbose = 1)
rf.fit(xTrain, yTrain)
cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
print("\n"+str(cm))
print('the sensitivity is %d percent'%(sensitivity * 100))
print('the specificity is %d percent'%(specificity * 100))
print('the weighted accuracy is %d percent'%(weightedAccuracy * 100))
rfeMatrix = rfeMatrix.append(
pandas.DataFrame({'numFeatures':[len(varsUsed)],
'weightedAccuracy':[weightedAccuracy],
'sensitivity':[sensitivity],
'specificity':[specificity]}), ignore_index = True)
print("\n"+str(rfeMatrix))
maxAccuracy = rfeMatrix.weightedAccuracy.max()
maxAccuracyFeatures = min(rfeMatrix.numFeatures[rfeMatrix.weightedAccuracy == maxAccuracy])
featuresUsed = importances['name'][0:maxAccuracyFeatures].tolist()
print "the final features used are %s"%featuresUsed