having issues with attribute errors when implementing a linear SVM with scikit-learn. I'm using a linear classifier with cross-validation through the RFECV method, and I can't access any of the attributes of the SVC. Not sure if it has to do with the feature selection or base model.
model = svm.SVC(kernel='linear')
selector=RFECV(model)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=pct_test)
selector=selector.fit(X_train, Y_train)
my_prediction = selector.predict(X_test)
f1.append(metrics.f1_score(Y_test, my_prediction))
kappa.append(metrics.cohen_kappa_score(Y_test, my_prediction))
precision.append(metrics.precision_score(Y_test, my_prediction))
recall.append(metrics.recall_score(Y_test, my_prediction))
print model.intercept_
print model.support_vectors_
print model.coef_
Metrics work fine, attributes all fail.
The error message is:
AttributeError: 'SVC' object has no attribute 'intercept_'
Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
Aside: I'm very new to OOP. If there's an underlying concept I'm missing, please elaborate or send over a link.
You are fitting (training the data) on the RFECV object selector, but trying to access attributes of SVC object model. But it is not trained. Hence there is no attribute intercept_ in it.
To access the intercept of SVC, you should use:
selector.estimator_.intercept_
But understand that the above estimator is fitted only on the reduced dataset (After eliminating features as specified)
Explanation:
You see, RFECV internally uses RFE to get important features in each iteration. And RFE clones the supplied estimator for the purpose. So when you initialize RFECV with model, it is trained on the clone of the model.
Checking the source code:
Line 407 (inside the fit method of RFECV):
rfe = RFE(estimator=self.estimator,
n_features_to_select=n_features_to_select,
step=self.step, verbose=self.verbose)
Line 428 (for estimating the scores):
scores = parallel(func(rfe, self.estimator, X, y, train, test, scorer)
for train, test in cv.split(X, y))
And then Line 165 (Inside fit method of RFE):
estimator = clone(self.estimator)
Related
I tried to use this permutation code:
perm_importance = permutation_importance(model, x_test, y_test, random_state=7)
Error seen:
"TypeError: estimator should be an estimator implementing 'fit' method"
The question:
Is there a way to create permutation variable importance after using xgb.train instead of xgb.fit?
I am creating a custom scorer for a gridsearchcv object. For the customer scorer I need probabilities from two different dataframes, but the model should only be trained on one of the dataframes. The other dataframe is needed to get probabilities. These probabilities will be used in the scoring function.
I had considered concatenating the dataframes, but there is no ground truth to one of the dataframes. This would create an issue with passing the y_true.
I had also tried to pass the model to the custom score function, but I got a traceback that the model was not fit. Here is an example of what I am trying to do:
def fit(self, X_train, y_train, X_info):
grid = self._create_grid_search()
clf = GradientBoostingClassifier()
score_func = make_scorer(self.make_custom_score, needs_proba=True, clf=clf, X_info=X_info)
model = GridSearchCV(estimator=clf,
param_grid=grid,
scoring=score_func,
cv=3)
def make_custom_score(self, y_true, y_score, clf, X_info):
I found this question: SKLearn cross-validation: How to pass info on fold examples to my scorer function?
which seems to be something that might be a possibility. This approach would seem to be to write a function in the form of scorer(estimator, X, y), but I think this will still have the issue that the model will be trained on all of the data. Is there any way to pass the estimator to the custom score function to be used by gridsearchcv?
I am trying to follow this tutorial to learn the machine learning based prediction but I have got two questions on it?
Ques1. How to set the n_estimators in the below piece of code, otherwise it will always assume the default value.
from sklearn.cross_validation import KFold
def run_cv(X,y,clf_class,**kwargs):
# Construct a kfolds object
kf = KFold(len(y),n_folds=5,shuffle=True)
y_pred = y.copy()
# Iterate through folds
for train_index, test_index in kf:
X_train, X_test = X[train_index], X[test_index]
y_train = y[train_index]
# Initialize a classifier with key word arguments
clf = clf_class(**kwargs)
clf.fit(X_train,y_train)
y_pred[test_index] = clf.predict(X_test)
return y_pred
It is being called as:
from sklearn.svm import SVC
print "%.3f" % accuracy(y, run_cv(X,y,SVC))
Ques2: How to use the already trained model file (e.g. obtained from SVM) so that I can use it to predict more (test) data which I didn't used for training?
For your first question, in the above code you would call run_cv(X,y,SVC,n_classifiers=100), the **kwargs will pass this to the classifier initializer with the step clf = clf_class(**kwargs).
For your second question, the cross validation in the code you've linked is just for model evaluation, i.e. comparing different types of models and hyperparameters, and determining the likely effectiveness of your model in production. Once you've decided on your model, you need to refit the model on the whole dataset:
clf.fit(X,y)
Then you can get predictions with clf.predict or clf.predict_proba.
I am trying to reproduce the example here but using RandomForestClassifer.
I can't see how to transform this part of the code
# Learn to predict each class against the other
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
I tried
# Learn to predict each class against the other
classifier = OneVsRestClassifier(RandomForestClassifier())
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
but I get
AttributeError: Base estimator doesn't have a decision_function
attribute.
Is there a workaround?
Well you should know what is decision_function used for. Its only used with a SVM classifier reason being it gives out the distance of your data points from the hyperplane that separates the data, whereas when you do it using a RandomForestClassifier it makes no sense. You can use other methods that are supported by RFC. You can use predict_proba if you want to get the probabilities of your classified data points.
Here is the reference for the supported functions
Just to mention RFC do supports oob_decision_function, which is the out of bag estimate on your training set.
So just replace your line like -
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)
or
y_score = classifier.fit(X_train, y_train).predict(X_test)
I am using sklearn for SVM training. I am using the cross-validation to evaluate the estimator and avoid the overfitting model.
I split the data into two parts. Train data and test data. Here is the code:
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
iris.data, iris.target, test_size=0.4, random_state=0
)
clf = svm.SVC(kernel='linear', C=1)
scores = cross_validation.cross_val_score(clf, X_train, y_train, cv=5)
print scores
Now I need to evaluate the estimator clf on X_test.
clf.score(X_test, y_test)
here, I get an error saying that the model is not fitted using fit(), but normally, in cross_val_score function the model is fitted? What is the problem?
cross_val_score is basically a convenience wrapper for the sklearn cross-validation iterators. You give it a classifier and your whole (training + validation) dataset and it automatically performs one or more rounds of cross-validation by splitting your data into random training/validation sets, fitting the training set, and computing the score on the validation set. See the documentation here for an example and more explanation.
The reason why clf.score(X_test, y_test) raises an exception is because cross_val_score performs the fitting on a copy of the estimator rather than the original (see the use of clone(estimator) in the source code here). Because of this, clf remains unchanged outside of the function call, and is therefore not properly initialized when you call clf.fit.