OnevsrestClassifier and random forest - python

I am trying to reproduce the example here but using RandomForestClassifer.
I can't see how to transform this part of the code
# Learn to predict each class against the other
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
I tried
# Learn to predict each class against the other
classifier = OneVsRestClassifier(RandomForestClassifier())
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
but I get
AttributeError: Base estimator doesn't have a decision_function
attribute.
Is there a workaround?

Well you should know what is decision_function used for. Its only used with a SVM classifier reason being it gives out the distance of your data points from the hyperplane that separates the data, whereas when you do it using a RandomForestClassifier it makes no sense. You can use other methods that are supported by RFC. You can use predict_proba if you want to get the probabilities of your classified data points.
Here is the reference for the supported functions
Just to mention RFC do supports oob_decision_function, which is the out of bag estimate on your training set.
So just replace your line like -
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)
or
y_score = classifier.fit(X_train, y_train).predict(X_test)

Related

predict_proba method available in OneVsRestClassifier

I am using sklearn's OneVsOneClassifier in an pipeline like so:
smt = SMOTE(random_state=42)
base_model = LogisticRegression()
pipeline = Pipeline([('sampler', smt), ('model', base_model)])
classifier = OneVsOneClassifier(estimator=pipeline)
classifier.fit(X_train, y_train)
# prediction
yhat = classifier.predict(X_test)
But then I cannot do:
yhat_prob = predict_proba(X_test)
AttributeError: 'OneVsOneClassifier' object has no attribute 'predict_proba'
scikit-learns OneVsRestClassifier does provide predict_proba method. I am suprised OneVsOneClassifier doesn't have this method.
How do I then get class probability estimates from my pipeline above?
It's not clear how to use OvO to get probabilities, so it's not implemented. https://github.com/scikit-learn/scikit-learn/issues/6164
There is the decision_function method for a more nuanced version of predict.

Cross val predict expects as input an already fitted model?

I am reading Geron's Hands-on Machine Learning. In page 90, there is a section about Confusion Matrix. He says that we need some predictions, so he does the following:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train5, cv=3)
This object sgd_clf is a stochastic gradient descent classifier which was previously fitted with the train data in the previous section. My question is: why, if already fitted, it is better to split the train set in three parts and retrain (?) the sgd_clf in two of them, then make a prediction and so on, if sgd_clf is already trained? Why not just let it predict on full X_train? Or just take a new not-fitted classifier as imput? Why put sgd_clf already trained as imput to retrain? I am a bit confused.
I see your confusion and I think Geron doesn't mean you should use the fitted model for cross-validation. He just wants to compare the naive fitting method with cross-validation.
The complete code should be as follows:
from sklearn.linear_model import SGDClassifier
# No cross-validation
sgd_clf1 = SGDClassifier(random_state=42)
sgd_clf1.fit(X_train, y_train)
# With cross-validation
sgd_clf2 = SGDClassifier(random_state=42)
cross_val_score(sgd_clf2, X_train, y_train, cv=3, scoring='accuracy')

attribute error in sklearn svm.SVC

having issues with attribute errors when implementing a linear SVM with scikit-learn. I'm using a linear classifier with cross-validation through the RFECV method, and I can't access any of the attributes of the SVC. Not sure if it has to do with the feature selection or base model.
model = svm.SVC(kernel='linear')
selector=RFECV(model)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=pct_test)
selector=selector.fit(X_train, Y_train)
my_prediction = selector.predict(X_test)
f1.append(metrics.f1_score(Y_test, my_prediction))
kappa.append(metrics.cohen_kappa_score(Y_test, my_prediction))
precision.append(metrics.precision_score(Y_test, my_prediction))
recall.append(metrics.recall_score(Y_test, my_prediction))
print model.intercept_
print model.support_vectors_
print model.coef_
Metrics work fine, attributes all fail.
The error message is:
AttributeError: 'SVC' object has no attribute 'intercept_'
Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
Aside: I'm very new to OOP. If there's an underlying concept I'm missing, please elaborate or send over a link.
You are fitting (training the data) on the RFECV object selector, but trying to access attributes of SVC object model. But it is not trained. Hence there is no attribute intercept_ in it.
To access the intercept of SVC, you should use:
selector.estimator_.intercept_
But understand that the above estimator is fitted only on the reduced dataset (After eliminating features as specified)
Explanation:
You see, RFECV internally uses RFE to get important features in each iteration. And RFE clones the supplied estimator for the purpose. So when you initialize RFECV with model, it is trained on the clone of the model.
Checking the source code:
Line 407 (inside the fit method of RFECV):
rfe = RFE(estimator=self.estimator,
n_features_to_select=n_features_to_select,
step=self.step, verbose=self.verbose)
Line 428 (for estimating the scores):
scores = parallel(func(rfe, self.estimator, X, y, train, test, scorer)
for train, test in cv.split(X, y))
And then Line 165 (Inside fit method of RFE):
estimator = clone(self.estimator)

Using cross_val_predict against test data set

I'm confused about using cross_val_predict in a test data set.
I created a simple Random Forest model and used cross_val_predict to make predictions:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict, KFold
lr = RandomForestClassifier(random_state=1, class_weight="balanced", n_estimators=25, max_depth=6)
kf = KFold(train_df.shape[0], random_state=1)
predictions = cross_val_predict(lr,train_df[features_columns], train_df["target"], cv=kf)
predictions = pd.Series(predictions)
I'm confused on the next step here. How do I use what is learnt above to make predictions on the test data set?
I don't think cross_val_score or cross_val_predict uses fit before predicting. It does it on the fly. If you look at the documentation (section 3.1.1.1), you'll see that they never mention fit anywhere.
As #DmitryPolonskiy commented, the model has to be trained (with the fit method) before it can be used to predict.
# Train the model (a.k.a. `fit` training data to it).
lr.fit(train_df[features_columns], train_df["target"])
# Use the model to make predictions based on testing data.
y_pred = lr.predict(test_df[feature_columns])
# Compare the predicted y values to actual y values.
accuracy = (y_pred == test_df["target"]).mean()
cross_val_predict is a method of cross validation, which lets you determine the accuracy of your model. Take a look at sklearn's cross-validation page.
I am not sure the question was answered. I had a similar thought. I want compare the results (Accuracy for example) with the method that does not apply CV. The CV valiadte accuracy is on the X_train and y_train. The other method fit the model using X_trian and y_train, tested on the X_test and y_test. So the comparison is not fair since they are on different datasets.
What you can do is using the estimator returned by the cross_validate
lr_fit = cross_validate(lr, train_df[features_columns], train_df["target"], cv=kf, return_estimator=Ture)
y_pred = lr_fit.predict(test_df[feature_columns])
accuracy = (y_pred == test_df["target"]).mean()

Prediction for RBM in scikit

I would like to use RBM in scikit. I can define and train a RBM like many other classifiers.
from sklearn.neural_network import BernoulliRBM
clf = BernoulliRBM(random_state=0, verbose=True)
clf.fit(X_train, y_train)
But I can't seem to find a function that makes me a prediction. I am looking for an equivalent for one of the following in scikit.
y_score = clf.decision_function(X_test)
y_score = clf.predict(X_test)
Neither functions are present in BernoulliRBM.
The BernoulliRBM is an unsupervised method so you won't be able to do clf.fit(X_train, y_train) but rather clf.fit(X_train). It is mostly used for non-linear feature extraction that can be feed to a classifier. It would look like this:
logistic = linear_model.LogisticRegression()
rbm = BernoulliRBM(random_state=0, verbose=True)
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
So the features extracted by rbm are passed to the LogisticRegression model. Take a look here for a full example.

Categories