I am using sklearn's OneVsOneClassifier in an pipeline like so:
smt = SMOTE(random_state=42)
base_model = LogisticRegression()
pipeline = Pipeline([('sampler', smt), ('model', base_model)])
classifier = OneVsOneClassifier(estimator=pipeline)
classifier.fit(X_train, y_train)
# prediction
yhat = classifier.predict(X_test)
But then I cannot do:
yhat_prob = predict_proba(X_test)
AttributeError: 'OneVsOneClassifier' object has no attribute 'predict_proba'
scikit-learns OneVsRestClassifier does provide predict_proba method. I am suprised OneVsOneClassifier doesn't have this method.
How do I then get class probability estimates from my pipeline above?
It's not clear how to use OvO to get probabilities, so it's not implemented. https://github.com/scikit-learn/scikit-learn/issues/6164
There is the decision_function method for a more nuanced version of predict.
Related
I am doing a course in coursera and need to submit this last assignment in order to pass. However, I am unable to complete it. I encounter NotFittedError in line 16 of the code. Can someone help me to find what is wrong with this code.
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc
def engagement_model():
train = pd.read_csv('assets/train.csv')
train_X = train[train.columns[1:9]]
train_y = train.iloc[:, 9:]
test = pd.read_csv('assets/test.csv')
X_train, X_test, y_train, y_test = train_test_split(train_X, train_y)
class_rf=RandomForestClassifier()
grid_values = {'n_estimators':[10,100], 'max_depth': [None, 30]}
grid_clf_auc = GridSearchCV(class_rf, param_grid=grid_values, scoring='roc_auc_score')
predict_test = grid_clf_auc.predict_proba(test[test.columns[1:9]])
predict_test = predict_test[:,1]
return pd.series(predict_test, index=[test['id']])
engagement_model()
The error I am getting is
NotFittedError: This GridSearchCV instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
You are getting an error because you didn't fit the model. You need first to fit it on your train data and then you can predict on the test data. Use .fit() method on your grid_clf_auc
Before calling grid_clf_auc.predict_proba, you need to call grid_clf_auc.fit(train_X, train_y). Otherwise you just have created a GradSearchCV object but you have not fitted it to your data. A classifier not fittet to your data cannot make a prediction.
if i use lightgbm
there is two methods of using lightgbm. first method: -
model=lgb.LGBMClassifier()
model.fit(X,y)
model.predict_proba(values)
i can get predict_proba method to predict probabilities.
if i use it natively
import lightgbm as lgb
dset = lgb.Dataset(X, label=y)
params = {}
params['learning_rate'] = 0.003
params['boosting_type'] = 'gbdt'
params['objective'] = 'binary'
params['metric'] = 'binary_logloss'
model = lgb.train(params, dset, 100)
I can get predict method but predict_proba method dont exist if i use this method. anyone can advise what i did wrong?
You can only use predict_proba in sklearn fit(). It does not exist in lightgbm native train().
I'm confused about using cross_val_predict in a test data set.
I created a simple Random Forest model and used cross_val_predict to make predictions:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict, KFold
lr = RandomForestClassifier(random_state=1, class_weight="balanced", n_estimators=25, max_depth=6)
kf = KFold(train_df.shape[0], random_state=1)
predictions = cross_val_predict(lr,train_df[features_columns], train_df["target"], cv=kf)
predictions = pd.Series(predictions)
I'm confused on the next step here. How do I use what is learnt above to make predictions on the test data set?
I don't think cross_val_score or cross_val_predict uses fit before predicting. It does it on the fly. If you look at the documentation (section 3.1.1.1), you'll see that they never mention fit anywhere.
As #DmitryPolonskiy commented, the model has to be trained (with the fit method) before it can be used to predict.
# Train the model (a.k.a. `fit` training data to it).
lr.fit(train_df[features_columns], train_df["target"])
# Use the model to make predictions based on testing data.
y_pred = lr.predict(test_df[feature_columns])
# Compare the predicted y values to actual y values.
accuracy = (y_pred == test_df["target"]).mean()
cross_val_predict is a method of cross validation, which lets you determine the accuracy of your model. Take a look at sklearn's cross-validation page.
I am not sure the question was answered. I had a similar thought. I want compare the results (Accuracy for example) with the method that does not apply CV. The CV valiadte accuracy is on the X_train and y_train. The other method fit the model using X_trian and y_train, tested on the X_test and y_test. So the comparison is not fair since they are on different datasets.
What you can do is using the estimator returned by the cross_validate
lr_fit = cross_validate(lr, train_df[features_columns], train_df["target"], cv=kf, return_estimator=Ture)
y_pred = lr_fit.predict(test_df[feature_columns])
accuracy = (y_pred == test_df["target"]).mean()
I am trying to reproduce the example here but using RandomForestClassifer.
I can't see how to transform this part of the code
# Learn to predict each class against the other
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
I tried
# Learn to predict each class against the other
classifier = OneVsRestClassifier(RandomForestClassifier())
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
but I get
AttributeError: Base estimator doesn't have a decision_function
attribute.
Is there a workaround?
Well you should know what is decision_function used for. Its only used with a SVM classifier reason being it gives out the distance of your data points from the hyperplane that separates the data, whereas when you do it using a RandomForestClassifier it makes no sense. You can use other methods that are supported by RFC. You can use predict_proba if you want to get the probabilities of your classified data points.
Here is the reference for the supported functions
Just to mention RFC do supports oob_decision_function, which is the out of bag estimate on your training set.
So just replace your line like -
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)
or
y_score = classifier.fit(X_train, y_train).predict(X_test)
I would like to use RBM in scikit. I can define and train a RBM like many other classifiers.
from sklearn.neural_network import BernoulliRBM
clf = BernoulliRBM(random_state=0, verbose=True)
clf.fit(X_train, y_train)
But I can't seem to find a function that makes me a prediction. I am looking for an equivalent for one of the following in scikit.
y_score = clf.decision_function(X_test)
y_score = clf.predict(X_test)
Neither functions are present in BernoulliRBM.
The BernoulliRBM is an unsupervised method so you won't be able to do clf.fit(X_train, y_train) but rather clf.fit(X_train). It is mostly used for non-linear feature extraction that can be feed to a classifier. It would look like this:
logistic = linear_model.LogisticRegression()
rbm = BernoulliRBM(random_state=0, verbose=True)
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
So the features extracted by rbm are passed to the LogisticRegression model. Take a look here for a full example.