Order of coef_ and estimators_ elements in scikit-learn classifiers - python

In scikit-learn, classifiers such as LinearSVC and MultinomialNB expose the coef_ object. This allows to inspect the weights of the model for each label, for example coef_[0], coef_[1] etc. A full code example is shown in the 20_newsgroups documentation in the show_top10 method.
In case of using a OneVsRestClassifier and a pipeline, I wrote code to get the weights of a wrapped model for a three class classification:
pipeline = Pipeline([
('scaler', MaxAbsScaler()),
('clf', OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='balanced')))])
pipeline.fit(X_train, y_train)
pipeline.named_steps['clf'].estimators_[0].coef_.toarray()[0]
pipeline.named_steps['clf'].estimators_[1].coef_.toarray()[0]
pipeline.named_steps['clf'].estimators_[2].coef_.toarray()[0]
But how is the order of the elements in coef_ (for LinearSVC) and estimators_ (for SVC wrapped in a OneVsRestClassifier) defined?
In general, how do I reliably obtain which class i the estimators_[i] model belongs to?

Related

How to combine already trained classifiers with StackingClassifier?

StackingClassifier in sklearn can stack several models. At the moment of the calling .fit method, the underlying models are trained.
A typical use case for StackingClassifier:
model1 = LogisticRegression()
model2 = RandomForest()
combination = StackingClassifier([model1, model2])
combination.fit(X_train, y_train)
However, what I need is the following:
model1 = LogisticRegression()
model1.fit(X_train_1, y_train_1)
model2 = RandomForest()
model2.fit(X_train_2, y_train_2)
combination = StackingClassifier([model1, model2], refit=False)
combination.fit(X_train_3, y_train_3)
where refit does not exist - it is what I would need.
I have already trained models model1, and model2 and do not want to re-fit them. I need just to fit the stacking model that combines these two. How do I elegantly combine them into one model that would produce an end-to-end .predict?
Of course, I can predict the first and the second model, create a data frame, and fit the third one. I would like to avoid that because then I cannot communicate the model as an end-to-end artifact.
You're close: it's cv="prefit", not refit=False. From the API docs:
cv : int, cross-validation generator, iterable, or “prefit”, default=None
[...]
"prefit" to assume the estimators are prefit. In this case, the estimators will not be refitted.

RandomForestClassifier - Odd error with trying to identify feature importance in sklearn?

I'm trying to retrieve the importance of features within a RandomForestClassifier model, retrieving the coef for each feature in the model,
I'm running the following code here,
random_forest = SelectFromModel(RandomForestClassifier(n_estimators = 200, random_state = 123))
random_forest.fit(X_train, y_train)
print(random_forest.estimator.feature_importances_)
but am receiving the following error
NotFittedError: This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
What exactly am I doing wrong? You can see I fit the model right before looking to identify the importance of features, but it doesn't seem to work as it should,
Similarily, I have the code below with a LogisticRegression model and it works fine,
log_reg = SelectFromModel(LogisticRegression(class_weight = "balanced", random_state = 123))
log_reg.fit(X_train, y_train)
print(log_reg.estimator_.coef_)
You have to call the attribute estimator_ to access the fitted estimator (see the docs). Observe that you forgot the trailing _. So it should be:
print(random_forest.estimator_.feature_importances_)
Interestingly, you did it correctly for your example with the LogisticRegression model.

How to access attribute from a trained estimator in TransformedTargetRegressor pipeline in scikit-learn?

I setup a small pipeline with scikit-Learn that I wrapped in a TransforedTargetRegressor object. After the training, I would like to access the attribute from my trained estimator (e.g. feature_importances_). Can anyone tell me how this can be done?
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import TransformedTargetRegressor
# setup the pipeline
Pipeline(steps = [('scale', StandardScaler(with_mean=True, with_std=True)),
('estimator', RandomForestRegressor())])
# tranform target variable
model = TransformedTargetRegressor(regressor=pipeline,
transformer=MinMaxScaler())
# fit model
model.fit(X_train, y_train)
I tried the following:
# try to access the attribute of the fitted estimator
model.get_params()['regressor__estimator'].feature_importances_
model.regressor.named_steps['estimator'].feature_importances_
But this results in the following NotFittedError:
NotFittedError: This RandomForestRegressor instance is not fitted yet.
Call 'fit' with appropriate arguments before using this method.
When you look into the documentation of TransformedTargetRegressor it says that the attribute .regressor_ (note the trailing underscore) returns the fitted regressor. Hence, your call should look like:
model.regressor_.named_steps['estimator'].feature_importances_
Your previous calls were just returning an unfitted clone. That's were the error came from.

Multioutput Stacking Regressor

One-shot question: I'm trying to build a Multiputput Stacked Regressor (added to sklearn 0.22).
As far I understand, I have to combine StackingRegressor and MultiOutputRegressor. After several attemps this seems to be the right order:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import OrthogonalMatchingPursuit
from sklearn.ensemble import StackingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.svm import SVR
estimators = [ ('svr', SVR(kernel='rbf', C=1e3, gamma=0.1)),
('knn',KNeighborsRegressor(n_neighbors=5)),
('omp', OrthogonalMatchingPursuit())
]
reg = MultiOutputRegressor(StackingRegressor( estimators = estimators, final_estimator= RandomForestRegressor(n_estimators=5)))
X=np.random.random((200,20))
y = np.random.random((200,4))
reg.fit(X,y)
reg.predict(X)
But the predict method ends with an error
*** ValueError: The base estimator should implement a predict method
I searched such error in the sklean files and it seems related to the MultiOutputRegressor:
if not hasattr(self.estimator, "predict"):
raise ValueError("The base estimator should implement a predict method")
So i tried to look at the self.estimator model:
reg.estimator.predict(X)
but I obtain this error:
*** AttributeError: 'StackingRegressor' object has no attribute 'final_estimator_'
Looking at the attributes of reg.estimator I can not find final_estimator_ but only final_estimator so my solution is to create such attribute:
reg.estimator.final_estimator_ = reg.estimator.final_estimator
It works but I'm not sure anymore if my model now is doing what it is suppose to do (maybe it is using the same final estimator for each coordinate of the output).
Is this a bug due to the combination StackingRegressor + MultiOutputRegressor or I'm missing something?
Thanks!
Set stacking_method='predict' in Model Initialization, and it should work fine. Idk why the 'auto' option doesn't work, but quick fix like so:
model = StackingClassifier(estimators=level0, final_estimator=level1, stack_method='predict', cv=5)

Python sklearn : fit_transform() does not work for GridSearchCV

I am creating a GridSearchCV classifier as
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english',sublinear_tf=True)),
('clf', LogisticRegression())
])
parameters= {}
gridSearchClassifier = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
# Fit/train the gridSearchClassifier on Training Set
gridSearchClassifier.fit(Xtrain, ytrain)
This works well, and I can predict. However, now I want to retrain the classifier. For this I want to do a fit_transform() on some feedback data.
gridSearchClassifier.fit_transform(Xnew, yNew)
But I get this error
AttributeError: 'GridSearchCV' object has no attribute 'fit_transform'
basically i am trying to fit_transform() on the classifier's internal TfidfVectorizer. I know that i can access the Pipeline's internal components using the named_steps attribute. Can i do something similar for the gridSearchClassifier?
Just call them step by step.
gridSearchClassifier.fit(Xnew, yNew)
transformed = gridSearchClassifier.transform(Xnew)
the fit_transform is nothing more but these two lines of code, simply not implemented as a single method for GridSearchCV.
update
From comments it seems that you are a bit lost of what GridSearchCV actually does. This is a meta-method to fit a model with multiple hyperparameters. Thus, once you call fit you get an estimator inside the best_estimator_ field of your object. In your case - it is a pipeline, and you can extract any part of it as usual, thus
gridSearchClassifier.fit(Xtrain, ytrain)
clf = gridSearchClassifier.best_estimator_
# do something with clf, its elements etc.
# for example print clf.named_steps['vect']
you should not use gridsearchcv as a classifier, this is only a method of fitting hyperparameters, once you find them you should work with best_estimator_ instead. However, remember that if you refit the TFIDF vectorizer, then your classifier will be useless; you cannot change data representation and expect old model to work well, you have to refit the whole classifier once your data change (unless this is carefully designed change, and you make sure old dimensions mean exactly the same - sklearn does not support such operations, you would have to implement this from scratch).
#lejot is correct that you should call fit() on the gridSearchClassifier.
Provided refit=True is set on the GridSearchCV, which is the default, you can access best_estimator_ on the fitted gridSearchClassifier.
You can access the already fitted steps:
tfidf = gridSearchClassifier.best_estimator_.named_steps['vect']
clf = gridSearchClassifier.best_estimator_.named_steps['clf']
You can then transform new text in new_X using:
X_vec = tfidf.transform(new_X)
You can make predictions using this X_vec with:
x_pred = clf.predict(X_vec)
You can also make predictions for the text going through the pipeline entire pipeline with.
X_pred = gridSearchClassifier.predict(new_X)

Categories