Why isn't `model.fit` defined in scikit-learn? - python

I am following step 3 of this example:
model.fit(dataset.data, dataset.target)
expected = dataset.target
predicted = model.predict(dataset.data)
I don't understand why scikit doesn't recognize model.fit.
Do I need assign that variable first?
Is there a missing import?
I'm working in jupyter, scikit-learn 0.17.1.

You need to first initiate an instance of whatever model you're using:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(dataset.data, dataset.target)

fit(x,y) is a method that can be used on an estimator.
In order to be able to use this method on model you would have to create model first and make sure its of an estimator class.
Documentation

Related

Add transform method to sklearn predictor to use it as an intermediate step of a sklearn pipeline

I have a sklearn Pipeline with several transformers and a LinearRegression predictor at the end. I want to add more custom transformers to the end of the pipeline and a final custom predictor, but the LinearRegression predictor doesn't have the transform method so it gives an error when I call the full pipeline's predict method.
I thought about adding the transform method to the LinearRegression class using inheritance and doing something like:
class NewModel(LinearRegression):
def transform(self, X):
return X["prediction"] = self.predict(X)
but I want to know if there is a better way to solve the problem so I can use any type of sklearn predictor in the middle of the pipeline. For instance, I would like a new class you can pass a sklearn predictor as an argument and the new class simply adds a transform method to the class calling the predict method of the predictor as in the example above, and adds the new column to the dataframe.

cast xgboost.Booster class to XGBRegressor or load XGBRegressor from xgboost.Booster

I get a model from Sagemaker of type:
<class 'xgboost.core.Booster'>
I can score this locally which is great but some google searches have shown that it may not be possible to do "standard" things like this taken from here:
plt.barh(boston.feature_names, xgb.feature_importances_)
Is it possible to tranform xgboost.core.Booster to XGBRegressor? Maybe one could use the save_raw method looking at this? Thanks!
So far I tried:
xgb_reg = xgb.XGBRegressor()
xgb_reg._Boster = model
xgb_reg.feature_importances_
but this reults in:
NotFittedError: need to call fit or load_model beforehand
Something along those lines appears to work fine:
local_model_path = "model.tar.gz"
with tarfile.open(local_model_path) as tar:
tar.extractall()
model = xgb.XGBRegressor()
model.load_model(model_file_name)
model can then be used as usual - model.tar.gz is an artifcat coming from sagemaker.

RandomForestClassifier - Odd error with trying to identify feature importance in sklearn?

I'm trying to retrieve the importance of features within a RandomForestClassifier model, retrieving the coef for each feature in the model,
I'm running the following code here,
random_forest = SelectFromModel(RandomForestClassifier(n_estimators = 200, random_state = 123))
random_forest.fit(X_train, y_train)
print(random_forest.estimator.feature_importances_)
but am receiving the following error
NotFittedError: This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
What exactly am I doing wrong? You can see I fit the model right before looking to identify the importance of features, but it doesn't seem to work as it should,
Similarily, I have the code below with a LogisticRegression model and it works fine,
log_reg = SelectFromModel(LogisticRegression(class_weight = "balanced", random_state = 123))
log_reg.fit(X_train, y_train)
print(log_reg.estimator_.coef_)
You have to call the attribute estimator_ to access the fitted estimator (see the docs). Observe that you forgot the trailing _. So it should be:
print(random_forest.estimator_.feature_importances_)
Interestingly, you did it correctly for your example with the LogisticRegression model.

Get feature importance from GridSearchCV

Is there a way to get feature importance from a sklearn's GridSearchCV?
For example :
from sklearn.model_selection import GridSearchCV
print("starting grid search ......")
optimized_GBM = GridSearchCV(LGBMRegressor(),
params,
cv=3,
n_jobs=-1)
#
optimized_GBM.fit(tr, yvar)
preds2 = optimized_GBM.predict(te)
Is there a way I can access feature importance ?
Maybe something like
optimized_GBM.feature_importances_
This one works
optimized_GBM.best_estimator_.feature_importances_
Got it. It goes something like this :
optimized_GBM.best_estimator_.feature_importance()
if you happen ran this through a Pipeline and receive object has no attribute 'feature_importance' try
optimized_GBM.best_estimator_.named_steps["step_name"].feature_importances_
where step_name is the corresponding name in your pipeline
That depends on what model you have selected. If you choose a SVM you wont be having feature importance parameter, but in decision trees you will get it

Python sklearn : fit_transform() does not work for GridSearchCV

I am creating a GridSearchCV classifier as
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english',sublinear_tf=True)),
('clf', LogisticRegression())
])
parameters= {}
gridSearchClassifier = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
# Fit/train the gridSearchClassifier on Training Set
gridSearchClassifier.fit(Xtrain, ytrain)
This works well, and I can predict. However, now I want to retrain the classifier. For this I want to do a fit_transform() on some feedback data.
gridSearchClassifier.fit_transform(Xnew, yNew)
But I get this error
AttributeError: 'GridSearchCV' object has no attribute 'fit_transform'
basically i am trying to fit_transform() on the classifier's internal TfidfVectorizer. I know that i can access the Pipeline's internal components using the named_steps attribute. Can i do something similar for the gridSearchClassifier?
Just call them step by step.
gridSearchClassifier.fit(Xnew, yNew)
transformed = gridSearchClassifier.transform(Xnew)
the fit_transform is nothing more but these two lines of code, simply not implemented as a single method for GridSearchCV.
update
From comments it seems that you are a bit lost of what GridSearchCV actually does. This is a meta-method to fit a model with multiple hyperparameters. Thus, once you call fit you get an estimator inside the best_estimator_ field of your object. In your case - it is a pipeline, and you can extract any part of it as usual, thus
gridSearchClassifier.fit(Xtrain, ytrain)
clf = gridSearchClassifier.best_estimator_
# do something with clf, its elements etc.
# for example print clf.named_steps['vect']
you should not use gridsearchcv as a classifier, this is only a method of fitting hyperparameters, once you find them you should work with best_estimator_ instead. However, remember that if you refit the TFIDF vectorizer, then your classifier will be useless; you cannot change data representation and expect old model to work well, you have to refit the whole classifier once your data change (unless this is carefully designed change, and you make sure old dimensions mean exactly the same - sklearn does not support such operations, you would have to implement this from scratch).
#lejot is correct that you should call fit() on the gridSearchClassifier.
Provided refit=True is set on the GridSearchCV, which is the default, you can access best_estimator_ on the fitted gridSearchClassifier.
You can access the already fitted steps:
tfidf = gridSearchClassifier.best_estimator_.named_steps['vect']
clf = gridSearchClassifier.best_estimator_.named_steps['clf']
You can then transform new text in new_X using:
X_vec = tfidf.transform(new_X)
You can make predictions using this X_vec with:
x_pred = clf.predict(X_vec)
You can also make predictions for the text going through the pipeline entire pipeline with.
X_pred = gridSearchClassifier.predict(new_X)

Categories