Feature importance in sklearn using adaboost - python

I am sing python library sklearn. I am using adaboost classifier and want to identify which features are most important in classification. Following is my code:
ada = AdaBoostClassifier(n_estimators=100)
selector = RFECV(ada, step=1, cv=5)
selector = selector.fit(np.asarray(total_data), np.asarray(target))
selector.support_
print "featue ranking", selector.ranking_
I am getting following error:
selector = selector.fit(np.asarray(total_data), np.asarray(target))
File "C:\Python27\lib\site-packages\sklearn\feature_selection\rfe.py", line 336, in fit
ranking_ = rfe.fit(X_train, y_train).ranking_
File "C:\Python27\lib\site-packages\sklearn\feature_selection\rfe.py", line 148, in fit
if estimator.coef_.ndim > 1:
AttributeError: 'AdaBoostClassifier' object has no attribute 'coef_'
Does anyone have any idea about it, and how to correct it.
Thanks!!

Straight from the docstring of RFECV:
Parameters
----------
estimator : object
A supervised learning estimator with a `fit` method that updates a
`coef_` attribute that holds the fitted parameters. Important features
must correspond to high absolute values in the `coef_` array.
For instance, this is the case for most supervised learning
algorithms such as Support Vector Classifiers and Generalized
Linear Models from the `svm` and `linear_model` modules.
In other words, RFE is currently only implemented for linear models. You could make it work for other models by changing it to use feature_importances_ instead of coef_ and submit a patch.

Related

How do I get the MAE, RMSE, MSE and R^2 on a Pycaret model?

I am trying to get the MAE, RMSE, MSE and R^2 on a model, but actually it only gives me some metrics that are used mostly on classification, not on regression .
These are the metrics that the model gives me:
I have already read the Pycaret documentation, but I only found the option of add_metric() but I don't if this function will work for that (also I didn't understood how add_metric() function works)
My setup function:
exp = setup(data = dataset, target = 'Lower Salary', categorical_features = cat_f,
ignore_features= ['Job Title','Headquarters','Founded','Type of ownership','Competitors','company_txt','job_title_sim','seniority_by_title','Salary Estimate','Job Description','Industry','Hourly','Employer provided'],
normalize = True,session_id = 123)
My create_model function:
logit = create_model('lr')
The actual problem is that you are using logistic regression. That is a classification model, not a regression model. That is why you are only seeing metrics for classification models.
I'm actually working on a time series project right now and need to calculate these metrics. This is what I do.
I haven't used pycaret but this would require you having the ability to get your test dataset and an output of your model's predictions.
You will need to import sort from the math library, and also install sklearn for r2_score(), mean_squared_error(), and mean_absolute_error().
print(f'Mean Absolute Error = {mean_absolute_error(actual,pred)}')
print(f'Mean Squared Error = {mean_squared_error(actual,pred)}')
print(f'Root Mean Squared Error = {math.sqrt(mean_squared_error(actual,pred))}')
print(f'r2 = {r2_score(actual,pred)}')
Looking at the pycaret docs, it looks like get_leaderboard() might work for your case. Here's a link
Here is a notebook that has code examples for you.

RandomForestClassifier - Odd error with trying to identify feature importance in sklearn?

I'm trying to retrieve the importance of features within a RandomForestClassifier model, retrieving the coef for each feature in the model,
I'm running the following code here,
random_forest = SelectFromModel(RandomForestClassifier(n_estimators = 200, random_state = 123))
random_forest.fit(X_train, y_train)
print(random_forest.estimator.feature_importances_)
but am receiving the following error
NotFittedError: This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
What exactly am I doing wrong? You can see I fit the model right before looking to identify the importance of features, but it doesn't seem to work as it should,
Similarily, I have the code below with a LogisticRegression model and it works fine,
log_reg = SelectFromModel(LogisticRegression(class_weight = "balanced", random_state = 123))
log_reg.fit(X_train, y_train)
print(log_reg.estimator_.coef_)
You have to call the attribute estimator_ to access the fitted estimator (see the docs). Observe that you forgot the trailing _. So it should be:
print(random_forest.estimator_.feature_importances_)
Interestingly, you did it correctly for your example with the LogisticRegression model.

Get feature importance from GridSearchCV

Is there a way to get feature importance from a sklearn's GridSearchCV?
For example :
from sklearn.model_selection import GridSearchCV
print("starting grid search ......")
optimized_GBM = GridSearchCV(LGBMRegressor(),
params,
cv=3,
n_jobs=-1)
#
optimized_GBM.fit(tr, yvar)
preds2 = optimized_GBM.predict(te)
Is there a way I can access feature importance ?
Maybe something like
optimized_GBM.feature_importances_
This one works
optimized_GBM.best_estimator_.feature_importances_
Got it. It goes something like this :
optimized_GBM.best_estimator_.feature_importance()
if you happen ran this through a Pipeline and receive object has no attribute 'feature_importance' try
optimized_GBM.best_estimator_.named_steps["step_name"].feature_importances_
where step_name is the corresponding name in your pipeline
That depends on what model you have selected. If you choose a SVM you wont be having feature importance parameter, but in decision trees you will get it

Reusing model fitted by cross_val_score in sklearn using joblib [duplicate]

This question already has an answer here:
Using sklearn cross_val_score and kfolds to fit and help predict model
(1 answer)
Closed last year.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
I created the following function in python:
def cross_validate(algorithms, data, labels, cv=4, n_jobs=-1):
print "Cross validation using: "
for alg, predictors in algorithms:
print alg
print
# Compute the accuracy score for all the cross validation folds.
scores = cross_val_score(alg, data, labels, cv=cv, n_jobs=n_jobs)
# Take the mean of the scores (because we have one for each fold)
print scores
print("Cross validation mean score = " + str(scores.mean()))
name = re.split('\(', str(alg))
filename = str('%0.5f' %scores.mean()) + "_" + name[0] + ".pkl"
# We might use this another time
joblib.dump(alg, filename, compress=1, cache_size=1e9)
filenameL.append(filename)
try:
move(filename, "pkl")
except:
os.remove(filename)
print
return
I thought that in order to do cross validation, sklearn had to fit your function.
However, when I try to use it later (f is the pkl file I saved above in joblib.dump(alg, filename, compress=1, cache_size=1e9)):
alg = joblib.load(f)
predictions = alg.predict_proba(train_data[predictors]).astype(float)
I get no error in the first line (so it looks like the load is working), but then it tells me NotFittedError: Estimator not fitted, callfitbefore exploiting the model. on the following line.
What am I doing wrong? Can't I reuse the model fitted to calculate the cross-validation? I looked at Keep the fitted parameters when using a cross_val_score in scikits learn but either I don't understand the answer, or it is not what I am looking for. What I want is to save the whole model with joblib so that I can the use it later without re-fitting.
It's not quite correct that cross-validation has to fit your model; rather a k-fold cross validation fits your model k times on partial data sets. If you want the model itself, you actually need to fit the model again on the whole dataset; this actually isn't part of the cross-validation process. So it actually wouldn't be redundant to call
alg.fit(data, labels)
to fit your model after your cross validation.
Another approcach would be rather than using the specialized function cross_val_score, you could think of this as a special case of a cross-validated grid search (with a single point in the parameter space). In this case GridSearchCV will by default refit the model over the entire dataset (it has a parameter refit=True), and also has predict and predict_proba methods in its API.
The real reason your model is not fitted is that the function cross_val_score first copies your model before fitting the copy : Source link
So your original model has not been fitted.
Cross_val_score does not keep the fitted model
Cross_val_predict does
There is no cross_val_predict_proba but you can do this
predict_proba for a cross-validated model

Python sklearn : fit_transform() does not work for GridSearchCV

I am creating a GridSearchCV classifier as
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english',sublinear_tf=True)),
('clf', LogisticRegression())
])
parameters= {}
gridSearchClassifier = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
# Fit/train the gridSearchClassifier on Training Set
gridSearchClassifier.fit(Xtrain, ytrain)
This works well, and I can predict. However, now I want to retrain the classifier. For this I want to do a fit_transform() on some feedback data.
gridSearchClassifier.fit_transform(Xnew, yNew)
But I get this error
AttributeError: 'GridSearchCV' object has no attribute 'fit_transform'
basically i am trying to fit_transform() on the classifier's internal TfidfVectorizer. I know that i can access the Pipeline's internal components using the named_steps attribute. Can i do something similar for the gridSearchClassifier?
Just call them step by step.
gridSearchClassifier.fit(Xnew, yNew)
transformed = gridSearchClassifier.transform(Xnew)
the fit_transform is nothing more but these two lines of code, simply not implemented as a single method for GridSearchCV.
update
From comments it seems that you are a bit lost of what GridSearchCV actually does. This is a meta-method to fit a model with multiple hyperparameters. Thus, once you call fit you get an estimator inside the best_estimator_ field of your object. In your case - it is a pipeline, and you can extract any part of it as usual, thus
gridSearchClassifier.fit(Xtrain, ytrain)
clf = gridSearchClassifier.best_estimator_
# do something with clf, its elements etc.
# for example print clf.named_steps['vect']
you should not use gridsearchcv as a classifier, this is only a method of fitting hyperparameters, once you find them you should work with best_estimator_ instead. However, remember that if you refit the TFIDF vectorizer, then your classifier will be useless; you cannot change data representation and expect old model to work well, you have to refit the whole classifier once your data change (unless this is carefully designed change, and you make sure old dimensions mean exactly the same - sklearn does not support such operations, you would have to implement this from scratch).
#lejot is correct that you should call fit() on the gridSearchClassifier.
Provided refit=True is set on the GridSearchCV, which is the default, you can access best_estimator_ on the fitted gridSearchClassifier.
You can access the already fitted steps:
tfidf = gridSearchClassifier.best_estimator_.named_steps['vect']
clf = gridSearchClassifier.best_estimator_.named_steps['clf']
You can then transform new text in new_X using:
X_vec = tfidf.transform(new_X)
You can make predictions using this X_vec with:
x_pred = clf.predict(X_vec)
You can also make predictions for the text going through the pipeline entire pipeline with.
X_pred = gridSearchClassifier.predict(new_X)

Categories