GridSearchCV's documentations states that I can pass a scoring function.
scoring : string, callable or None, default=None
I would like to use a native accuracy_score as a scoring function.
So here is my attempt. Imports and some data:
import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn import neighbors
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([0, 1, 0, 0, 0, 1])
Now when I use just k-fold cross-validation without my scoring function, everything works as intended:
parameters = {
'n_neighbors': [2, 3, 4],
'weights':['uniform', 'distance'],
'p': [1, 2, 3]
}
model = neighbors.KNeighborsClassifier()
k_fold = KFold(len(Y), n_folds=6, shuffle=True, random_state=0)
clf = GridSearchCV(model, parameters, cv=k_fold) # TODO will change
clf.fit(X, Y)
print clf.best_score_
But when I change the line to
clf = GridSearchCV(model, parameters, cv=k_fold, scoring=accuracy_score) # or accuracy_score()
I get the error: ValueError: Cannot have number of folds n_folds=10 greater than the number of samples: 6. which in my opinion does not represent the real problem.
In my opinion the problem is that accuracy_score does not follow the signature scorer(estimator, X, y), which is written in the documentation
So how can I fix this problem?
It will work if you change scoring=accuracy_score to scoring='accuracy' (see the documentation for the full list of scorers you can use by name in this way.)
In theory, you should be able to pass custom scoring functions like you're trying, but my guess is that you're right and accuracy_score doesn't have the right API.
Here is an example of using Weighted Kappa as scoring metric for GridSearchCV for a simple Random Forest model. The key learning for me was to use the parameters related to the scorer in the 'make_scorer' function.
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import cohen_kappa_score, make_scorer
kappa_scorer = make_scorer(cohen_kappa_score,weights="quadratic")
# Create the parameter grid based on the results of random search
param_grid = {
'bootstrap': [True],
'max_features': range(2,10), # try features from 2 to 10
'min_samples_leaf': [3, 4, 5],
'n_estimators' : [100,300,500],
'max_depth': [5]
}
# Create a based model
random_forest = RandomForestClassifier(class_weight ="balanced_subsample",random_state=1)
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = random_forest, param_grid = param_grid,
cv = 5, n_jobs = -1, verbose = 2, scoring = kappa_scorer) # search for best model using roc_auc
# Fit the grid search to the data
grid_search.fit(final_tr, yTrain)
Related
I'm trying to use random forest with grid search but this error shows up
ValueError: Invalid parameter classifier for estimator Pipeline(steps=[('tfidf_vectorizer', TfidfVectorizer()),
('rf_classifier', RandomForestClassifier())]).
Check the list of available parameters with `estimator.get_params().keys()`.
import numpy as np # linear algebra
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import pipeline,ensemble,preprocessing,feature_extraction,metrics
train=pd.read_json('cleaned_data1')
#split dataset into X , Y
X=train.iloc[:,0]
Y=train.iloc[:,2]
estimators=pipeline.Pipeline([
('tfidf_vectorizer', feature_extraction.text.TfidfVectorizer(lowercase=True)),
('rf_classifier', ensemble.RandomForestClassifier())
])
print(estimators.get_params().keys())
params = {"classifier__max_depth": [3, None],
"classifier__max_features": [1, 3, 10],
"classifier__min_samples_split": [1, 3, 10],
"classifier__min_samples_leaf": [1, 3, 10],
# "bootstrap": [True, False],
"classifier__criterion": ["gini", "entropy"]}
X_train,X_test,y_train,y_test=train_test_split(X,Y, test_size=0.2)
rf_classifier=GridSearchCV(estimators,params, cv=10 , n_jobs=-1 ,scoring='accuracy',iid=True)
rf_classifier.fit(X_train,y_train)
y_pred=rf_classifier.predict(X_test)
metrics.confusion_matrix(y_test,y_pred)
print(metrics.accuracy_score(y_test,y_pred))
I've tried to add those params
param_grid = {
'n_estimators': [200, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
}
but still the same error
Please ensure that when you reference something in the pipeline, you use the same naming convention when you are initializing a parameter grid.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
# Define a pipeline to search for the best combination of PCA truncation
# and classifier regularization.
pca = PCA()
# set the tolerance to a large value to make the example faster
logistic = LogisticRegression(max_iter=10000, tol=0.1)
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])
X_digits, y_digits = datasets.load_digits(return_X_y=True)
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
'pca__n_components': [5, 15, 30, 45, 64],
'logistic__C': np.logspace(-4, 4, 4),
}
search = GridSearchCV(pipe, param_grid, n_jobs=-1)
search.fit(X_digits, y_digits)
print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
In this example, we reference LogisticRegression model as 'logistic'. Also on a side note, please note that for RandomForestClassifiers, a value of min_samples_split = 1 is not possible and will result in an error.
This is from the sklearn documentation
Where you have called the random forest ensemble 'rf_classifier' within the pipeline, you should rename this to 'classifier' which should solve the issue.
The params look for something named 'classifier' in the pipeline so they can apply themselves however at current there is nothing named this and therefore this error is thrown.
If you want (I'm not sure if this will work but worth testing), you could change "classifier__" in the params list to "rf_classifier__" to see if the params will then recognise the passed classifier.
I want to perform recursive feature elimination with cross validation (rfecv) in 10-fold cross validation (i.e. cross_val_predict or cross_validate) in sklearn.
Since rfecv itself has a cross validation part in its name, I am not clear how to do it. My current code is as follows.
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state = 0, class_weight="balanced")
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
rfecv = RFECV(estimator=clf, step=1, cv=k_fold)
Please let me know how I can use the data X and y with rfecv in 10-fold cross validation.
I am happy to provide more details if needed.
To use recursive feature elimination in conjunction with a pre-defined k_fold, you should use RFE and not RFECV:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
clf = RandomForestClassifier(random_state = 0, class_weight="balanced")
selector = RFE(clf, 5, step=1)
cv_acc = []
for train_index, val_index in k_fold.split(X, y):
selector.fit(X[train_index], y[train_index])
pred = selector.predict(X[val_index])
acc = accuracy_score(y[val_index], pred)
cv_acc.append(acc)
cv_acc
# result:
[1.0,
0.9333333333333333,
0.9333333333333333,
1.0,
0.9333333333333333,
0.9333333333333333,
0.8666666666666667,
1.0,
0.8666666666666667,
0.9333333333333333]
To perform feature selection with RFE and then fit a rf with 10 fold cross validation, here's how you could do it:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import RFE
rf = RandomForestClassifier(random_state = 0, class_weight="balanced")
rfe = RFE(estimator=rf, step=1)
Now transform the original X by fitting with the RFECV:
X_new = rfe.fit_transform(X, y)
Here are the ranked features (not much of a problem with only 4 of them):
rfe.ranking_
# array([2, 3, 1, 1])
Now split into train and test data and perform a cross validation in conjunction with a grid search using GridSearchCV (they usually go together):
X_train, X_test, y_train, y_test = train_test_split(X_new,y,train_size=0.7)
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
param_grid = {
'n_estimators': [5, 10, 15, 20],
'max_depth': [2, 5, 7, 9]
}
grid_clf = GridSearchCV(rf, param_grid, cv=k_fold.split(X_train, y_train))
grid_clf.fit(X_train, y_train)
y_pred = grid_clf.predict(X_test)
confusion_matrix(y_test, y_pred)
array([[17, 0, 0],
[ 0, 11, 0],
[ 0, 3, 14]], dtype=int64)
I need to perform leave-one-out cross validation of RF model.
I successfully built a model with high predictive ability.
Now I need to perform LOO test prior to the publication.
Here is my code:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
FC_data = pd.read_excel('C:\\Users\\Dre\\Desktop\\My Papers\\Furocoumarins_paper_2018\\Furocoumarins_NEW1.xlsx', index_col=0)
FC_data.head()
# Create correlation matrix
corr_matrix = FC_data.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features
FC_data1 = FC_data.drop(FC_data[to_drop], axis=1)
y = FC_data1.LogFiT
X = FC_data1.drop(['LogFiT', 'LogS'], axis=1)
X_train = X.drop(["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"], axis=0)
X_train.head(21)
y_train = y.drop(["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"], axis=0)
y_train.head(21)
X_test = X.loc[["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"]]
X_test.head(5)
y_test = y.loc[["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"]]
y_test.head(5)
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
randomforest = RandomForestRegressor(n_jobs=-1)
selector = SelectFromModel(randomforest)
features_important = selector.fit_transform(X_train, y_train)
model = randomforest.fit(features_important, y_train)
from sklearn.model_selection import GridSearchCV
clf_rf = RandomForestRegressor()
parameters = {"n_estimators":[1, 2, 3, 4, 5, 7, 10, 15, 20, 30, 40, 50, 100], "max_depth":[1, 2, 3, 4, 5, 7, 10, 15, 20, 30, 40, 50, 100]}
grid_search_cv_clf = GridSearchCV(clf_rf, parameters, cv=5)
grid_search_cv_clf.fit(features_important, y_train)
from sklearn.metrics import r2_score
y_pred = grid_search_cv_clf.predict(features_important)
r2_score(y_train, y_pred)
grid_search_cv_clf.best_params_
best_clf = grid_search_cv_clf.best_estimator_
X_test_filtered = X_test.iloc[:,selector.get_support()]
best_clf.score(X_test_filtered, y_test)
feature_importances = best_clf.feature_importances_
feature_importances_df = pd.DataFrame({'features': X_test_filtered.columns.values,
'feature_importances':feature_importances})
importances = feature_importances_df.sort_values('feature_importances', ascending=False)
importances.head(25)
Now I need q2 value.
Finally, I wrote this code and got a reasonably high score 0.9071543776303185
.
from sklearn.model_selection import LeaveOneOut
parameters = {"n_estimators":[4], "max_depth":[20]}
loo_clf = GridSearchCV(best_clf, parameters, cv=LeaveOneOut())
loo_clf.fit(features_important, y_train)
loo_clf.score(features_important, y_train)
I'm not sure if it is q2 or not. How do you think?
I also decided to obtain 5-fold cross-validation score. However, it gives ridiculous values like, for example: -36.58997717, 0.76801832, -1.59900448, 0.1834304 , -2.38256389 and a mean of -7.924019361863889.
from sklearn.model_selection import cross_val_score
cvs=cross_val_score(best_clf, features_important, y_train)
mean_cross_val_score = cvs.mean()
mean_cross_val_score
Probably, there is a way to fix it?
You should not run the hyper-parameters search before to make the model evaluation. Instead, you should the 2 cross-validations, otherwise, you are leaking some information. To know more about this, you should look at the following example from the scikit-learn documentation: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py
Therefore, in your particular use-case, you should use: GridSearchCV, SelectFromModel, and cross_val_score:
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
X, y = make_regression(n_samples=100)
feature_selector = SelectFromModel(
RandomForestRegressor(n_jobs=-1), threshold="mean"
)
pipe = make_pipeline(
feature_selector, RandomForestRegressor(n_jobs=-1)
)
param_grid = {
# define the grid of the random-forest for the feature selection
"selectfrommodel__estimator__n_estimators": [10, 20],
"selectfrommodel__estimator__max_depth": [3, 5],
# define the grid of the random-forest for the prediction
"randomforestregressor__n_estimators": [10, 20],
"randomforestregressor__max_depth": [5, 8],
}
grid_search = GridSearchCV(pipe, param_grid=param_grid, n_jobs=-1, cv=3)
# You can use the LOO in this way. Be aware that this not a good practise,
# it leads to large variance when evaluating your model.
# scores = cross_val_score(pipe, X, y, cv=LeaveOneOut(), error_score='raise')
scores = cross_val_score(pipe, X, y, cv=2, error_score='raise')
score.mean()
You need to specify the scoring and the cv arguments.
Use this:
from sklearn.model_selection import cross_val_score
mycv = LeaveOneOut()
cvs=cross_val_score(best_clf, features_important, y_train, scoring='r2',cv = mycv)
mean_cross_val_score = cvs.mean()
print(mean_cross_val_score)
This will return the mean cross-validated R2 score using LOOCV.
For more scoring options see here: https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values
I use GridSearchCV to fit SVM, and I want to know the number of support vectors for all the fitted models. For now I can only access this SVM's attribute for the best model.
Toy example:
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
y = np.array([1, 1, 2, 2])
clf = SVC()
params = {'C': [0.01, 0.1, 1]}
search = GridSearchCV(estimator=clf, cv=2, param_grid=params, return_train_score=True)
search.fit(X, y);
with the number of support vectors for the best model:
search.best_estimator_.n_support_
How to get the n_support_ for all models? Just as we get train/test error separately for each parameter C.
I have a dataset with ~300 points and 32 distinct labels and I want to evaluate a LinearSVR model by plotting its learning curve using grid search and LabelKFold validation.
The code I have looks like this:
import numpy as np
from sklearn import preprocessing
from sklearn.svm import LinearSVR
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import LabelKFold
from sklearn.grid_search import GridSearchCV
from sklearn.learning_curve import learning_curve
...
#get data (x, y, labels)
...
C_space = np.logspace(-3, 3, 10)
epsilon_space = np.logspace(-3, 3, 10)
svr_estimator = Pipeline([
("scale", preprocessing.StandardScaler()),
("svr", LinearSVR),
])
search_params = dict(
svr__C = C_space,
svr__epsilon = epsilon_space
)
kfold = LabelKFold(labels, 5)
svr_search = GridSearchCV(svr_estimator, param_grid = search_params, cv = ???)
train_space = np.linspace(.5, 1, 10)
train_sizes, train_scores, valid_scores = learning_curve(svr_search, x, y, train_sizes = train_space, cv = ???, n_jobs = 4)
...
#plot learning curve
My question is how to setup the cv attribute for the grid search and learning curve so that it will break my original set into training and test sets that don't share any labels for computing the learning curve. And then from those training sets, further separate them into training and test sets without sharing labels for the grid search?
Essentially, how do I run a nested LabelKFold?
I, the user who created the bounty for this question, wrote the following reproducible example using data available from sklearn.
import numpy as np
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score, LabelKFold
digits = load_digits()
X = digits['data']
Y = digits['target']
Z = np.zeros_like(Y) ## this is just to make a 2-class problem, purely for the sake of an example
Z[np.where(Y>4)]=1
strata = [x % 13 for x in xrange(Y.size)] # define the strata for use in
## define stuff for nested cv...
mtry = [5, 10]
tuned_par = {'max_features': mtry}
toy_rf = RandomForestClassifier(n_estimators=10, max_depth=10, random_state=10,
class_weight="balanced")
roc_auc_scorer = make_scorer(roc_auc_score, needs_threshold=True)
## define outer k-fold label-aware cv
outer_cv = LabelKFold(labels=strata, n_folds=5)
#############################################################################
## this works: using regular randomly-allocated 10-fold CV in the inner folds
#############################################################################
vanilla_clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer,
cv=5, n_jobs=1)
vanilla_results = cross_val_score(vanilla_clf, X=X, y=Z, cv=outer_cv, n_jobs=1)
##########################################################################
## this does not work: attempting to use label-aware CV in the inner loop
##########################################################################
inner_cv = LabelKFold(labels=strata, n_folds=5)
nested_kfold_clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer,
cv=inner_cv, n_jobs=1)
nested_kfold_results = cross_val_score(nested_kfold_clf, X=X, y=Y, cv=outer_cv, n_jobs=1)
From your question, you are looking for the LabelKFold score on your data, while grid-searching the parameters of your pipeline in each of the iterations of this outer LabelKFold, using again a LabelKFold. Although I was not able to achieve that out-of-the-box it takes only one loop:
outer_cv = LabelKFold(labels=strata, n_folds=3)
strata = np.array(strata)
scores = []
for outer_train, outer_test in outer_cv:
print "Outer set. Train:", set(strata[outer_train]), "\tTest:", set(strata[outer_test])
inner_cv = LabelKFold(labels=strata[outer_train], n_folds=3)
print "\tInner:"
for inner_train, inner_test in inner_cv:
print "\t\tTrain:", set(strata[outer_train][inner_train]), "\tTest:", set(strata[outer_train][inner_test])
clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer, cv= inner_cv, n_jobs=1)
clf.fit(X[outer_train],Z[outer_train])
scores.append(clf.score(X[outer_test], Z[outer_test]))
Running the code, the first iteration yields:
Outer set. Train: set([0, 1, 4, 5, 7, 8, 10, 11]) Test: set([9, 2, 3, 12, 6])
Inner:
Train: set([0, 10, 11, 5, 7]) Test: set([8, 1, 4])
Train: set([1, 4, 5, 8, 10, 11]) Test: set([0, 7])
Train: set([0, 1, 4, 8, 7]) Test: set([10, 11, 5])
Hence, it is easy to verify that it executes as intended. Your cross-validation scores are in the list scores and you can easily process them. I have used the variables, e.g., strata you defined in your last piece of code.