How to nest LabelKFold?

How to nest LabelKFold? - python

I have a dataset with ~300 points and 32 distinct labels and I want to evaluate a LinearSVR model by plotting its learning curve using grid search and LabelKFold validation.
The code I have looks like this:
import numpy as np
from sklearn import preprocessing
from sklearn.svm import LinearSVR
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import LabelKFold
from sklearn.grid_search import GridSearchCV
from sklearn.learning_curve import learning_curve
...
#get data (x, y, labels)
...
C_space = np.logspace(-3, 3, 10)
epsilon_space = np.logspace(-3, 3, 10)
svr_estimator = Pipeline([
("scale", preprocessing.StandardScaler()),
("svr", LinearSVR),
])
search_params = dict(
svr__C = C_space,
svr__epsilon = epsilon_space
)
kfold = LabelKFold(labels, 5)
svr_search = GridSearchCV(svr_estimator, param_grid = search_params, cv = ???)
train_space = np.linspace(.5, 1, 10)
train_sizes, train_scores, valid_scores = learning_curve(svr_search, x, y, train_sizes = train_space, cv = ???, n_jobs = 4)
...
#plot learning curve
My question is how to setup the cv attribute for the grid search and learning curve so that it will break my original set into training and test sets that don't share any labels for computing the learning curve. And then from those training sets, further separate them into training and test sets without sharing labels for the grid search?
Essentially, how do I run a nested LabelKFold?
I, the user who created the bounty for this question, wrote the following reproducible example using data available from sklearn.
import numpy as np
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score, LabelKFold
digits = load_digits()
X = digits['data']
Y = digits['target']
Z = np.zeros_like(Y) ## this is just to make a 2-class problem, purely for the sake of an example
Z[np.where(Y>4)]=1
strata = [x % 13 for x in xrange(Y.size)] # define the strata for use in
## define stuff for nested cv...
mtry = [5, 10]
tuned_par = {'max_features': mtry}
toy_rf = RandomForestClassifier(n_estimators=10, max_depth=10, random_state=10,
class_weight="balanced")
roc_auc_scorer = make_scorer(roc_auc_score, needs_threshold=True)
## define outer k-fold label-aware cv
outer_cv = LabelKFold(labels=strata, n_folds=5)
#############################################################################
## this works: using regular randomly-allocated 10-fold CV in the inner folds
#############################################################################
vanilla_clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer,
cv=5, n_jobs=1)
vanilla_results = cross_val_score(vanilla_clf, X=X, y=Z, cv=outer_cv, n_jobs=1)
##########################################################################
## this does not work: attempting to use label-aware CV in the inner loop
##########################################################################
inner_cv = LabelKFold(labels=strata, n_folds=5)
nested_kfold_clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer,
cv=inner_cv, n_jobs=1)
nested_kfold_results = cross_val_score(nested_kfold_clf, X=X, y=Y, cv=outer_cv, n_jobs=1)

From your question, you are looking for the LabelKFold score on your data, while grid-searching the parameters of your pipeline in each of the iterations of this outer LabelKFold, using again a LabelKFold. Although I was not able to achieve that out-of-the-box it takes only one loop:
outer_cv = LabelKFold(labels=strata, n_folds=3)
strata = np.array(strata)
scores = []
for outer_train, outer_test in outer_cv:
print "Outer set. Train:", set(strata[outer_train]), "\tTest:", set(strata[outer_test])
inner_cv = LabelKFold(labels=strata[outer_train], n_folds=3)
print "\tInner:"
for inner_train, inner_test in inner_cv:
print "\t\tTrain:", set(strata[outer_train][inner_train]), "\tTest:", set(strata[outer_train][inner_test])
clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer, cv= inner_cv, n_jobs=1)
clf.fit(X[outer_train],Z[outer_train])
scores.append(clf.score(X[outer_test], Z[outer_test]))
Running the code, the first iteration yields:
Outer set. Train: set([0, 1, 4, 5, 7, 8, 10, 11]) Test: set([9, 2, 3, 12, 6])
Inner:
Train: set([0, 10, 11, 5, 7]) Test: set([8, 1, 4])
Train: set([1, 4, 5, 8, 10, 11]) Test: set([0, 7])
Train: set([0, 1, 4, 8, 7]) Test: set([10, 11, 5])
Hence, it is easy to verify that it executes as intended. Your cross-validation scores are in the list scores and you can easily process them. I have used the variables, e.g., strata you defined in your last piece of code.

Related

Python Scikit-Learn RandomizedSearchCV with custom scoring functions

I am using Scikit-Learn's Random Forest Regressor, Pipeline, and RandomizedSearchCV to predict the target variable using some features in my dataset. I need to use my own custom scoring functions that calculate weighted scores using weights (signifying the importance of observations) from the dataset. My code seems to work but I am getting a warning when the grid runs:
DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for examples using ravel(). self.__final_estimator.fit(Xt, y, **fit_params)
This is related to .fit(X_train, y_train). Based on this warning, if I change the code to .fit(X_train, y_train.values.ravel()) then I cannot get my weighted scores to work. I have tried editing the code in different/appropriate ways to get the weighted scores to work but to no avail.
I am including my code below that runs on a test data in test.csv. The file has four columns: two feature columns ('x1', 'x2'), target ('y') and weight ('weight') columns. The custom scoring functions below are simple functions that calculate weighted rmse_score and mean_abs_error_score. How can I use .fit(X_train, y_train.values.ravel()) and still compute the scores?
import pandas as pd
import numpy as np
import sklearn.model_selection as skms
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
def rmse_score(y_true, y_pred, weight):
weight = weight.loc[y_true.index.values]
rmse = np.sqrt(np.mean(weight*(y_true.values-y_pred)**2))
return rmse
def mean_abs_error_score(y_true, y_pred, weight):
weight = weight.loc[y_true.index.values]
mae = np.mean(weight*np.absolute(y_true.values-y_pred))
return mae
#---- reading data
heart_df = pd.read_csv('data\\test.csv')
#---- splitting into training & testing sets
y = heart_df['y']
X = heart_df[['x1', 'x2']]
X_train, X_test, y_train, y_test = skms.train_test_split(X, y, test_size=0.20)
X_train_weights = heart_df['weight'].loc[X_train.index.values]
params = {"weight": X_train_weights}
my_scorer1 = make_scorer(rmse_score, greater_is_better=False, **params)
my_scorer2 = make_scorer(mean_abs_error_score, greater_is_better=False, **params)
#---- random forest training with hyperparameter tuning
pipe = Pipeline([("scaler", StandardScaler()), ("rfr", RandomForestRegressor())])
random_grid = { "rfr__n_estimators": [10, 100, 500, 1000],
"rfr__max_depth": [10, 20, 30, 40, 50, None],
"rfr__max_features": [0.25, 0.50, 0.75],
"rfr__min_samples_split": [5, 10, 20],
"rfr__min_samples_leaf": [3, 5, 10],
"rfr__bootstrap": [True, False]
}
rfr_cv = skms.RandomizedSearchCV(pipe,
param_distributions=random_grid,
n_iter = 15,
cv = 3,
verbose=3,
scoring={'rmse': my_scorer1, 'mae':my_scorer2},
refit = 'rmse',
random_state=42,
n_jobs = -1)
rfr_cv.fit(X_train, y_train)
best_params = rfr_cv.best_params_
best_score = rfr_cv.best_score_
print(f'best hyperparameters = {best_params}')
print(f'best score = {best_score}')

RandomForestRegressor used with GridSearchCV and RandomSearchCV may be overfitting on test set

I am following along with the book titled: Hands-On Machine Learning with SciKit-Learn, Keras and TensorFlow by Aurelien Geron (link). In chapter 2 you get hands on with actually building an ML system using a dataset from StatLib's California Housing Prices (link).
I have been running cross validation tests using BOTH GridSearchCV and RandomSearchCV to test and see which performs better (they both perform about the same, depending on the run GridSearch will perform better than RandomSearch and vice versa). During my cross validation of the training set, all of my RMSE's come back (after about 10 folds) looking like so:
49871.10156541779 {'max_features': 6, 'n_estimators': 100} GRID SEARCH CV
49573.67188289324 {'max_features': 6, 'n_estimators': 300} GRID SEARCH CV
49759.116323927 {'max_features': 8, 'n_estimators': 100} GRID SEARCH CV
49388.93702859155 {'max_features': 8, 'n_estimators': 300} GRID SEARCH CV
49759.445071611895 {'max_features': 10, 'n_estimators': 100} GRID SEARCH CV
49517.74394767381 {'max_features': 10, 'n_estimators': 300} GRID SEARCH CV
49796.22587441326 {'max_features': 12, 'n_estimators': 100} GRID SEARCH CV
49616.61833604992 {'max_features': 12, 'n_estimators': 300} GRID SEARCH CV
49795.571075148444 {'max_features': 14, 'n_estimators': 300} GRID SEARCH CV
49790.38581725693 {'n_estimators': 100, 'max_features': 12} RANDOM SEARCH CV
49462.758078362356 {'n_estimators': 300, 'max_features': 8} RANDOM SEARCH CV
Please note that I am selecting the best results out of about 50 or so results to present here. I am using the following code to generate this:
param_grid = [{'n_estimators' : [3, 10, 30, 100, 300],
'max_features' : [2, 4, 6, 8, 10, 12, 14]},
{'bootstrap' : [False], 'n_estimators' : [3, 10, 12],
'max_features' : [2, 3, 4]}]
forest_regressor = RandomForestRegressor({'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse',
'max_depth': None, 'max_features': 8, 'max_leaf_nodes': None,
'max_samples': None, 'min_impurity_decrease': 0.0,
'min_impurity_split': None, 'min_samples_leaf': 1,
'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0,
'n_estimators': 300, 'n_jobs': None, 'oob_score': False,
'random_state': None, 'verbose': 0, 'warm_start': False})
grid_search = GridSearchCV(forest_regressor, param_grid, cv=10, scoring="neg_mean_squared_error",
return_train_score=True, refit=True)
grid_search.fit(Dataframe, TrainingLabels)
prediction = grid_search.predict(Dataframe)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params, "GRID SEARCH CV")
##################################################################################
#Randomized Search Cross Validation
param_grid = [{'n_estimators' : [3, 10, 30, 100, 300],
'max_features' : [2, 4, 6, 8, 10, 12, 14]},
{'bootstrap' : [False], 'n_estimators' : [3, 10, 12],
'max_features' : [2, 3, 4]}]
forest_regressor = RandomForestRegressor({'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse',
'max_depth': None, 'max_features': 8, 'max_leaf_nodes': None,
'max_samples': None, 'min_impurity_decrease': 0.0,
'min_impurity_split': None, 'min_samples_leaf': 1,
'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0,
'n_estimators': 300, 'n_jobs': None, 'oob_score': False,
'random_state': None, 'verbose': 0, 'warm_start': False})
rand_search = RandomizedSearchCV(forest_regressor, param_grid, cv=10, refit=True,
scoring='neg_mean_squared_error', return_train_score=True)
rand_search.fit(Dataframe, TrainingLabels)
prediction = rand_search.predict(Dataframe)
cvres = rand_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params, "RANDOM SEARCH CV")
Now, I am doing things a little differently than what the book states; my pipeline looks as such:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats
class Dataframe_Manipulation:
def __init__(self):
self.dataframe = pd.read_csv(r'C:\Users\bohayes\AppData\Local\Programs\Python\Python38\Excel and Text\housing.csv')
def Cat_Creation(self):
# Creation of an Income Category to organize the median incomes into strata (bins) to sample from
self.income_cat = self.dataframe['income_category'] = pd.cut(self.dataframe['median_income'],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5])
self.rooms_per_house_cat = self.dataframe['rooms_per_house'] = self.dataframe['total_rooms']/self.dataframe['households']
self.bedrooms_per_room_cat = self.dataframe['bedrooms_per_room'] = self.dataframe['total_bedrooms']/self.dataframe['total_rooms']
self.pop_per_house = self.dataframe['pop_per_house'] = self.dataframe['population'] / self.dataframe['households']
return self.dataframe
def Fill_NA(self):
self.imputer = KNNImputer(n_neighbors=5, weights='uniform')
self.dataframe['total_bedrooms'] = self.imputer.fit_transform(self.dataframe[['total_bedrooms']])
self.dataframe['bedrooms_per_room'] = self.imputer.fit_transform(self.dataframe[['bedrooms_per_room']])
return self.dataframe
def Income_Cat_Split(self):
self.inc_cat_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for self.train_index, self.test_index in self.inc_cat_split.split(self.dataframe, self.dataframe['income_category']):
self.strat_train_set = self.dataframe.loc[self.train_index].reset_index(drop=True)
self.strat_test_set = self.dataframe.loc[self.test_index].reset_index(drop=True)
# the proportion is the % of total instances and which strata they are assigned to
self.proportions = self.strat_test_set['income_category'].value_counts() / len(self.strat_test_set)
# Only pulling out training set!!!!!!!!!!!!!!!
return self.strat_train_set, self.strat_test_set
def Remove_Cats_Test(self):
self.test_labels = self.strat_test_set['median_house_value'].copy()
self.strat_test_set = self.strat_test_set.drop(['median_house_value'], axis=1)
return self.test_labels
def Remove_Cats_Training(self):
self.training_labels = self.strat_train_set['median_house_value'].copy()
self.strat_train_set = self.strat_train_set.drop(['median_house_value'], axis=1)
return self.training_labels
def Encode_Transform(self):
self.column_trans = make_column_transformer((OneHotEncoder(), ['ocean_proximity']), remainder='passthrough')
self.training_set_encoded = self.column_trans.fit_transform(self.strat_train_set)
self.test_set_encoded = self.column_trans.fit_transform(self.strat_test_set)
return self.training_set_encoded, self.test_set_encoded
def Standard_Scaler(self):
self.scaler = StandardScaler()
self.scale_training_set = self.scaler.fit(self.training_set_encoded)
self.scale_test_set = self.scaler.fit(self.test_set_encoded)
self.scaled_training_set = self.scaler.transform(self.training_set_encoded)
self.scaled_test_set = self.scaler.transform(self.test_set_encoded)
return self.scaled_training_set
def Test_Set(self):
return self.scaled_test_set
A = Dataframe_Manipulation()
B = A.Cat_Creation()
C = A.Fill_NA()
D = A.Income_Cat_Split()
TestLabels = A.Remove_Cats_Test()
TrainingLabels = A.Remove_Cats_Training()
G = A.Encode_Transform()
TrainingSet = A.Standard_Scaler()
TestSet = A.Test_Set()
The Grid and Random Searches come after this bit, however my RMSE scores come back drastically different when I test them on the TestSet, which leads me to believe that I am overfitting, however maybe the RSME's look different because I am using a smaller test set? Here you go:
19366.910530221918
19969.043158986697
Now here is the code that generates that: and it comes after I run Grid and Random Searches and fit the test labels and test set to the model:
#Final Grid Model
final_grid_model = grid_search.best_estimator_
final_grid_prediction = final_grid_model.predict(TestSet)
final_grid_mse = mean_squared_error(TestLabels, final_grid_prediction)
final_grid_rmse = np.sqrt(final_grid_mse)
print(final_grid_rmse)
###################################################################################
#Final Random Model
final_rand_model = rand_search.best_estimator_
final_rand_prediction = final_rand_model.predict(TestSet)
final_rand_mse = mean_squared_error(TestLabels, final_rand_prediction)
final_rand_rmse = np.sqrt(final_rand_mse)
print(final_rand_rmse)
Just to make sure I also did a confidence score on the model as well and these are the code and results:
#Confidence Grid Search
confidence = 0.95
squared_errors = (final_grid_prediction - TestLabels) ** 2
print(np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
loc=squared_errors.mean(),
scale=stats.sem(squared_errors))))
###################################################################################
#Confidence Random Search
confidence1 = 0.95
squared_errors1 = (final_rand_prediction - TestLabels) ** 2
print(np.sqrt(stats.t.interval(confidence1, len(squared_errors1) - 1,
loc=squared_errors1.mean(),
scale=stats.sem(squared_errors1))))
>>>[18643.4914044 20064.26363526]
[19222.30464011 20688.84660134]
Why is it that my average RMSE score on the TrainingSet is about 49,000 and that same score on the test set is averaging at about 19,000? I must be overfitting, but I am not sure how or where I am going wrong.

tl;dr: Your code is unnecessarily convoluted for such a (standard) job; do not re-invent the wheel, go with a pipeline instead.
There is an error in how you scale your data, which most probably is the root cause of the observed behavior here; in the second line:
self.scale_training_set = self.scaler.fit(self.training_set_encoded)
self.scale_test_set = self.scaler.fit(self.test_set_encoded)
you essentially overwrite your scaler with the results on the test set fit, and subsequently you actually scale your training data with this test-fitted scaler:
self.scaled_training_set = self.scaler.transform(self.training_set_encoded)
Since your test set is only 20% of the dataset, what happens is that it does not contain enough values to adequately cover the whole range (min-max) of the (bigger) training set; as a result, the training set is mis-scaled (actually containing values well above the max value of the test set), which probably leads to a higher RMSE (which is not scale invariant, and by definition depends on the scale pf the predictions).
You may think that using StratifiedShuffleSplit upstream should have protected you from such a case, but truth is that StratifiedShuffleSplit is only good for classification datasets, and it is actually meaningless in regression ones (I am genuinely surprised that it does not throw an error here).
To remedy this issue, you should just remove the line
self.scale_test_set = self.scaler.fit(self.test_set_encoded)
from your Standard_Scaler() function.
Keep in mind that, in general, we never fit on a test set - we only transform; scikit-learn pipelines, apart from saving you from having to write all this boilerplate code (thus increasing the probability of coding errors), will protect you from this kind of error...

How to perform cross-validation of a random-forest model in scikit-learn?

I need to perform leave-one-out cross validation of RF model.
I successfully built a model with high predictive ability.
Now I need to perform LOO test prior to the publication.
Here is my code:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
FC_data = pd.read_excel('C:\\Users\\Dre\\Desktop\\My Papers\\Furocoumarins_paper_2018\\Furocoumarins_NEW1.xlsx', index_col=0)
FC_data.head()
# Create correlation matrix
corr_matrix = FC_data.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features
FC_data1 = FC_data.drop(FC_data[to_drop], axis=1)
y = FC_data1.LogFiT
X = FC_data1.drop(['LogFiT', 'LogS'], axis=1)
X_train = X.drop(["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"], axis=0)
X_train.head(21)
y_train = y.drop(["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"], axis=0)
y_train.head(21)
X_test = X.loc[["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"]]
X_test.head(5)
y_test = y.loc[["3-Acetoisopseudopsoralen", "3-Carbethoxypsoralen", "4,4'-Dimethylangelicin",
"4,7,4'-Trimethylallopsoralen", "Psoralen"]]
y_test.head(5)
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
randomforest = RandomForestRegressor(n_jobs=-1)
selector = SelectFromModel(randomforest)
features_important = selector.fit_transform(X_train, y_train)
model = randomforest.fit(features_important, y_train)
from sklearn.model_selection import GridSearchCV
clf_rf = RandomForestRegressor()
parameters = {"n_estimators":[1, 2, 3, 4, 5, 7, 10, 15, 20, 30, 40, 50, 100], "max_depth":[1, 2, 3, 4, 5, 7, 10, 15, 20, 30, 40, 50, 100]}
grid_search_cv_clf = GridSearchCV(clf_rf, parameters, cv=5)
grid_search_cv_clf.fit(features_important, y_train)
from sklearn.metrics import r2_score
y_pred = grid_search_cv_clf.predict(features_important)
r2_score(y_train, y_pred)
grid_search_cv_clf.best_params_
best_clf = grid_search_cv_clf.best_estimator_
X_test_filtered = X_test.iloc[:,selector.get_support()]
best_clf.score(X_test_filtered, y_test)
feature_importances = best_clf.feature_importances_
feature_importances_df = pd.DataFrame({'features': X_test_filtered.columns.values,
'feature_importances':feature_importances})
importances = feature_importances_df.sort_values('feature_importances', ascending=False)
importances.head(25)
Now I need q2 value.
Finally, I wrote this code and got a reasonably high score 0.9071543776303185
.
from sklearn.model_selection import LeaveOneOut
parameters = {"n_estimators":[4], "max_depth":[20]}
loo_clf = GridSearchCV(best_clf, parameters, cv=LeaveOneOut())
loo_clf.fit(features_important, y_train)
loo_clf.score(features_important, y_train)
I'm not sure if it is q2 or not. How do you think?
I also decided to obtain 5-fold cross-validation score. However, it gives ridiculous values like, for example: -36.58997717, 0.76801832, -1.59900448, 0.1834304 , -2.38256389 and a mean of -7.924019361863889.
from sklearn.model_selection import cross_val_score
cvs=cross_val_score(best_clf, features_important, y_train)
mean_cross_val_score = cvs.mean()
mean_cross_val_score
Probably, there is a way to fix it?

You should not run the hyper-parameters search before to make the model evaluation. Instead, you should the 2 cross-validations, otherwise, you are leaking some information. To know more about this, you should look at the following example from the scikit-learn documentation: https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html#sphx-glr-auto-examples-model-selection-plot-nested-cross-validation-iris-py
Therefore, in your particular use-case, you should use: GridSearchCV, SelectFromModel, and cross_val_score:
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
X, y = make_regression(n_samples=100)
feature_selector = SelectFromModel(
RandomForestRegressor(n_jobs=-1), threshold="mean"
)
pipe = make_pipeline(
feature_selector, RandomForestRegressor(n_jobs=-1)
)
param_grid = {
# define the grid of the random-forest for the feature selection
"selectfrommodel__estimator__n_estimators": [10, 20],
"selectfrommodel__estimator__max_depth": [3, 5],
# define the grid of the random-forest for the prediction
"randomforestregressor__n_estimators": [10, 20],
"randomforestregressor__max_depth": [5, 8],
}
grid_search = GridSearchCV(pipe, param_grid=param_grid, n_jobs=-1, cv=3)
# You can use the LOO in this way. Be aware that this not a good practise,
# it leads to large variance when evaluating your model.
# scores = cross_val_score(pipe, X, y, cv=LeaveOneOut(), error_score='raise')
scores = cross_val_score(pipe, X, y, cv=2, error_score='raise')
score.mean()

You need to specify the scoring and the cv arguments.
Use this:
from sklearn.model_selection import cross_val_score
mycv = LeaveOneOut()
cvs=cross_val_score(best_clf, features_important, y_train, scoring='r2',cv = mycv)
mean_cross_val_score = cvs.mean()
print(mean_cross_val_score)
This will return the mean cross-validated R2 score using LOOCV.
For more scoring options see here: https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values

SKLearn Naive Bayes: add feature after tfidf vectorization

So I have been tasked with training a model on phone call transcripts. The following code does this. A little background info:
- x is a list of strings, each ith element is an entire transcript
- y is a list of booleans, stating the outcome of a call being positive or negative.
The following code works, but here is my issue. I want to include call duration as a feature to train on. I'd assume after the TFIDF transformer that vectorizes the transcripts, I would just concatenate the call duration feature to the TFIDF output right? Maybe this is easier than I think, but I have the transcripts and the durations all in the pandas data frame you see at the beginning of the code. So if I have that data frame column (numpy array) of durations, what do I need to do to add that feature into my model?
Additional Questions:
Am I missing a fundamental assumption about Naive Bayes model that limits me to vectorized strings?
At which step in my pipeline do I add the new feature?
Can this even be done in a pipeline or do I have to break it apart to do something like this?
Code:
import numpy as np
import pandas as pd
import random
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import SelectPercentile
from sklearn.metrics import roc_auc_score
from sklearn.feature_selection import chi2
def main():
filename = 'QA_training.pkl'
splitRatio = 0.67
dataframe = loadData(filename)
x, y = getTrainingData(dataframe)
print len(x), len(y)
x_train, x_test = splitDataset(x, splitRatio)
y_train, y_test = splitDataset(y, splitRatio)
#x_train = np.asarray(x_train)
percentiles = [10, 15, 20, 25, 30, 35, 40, 45, 50]
MNNB_pipe = Pipeline([('vec', CountVectorizer()),('tfidf', TfidfTransformer()),('select', SelectPercentile(score_func=chi2)),('clf', MultinomialNB())])
MNNB_param_grid = {
#'vec__max_features': (10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000),
'tfidf__use_idf': (True, False),
'tfidf__sublinear_tf': (True, False),
'vec__binary': (True, False),
'tfidf__norm': ('l1', 'l2'),
'clf__alpha': (1, 0.1, 0.01, 0.001, 0.0001, 0.00001),
'select__percentile': percentiles
}
MNNB_search = GridSearchCV(MNNB_pipe, param_grid=MNNB_param_grid, cv=10, scoring='roc_auc', n_jobs=-1, verbose=1)
MNNB_search = MNNB_search.fit(x_train, y_train)
MNNB_search_best_cv = cross_val_score(MNNB_search.best_estimator_, x_train, y_train, cv=10, scoring='roc_auc', n_jobs=-1, verbose=10)
SGDC_pipe = Pipeline([('vec', CountVectorizer()),('tfidf', TfidfTransformer()),('select', SelectPercentile(score_func=chi2)),('clf', SGDClassifier())])
SGDC_param_grid = {
#'vec__max_features': [10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000],
'tfidf__use_idf': [True, False],
'tfidf__sublinear_tf': [True, False],
'vec__binary': [True, False],
'tfidf__norm': ['l1', 'l2'],
'clf__loss': ['modified_huber','log'],
'clf__penalty': ['l1','l2'],
'clf__alpha': [1e-3],
'clf__n_iter': [5,10],
'clf__random_state': [42],
'select__percentile': percentiles
}
SGDC_search = GridSearchCV(SGDC_pipe, param_grid=SGDC_param_grid, cv=10, scoring='roc_auc', n_jobs=-1, verbose=1)
SGDC_search = SGDC_search.fit(x_train, y_train)
SGDC_search_best_cv = cross_val_score(SGDC_search.best_estimator_, x_train, y_train, cv=10, scoring='roc_auc', n_jobs=-1, verbose=10)
# pre_SGDC = SGDC_clf.predict(x_test)
# print (np.mean(pre_SGDC == y_test))
mydata = [{'model': MNNB_search.best_estimator_.named_steps['clf'],'features': MNNB_search.best_estimator_.named_steps['select'], 'mean_cv_scores': MNNB_search_best_cv.mean()},
#{'model': GNB_search.best_estimator_.named_steps['classifier'],'features': GNB_search.best_estimator_.named_steps['select'], 'mean_cv_scores': GNB_search_best_cv.mean()},
{'model': SGDC_search.best_estimator_.named_steps['clf'],'features': SGDC_search.best_estimator_.named_steps['select'], 'mean_cv_scores': SGDC_search_best_cv.mean()}]
model_results_df = pd.DataFrame(mydata)
model_results_df.to_csv("best_model_results.csv")

As far as I'm aware, sklearn pipelines are API driven -- There's no real magic that happens in the pipeline itself. So, from that perspective, you should be able to create your own wrapper around TfidfVectorizer that does what you want it to do. For example, let's assume that you have a DataFrame that looks like this:
df = pd.DataFrame({'text': ['foo text', 'bar text'], 'duration': [1, 2]})
you could probably implement your transform as follows:
class MyVectorizer(object):
def __init__(self, tfidf_kwargs=None):
self._tfidf = TfidfVectorizer(**(tfidf_kwargs or None))
def fit(self, X, y=None):
self._tfidf.fit(X['text'], y)
return self
def fit_transform(self, X, y=None):
self.fit(X)
return self.transform(X, copy=False)
def transform(self, X, copy=True):
result = self._tfidf.transform(X['text'], copy=copy)
# result is a sparse matrix. I'm not sure of a clean way
# to add a column to a sparse matrix. If you have the
# memory, you can use a dense matrix instead...
return np.column_stack((result, X['duration']))
And then I think you should be all set to use this instead of the original tfidf vectorizer.

Pass a scoring function from sklearn.metrics to GridSearchCV

GridSearchCV's documentations states that I can pass a scoring function.
scoring : string, callable or None, default=None
I would like to use a native accuracy_score as a scoring function.
So here is my attempt. Imports and some data:
import numpy as np
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn import neighbors
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([0, 1, 0, 0, 0, 1])
Now when I use just k-fold cross-validation without my scoring function, everything works as intended:
parameters = {
'n_neighbors': [2, 3, 4],
'weights':['uniform', 'distance'],
'p': [1, 2, 3]
}
model = neighbors.KNeighborsClassifier()
k_fold = KFold(len(Y), n_folds=6, shuffle=True, random_state=0)
clf = GridSearchCV(model, parameters, cv=k_fold) # TODO will change
clf.fit(X, Y)
print clf.best_score_
But when I change the line to
clf = GridSearchCV(model, parameters, cv=k_fold, scoring=accuracy_score) # or accuracy_score()
I get the error: ValueError: Cannot have number of folds n_folds=10 greater than the number of samples: 6. which in my opinion does not represent the real problem.
In my opinion the problem is that accuracy_score does not follow the signature scorer(estimator, X, y), which is written in the documentation
So how can I fix this problem?

It will work if you change scoring=accuracy_score to scoring='accuracy' (see the documentation for the full list of scorers you can use by name in this way.)
In theory, you should be able to pass custom scoring functions like you're trying, but my guess is that you're right and accuracy_score doesn't have the right API.

Here is an example of using Weighted Kappa as scoring metric for GridSearchCV for a simple Random Forest model. The key learning for me was to use the parameters related to the scorer in the 'make_scorer' function.
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import cohen_kappa_score, make_scorer
kappa_scorer = make_scorer(cohen_kappa_score,weights="quadratic")
# Create the parameter grid based on the results of random search
param_grid = {
'bootstrap': [True],
'max_features': range(2,10), # try features from 2 to 10
'min_samples_leaf': [3, 4, 5],
'n_estimators' : [100,300,500],
'max_depth': [5]
}
# Create a based model
random_forest = RandomForestClassifier(class_weight ="balanced_subsample",random_state=1)
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = random_forest, param_grid = param_grid,
cv = 5, n_jobs = -1, verbose = 2, scoring = kappa_scorer) # search for best model using roc_auc
# Fit the grid search to the data
grid_search.fit(final_tr, yTrain)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to nest LabelKFold? - python

Related

Python Scikit-Learn RandomizedSearchCV with custom scoring functions

RandomForestRegressor used with GridSearchCV and RandomSearchCV may be overfitting on test set

How to perform cross-validation of a random-forest model in scikit-learn?

SKLearn Naive Bayes: add feature after tfidf vectorization

Pass a scoring function from sklearn.metrics to GridSearchCV

Categories

Resources