Scikitlearn GridSearchCV best model scores - python

I am trying to print the training and test score of the best model from my GridSearchCV object. My initial guess was to use cv_results['best_train_score'] and cv_results['best_test_score'] but after looking at the documentation I dont think there is a 'best_train_score' for cv_results.
I also see that there is a best_estimator_ but I'm not sure if I can use this to print a test and a training score. Any help is greatly appreciated.

You can use the best_estimator_ of your fitted GridSearchCV to retrieve the best model and then use the score function of your estimator to calculate the train and test accuracy.
As follows:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV, train_test_split
iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2
)
parameters = {"kernel": ("linear", "rbf"), "C": [1, 10]}
svc = svm.SVC()
cv = GridSearchCV(svc, parameters)
cv.fit(iris.data, iris.target)
model = cv.best_estimator_
print(f"train score: {model.score(X_train, y_train)}")
print(f"test score: {model.score(X_test, y_test)}")
Output:
train score: 0.9916666666666667
test score: 1.0

Related

Why do my CatBoost fit metrics are different than the sklearn evaluation metrics?

I'm still not sure this should be a question for this forum or for Cross-Validated, but I'll try this one, since it's more about the output of the code than the technique per se. Here's the thing, I'm running a CatBoost Classifier, just like this:
# import libraries
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.model_selection import train_test_split
# import data
train = pd.read_csv("train.csv")
# get features and label
X = train[["Pclass", "Sex", "SibSp", "Parch", "Fare"]]
y = train[["Survived"]]
# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# model parameters
model_cb = CatBoostClassifier(
cat_features=["Pclass", "Sex"],
loss_function="Logloss",
eval_metric="AUC",
learning_rate=0.1,
iterations=500,
od_type = "Iter",
od_wait = 200
)
# fit model
model_cb.fit(
X_train,
y_train,
plot=True,
eval_set=(X_test, y_test),
verbose=50,
)
y_pred = model_cb.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(roc_auc_score(y_test, y_pred))
The dataframe I'm using is from the Titanic competition (link).
The problem is that the model_cb.fit step is showing an AUC of 0.87, but the last line, the roc_auc_score from sklearn is showing me an AUC of 0.73, i.e., a much lower. The AUC from CatBoost, from what I understood is supposedly already on the testing dataset.
Any ideas on which is the problem here and how could I fix it?
The ROC curve needs predicted probabilities or some other sort of confidence measure, not hard class predictions. Use
y_pred = model_cb.predict_proba(X_test)[:, 1]
See Scikit-learn : roc_auc_score and Why does roc_curve return only 3 values?.

Best Practice for Executing Ensemble Methods

I am running the sample code below.
df = pd.read_csv('C:\\my_path\\test.csv', header=0, encoding = 'unicode_escape')
df = df.fillna(0)
X = df1.drop(columns = ['PRICE','MATURITYDATE'])
y = df1['PRICE']
from sklearn.model_selection import train_test_split
#split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
#create new a knn model
knn = KNeighborsClassifier()
#create a dictionary of all values we want to test for n_neighbors
params_knn = {'n_neighbors': np.arange(1, 25)}
#use gridsearch to test all values for n_neighbors
knn_gs = GridSearchCV(knn, params_knn, cv=5)
#fit model to training data
knn_gs.fit(X_train, y_train)
#save best model
knn_best = knn_gs.best_estimator_
#check best n_neigbors value
print(knn_gs.best_params_)
# RANDOM FOREST
from sklearn.ensemble import RandomForestClassifier
#create a new random forest classifier
rf = RandomForestClassifier()
#create a dictionary of all values we want to test for n_estimators
params_rf = {'n_estimators': [50, 100, 200]}
#use gridsearch to test all values for n_estimators
rf_gs = GridSearchCV(rf, params_rf, cv=5)
#fit model to training data
rf_gs.fit(X_train, y_train)
#save best model
rf_best = rf_gs.best_estimator_
#check best n_estimators value
print(rf_gs.best_params_)
# REGRESSION
from sklearn.linear_model import LogisticRegression
#create a new logistic regression model
log_reg = LogisticRegression()
#fit the model to the training data
log_reg.fit(X_train, y_train)
#test the three models with the test data and print their accuracy scores
print('knn: {}'.format(knn_best.score(X_test, y_test)))
print('rf: {}'.format(rf_best.score(X_test, y_test)))
print('log_reg: {}'.format(log_reg.score(X_test, y_test)))
# VOTING CLASSIFIER
from sklearn.ensemble import VotingClassifier
#create a dictionary of our models
estimators=[('knn', knn_best), ('rf', rf_best), ('log_reg', log_reg)]
#create our voting classifier, inputting our models
ensemble = VotingClassifier(estimators, voting='hard')
It's all from the link below
https://towardsdatascience.com/ensemble-learning-using-scikit-learn-85c4531ff86a
The problem that I'm running into with each of these methods is always this:
ValueError: Unknown label type: 'continuous'
I guess everything needs to be converted into a categorical type or, perhaps, one hot encoding needs to be applied. Is this correct? What is the best way to deal with this kind of issue? I'm hoping to keep things simple and very generic, without introducing custom coding. This is why I am leaning towards the scikit-learn libraries. I'd greatly appreciate any/all thoughts and insights. Thanks so much!

I am getting Not Fitted error in random forest classifier?

I have 4 features and one target variable. I am using RandomForestRegressor instead of RandomForestClassifer as my target variable is float. When I am trying to fit my model and then output them in sorted order to get the important features I am getting Not fitted error how to fix it?
Code:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
# Split the data into 30% test and 70% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
feat_labels = data.columns[:4]
regr = RandomForestRegressor(max_depth=2, random_state=0)
#clf = RandomForestClassifier(n_estimators=100, random_state=0)
# Train the classifier
#clf.fit(X_train, y_train)
regr.fit(X, y)
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
You are fitting to regr but calling the feature importances on clf. Try calling this instead:
importances = regr.feature_importances_
I noticed that previously your classifier was being fit with the training data you setup, but the regressor is now being fit with X and y.
However, I don't see here where you're setting X and y in the first place or even more where you actually load in a dataset. Could it be you forgot this step as well as what Harpal mentioned in another answer?

Logistic Regression: Train using past data and predict using current data?

I've trained and tested my logistic regression using available data but now need to output a future prediction. I want to include the 2017 values that I used in my training and test set to predict the 2018 probability.
This is the code I used to train and test my model:
Xadj = train.ix[:,('2016 transaction count','critical_CI', 'critical_CN','critical_CS',
'critical_FI', 'critical_IN','critical_OI','critical_RA','create_year_2012', 'create_year_2013',
'create_year_2014', 'create_year_2015','create_year_2016')]
#Coded is the transformation of 2017 transaction count to a binary variable
y = y=train.ix[:,('2017 transaction count coded')]
logit_model=sm.Logit(y,Xadj)
result=logit_model.fit()
print(result.summary())
X_train, X_test, y_train, y_test = train_test_split(Xadj, y, test_size=0.3, random_state=42)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
#Cross Validation
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
kfold = model_selection.KFold(n_splits=10, random_state=7)
modelCV = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))
In an attempt to export predictions for 2018, I have done the following:
#Create 2018 Purchase Probability
train['2018 Purchase Probability']=pd.DataFrame({'2018 Purchase Probability' : []})
yact=train.ix[:,('2018 Purchase Probability')]
#Adding in 2017 values
X = train.ix[:, ('2017 transaction count','critical_CI', 'critical_CN','critical_CS',
'critical_FI', 'critical_IN','critical_OI','critical_RA','create_year_2012', 'create_year_2013',
'create_year_2014', 'create_year_2015','create_year_2016','create_year_2017')]
from sklearn.preprocessing import scale, StandardScaler
scaler = StandardScaler()
scaler.fit(Xadj)
X = scaler.transform(Xadj)
X_pred = scaler.transform(X)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(Xadj, y)
#Generate 0/1 prediction
prediction = logreg.predict(X= X)
#Generate odds ratio
precent_prediction = logreg.predict_proba(X= X)
prediction = pd.DataFrame(prediction)
I'm not sure if I've done this correctly and judging from my output (which is mostly 1's) I don't think I have. I am new to coding in Python and am struggling to turn my tested model into a future prediction that can be used to make decisions.
Thanks in advance for any help!

Cross validation and model selection

I am using sklearn for SVM training. I am using the cross-validation to evaluate the estimator and avoid the overfitting model.
I split the data into two parts. Train data and test data. Here is the code:
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
iris.data, iris.target, test_size=0.4, random_state=0
)
clf = svm.SVC(kernel='linear', C=1)
scores = cross_validation.cross_val_score(clf, X_train, y_train, cv=5)
print scores
Now I need to evaluate the estimator clf on X_test.
clf.score(X_test, y_test)
here, I get an error saying that the model is not fitted using fit(), but normally, in cross_val_score function the model is fitted? What is the problem?
cross_val_score is basically a convenience wrapper for the sklearn cross-validation iterators. You give it a classifier and your whole (training + validation) dataset and it automatically performs one or more rounds of cross-validation by splitting your data into random training/validation sets, fitting the training set, and computing the score on the validation set. See the documentation here for an example and more explanation.
The reason why clf.score(X_test, y_test) raises an exception is because cross_val_score performs the fitting on a copy of the estimator rather than the original (see the use of clone(estimator) in the source code here). Because of this, clf remains unchanged outside of the function call, and is therefore not properly initialized when you call clf.fit.

Categories