KNN algorithm with GridSearchCV - python

Im trying to create a KNN model with GridSearchCV but am getting an error pertaining to param_grid: "ValueError: Invalid parameter classifier_leaf_size for estimator KNeighborsClassifier(). Check the list of available parameters with estimator.get_params().keys().".
knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
scores = cross_val_score(knn, x_train, y_train, scoring='accuracy', cv=cv, n_jobs=-1)
print('Training Accuracy - KNN Classification: ', knn.score(x_train, y_train))
print('Testing Accuracy - KNN Classification', knn.score(x_test, y_test))
plt.show()
#classification report
cr = classification_report(y_test, y_pred)
print(cr, "\n")
#grid
estimator_KNN = KNeighborsClassifier(algorithm='auto')
knn_grid_set_up = {'n_neigbors': (1,10,1),
'classifier_leaf_size': (20,40,1), 'p': (1,2),
'classifier_weights': ('uniform', 'distance')
}
grid_search_KNN = GridSearchCV(
estimator=estimator_KNN,
param_grid=knn_grid_set_up,
scoring = 'accuracy',
n_jobs = -1,
cv = 5
)
knn_grid.fit(x_train, y_train)
What causes the error? I've read documentation, have tried various methods, but still can't understand what is going on

Related

Get the error "All intermediate steps should be transformers and implement fit and transform or be the string passthrough" and I can't resolve it

I'm trying to create a training and predicting pipeline that allows me to train models using various sizes of training data and perform predictions on the testing data, I wrote a function:
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test):
results = {}
learner = Pipeline(steps=[('tree', DecisionTreeClassifier()),
('logistic', LogisticRegression()),
('naive', MultinomialNB())])
learner.fit(X_train, y_train)
predictions_test = learner.predict(X_test)
predictions_train = learner.predict(X_train[:300])
results['acc_train'] = accuracy_score(y_train[:300], predictions_train)
results['acc_test'] = accuracy_score(y_test, predictions_test)
results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta=0.5)
results['f_test'] = fbeta_score(y_test, predictions_test, beta=0.5)
print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
return results
and then wrote that code:
clf_A = DecisionTreeClassifier()
clf_B = LogisticRegression()
clf_C = MultinomialNB()
samples_100 = len(y_train)
samples_10 = int(len(y_train) * 0.1)
samples_1 = int(len(y_train) * 0.01)
results = {}
for clf in [clf_A, clf_B, clf_C]:
clf_name = clf.__class__.__name__
results[clf_name] = {}
for i, samples in enumerate([samples_1, samples_10, samples_100]):
results[clf_name][i] = \
train_predict(clf, samples, X_train, y_train, X_test, y_test)
vs.evaluate(results, accuracy, fscore)
but got the error:
TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' 'DecisionTreeClassifier()' (type <class 'sklearn.tree._classes.DecisionTreeClassifier'>) doesn't
I tried a lot and did not find a solution, can you help me?

how to use "scikit-learn calibration" after fine-tuning lightgbm

I fine tuned LGBM and applied calibration, but have troubles applying calibration.
I have 1) train, 2) valid, 3) test data.
I trained and fine-tuned LGBM using 1) train data and 2) valid data.
Then, I got a best parameter of LGBM.
After then, I want to calibrate, in order to make my model's output can be directly interpreted as a confidence level. But I'm confused in using scikit-learn CalibratedClassifierCV.
In my situation, should I use cv='prefit' or cv=5? Also, should I use train data or valid data fitting CalibratedClassifierCV?
1) uncalibrated_clf but after training
clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=True, early_stopping_rounds=20)
2-1) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv='prefit', method='isotonic')
cal_clf.fit(X_valid, y_valid)
2-2) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_train, y_train)
2-3) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_valid, y_valid)
Which one is right? All is right, or only one or two is(are) right?
Below is code.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.calibration import calibration_curve
from sklearn.calibration import CalibratedClassifierCV
import lightgbm as lgb
import matplotlib.pyplot as plt
np.random.seed(0)
n_samples = 10000
X, y = make_classification(
n_samples=3*n_samples, n_features=20, n_informative=2,
n_classes=2, n_redundant=2, random_state=32)
#n_samples = N_SAMPLES//10
X_train, y_train = X[:n_samples], y[:n_samples]
X_valid, y_valid = X[n_samples:2*n_samples], y[n_samples:2*n_samples]
X_test, y_test = X[2*n_samples:], y[2*n_samples:]
plt.figure(figsize=(12, 9))
plt.plot([0, 1], [0, 1], '--', color='gray')
# 1) Uncalibrated_clf but fine-tuned on training data
clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=True, early_stopping_rounds=20)
y_prob = clf.predict_proba(X_test)[:, 1]
fraction_of_positives, mean_predicted_value = calibration_curve(y_test, y_prob, n_bins=10)
plt.plot(
fraction_of_positives,
mean_predicted_value,
'o-', label='uncalibrated_clf')
# 2-1) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv='prefit', method='isotonic')
cal_clf.fit(X_valid, y_valid)
y_prob1 = cal_clf.predict_proba(X_test)[:, 1]
fraction_of_positives1, mean_predicted_value1 = calibration_curve(y_test, y_prob1, n_bins=10)
plt.plot(
fraction_of_positives1,
mean_predicted_value1,
'o-', label='calibrated_clf1')
# 2-2) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_train, y_train)
y_prob2 = cal_clf.predict_proba(X_test)[:, 1]
fraction_of_positives2, mean_predicted_value2 = calibration_curve(y_test, y_prob2, n_bins=10)
plt.plot(
fraction_of_positives2,
mean_predicted_value2,
'o-', label='calibrated_clf2')
plt.legend()
# 2-3) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_valid, y_valid)
y_prob3 = cal_clf.predict_proba(X_test)[:, 1]
fraction_of_positives3, mean_predicted_value3 = calibration_curve(y_test, y_prob3, n_bins=10)
plt.plot(
fraction_of_positives2,
mean_predicted_value2,
'o-', label='calibrated_clf3')
plt.legend()
The way to go about this is:
a) fit the model and calibrate on the hold out set
model.fit(X_train, y_train)
calibrated = CalibratedClassifierCV(model, cv='prefit').fit(X_val, y_val)
y_pred = calibrated.predict(X_test)
(this is actually the meaning of prefit here: the model is already fitted now take a new relevant set and calibrate the output).
b) fit the model and calibrate with cross validation on the training set
model.fit(X_train, y_train)
calibrated = CalibratedClassifierCV(model, cv=5).fit(X_train, y_train)
y_pred_val = calibrated.predict(X_val)
As is usually the case the number of cross validations and the method (isotonic regression vs Platt scaling or sigmoid in scikit-learn's jargon) critically depends on your data and your setup. Therefore, I'd suggest to put those in a grid search and see what produces the best results.
Finally, a deeper dive can be found here:
https://machinelearningmastery.com/calibrated-classification-model-in-scikit-learn/

The predictions from StackingRegressor (Sklearn) are not reproducible

I am working on training a regression model using StackingRegressor and I found out that the prediction from this model is not consistent while I am using the same random_state.
Here is my code:
random_seed = 42
mdl_lgbm = lightgbm.LGBMRegressor(colsample_bytree=0.6,
learning_rate=0.05,
max_depth=6,
min_child_samples=227,
min_child_weight=10,
n_estimators=1800,
num_leaves=45,
reg_alpha=0,
reg_lambda=1,
subsample=0.6,
n_jobs=-1,
random_state=random_seed)
mdl_xgb = xgb.XGBRegressor(subsample=0.5,
n_estimators=900,
min_child_weight=8,
max_depth=6,
learning_rate=0.03,
colsample_bytree=0.8,
n_jobs=-1,
reg_alpha=2,
reg_lambda=50,
objective='reg:squarederror',
random_state=random_seed)
mdl_rf = RandomForestRegressor(bootstrap=True,
max_depth=110,
max_features='auto',
min_samples_leaf=5,
min_samples_split=5,
n_estimators=1430,
n_jobs=-1,
random_state=random_seed)
# Base models
base_mdl_names = {
'XGB': mdl_xgb,
'LGBM': mdl_lgbm,
'RF': mdl_rf,
}
final_estimator = xgb.XGBRegressor(subsample=0.3,
n_estimators=1200,
min_child_weight=2,
max_depth=5,
learning_rate=0.06,
colsample_bytree=0.8,
n_jobs=-1,
reg_alpha=1,
reg_lambda=0.1,
objective='reg:squarederror',
random_state=random_seed)
base_estimators = list()
for name, mdl in base_mdl_names.items():
base_estimators.append((name, mdl))
stacked_mdl = StackingRegressor(estimators=base_estimators,
final_estimator=final_estimator,
cv=5,
passthrough=True)
stacked_mdl.fit(X_train, y_train)
Please note that I am not changing X_train. When I use the trained model for the prediction, the results are not reproducable. I mean that if I re-train the model, the results will be different while every input is the same. Any clue why this happening?
StackingRegressor use cv (cross validation). Therefore you also have to set its random_state in order to have exactly the same cross validation split at each run.
You should do as follow:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, random_state=random_seed, shuffle=True)
stacked_mdl = StackingRegressor(estimators=base_estimators,
final_estimator=final_estimator,
cv=kfold,
passthrough=True)

Different output in SVM model and hyperparameter tuning result

I want to ask about the output of my classification model.
I have this model to do classification for my dataset. Check out my code below.
clf = svm.SVC(kernel='linear', C = 1.0)
X_train, X_test, y_train, y_test = train_test_split(vector, data['label'], test_size=0.33, random_state=123)
clf.fit(X_train,y_train)
y_preds = clf.predict(X_test)
print(confusion_matrix(y_test, y_preds))
print(classification_report(y_test,y_preds))
print('\nAccuracy Score = ',accuracy_score(y_test, y_preds))
I use kernel='linear' and C = 1.0. and the output of accuracy_score shows:
Accuracy Score = 0.7777777777777778
But when i try to use hyperparameter tuning using randomized search, but i still can't get better result than the model i build before. So, i try to check using the same parameters in randomizedsearch, but i get different result. Check my code below:
X_train, X_test, y_train, y_test = train_test_split(vector, data['label'], test_size=0.33, random_state=123)
parameters = [ {'C': [1], 'kernel': ['linear']} ]
search = RandomizedSearchCV(n_iter = 1, estimator = svm.SVC(), param_distributions = parameters)
search.fit(X_train,y_train)
print(search.best_params_)
print(search.best_score_)
Based on my code above, the output shows:
{'kernel': 'linear', 'C': 1}
0.737162980081435
Because of that, i conclude that it has same parameter, but get different result. Can someone tell me why it shows different result?

Results from GridSearchCV/RandomizedSearchCV cannot be reproduced by running a single model using the same parameters

I am running RandomizedSearchCV with 5-folds in order to find best parameters. I have a hold-out set (X_test) that I use to predict. My portion of code is:
svc= SVC(class_weight=class_weights, random_state=42)
Cs = [0.01, 0.1, 1, 10, 100, 1000, 10000]
gammas = [1e-1, 1e-2, 1e-3, 1e-4, 1e-5]
param_grid = {'C': Cs,
'gamma': gammas,
'kernel': ['linear', 'rbf', 'poly']}
my_cv = TimeSeriesSplit(n_splits=5).split(X_train)
rs_svm = RandomizedSearchCV(SVC(), param_grid, cv = my_cv, scoring='accuracy',
refit='accuracy', verbose = 3, n_jobs=1, random_state=42)
rs_svm.fit(X_train, y_train)
y_pred = rs_svm.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
clfreport = classification_report(y_test, y_pred)
print (rs_svm.best_params_)
The result is classification report:
Now, I am interested in reproducing this result using a run-alone model (no randomizedsearchCV) with the selected parameters:
from sklearn.model_selection import TimeSeriesSplit
tcsv=TimeSeriesSplit(n_splits=5)
for train_index, test_index in tcsv.split(X_train):
train_index_ = int(train_index.shape[0])
test_index_ = int(test_index.shape[0])
X_train_, y_train_ = X_train[0:train_index_],y_train[0:train_index_]
X_test_, y_test_ = X_train[test_index_:],y_train[test_index_:]
class_weights = compute_class_weight('balanced', np.unique(y_train_), y_train_)
class_weights = dict(enumerate(class_weights))
svc= SVC(C=0.01, gamma=0.1, kernel='linear', class_weight=class_weights, verbose=True,
random_state=42)
svc.fit(X_train_, y_train_)
y_pred_=svc.predict(X_test)
cm = confusion_matrix(y_test, y_pred_)
clfreport = classification_report(y_test, y_pred_)
In my understanding, the clfreports should be identical but my result after this run are:
Does anyone have any suggestions why that might be happening?
Given your 1st code snippet, where you use RandomizedSearchCV to find the best hyperparameters, you don't need to do any splitting again; so, in your 2nd snippet, you should just fit using the found hyperparameters and the class weights using the whole of your training set, and then predict on your test set:
class_weights = compute_class_weight('balanced', np.unique(y_train), y_train)
class_weights = dict(enumerate(class_weights))
svc= SVC(C=0.01, gamma=0.1, kernel='linear', class_weight=class_weights, verbose=True, random_state=42)
svc.fit(X_train, y_train)
y_pred_=svc.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
clfreport = classification_report(y_test, y_pred)
The discussion in Order between using validation, training and test sets might be useful for clarifying the procedure...

Categories