Below is an example of using scikit-learn to get cross-validated predictions from k-nearest neighbors, with k chosen by cross-validation. The code seems to work, but how can I also print the k that was selected in each of the outer folds?
import numpy as np, sklearn
n = 100
X = np.random.randn(n, 2)
y = np.where(np.sum(X, axis = 1) + np.random.randn(n) > 0, "blue", "red")
preds = sklearn.model_selection.cross_val_predict(
X = X,
y = y,
estimator = sklearn.model_selection.GridSearchCV(
estimator = sklearn.neighbors.KNeighborsClassifier(),
param_grid = {'n_neighbors': range(1, 7)},
cv = sklearn.model_selection.KFold(10, random_state = 133),
scoring = 'accuracy'),
cv = sklearn.model_selection.KFold(10, random_state = 144))
You can't get this directly from that function, so you would need to replace cross_val_predict with cross_validate and set the return_estimator flag to True. You can then select the estimators used in the returned dictionary with the key estimator. The selected parameters of the estimators is stored in the attribute best_params_. So
import numpy as np
import sklearn
# sklearn 0.20.3 doesn't seem to import submodules in __init__
# So importing them directly is required.
import sklearn.model_selection
import sklearn.neighbors
n = 100
X = np.random.randn(n, 2)
y = np.where(np.sum(X, axis = 1) + np.random.randn(n) > 0, "blue", "red")
scores = sklearn.model_selection.cross_validate(
X = X,
y = y,
estimator = sklearn.model_selection.GridSearchCV(
estimator = sklearn.neighbors.KNeighborsClassifier(),
param_grid = {'n_neighbors': range(1, 7)},
cv = sklearn.model_selection.KFold(10, random_state = 133),
scoring = 'accuracy'),
cv = sklearn.model_selection.KFold(10, random_state = 144),
return_estimator=True)
# Selected hyper-parameters for the estimator from the first fold
print(scores['estimator'][0].best_params_)
Unfortunately you can't get the actual predictions AND the hyper-parameters selected from the same function. If you want that, you will have to do the nested cross-validation manually:
cv = sklearn.model_selection.KFold(10, random_state = 144)
estimator = sklearn.model_selection.GridSearchCV(
estimator = sklearn.neighbors.KNeighborsClassifier(),
param_grid = {'n_neighbors': range(1, 7)},
cv = sklearn.model_selection.KFold(10, random_state = 133),
scoring = 'accuracy')
for train, test in cv.split(X,y):
X_train, y_train = X[train], y[train]
X_test, y_test = X[test], y[test]
m = estimator.fit(X_train, y_train)
print(m.best_params_)
y_pred = m.predict(X_test)
print(y_pred)
Related
I would like to include log transformation as part of my hyperparameter tuning. I'm currently running GridSearchCV twice and then select the best model from both runs. Is there a way to do this as part of GridSearchCV instead?
Example of my current model below:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_regression
X, y = make_regression(n_features=4, n_informative=2,
random_state=0, shuffle=False)
rf = RandomForestRegressor()
def do_log(y):
y_t = np.log(y+1)
return(y_t)
def do_exp(x):
y = np.exp(y)-1
return(y_t)
transformed_rf = TransformedTargetRegressor(rf, func = do_log, inverse_func=do_exp)
param_grid = {'regressor__n_estimators': [100, 500, 1000]}
grid_search1 = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 10)
grid_search1.fit(X, y)
grid_search2 = GridSearchCV(estimator = transformed_rf, param_grid = param_grid, cv = 10)
grid_search2.fit(X, y)
if (grid_search1.best_score_ > grid_search2.best_score_):
best_model = grid_search1.best_estimator_
elif (grid_search1.best_score_ < grid_search2.best_score_):
best_model = grid_search2.best_estimator_
else:
print("same performance for both models")
best_model = grid_search1.best_estimator_
I'm looking for something like this:
param_grid = {'estimators': [100, 500, 1000],
'log_transform': [True, False],
}
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, cv = 10)
grid_search.fit(X, y)
best_model = grid_search.best_estimator_
I have a problem that the training error is too good, but the test error is too bad. I've already use PCA to reduce the dimension of the feature and these are the best that i can get so far but it still not good enough for test data evaluation:
XGBoost :
R2 Score : 0.559832465443366
MSE : 0.021168084677487115
RMSE : 0.1454925588388874
MAE : 0.12313938140869134
dataset: https://docs.google.com/spreadsheets/d/1xLTv4jLh7j3sTh0UKMHnSUvMXx1qNiXZ/edit?usp=share_link&ouid=116330084208220275542&rtpof=true&sd=true
these are my codes:
dataset = pd.read_excel('Data.xlsx')
x = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 4)
sc = StandardScaler()
x_train[:, :] = sc.fit_transform(x_train[:, :])
x_test[:, :] = sc.transform(x_test[:, :])
pca = PCA(n_components = 4)
x_train = pca.fit_transform(x_train)
x_test = pca.transform(x_test)
rf = RandomForestRegressor()
adb = AdaBoostRegressor()
xgb = xgb.XGBRegressor()
gbrt = GradientBoostingRegressor()
rf_parameters = {'n_estimators':[200,500],'criterion':['squared_error', 'absolute_error', 'friedman_mse', 'poisson'], 'max_features': ['sqrt', 'log2', None]}
adb_parameters = {'n_estimators':[200,500],'loss':['linear', 'square', 'exponential']}
xgb_parameters = {'booster':['gbtree', 'dart'],
'sampling_method':['uniform', 'gradient_based'],
'tree_method':['auto','exact','approx','hist','gpu_hist'],
'n_estimators':[200,500]}
gbrt_parameters = {'loss':['squared_error', 'absolute_error', 'huber', 'quantile'],'n_estimators':[200,500],'criterion':['friedman_mse', 'squared_error'], 'max_features':['auto', 'sqrt', 'log2']}
rf_grid = GridSearchCV(rf, rf_parameters, cv = 8, n_jobs = -1)
adb_grid = GridSearchCV(adb, adb_parameters, cv = 8, n_jobs = -1)
xgb_grid = GridSearchCV(xgb, xgb_parameters, cv = 8, n_jobs = -1)
gbrt_grid = GridSearchCV(gbrt, gbrt_parameters, cv = 8, n_jobs = -1)
rf_grid.fit(x_train, y_train)
adb_grid.fit(x_train, y_train)
xgb_grid.fit(x_train, y_train)
gbrt_grid.fit(x_train, y_train)
y_pred_rf = rf_grid.predict(x_test)
y_pred_adb = adb_grid.predict(x_test)
y_pred_xgb = xgb_grid.predict(x_test)
y_pred_gbrt = gbrt_grid.predict(x_test)`
what should i do to reducing the test data error, but the dataset only consist of 60 data and i use 80-20 splitting. Thank you
I've already use PCA to reduce the dimension of the feature and these are the best that i can get so far but it still not good enough for test data evaluation, what should i do to reducing the test data error, but the dataset only consist of 60 data and i use 80-20 splitting. Thank you
Why am I getting negative SCORE even if i am using scoring = 'neg_mean_squared_error'?
I tried the following code from apparently the source code:
neg_mean_squared_error_scorer = make_scorer(mean_squared_error, greater_is_better=False)
Source Code
However it doesn't work. And I don't see the point of using it if we are supposed to use scoring = 'neg_mean_squared_error'.
Here is the code I used:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from math import sqrt
from sklearn.metrics import \
r2_score, get_scorer, make_scorer, mean_squared_error
from sklearn.linear_model import \
Lasso, Ridge, LassoCV,LinearRegression
from sklearn.preprocessing import \
StandardScaler, PolynomialFeatures
from sklearn.model_selection import \
KFold, RepeatedKFold, GridSearchCV, \
cross_validate, train_test_split
# Features
x1 = np.linspace(-20,20,100)
x1 = np.array(x1).reshape(-1,1)
x2 = pow(x1,2)
x3 = pow(x1,3)
x4 = pow(x1,4)
x5 = pow(x1,5)
# Parameters
beta_0 = 1.75
beta_1 = 5
beta_3 = 0.05
beta_5 = -10.3
eps_mu = 0 # epsilon mean
eps_sigma = sqrt(4) # epsilon standard deviation
eps_size = 100 # epsilon size
np.random.seed(1) # Fixing a seed
eps = np.random.normal(eps_mu, eps_sigma, eps_size)
eps = np.array(eps).reshape(-1,1)
y = beta_0 + beta_1*x1 + beta_3*x3 + beta_5*x5 + eps
data = np.concatenate((y,x1,x2,x3,x4,x5), axis = 1)
X = data[:,1:6]
y = data[:,0]
alphas_to_try = np.linspace(0.00000000000000000000000001,0.002,10) ######## To modify #######
scoring = 'neg_mean_squared_error'
#scoring = (mean_squared_error, greater_is_better=False)
scorer = get_scorer(scoring)
k = 5
cv = KFold(n_splits = k)
for train_index, test_index in cv.split(data):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)
validation_scores = []
train_scores = []
results_list = []
test_scores = []
for curr_alpha in alphas_to_try:
regmodel = Lasso(alpha = curr_alpha)
results = cross_validate(
regmodel, X, y, scoring=scoring, cv=cv,
return_train_score = True)
validation_scores.append(np.mean(results['test_score']))
train_scores.append(np.mean(results['train_score']))
results_list.append(results)
regmodel.fit(X,y)
y_pred = regmodel.predict(X_test)
test_scores.append(scorer(regmodel, X_test, y_test))
chosen_alpha_id = np.argmax(validation_scores)
chosen_alpha = alphas_to_try[chosen_alpha_id]
max_validation_score = np.max(validation_scores)
test_score_at_chosen_alpha = test_scores[chosen_alpha_id]
print('chosen_alpha:', chosen_alpha)
print('max_validation_score:', max_validation_score)
print('test_score_at_chosen_alpha:', test_score_at_chosen_alpha)
plt.figure(figsize = (8,8))
sns.lineplot(y = validation_scores, x = alphas_to_try, label = 'validation_data')
sns.lineplot(y = train_scores, x = alphas_to_try, label = 'training_data')
plt.axvline(x=chosen_alpha, linestyle='--')
sns.lineplot(y = test_scores, x = alphas_to_try, label = 'test_data')
plt.xlabel('alpha_parameter')
plt.ylabel(scoring)
plt.title('LASSO Regularisation')
plt.legend()
plt.show()
Why the code is not working? Why am I getting negative scores?
Output:
What I am supposed to get:
I am supposed to get something like the screenshot above, but MSE instead of r2 on the y axis.
As the name suggests, neg_mean_squared_error is the negative of the mean-squared-error, so negative scores is expected (in fact, it is positive scores that are impossible).
As to the plots, there's a bigger problem. Your train and validation scores are obtained using cross_validate, and are fine. But your test scores are obtained by fitting the regressor to the entire X, y and then scoring that on X_test, y_test, a subset of the training set! So those scores are quite optimistically biased.
A quick check on the scale of the errors: you have a degree-5 polynomial with the original feature taking values between -20 and 20. So the target takes values on the order of 10^6, and so squared errors may be expected on the order of 10^12.
I want to see the individual score of each fitted model to visualize the strength of cross validation (I am doing this to show my coworkers why cross validation is important).
I have a .csv file with 500 rows, 200 independent variables and 1 binary target. I defined skf to fold the data 5 times using StratifiedKFold.
My code looks like this:
X = data.iloc[0:500, 2:202]
y = data["target"]
skf = StratifiedKFold(n_splits = 5, random_state = 0)
clf = svm.SVC(kernel = "linear")
Scores = [0] * 5
for i, j in skf.split(X, y):
X_train, y_train = X.iloc[i], y.iloc[i]
X_test, y_test = X.iloc[j], y.iloc[j]
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
As you can see, I assigned a list of 5 zeroes to Scores. I would like to assign the clf.score(X_test, y_test) of each of the 5 predictions to the list. However, the indices i and j are not {1, 2, 3, 4, 5}. Rather, they are row numbers used to fold the X and y data frames.
How can I assign the test scores of each of the k fitted models into Scoreswithin this loop? Do I need a separate index for this?
I know using cross_val_score literally does all this and gives you a geometric average of the k scores. However, I want to show my coworkers what happens behind the cross validation functions that come in the sklearn library.
Thanks in advance!
If I understood the question, and you don't need any particular indexing for Scores:
from sklearn.model_selection import StratifiedKFold
from sklearn.svm import SVC
X = np.random.normal(size = (500, 200))
y = np.random.randint(low = 0, high=2, size=500)
skf = StratifiedKFold(n_splits = 5, random_state = 0)
clf = SVC(kernel = "linear")
Scores = []
for i, j in skf.split(X, y):
X_train, y_train = X[i], y[i]
X_test, y_test = X[j], y[j]
clf.fit(X_train, y_train)
Scores.append(clf.score(X_test, y_test))
The result is:
>>>Scores
[0.5247524752475248, 0.53, 0.5, 0.51, 0.4444444444444444]
from sklearn.neighbors import KNeighborsClassifier
import pandas as pd
from sklearn import metrics
from sklearn.cross_validation import train_test_split
import matplotlib.pyplot as plt
r = pd.read_csv("vitalsign_test.csv")
clm_list = []
for column in r.columns:
clm_list.append(column)
X = r[clm_list[1:len(clm_list)-1]].values
y = r[clm_list[len(clm_list)-1]].values
X_train, X_test, y_train, y_test = train_test_split (X,y, test_size = 0.3, random_state=4)
k_range = range(1,25)
scores = []
for k in k_range:
clf = KNeighborsClassifier(n_neighbors = k)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
scores.append(metrics.accuracy_score(y_test,y_pred))
plt.plot(k_range,scores)
plt.xlabel('value of k for clf')
plt.ylabel('testing accuracy')
reponse that I am getting is
ValueError: x and y must have same first dimension
my feature and response shape is:
y.shape
Out[60]: (500,)
X.shape
Out[61]: (500, 6)
It has nothing to do with your X and y, it is about x and y arguments to plot, since your scores has one element, and k_range has 25. The error is incorrect indentation:
for k in k_range:
clf = KNeighborsClassifier(n_neighbors = k)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
scores.append(metrics.accuracy_score(y_test,y_pred))
should be
for k in k_range:
clf = KNeighborsClassifier(n_neighbors = k)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
scores.append(metrics.accuracy_score(y_test,y_pred))