I am trying to solve a regression problem on Boston Dataset with help of random forest regressor.I was using GridSearchCV for selection of best hyperparameters.
Problem 1
Should I fit the GridSearchCV on some X_train, y_train and then get the best parameters.
OR
Should I fit it on X, y to get best parameters.(X, y = entire dataset)
Problem 2
Say If I fit it on X, y and get the best parameters and then build a new model on these best parameters.
Now how should I train this new model on ?
Should I train the new model on X_train, y_train or X, y.
Problem 3
If I train new model on X,y then how will I validate the results ?
My code so far
#Dataframes
feature_cols = ['CRIM','ZN','INDUS','NOX','RM','AGE','DIS','TAX','PTRATIO','B','LSTAT']
X = boston_data[feature_cols]
y = boston_data['PRICE']
Train Test Split of Data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
Grid Search to get best hyperparameters
from sklearn.grid_search import GridSearchCV
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
CV_rfc.best_params_
#{'max_depth': 10, 'n_estimators': 100}
Train a Model on the max_depth: 10, n_estimators: 100
RFReg = RandomForestRegressor(max_depth = 10, n_estimators = 100, random_state = 1)
RFReg.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)
RMSE: 2.8139766730629394
I just want some guidance with what the correct steps would be
In general, to tune the hyperparameters, you should always train your model over X_train, and use X_test to check the results. You have to tune the parameters based on the results obtained by X_test.
You should never tune hyperparameters over the whole dataset because it would defeat the purpose of the test/train split (as you correctly ask in the Problem 3).
This is a valid concern indeed.
Problem 1
The GridSearchCV does cross validation indeed to find the proper set of hyperparameters. But you should still have a validation set to make sure that the optimal set of parameters is sound for it (so that gives in the end train, test, validation sets).
Problem 2
The GridSearchCV already gives you the best estimator, you don't need to train a new one. But actually CV is just to check if the building is sound, you can train then on the full dataset (see https://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation for a full detailed discussion).
Problem 3
What you already validated is the way you trained your model (i.e. you already validated that the hyperparameters you found are sound and the training works as expected for the data you have).
Related
I use 20% of the data set as my testing set and use GridSearchCV to implement K-fold cross-validation to tune hyperparameters.
By using a pipeline, we can put the column transformer and the machine learning algorithm into GridSearchCV together. If I set up a 5 fold cross-validation for GridSearchCV, the function will use 5 different training and validation sets to train and validate each combination of hyperparameters. As I know, GridSearchCV uses the mean of 5 fold scores to choose the best model.
Then my question is, how does it transform the testing set?
I'm very confused about this because to avoid data leakage, we should use only the training set to fit the transformer, but in this case, we have 5 different training sets and I don't know which one the GridSearchCV function uses to fit and transform the validation and testing set.
My code is given below
X_other, X_test, y_other, y_test = train_test_split(X, y, test_size = 0.2, random_state = i)
kf = KFold(n_splits = 4, shuffle = True, random_state = i)
pipe = Pipeline(steps = [("preprocessor", preprocessor),("model", ML_algo)])
grid = GridSearchCV(estimator = pipe, param_grid=param_grid,
scoring = make_scorer(mean_squared_error, greater_is_better=False,
squared=False),cv=kf, return_train_score = True, n_jobs=-1, verbose=False)
grid.fit(X_other, y_other)
test_score = mean_squared_error(y_test, grid.predict(X_test), squared=False)
short answer: there is no data leakage, test set is not used (and should not be used) for training the model in your code.
long answer: k fold cross-validation randomly divided your X_other& y_other(training set) into k splits, for each iteration of cross-validation, k-1 fold of data is used to train the model while this model is then evaluated with the 1 fold left using the metric you specified in scoring=. (refer to the below picture from sklearn for details:https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation)
After finding the best set of hyperparameters by GridSearchCV(), all training set data is used in training a final model with the found hyperparameters, then, X_test,y_test (test set) can be transformed by this model. Note that in this process, X_test,y_test is not used and should not be used other than in the final prediction.
I am using ElasticNet to obtain a fit of my data. To determine the hyperparameters (l1, alpha), I am using ElasticNetCV. With the obtained hyperparamers, I refit the model to the whole dataset for production use. I am unsure if this is correct in both, the machine learning aspect and - if so - how I do it. The code "works" and presumably does what it should, but I wanted to be certain that it is also correct.
My procedure is:
X_tr, X_te, y_tr, y_te = train_test_split(X,y)
optimizer = ElasticNetCV(l1_ratio = [.1,.5,.7,.9,.99,1], n_alphas=400, cv=5, normalize=True)
optimizer.fit(X_tr, y_tr)
best = ElasticNet(alpha=optimizer.alpha_, l1_ratio=optimizer.l1_ratio_, normalize=True)
best.fit(X,y)
Thank you in advance
I am a beginner on this but I would love to share my approach to ElasticNet hyperparameters tuning. I would suggest to use RandomizedSearchCV instead. Here is part of the current code I am writing now:
#-----------------------------------------------
# input:
# X_train, X_test, Y_train, Y_test: datasets
# Returns:
# R² and RMSE Scores
#-----------------------------------------------
# Standardize data before
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# define grid
params = dict()
# values for alpha: 100 values between e^-5 and e^5
params['alpha'] = np.logspace(-5, 5, 100, endpoint=True)
# values for l1_ratio: 100 values between 0 and 1
params['l1_ratio'] = np.arange(0, 1, 0.01)
Warning: you are testing 100 x 100 = 10 000 possible combinations.
# Create an instance of the Elastic Net Regressor
regressor = ElasticNet()
# Call the RanddomizedSearch with Cross Validation using the chosen regressor
rs_cv= RandomizedSearchCV(regressor, params, n_iter = 100, scoring=None, cv=5, verbose=0, refit=True)
rs_cv.fit(X_train, Y_train.values.ravel())
# Results
Y_pred = rs_cv.predict(X_test)
R2_score = rs_cv.score(X_test, Y_test)
RMSE_score = np.sqrt(mean_squared_error(Y_test, Y_pred))
return R2_score, RMSE_score, rs_cv.best_params_
The advantage is that in RandomizedSearchCV the number of iterations can be predetermined in advance. The choices of points to be tested are random but 90% (in some cases) faster than GridSearchCV (that tests all possible combinations).
I am using this same approach for other regressors like RandomForests and GradientBoosting who parameters grids are far more complicated and demand much more computer power to run.
As I said at the beginning I am new to this field, so any constructive comment will be welcomed.
Johnny
I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485
I trained a model using Logistic Regression to predict whether a name field and description field belong to a profile of a male, female, or brand. My train accuracy is around 99% while my test accuracy is around 83%. I have tried implementing regularization by tuning the C parameter but the improvements were barely noticed. I have around 5,000 examples in my training set. Is this an instance where I just need more data or is there something else I can do in Sci-Kit Learn to get my test accuracy higher?
overfitting is a multifaceted problem. It could be your train/test/validate split (anything from 50/40/10 to 90/9/1 could change things). You might need to shuffle your input. Try an ensemble method, or reduce the number of features. you might have outliers throwing things off
then again, it could be none of these, or all of these, or some combination of these.
for starters, try to plot out test score as a function of test split size, and see what you get
#The 'C' value in Logistic Regresion works very similar as the Support
#Vector Machine (SVM) algorithm, when I use SVM I like to use #Gridsearch
#to find the best posible fit values for 'C' and 'gamma',
#maybe this can give you some light:
# For SVC You can remove the gamma and kernel keys
# param_grid = {'C': [0.1,1, 10, 100, 1000],
# 'gamma': [1,0.1,0.01,0.001,0.0001],
# 'kernel': ['rbf']}
param_grid = {'C': [0.1,1, 10, 100, 1000]}
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report,confusion_matrix
# Train and fit your model to see initial values
X_train, X_test, y_train, y_test = train_test_split(df_feat, np.ravel(df_target), test_size=0.30, random_state=101)
model = SVC()
model.fit(X_train,y_train)
predictions = model.predict(X_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
# Find the best 'C' value
grid = GridSearchCV(SVC(),param_grid,refit=True,verbose=3)
grid.best_params_
c_val = grid.best_estimator_.C
#Then you can re-run predictions on this grid object just like you would with a normal model.
grid_predictions = grid.predict(X_test)
# use the best 'C' value found by GridSearch and reload your LogisticRegression module
logmodel = LogisticRegression(C=c_val)
logmodel.fit(X_train,y_train)
print(confusion_matrix(y_test,grid_predictions))
print(classification_report(y_test,grid_predictions))
I'm using sklearn to fit a linear regression model to some data. In particular, my response variable is stored in an array y and my features in a matrix X.
I train a linear regression model with the following piece of code
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X,y)
and everything seems to be fine.
Then let's say I have some new data X_new and I want to predict the response variable for them. This can easily done by doing
predictions = model.predict(X_new)
My question is, what is this the error associated to this prediction?
From my understanding I should compute the mean squared error of the model:
from sklearn.metrics import mean_squared_error
model_mse = mean_squared_error(model.predict(X),y)
And basically my real predictions for the new data should be a random number computed from a gaussian distribution with mean predictions and sigma^2 = model_mse. Do you agree with this and do you know if there's a faster way to do this in sklearn?
You probably want to validate your model on your training data set. I would suggest exploring the cross-validation submodule sklearn.cross_validation.
The most basic usage is:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
It depends on you training data-
If it's distribution is a good representation of the "real world" and of a sufficient size (see learning theories, as PAC), then I would generally agree.
That said- if you are looking for a practical way to evaluate your model, why won't you use the test set as Kris has suggested?
I usually use grid search for optimizing parameters:
#split to training and test sets
X_train, X_test, y_train, y_test =train_test_split(
X_data[indices], y_data[indices], test_size=0.25)
#cross validation gridsearch
params = dict(logistic__C=[0.1,0.3,1,3, 10,30, 100])
grid_search = GridSearchCV(clf, param_grid=params,cv=5)
grid_search.fit(X_train, y_train)
#print scores and best estimator
print 'best param: ', grid_search.best_params_
print 'best train score: ', grid_search.best_score_
print 'Test score: ', grid_search.best_estimator_.score(X_test,y_test)
The Idea is hiding the test set from your learning algorithm (and yourself)- Don't train and don't optimize parameters using this data.
Finally you should use the test set for performance evaluation (error) only, it should provide an unbiased mse.