I am using ElasticNet to obtain a fit of my data. To determine the hyperparameters (l1, alpha), I am using ElasticNetCV. With the obtained hyperparamers, I refit the model to the whole dataset for production use. I am unsure if this is correct in both, the machine learning aspect and - if so - how I do it. The code "works" and presumably does what it should, but I wanted to be certain that it is also correct.
My procedure is:
X_tr, X_te, y_tr, y_te = train_test_split(X,y)
optimizer = ElasticNetCV(l1_ratio = [.1,.5,.7,.9,.99,1], n_alphas=400, cv=5, normalize=True)
optimizer.fit(X_tr, y_tr)
best = ElasticNet(alpha=optimizer.alpha_, l1_ratio=optimizer.l1_ratio_, normalize=True)
best.fit(X,y)
Thank you in advance
I am a beginner on this but I would love to share my approach to ElasticNet hyperparameters tuning. I would suggest to use RandomizedSearchCV instead. Here is part of the current code I am writing now:
#-----------------------------------------------
# input:
# X_train, X_test, Y_train, Y_test: datasets
# Returns:
# R² and RMSE Scores
#-----------------------------------------------
# Standardize data before
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# define grid
params = dict()
# values for alpha: 100 values between e^-5 and e^5
params['alpha'] = np.logspace(-5, 5, 100, endpoint=True)
# values for l1_ratio: 100 values between 0 and 1
params['l1_ratio'] = np.arange(0, 1, 0.01)
Warning: you are testing 100 x 100 = 10 000 possible combinations.
# Create an instance of the Elastic Net Regressor
regressor = ElasticNet()
# Call the RanddomizedSearch with Cross Validation using the chosen regressor
rs_cv= RandomizedSearchCV(regressor, params, n_iter = 100, scoring=None, cv=5, verbose=0, refit=True)
rs_cv.fit(X_train, Y_train.values.ravel())
# Results
Y_pred = rs_cv.predict(X_test)
R2_score = rs_cv.score(X_test, Y_test)
RMSE_score = np.sqrt(mean_squared_error(Y_test, Y_pred))
return R2_score, RMSE_score, rs_cv.best_params_
The advantage is that in RandomizedSearchCV the number of iterations can be predetermined in advance. The choices of points to be tested are random but 90% (in some cases) faster than GridSearchCV (that tests all possible combinations).
I am using this same approach for other regressors like RandomForests and GradientBoosting who parameters grids are far more complicated and demand much more computer power to run.
As I said at the beginning I am new to this field, so any constructive comment will be welcomed.
Johnny
Related
I created a very simple function to test XGBoost.
X is an array containing 1000 rows of "7*np.pi" for each row.
Y is simply "1 + 0.5*np.sin(x)"
I split the dataset in 800 training and 200 testing rows. Shuffle MUST be False to simulate future occurrences, making sure the last 200 rows are reserved to testing.
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error as MSE
from xgboost import XGBRegressor
N = 1000 # 1000 rows
x = np.linspace(0, 7*np.pi, N) # Simple function
y = 1 + 0.5*np.sin(x) # Generate simple function sin(x) as y
# Train-test split, intentionally use shuffle=False to simulate time series
X = x.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
### Interestingly, model generalizes well if shuffle=False
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
XGB_reg = XGBRegressor(random_state=42)
XGB_reg.fit(X_train,y_train)
# EVALUATE ON TRAIN DATA
yXGBPredicted = XGB_reg.predict(X_train)
rmse = np.sqrt(MSE(y_train, yXGBPredicted))
print("RMSE TRAIN XGB: % f" %(rmse))
# EVALUATE ON TEST DATA
yXGBPredicted = XGB_reg.predict(X_test)
# METRICAS XGB
rmse = np.sqrt(MSE(y_test, yXGBPredicted))
print("RMSE TEST XGB: % f" %(rmse))
# Predict full dataset
yXGB = XGB_reg.predict(X)
# Plot and compare
plt.style.use('fivethirtyeight')
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(figsize=(10,5))
plt.plot(x, y)
plt.plot(x, yXGB)
plt.ylim(0,2)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
I trained the model on the first 800 rows and then predicted the next 200 rows.
I was expecting testing data to have a great RMSE, but it did not happen.
I was surprised to see that XGBoost simple repeated the last value of the training set on all rows of the predictions (see chart).
Any ideas why this doesn't work?
You're asking your model to "extrapolate" - making predictions for x values that are greater than x values in the training dataset. Extrapolation works with some model types (such as linear models), but it typically does not work with decision tree models and their ensembles (such as XGBoost).
If you switch from XGBoost to LightGBM, then you can train extrapolation-capable decision tree ensembles using the "linear tree" approach:
Any ideas why this doesn't work?
Your XGBRegressor is probably over-fitted (has n_estimators = 100 and max_depth = 6). If you decrease those parameter values, then the red line will appear more jagged, and it will be easier for you to see it "working".
Right now, if you ask your over-fitted XGBRegressor to extrapolate, then it basically functions as a giant look-up table. When extrapolating towards +Inf, then the "closest match" is at x = 17.5; when extrapolating towards -Inf, then the "closest match" is at x = 0.0.
I have a custom scorer function whose inputs depend on the specific train and validation fold, additionally, the estimator's .predict_survival_function output is also needed. To give a more concrete example:
I am trying to run a GridSearch for a Random Survival Forest (scikit-survival package) with the Integrated Brier Score (IBS)
as the scoring method. The challenge is in the fact that the domain of the IBS is data- (and therefore, fold-) specific as it relies on the Kaplan-Meyer estimate at some point. Moreover, the .predict_survival_function method needs to be called every time during the scoring evaluation step and not only at the end of it.
It seems that I managed to to deal with the first issue by creating the following function:
def IB_time_interval(y_train, y_test):
y_times_tr = [i[2] for i in y_train]
y_times_te = [i[2] for i in y_test]
T1 = np.percentile(y_times_tr, 5, interpolation='higher')
T2 = np.percentile(y_times_tr, 95, interpolation='lower')
T3 = np.percentile(y_times_te, 5, interpolation='higher')
T4 = np.percentile(y_times_te, 95, interpolation='lower')
return np.linspace(np.maximum(T1,T3), np.minimum(T2, T4))
That is robust enough to work throughout all the folds. However, I am unable to retrieve the estimator's predictions during the grid search phase, as a non-fitted copy of it seems to passed instead every time the custom scorer function is called. The workaround that I I tried is to re-fit the estimator inside the scoring function, but not only this is conceptually wrong, it also raises errors.
The custom scorer function looks like the following:
def IB_scorer(y_true, y_pred, times=times_linspace, y=y, clf=rsf):
rsf.fit(X_train,y_train) #<--- = conceptually wrong
survs_test = rsf.predict_survival_function(X_test, return_array=False) #<---
T1, T2 = survs_test[0].x.min(), survs_test[0].x.max()
mask2 = np.logical_or(times >= T2, times < T1) # mask outer interval
times = times[~mask2]
#preds has shape (n_y-s, n_times)
preds_test = np.asarray([[fn(t) for t in times] for fn in survs_test])
return integrated_brier_score(y, y_true, preds_test, times)
and I create the scoring object immediately afterwards:
trial_IB_scorer = make_scorer(IB_scorer, greater_is_better=False)
Any suggestions? It would be great to be able to use GridSearch with more complex scoring functions, especially for the survival analysis case!
PS. I will paste the rest of the minimal working code here:
import numpy as np
from sksurv.ensemble import RandomSurvivalForest
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sksurv.metrics import integrated_brier_score
from sksurv.datasets import load_breast_cancer
X, y = load_breast_cancer()
X = X.drop(["er", "grade"], axis=1)
y_cens = np.array([i[0] for i in y]) #censoring status 1 or 0
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2,
shuffle=True,
random_state=0,
stratify = y_cens)
param_grid = {
'max_depth': [4, 20, None],
'max_features': ["sqrt", None]
}
rsf = RandomSurvivalForest(n_jobs=1, random_state=0)
times_linspace = IB_time_interval(y_train, y_test)
clf = GridSearchCV(rsf, param_grid, refit=True, n_jobs=1,
scoring=trial_IB_scorer)
clf.fit(X_train, y_train)
print("final score clf", clf.score(X_train, y_train))
print(clf.best_params_)
I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485
I am running a random forest regression model, but my results are not that great. One person recommended to check interaction effects.
Surprisingly I do not see too many questions about this. This one did not help me. I am also not sure how to incorporate to my code sklearn.preprocessing.PolynomialFeatures.
My data is very simple:
My code:
# Split data
y = starbucks_log.iloc[:, 0]
# x as all others
X = starbucks_log.drop('total_amount', axis = 1)
# Set seed for reproducibility
SEED = 1
# Split dataset into 80% train and 20% test
X_train, X_test, y_train, y_test = \
train_test_split(X, y,
test_size = 0.2,
random_state = SEED)
# Instantiate a random forests regressor 'rf' 400 estimators
rf = RandomForestRegressor(n_estimators = 400,
min_samples_leaf = 1,
random_state = SEED)
# Fit 'rf' to the training set
rf.fit(X_train, y_train)
# Predict the test set labels 'y_pred'
y_pred = rf.predict(X_test)
y_pred_train=rf.predict(X_train)
# Evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)
rmse_train = MSE(y_train, y_pred_train)**(1/2)
# Print the test set RMSE
print('Test set RMSE of rf: {:.5f}'.format(rmse_test))
print('Train set RMSE of rf: {:.5f}'.format(rmse_train))
I would like to add all possible interaction effects of income, age, and male(gender). It would be easier to drop some of them later.
Thanks!
I am trying to solve a regression problem on Boston Dataset with help of random forest regressor.I was using GridSearchCV for selection of best hyperparameters.
Problem 1
Should I fit the GridSearchCV on some X_train, y_train and then get the best parameters.
OR
Should I fit it on X, y to get best parameters.(X, y = entire dataset)
Problem 2
Say If I fit it on X, y and get the best parameters and then build a new model on these best parameters.
Now how should I train this new model on ?
Should I train the new model on X_train, y_train or X, y.
Problem 3
If I train new model on X,y then how will I validate the results ?
My code so far
#Dataframes
feature_cols = ['CRIM','ZN','INDUS','NOX','RM','AGE','DIS','TAX','PTRATIO','B','LSTAT']
X = boston_data[feature_cols]
y = boston_data['PRICE']
Train Test Split of Data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
Grid Search to get best hyperparameters
from sklearn.grid_search import GridSearchCV
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
CV_rfc.best_params_
#{'max_depth': 10, 'n_estimators': 100}
Train a Model on the max_depth: 10, n_estimators: 100
RFReg = RandomForestRegressor(max_depth = 10, n_estimators = 100, random_state = 1)
RFReg.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)
RMSE: 2.8139766730629394
I just want some guidance with what the correct steps would be
In general, to tune the hyperparameters, you should always train your model over X_train, and use X_test to check the results. You have to tune the parameters based on the results obtained by X_test.
You should never tune hyperparameters over the whole dataset because it would defeat the purpose of the test/train split (as you correctly ask in the Problem 3).
This is a valid concern indeed.
Problem 1
The GridSearchCV does cross validation indeed to find the proper set of hyperparameters. But you should still have a validation set to make sure that the optimal set of parameters is sound for it (so that gives in the end train, test, validation sets).
Problem 2
The GridSearchCV already gives you the best estimator, you don't need to train a new one. But actually CV is just to check if the building is sound, you can train then on the full dataset (see https://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation for a full detailed discussion).
Problem 3
What you already validated is the way you trained your model (i.e. you already validated that the hyperparameters you found are sound and the training works as expected for the data you have).