I'm trying to graph the mean squared error of my data and I'm having a little difficulty figuring out just how to do it. I know you need both the "true" value and the "predicted" value in order to get the mse, but the way my project is laid out is quite confusing.
I have a method in which I generate a model like so:
def fit_curve(X, y, degree):
poly_features = PolynomialFeatures(degree = degree)
x_poly = poly_features.fit_transform(X)
linreg = LinearRegression()
model = linreg.fit(x_poly, y)
return model
This returns a model that's already trained.
Then, I'm supposed to find the mean squared error for said model. I'm not sure how I'm supposed to do this since the model has already been trained without returning the predicted values.
Right now my method that calculates mse is:
def mse(X, y, degree, model):
poly_features = PolynomialFeatures(degree = degree)
linreg = LinearRegression()
x_poly = poly_features.fit_transform(X)
linreg.fit(x_poly, y)
y_predict = linreg.predict(x_poly)
mse = mean_squared_error(y_predict, y)
return mse
I feel like a lot of the code I use in mse is very redundant when compared to fit_curve. Unfortunately, guidelines say that this is the way I need to do it (with mse taking X, y, degree, and model.
I think it's also worth noting that my current mse works correctly until about 13-14 degrees, where the answer it generates on the graph does not match the solution I was given. I'm not sure why it's not working perfectly, because I thought this was the right idea.
Things should be done in that way:
1) Split your X and y into train and test sets. You can use train_test_split for that. You can choose your test_size (I put 0.33 as an example) and random_state (this one helps with reproducibility).
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
2) Fit your model (hereby, a linear regression) using X_train and y_train. You have some feature generation method (polynomial one), it's great. Use it with the training data.
poly_features = PolynomialFeatures(degree=degree)
linreg = LinearRegression()
X_train_poly = poly_features.fit_transform(X_train)
linreg.fit(X_train_poly, y_train)
3) Evaluate your fitted model by looking whether it can correctly predict on unseen data (X_test). For that, you can indeed use mean_squared_error with model.predict(X_test) and y_test. Caution, you must apply the same transformation to X_test than what you did for X_train (that's why we first use poly_features.transform)
X_test_poly = poly_features.transform(X_test)
print(mean_squared_error(linreg.predict(X_test_poly), y_test))
Hope that helps.
Related
I created a very simple function to test XGBoost.
X is an array containing 1000 rows of "7*np.pi" for each row.
Y is simply "1 + 0.5*np.sin(x)"
I split the dataset in 800 training and 200 testing rows. Shuffle MUST be False to simulate future occurrences, making sure the last 200 rows are reserved to testing.
import numpy as np
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error as MSE
from xgboost import XGBRegressor
N = 1000 # 1000 rows
x = np.linspace(0, 7*np.pi, N) # Simple function
y = 1 + 0.5*np.sin(x) # Generate simple function sin(x) as y
# Train-test split, intentionally use shuffle=False to simulate time series
X = x.reshape(-1,1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
### Interestingly, model generalizes well if shuffle=False
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, shuffle=False)
XGB_reg = XGBRegressor(random_state=42)
XGB_reg.fit(X_train,y_train)
# EVALUATE ON TRAIN DATA
yXGBPredicted = XGB_reg.predict(X_train)
rmse = np.sqrt(MSE(y_train, yXGBPredicted))
print("RMSE TRAIN XGB: % f" %(rmse))
# EVALUATE ON TEST DATA
yXGBPredicted = XGB_reg.predict(X_test)
# METRICAS XGB
rmse = np.sqrt(MSE(y_test, yXGBPredicted))
print("RMSE TEST XGB: % f" %(rmse))
# Predict full dataset
yXGB = XGB_reg.predict(X)
# Plot and compare
plt.style.use('fivethirtyeight')
plt.rcParams.update({'font.size': 16})
fig, ax = plt.subplots(figsize=(10,5))
plt.plot(x, y)
plt.plot(x, yXGB)
plt.ylim(0,2)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
I trained the model on the first 800 rows and then predicted the next 200 rows.
I was expecting testing data to have a great RMSE, but it did not happen.
I was surprised to see that XGBoost simple repeated the last value of the training set on all rows of the predictions (see chart).
Any ideas why this doesn't work?
You're asking your model to "extrapolate" - making predictions for x values that are greater than x values in the training dataset. Extrapolation works with some model types (such as linear models), but it typically does not work with decision tree models and their ensembles (such as XGBoost).
If you switch from XGBoost to LightGBM, then you can train extrapolation-capable decision tree ensembles using the "linear tree" approach:
Any ideas why this doesn't work?
Your XGBRegressor is probably over-fitted (has n_estimators = 100 and max_depth = 6). If you decrease those parameter values, then the red line will appear more jagged, and it will be easier for you to see it "working".
Right now, if you ask your over-fitted XGBRegressor to extrapolate, then it basically functions as a giant look-up table. When extrapolating towards +Inf, then the "closest match" is at x = 17.5; when extrapolating towards -Inf, then the "closest match" is at x = 0.0.
I am using ElasticNet to obtain a fit of my data. To determine the hyperparameters (l1, alpha), I am using ElasticNetCV. With the obtained hyperparamers, I refit the model to the whole dataset for production use. I am unsure if this is correct in both, the machine learning aspect and - if so - how I do it. The code "works" and presumably does what it should, but I wanted to be certain that it is also correct.
My procedure is:
X_tr, X_te, y_tr, y_te = train_test_split(X,y)
optimizer = ElasticNetCV(l1_ratio = [.1,.5,.7,.9,.99,1], n_alphas=400, cv=5, normalize=True)
optimizer.fit(X_tr, y_tr)
best = ElasticNet(alpha=optimizer.alpha_, l1_ratio=optimizer.l1_ratio_, normalize=True)
best.fit(X,y)
Thank you in advance
I am a beginner on this but I would love to share my approach to ElasticNet hyperparameters tuning. I would suggest to use RandomizedSearchCV instead. Here is part of the current code I am writing now:
#-----------------------------------------------
# input:
# X_train, X_test, Y_train, Y_test: datasets
# Returns:
# R² and RMSE Scores
#-----------------------------------------------
# Standardize data before
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# define grid
params = dict()
# values for alpha: 100 values between e^-5 and e^5
params['alpha'] = np.logspace(-5, 5, 100, endpoint=True)
# values for l1_ratio: 100 values between 0 and 1
params['l1_ratio'] = np.arange(0, 1, 0.01)
Warning: you are testing 100 x 100 = 10 000 possible combinations.
# Create an instance of the Elastic Net Regressor
regressor = ElasticNet()
# Call the RanddomizedSearch with Cross Validation using the chosen regressor
rs_cv= RandomizedSearchCV(regressor, params, n_iter = 100, scoring=None, cv=5, verbose=0, refit=True)
rs_cv.fit(X_train, Y_train.values.ravel())
# Results
Y_pred = rs_cv.predict(X_test)
R2_score = rs_cv.score(X_test, Y_test)
RMSE_score = np.sqrt(mean_squared_error(Y_test, Y_pred))
return R2_score, RMSE_score, rs_cv.best_params_
The advantage is that in RandomizedSearchCV the number of iterations can be predetermined in advance. The choices of points to be tested are random but 90% (in some cases) faster than GridSearchCV (that tests all possible combinations).
I am using this same approach for other regressors like RandomForests and GradientBoosting who parameters grids are far more complicated and demand much more computer power to run.
As I said at the beginning I am new to this field, so any constructive comment will be welcomed.
Johnny
I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485
I built a simple linear regression model to predict students' final grade using this dataset https://archive.ics.uci.edu/ml/datasets/Student+Performance.
While my accuracy is very good, the errors seem to be big.
I'm not sure if I'm just not understanding the meaning of the errors correctly or if I made some errors in my code. I thought for the accuracy of 92, the errors should be way smaller and closer to 0.
Here's my code:
data = pd.read_csv("/Users/.../student/student-por.csv", sep=";")
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1, random_state=42)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
linear_accuracy = round(linear.score(x_test, y_test) , 5)
linear_mean_abs_error = metrics.mean_absolute_error(y_test, linear_prediction)
linear_mean_sq_error = metrics.mean_squared_error(y_test, linear_prediction)
linear_root_mean_sq_error = np.sqrt(metrics.mean_squared_error(y_test, linear_prediction))
Did I make any errors in the code or errors do make sense in this case?
The accuracy metric in sklearn linear regression is the R^2 metric. It essentially tells you the percent of the variation in the dependent variable explained by the model predictors. 0.92 is a very good score, but it does not mean that your errors will be 0. I looked your work and it seems that you used all the numeric variables as your predictors and your target was G3. The code seems fine and the results seem accurate too. In regression tasks it is really hard to get 0 errors. Please let me know if you have any questions. Cheers
I'm using sklearn to fit a linear regression model to some data. In particular, my response variable is stored in an array y and my features in a matrix X.
I train a linear regression model with the following piece of code
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X,y)
and everything seems to be fine.
Then let's say I have some new data X_new and I want to predict the response variable for them. This can easily done by doing
predictions = model.predict(X_new)
My question is, what is this the error associated to this prediction?
From my understanding I should compute the mean squared error of the model:
from sklearn.metrics import mean_squared_error
model_mse = mean_squared_error(model.predict(X),y)
And basically my real predictions for the new data should be a random number computed from a gaussian distribution with mean predictions and sigma^2 = model_mse. Do you agree with this and do you know if there's a faster way to do this in sklearn?
You probably want to validate your model on your training data set. I would suggest exploring the cross-validation submodule sklearn.cross_validation.
The most basic usage is:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
It depends on you training data-
If it's distribution is a good representation of the "real world" and of a sufficient size (see learning theories, as PAC), then I would generally agree.
That said- if you are looking for a practical way to evaluate your model, why won't you use the test set as Kris has suggested?
I usually use grid search for optimizing parameters:
#split to training and test sets
X_train, X_test, y_train, y_test =train_test_split(
X_data[indices], y_data[indices], test_size=0.25)
#cross validation gridsearch
params = dict(logistic__C=[0.1,0.3,1,3, 10,30, 100])
grid_search = GridSearchCV(clf, param_grid=params,cv=5)
grid_search.fit(X_train, y_train)
#print scores and best estimator
print 'best param: ', grid_search.best_params_
print 'best train score: ', grid_search.best_score_
print 'Test score: ', grid_search.best_estimator_.score(X_test,y_test)
The Idea is hiding the test set from your learning algorithm (and yourself)- Don't train and don't optimize parameters using this data.
Finally you should use the test set for performance evaluation (error) only, it should provide an unbiased mse.