I have a problem. I found this question https://stats.stackexchange.com/questions/56302/what-are-good-rmse-values
Someone wrote:
The RMSE for your training and your test sets should be very similar
if you have built a good model.
and another wrote:
RMSE of test > RMSE of train => OVER FITTING of the data. RMSE of test
< RMSE of train => UNDER FITTING of the data.
I think RMSE of test data it is
y_pred = knn.predict(X_test)
rmse = metrics.mean_squared_error(y_test, y_pred , squared=False)
But how could I get the RMSE (or another metric) of my training data? Perhaps it is
rmse = metrics.mean_squared_error(X_train, X_test, squared=False)
But with that I got
ValueError: Found input variables with inconsistent numbers of samples: [8880, 2220]
So how could I get the RMSE from my training ?
from sklearn.neighbors import (KNeighborsRegressor,)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=30)
knn = KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5,
p=2, weights='uniform')
knn.fit(X, y)
y_pred = knn.predict(X_test)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = metrics.mean_squared_error(y_test, y_pred , squared=False)
print(rmse)
First of all, there's something wrong with your code and it's that you are training your model with the whole data, instead of the training data you've already split, this makes your validation sample useless as the model itself learnt from it.
You should change your fit like so:
knn.fit(X_train, y_train)
Then to get the RMSE of it you should use the predict on your train data and compare it afterwards:
y_train_pred = knn.predict(X_train)
rmse = metrics.mean_squared_error(y_train, y_train_pred, squared=False)
Everything else should stay the same
The context of your question is not clear. However, you should take the predictions and not the model features of the test set, and evaluate the RMSE using the same package that you used:
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html
Pay attention to the squared parameter.
Related
#split dataset in features and target variable
feature_cols = ['RIAGENDR_0', 'RIDAGEYR', 'RIDRETH3_2', 'RIDRETH3_3', 'RIDRETH3_4', 'RIDRETH3_6', 'RIDRETH3_7', 'INDFMPIR', 'DMDMARTZ_1.0', 'DMDMARTZ_2.0', 'DMDMARTZ_3.0', 'DMDMARTZ_4.0', 'DMDMARTZ_6.0', 'DMDEDUC2', 'RFXT010', 'BMXWT', 'BMXBMI', 'URXUMA', 'LBDHDD', 'LBXFER', 'LBXGH', 'LBXBPB', 'LBXBCD', 'LBXBSE', 'LBXBMN', 'URXUBA', 'URXUCD', 'URXUCO', 'URXUCS', 'URXUMO', 'URXUMN', 'URXUPB', 'URXUSB', 'URXUSN', 'URXUTL', 'URXUTU']
X = data[feature_cols] # Features
scale = StandardScaler()
X = scale.fit_transform(X)
y = data['depre_score'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(y_test)
print(y_pred)
confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
recall_sensitivity = metrics.recall_score(y_test, y_pred, pos_label=1)
recall_specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
print(recall_sensitivity, recall_specificity)
Why do you think you are doing something wrong? Perhaps your data are such that you can achieve a perfect classification... e.g., see this mushroom classification.
Having said that, it is also possible that there is some leakage in your data as specified by #gtomer. That means an exact point that is present in training set is available in your test set. You can do K-fold test on your data and see how it follows up with the accuracy. And secondly, use different classifiers too (it is better to use Random Forests compared to Decision Trees)
#REGRESSION ANALYSIS
#splitting the dataset into x and y variables
firm1=pd.DataFrame(firm, columns=['Sales', 'Advert', 'Empl', 'Prod'])
print(firm1)
x = firm1.drop(['Sales'], axis=1)
y = firm1['Sales']
print(x)
print(y)
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
#the LR model
M=linear_model.LinearRegression(fit_intercept=True)
M.fit(x_train, y_train)
y_pred=M.predict(x_test)
print(y_pred)
print('Coeff: ', M.coef_)
for i in M.coef_:
print('{:.4f}'.format(i))
print('Intercept: ','{:.4f}'.format(M.intercept_))
print('MSE: ','{:.4f}'.format(mean_squared_error(y_test, y_pred)))
print('Coeffieicnt of determination (r2): ','{:.4f}'.format(r2_score(y_test, y_pred)))
print(firm1.sample())
This is my linear regression model. Every time I run the code, I get a different sent of coefficient for the x variables and the Intercept. I cannot have a constant equation. Is that normal?
Coeff: [454.83981664 63.77031531 59.31844506]
454.8398
63.7703
59.3184
Intercept: -1073.5124
MSE: 434529.9361
Those are the values (coefficients, intercept and mean square error). However, when I run it again, I get a different output shown below
Coeff: [462.0304152 61.17909189 269.41075305]
462.0304
61.1791
269.4108
Intercept: -1462.2449
MSE: 4014768.0049
It is normal to get different coefficients with each training of your linear regression. Indeed, initially the coefficients are set randomly and then are progressively updated with the least squares method. If you want to have the same coefficients for each training, you can set the seed with :
numpy.random.seed(42)
I have split my data into 3 sets train, test and validation as shown below:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
I wanted to ask where do I put the validation set in this code:
#Defining Model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy on test is:",accuracy_score(y_test,y_pred))
#Measure Precision,Recall
print("Precision Score: ",precision_score(y_test, y_pred,average='macro'))
print("Recall Score: ",recall_score(y_test, y_pred,average='macro'))
print("F1-Score :",f1_score(y_test, y_pred,average='macro'))
I suggest you to read up some more on why you would split your date into a train, test and validation set.
In the code you show you can use validation data in the same way you use your test data but that doesn't really make fully sense.
There is a lot to it, I think this can get you started. Link
In short and very simplified, the general idea is that you use results from test data to make adjustment to your model, to improve its performance.
Your validation data you use only at the very end for your final model evaluation, to make sure it actually does perform well on unseen data.
(Worst case if you only use two sets, then you might adjust parameters until works well for these two datasets but still not any other.)
You can show validation data score and accuracy like this way:
#Defining Model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred_val = model.predict(X_val)
print("Accuracy on val is:",accuracy_score(y_val, y_pred_val))
y_pred = model.predict(X_test)
print("Accuracy on test is:",accuracy_score(y_test,y_pred))
#Measure Precision,Recall
print("Precision Score for Val: ",precision_score(y_val, y_pred_val, average='macro'))
print("Recall Score for Val: ",recall_score(y_val, y_pred_val, average='macro'))
print("F1-Score for Val :",f1_score(y_val, y_pred_val, average='macro'))
print("Precision Score: ",precision_score(y_test, y_pred,average='macro'))
print("Recall Score: ",recall_score(y_test, y_pred,average='macro'))
print("F1-Score :",f1_score(y_test, y_pred,average='macro'))
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(regressor, X, y, scoring='neg_mean_absolute_error',
cv=cv, n_jobs=-1)
np.mean(np.abs(scores))
regressor is the fitted model, X is the independent features and y is the dependent feature. Is the code right? Also I'm confused can rmse be bigger than 100? I'm getting values such as 121 from some regression models. Is rmse used to tell you how good your model is in general or only to tell you how good your model is compared to other models?
rmse = 121
The RMSE value can be calculated using sklearn.metrics as follows:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(test, predictions)
rmse = math.sqrt(mse)
print('RMSE: %f' % rmse)
In terms of the interpretation, you need to compare RMSE to the mean of your test data to determine the model accuracy. Standard errors are a measure of how accurate the mean of a given sample is likely to be compared to the true population mean.
For instance, an RMSE of 5 compared to a mean of 100 is a good score, as the RMSE size is quite small relative to the mean.
On the other hand, an RMSE of 5 compared to a mean of 2 would not be a good result - the mean estimate is too wide compared to the test mean.
If you want RMSE, why are you using mean absolute error for scoring? Change it to this:
scores = cross_val_score(regressor, X, y, scoring = 'neg_mean_squared_error',
cv = cv, n_jobs = -1)
Since, RMSE is the square root of mean squared error, we have to do this:
np.mean(np.sqrt(np.abs(scores)))
I have the simple fitting model like this:
lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)
print accuracy_score(y_test, predictions)
and with using cross validation I have this:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model, X = X_train, y = y_train, cv = 7)
from cross validation how can I take the accuracy in order to have the same measure print accuracy_score(y_test, predictions)? Is it accuracies.mean()?
print accuracies will give an array of accuracy on each fold of cross validation
print "Train set score :: {} ".format(accuracies.mean()) will give the mean accuracy on the cross validation and
print "Train set score :: {} +/-{}".format(accuracies.mean(),accuracies.std()*2) will give you the accuracy along with the mean deviation