from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(regressor, X, y, scoring='neg_mean_absolute_error',
cv=cv, n_jobs=-1)
np.mean(np.abs(scores))
regressor is the fitted model, X is the independent features and y is the dependent feature. Is the code right? Also I'm confused can rmse be bigger than 100? I'm getting values such as 121 from some regression models. Is rmse used to tell you how good your model is in general or only to tell you how good your model is compared to other models?
rmse = 121
The RMSE value can be calculated using sklearn.metrics as follows:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(test, predictions)
rmse = math.sqrt(mse)
print('RMSE: %f' % rmse)
In terms of the interpretation, you need to compare RMSE to the mean of your test data to determine the model accuracy. Standard errors are a measure of how accurate the mean of a given sample is likely to be compared to the true population mean.
For instance, an RMSE of 5 compared to a mean of 100 is a good score, as the RMSE size is quite small relative to the mean.
On the other hand, an RMSE of 5 compared to a mean of 2 would not be a good result - the mean estimate is too wide compared to the test mean.
If you want RMSE, why are you using mean absolute error for scoring? Change it to this:
scores = cross_val_score(regressor, X, y, scoring = 'neg_mean_squared_error',
cv = cv, n_jobs = -1)
Since, RMSE is the square root of mean squared error, we have to do this:
np.mean(np.sqrt(np.abs(scores)))
Related
I am testing RandomForestClassifier on simple dataset from sklearn. When I split the data with train_test_split, I get accuracy=0.89. If I use cross-validation with cross_val_score with same parameters of classifier, accuracy is smaller - about 0.83. Why?
Here is the code:
from sklearn.model_selection import cross_val_score, StratifiedKFold,GridSearchCV,train_test_split
from sklearn.metrics import accuracy_score,f1_score,make_scorer
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_circles
np.random.seed(42)
#create dataset:
x, y = make_circles(n_samples=500, factor=0.1, noise=0.35, random_state=42)
#initialize stratified split:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
#create classifier:
clf = RandomForestClassifier(random_state=42, max_depth=12,n_jobs=-1,
oob_score=True,n_estimators=100,min_samples_leaf=10)
#average accuracy on cross-validation:
results = np.mean(cross_val_score(clf, x, y, cv=skf,scoring=make_scorer(accuracy_score)))
print("ACCURACY WITH CV = ",results)#prints 0.832
#use train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.2)
clf=RandomForestClassifier(random_state=42, max_depth=12,n_jobs=-1, oob_score=True,n_estimators=100,min_samples_leaf=10)
clf.fit(xtrain,ytrain)
ypred=clf.predict(xtest)
print("ACCURACY WITHOUT CV = ",accuracy_score(ytest,ypred))#prints 0.89
what I got:
ACCURACY WITH CV = 0.83
ACCURACY WITHOUT CV = 0.89
Cross validation is used to run multiple experiments on different splits of data and then average their results. This is to ensure that the result of the experiment is not biased by one split, as it is in your case.
Your chosen seed along with some luck gave you a test train split which has higher accuracy than the average. The higher accuracy is an artifact of random sampling when making a split and not an indicator of better model performance.
Simply put:
Cross Validation makes multiple splits of data. Your model is trained
on all of these different splits and then the performance is
averaged.
If you pick one of these splits, you may get lucky and there might be
good overlap between the data points in your test and train set. Your
model will have high accuracy in this case.
Or you may get unlucky and there might not be a high overlap between
the data points in test and train set. Your model will have a lower
accuracy in this case.
Thus, cross validation is used to average the results of various such splits (5 in your case).
Here is your code run in a google colab notebook:
https://colab.research.google.com/drive/16-NotF-_WVLESmvGMONSGSZigxrT3KLx?usp=sharing
The last cell makes 5 different splits and then averages their accuracies. Notice how that is the same as the one you got from cross validation. Also notice how some splits have higher and some splits have a lower accuracy.
To further convince yourself, look at the output of:
cross_val_score(clf, x, y, cv=skf, scoring=make_scorer(accuracy_score))
The output is a list of scores (accuracies in your case) for the 5 different splits. You can see that they have varying values around 0.83
This is just up to chance for the split and the random state of the Random Forest Classifier. Try leaving the random_state=42 out and let it fit several times and you'll get a variance of different accuracies. By chance, I had one without CV of "just" 0.78! In contrast, the cv will give you and average (your calculated mean) PLUS an idea about how much your accuracy could vary around that.
I'm trying to get "same" metrics using a RFECV, and a cross_val_score method. The 2nd method comes because it's really important for me to get metrics with their standard deviation (uncertainties are cool).
This is the regression model:
regression = Lasso(alpha=0.1,
selection="random",
max_iter=10000,
random_state=42)
The RFECV method:
min_number_features = df.shape[0]//10
rfecv = RFECV(estimator=regression,
step=1,
min_features_to_select=min_number_features,
cv=KFold(n_splits=10,
shuffle=True,
random_state=42),
scoring='neg_mean_squared_error')
rfecv.fit(X_train, target_train)
score = rfecv.score(X_train, target_train)
On aveage, it gives rmse of 0.84. The cross_val_score method is the following:
metrics_cross_val_score=[
"neg_root_mean_squared_error",
"neg_mean_squared_error",
"r2",
"explained_variance",
"neg_mean_absolute_error",
"max_error",
"neg_median_absolute_error"
]
for m in metrics_cross_val_score:
score=cross_val_score(regression,
X_train,
target_train,
cv=KFold(n_splits=10,
shuffle=True,
random_state=42),
scoring=m)
score= [-score.mean()/mean,score.std()/mean]
metrics[m]=round(score[0],2)
dev="std_"+m
metrics[dev]=round(score[1],2)
For the 2nd method, I normalize every metric by the mean (in an attempt to have a from-0-to-1 score): The results tend to not be exactly like with the 1st method (although the RFECV RMSE is within the interval of the cross_val_score RMSE +- the standard deviation, which is quite big and not-good).
So, here comes the questions:
I read many ways of normalizing the RMSE (by the mean, by y_max-y_min, by quantiles... And I don't know yet the best approach for my data. Anyone has a bright recommendation for that?
The RFECV is working with the selected features, and cross_val_score with all features. If cross_val_score works with the very same columns than RFECV selects, the wellness of cross_val_score RMSE decay dramatically, and that really puzzles me.
Here is a comparison between RFECV RMSE (alg_score), and cross_val_score metrics with standard deviation (everything else).
Hope I made myself understood.
If you feel curious, here is the dashboard with everything related to that:
https://datastudio.google.com/s/gUKsAyZfI5I
I have a data with 1025 inputs and 14 columns. First I set the label by putting them in separate tables.
x = dataset.drop('label', axis=1)
y = dataset['label']
The label values is only either 1 or 0. Then I split the data using:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.30)
I then make my Classifier:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
Then whenever I make my Decision tree, it ends up too big:
from sklearn import tree
tree.plot_tree(classifier.fit(X_train, y_train))
The result outputs 8 levels and it gets too big. I thought this was okay but after observing the confusion matrix and classification report:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
It results to:
[[155 3]
[ 3 147]]
precision recall f1-score support
0 0.98 0.98 0.98 158
1 0.98 0.98 0.98 150
accuracy 0.98 308
macro avg 0.98 0.98 0.98 308
weighted avg 0.98 0.98 0.98 308
The high accuracy makes me doubt my solution. What is wrong with my code and how can I tone down the decision tree and accuracy score?
It looks like what you need to do is check to make sure your tree is not overfitting. There are two primary ways we can accomplish this using Decision Trees and sklearn.
Validation Curves
First, you should check to make sure your tree is overfitting. You can do so using a validation curve (see here).
An example of a validation curve is below:
import numpy as np
from sklearn.model_selection import validation_curve
from sklearn.datasets import load_iris
from sklearn.linear_model import Ridge
np.random.seed(0)
X, y = load_iris(return_X_y=True)
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X, y = X[indices], y[indices]
train_scores, valid_scores = validation_curve(Ridge(), X, y, "alpha",
np.logspace(-7, 3, 3),
cv=5)
train_scores
valid_scores
Once you verify that your tree is overfitting, you need to do a thing called pruning, which you can accomplish using hyperparameter optimization as mentioned by #e-zeytinci. You can do that with GridSearchCV
GridSearchCV
GridSearchCV allows us to optimize the hyperparemeters of a decision tree, or any model, to look at things like maximum depth and maximum nodes (which seems to be OPs concerns), and also helps us to accomplish proper pruning.
An example of that implementation can be read here
An example set of working code taken from this post is below:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
def dtree_grid_search(X,y,nfolds):
#create a dictionary of all values we want to test
param_grid = { 'criterion':['gini','entropy'],'max_depth': np.arange(3, 15)}
# decision tree model
dtree_model=DecisionTreeClassifier()
#use gridsearch to test all values
dtree_gscv = GridSearchCV(dtree_model, param_grid, cv=nfolds)
#fit model to data
dtree_gscv.fit(X, y)
return dtree_gscv.best_params_
Random Forests
Alternatively, Random Forests can help with Decision Tree overfitting.
You could implement a RandomForestClassifier and follow the same hyperparameter tuning outlined above.
An example from this post is below:
from sklearn.grid_search import GridSearchCV
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000,
n_features=10,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=False)
rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=50, oob_score = True)
param_grid = {
'n_estimators': [200, 700],
'max_features': ['auto', 'sqrt', 'log2']
}
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(X, y)
print CV_rfc.best_params_
You can validate your score of your decision tree, if you also include your train and test score (test you have already):
print(confusion_matrix(y_train, clf.predict(y_train))
print(classification_report(y_train, clf.predict(y_train))
If you have similar results for it, your tree is good fitting, in terms of accuracy (precision). You can also check this out for over-/and underfitting.
To the concept of over- and underfitting:
The blue curve is the error of training data, wherever the red curve is the test error, here you can see that the blue error goes down, wherever the red is stuck. This is overfitting - which means that the training data influences the data to much.
But your error for your test data is already low, which gives an indication that:
A function that is overfitted is likely to request more information about each item in the validation dataset than does the optimal function; gathering this additional unneeded data can be expensive or error-prone, especially if each individual piece of information must be gathered by human observation and manual data-entry.
Always remind yourself that only have 14 criteria available. The full parameters you can find here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
If you have such an accurate result for balanced data, I would ask myself if there is a feature (column) which directly influence your target variable. The key word is data leakage. This means that you have a feature which is only there because of your target variable and in a real test you would not have it in advance. One hint to get an idea would be: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
If you still have the feeling your tree is too depth, you can adjust your maximum depth with:
classifier = DecisionTreeClassifier(max_depth= 4)
I have performed a ridge regression model on a data set
(link to the dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)
as below:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
y = train['SalePrice']
X = train.drop("SalePrice", axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train,y_train)
pred = ridge.predict(X_test)
I calculated the MSE using the metrics library from sklearn as
from sklearn.metrics import mean_squared_error
mean = mean_squared_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test,pred)
I am getting a very large value of MSE = 554084039.54321 and RMSE = 21821.8, I am trying to understand if my implementation is correct.
RMSE implementation
Your RMSE implementation is correct which is easily verifiable when you take the sqaure root of sklearn's mean_squared_error.
I think you are missing a closing parentheses though, here to be exact:
rmse = np.sqrt(mean_squared_error(y_test,pred)) # the last one was missing
High error problem
Your MSE is high due to model not being able to model relationships between your variables and target very well. Bear in mind each error is taken to the power of 2, so being 1000 off in price sky-rockets the value to 1000000.
You may want to modify the price with natural logarithm (numpy.log) and transform it to log-scale, it is a common practice especially for this problem (I assume you are doing House Prices: Advanced Regression Techniques), see available kernels for guidance. With this approach, you will not get such big values.
Last but not least, check Mean Absolute Error in order to see your predictions are not as terrible as they seem.
I've been running the implementation the 'Mean Decrease Accuracy' measure that is shown on this website:
In the example the author is using the random forest regressor RandomForestRegressor, but I am using the random forest classifier RandomForestClassifier. Thus, my question is, if I should also use the r2_score for measuring accuracy or if I should switch to classic accuracy accuracy_score or matthews correlation coefficient matthews_corrcoef?.
Does anybody here if I should switch or not. And why?
Thanks for any help!
Here is the code from the website in case you are too lazy to click :)
from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import r2_score
from collections import defaultdict
X = boston["data"]
Y = boston["target"]
rf = RandomForestRegressor()
scores = defaultdict(list)
#crossvalidate the scores on a number of different random splits of the data
for train_idx, test_idx in ShuffleSplit(len(X), 100, .3):
X_train, X_test = X[train_idx], X[test_idx]
Y_train, Y_test = Y[train_idx], Y[test_idx]
r = rf.fit(X_train, Y_train)
acc = r2_score(Y_test, rf.predict(X_test))
for i in range(X.shape[1]):
X_t = X_test.copy()
np.random.shuffle(X_t[:, i])
shuff_acc = r2_score(Y_test, rf.predict(X_t))
scores[names[i]].append((acc-shuff_acc)/acc)
print "Features sorted by their score:"
print sorted([(round(np.mean(score), 4), feat) for
feat, score in scores.items()], reverse=True)
r2_score is for regression (continuous response variable), whereas classic classification (discrete categorical variable) metrics such like accuracy_score and f1_score roc_auc (the last two are most appropriate if you have unbalanced y-labels) are right choices for your task.
Random shuffling each features in the input data matrix and measuring the decline in these classification metrics sounds like a valid approach to rank feature importances.