I am trying to evaluate a relevance of features and I am using DecisionTreeRegressor()
The related part of the code is presented below:
# TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature
new_data = data.drop(['Frozen'], axis = 1)
# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# TODO: Set a random state.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(new_data, data['Frozen'], test_size = 0.25, random_state = 1)
# TODO: Create a decision tree regressor and fit it to the training set
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=1)
regressor.fit(X_train, y_train)
# TODO: Report the score of the prediction using the testing set
from sklearn.model_selection import cross_val_score
#score = cross_val_score(regressor, X_test, y_test)
score = regressor.score(X_test, y_test)
print score # python 2.x
When I run the print function, it returns the given score:
-0.649574327334
You can find the score function implementatioin and some explanation below here and below:
Returns the coefficient of determination R^2 of the prediction.
...
The best possible score is 1.0 and it can be negative (because the
model can be arbitrarily worse).
I could not grasp the whole concept yet, so this explanation is not very helpful for me. For instance I could not understand why score could be negative and what exactly it indicates (if something is squared, I would expect it can only be positive).
What does this score indicates and why can it be negative?
If you know any article (for starters) it might be helpful as well!
R^2 can be negative from its definition (https://en.wikipedia.org/wiki/Coefficient_of_determination) if the model fits the data worse than a horizontal line. Basically
R^2 = 1 - SS_res/SS_tot
and SS_res and SS_tot are always positive. If SS_res >> SS_tot, you have a negative R^2. Look at this answer as well: https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative
The article execute cross_val_score in which DecisionTreeRegressor is implemented. You may take a look at the documentation of scikitlearn DecisionTreeRegressor.
Basically, the score you see is R^2, or (1-u/v). U is the squared sum residual of your prediction, and v is the total square sum(sample sum of square).
u/v can be arbitrary large when you make really bad prediction, while it can only be as small as zero given u and v are sum of squared residual(>=0)
Related
I am using xgboost for a classification problem with an imbalanced dataset. I plan on using some combination of an f1-score or roc-auc as my primary criteria for judging the model.
Currently the default value returned from the score method is accuracy, but I would really like to have a specific evaluation metric returned instead. My big motivation for doing this is that I presume the feature_importances_ attribute from the model is determined from what's affecting the score method, and the columns that impact predictive accuracy might very well be different from the columns that impact roc-auc. Right now I am passing in values to eval_metric but it does not seem to be making a difference.
Here is some sample code:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score
data = load_breast_cancer()
X = data['data']
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify=y)
mod.fit(X_train, y_train)
Now at this point, mod.score(X_test, y_test) will return a value of ~ 0.96, and the roc_auc_score is ~ 0.99.
I was hoping the following snippet:
mod.fit(X_train, y_train, eval_metric='auc')
Would then allow mod.score(X_test, y_test) to return the roc_auc_score value, but it is still returning predictive accuracy, not roc_auc.
The purpose of this exercise is estimating the influence of different columns on the outcome, so if I could get feature_importances_ returned using f1 or roc_auc as the measure of impact this would be a huge boon, but I do not seem to be on the right path as of now.
Thank you.
There are two parts to your question, to use eval_metric, you need to provide data to evaluate using eval_set = :
mod = XGBClassifier()
mod.fit(X_train, y_train,eval_set=[(X_test,y_test)],eval_metric="auc")
You can check the auc using evals_result(), and it gives the auc for every iteration:
mod.evals_result()
{'validation_0': OrderedDict([('auc',
[0.965939,
0.9833,
0.984788,
[...]
0.991402,
0.991071,
0.991402,
0.991733])])}
The importance score is calculated based on the average gain across all splits the feature is used in see help page. From your question, I suppose you need the mdoel to maximize auc, like in cross-validation, but you cannot use the auc as an objective in xgboost. Gradient boosting methods require a differentiable loss function.
With imbalanced dataset, you can try to adjust the parameter scale_pos_weight, to adjust the balance of positive and negative weights. This is discussed in xgboost website
I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485
i have trouble optimizing threshold for binar classification. I am using 3 models: Logistic Regression, Catboost and Sklearn RandomForestClassifier.
For each model I am doing the following steps:
1) fit model
2) get 0.0 recall for first class (which belongs to 5% of dataset) and 1.0 recall for zero class. (this can't be fixed with gridsearch and class_weight='balanced' parameter.) >:(
3) Find optimal treshold
fpr, tpr, thresholds = roc_curve(y_train, model.predict_proba(X_train)[:, 1])
optimal_threshold = thresholds[np.argmax(tpr - fpr)]
4) Enjoy ~70 recall ratio for both classes.
5) Predict probabilities for test dataset and use optimal_threshold, i calculated above, to get classes.
Here comes the question: when I am starting code again and again, if i don't fix random_state, optimal treshold is variant and shifts quiet dramatically. This leads to dramatic changes in accuracy metrics based on test sample.
Do i need to calculate some average threshold and use it as a constant hard value? Or maybe i have to fix random_state everywhere? Or maybe the method of finding optimal_threshold isnt correct?
If you do not set random_state to a fixed values results will be different in every run. To get reproducible results set random_state everywhere required to a fixed value or, use fixed numpy random seed numpy.random.seed.
https://scikit-learn.org/stable/faq.html#how-do-i-set-a-random-state-for-an-entire-execution
Scikit FAQ mentions it is better to use random_state where required instead of global random state.
Global Random State Example:
import numpy as np
np.random.seed(42)
Some examples locally setting random_state:
X_train, X_test, y_train, y_test = train_test_split(sample.data, sample.target, test_size=0.3, random_state=0)
skf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
classifierAlgorithm = LGBMClassifier(objective='binary', random_state=0)
I have performed a ridge regression model on a data set
(link to the dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)
as below:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
y = train['SalePrice']
X = train.drop("SalePrice", axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train,y_train)
pred = ridge.predict(X_test)
I calculated the MSE using the metrics library from sklearn as
from sklearn.metrics import mean_squared_error
mean = mean_squared_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test,pred)
I am getting a very large value of MSE = 554084039.54321 and RMSE = 21821.8, I am trying to understand if my implementation is correct.
RMSE implementation
Your RMSE implementation is correct which is easily verifiable when you take the sqaure root of sklearn's mean_squared_error.
I think you are missing a closing parentheses though, here to be exact:
rmse = np.sqrt(mean_squared_error(y_test,pred)) # the last one was missing
High error problem
Your MSE is high due to model not being able to model relationships between your variables and target very well. Bear in mind each error is taken to the power of 2, so being 1000 off in price sky-rockets the value to 1000000.
You may want to modify the price with natural logarithm (numpy.log) and transform it to log-scale, it is a common practice especially for this problem (I assume you are doing House Prices: Advanced Regression Techniques), see available kernels for guidance. With this approach, you will not get such big values.
Last but not least, check Mean Absolute Error in order to see your predictions are not as terrible as they seem.
I would like to know the difference between the score returned by GridSearchCV and the R2 metric calculated as below. In other cases I receive the grid search score highly negative (same applies for cross_val_score) and I would be grateful for explaining what it is.
from sklearn import datasets
from sklearn.model_selection import (cross_val_score, GridSearchCV)
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score, r2_score
from sklearn import tree
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
X = pd.DataFrame(X)
parameters = {'splitter':('best','random'),
'max_depth':np.arange(1,10),
'min_samples_split':np.arange(2,10),
'min_samples_leaf':np.arange(1,5)}
regressor = GridSearchCV(DecisionTreeRegressor(), parameters, scoring = 'r2', cv = 5)
regressor.fit(X, y)
print('Best score: ', regressor.best_score_)
best = regressor.best_estimator_
print('R2: ', r2_score(y_pred = best.predict(X), y_true = y))
The regressor.best_score_ is the average of r2 scores on left-out test folds for the best parameter combination.
In your example, the cv=5, so the data will be split into train and test folds 5 times. The model will be fitted on train and scored on test. These 5 test scores are averaged to get the score. Please see documentation:
"best_score_: Mean cross-validated score of the best_estimator"
The above process repeats for all parameter combinations. And the best average score from it is assigned to the best_score_.
You can look at my other answer for complete working of GridSearchCV
After finding the best parameters, the model is trained on full data.
r2_score(y_pred = best.predict(X), y_true = y)
is on the same data as the model is trained on, so in most cases, it will be higher.
The question linked by #Davide in the comments has answers why you get a positive R2 score - your model performs better than a constant prediction. At the same time you can get negative values in other situation, if your models there perform bad.
the reason for the difference in values is that regressor.best_score_ is evaluated on a particular fold out of the 5-fold split that you do, whereas r2_score(y_pred = best.predict(X), y_true = y) evaluates the same model (regressor.best_estimator_) but on the full sample (including the (5-1)-fold sub-set that was used to train that estimator)