I have performed a ridge regression model on a data set
(link to the dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)
as below:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
y = train['SalePrice']
X = train.drop("SalePrice", axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train,y_train)
pred = ridge.predict(X_test)
I calculated the MSE using the metrics library from sklearn as
from sklearn.metrics import mean_squared_error
mean = mean_squared_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test,pred)
I am getting a very large value of MSE = 554084039.54321 and RMSE = 21821.8, I am trying to understand if my implementation is correct.
RMSE implementation
Your RMSE implementation is correct which is easily verifiable when you take the sqaure root of sklearn's mean_squared_error.
I think you are missing a closing parentheses though, here to be exact:
rmse = np.sqrt(mean_squared_error(y_test,pred)) # the last one was missing
High error problem
Your MSE is high due to model not being able to model relationships between your variables and target very well. Bear in mind each error is taken to the power of 2, so being 1000 off in price sky-rockets the value to 1000000.
You may want to modify the price with natural logarithm (numpy.log) and transform it to log-scale, it is a common practice especially for this problem (I assume you are doing House Prices: Advanced Regression Techniques), see available kernels for guidance. With this approach, you will not get such big values.
Last but not least, check Mean Absolute Error in order to see your predictions are not as terrible as they seem.
Related
I'm working on a classification project, where I try out various types of models like logistic regression, decision trees etc, to see which model can most accurately predict if a patient is at risk for heart disease (given an existing data set of over 3600 rows).
I'm currently trying to work on my decision tree, and have plotted ROC curves to find the optimized values for tuning the max_depth and min_samples_split hyperparameters. However when I try to create my new model I get the warning:
"UndefinedMetricWarning: Precision is ill-defined and being set to 0.0
due to no predicted samples. Use zero_division parameter to control
this behavior."
I have already googled the warning, and semi understand why it's happening, but not how to fix it. I don't want to just get rid of the warning or ignore the values that weren't predicted. I want to actually fix the issue. From my understanding, it has something to do with how I processed my data. However, I'm not sure where I went wrong with my data processing.
I started off with doing a train-test split, then used StandardScaler like so:
#Let's split the data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = df.drop("TenYearCHD", axis = 1)
y = df["TenYearCHD"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
#Let's scale our data
SS = StandardScaler()
X_train = SS.fit_transform(X_train)
X_test = SS.transform(X_test)
I then created my initial decision tree, and received no warnings:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion = "entropy")
#Fit our model and predict
dtc.fit(X_train, y_train)
dtc_pred = dtc.predict(X_test)
After looking at my ROC curve and AOC scores, I attempted to create another more optimized decision tree, which is where I then received my warning:
dtc3 = DecisionTreeClassifier(criterion = "entropy", max_depth = 4, min_samples_split= .25)
dtc3.fit(X_train, y_train)
dtc3_pred = dtc3.predict(X_test)
Essentially i'm at a loss at what to do. Should I use a different method like StratifiedKFolds in addition to train-test split to process my data? Should I do something else entirely? Any help would be greatly appreciated.
I am using xgboost for a classification problem with an imbalanced dataset. I plan on using some combination of an f1-score or roc-auc as my primary criteria for judging the model.
Currently the default value returned from the score method is accuracy, but I would really like to have a specific evaluation metric returned instead. My big motivation for doing this is that I presume the feature_importances_ attribute from the model is determined from what's affecting the score method, and the columns that impact predictive accuracy might very well be different from the columns that impact roc-auc. Right now I am passing in values to eval_metric but it does not seem to be making a difference.
Here is some sample code:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score
data = load_breast_cancer()
X = data['data']
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify=y)
mod.fit(X_train, y_train)
Now at this point, mod.score(X_test, y_test) will return a value of ~ 0.96, and the roc_auc_score is ~ 0.99.
I was hoping the following snippet:
mod.fit(X_train, y_train, eval_metric='auc')
Would then allow mod.score(X_test, y_test) to return the roc_auc_score value, but it is still returning predictive accuracy, not roc_auc.
The purpose of this exercise is estimating the influence of different columns on the outcome, so if I could get feature_importances_ returned using f1 or roc_auc as the measure of impact this would be a huge boon, but I do not seem to be on the right path as of now.
Thank you.
There are two parts to your question, to use eval_metric, you need to provide data to evaluate using eval_set = :
mod = XGBClassifier()
mod.fit(X_train, y_train,eval_set=[(X_test,y_test)],eval_metric="auc")
You can check the auc using evals_result(), and it gives the auc for every iteration:
mod.evals_result()
{'validation_0': OrderedDict([('auc',
[0.965939,
0.9833,
0.984788,
[...]
0.991402,
0.991071,
0.991402,
0.991733])])}
The importance score is calculated based on the average gain across all splits the feature is used in see help page. From your question, I suppose you need the mdoel to maximize auc, like in cross-validation, but you cannot use the auc as an objective in xgboost. Gradient boosting methods require a differentiable loss function.
With imbalanced dataset, you can try to adjust the parameter scale_pos_weight, to adjust the balance of positive and negative weights. This is discussed in xgboost website
I am new to machine learning and xgboostand I am solving a regression problem.
My target value are very small (e.g.-1.23e-12).
I am using linear regression and xgboost regressor,
but xgboost always predicts the same values, like:
[1.32620335e-05 1.32620335e-05 ... 1.32620335e-05].
I tried to tune some parameters in xgboost.regressor, but it also predicted the same values.
I've seen Scaling of target causes Scikit-learn SVM regression to break down
, so I tried to scale my target value to likes(data.target = data.target*(10**12))
, and it fixed the problem. But I am not sure this is reasonable to scale my target value, and I don't know if this problem in xgboost is the same to SVR?
.
Here is target value of my data:
count 2.800010e+05
mean -1.722068e-12
std 6.219815e-13
min -4.970697e-12
25% -1.965893e-12
50% -1.490800e-12
75% -1.269998e-12
max -1.111604e-12
And part of my code:
X = df[feature].values
y = df[target].values *(10**(12))
X_train, X_test, y_train, y_test = train_test_split(X, y)
xgb = xgboost.XGBRegressor()
LR = linear_model.LinearRegression()
xgb.fit(X_train,y_train)
LR.fit(X_train,y_train)
xgb_predicted = xgb.predict(X_test)
LR_predicted = LR.predict(X_test)
print('xgb predicted:',xgb_predicted[0:5])
print('LR predicted:',LR_predicted[0:5])
print('ground truth:',y_test[0:5])
Output:
xgb predicted: [-1.5407631 -1.49756 -1.9647646 -2.7702322 -2.5296502]
LR predicted: [-1.60908805 -1.51145989 -1.71565321 -2.25043287 -1.65725868]
ground truth: [-1.6572993 -1.59879922 -2.39709641 -2.26119817 -2.01300088]
And the output with y = df[target].values (i.e., did not scale target value)
xgb predicted: [1.32620335e-05 1.32620335e-05 1.32620335e-05 1.32620335e-05
1.32620335e-05]
LR predicted: [-1.60908805e-12 -1.51145989e-12 -1.71565321e-12 -2.25043287e-12
-1.65725868e-12]
ground truth: [-1.65729930e-12 -1.59879922e-12 -2.39709641e-12 -2.26119817e-12
-2.01300088e-12]
Let's try something simpler. I suspect that if you tried to fit a DecisionTreeRegressor (sklearn) to your problem (without scaling) you will likely see similar behavior.
Also, most likely, the nodes in your (xgboost) trees are not getting split at all, see by doing xgb.get_booster().get_dump()
Now, try this: run multiple experiments, scale your y such that each y is of the order 1e-1, then next experiment scale such that order of y is 1e-2, so on. You will see that decision tree stops splitting around some order. I believe it is linked to minimum impurity value, for example, sklearn decision tree value is here https://github.com/scikit-learn/scikit-learn/blob/ed5e127b/sklearn/tree/tree.py#L285 (around 1e-7)
This is my best guess at the moment. If someone can add to or verify this then I'll be happy to learn :)
NOTE: I appreciate the massive quantity of comments suggesting that this is inappropriate to quantify model performance. However, this is irrelevant to my error, and this error occurs for a variety of other metrics. Also, see here for the appropriate way to respond when you think the OP is "asking the wrong question"
I have an sklearn logistic model for which I am attempting to get the RMSE. However, when I .predict_proba, I get a vector of probabilities. However, my y_test is in its categorical form, which sklearn.linear_model.LogisticRegression just sort of dealt with automagically.
How to I reconcile these two things to get the RMSE?
>>> sklearn.metrics.mean_squared_error(y_test, pred_proba, sample_weight=weights_test)
ValueError: y_true and y_pred have different number of output (1!=13)
predict_proba is predicting the probability that a sample belongs to a class. The arg max of those probabilities is the predicted class (categorical form). RMSE is not a metric for classification. If you want to evaluate your model, consider a different metric like accuracy_score:
from sklearn.metrics import accuracy_score
predictions = your_model.predict(X_test)
print("Accuracy: %.3f" % accuracy_score(y_test, predictions))
The brier score, basically the mean squared error, is a known and valid loss function for classification models that leverage probability scores; I would take a look at that as well.
To your particular issue, you want to compare the probabilities returned for your target class, i.e. for a binary class problem:
from sklearn.metrics import brier_score_loss
probs = your_model.predict_proba(X_test)
brier_score_loss(y_true, probs[:, 1])
I'm not sure brier is formally defined for multiclass problems. I would point to the idea of mean misclassification error, which averages the error across classes.
To leverage this within the sklearn API, encode your y_true categorically, i.e. each class gets its own column, and call
sklearn.metrics.mean_squared_error(y_true, probs, multioutput=’uniform_average’)
Here is how you can calculate RMSE:
import numpy as np
from sklearn.metrics import mean_squared_error
x = np.range(10)
y = x
rmse = np.sqrt(mean_squared_error(x, y))
One can transform the y_test into a format compatible with the predict_proba output as follows:
model = sklearn.linear_model.LogisticRegression().fit(X,y) # or whatever model
label_encoder = sklearn.preprocessing.LabelEncoder()
label_encoder.classes_ = model.classes_
y_test_onehot = sklearn.preprocessing.OneHotEncoder().fit_transform(label_encoder.transform(y_test).reshape((-1,1)))
You can now apply any of the metrics in sklearn.metric. This is essential for computing, say, the brier score.
I am trying to evaluate a relevance of features and I am using DecisionTreeRegressor()
The related part of the code is presented below:
# TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature
new_data = data.drop(['Frozen'], axis = 1)
# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# TODO: Set a random state.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(new_data, data['Frozen'], test_size = 0.25, random_state = 1)
# TODO: Create a decision tree regressor and fit it to the training set
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=1)
regressor.fit(X_train, y_train)
# TODO: Report the score of the prediction using the testing set
from sklearn.model_selection import cross_val_score
#score = cross_val_score(regressor, X_test, y_test)
score = regressor.score(X_test, y_test)
print score # python 2.x
When I run the print function, it returns the given score:
-0.649574327334
You can find the score function implementatioin and some explanation below here and below:
Returns the coefficient of determination R^2 of the prediction.
...
The best possible score is 1.0 and it can be negative (because the
model can be arbitrarily worse).
I could not grasp the whole concept yet, so this explanation is not very helpful for me. For instance I could not understand why score could be negative and what exactly it indicates (if something is squared, I would expect it can only be positive).
What does this score indicates and why can it be negative?
If you know any article (for starters) it might be helpful as well!
R^2 can be negative from its definition (https://en.wikipedia.org/wiki/Coefficient_of_determination) if the model fits the data worse than a horizontal line. Basically
R^2 = 1 - SS_res/SS_tot
and SS_res and SS_tot are always positive. If SS_res >> SS_tot, you have a negative R^2. Look at this answer as well: https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative
The article execute cross_val_score in which DecisionTreeRegressor is implemented. You may take a look at the documentation of scikitlearn DecisionTreeRegressor.
Basically, the score you see is R^2, or (1-u/v). U is the squared sum residual of your prediction, and v is the total square sum(sample sum of square).
u/v can be arbitrary large when you make really bad prediction, while it can only be as small as zero given u and v are sum of squared residual(>=0)