xgboost regression predict same value - python

I am new to machine learning and xgboostand I am solving a regression problem.
My target value are very small (e.g.-1.23e-12).
I am using linear regression and xgboost regressor,
but xgboost always predicts the same values, like:
[1.32620335e-05 1.32620335e-05 ... 1.32620335e-05].
I tried to tune some parameters in xgboost.regressor, but it also predicted the same values.
I've seen Scaling of target causes Scikit-learn SVM regression to break down
, so I tried to scale my target value to likes(data.target = data.target*(10**12))
, and it fixed the problem. But I am not sure this is reasonable to scale my target value, and I don't know if this problem in xgboost is the same to SVR?
.
Here is target value of my data:
count 2.800010e+05
mean -1.722068e-12
std 6.219815e-13
min -4.970697e-12
25% -1.965893e-12
50% -1.490800e-12
75% -1.269998e-12
max -1.111604e-12
And part of my code:
X = df[feature].values
y = df[target].values *(10**(12))
X_train, X_test, y_train, y_test = train_test_split(X, y)
xgb = xgboost.XGBRegressor()
LR = linear_model.LinearRegression()
xgb.fit(X_train,y_train)
LR.fit(X_train,y_train)
xgb_predicted = xgb.predict(X_test)
LR_predicted = LR.predict(X_test)
print('xgb predicted:',xgb_predicted[0:5])
print('LR predicted:',LR_predicted[0:5])
print('ground truth:',y_test[0:5])
Output:
xgb predicted: [-1.5407631 -1.49756 -1.9647646 -2.7702322 -2.5296502]
LR predicted: [-1.60908805 -1.51145989 -1.71565321 -2.25043287 -1.65725868]
ground truth: [-1.6572993 -1.59879922 -2.39709641 -2.26119817 -2.01300088]
And the output with y = df[target].values (i.e., did not scale target value)
xgb predicted: [1.32620335e-05 1.32620335e-05 1.32620335e-05 1.32620335e-05
1.32620335e-05]
LR predicted: [-1.60908805e-12 -1.51145989e-12 -1.71565321e-12 -2.25043287e-12
-1.65725868e-12]
ground truth: [-1.65729930e-12 -1.59879922e-12 -2.39709641e-12 -2.26119817e-12
-2.01300088e-12]

Let's try something simpler. I suspect that if you tried to fit a DecisionTreeRegressor (sklearn) to your problem (without scaling) you will likely see similar behavior.
Also, most likely, the nodes in your (xgboost) trees are not getting split at all, see by doing xgb.get_booster().get_dump()
Now, try this: run multiple experiments, scale your y such that each y is of the order 1e-1, then next experiment scale such that order of y is 1e-2, so on. You will see that decision tree stops splitting around some order. I believe it is linked to minimum impurity value, for example, sklearn decision tree value is here https://github.com/scikit-learn/scikit-learn/blob/ed5e127b/sklearn/tree/tree.py#L285 (around 1e-7)
This is my best guess at the moment. If someone can add to or verify this then I'll be happy to learn :)

Related

How to fix warning: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples

I'm working on a classification project, where I try out various types of models like logistic regression, decision trees etc, to see which model can most accurately predict if a patient is at risk for heart disease (given an existing data set of over 3600 rows).
I'm currently trying to work on my decision tree, and have plotted ROC curves to find the optimized values for tuning the max_depth and min_samples_split hyperparameters. However when I try to create my new model I get the warning:
"UndefinedMetricWarning: Precision is ill-defined and being set to 0.0
due to no predicted samples. Use zero_division parameter to control
this behavior."
I have already googled the warning, and semi understand why it's happening, but not how to fix it. I don't want to just get rid of the warning or ignore the values that weren't predicted. I want to actually fix the issue. From my understanding, it has something to do with how I processed my data. However, I'm not sure where I went wrong with my data processing.
I started off with doing a train-test split, then used StandardScaler like so:
#Let's split the data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = df.drop("TenYearCHD", axis = 1)
y = df["TenYearCHD"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
#Let's scale our data
SS = StandardScaler()
X_train = SS.fit_transform(X_train)
X_test = SS.transform(X_test)
I then created my initial decision tree, and received no warnings:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion = "entropy")
#Fit our model and predict
dtc.fit(X_train, y_train)
dtc_pred = dtc.predict(X_test)
After looking at my ROC curve and AOC scores, I attempted to create another more optimized decision tree, which is where I then received my warning:
dtc3 = DecisionTreeClassifier(criterion = "entropy", max_depth = 4, min_samples_split= .25)
dtc3.fit(X_train, y_train)
dtc3_pred = dtc3.predict(X_test)
Essentially i'm at a loss at what to do. Should I use a different method like StratifiedKFolds in addition to train-test split to process my data? Should I do something else entirely? Any help would be greatly appreciated.

Root Mean Squared Error vs Accuracy Linear Regression

I built a simple linear regression model to predict students' final grade using this dataset https://archive.ics.uci.edu/ml/datasets/Student+Performance.
While my accuracy is very good, the errors seem to be big.
I'm not sure if I'm just not understanding the meaning of the errors correctly or if I made some errors in my code. I thought for the accuracy of 92, the errors should be way smaller and closer to 0.
Here's my code:
data = pd.read_csv("/Users/.../student/student-por.csv", sep=";")
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1, random_state=42)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
linear_accuracy = round(linear.score(x_test, y_test) , 5)
linear_mean_abs_error = metrics.mean_absolute_error(y_test, linear_prediction)
linear_mean_sq_error = metrics.mean_squared_error(y_test, linear_prediction)
linear_root_mean_sq_error = np.sqrt(metrics.mean_squared_error(y_test, linear_prediction))
Did I make any errors in the code or errors do make sense in this case?
The accuracy metric in sklearn linear regression is the R^2 metric. It essentially tells you the percent of the variation in the dependent variable explained by the model predictors. 0.92 is a very good score, but it does not mean that your errors will be 0. I looked your work and it seems that you used all the numeric variables as your predictors and your target was G3. The code seems fine and the results seem accurate too. In regression tasks it is really hard to get 0 errors. Please let me know if you have any questions. Cheers

Optimal threshold for imbalanced binar classification problem

i have trouble optimizing threshold for binar classification. I am using 3 models: Logistic Regression, Catboost and Sklearn RandomForestClassifier.
For each model I am doing the following steps:
1) fit model
2) get 0.0 recall for first class (which belongs to 5% of dataset) and 1.0 recall for zero class. (this can't be fixed with gridsearch and class_weight='balanced' parameter.) >:(
3) Find optimal treshold
fpr, tpr, thresholds = roc_curve(y_train, model.predict_proba(X_train)[:, 1])
optimal_threshold = thresholds[np.argmax(tpr - fpr)]
4) Enjoy ~70 recall ratio for both classes.
5) Predict probabilities for test dataset and use optimal_threshold, i calculated above, to get classes.
Here comes the question: when I am starting code again and again, if i don't fix random_state, optimal treshold is variant and shifts quiet dramatically. This leads to dramatic changes in accuracy metrics based on test sample.
Do i need to calculate some average threshold and use it as a constant hard value? Or maybe i have to fix random_state everywhere? Or maybe the method of finding optimal_threshold isnt correct?
If you do not set random_state to a fixed values results will be different in every run. To get reproducible results set random_state everywhere required to a fixed value or, use fixed numpy random seed numpy.random.seed.
https://scikit-learn.org/stable/faq.html#how-do-i-set-a-random-state-for-an-entire-execution
Scikit FAQ mentions it is better to use random_state where required instead of global random state.
Global Random State Example:
import numpy as np
np.random.seed(42)
Some examples locally setting random_state:
X_train, X_test, y_train, y_test = train_test_split(sample.data, sample.target, test_size=0.3, random_state=0)
skf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
classifierAlgorithm = LGBMClassifier(objective='binary', random_state=0)

How to calculate the RMSE on Ridge regression model

I have performed a ridge regression model on a data set
(link to the dataset: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)
as below:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
y = train['SalePrice']
X = train.drop("SalePrice", axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)
ridge = Ridge(alpha=0.1, normalize=True)
ridge.fit(X_train,y_train)
pred = ridge.predict(X_test)
I calculated the MSE using the metrics library from sklearn as
from sklearn.metrics import mean_squared_error
mean = mean_squared_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test,pred)
I am getting a very large value of MSE = 554084039.54321 and RMSE = 21821.8, I am trying to understand if my implementation is correct.
RMSE implementation
Your RMSE implementation is correct which is easily verifiable when you take the sqaure root of sklearn's mean_squared_error.
I think you are missing a closing parentheses though, here to be exact:
rmse = np.sqrt(mean_squared_error(y_test,pred)) # the last one was missing
High error problem
Your MSE is high due to model not being able to model relationships between your variables and target very well. Bear in mind each error is taken to the power of 2, so being 1000 off in price sky-rockets the value to 1000000.
You may want to modify the price with natural logarithm (numpy.log) and transform it to log-scale, it is a common practice especially for this problem (I assume you are doing House Prices: Advanced Regression Techniques), see available kernels for guidance. With this approach, you will not get such big values.
Last but not least, check Mean Absolute Error in order to see your predictions are not as terrible as they seem.

Interpreting the DecisionTreeRegressor score?

I am trying to evaluate a relevance of features and I am using DecisionTreeRegressor()
The related part of the code is presented below:
# TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature
new_data = data.drop(['Frozen'], axis = 1)
# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# TODO: Set a random state.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(new_data, data['Frozen'], test_size = 0.25, random_state = 1)
# TODO: Create a decision tree regressor and fit it to the training set
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=1)
regressor.fit(X_train, y_train)
# TODO: Report the score of the prediction using the testing set
from sklearn.model_selection import cross_val_score
#score = cross_val_score(regressor, X_test, y_test)
score = regressor.score(X_test, y_test)
print score # python 2.x
When I run the print function, it returns the given score:
-0.649574327334
You can find the score function implementatioin and some explanation below here and below:
Returns the coefficient of determination R^2 of the prediction.
...
The best possible score is 1.0 and it can be negative (because the
model can be arbitrarily worse).
I could not grasp the whole concept yet, so this explanation is not very helpful for me. For instance I could not understand why score could be negative and what exactly it indicates (if something is squared, I would expect it can only be positive).
What does this score indicates and why can it be negative?
If you know any article (for starters) it might be helpful as well!
R^2 can be negative from its definition (https://en.wikipedia.org/wiki/Coefficient_of_determination) if the model fits the data worse than a horizontal line. Basically
R^2 = 1 - SS_res/SS_tot
and SS_res and SS_tot are always positive. If SS_res >> SS_tot, you have a negative R^2. Look at this answer as well: https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative
The article execute cross_val_score in which DecisionTreeRegressor is implemented. You may take a look at the documentation of scikitlearn DecisionTreeRegressor.
Basically, the score you see is R^2, or (1-u/v). U is the squared sum residual of your prediction, and v is the total square sum(sample sum of square).
u/v can be arbitrary large when you make really bad prediction, while it can only be as small as zero given u and v are sum of squared residual(>=0)

Categories