How to fix warning: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples - python

I'm working on a classification project, where I try out various types of models like logistic regression, decision trees etc, to see which model can most accurately predict if a patient is at risk for heart disease (given an existing data set of over 3600 rows).
I'm currently trying to work on my decision tree, and have plotted ROC curves to find the optimized values for tuning the max_depth and min_samples_split hyperparameters. However when I try to create my new model I get the warning:
"UndefinedMetricWarning: Precision is ill-defined and being set to 0.0
due to no predicted samples. Use zero_division parameter to control
this behavior."
I have already googled the warning, and semi understand why it's happening, but not how to fix it. I don't want to just get rid of the warning or ignore the values that weren't predicted. I want to actually fix the issue. From my understanding, it has something to do with how I processed my data. However, I'm not sure where I went wrong with my data processing.
I started off with doing a train-test split, then used StandardScaler like so:
#Let's split the data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = df.drop("TenYearCHD", axis = 1)
y = df["TenYearCHD"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
#Let's scale our data
SS = StandardScaler()
X_train = SS.fit_transform(X_train)
X_test = SS.transform(X_test)
I then created my initial decision tree, and received no warnings:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion = "entropy")
#Fit our model and predict
dtc.fit(X_train, y_train)
dtc_pred = dtc.predict(X_test)
After looking at my ROC curve and AOC scores, I attempted to create another more optimized decision tree, which is where I then received my warning:
dtc3 = DecisionTreeClassifier(criterion = "entropy", max_depth = 4, min_samples_split= .25)
dtc3.fit(X_train, y_train)
dtc3_pred = dtc3.predict(X_test)
Essentially i'm at a loss at what to do. Should I use a different method like StratifiedKFolds in addition to train-test split to process my data? Should I do something else entirely? Any help would be greatly appreciated.

Related

why am I getting a very high test accuracy even when i test my dataset with a single feature

I am writing a small program and I am training a random forest to predict a binary value. My dataset has around 20,000 entries and each entry has 25 features(continuous and categorical) with a binary target value to predict.
I am getting over 99% test accuracy which is surprisingly high. I tried to reduce the number of my features, even with two features I am still getting such high accuracy. I just want to make sure I am not doing anything wrong in my code, such as the training set leaking into my test set.
Here is the code snippet
data = pd.read_csv(r'test.csv')
data = data.drop_duplicates()
#spliting data
X = data.drop('label', axis=1)
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#preproccessing the dataset by one hot encoding
l1 = OneHotEncoder(handle_unknown='ignore')
l1.fit(X_train)
X_train = l1.transform(X_train)
X_test = l1.transform(X_test)
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=20, random_state=0)
classifier.fit(X_train, y_train.to_numpy())
#evaluation
y_pred = classifier.predict(X_test)
print(accuracy_score(y_test, y_pred))
additionally, I forgot to add that my dataset is balanced and precision and recall scores are 100% !
This is quite a big dataset. How balanced is your dataset? It might be the case your test split is filled mostly with the entries of one label and failed every time the entry was from the other label. Therefore, i would say accuracy is not a good measure to rely on in here.
Have a look at this:
Difference of model accuracy and performance
Have a look at your confusion matrix and inspect your splits.

Using Custom Metric for Score Method in XGBoost

I am using xgboost for a classification problem with an imbalanced dataset. I plan on using some combination of an f1-score or roc-auc as my primary criteria for judging the model.
Currently the default value returned from the score method is accuracy, but I would really like to have a specific evaluation metric returned instead. My big motivation for doing this is that I presume the feature_importances_ attribute from the model is determined from what's affecting the score method, and the columns that impact predictive accuracy might very well be different from the columns that impact roc-auc. Right now I am passing in values to eval_metric but it does not seem to be making a difference.
Here is some sample code:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score
data = load_breast_cancer()
X = data['data']
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify=y)
mod.fit(X_train, y_train)
Now at this point, mod.score(X_test, y_test) will return a value of ~ 0.96, and the roc_auc_score is ~ 0.99.
I was hoping the following snippet:
mod.fit(X_train, y_train, eval_metric='auc')
Would then allow mod.score(X_test, y_test) to return the roc_auc_score value, but it is still returning predictive accuracy, not roc_auc.
The purpose of this exercise is estimating the influence of different columns on the outcome, so if I could get feature_importances_ returned using f1 or roc_auc as the measure of impact this would be a huge boon, but I do not seem to be on the right path as of now.
Thank you.
There are two parts to your question, to use eval_metric, you need to provide data to evaluate using eval_set = :
mod = XGBClassifier()
mod.fit(X_train, y_train,eval_set=[(X_test,y_test)],eval_metric="auc")
You can check the auc using evals_result(), and it gives the auc for every iteration:
mod.evals_result()
{'validation_0': OrderedDict([('auc',
[0.965939,
0.9833,
0.984788,
[...]
0.991402,
0.991071,
0.991402,
0.991733])])}
The importance score is calculated based on the average gain across all splits the feature is used in see help page. From your question, I suppose you need the mdoel to maximize auc, like in cross-validation, but you cannot use the auc as an objective in xgboost. Gradient boosting methods require a differentiable loss function.
With imbalanced dataset, you can try to adjust the parameter scale_pos_weight, to adjust the balance of positive and negative weights. This is discussed in xgboost website

Non linear regression using Xgboost

I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485

Statmodels output different from sklearn regression

I am trying to get something magic from the boston dataset on sklearn. Wihtout making any change I did a regression with sklearn and another with statsmodels to easily get the p-value of my each of the variables used. However, my reults are completely different results.
Here it is:
boston_houses=load_boston()
boston=pd.DataFrame(data=boston_houses.data, columns=boston_houses.feature_names)
boston['MEDV']=boston_houses.target
boston.head()
X,y=boston.drop(columns='MEDV'),boston['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=42)
lin_model = LinearRegression()
lin_model.fit(X_train, y_train)
pred= lin_model.predict(X_test)
from sklearn.metrics import r2_score,mean_squared_error
rSq=r2_score(y_test,pred)
rmse=np.sqrt(mean_squared_error(y_test,pred))
print ('The R-squared for this model {}'.format(rSq))
print ('The Root mean square error for this model {}'.format(rmse))
###### scipy now ###
The R-squared for this model 0.7261570836552478
The Root mean square error for this model 4.55236459846306
X_new=sm.tools.tools.add_constant(X_train)
estimator= sm.OLS(y_train, X_new)
estimator.fit()
print(estimator.fit().summary())
I get 0.739 for the R-squared with statsmodel,Why??
If you are wondering why it is not the same as R-squared which you received from sklearn.metrics.r2_score than the reason is that you have used two different realisations of linear regression with different parameters which produced different predictions.
If you for example change your test_size to 0.25 in train_test_split you will have one more model with different result.
I have used the test for the whole data on sklearn. The results finally matched.Silly from me. I should check how my training perform before to checking how my the sample test respond. That would help me to avoid the mistake acknowledging that there are 2 steps.

how to properly use sklearn to predict the error of a fit

I'm using sklearn to fit a linear regression model to some data. In particular, my response variable is stored in an array y and my features in a matrix X.
I train a linear regression model with the following piece of code
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X,y)
and everything seems to be fine.
Then let's say I have some new data X_new and I want to predict the response variable for them. This can easily done by doing
predictions = model.predict(X_new)
My question is, what is this the error associated to this prediction?
From my understanding I should compute the mean squared error of the model:
from sklearn.metrics import mean_squared_error
model_mse = mean_squared_error(model.predict(X),y)
And basically my real predictions for the new data should be a random number computed from a gaussian distribution with mean predictions and sigma^2 = model_mse. Do you agree with this and do you know if there's a faster way to do this in sklearn?
You probably want to validate your model on your training data set. I would suggest exploring the cross-validation submodule sklearn.cross_validation.
The most basic usage is:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
It depends on you training data-
If it's distribution is a good representation of the "real world" and of a sufficient size (see learning theories, as PAC), then I would generally agree.
That said- if you are looking for a practical way to evaluate your model, why won't you use the test set as Kris has suggested?
I usually use grid search for optimizing parameters:
#split to training and test sets
X_train, X_test, y_train, y_test =train_test_split(
X_data[indices], y_data[indices], test_size=0.25)
#cross validation gridsearch
params = dict(logistic__C=[0.1,0.3,1,3, 10,30, 100])
grid_search = GridSearchCV(clf, param_grid=params,cv=5)
grid_search.fit(X_train, y_train)
#print scores and best estimator
print 'best param: ', grid_search.best_params_
print 'best train score: ', grid_search.best_score_
print 'Test score: ', grid_search.best_estimator_.score(X_test,y_test)
The Idea is hiding the test set from your learning algorithm (and yourself)- Don't train and don't optimize parameters using this data.
Finally you should use the test set for performance evaluation (error) only, it should provide an unbiased mse.

Categories