Statmodels output different from sklearn regression - python

I am trying to get something magic from the boston dataset on sklearn. Wihtout making any change I did a regression with sklearn and another with statsmodels to easily get the p-value of my each of the variables used. However, my reults are completely different results.
Here it is:
boston_houses=load_boston()
boston=pd.DataFrame(data=boston_houses.data, columns=boston_houses.feature_names)
boston['MEDV']=boston_houses.target
boston.head()
X,y=boston.drop(columns='MEDV'),boston['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=42)
lin_model = LinearRegression()
lin_model.fit(X_train, y_train)
pred= lin_model.predict(X_test)
from sklearn.metrics import r2_score,mean_squared_error
rSq=r2_score(y_test,pred)
rmse=np.sqrt(mean_squared_error(y_test,pred))
print ('The R-squared for this model {}'.format(rSq))
print ('The Root mean square error for this model {}'.format(rmse))
###### scipy now ###
The R-squared for this model 0.7261570836552478
The Root mean square error for this model 4.55236459846306
X_new=sm.tools.tools.add_constant(X_train)
estimator= sm.OLS(y_train, X_new)
estimator.fit()
print(estimator.fit().summary())
I get 0.739 for the R-squared with statsmodel,Why??

If you are wondering why it is not the same as R-squared which you received from sklearn.metrics.r2_score than the reason is that you have used two different realisations of linear regression with different parameters which produced different predictions.
If you for example change your test_size to 0.25 in train_test_split you will have one more model with different result.

I have used the test for the whole data on sklearn. The results finally matched.Silly from me. I should check how my training perform before to checking how my the sample test respond. That would help me to avoid the mistake acknowledging that there are 2 steps.

Related

How to fix warning: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples

I'm working on a classification project, where I try out various types of models like logistic regression, decision trees etc, to see which model can most accurately predict if a patient is at risk for heart disease (given an existing data set of over 3600 rows).
I'm currently trying to work on my decision tree, and have plotted ROC curves to find the optimized values for tuning the max_depth and min_samples_split hyperparameters. However when I try to create my new model I get the warning:
"UndefinedMetricWarning: Precision is ill-defined and being set to 0.0
due to no predicted samples. Use zero_division parameter to control
this behavior."
I have already googled the warning, and semi understand why it's happening, but not how to fix it. I don't want to just get rid of the warning or ignore the values that weren't predicted. I want to actually fix the issue. From my understanding, it has something to do with how I processed my data. However, I'm not sure where I went wrong with my data processing.
I started off with doing a train-test split, then used StandardScaler like so:
#Let's split the data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X = df.drop("TenYearCHD", axis = 1)
y = df["TenYearCHD"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
#Let's scale our data
SS = StandardScaler()
X_train = SS.fit_transform(X_train)
X_test = SS.transform(X_test)
I then created my initial decision tree, and received no warnings:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion = "entropy")
#Fit our model and predict
dtc.fit(X_train, y_train)
dtc_pred = dtc.predict(X_test)
After looking at my ROC curve and AOC scores, I attempted to create another more optimized decision tree, which is where I then received my warning:
dtc3 = DecisionTreeClassifier(criterion = "entropy", max_depth = 4, min_samples_split= .25)
dtc3.fit(X_train, y_train)
dtc3_pred = dtc3.predict(X_test)
Essentially i'm at a loss at what to do. Should I use a different method like StratifiedKFolds in addition to train-test split to process my data? Should I do something else entirely? Any help would be greatly appreciated.

Using Custom Metric for Score Method in XGBoost

I am using xgboost for a classification problem with an imbalanced dataset. I plan on using some combination of an f1-score or roc-auc as my primary criteria for judging the model.
Currently the default value returned from the score method is accuracy, but I would really like to have a specific evaluation metric returned instead. My big motivation for doing this is that I presume the feature_importances_ attribute from the model is determined from what's affecting the score method, and the columns that impact predictive accuracy might very well be different from the columns that impact roc-auc. Right now I am passing in values to eval_metric but it does not seem to be making a difference.
Here is some sample code:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score
data = load_breast_cancer()
X = data['data']
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify=y)
mod.fit(X_train, y_train)
Now at this point, mod.score(X_test, y_test) will return a value of ~ 0.96, and the roc_auc_score is ~ 0.99.
I was hoping the following snippet:
mod.fit(X_train, y_train, eval_metric='auc')
Would then allow mod.score(X_test, y_test) to return the roc_auc_score value, but it is still returning predictive accuracy, not roc_auc.
The purpose of this exercise is estimating the influence of different columns on the outcome, so if I could get feature_importances_ returned using f1 or roc_auc as the measure of impact this would be a huge boon, but I do not seem to be on the right path as of now.
Thank you.
There are two parts to your question, to use eval_metric, you need to provide data to evaluate using eval_set = :
mod = XGBClassifier()
mod.fit(X_train, y_train,eval_set=[(X_test,y_test)],eval_metric="auc")
You can check the auc using evals_result(), and it gives the auc for every iteration:
mod.evals_result()
{'validation_0': OrderedDict([('auc',
[0.965939,
0.9833,
0.984788,
[...]
0.991402,
0.991071,
0.991402,
0.991733])])}
The importance score is calculated based on the average gain across all splits the feature is used in see help page. From your question, I suppose you need the mdoel to maximize auc, like in cross-validation, but you cannot use the auc as an objective in xgboost. Gradient boosting methods require a differentiable loss function.
With imbalanced dataset, you can try to adjust the parameter scale_pos_weight, to adjust the balance of positive and negative weights. This is discussed in xgboost website

Non linear regression using Xgboost

I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485

Root Mean Squared Error vs Accuracy Linear Regression

I built a simple linear regression model to predict students' final grade using this dataset https://archive.ics.uci.edu/ml/datasets/Student+Performance.
While my accuracy is very good, the errors seem to be big.
I'm not sure if I'm just not understanding the meaning of the errors correctly or if I made some errors in my code. I thought for the accuracy of 92, the errors should be way smaller and closer to 0.
Here's my code:
data = pd.read_csv("/Users/.../student/student-por.csv", sep=";")
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1, random_state=42)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
linear_accuracy = round(linear.score(x_test, y_test) , 5)
linear_mean_abs_error = metrics.mean_absolute_error(y_test, linear_prediction)
linear_mean_sq_error = metrics.mean_squared_error(y_test, linear_prediction)
linear_root_mean_sq_error = np.sqrt(metrics.mean_squared_error(y_test, linear_prediction))
Did I make any errors in the code or errors do make sense in this case?
The accuracy metric in sklearn linear regression is the R^2 metric. It essentially tells you the percent of the variation in the dependent variable explained by the model predictors. 0.92 is a very good score, but it does not mean that your errors will be 0. I looked your work and it seems that you used all the numeric variables as your predictors and your target was G3. The code seems fine and the results seem accurate too. In regression tasks it is really hard to get 0 errors. Please let me know if you have any questions. Cheers

how to properly use sklearn to predict the error of a fit

I'm using sklearn to fit a linear regression model to some data. In particular, my response variable is stored in an array y and my features in a matrix X.
I train a linear regression model with the following piece of code
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X,y)
and everything seems to be fine.
Then let's say I have some new data X_new and I want to predict the response variable for them. This can easily done by doing
predictions = model.predict(X_new)
My question is, what is this the error associated to this prediction?
From my understanding I should compute the mean squared error of the model:
from sklearn.metrics import mean_squared_error
model_mse = mean_squared_error(model.predict(X),y)
And basically my real predictions for the new data should be a random number computed from a gaussian distribution with mean predictions and sigma^2 = model_mse. Do you agree with this and do you know if there's a faster way to do this in sklearn?
You probably want to validate your model on your training data set. I would suggest exploring the cross-validation submodule sklearn.cross_validation.
The most basic usage is:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
It depends on you training data-
If it's distribution is a good representation of the "real world" and of a sufficient size (see learning theories, as PAC), then I would generally agree.
That said- if you are looking for a practical way to evaluate your model, why won't you use the test set as Kris has suggested?
I usually use grid search for optimizing parameters:
#split to training and test sets
X_train, X_test, y_train, y_test =train_test_split(
X_data[indices], y_data[indices], test_size=0.25)
#cross validation gridsearch
params = dict(logistic__C=[0.1,0.3,1,3, 10,30, 100])
grid_search = GridSearchCV(clf, param_grid=params,cv=5)
grid_search.fit(X_train, y_train)
#print scores and best estimator
print 'best param: ', grid_search.best_params_
print 'best train score: ', grid_search.best_score_
print 'Test score: ', grid_search.best_estimator_.score(X_test,y_test)
The Idea is hiding the test set from your learning algorithm (and yourself)- Don't train and don't optimize parameters using this data.
Finally you should use the test set for performance evaluation (error) only, it should provide an unbiased mse.

Categories