Getting AUC from sklearn using preds from Statsmodel instance - python

I have a logistic regression that I want to know the AUC for.
I created the logistic regression model using statsmodels:
import statsmodels.api as sm
y = generate_data(dependent_var) # pseudocode
X = generate_data(independent_var) # pseudocode
X['constant'] = 1
logit_model=sm.Logit(y,X)
result=logit_model.fit()
Then, I use sklearn to get an AUC score for my model predictions:
from sklearn.metrics import roc_auc_score
roc_auc_score(y, result.predict())
The code runs and I get a AUC score, I just want to make sure I am passing variables between the package calls correctly.

Input for roc_auc_score :
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html
sklearn.metrics.roc_auc_score(y_true, y_score, *, average='macro', sample_weight=None, max_fpr=None, multi_class='raise', labels=None)[source]
y_true array-like of shape (n_samples,) or (n_samples, n_classes)
True labels or binary label indicators. The binary and multiclass cases expect labels with shape (n_samples,) while the multilabel case expects binary label indicators with shape (n_samples, n_classes).
y_score array-like of shape (n_samples,) or (n_samples, n_classes)
Target scores.
To validate the input parameters:
y_true: correct
y_score: result.predict(X), based on https://www.statsmodels.org/devel/examples/notebooks/generated/predict.html
But your validation is missing the concept of train and testsplit wich is necassary in machine learning. Normally you always divide the dataset in a training part and in a test part. In a pseudo code this would look like:
import statsmodels.api as sm
y = generate_data(dependent_var) # pseudocode
X = generate_data(independent_var) # pseudocode
X['constant'] = 1
#pseudo line
X_train, X_test, y_train, y_test = train_test_split
logit_model=sm.Logit(y_train,X_train)
result=logit_model.fit()
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, result.predict(X_test))

Related

how to build an artificial neural network with scikit-learn?

I am trying to run an artificial neural network with scikit-learn.
I want to run the regression, get the model fit results, an generate out of sample forecasts.
This is my code below. Any help will be greatly appreciated.
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import MLPRegressor
#import the data
df=pd.read_excel(r"C:\Users\Action\Downloads\Python\Practice_Data\sorted_data v2.xlsx")
#view the data
df.head(5)
#to drop a column of data type
df2=df.drop('Unnamed: 13', axis=1)
#view the data
df2.head(5)
Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.metrics import r2_score
describe the data
df.describe().transpose()
target_column = ['public health care services']
predictors = list(set(list(df.columns))-set(target_column))
df[predictors] = df[predictors]/df[predictors].max()
df.describe().transpose()
set the X and Y
X = df[predictors].values
y = df[target_column].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
import MLP Classifier and fit the network
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(hidden_layer_sizes=(8,8,8), activation='relu', solver='adam', max_iter=500)
mlp.fit(X_train,y_train)
predict_train = mlp.predict(X_train)
set up the MLP Classifier
mlp = MLPClassifier(
hidden_layer_sizes=(50, 8),
max_iter=15,
alpha=1e-4,
solver="sgd",
verbose=True,
random_state=1,
learning_rate_init=0.1)
import the warnings
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn")
predict_test = mlp.predict(X_test)
to train on the data I use the MLPClassifier to call the fit function on the training data.
mlp.fit(X_train, y_train)
after this, the neural network is done training.
after the neural network is trained, the next step is to test it.
print out the model scores
print(f"Training set score: {mlp.score(X_train, y_train)}")
print(f"Test set score: {mlp.score(X_test, y_test)}")
y_predict = mlp.predict(X_train)
I am getting an error from below
x_ann = y_predict[:, 0]
y_ann = y_predict[:, 1]
The error message is
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
any help will be greatly appreciated
predict function gives you the actual class and since your point can belong to one and only one class (except multi label), it is supposed to be like this only
What is the shape of your Y_true_labels? Might be the case that your labels are Sparse and with 2 classes, means 0,1 and since the models is minimising Log Loss as described here as:
This model optimizes the log-loss function using LBFGS or stochastic gradient descent.
Also looking at the predict() it says:
log_y_probndarray of shape (n_samples, n_classes)
The predicted log-probability of the sample for each class in the model, where classes are ordered as they are in self.classes_. Equivalent to log(predict_proba(X))
So it means that if probability is 0.3 it means it belongs to class and if it's 0.7 it belongs to class, ASSUMING it's binary classification with a threshold set to 0.5.
What you might be confusing with is the predict_proba() function which gives you the probabilities for each classes.
Might be the case. Please post your X,Y data shape and type so that we can understand better.

Cross-validation and parameters tuning with XGBoost and hyperopt

One way to do nested cross-validation with a XGB model would be:
from sklearn.model_selection import GridSearchCV, cross_val_score
from xgboost import XGBClassifier
# Let's assume that we have some data for a binary classification
# problem : X (n_samples, n_features) and y (n_samples,)...
gs = GridSearchCV(estimator=XGBClassifier(),
param_grid={'max_depth': [3, 6, 9],
'learning_rate': [0.001, 0.01, 0.05]},
cv=2)
scores = cross_val_score(gs, X, y, cv=2)
However, regarding the tuning of XGB parameters, several tutorials (such as this one) take advantage of the Python hyperopt library. I would like to be able to do nested cross-validation (as above) using hyperopt to tune the XGB parameters.
To do so, I wrote my own Scikit-Learn estimator:
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.model_selection import train_test_split
from sklearn.exceptions import NotFittedError
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier
def optimize_params(X, y, params_space, validation_split=0.2):
"""Estimate a set of 'best' model parameters."""
# Split X, y into train/validation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=validation_split, stratify=y)
# Estimate XGB params
def objective(_params):
_clf = XGBClassifier(n_estimators=10000,
max_depth=int(_params['max_depth']),
learning_rate=_params['learning_rate'],
min_child_weight=_params['min_child_weight'],
subsample=_params['subsample'],
colsample_bytree=_params['colsample_bytree'],
gamma=_params['gamma'])
_clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_val, y_val)],
eval_metric='auc',
early_stopping_rounds=30)
y_pred_proba = _clf.predict_proba(X_val)[:, 1]
roc_auc = roc_auc_score(y_true=y_val, y_score=y_pred_proba)
return {'loss': 1. - roc_auc, 'status': STATUS_OK}
trials = Trials()
return fmin(fn=objective,
space=params_space,
algo=tpe.suggest,
max_evals=100,
trials=trials,
verbose=0)
class OptimizedXGB(BaseEstimator, ClassifierMixin):
"""XGB with optimized parameters.
Parameters
----------
custom_params_space : dict or None
If not None, dictionary whose keys are the XGB parameters to be
optimized and corresponding values are 'a priori' probability
distributions for the given parameter value. If None, a default
parameters space is used.
"""
def __init__(self, custom_params_space=None):
self.custom_params_space = custom_params_space
def fit(self, X, y, validation_split=0.3):
"""Train a XGB model.
Parameters
----------
X : ndarray, shape (n_samples, n_features)
Data.
y : ndarray, shape (n_samples,) or (n_samples, n_labels)
Labels.
validation_split : float (default: 0.3)
Float between 0 and 1. Corresponds to the percentage of samples in X which will be used as validation data to estimate the 'best' model parameters.
"""
# If no custom parameters space is given, use a default one.
if self.custom_params_space is None:
_space = {
'learning_rate': hp.uniform('learning_rate', 0.0001, 0.05),
'max_depth': hp.quniform('max_depth', 8, 15, 1),
'min_child_weight': hp.quniform('min_child_weight', 1, 5, 1),
'subsample': hp.quniform('subsample', 0.7, 1, 0.05),
'gamma': hp.quniform('gamma', 0.9, 1, 0.05),
'colsample_bytree': hp.quniform('colsample_bytree', 0.5, 0.7, 0.05)
}
else:
_space = self.custom_params_space
# Estimate best params using X, y
opt = optimize_params(X, y, _space, validation_split)
# Instantiate `xgboost.XGBClassifier` with optimized parameters
best = XGBClassifier(n_estimators=10000,
max_depth=int(opt['max_depth']),
learning_rate=opt['learning_rate'],
min_child_weight=opt['min_child_weight'],
subsample=opt['subsample'],
gamma=opt['gamma'],
colsample_bytree=opt['colsample_bytree'])
best.fit(X, y)
self.best_estimator_ = best
return self
def predict(self, X):
"""Predict labels with trained XGB model.
Parameters
----------
X : ndarray, shape (n_samples, n_features)
Returns
-------
output : ndarray, shape (n_samples,) or (n_samples, n_labels)
"""
if not hasattr(self, 'best_estimator_'):
raise NotFittedError('Call `fit` before `predict`.')
else:
return self.best_estimator_.predict(X)
def predict_proba(self, X):
"""Predict labels probaiblities with trained XGB model.
Parameters
----------
X : ndarray, shape (n_samples, n_features)
Returns
-------
output : ndarray, shape (n_samples,) or (n_samples, n_labels)
"""
if not hasattr(self, 'best_estimator_'):
raise NotFittedError('Call `fit` before `predict_proba`.')
else:
return self.best_estimator_.predict_proba(X)
My questions are:
Is this a valid approach? For instance, in the fit method of my OptimizedXGB, best.fit(X, y) will train a XGB model on X, y. However, this might lead to overfitting as no eval_set is specified to ensure early stopping.
On a toy example (the famous iris dataset), this OptimizedXGB performs worse than a basic LogisticRegression classifier. Why is that? Is it because the example is to simplistic? See below for the code of the example.
Example:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
X, y = load_iris(return_X_y=True)
X = X[:, :2]
X = X[y < 2]
y = y[y < 2]
skf = StratifiedKFold(n_splits=2, random_state=42)
# With a LogisticRegression classifier
pipe = Pipeline([('scaler', StandardScaler()), ('lr', LogisticRegression())])
gs = GridSearchCV(estimator=pipe, param_grid={'lr__C': [1., 10.]})
lr_scores = cross_val_score(gs, X, y, cv=skf)
# With OptimizedXGB
xgb_scores = cross_val_score(OptimizedXGB(), X, y, cv=skf)
# Print results
print('Accuracy with LogisticRegression = %.4f (+/- %.4f)' % (np.mean(lr_scores), np.std(lr_scores)))
print('Accuracy with OptimizedXGB = %.4f (+/- %.4f)' % (np.mean(xgb_scores), np.std(xgb_scores)))
Outputs:
Accuracy with LogisticRegression = 0.9900 (+/- 0.0100)
Accuracy with OptimizedXGB = 0.9100 (+/- 0.0300)
Although the scores are close, I would have expected the XGB model to score at least as well as a LogisticRegression classifier.
EDIT:
similar post
First, check this post - might help - nested CV.
Regarding your questions:
Yes, that's the right way to go. Once you have your hyper parameters selected, you should fit your model (selected model) on the entire training data. However, since this model includes a model selection process inside, you can only "score" how well it generalizes using an external CV, like you did.
Since you are scoring the selection process as well (and not just the model, say XGB Vs Linear regression) there might be some problem with the selection process. Maybe you hyper space is not properly defined and you are choosing poor parameters?

GridSearchCV for the multi-class SVM in python

I am trying to learn how to find the best parameters for a classifier. So, I am using GridSearchCV for a multi-class classification problem. A dummy code was generated on Does not GridSearchCV support multi-class? I am just using that code with n_classes=3.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler,label_binarize
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, f1_score, roc_auc_score, make_scorer
X, y = make_classification(n_samples=3000, n_features=10, weights=[0.1, 0.9, 0.3],n_classes=3, n_clusters_per_class=1,n_informative=2)
pipe = make_pipeline(StandardScaler(), SVC(kernel='rbf', class_weight='auto'))
param_space = dict(svc__C=np.logspace(-5,0,5), svc__gamma=np.logspace(-2, 2, 10))
f1_score
my_scorer = make_scorer(f1_score, greater_is_better=True)
gscv = GridSearchCV(pipe, param_space, scoring=my_scorer)
I am trying to do One-hot encoding as advised here Scikit-learn GridSearch giving "ValueError: multiclass format is not supported" error. Also, sometimes there will be dataset like Toxic Comment Classification dataset on Kaggle which will give you binarized labels.
y = label_binarize(y, classes=[0, 1, 2])
for i in classes:
gscv.fit(X, y[i])
print gscv.best_params_
I am getting:
ValueError: bad input shape (2000L, 3L)
I am not sure why I am getting this error. My objective is to find the best parameters for a multi-class classification problem.
There are two problems in the two parts of your code.
1) Let's start with first part when you have not one-hot encoded the labels. You see, SVC supports the multi-class cases just fine. But the f1_score when combined with (inside) GridSearchCV does not.
f1_score by default returns the scores of positive label in case of binary classification so will throw error in your case.
OR It also can return an array of scores (one for each class), but GridSearchCV only accepts a single value as score because it needs that for finding the best score and best combination of hyper-parameters. So you need to pass the averaging method in f1_score to get a single value from the array.
According to the f1_score documentation, following averaging methods are allowed:
average : string, [None, ‘binary’ (default), ‘micro’, ‘macro’,
‘samples’, ‘weighted’]
So change your make_scorer like this:
my_scorer = make_scorer(f1_score, greater_is_better=True, average='micro')
Change the 'average' param above as it suits you.
2) Now coming to the second part: When you one-hot encode the labels, the shape of y becomes 2-d, but SVC only supports a 1-d array as y as specified in the documentation:
fit(X, y, sample_weight=None)[source]
X : {array-like, sparse matrix}, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
But even if you encode the labels and use a classifier which supports the 2-d labels, then the first error will have to be solved. So I would advice you not to one-hot encode the labels and just change the f1_score.

ValueError: multiclass-multioutput format is not supported using sklearn roc_auc_score function

I am using logistic regression for prediction. My predictions are 0's and 1's. After training my model on given data and also when training on important features i.e X_important_train see screenshot. I am getting score around 70% but when I use roc_auc_score(X,y) or roc_auc_score(X_important_train, y_train) I am getting value error:
ValueError: multiclass-multioutput format is not supported
Code:
# Load libraries
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
# Standarize features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
model.fit(X_important_train, y_train)
model.score(X_important_train, y_train)
roc_auc_score(X_important_train, y_train)
Screenshot:
First of all, the roc_auc_score function expects input arguments with the same shape.
sklearn.metrics.roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None)
Note: this implementation is restricted to the binary classification task or multilabel classification task in label indicator format.
y_true : array, shape = [n_samples] or [n_samples, n_classes]
True binary labels in binary label indicators.
y_score : array, shape = [n_samples] or [n_samples, n_classes]
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).
Now, the inputs are the true and predicted scores, NOT the training and label data as you are using in the example that you posted.
In more detail,
model.fit(X_important_train, y_train)
model.score(X_important_train, y_train)
# this is wrong here
roc_auc_score(X_important_train, y_train)
You should so something like:
y_pred = model.predict(X_test_data)
roc_auc_score(y_true, y_pred)

scikit learn decision tree model evaluation

Here are the related code and document, wondering for the default cross_val_score without explicitly specify score, the output array means precision, AUC or some other metrics?
Using Python 2.7 with miniconda interpreter.
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
>>> from sklearn.datasets import load_iris
>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.tree import DecisionTreeClassifier
>>> clf = DecisionTreeClassifier(random_state=0)
>>> iris = load_iris()
>>> cross_val_score(clf, iris.data, iris.target, cv=10)
...
...
array([ 1. , 0.93..., 0.86..., 0.93..., 0.93...,
0.93..., 0.93..., 1. , 0.93..., 1. ])
regards,
Lin
From the user guide:
By default, the score computed at each CV iteration is the score
method of the estimator. It is possible to change this by using the
scoring parameter:
From the DecisionTreeClassifier documentation:
Returns the mean accuracy on the given test data and labels. In
multi-label classification, this is the subset accuracy which is a
harsh metric since you require for each sample that each label set be
correctly predicted.
Don't be confused by "mean accuracy," it's just the regular way one computes accuracy. Follow the links to the source:
from .metrics import accuracy_score
return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
Now the source for metrics.accuracy_score
def accuracy_score(y_true, y_pred, normalize=True, sample_weight=None):
...
# Compute accuracy for each possible representation
y_type, y_true, y_pred = _check_targets(y_true, y_pred)
if y_type.startswith('multilabel'):
differing_labels = count_nonzero(y_true - y_pred, axis=1)
score = differing_labels == 0
else:
score = y_true == y_pred
return _weighted_sum(score, sample_weight, normalize)
And if you still aren't convinced:
def _weighted_sum(sample_score, sample_weight, normalize=False):
if normalize:
return np.average(sample_score, weights=sample_weight)
elif sample_weight is not None:
return np.dot(sample_score, sample_weight)
else:
return sample_score.sum()
Note: for accuracy_score normalize parameter defaults to True, thus it simply returns np.average of the boolean numpy arrays, thus it's simply the average number of correct predictions.
If a scoring argument isn't given, cross_val_score will default to using the .score method of the estimator you're using. For DecisionTreeClassifier, it's mean accuracy (as shown in the docstring below):
In [11]: DecisionTreeClassifier.score?
Signature: DecisionTreeClassifier.score(self, X, y, sample_weight=None)
Docstring:
Returns the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.
Parameters
----------
X : array-like, shape = (n_samples, n_features)
Test samples.
y : array-like, shape = (n_samples) or (n_samples, n_outputs)
True labels for X.
sample_weight : array-like, shape = [n_samples], optional
Sample weights.
Returns
-------
score : float
Mean accuracy of self.predict(X) wrt. y.

Categories