Is there any way to calculate residual deviance of a scikit-learn logistic regression model? This is a standard output from R model summaries, but I couldn't find it any of sklearn's documentation.
As suggested by #russell-richie, it should be model.predict_proba
Don't forget the argument normalize=False in function metrics.log_loss() to return the sum of the per-sample losses.
So to complete #ingo's answer, to obtain the model deviance with sklearn.linear_model.LogisticRegression, you can compute:
def deviance(X, y, model):
return 2*metrics.log_loss(y, model.predict_proba(X), normalize=False)
Actually, you can. Deviance is closely related to cross entropy, which is in sklearn.metrics.log_loss. Deviance is just 2*(loglikelihood_of_saturated_model - loglikelihood_of_fitted_model). Scikit learn can (without larger tweaks) only handle classification of individual instances, so that the log-likelihood of the saturated model is going to be zero. Cross entropy as returned by log_loss is the negative log-likelihood. Thus, the deviance is simply
def deviance(X, y, model):
return 2*metrics.log_loss(y, model.predict_log_proba(X))
I know this is a very late answer, but I hope it helps anyway.
You cannot do it in scikit-learn but check out statsmodels, GLMResults(API)
Here is a python implementation of explained_deviance that implements the discussions from this thread: Github code
import numpy as np
from scipy.special import softmax, expit
from sklearn.metrics import log_loss
from sklearn.dummy import DummyClassifier
# deviance function
def explained_deviance(y_true, y_pred_logits=None, y_pred_probas=None,
returnloglikes=False):
"""Computes explained_deviance score to be comparable to explained_variance"""
assert y_pred_logits is not None or y_pred_probas is not None, "Either the predicted probabilities \
(y_pred_probas) or the predicted logit values (y_pred_logits) should be provided. But neither of the two were provided."
if y_pred_logits is not None and y_pred_probas is None:
# check if binary or multiclass classification
if y_pred_logits.ndim == 1:
y_pred_probas = expit(y_pred_logits)
elif y_pred_logits.ndim == 2:
y_pred_probas = softmax(y_pred_logits)
else: # invalid
raise ValueError(f"logits passed seem to have incorrect shape of {y_pred_logits.shape}")
if y_pred_probas.ndim == 1: y_pred_probas = np.stack([1-y_pred_probas, y_pred_probas], axis=-1)
# compute a null model's predicted probability
X_dummy = np.zeros(len(y_true))
y_null_probas = DummyClassifier(strategy='prior').fit(X_dummy,y_true).predict_proba(X_dummy)
#strategy : {"most_frequent", "prior", "stratified", "uniform", "constant"}
# suggestion from https://stackoverflow.com/a/53215317
llf = -log_loss(y_true, y_pred_probas, normalize=False)
llnull = -log_loss(y_true, y_null_probas, normalize=False)
### McFadden’s pseudo-R-squared: 1 - (llf / llnull)
explained_deviance = 1 - (llf / llnull)
## Cox & Snell’s pseudo-R-squared: 1 - exp((llnull - llf)*(2/nobs))
# explained_deviance = 1 - np.exp((llnull - llf) * (2 / len(y_pred_probas))) ## TODO, not implemented
if returnloglikes:
return explained_deviance, {'loglike_model':llf, 'loglike_null':llnull}
else:
return explained_deviance
Related
from math import exp
import numpy as np
from sklearn.linear_model import LogisticRegression
I used code below from How To Implement Logistic Regression From Scratch in Python
def predict(row, coefficients):
yhat = coefficients[0]
for i in range(len(row)-1):
yhat += coefficients[i + 1] * row[i]
return 1.0 / (1.0 + exp(-yhat))
def coefficients_sgd(train, l_rate, n_epoch):
coef = [0.0 for i in range(len(train[0]))]
for epoch in range(n_epoch):
sum_error = 0
for row in train:
yhat = predict(row, coef)
error = row[-1] - yhat
sum_error += error**2
coef[0] = coef[0] + l_rate * error * yhat * (1.0 - yhat)
for i in range(len(row)-1):
coef[i + 1] = coef[i + 1] + l_rate * error * yhat * (1.0 - yhat) * row[i]
return coef
dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]
l_rate = 0.3
n_epoch = 100
coef = coefficients_sgd(dataset, l_rate, n_epoch)
print(coef)
[-0.39233141593823756, 1.4791536027917747, -2.316697087065274]
x = np.array(dataset)[:,:2]
y = np.array(dataset)[:,2]
model = LogisticRegression(penalty="none")
model.fit(x,y)
print(model.intercept_.tolist() + model.coef_.ravel().tolist())
[-3.233238244349982, 6.374828107647225, -9.631487530388092]
What should I change to get the same or closer coefficients ? How can I establish initial coefficients , learning rate , n_epoch ?
Well, there are many nuances here 🙂
First, recall that estimating coefficients of logistic regression with (negative) log-likelihood is possible using various optimization methods, including SGD you implemented, but there is no exact, closed-form solution. So even if you implement an exact copy of scikit-learn's LogisticRegression, you will need to set the same hyperparameters (number of epochs, learning rate, etc.) and random state to obtain the same coefficients.
Second, LogisticRegression offers five different optimization methods (solver parameter). You run LogisticRegression(penalty="none") with its default parameters and the default for solver is 'lbfgs', not SGD; so depending on your data and hyperparameters, you may get significantly different results.
What should I change to get the same or closer coefficients ?
I would suggest comparing your implementation with SGDClassifier(loss='log') first, since LogisticRegression does not offer SGD solver. Although keep in mind that scikit-learn's implementation is more sophisticated, in particular having more hyperparameters for early stopping like tol.
How can I establish initial coefficients, learning rate, n_epoch?
Typically, coefficients for SGD are initialized randomly (e.g., uniform(-1/(2n), 1/(2n))), using some data statistics (e.g., dot(y, w)/(dot(w, w) for every coefficient w), or with pre-trained model's parameters. On the contrary, there is no golden rule for learning rate or number of epochs. Usually, we set a big number of epochs and some other stopping criterion (e.g., whether norm between current and previous coefficients is smaller than some small tol), a moderate learning rate, and every iteration we reduce the learning rate following some rule (see learning_rate parameter of SGDClassifier or User Guide) and check the stopping criterion.
Ive been playing around with Pyswarms specifically with discrete.binaryPSO to perform feature selection as it is an optimisation technique that helps perform feature subset selection to improve classifier performance. (https://pyswarms.readthedocs.io/en/development/examples/feature_subset_selection.html)<- link to pyswarms.
My dataset is based on text data with a corresponding label(identified in 1’s and 0’s). Upon preprocessing, i incorporated countvectorizer and tfidftransformer to the text data.
However a simple machine learning classifier using sklearn predicts a much higher accuracy in comparison to incorporating pyswarms. No matter what dataset i use, pre-processing techniques and functions i add when incorporating discrete.binarypso my accuracy, precision and recall is lower than a simple machine learning classification using SKlearn.
My code is attached below any help on the situation is appreciated:
# Create an instance of the classifier
classifier = LogisticRegression()
# Define objective function
# Define objective function
def f_per_particle(m, alpha):
total_features = training_data.shape[1]
# Get the subset of the features from the binary mask
if np.count_nonzero(m) == 0:
X_subset = training_data
else:
X_subset = training_data[:,m==1]
# Perform classification and store performance in P
classifier.fit(X_subset, y_train)
P = (classifier.predict(X_subset) == y_train).mean()
# Compute for the objective function
j = (alpha * (1.0 - P)
+ (1.0 - alpha) * (1 - (X_subset.shape[1] / total_features)))
return j
def f(x, alpha=0.88):
"""Higher-level method to do classification in the
whole swarm.
Inputs
------
x: numpy.ndarray of shape (n_particles, dimensions)
The swarm that will perform the search
Returns
-------
numpy.ndarray of shape (n_particles, )
The computed loss for each particle
"""
n_particles = x.shape[0]
j = [f_per_particle(x[i], alpha) for i in range(n_particles)]
return np.array(j)
options = {'c1':0.5, 'c2': 0.5,'w':0.9,'k': 10,'p':2}
# Call instance of PSO
dimensions = training_data.shape[1] # dimensions should be the number of features
optimizer = ps.discrete.BinaryPSO(n_particles=10, dimensions=dimensions, options=options)
# Perform optimization
cost, pos = optimizer.optimize(f, iters=10)
print('selected features = ' + str(sum((pos == 1)*1)) + '/' + str(len(pos)))
classifier.fit(training_data, y_train)
print('accuracy before FS = ' + str(accuracy_score(y_test, classifier.predict(testing_data), normalize = True)*100))
X_subset = training_data[:,pos==1]
classifier.fit(X_subset, y_train)
print('accuracy after FS = ' + str(accuracy_score(y_test, classifier.predict(testing_data[:,pos==1]), normalize = True)*100))
Since feature selection is not yielding better performance, I would recommend to use all the features in the machine learning model and see the impact of each feature. You may find
https://shap.readthedocs.io/en/latest/index.html[SHAP][1]
to be helpful for explaining the output and then look at the significance of each feature for this purpose.
So I am relatively new to the ML/AI game in python, and I'm currently working on a problem surrounding the implementation of a custom objective function for XGBoost.
My differential equation knowledge is pretty rusty so I've created a custom obj function with a gradient and hessian that models the mean squared error function that is ran as the default objective function in XGBRegressor to make sure that I am doing all of this correctly. The problem is, the results of the model (the error outputs are close but not identical for the most part (and way off for some points). I don't know what I'm doing wrong or how that could be possible if I am computing things correctly. If you all could look at this an maybe provide insight into where I am wrong, that would be awesome!
The original code without a custom function is:
import xgboost as xgb
reg = xgb.XGBRegressor(n_estimators=150,
max_depth=2,
objective ="reg:squarederror",
n_jobs=-1)
reg.fit(X_train, y_train)
y_pred_test = reg.predict(X_test)
and my custom objective function for MSE is as follows:
def gradient_se(y_true, y_pred):
#Compute the gradient squared error.
return (-2 * y_true) + (2 * y_pred)
def hessian_se(y_true, y_pred):
#Compute the hessian for squared error
return 0*(y_true + y_pred) + 2
def custom_se(y_true, y_pred):
#squared error objective. A simplified version of MSE used as
#objective function.
grad = gradient_se(y_true, y_pred)
hess = hessian_se(y_true, y_pred)
return grad, hess
the documentation reference is here
Thanks!
According to the documentation, the library passes the predicted values (y_pred in your case) and the ground truth values (y_true in your case) in this order.
You pass the y_true and y_pred values in reversed order in your custom_se(y_true, y_pred) function to both the gradient_se and hessian_se functions. For the hessian it doesn't make a difference since the hessian should return 2 for all x values and you've done that correctly.
For the gradient_se function you've incorrect signs for y_true and y_pred.
The correct implementation is as follows:
def gradient_se(y_pred, y_true):
#Compute the gradient squared error.
return 2*(y_pred - y_true)
def hessian_se(y_pred, y_true):
#Compute the hessian for squared error
return 0*y_true + 2
def custom_se(y_pred, y_true):
#squared error objective. A simplified version of MSE used as
#objective function.
grad = gradient_se(y_pred, y_true)
hess = hessian_se(y_pred, y_true)
return grad, hess
Update: Please keep in mind that the native XGBoost implementation and the implementation of the sklearn wrapper for XGBoost use a different ordering of the arguments. The native implementation takes predictions first and true labels (dtrain) second, while the sklearn implementation takes the true labels (dtrain) first and the predictions second.
For my problem, I want to predict customer review scores ranging from 1 to 5.
I thought it would be good to implement this as a regression problem because a predicted 1 from the model while 5 being the true value should be a "worse" prediction than 4.
It is also wished, that the model performs somehow equally good for all review score classes.
Because my dataset is highly unbalanced I want to create a metric/loss that is capable of capturing this (I think just as F1 for classification).
Therefore I created following metric (for now just mse is relevant):
def custom_metric(y_true, y_pred):
df = pd.DataFrame(np.column_stack([y_pred, y_true]), columns=["Predicted", "Truth"])
class_mse = 0
#class_mae = 0
print("MAE for Classes:")
for i in df.Truth.unique():
temp = df[df["Truth"]==i]
mse = mean_squared_error(temp.Truth, temp.Predicted)
#mae = mean_absolute_error(temp.Truth, temp.Predicted)
print("Class {}: {}".format(i, mse))
class_mse += mse
#class_mae += mae
print()
print("AVG MSE over Classes {}".format(class_mse/len(df.Truth.unique())))
#print("AVG MAE over Classes {}".format(class_mae/len(df.Truth.unique())))
Now an example prediction:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, mean_absolute_error
# sample predictions: "model" messed up at class 2 and 3
y_true = np.array((1,1,1,2,2,2,3,3,3,4,4,4,5,5,5))
y_pred = np.array((1,1,1,2,2,3,5,4,3,4,4,4,5,5,5))
custom_metric(y_true, y_pred)
Now my question: Is it able to create a custom tensorflow loss function which is able to act in a similar behaviour? I also worked on this implementation which is not yet ready for tensorflow but maybe more alike:
def custom_metric(y_true, y_pred):
mse_class = 0
num_classes = len(np.unique(y_true))
stacked = np.vstack((y_true, y_pred))
for i in np.unique(stacked[0]):
y_true_temp = stacked[0][np.where(stacked[0]==i)]
y_pred_temp = stacked[1][np.where(stacked[0]==i)]
mse = np.mean(np.square(y_pred_temp - y_true_temp))
mse_class += mse
return mse_class/num_classes
But still, I am not sure how to work around the for loop for a tensorflow like definition.
Thanks in advance for any help!
The for loop should be dealt with exactly by means of numpy/tensorflow operations on a tensor.
A custom metric example would be:
from keras import backend as K
def custom_mean_squared_error(y_true, y_pred):
return K.mean(K.square(y_pred - y_true), axis=-1)
where y_true is the ground truth label, y_pred are your predictions. You can see there are not explicit for-loops.
The motivation for not using for loops is that vectorized operations (which are present both in numpy and tensorflow) take advantage of the modern CPU architectures, turning multiple iterative operations into matrix ones. Consider that a dot-product implementation in numpy takes approximately 30 times less than a regular for-loop in Python.
I try to use machine learning techniques to predict time to events. My predictions will be probability vectors v of length 20, v[i] being
the probability that the event occurs in i + 1 days (i ranges from 0 to 19).
How can I test the custom loss and metric functions I write?
I'd like to use the following loss and metric to train a model :
Her's how I tried to implement it :
from keras import backend as K
def weighted_meansquare(y_true, y_pred):
w = K.constant(np.array([i + 1 for i in range(20)]))
return K.sum(K.square(w * y_pred - w * y_true))
def esperance_metric(y_true, y_pred):
w = K.constant(np.array([i + 1 for i in range(20)]))
return K.sum(w * y_true - w * y_true)
I expected the model to minimize the metric (which is basically an expectation since my model returns a probability vector). Yet when I try to fit my model I see that the metric is always 0.0000e+00 .
What I'm looking for is :
some specific tips about how to code these functions
some general tips about testing keras.backend functions
You have a typo in your definition of esperance_metric: you use y_true - y_true instead of y_pred - y_true, which is why your metric is always 0.
I also see a mistake in weighted_meansquare. You should multiple by w after taking the square as follows:
K.sum(w * K.square(y_pred - y_true))
In general, if you want to test backend functions you can try evaluating them with K.eval. For example:
y_pred = K.constant([1.] * 20)
y_true = K.constant([0.] * 20)
print(K.eval(esperance_metric(y_true, y_pred)))