scikit-learn Logistic Regression prediction not same as self-implementation - python

I trained a model using scikit-learn's LogisticRegression classifier (multinomial/multiclass). I then saved the coefficients from the model to a file. Next, I loaded the coefficients into my own self-implementation of softmax, which is what scikit-learn's documentation claims is used by the Logistic Regression classifier for the multinomial case. However, the predictions do not align.
Training mlogit model with scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import json
# Split data into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# Train model
mlr = LogisticRegression(random_state=21, multi_class='multinomial', solver='newton-cg')
mlr.fit(X_train, y_train)
y_pred = mlr.predict(X_test)
# Save test data and coefficients
json.dump(X_test.tolist(), open('X_test.json'), 'w'), indent=4)
json.dump(y_pred.tolist(), open('y_pred.json'), 'w'), indent=4)
json.dump(mlr.classes_.tolist(), open('classes.json'), 'w'), indent=4)
json.dump(mlr.coef_.tolist(), open('weights.json'), 'w'), indent=4)
Self-implementation of softmax via Scipy
from scipy.special import softmax
import numpy as np
import json
def predict(x, w, classes):
z = np.dot(x, np.transpose(w))
sm = softmax(z)
return [classes[i] for i in sm.argmax(axis=1)]
x = json.load(open('X_test.json'))
w = json.load(open('weights.json'))
classes = json.load(open('classes.json'))
y_pred_self = predict(x, w, classes)
Results do not match
Essentially, when I compare y_pred_self with y_pred, they are not the same (about 85% similar).
So my question is whether the scikit-learn softmax or predict implementation has some non-standard/hidden tweaks?
Side-note: I have also tried a self-implementation in Ruby and it also gives predictions that are off.

There are some differences that I've seen at a first glance. Please have a look at the following points:
1. Regularization
According to the docs scikit-learn uses a regularization term:
This class implements regularized logistic regression [...]. Note that regularization is applied by default.
So you could deactivate the regularization term from the scikit-learn implementation or add regularization to your own implementation.
2. Bias
In the docs you can read that a bias term is used:
fit_interceptbool, default=True
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
So you could deactivate the bias in the scikit-learn implementation or add the bias term to your implementation.
Maybe use a well-known dataset from the scikit-learn library or provide your dataset, so it is easier to reproduce the problem. Let me know how it worked.

Related

Using Custom Metric for Score Method in XGBoost

I am using xgboost for a classification problem with an imbalanced dataset. I plan on using some combination of an f1-score or roc-auc as my primary criteria for judging the model.
Currently the default value returned from the score method is accuracy, but I would really like to have a specific evaluation metric returned instead. My big motivation for doing this is that I presume the feature_importances_ attribute from the model is determined from what's affecting the score method, and the columns that impact predictive accuracy might very well be different from the columns that impact roc-auc. Right now I am passing in values to eval_metric but it does not seem to be making a difference.
Here is some sample code:
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score
data = load_breast_cancer()
X = data['data']
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.2, stratify=y)
mod.fit(X_train, y_train)
Now at this point, mod.score(X_test, y_test) will return a value of ~ 0.96, and the roc_auc_score is ~ 0.99.
I was hoping the following snippet:
mod.fit(X_train, y_train, eval_metric='auc')
Would then allow mod.score(X_test, y_test) to return the roc_auc_score value, but it is still returning predictive accuracy, not roc_auc.
The purpose of this exercise is estimating the influence of different columns on the outcome, so if I could get feature_importances_ returned using f1 or roc_auc as the measure of impact this would be a huge boon, but I do not seem to be on the right path as of now.
Thank you.
There are two parts to your question, to use eval_metric, you need to provide data to evaluate using eval_set = :
mod = XGBClassifier()
mod.fit(X_train, y_train,eval_set=[(X_test,y_test)],eval_metric="auc")
You can check the auc using evals_result(), and it gives the auc for every iteration:
mod.evals_result()
{'validation_0': OrderedDict([('auc',
[0.965939,
0.9833,
0.984788,
[...]
0.991402,
0.991071,
0.991402,
0.991733])])}
The importance score is calculated based on the average gain across all splits the feature is used in see help page. From your question, I suppose you need the mdoel to maximize auc, like in cross-validation, but you cannot use the auc as an objective in xgboost. Gradient boosting methods require a differentiable loss function.
With imbalanced dataset, you can try to adjust the parameter scale_pos_weight, to adjust the balance of positive and negative weights. This is discussed in xgboost website

What is the expected_value field of TreeExplainer for a Random Forest?

I used SHAP to explain my RF
RF_best_parameters = RandomForestRegressor(random_state=24, n_estimators=100)
RF_best_parameters.fit(X_train, y_train.values.ravel())
shap_explainer_model = shap.TreeExplainer(RF_best_parameters)
The TreeExplainer class has an attribute expected_value.
My first guess that this field is the mean of the predicted y, according to the X_train (I also read this here )
But it is not.
The output of the command:
shap_explainer_model.expected_value
is 0.2381.
The output of the command:
RF_best_parameters.predict(X_train).mean()
is 0.2389.
As we can see the values are not same.
So what is the meaning of the expected_value here?
This is due to a peculiarity of the method when used with the Random Forest algorithm; quoting from the response in the relevant Github thread shap explainer expected_value is different from model expected value:
It is because of how sklearn records the training samples in the tree models it builds. Random forests use a random subsample of the data to train each tree, and it is that random subsample that is used in sklearn to record the leaf sample weights in the model. Since TreeExplainer uses the recorded leaf sample weights to represent the training dataset, it will depend on the random sampling used during training. This will cause small variations like the ones you are seeing.
We can actually verify that this behavior is not present with other algorithms, say Gradient Boosting Trees:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
import numpy as np
import shap
shap.__version__
# 0.37.0
X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
gbt = GradientBoostingRegressor(random_state=0)
gbt.fit(X_train, y_train)
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])
np.isclose(mean_pred_gbt, gbt_explainer.expected_value)
# array([ True])
But for RF, we get indeed a "small variation" as mentioned by the main SHAP developer in the thread above:
rf = RandomForestRegressor(random_state=0)
rf.fit(X_train, y_train)
rf_explainer = shap.TreeExplainer(rf)
rf_explainer.expected_value
# array([-11.59166808])
mean_pred_rf = np.mean(rf.predict(X_train))
mean_pred_rf
# -11.280125877556388
np.isclose(mean_pred_rf, rf_explainer.expected_value)
# array([False])
Just try :
shap_explainer_model = shap.TreeExplainer(RF_best_parameters, data=X_train, feature_perturbation="interventional", model_output="raw")
Then the shap_explainer_model.expected_value should give you the mean prediction of your model on train data.
Otherwise, TreeExplainer uses feature_perturbation="tree_path_dependent"; accoding to the documentation:
The “tree_path_dependent” approach is to just follow the trees and use the number of training examples that went down each leaf to represent the background distribution. This approach does not require a background dataset and so is used by default when no background dataset is provided.

How can I use generalized regression neural network in python?

I'm trying to find any python library or package which implements newgrnn (Generalized Regression Neural Network) using python.
Is there any package or library available where I can use neural network for regression. I'm trying to find python equivalent of the newgrnn (Generalized Regression Neural Network) which is described here.
I found the library neupy which solved my problem:
from neupy import algorithms
from neupy.algorithms.rbfn.utils import pdf_between_data
grnn = algorithms.GRNN(std=0.003)
grnn.train(X, y)
# In this part of the code you can do any moifications you want
ratios = pdf_between_data(grnn.input_train, X, grnn.std)
predicted = (np.dot(grnn.target_train.T, ratios) / ratios.sum(axis=0)).T
This is the link for the library: http://neupy.com/apidocs/neupy.algorithms.rbfn.grnn.html
A more upgraded form is pyGRNN which offers in addition to the normal GRNN the Anisotropic GRNN, which optimizes the hyperparameters automatically:
from sklearn import datasets
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from pyGRNN import GRNN
# get the data set
diabetes = datasets.load_diabetes()
X = diabetes.data
y = diabetes.target
X_train, X_test, y_train, y_test = train_test_split(preprocessing.minmax_scale(X),
preprocessing.minmax_scale(y.reshape((-1, 1))),
test_size=0.25)
# use Anisotropic GRNN with Limited-Memory BFGS algorithm
# to select the optimal bandwidths
AGRNN = GRNN(calibration = 'gradient_search')
AGRNN.fit(X_train, y_train.ravel())
sigma = AGRNN.sigma
y_pred = AGRNN.predict(X_test)
mse_AGRNN = MSE(y_test, y_pred)
mse_AGRNN ## 0.030437040

how to properly use sklearn to predict the error of a fit

I'm using sklearn to fit a linear regression model to some data. In particular, my response variable is stored in an array y and my features in a matrix X.
I train a linear regression model with the following piece of code
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X,y)
and everything seems to be fine.
Then let's say I have some new data X_new and I want to predict the response variable for them. This can easily done by doing
predictions = model.predict(X_new)
My question is, what is this the error associated to this prediction?
From my understanding I should compute the mean squared error of the model:
from sklearn.metrics import mean_squared_error
model_mse = mean_squared_error(model.predict(X),y)
And basically my real predictions for the new data should be a random number computed from a gaussian distribution with mean predictions and sigma^2 = model_mse. Do you agree with this and do you know if there's a faster way to do this in sklearn?
You probably want to validate your model on your training data set. I would suggest exploring the cross-validation submodule sklearn.cross_validation.
The most basic usage is:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
It depends on you training data-
If it's distribution is a good representation of the "real world" and of a sufficient size (see learning theories, as PAC), then I would generally agree.
That said- if you are looking for a practical way to evaluate your model, why won't you use the test set as Kris has suggested?
I usually use grid search for optimizing parameters:
#split to training and test sets
X_train, X_test, y_train, y_test =train_test_split(
X_data[indices], y_data[indices], test_size=0.25)
#cross validation gridsearch
params = dict(logistic__C=[0.1,0.3,1,3, 10,30, 100])
grid_search = GridSearchCV(clf, param_grid=params,cv=5)
grid_search.fit(X_train, y_train)
#print scores and best estimator
print 'best param: ', grid_search.best_params_
print 'best train score: ', grid_search.best_score_
print 'Test score: ', grid_search.best_estimator_.score(X_test,y_test)
The Idea is hiding the test set from your learning algorithm (and yourself)- Don't train and don't optimize parameters using this data.
Finally you should use the test set for performance evaluation (error) only, it should provide an unbiased mse.

Restricted Boltzmann Machine in Scikit-learn: Iris Classification

I'm working on an example of applying Restricted Boltzmann Machine on Iris dataset. Essentially, I'm trying to make a comparison between RMB and LDA. LDA seems to produce a reasonable correct output result, but the RBM isn't. Following a suggestion, I binarized the feature inputs using skearn.preprocessing.Binarizer, and also tried different threshold parameter values. I tried several different ways to apply binarization, but none seemed to work for me.
Below is my modified version of the code based on this user's version User: covariance.
Any helpful comments are greatly appreciated.
from sklearn import linear_model, datasets, preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.neural_network import BernoulliRBM
from sklearn.lda import LDA
# import some data to play with
iris = datasets.load_iris()
X = iris.data[:,:2] # we only take the first two features.
Y = iris.target
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=10)
# Models we will use
rbm = BernoulliRBM(random_state=0, verbose=True)
binarizer = preprocessing.Binarizer(threshold=0.01,copy=True)
X_binarized = binarizer.fit_transform(X_train)
hidden_layer = rbm.fit_transform(X_binarized, Y_train)
logistic = linear_model.LogisticRegression()
logistic.coef_ = hidden_layer
classifier = Pipeline(steps=[('rbm', rbm), ('logistic', logistic)])
lda = LDA(n_components=3)
#########################################################################
# Training RBM-Logistic Pipeline
logistic.fit(X_train, Y_train)
classifier.fit(X_binarized, Y_train)
#########################################################################
# Get predictions
print "The RBM model:"
print "Predict: ", classifier.predict(X_test)
print "Real: ", Y_test
print
print "Linear Discriminant Analysis: "
lda.fit(X_train, Y_train)
print "Predict: ", lda.predict(X_test)
print "Real: ", Y_test
RBM and LDA are not directly comparable, as RBM doesn't perform classification on its own. Though you are using it as a feature engineering step with logistic regression at the end, LDA is itself a classifier - so the comparison isn't very meaningful.
The BernoulliRBM in scikit learn only handles binary inputs. The iris dataset has no sensible binarization, so you aren't going to get any meaningful outputs.

Categories