Understand shap values for binary classification - python

I have trained my imbalanced dataset (binary classification) using CatboostClassifer. Now, I am trying to interpret the model using the SHAP library. Below is the code to fit the model and calculate shap values:
weights = y.value_counts()[0] / y.value_counts()[1]
catboost_clf = CatBoostClassifier(loss_function='Logloss', iterations=100, verbose=True, \
l2_leaf_reg=6, scale_pos_weight=weights,eval_metric="MCC")
catboost_clf.fit(X, y)
trainx_preds = catboost_clf.predict(X_test)
explainer = shap.TreeExplainer(catboost_clf)
shap_values = explainer.shap_values(Pool(X,y))
#Class 0 samples 1625125
#Class 1 samples 122235
The size of shap values is (1747360, 13) i.e. (number of instances, number of features). I was expecting the shap values to be a 3d array i.e. (number of classes,number of instances, number of features). Shap values for each of the positive and negative class. How do I achieve that? How do I extract class wise shapley values to better understanding of the model.
Also, explainer.expected_value shows one base value instead of two.
Is there anything missing or incorrect in the code?
Thanks in advance!

Adding 'Multicalss' to the loss_function solved the problem. Referred to the documentation: Catboost
model = CatBoostClassifier(loss_function = 'MultiClass')

Related

XGBoost XGBRegressor predict with different dimensions than fit

I am using the xgboost XGBRegressor to train on a data of 20 input dimensions:
model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=20)
model.fit(trainX, trainy, verbose=False)
trainX is 2000 x 19, and trainy is 2000 x 1.
In another word, I am using the 19 dimensions of trainX to predict the 20th dimension (the one dimension of trainy) as the training.
When I am making a prediction:
yhat = model.predict(x_input)
x_input has to be 19 dimensions.
I am wondering if there is a way to keep using the 19 dimensions to train prediction the 20th dimension. But during the prediction, x_input has only 4 dimensions to predict the 20th dimension. It is kinda of a transfer learning to different input dimension.
Does xgboost supports such a feature? I tried just to fill x_input's other dimensions to None, but that yields to terrible prediction results.
Fundamentally, you're training your model with a dense dataset (19/19 feature values), and are now wondering if you're allowed to make predictions with a sparse dataset (4/19 feature values).
Does xgboost supports such a feature?
Yes, it is technically possible with XGBoost, because XGBoost will treat the absent 15/19 feature values as missing. It will not be possible with some other ML framework (such as Scikit-Learn) that do not work with sparse input by default.
Alternatively, you can make your XGBoost model explicitly "missing-value-proof" by assembling a pipeline which contains feature imputation step(s).
I tried just to fill x_input's other dimensions to None, but that yields to terrible prediction results.
You should represent missing values as float("NaN") (not as None).
If I understand your question correctly, you are trying to train a model with 19 features, but then feed it only 1 feature to make a prediction.
That's not going to be possible. When you train a model, you are assuming that your data points are drawn from a probability distribution P(X,Y), where Y is your label and X is your features. If you try to change the dimensionality of X, it'll no longer belong to that distribution (at least intuitively, I am not a mathematician so, I cannot come up with a proof for this).
For instance, let's assume your data lies on a 3D cube. That means that you need three coordinate axes to represent a point on it. You cannot place a point using 2 dimensions without assuming the value of the remaining dimension.
You can assume the values of the features you try to drop, but they may not represent the data you originally trained on.

Get parameter estimates from logistic regression model using pycaret

I am training and tuning a model in pycaret such as:
from pycaret.classification import *
clf1 = setup(data = train, target = 'target', feature_selection = True, test_data = test, remove_multicollinearity = True, multicollinearity_threshold = 0.4)
# create model
lr = create_model('lr')
# tune model
tuned_lr = tune_model(lr)
# optimize threshold
optimized_lr = optimize_threshold(tuned_lr)
I would like to get the parameters estimated for the features in the Logistic Regression, so I could proceed with understanding the effect size of each feature on the target. However, the object optimized_lr has a function optimized_lr.get_params() which returns the hyperparameters of the model, however, I am not quite interested in my tuning decisions, instead, I am very interested in the real parameters of the model, the ones estimated in Logistic Regression.
How could I get them to use pycaret? (I could easily get those using other packages such as statsmodels, but I want to know in pycaret)
how about
for f, c in zip (optimized_lr.feature_names_in_,tuned.coef_[0]):
print(f, c)
To get the coefficients, use this code:
tuned_lr.feature_importances_ #this will give you the coefficients
get_config('X_train').columns #this code will give you the names of the columns.
Now we can create a dataframe so that we could see clearly how it correlates with the independent variable.
Coeff=pd.DataFrame({"Feature":get_config('X_train').columns.tolist(),"Coefficients":tuned_lr.feature_importances_})
print(Coeff)
# It would give me the Coefficient with the names of the respective columns. Hope it helps.

Tensorflow model architecture for sparse dataset

I have a regression dataset where approximately 95% of the target variables are zeros (the other 5% are between 1 and 30) and I am trying to design a Tensorflow model to model that data. I am thinking of implementing a model that combines a classifier and a regressor (check the output of the classifier submodel, if it's less than a threshold then pass it to the regression submodel). I have the intuition that this should be built using the functional API But I couldn't find helpful resources on that. Any ideas?
Here is the code that generates the data that I am using to replicate the problem:
n = 10000
zero_percentage = 0.95
zeros = np.zeros(round(n * zero_percentage))
non_zeros = np.random.randint(1,30,size=round(n * (1- zero_percentage)))
y = np.concatenate((zeros,non_zeros))
np.random.shuffle(y)
a = 50
b = 10
x = np.array([np.random.randint(31,60) if element == 0 else (element - b) / a for element in y])
y_classification = np.array([0 if element == 0 else 1 for element in y])
Note: I experimented with probabilistic models (Poisson regression and regression with a discretized logistic mixture distribution), and they provided good results but the training was unstable (loss diverges very often).
Instead of trying to find some heuristic to balance the training between the zero values and the others, you might want to try some input preprocessing method that can handle imbalanced training sets better (usually by mapping to another space before running the model, then doing the inverse with the results); for example, an embedding layer. Alternatively, normalize the values to a small range (like [-1, 1]) and apply an activation function before evaluating the model on the data.

How to forecast time series using AutoReg in python

I'm trying to build old school model using only auto regression algorithm. I found out that there's an implementation of it in statsmodel package. I've read the documentation, and as I understand it should work as ARIMA. So, here's my code:
import statsmodels.api as sm
model = sm.tsa.AutoReg(df_train.beer, 12).fit()
And when I want to predict new values, I'm trying to follow the documentation:
y_pred = model.predict(start=df_test.index.min(), end=df_test.index.max())
# or
y_pred = model.predict(start=100, end=1000)
Both returns a list of NaNs.
Also, when I type model.predict(0, df_train.size - 1) it predicts real values, but model.predict(0, df_train.size) predicts NaNs list.
Am I doing something wrong?
P.S. I know there's ARIMA, ARMA or SARIMAX algorithms, that can be used as basic auto regression. But I need exactly AutoReg.
We can do the forecasting in couple of ways:
by directly using the predict() function and
by using the definition of AR(p) process and the parameters learnt with AutoReg(): this will be helpful for short-term predictions, as we shall see.
Let's start with a sample dataset from statsmodels, the data looks like the following:
import statsmodels.api as sm
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
plt.plot(range(len(data)), data)
Let's fit an AR(p) process to model the time series and use partial autocorrelation plot to find the order p, as shown below
As seen from above, the first few PACF values remain significant, let's use p=10 for the AR(p).
Let's divide the data into training and validation (test) datasets and fit auto-regressive model of order 10 using the training data:
from statsmodels.tsa.ar_model import AutoReg
n = len(data)
ntrain = int(n*0.9)
ntest = n - ntrain
lag = 10
res = AutoReg(data[:ntrain], lags = lag).fit()
Now, use the predict() function for forecasting all values corresponding to the held-out dataset:
preds = res.model.predict(res.params, start=n-ntest, end=n)
Notice that we can get the exactly same predictions using the parameters from the trained model, as shown below:
x = data[ntrain-lag:ntrain].values
preds1 = []
for t in range(ntrain, n):
pred = res.params[0] + np.sum(res.params[1:]*x[::-1])
x[:lag-1], x[lag-1] = x[-(lag-1):], pred
preds1.append(pred)
Note that the forecast values generated this way is same as the ones obtained using the predict() function above.
np.allclose(preds.values, np.array(preds1))
# True
Now, let's plot the forecast values for the test data:
As can be seen, for long term prediction, quality of forecasting is not that good (since the forecasted values are used for long term prediction).
Let's instead go for short-term predictions now and use the last lag points from the dataset to forecast the next value, as shown in the next code snippet.
preds = []
for t in range(ntrain, n):
pred = res.params[0] + np.sum(res.params[1:]*data[t-lag:t].values[::-1])
preds.append(pred)
As can be seen from the next plot, short term forecasting works way better:
You can use this code for forecasting
import statsmodels as sm
model = sm.tsa.AutoReg(df_train.beer, 12).fit()
y_pred = model.model.predict(model.params, start=df_test.index.min(), end=df_test.index.max())
from statsmodels.tsa.ar_model import AutoReg
model=AutoReg(dataset[''],lags=1)
ARFit=model.fit()
forecasted=ARFit.predict(start=len(dataset),end=len(dataset)+12)
#visualizacion
dataset[''].plot(figsize=(12,8),legend=True)
forecasted.plot(legend=True)

Regressor Neural Network built with Keras only ever predicts one value

I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.
So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.
Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.

Categories