Ive been working on an XGBoost Classifier with 5 classes for which I used to get the confidence of each prediction from the predict_proba() function (total sum of the confidences would add up to 1, easy enough). Now that Ive switched to a Binary classification, the predict_proba() returns (e.g.) this: [[-1.0231217 2.1702857]]
I am extremely confused as to what exactly is being measured in this matrix or if Ive done something wrong when training the model. How would I go about finding out the prediction confidence for each prediction ?
my model parameters in case its something obvious I've missed during training:
xgb1 = XGBClassifier(learning_rate= 0.0125,
n_estimators=1000,
#n_jobs = -1,
max_depth=6,
min_child_weight=1,
gamma=0.1,
subsample=0.7,
colsample_bytree=0.6,
objective='multi:softmax' ,
nthread=4,
num_class=2,
seed=42,
use_label_encoder=False,
eval_metric='mlogloss',
tree_method = 'gpu_hist',
gpu_id = 0)
One thing I noticed is, you can change objective='binary:logistic'. You can try it, but not sure whether this is the reason.
And it will be helpful if you add, how you use the predict_proba() function. To output probabilities, predict_proba() second argument output_margin should be false.
Related
I have a simple model in tensorflow which is being trained on the first 1000 images in the MNIST datset. From my previous experience the learning rates which I used were of the order of around 0.001, however for my model to converge the learning rate needs to be far heigher, at least larger than 1. The model is shown below.
def gen_model():
return tf.keras.models.Sequential([
tf.keras.Input(shape=(28,28,)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='sigmoid'),
tf.keras.layers.Dense(10, activation='softmax')
])
model = gen_model()
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=5), loss='mean_squared_error')
model.summary()
model.fit(x_train, y_train, batch_size=1000, epochs=10000)
Is it expected for models of this form to require an extremely high learning rate, or is there something I have missed? When I use a learning rate of around 0.001 the loss changes incredibly slowly.
The dataset was created with the following code:
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype("float32") / 255.0
x_train = x_train.reshape(60000,28,28)[:1000];
y_train = y_train[:1000];
y_train = tf.one_hot(y_train, 10)
Generally speaking, models that require learning rates larger than 1 raise a red flag for me. It seems like your model is a vanilla multilayer perceptron, so there's nothing overly complicated about that, but there are a couple things about your setup that stand out:
The output from your model uses a softmax, which is normally used to represent values from a categorical distribution (i.e., 1-of-k) -- this is typical for a classification model. But the loss you're using is typically used for optimizing Gaussian or regression outputs. You might want to try using a cross-entropy loss to see if that helps.
The output from your model is in probability space, so the values you get out from your model are in [0, 1]. The loss you're using is averaging the squared differences between the model output and the target 1-hot vector (whose values are in {0, 1}). The value you'll get for this loss is always smaller than 1, so with a learning rate less than 1, and multiplying by the existing model weights, the delta that you'll apply to your model weights is always going to be small. Sometimes that's a good thing, but my guess is that in this case -- and particularly at the start of training when the model weights aren't near their optimal values -- this is going to be quite slow.
Related to the above point, you might try initializing your model weights with a larger range of values than the default. This would help make the gradient values larger, but could also make the model more likely to diverge.
You could also try to replace your softmax output activation with a plain linear activation, in effect converting your model's output to (unnormalized) log-probability space. Then you'd need to change your dataset labels to also represent target log-probability values, which isn't possible exactly, but could get close with something like 1e8 * (1 - one_hot). But if you wanted to go this route, you'd effectively be implementing a cross-entropy loss yourself; see the first point.
https://www.tensorflow.org/tutorials/estimator/linear
I am following the Tensorflow documentation to implement a Linear Classifier but I like to use my own data instead of the tutorial set. I just have a few general questions.
My dataset is as follows. It's not a time series.
row[0] - float (changed to binary, 0 = negative, 1 = positive) VALUE TO ESTIMATE
row[1] - string (categorical, changed to vocabulary, ints 1,2,3,4,5,6,7,8,9)
row[2-19] - float (positive and negative)
row[20-60] - ints (percentile ranks, ints 10,20,30,40,50,60,70,80,90)
row[61-95] - ints (binary 1, 0)
I started by using 50k (45k training) rows of data and num_epochs=100, batch_size=256.
{'accuracy': 0.8912, 'accuracy_baseline': 0.8932, 'auc': 0.7101819, 'auc_precision_recall': 0.2830853, 'average_loss': 0.30982444, 'label/mean': 0.1068, 'loss': 0.31013006, 'precision': 0.4537037, 'prediction/mean': 0.11840516, 'recall': 0.0917603, 'global_step': 17600}
Does the column I want to estimate need to be a column of binaries for this model?
Is it a bad idea to mix data types like this? Would it be necessary to normalize the data using something like preprocessing.Normalization ?
Should I alter the epochs/batch if I want to use more data?
The accuracy seems high but the loss also seems quite high, why is that?
Any other suggestions?
Thanks for any help or advice.
Here is the answer to your questions.
By default tf.estimator.LinearClassifier considers as binary classification with n_classes=2, but you can have more than 2 classes as well.
For a linear classification normalizing data won't affect much in terms of accuracy compared to non linear classifier accuracy change after normalizing on the same data.
You can observe the change in accuracy and loss, if it does not change much for about 5-10 epochs, you can restrict the number of epochs there only. Again you can repeat the same step by changing the batch size.
Accuracy and loss are not dependent on each other, consider an example of your case to classify 0 and 1. A model with 2 classes that always predicts 0.51 for the true class would have the same accuracy as one that predicts 0.99. Best model would be with high accuracy and with less loss, if your model is giving good accuracy and high loss that means your model made huge errors on few data.
Try to tune your model hyper parameters based on several observations and to feed quality data with some preprocessing is always best way to reach high accuracy and less loss, with some additional data would be good to have always.
I am trying to run a random forest on a highly unbalanced sample. There are issues both with the sample weights and the class weights. However, when I use the sklearn documentation to include the appropriate weights, I still get highly unbalanced predictions. For example, I have class weights of
{'A': 0.05555555555555555, 'B': 1.0, 'C': 1.0}
This should reweight the data to be about 60% A, 25% B, 15% C. However, when I run the model with weights, I get more or less the same results on the fitted class probabilities. I also tried doing the "balanced" option just to test, and I still got highly skewed results (predicting probabilities close to 1 for every observation of A). And I've tried this with and without the sample weights and with and without the class weights and I get more or less the same results. Am I implementing this incorrectly?
clf=RandomForestClassifier(n_estimators=1000,class_weight=class_weights)
clf=RandomForestClassifier(n_estimators=1000)
clf.fit(x,y,sample_weight=weights)
print("Accuracy: ",metrics.accuracy_score(y, clf.predict(x)))
new_arts = pd.DataFrame(data=clf.predict_proba(full_data_scaled),
columns=clf.classes_,
index=full_data_scaled.index.values)
First thing that you could check, is the actual dimensionality of your classifier in relation to your dataset. You use in both cases 1000 estimators. This can highly overfit if you are using a small dataset.
Second, I am assuming you use the Gini criterion to split. Maybe you can check if the criterion 'entropy' results in the same output.
I'm doing a multi-class prediction with XBGClassifier and get strange results for probabilities (clearly not the one we could expect, very different from the one with SVM.SVC for example).
Code:
clf = XGBClassifier( learning_rate=0.00005, objective='multi:softprob')
[...]
clf.fit(X, Y, eval_metric='mlogloss')
[...]
clf.predict_proba( data)
All the provided probabilities are very strange:
INFO:root:[[0.16740549 0.16724858 0.16669136 0.1662821 0.16619198 0.16618045]]
INFO:root:[[0.16658343 0.16709101 0.16700828 0.16666834 0.16638225 0.16626666]]
INFO:root:[[0.16706458 0.16723593 0.16682376 0.16645898 0.16622521 0.16619155]]
INFO:root:[[0.1670872 0.16725858 0.16679683 0.16641934 0.16624773 0.16619037]]
INFO:root:[[0.16655219 0.1669247 0.16697693 0.16680391 0.1664368 0.16630547]]
INFO:root:[[0.16774052 0.16720766 0.16651934 0.1662414 0.16615131 0.16613977]]
INFO:root:[[0.16740549 0.16724858 0.16669136 0.1662821 0.16619198 0.16618045]]
INFO:root:[[0.16658343 0.16709101 0.16700828 0.16666834 0.16638225 0.16626666]]
Any idea?
Thanks
To add to #bhaskarc's pertinent comment.
It seems your model is not learning since it's predicting the same probability for all classes.
One other reason for this might be that your learning rate is too small.
Try changing it for something bigger and re-check the predictions:
learning_rate=0.001
Also you can try playing with other parameters(max_depth, n_estimators, gamma, ...)
I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.
So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.
Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.