XGBoost - Strange results with XGBClassifier predict_proba (python) - python

I'm doing a multi-class prediction with XBGClassifier and get strange results for probabilities (clearly not the one we could expect, very different from the one with SVM.SVC for example).
Code:
clf = XGBClassifier( learning_rate=0.00005, objective='multi:softprob')
[...]
clf.fit(X, Y, eval_metric='mlogloss')
[...]
clf.predict_proba( data)
All the provided probabilities are very strange:
INFO:root:[[0.16740549 0.16724858 0.16669136 0.1662821 0.16619198 0.16618045]]
INFO:root:[[0.16658343 0.16709101 0.16700828 0.16666834 0.16638225 0.16626666]]
INFO:root:[[0.16706458 0.16723593 0.16682376 0.16645898 0.16622521 0.16619155]]
INFO:root:[[0.1670872 0.16725858 0.16679683 0.16641934 0.16624773 0.16619037]]
INFO:root:[[0.16655219 0.1669247 0.16697693 0.16680391 0.1664368 0.16630547]]
INFO:root:[[0.16774052 0.16720766 0.16651934 0.1662414 0.16615131 0.16613977]]
INFO:root:[[0.16740549 0.16724858 0.16669136 0.1662821 0.16619198 0.16618045]]
INFO:root:[[0.16658343 0.16709101 0.16700828 0.16666834 0.16638225 0.16626666]]
Any idea?
Thanks

To add to #bhaskarc's pertinent comment.
It seems your model is not learning since it's predicting the same probability for all classes.
One other reason for this might be that your learning rate is too small.
Try changing it for something bigger and re-check the predictions:
learning_rate=0.001
Also you can try playing with other parameters(max_depth, n_estimators, gamma, ...)

Related

Binary classification prediction confidence 2: Electric Boogaloo

Ive been working on an XGBoost Classifier with 5 classes for which I used to get the confidence of each prediction from the predict_proba() function (total sum of the confidences would add up to 1, easy enough). Now that Ive switched to a Binary classification, the predict_proba() returns (e.g.) this: [[-1.0231217 2.1702857]]
I am extremely confused as to what exactly is being measured in this matrix or if Ive done something wrong when training the model. How would I go about finding out the prediction confidence for each prediction ?
my model parameters in case its something obvious I've missed during training:
xgb1 = XGBClassifier(learning_rate= 0.0125,
n_estimators=1000,
#n_jobs = -1,
max_depth=6,
min_child_weight=1,
gamma=0.1,
subsample=0.7,
colsample_bytree=0.6,
objective='multi:softmax' ,
nthread=4,
num_class=2,
seed=42,
use_label_encoder=False,
eval_metric='mlogloss',
tree_method = 'gpu_hist',
gpu_id = 0)
One thing I noticed is, you can change objective='binary:logistic'. You can try it, but not sure whether this is the reason.
And it will be helpful if you add, how you use the predict_proba() function. To output probabilities, predict_proba() second argument output_margin should be false.

Pytorch Linear Regression wrong result on particular dataset

I'm using pytorch to implement a simple linear regression model.
The code works perfectly for randomly created datasets, but when it comes to the dataset I wanted to train, it gives significantly wrong results.
Here is the code:
x = torch.linspace(1,100,steps=100)
learn_rate = 0.000001
x_train = x[:100]
x_test = x[100:]
y_train = data[:100]
y_test = data[100:]
# y_train = -0.01*x_train + torch.randn(100)*10 #Code for generating random data.
w = torch.rand(1,requires_grad=True)
b= torch.rand(1,requires_grad=True)
for i in range(1000):
loss = torch.mean((y_train-(w*x_train+b))**2)
if(i%100==0):
print(loss)
loss.backward()
w.data.add_(-w.grad.data*learn_rate)
b.data.add_(-b.grad.data*learn_rate)
w.grad.data.zero_()
b.grad.data.zero_()
The result it gives makes no sense.
However, when I used a randomly generated dataset, it works perfectly:
The dataset actually looks similar. I am not sure for the reason of the inaccuracy of this model.
Code for plotting data:
plt.plot(x_train.numpy(),y_train.numpy())
plt.plot(x_train.numpy(),(w*x_train+b).data.numpy())
plt.show()
--
Now the problem seems to be that weight converges much faster than bias. At the current learning rate, bias will not converge to the optimal. However, if I increase the learning rate just by a little, the weight will simply diverge. I have to set two learning rates.
However, I'm wondering whether setting different learning rate is the best solution for a simple model like this, because I've found out that not much model actually uses different learning rate for different parameters.
Your code seems to be correct, but your model converges slower when there is a large bias in your data (because it now has to update the bias parameter many times before it reaches the correct value).
You could try running it for more iterations or increasing the learning rate.

The size of test data does not fit into model (python)

I have a problem with testing my model, when I train my model, it works well. However, when I try to put the test data into the model it gives the error of the size does not fit, which was expected for me. I have split my data into 70% training and 30% testing. I understand why is that so, yet couldn't solve it.
net = Net(n_feature=244, n_hidden=10, n_output=244)
print(net)
optimizer = torch.optim.SGD(net.parameters(), lr=0.2)
loss_func = torch.nn.MSELoss()
There are also some code here.
def test():
Xtest = torch.FloatTensor(X_test)
ypred_test = net(Xtest)
plt.scatter(Xtest[:100] , y_test[:100])
plt.plot(Xtest.detach().numpy()[:100] , ypred_test.detach().numpy()[:100] , "red")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
I am using jupyter notebook, thats why it is a bit messy. Any help would be great. Thanks in regard!
Error can be seen from the image
Sure, the data has a size of (349,2). I used train_test_split to split the data into %30 test and %70 train. (244 data pair for training and 105 data pair for test) There are two features which are x and y. I am trying to build a neural network with pytorch to do regression. Then, I want to do hsic test for the causaliy. #AbhishekKumar

Using sample and class weights in sklearn

I am trying to run a random forest on a highly unbalanced sample. There are issues both with the sample weights and the class weights. However, when I use the sklearn documentation to include the appropriate weights, I still get highly unbalanced predictions. For example, I have class weights of
{'A': 0.05555555555555555, 'B': 1.0, 'C': 1.0}
This should reweight the data to be about 60% A, 25% B, 15% C. However, when I run the model with weights, I get more or less the same results on the fitted class probabilities. I also tried doing the "balanced" option just to test, and I still got highly skewed results (predicting probabilities close to 1 for every observation of A). And I've tried this with and without the sample weights and with and without the class weights and I get more or less the same results. Am I implementing this incorrectly?
clf=RandomForestClassifier(n_estimators=1000,class_weight=class_weights)
clf=RandomForestClassifier(n_estimators=1000)
clf.fit(x,y,sample_weight=weights)
print("Accuracy: ",metrics.accuracy_score(y, clf.predict(x)))
new_arts = pd.DataFrame(data=clf.predict_proba(full_data_scaled),
columns=clf.classes_,
index=full_data_scaled.index.values)
First thing that you could check, is the actual dimensionality of your classifier in relation to your dataset. You use in both cases 1000 estimators. This can highly overfit if you are using a small dataset.
Second, I am assuming you use the Gini criterion to split. Maybe you can check if the criterion 'entropy' results in the same output.

Regressor Neural Network built with Keras only ever predicts one value

I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.
So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.
Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.

Categories