Prediction using RandomForest Algorithm, Sensitivity and Specificity - python

I'm new to ML and developing a simple Prediction model using Random Forest algorithm. I'm also evaluating my model using Confusion Matrix and computing the Sensitivity and Specificity scores of my built model.
While I conceptually understand these two, I'm now stuck at a point where my model always results in the sensitivity score of 1 and specificity score of 0. It gets an accuracy score of 91% (91/100 training samples are predicted correctly).
Now what do I infer from this? Is my model too sensitive and not specific? Will I have to understand that my model will be more bound to emit False Positives and never False Negatives? (This may be good in my scenario) as the domain I'm dealing with is predict frauds where false positives are better.
As a side note, on a typical ML workflow, how the Sensitivity and Specificity scores are often used in practise? Is it used to keep some sort of a threshold on the client systems to take the prediction with a pinch of salt? What would be the parameters I can tune to get different values for these? At the moment, I just tweaked the Random Forest model variables like no of trees, no of nodes etc. But these numbers always remain the same.

I would argue that the best inference you can make from your specificity and sensitivity numbers is that your model is somehow misspecified, there is a data error, or some other non-parameter aspect of the data or your code that is causing this problem.
As you know: (Using: True Positive, False Negative...)
Sensitivity = TP/(TP + FN)
Specificity = TN/(FP + TN)
Accuracy = (TP + TN) / (TP + FP + TN + FN)
To get sensitivity = 1, TP = TP + FN. There are a number of ways that could happen, of course, but the ones to look for are TP = FN, or FN = 0 (since TP cannot be 0).
Additionally, a specificity of 0 implies that TN = 0, regardless of what FP and TN are. But here too I would look at the distribution of FP and TN.
If TN = 0, then for accuracy = 0.9 it must be that TP = 0.9, which suggests that even if TN = 0, FP and/or FN are > 0 and indeed that FP + FN = 0.1 (since the denominator must equal 1).
One possibility, mentioned by #desertnaut is that you have a serious class imbalance issue. That is, the minority class (the one you want to predict) is a very, very small fraction of the total number of all observations in your data. To look into this, simply plot the distribution of your target. If you're seeing that the minority class is less than, say, 10% of the total, (in general, one category dominates the other 10:1) then you're likely to start seeing problems in your model.
Here's a good tutorial on dealing with imbalanced data.
And here's a paper if you're more technically minded and want to confront an imbalance in a more academic way.
That said, I would also look at the code you have for your model. Check into how you're separating training and testing:
Is the balance of classes in the training set representative of the balance in the whole set?
How are you doing cross-validation, if at all? Does changing the method give you different results for spec/sens?
Are the data types for the categories correct throughout (and for the rest of the variables, for that matter)? Since RF can deal with categorical variables it might not throw an error if you have what you expect to be an int coded as a string (just to pick an example), which might produce odd results.
Seeing any results with perfect scores (1/0) is usually a sign that something other than just model parameters (or hyperparameters) is incorrectly specified. Not that it can't happen. It just so rarely does.

Related

How are the votes of individual trees calculated for Random Forest and Extra Trees in Sklearn?

I have been constructing my own Extra Trees (XT) classifier in Rust for binary classification. To verify correctness of my classifier, I have been comparing it against Sklearns implementation of XT, but I constantly get different results. I thought that there must be a bug in my code at first, but now I realize it's not a bug, but instead a different method of calculating votes amongst the different trees in the ensemble. In my code, each tree votes based on the most frequent classification in a leafs' subset of data. So for example, if we are traversing a tree, and find ourselves at a leaf node that has 40 classifications of 0, and 60 classifications of 1, the tree classifies the data as 1.
Looking at Sklearn's documentation for XT (As seen here), I read the following line in regards to the predict method
The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.
While this gives me some idea about how individual trees vote, I still have more questions. Perhaps an exact mathematical expression of how these weights are calculated would help, but I have yet to find one in the documentation.
I will provide more details in the upcoming paragraphs, but I wish to ask my question concisely here. How are these weights calculated at a high level, what are the mathematics behind it? Is there a way to change how individual XT trees calculate their votes?
---------------------------------------- Additional Details -----------------------------------------------
For my current tests, this is how I build my classifier
classifier = ExtraTreesClassifier(n_estimators=5, criterion='gini',
max_depth=1, max_features=5,random_state=0)
To predict unseen transactions X, I use classifier.predict(X). Digging through the source code of predict (seen here, line 630-ish), I see that this is all the code that executes for binary classification
proba = self.predict_proba(X)
if self.n_outputs_ == 1:
return self.classes_.take(np.argmax(proba, axis=1), axis=0)
What this code is doing is relatively obvious to me. It merely determines the most likely classification of transactions by taking the argmax of proba. What I fail to understand is how this proba value is made in the first place. I beleive that the predict_proba method that predict uses is defined here at Line 650-ish. Here is what I believe the relevant source code to be
check_is_fitted(self)
# Check data
X = self._validate_X_predict(X)
# Assign chunk of trees to jobs
n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs)
# avoid storing the output of every estimator by summing them here
all_proba = [np.zeros((X.shape[0], j), dtype=np.float64)
for j in np.atleast_1d(self.n_classes_)]
lock = threading.Lock()
Parallel(n_jobs=n_jobs, verbose=self.verbose,
**_joblib_parallel_args(require="sharedmem"))(
delayed(_accumulate_prediction)(e.predict_proba, X, all_proba,
lock)
for e in self.estimators_)
for proba in all_proba:
proba /= len(self.estimators_)
if len(all_proba) == 1:
return all_proba[0]
else:
return all_proba
I fail to understand what exactly is being calculated here. This is where my trail goes a bit cold and I get confused, and find myself in need of help.
Trees can predict probability estimates, according to the training sample proportions in each leaf. In your example, the probability of class 0 is 0.4, and 0.6 for class 1.
Random forests and extremely random trees in sklearn perform soft voting: each tree predicts the class probabilities as above, and then the ensemble just averages those across trees. That produces a probability for each class, and then the predicted class is the one with the largest probability.
In the code, the relevant bit is _accumulate_predictions, which just sums the probability estimates, followed by the division by the number of estimators.

Likelihood Ratio Test and pseudo Rsquared differ significantly (Logistic Regression)

I apply a logistic regression and I would like to test for statistical sigificance of my overall model.
Now, the pseudo-Rsquared (McFaddon) Rsquared = 1 - L(c)/L(null) returns the variance explained by the model - where L(c) denotes the maximized likelihood value from the fitted model and L(null) denotes the corresponding value for the null model (no covariates, only intercept).
The likelihood test statistic is LR = 2 * (L(c) - L(null)) which follows a Chi-squared distribution and can be tested for significance according to the models degree of freedoms.
Anyways, I use the Chi-squared to calculate a p-value which is highly significant, but the pseudo Rsquared is around 0.021 ???
Why does Rsquared and the overall p-value differ so much?
Using an accuracy calulation for some test-data metrics.accuracy_score(y_test, y_pred), I see that the accuracy for the test data is only around 55% (for the training data its around 60%).
Can someone help me to interpret my results?
Maybe there is a correlation which is significant, but the impact is still small: since you are doing classification, you could check if examples with this variable (=1 in the binary case) have a slightly higher/lower probability to be member of class 1 than those without that variable (=0 in the binary case):
examples with the variable being 1 have a chance of 50% to belong to class 1 while examples with that variable being 0 have a chance of 48% to belong to class 1.
If lots of examples exist who have that variable, the effect might still be significant (p value), but it will hardly predict the right class alone (explain the variance - r squared).
This might be the reference which could help you understand this graphically for another problem: https://blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-a-regression-model-with-low-r-squared-and-low-p-values

Can I set feature priors in sklearn Bayesian classifier?

I have done some simple Bayesian classification
X = [[1,0,0], [1,1,0]] ### there are more data of course
Y = [1,0]
classifier = BernoulliNB()
classifier.fit(X, Y)
Now I have got some "insider tips" that the first element in every X is more important than the others.
Can I incorporate this knowledge before I train the model please?
If sklearn doesn't allow it, is there any other classifier or other library that allows us to incorporate our prior before model training please?
I do not know the answer of the question 2 but I can answer question 1.
In the comment "multiply the first element for each observation by different values" is a wrong approach.
When you are using BernoulliNB or Binomial, the way you incorporate prior knowledge is by adding your knowledge into the sample (data).
Let's say you are flipping the coin and you know that the coin is rigged towards more head. Then you are adding more samples that show more heads. If your prior knowledge says 70% heads and 30% tails: You can add total 100 samples, 70 heads and 30 tails, to your data X.
Think about what the algorithm is actually doing. Naive Bayes performs the following classification:
p(class = k | data) ~ p(class = k) * p(data | class = k)
In words: The (posterior) probability of an observation being in class k is proportional to the probability of any observation being in class k (that's the prior) times the probability of seeing the observation, given it came from class k (the likelihood).
Usually when we don't know anything, we assume that p(class = k) just reflects the distribution of the observed data.
In your case, you're saying that you have some information, in addition to the observed data, that leads you to believe that the prior, p(class = k) should be amended. This is perfectly legitimate. In fact, that's the beauty of Bayesian inference. Whatever your prior knowledge is, you should incorporate that into this term. So in your case, perhaps that's increasing the probability of being in a particular class (i.e. increasing its weight as suggested in the comments), if you know that it's more likely to occur than the data suggests.

random forest with specified false positive and sensitivity

Using the randomForest package in R, I was able to train a random forest that minimized overall error rate. However, what I want to do is train two random forests, one that first minimizes false positive rate (~ 0) and then overall error rate, and one that first maximizes sensitivity (~1), and then overall error. Another construction of the problem would be: given a false positive rate and sensitivity rate, train two different random forests that satisfy one of the rates respectively, and then minimize overall error rate. Does anyone know if theres an r package or python package, or any other software out there that does this and or how to do this? Thanks for the help.
This is a workaround that may be worth trying. (Sorry that I do not have enough reputation to put it as a comment.)
As
sensitivity = TP/(TP + FN)
specificity = TN/(TN + FP)
ER = (TP + TN)/(TP + FN + TN + FP)
(Notations from Sensitivity_and_specificity)
If you duplicate some positive/negative samples (or increase the weights), the ER will approximate sensitivity/specificity.
So if you want to maximize sensitivity, then you can sample/duplicate some positive samples into the dataset then train your RF on it. For maximizing specificity, you can do the same thing on negative samples.
You can do a grid serarch over the 'regularazation' parameters to best match your target behavior.
Parameters of interest:
max depth
number of features

sigmoid output for detection class returns incorrect performance

Summary of my problem: I have a detection (binary classification, unbalanced problem). I use a sigmoid to classify samples. Reported f-score, precision and recall seem to consider both classes, e.g. true positives seem to be the total number of correctly classified samples, and not total number of samples belonging to class '1' that are correctly classified.
Longer explanation: In my experiment I have demographical data about persons and I have to predict whether they bought a product or not. I used PCA to reduce the initial features to just 4 features and the data is stored in csv file (first column has the class labels, '0' and '1'). Note that most people didn't buy and then the two classes are very unbalanced. I use the CSVDataset class to read it:
dataset: &train !obj:pylearn2.datasets.csv_dataset.CSVDataset {
path: 'input.csv',
task: 'classification'
}
I want to start with a simple classification model and I use f-score as performance measure. Therefore, my first idea was to use a MLP model with a single sigmoid layer (default monitor 'detection' provides recall, precision, f-score):
model: !obj:pylearn2.models.mlp.MLP {
layers: [
!obj:pylearn2.models.mlp.Sigmoid {
layer_name: 'y',
dim: 2,
irange: .005
}
],
nvis: 4,
}
My initial idea was to set dim to 1 (the decision rule would be: if output > 0.5 choose class '1', if < 0.5 choose class '0'). However, I got the error ValueError: Can't convert to VectorSpace of dim 1. Expected either dim=2 (merged one-hots) or 2 (concatenated one-hots) and then I decided to set dim to 2 (decision rule would be: if out1 > out0 choose '1', if out1
In my train.yaml I follow more or less the softmax example notebook provided in the documentation. For example, I use BGD algorithm and set the batch_size as the total number of examples in the training set (74164 examples, a small dataset!) just to avoid confusion when checking the performance manually.
The model was trained with the train.py script that is provided and everything seemed fine, until I had a look at the results. As mentioned earlier, it is a detection problem where the class to detect ('1') happens very rarely. Therefore, I was very surprised to see high values for the reported train_y_f1 (best result is approx. 94%, after one epoch).
To check this, I computed the f-score manually using the provided script predict_csv.py and then loading the predictions. I saw that in fact there were only misses (all '1' were classified as '0'), so precision, recall and f-score should be all zero. Why does the detection monitor report higher values?
After some investigation, I found that the MLP has an output for each class, and I verified (computed it manually and got the same numbers) that true positives and false positives defined in get_detection_channels_from_state() refer actually to both classes, '1' and '0', e.g. true positives is the number of vectors that belong to '1' classified as '1' summed to the number of vectors that belong to '0' classified as '0'. So the MLP is classifying everything as '0', and since nearly all vectors belongs to '0', the performance is good. This is a known problem for unbalanced detection problems, where correct classification rate is not a suitable measure, and it is the reason why we have measures such f-score or AUC. However, if tp and fp in get_detection_channels_from_state() consider both classes, then the reported f-score is not useful (not to me at least).
I can imagine that this is known to the designer of the Sigmoid class, so I can only assume that I am doing something wrong. Hopefully somebody can give me a hint :)
Note: I have submitted this question to the pylearn2 user mailing list. If I get an answer I will copy it here...
The pylearn monitor calculates the f1 score, % misclass, etc, for each batch, not for the entire epoch. When it generates the report, the f1 score is the mean of the f1's for all the batches in the epoch. Reporting the mean over all the batches works just fine when you look at quantities like misclass:
misclass[n] is the score for the nth batch
misclass_epoch = mean(misclass[0] + misclass[1] +.. misclass[n])
however, you can't construct the same statement for the f1 score:
f1_epoch != mean(f1[0] + f1[1] +.. f1[n])
where f1[n] = 2*precision[n]*recall[n]/(precision[n] + recall[n])
For demonstration purposes, try setting the batch size to be the size of the data set (you can get away with this in the mnist example). The f1 score will then be correct.
So the best advice is keep an eye on quantities in the monitor like misclass, where the mean over the batches is the same as the value for the epoch. Once you've trained the nn, then you can make predictions for your entire validation set and calculate the f1 score at that point.

Categories