random forest with specified false positive and sensitivity - python

Using the randomForest package in R, I was able to train a random forest that minimized overall error rate. However, what I want to do is train two random forests, one that first minimizes false positive rate (~ 0) and then overall error rate, and one that first maximizes sensitivity (~1), and then overall error. Another construction of the problem would be: given a false positive rate and sensitivity rate, train two different random forests that satisfy one of the rates respectively, and then minimize overall error rate. Does anyone know if theres an r package or python package, or any other software out there that does this and or how to do this? Thanks for the help.

This is a workaround that may be worth trying. (Sorry that I do not have enough reputation to put it as a comment.)
As
sensitivity = TP/(TP + FN)
specificity = TN/(TN + FP)
ER = (TP + TN)/(TP + FN + TN + FP)
(Notations from Sensitivity_and_specificity)
If you duplicate some positive/negative samples (or increase the weights), the ER will approximate sensitivity/specificity.
So if you want to maximize sensitivity, then you can sample/duplicate some positive samples into the dataset then train your RF on it. For maximizing specificity, you can do the same thing on negative samples.

You can do a grid serarch over the 'regularazation' parameters to best match your target behavior.
Parameters of interest:
max depth
number of features

Related

ConvergenceWarning: Regressors in active set degenerate

I am running various Regressions in Python with lots of variables. For a more sparse variable selection, I implemented a relaxed Lasso (https://relaxedlasso.readthedocs.io/en/latest/content.html#implementation).
The code works all fine and I get a more sparse variable selection and reasonable R-2 scores. The code is displayed below:
relaxed_lasso = RelaxedLassoLars(alpha=#result_from_RelaxedLassoLarsCV,theta=#result_from_RelaxedLassoLarsCV).fit(X_train, y_train.values.ravel())
print("Training set score: {:.2f}".format(relaxed_lasso.score(X_train, y_train.values.ravel())))
print("Test set score: {:.2f}".format(relaxed_lasso.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(relaxed_lasso.coef_ != 0)))
But I always get the following warning:
ConvergenceWarning: Regressors in active set degenerate. Dropping a regressor, after 10 iterations, i.e. alpha=1.165e-05, with an active set of 10 regressors, and the smallest cholesky pivot element being 1.825e-08. Reduce max_iter or increase eps parameters.
I would like to know what this means for the validity of my result. I couldn't find any meaninful explanation online. That's why I'm asking here.
Thanks!

Likelihood Ratio Test and pseudo Rsquared differ significantly (Logistic Regression)

I apply a logistic regression and I would like to test for statistical sigificance of my overall model.
Now, the pseudo-Rsquared (McFaddon) Rsquared = 1 - L(c)/L(null) returns the variance explained by the model - where L(c) denotes the maximized likelihood value from the fitted model and L(null) denotes the corresponding value for the null model (no covariates, only intercept).
The likelihood test statistic is LR = 2 * (L(c) - L(null)) which follows a Chi-squared distribution and can be tested for significance according to the models degree of freedoms.
Anyways, I use the Chi-squared to calculate a p-value which is highly significant, but the pseudo Rsquared is around 0.021 ???
Why does Rsquared and the overall p-value differ so much?
Using an accuracy calulation for some test-data metrics.accuracy_score(y_test, y_pred), I see that the accuracy for the test data is only around 55% (for the training data its around 60%).
Can someone help me to interpret my results?
Maybe there is a correlation which is significant, but the impact is still small: since you are doing classification, you could check if examples with this variable (=1 in the binary case) have a slightly higher/lower probability to be member of class 1 than those without that variable (=0 in the binary case):
examples with the variable being 1 have a chance of 50% to belong to class 1 while examples with that variable being 0 have a chance of 48% to belong to class 1.
If lots of examples exist who have that variable, the effect might still be significant (p value), but it will hardly predict the right class alone (explain the variance - r squared).
This might be the reference which could help you understand this graphically for another problem: https://blog.minitab.com/blog/adventures-in-statistics-2/how-to-interpret-a-regression-model-with-low-r-squared-and-low-p-values

Difference between batch-average and global Fscore

I am facing a False Positive Reduction problem, and ratio of the size of positive and negative is approx. 1.7:1.
I learned from the answer that using precision, recall, FScore, or even weighting true-positive, false-positive, true-negative and false-negative differently dependent on cost to evaluate different models to deal with specified classification task.
Since Precision, Recall, and FScore are removed from keras, I found some methods to do the tracking of those metrics during training, such as github repo keras-metrics.
Besides, I also find ohter solutions by defining precision like this,
def precision(y_true, y_pred):
"""Precision metric.
Only computes a batch-wise average of precision.
Computes the precision, a metric for multi-label classification of
how many selected items are relevant.
"""
true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
precision = true_positives / (predicted_positives + K.epsilon())
return precision
However, since those methods is tracking the metrics during training, and all of those claim to be batch-wise average rather than a global value.
I wonder how neccessary is that to keep track on those metrics during training. Or I just focus on the loss and accuracy during training, and evaluate all models using validation functions from like scikit-learn to compare those metrics with a global method.
In Keras, all training metrics are measured batch-wise.
To obtain a global metric, Keras will average these batch-metrics.
Something like sum(batch_metrics) / batches.
Since most metrics are mean values considering the "number of samples", doing that kind of averaging will not change the global value too much.
If samples % batch_size == 0, then we can say that:
sum(all_samples_metrics) / samples == sum(all_batch_metrics) / batches
But these specific metrics you are talking about are not divided by the "number of samples", but by the number of samples "that satisfy a condition". Thus, the divisor in each batch is different. Mathematically, the result of averaging the batch-metrics to obtain a global result will not reflect the true global result.
So, can we say that they're not good for training?
Well, no. They may be good for training. Sometimes "accuracy" is a terrible metric for a specific problem.
The key to use these metrics batch-wise is to have a batch size that is big enough to avoid too much variation in the divisors.

Prediction using RandomForest Algorithm, Sensitivity and Specificity

I'm new to ML and developing a simple Prediction model using Random Forest algorithm. I'm also evaluating my model using Confusion Matrix and computing the Sensitivity and Specificity scores of my built model.
While I conceptually understand these two, I'm now stuck at a point where my model always results in the sensitivity score of 1 and specificity score of 0. It gets an accuracy score of 91% (91/100 training samples are predicted correctly).
Now what do I infer from this? Is my model too sensitive and not specific? Will I have to understand that my model will be more bound to emit False Positives and never False Negatives? (This may be good in my scenario) as the domain I'm dealing with is predict frauds where false positives are better.
As a side note, on a typical ML workflow, how the Sensitivity and Specificity scores are often used in practise? Is it used to keep some sort of a threshold on the client systems to take the prediction with a pinch of salt? What would be the parameters I can tune to get different values for these? At the moment, I just tweaked the Random Forest model variables like no of trees, no of nodes etc. But these numbers always remain the same.
I would argue that the best inference you can make from your specificity and sensitivity numbers is that your model is somehow misspecified, there is a data error, or some other non-parameter aspect of the data or your code that is causing this problem.
As you know: (Using: True Positive, False Negative...)
Sensitivity = TP/(TP + FN)
Specificity = TN/(FP + TN)
Accuracy = (TP + TN) / (TP + FP + TN + FN)
To get sensitivity = 1, TP = TP + FN. There are a number of ways that could happen, of course, but the ones to look for are TP = FN, or FN = 0 (since TP cannot be 0).
Additionally, a specificity of 0 implies that TN = 0, regardless of what FP and TN are. But here too I would look at the distribution of FP and TN.
If TN = 0, then for accuracy = 0.9 it must be that TP = 0.9, which suggests that even if TN = 0, FP and/or FN are > 0 and indeed that FP + FN = 0.1 (since the denominator must equal 1).
One possibility, mentioned by #desertnaut is that you have a serious class imbalance issue. That is, the minority class (the one you want to predict) is a very, very small fraction of the total number of all observations in your data. To look into this, simply plot the distribution of your target. If you're seeing that the minority class is less than, say, 10% of the total, (in general, one category dominates the other 10:1) then you're likely to start seeing problems in your model.
Here's a good tutorial on dealing with imbalanced data.
And here's a paper if you're more technically minded and want to confront an imbalance in a more academic way.
That said, I would also look at the code you have for your model. Check into how you're separating training and testing:
Is the balance of classes in the training set representative of the balance in the whole set?
How are you doing cross-validation, if at all? Does changing the method give you different results for spec/sens?
Are the data types for the categories correct throughout (and for the rest of the variables, for that matter)? Since RF can deal with categorical variables it might not throw an error if you have what you expect to be an int coded as a string (just to pick an example), which might produce odd results.
Seeing any results with perfect scores (1/0) is usually a sign that something other than just model parameters (or hyperparameters) is incorrectly specified. Not that it can't happen. It just so rarely does.

Can I set feature priors in sklearn Bayesian classifier?

I have done some simple Bayesian classification
X = [[1,0,0], [1,1,0]] ### there are more data of course
Y = [1,0]
classifier = BernoulliNB()
classifier.fit(X, Y)
Now I have got some "insider tips" that the first element in every X is more important than the others.
Can I incorporate this knowledge before I train the model please?
If sklearn doesn't allow it, is there any other classifier or other library that allows us to incorporate our prior before model training please?
I do not know the answer of the question 2 but I can answer question 1.
In the comment "multiply the first element for each observation by different values" is a wrong approach.
When you are using BernoulliNB or Binomial, the way you incorporate prior knowledge is by adding your knowledge into the sample (data).
Let's say you are flipping the coin and you know that the coin is rigged towards more head. Then you are adding more samples that show more heads. If your prior knowledge says 70% heads and 30% tails: You can add total 100 samples, 70 heads and 30 tails, to your data X.
Think about what the algorithm is actually doing. Naive Bayes performs the following classification:
p(class = k | data) ~ p(class = k) * p(data | class = k)
In words: The (posterior) probability of an observation being in class k is proportional to the probability of any observation being in class k (that's the prior) times the probability of seeing the observation, given it came from class k (the likelihood).
Usually when we don't know anything, we assume that p(class = k) just reflects the distribution of the observed data.
In your case, you're saying that you have some information, in addition to the observed data, that leads you to believe that the prior, p(class = k) should be amended. This is perfectly legitimate. In fact, that's the beauty of Bayesian inference. Whatever your prior knowledge is, you should incorporate that into this term. So in your case, perhaps that's increasing the probability of being in a particular class (i.e. increasing its weight as suggested in the comments), if you know that it's more likely to occur than the data suggests.

Categories