Change threshold value for Random Forest classifier - python

I need to develop a model which will be free (or close to free) of false negative values. To do so I've plotted Recall-Precision curve and determined that the threshold value should be set to 0.11
My question is, how to define threshold value upon model training? There's no point in defining it later upon evaluation because it won't reflect on new data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
rfc_model = RandomForestClassifier(random_state=101)
rfc_model.fit(X_train, y_train)
rfc_preds = rfc_model.predict(X_test)
recall_precision_vals = []
for val in np.linspace(0, 1, 101):
predicted_proba = rfc_model.predict_proba(X_test)
predicted = (predicted_proba[:, 1] >= val).astype('int')
recall_sc = recall_score(y_test, predicted)
precis_sc = precision_score(y_test, predicted)
recall_precision_vals.append({
'Threshold': val,
'Recall val': recall_sc,
'Precis val': precis_sc
})
recall_prec_df = pd.DataFrame(recall_precision_vals)
Any ideas?

how to define threshold value upon model training?
There is simply no threshold during model training; Random Forest is a probabilistic classifier, and it only outputs class probabilities. "Hard" classes (i.e. 0/1), which indeed require a threshold, are neither produced nor used in any stage of the model training - only during prediction, and even then only in the cases we indeed require a hard classification (not always the case). Please see Predict classes or class probabilities? for more details.
Actually, the scikit-learn implementation of RF doesn't actually employ a threshold at all, even for hard class prediction; reading closely the docs for the predict method:
the predicted class is the one with highest mean probability estimate across the trees
In simple words, this means that the actual RF output is [p0, p1] (assuming binary classification), from which the predict method simply returns the class with the highest value, i.e. 0 if p0 > p1 and 1 otherwise.
Assuming that what you actually want to do is return 1 if p1 is greater from some threshold less than 0.5, you have to ditch predict, use predict_proba instead, and then manipulate these returned probabilities to get what you want. Here is an example with dummy data:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4,
n_informative=2, n_redundant=0,
n_classes=2, random_state=0, shuffle=False)
clf = RandomForestClassifier(n_estimators=100, max_depth=2,
random_state=0)
clf.fit(X, y)
Here, simply using predict for, say, the first element of X, will give 0:
clf.predict(X)[0]
# 0
because
clf.predict_proba(X)[0]
# array([0.85266881, 0.14733119])
i.e. p0 > p1.
To get what you want (i.e. here returning class 1, since p1 > threshold for a threshold of 0.11), here is what you have to do:
prob_preds = clf.predict_proba(X)
threshold = 0.11 # define threshold here
preds = [1 if prob_preds[i][1]> threshold else 0 for i in range(len(prob_preds))]
after which, it is easy to see that now for the first predicted sample we have:
preds[0]
# 1
since, as shown above, for this sample we have p1 = 0.14733119 > threshold.

Related

Strong Correlation(89%) but negetive coefficient with value (-0.00)

My dataframe has 8 Columns ( target_var='I62' ). when I am running my python pandas corr() it give me a very good correlation between my target_var and IOF2 and H6.
# This data is till 2019-Dec
df3Train = pd.read_csv('I62Trainv7.csv', parse_dates=['Value_Date'],index_col=['Value_Date'])
df3Train.corr()
Please see below image
Correlation
Using 'I62'(y), and "[IOF2,H6]" (x), I created the LinearRegression model from sklearn. When I am looking for the coefficient of the model it shows me negative (-0.004) coefficient for H6:
# Model building with I62 and IOF2 and H6
y = df3TrainClean["I62"]
# Independent Vars
x = df3TrainClean[['IOF2', 'H6']]
# Training on 80% Testing 20%
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=1)
lrModel = LinearRegression()
lrModel.fit(x_train, y_train)
print(lrModel.score(x_test, y_test))
print(lrModel.score(x_train, y_train))
# Lets train the model
lrModel.fit(x,y)
yPred = lrModel.predict(x_test)
# Coefficients
coeff_df = pd.DataFrame(lrModel.coef_, x.columns, columns=['Coefficient'])
coeff_df
Coefficient
I am bit lost, when there is a good correlation, how come the coefficient is too low (and negetive) !!!
Can anyone please explain and the impact of this negetive coefficient on the predictions.
Thanks
Negative correlation means that when one value increases, the other decreases. Correlation can take a value of -1, which means that two values always move in opposite directions. A correlation of 0 means that the movement of one does not imply the movement of another -- i.e. they are not correlated at all.

train_test_split random_state not working; produces different output everytime

So, I've been using KNN on a set of data, with a random_state = 4 during the train_test_split phase. Despite of using the random state, the output of accuracy, classification report, prediction, etc, are different each time. Was wondering why was that?
Here's the head of the data: (predicting the position based on all_time_runs and order)
order position all_time_runs
0 10 NO BAT 1304
1 2 CAN BAT 7396
2 3 NO BAT 6938
3 6 CAN BAT 4903
4 6 CAN BAT 3761
And here's the code for the classification and prediction:
#splitting data into features and target
X = posdf.drop('position',axis=1)
y = posdf['position']
knn = KNeighborsClassifier(n_neighbors = 5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
#fitting the KNN model
knn.fit(X_train, y_train)
#predicting with the model
prediction = knn.predict(X_test)
#knn score
score = knn.score(X_test, y_test)
Althought train_test_split has a random factor associated to it, and it has to be solved to avoid having random resuls, it's not the only factor you should work on solving.
KNN is a model that takes each row of the test set, finds the nearest k training set vectors and classifies it by majority decision and even in case of ties, the decision is random. You need to set.seed(x) in order to ensure the method is replicable.
Documentation states:
Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.

Why when I use GridSearchCV with roc_auc scoring, the score is different for grid_search.score(X,y) and roc_auc_score(y, y_predict)?

I am using stratified 10-fold cross validation to find model that predicts y (binary outcome) from X (X has 34 labels) with the highest auc. I set the GridSearchCV:
log_reg = LogisticRegression()
parameter_grid = {'penalty' : ["l1", "l2"],'C': np.arange(0.1, 3, 0.1),}
cross_validation = StratifiedKFold(n_splits=10,shuffle=True,random_state=100)
grid_search = GridSearchCV(log_reg, param_grid = parameter_grid,scoring='roc_auc',
cv = cross_validation)
And then do the cross-validation:
grid_search.fit(X, y)
y_pr=grid_search.predict(X)
I do not understand the following:
why grid_search.score(X,y) and roc_auc_score(y, y_pr) give different results (the former is 0.74 and the latter is 0.63)? Why do not these commands do the same thing in my case?
This is due to different initialization of roc_auc when used in GridSearchCV.
Look at the source code here
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True,
needs_threshold=True)
Observe the third parameter needs_threshold. When true, it will require the continous values for y_pred such as probabilities or confidence scores which in gridsearch will be calculated from log_reg.decision_function().
When you explicitly call roc_auc_score with y_pr, you are using .predict() which will output the resultant predicted class labels of the data and not probabilities. That should account for the difference.
Try :
y_pr=grid_search.decision_function(X)
roc_auc_score(y, y_pr)
If still not same results, please update the question with complete code and some sample data.

Interpreting the DecisionTreeRegressor score?

I am trying to evaluate a relevance of features and I am using DecisionTreeRegressor()
The related part of the code is presented below:
# TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature
new_data = data.drop(['Frozen'], axis = 1)
# TODO: Split the data into training and testing sets(0.25) using the given feature as the target
# TODO: Set a random state.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(new_data, data['Frozen'], test_size = 0.25, random_state = 1)
# TODO: Create a decision tree regressor and fit it to the training set
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=1)
regressor.fit(X_train, y_train)
# TODO: Report the score of the prediction using the testing set
from sklearn.model_selection import cross_val_score
#score = cross_val_score(regressor, X_test, y_test)
score = regressor.score(X_test, y_test)
print score # python 2.x
When I run the print function, it returns the given score:
-0.649574327334
You can find the score function implementatioin and some explanation below here and below:
Returns the coefficient of determination R^2 of the prediction.
...
The best possible score is 1.0 and it can be negative (because the
model can be arbitrarily worse).
I could not grasp the whole concept yet, so this explanation is not very helpful for me. For instance I could not understand why score could be negative and what exactly it indicates (if something is squared, I would expect it can only be positive).
What does this score indicates and why can it be negative?
If you know any article (for starters) it might be helpful as well!
R^2 can be negative from its definition (https://en.wikipedia.org/wiki/Coefficient_of_determination) if the model fits the data worse than a horizontal line. Basically
R^2 = 1 - SS_res/SS_tot
and SS_res and SS_tot are always positive. If SS_res >> SS_tot, you have a negative R^2. Look at this answer as well: https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative
The article execute cross_val_score in which DecisionTreeRegressor is implemented. You may take a look at the documentation of scikitlearn DecisionTreeRegressor.
Basically, the score you see is R^2, or (1-u/v). U is the squared sum residual of your prediction, and v is the total square sum(sample sum of square).
u/v can be arbitrary large when you make really bad prediction, while it can only be as small as zero given u and v are sum of squared residual(>=0)

What is the theorical foundation for scikit-learn dummy classifier?

By the documentation I read that a dummy classifier can be used to test it against a classification algorithm.
This classifier is useful as a simple baseline to compare with other
(real) classifiers. Do not use it for real problems.
What does the dummy classifier do when it uses the stratified aproach. I know that the docummentation says that:
generates predictions by respecting the training set’s class
distribution.
Could anybody give me a more theorical explanation of why this is a proof for the performance of the classifier?.
The dummy classifier gives you a measure of "baseline" performance--i.e. the success rate one should expect to achieve even if simply guessing.
Suppose you wish to determine whether a given object possesses or does not possess a certain property. If you have analyzed a large number of those objects and have found that 90% contain the target property, then guessing that every future instance of the object possesses the target property gives you a 90% likelihood of guessing correctly. Structuring your guesses this way is equivalent to using the most_frequent method in the documentation you cite.
Because many machine learning tasks attempt to increase the success rate of (e.g.) classification tasks, evaluating the baseline success rate can afford a floor value for the minimal value one's classifier should out-perform. In the hypothetical discussed above, you would want your classifier to get more than 90% accuracy, because 90% is the success rate available to even "dummy" classifiers.
If one trains a dummy classifier with the stratified parameter using the data discussed above, that classifier will predict that there is a 90% probability that each object it encounters possesses the target property. This is different from training a dummy classifier with the most_frequent parameter, as the latter would guess that all future objects possess the target property. Here's some code to illustrate:
from sklearn.dummy import DummyClassifier
import numpy as np
two_dimensional_values = []
class_labels = []
for i in xrange(90):
two_dimensional_values.append( [1,1] )
class_labels.append(1)
for i in xrange(10):
two_dimensional_values.append( [0,0] )
class_labels.append(0)
#now 90% of the training data contains the target property
X = np.array( two_dimensional_values )
y = np.array( class_labels )
#train a dummy classifier to make predictions based on the most_frequent class value
dummy_classifier = DummyClassifier(strategy="most_frequent")
dummy_classifier.fit( X,y )
#this produces 100 predictions that say "1"
for i in two_dimensional_values:
print( dummy_classifier.predict( [i]) )
#train a dummy classifier to make predictions based on the class values
new_dummy_classifier = DummyClassifier(strategy="stratified")
new_dummy_classifier.fit( X,y )
#this produces roughly 90 guesses that say "1" and roughly 10 guesses that say "0"
for i in two_dimensional_values:
print( new_dummy_classifier.predict( [i]) )
A major motivation for Dummy Classifier is F-score, when the positive class is in minority (i.e. imbalanced classes). This classifier is used for sanity test of actual classifier. Actually, dummy classifier completely ignores the input data. In case of 'most frequent' method, it checks the occurrence of most frequent label.
Using the Doc To illustrate DummyClassifier, first let’s create an imbalanced dataset:
>>>
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> y[y != 1] = -1
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
Next, let’s compare the accuracy of SVC and most_frequent:
>>>
>>> from sklearn.dummy import DummyClassifier
>>> from sklearn.svm import SVC
>>> clf = SVC(kernel='linear', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.63...
>>> clf = DummyClassifier(strategy='most_frequent',random_state=0)
>>> clf.fit(X_train, y_train)
DummyClassifier(constant=None, random_state=0, strategy='most_frequent')
>>> clf.score(X_test, y_test)
0.57...
We see that SVC doesn’t do much better than a dummy classifier. Now, let’s change the kernel:
>>>
>>> clf = SVC(gamma='scale', kernel='rbf', C=1).fit(X_train, y_train)
>>> clf.score(X_test, y_test)
0.97...
We see that the accuracy was boosted to almost 100%. So this is better.

Categories