Get precison model through GridSearchCV for recall optimization

Get precison model through GridSearchCV for recall optimization - python

Given a machine learning model RBF SVC called 'm', I performed a gridSearchCV on gamma value, to optimize recall.
I'm looking to answer to this:
"The grid search should find the model that best optimizes for recall. How much better is the recall of this model than the precision?"
So I did the gridSearchCV:
grid_values = {'gamma': [0.001, 0.01, 0.05, 0.1, 1, 10, 100]}
grid_m_re = GridSearchCV(m, param_grid = grid_values, scoring = 'recall')
grid_m_re.fit(X_train, y_train)
y_decision_fn_scores_re = grid_m_re.decision_function(X_test)
print('Grid best parameter (max. recall): ', grid_m_re.best_params_)
print('Grid best score (recall): ', grid_m_re.best_score_)
This tell me the best model is for gamma=0.001 and it has a recall score of 1.
I'm wondering how to get the precision for this model to get the trade of of this model, cause the GridSearchCV only has attribute to get what it was optimize for.([Doc sklearn.GridSearchCV][1])

Not sure if there's an easier/more direct way to get this, but this approach also allows you to capture the 'best' model to play around with later:
First do you CV fit on training data:
grid_m_re = GridSearchCV(m, param_grid = grid_values, scoring = 'recall')
grid_m_re.fit(X_train, y_train)
Once you're done, you can pull out the 'best' model (as determined by your scoring criteria during CV), and then use it however you want:
m_best = grid_m_re.best_estimator_
and in your specific case:
from sklearn.metrics import precision_score
y_pred = m_best.predict(X_test)
precision_score(y_test, y_pred)

You can easily overfit if you don't optimize both, C and gamma at the same time.
if you plot the SVC with C on the X axis, gamma on the y axis and recall as color you get some kind of V-Shape, see here
So if you do grid search, better optimize for both, C and gamma, at the same time.
The problem is that usually you get the best results for small C-Values, and in that area the V-shape has it's pointy end: is not very big and difficult to hit.
I recently used:
make a random grid of 10 points
every point contains C, gamma, direction, speed
cut the dataset with stratifiedShuffleSplit
fit & estimate score with cross validation
repeat:
kill the worst two points
the best two points spawn a kid
move every point in its direction with just a little bit of random,
fit & estimate score with cross validation
(if a point notice it goes downward, turn around and half speed)
until break criterion is hit
Worked like a charm.
I used the max distance in the feature space divided by four as initial speed,
the direction had a maximum random of pi/4
Well, the cross validation was a bit costly.
Cleptocreatively inspired by this paper.
... and another edit:
I used between 10 and 20 cycles in the loop to get the perfect points.
If your dataset is too big to do several fits, make a representative subset for the first few trainings...

Related

Building ML classifier with imbalanced data

I have a dataset with 1400 obs and 19 columns. The Target variable has values 1 (value that I am most interested in) and 0. The distribution of classes shows imbalance (70:30).
Using the code below I am getting weird values (all 1s). I am not figuring out if this is due to a problem of overfitting/imbalance data or to feature selection (I used Pearson correlation since all values are numeric/boolean).
I am thinking that the steps followed are wrong.
import numpy as np
import math
import sklearn.metrics as metrics
from sklearn.metrics import f1_score
y = df['Label']
X = df.drop('Label',axis=1)
def create_cv(X,y):
if type(X)!=np.ndarray:
X=X.values
y=y.values
test_size=1/5
proportion_of_true=y[y==1].shape[0]/y.shape[0]
num_test_samples=math.ceil(y.shape[0]*test_size)
num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])
X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
return X_train,X_test,y_train,y_test
X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_predict_test = tree.predict(X_test)
print(classification_report(y_test, y_predict_test))
f1_score(y_test, y_predict_test)
Output:
precision recall f1-score support
0 1.00 1.00 1.00 24
1 1.00 1.00 1.00 70
accuracy 1.00 94
macro avg 1.00 1.00 1.00 94
weighted avg 1.00 1.00 1.00 94
Has anyone experienced similar issues in building a classifier when data has imbalance, using CV and/or under sampling? Happy to share the whole dataset, in case you might want to replicate the output.
What I would like to ask you for some clear answer to follow that can show me the steps and what I am doing wrong.
I know that, to reduce overfitting and work with balance data, there are some methods such as random sampling (over/under), SMOTE, CV. My idea is
Split the data on train/test taking into account imbalance
Perform CV on trains set
Apply undersampling only on a test fold
After the model has been chosen with the help of CV, undersample the train set and train the classifier
Estimate the performance on the untouched test set
(f1-score)
as also outlined in this question: CV and under sampling on a test fold .
I think the steps above should make sense, but happy to receive any feedback that you might have on this.

When you have imbalanced data you have to perform stratification. The usual way is to oversample the class that has less values.
Another option is to train your algorithm with less data. If you have a good dataset that should not be a problem. In this case you grab first the samples from the less represented class use the size of the set to compute how many samples to get from the other class:
This code may help you split your dataset that way:
def split_dataset(dataset: pd.DataFrame, train_share=0.8):
"""Splits the dataset into training and test sets"""
all_idx = range(len(dataset))
train_count = int(len(all_idx) * train_share)
train_idx = random.sample(all_idx, train_count)
test_idx = list(set(all_idx).difference(set(train_idx)))
train = dataset.iloc[train_idx]
test = dataset.iloc[test_idx]
return train, test
def split_dataset_stratified(dataset, target_attr, positive_class, train_share=0.8):
"""Splits the dataset as in `split_dataset` but with stratification"""
data_pos = dataset[dataset[target_attr] == positive_class]
data_neg = dataset[dataset[target_attr] != positive_class]
if len(data_pos) < len(data_neg):
train_pos, test_pos = split_dataset(data_pos, train_share)
train_neg, test_neg = split_dataset(data_neg, len(train_pos)/len(data_neg))
# set.difference makes the test set larger
test_neg = test_neg.iloc[0:len(test_pos)]
else:
train_neg, test_neg = split_dataset(data_neg, train_share)
train_pos, test_pos = split_dataset(data_pos, len(train_neg)/len(data_pos))
# set.difference makes the test set larger
test_pos = test_pos.iloc[0:len(test_neg)]
return train_pos.append(train_neg).sample(frac = 1).reset_index(drop = True), \
test_pos.append(test_neg).sample(frac = 1).reset_index(drop = True)
Usage:
train_ds, test_ds = split_dataset_stratified(data, target_attr, positive_class)
You can now perform cross validation on train_ds and evaluate your model in test_ds.

There is another solution that is in the model-level - using models that support weights of samples, such as Gradient Boosted Trees. Of those, CatBoost is usually the best as its training method leads to less leakage (as described in their article).
Example code:
from catboost import CatBoostClassifier
y = df['Label']
X = df.drop('Label',axis=1)
label_ratio = (y==1).sum() / (y==0).sum()
model = CatBoostClassifier(scale_pos_weight = label_ratio)
model.fit(X, y)
And so forth.
This works because Catboost treats each sample with a weight, so you can determine class weights in advance (scale_pos_weight).
This is better than downsampling, and is technically equal to oversampling (but requires less memory).
Also, a major part of treating imbalanced data, is making sure your metrics are weighted as well, or at least well-defined, as you might want equal performance (or skewed performance) on these metrics.
And if you want a more visual output than sklearn's classification_report, you can use one of the Deepchecks built-in checks (disclosure - I'm one of the maintainers):
from deepchecks.checks import PerformanceReport
from deepchecks import Dataset
PerformanceReport().run(Dataset(train_df, label='Label'), Dataset(test_df, label='Label'), model)

your implementation of stratified train/test creation is not optimal, as it lacks randomness. Very often data comes in batches, so it is not a good practice to take sequences of data as is, without shuffling.
as #sturgemeister mentioned, classes ratio 3:7 is not critical, so you should not worry too much of class imbalance. When you artificially change data balance in training you will need to compensate it by multiplication by prior for some algorithms.
as for your "perfect" results either your model overtrained or the model is indeed classifies the data perfectly. Use different train/test split to check this.
another point: your test set is only 94 data points. It is definitely not 1/5 of 1400. Check your numbers.
to get realistic estimates, you need lots of test data. This is the reason why you need to apply Cross Validation strategy.
as for general strategy for 5-fold CV I suggest following:
split your data to 5 folds with respect to labels (this is called stratified split and you can use StratifiedShuffleSplit function)
take 4 splits and train your model. If you want to use under/oversampling, modify the data in those 4 training splits.
apply the model to the remaining part. Do not under/over sample data in the test part. This way you get realistic performance estimate. Save the results.
repeat 2. and 3. for all test splits (totally 5 times obviously). Important: do not change parameters (e.g. tree depth) of the model when training - they should be the same for all splits.
now you have all your data points tested without being trained on them. This is the core idea of cross validation. Concatenate all the saved results, and estimate the performance .

Cross-validation or held-out set
First of all, you are not doing cross-validation. You are splitting your data in a train/validation/test set, which is good, and often sufficient when the number of training samples is large (say, >2e4). However, when the number of samples is small, which is your case, cross-validation becomes useful.
It is explained in depth in scikit-learn's documentation. You will start by taking out a test set from your data, as your create_cv function does. Then, you split the rest of the training data in e.g. 3 splits. Then, you do, for i in {1, 2, 3}: train on data j != i, evaluate on data i. The documentation explains it with prettier and colorful figures, you should have a look! It can be quite cumbersome to implement, but hopefully scikit does it out of the box.
As for the dataset being unbalanced, it is a very good idea to keep the same ratio of labels in each set. But again, you can let scikit handle it for you!
Purpose
Also, the purpose of cross-validation is to choose the right values for the hyper-parameters. You want the right amount of regularization, not too big (under-fitting) nor too small (over-fitting). If you're using a decision tree, the maximum depth (or the minimum number of samples per leaf) is the right metric to consider to estimate the regularization of your method.
Conclusion
Simply use GridSearchCV. You will have cross-validation and label balance done for you.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/5, stratified=True)
tree = DecisionTreeClassifier()
parameters = {'min_samples_leaf': [1, 5, 10]}
clf = GridSearchCV(svc, parameters, cv=5) # Specifying cv does StratifiedShuffleSplit, see documentation
clf.fit(iris.data, iris.target)
sorted(clf.cv_results_.keys())
You can also replace the cv variable by a fancier shuffler, such as StratifiedGroupKFold (no intersection between groups).
I would also advise looking towards random trees, which are less interpretable but said to have better performances in practice.

Just wanted to add thresholding and cost sensitive learning to the list of possible approaches mentioned by the others. The former is well described here and consists in finding a new threshold for classifying positive vs negative classes (generally is 0.5 but it can be treated as an hyper parameter). The latter consists on weighting the classes to cope with their unbalancedness. This article was really useful to me to understand how to deal with unbalanced data sets. In it, you can find also cost sensitive learning with a specific explanation using decision tree as a model. Also all other approaches are really nicely reviewed including: Adaptive Synthetic Sampling, informed undersampling etc.

Time series regression - having problems trying to predict with KRR and gaussian process regression

I need to predict what the output will be depending on the time. I want to make it so i can train my model on the first 20% of the data, and then make a model, that will follow the behavior, and predict the remaining 80%.
The data i am working on looks as follows:
My data
But when i try to make regressions to do this, i either get something way off target (or something quite close, but then it is linear), which is not accepted.
I maybe think my problem is the choice of my kernel, or the way i am making the regressions. Right now i am making the with the sklearn package as follows:
gpr=GaussianProcessRegressor(kernel=1.15**2*RBF(length_scale=41.4) + WhiteKernel(noise_level=1.32e-4),
n_restarts_optimizer=10,
optimizer='fmin_l_bfgs_b',
normalize_y=True,
alpha=0.051)
gpr.fit(X_train, y_train)
y_gpr, y_std = gpr.predict(X_test, return_std=True)
But after a few predictions, the predictions just become the same steady value, instead of having the curve as in the data. Also, the standard variations for the prediction becomes very large.
The GPR prediction on the real data
When doing the Kernel Ridge Regression in python, i can't seem to get the curve to follow the data aswell. Either it drops to 0 in a few predictions, or it has to be a linear prediction.
The KRR model, but linear instead - which is not good enough
The KRR model is made as follows (and i know the kernel=polynomial with a degree of 1, but i cant seem to figure out/find an appropiate kernel that will follow my data):
#The kernel ridge regression
krr = KernelRidge(alpha=0.051,kernel='polynomial',degree=1)
# krr = KernelRidge(alpha=0.051,kernel=RBF(0.5))
krr.fit(X_train,y_train)
list_y_pred=krr.predict(X_test)
So if possible, i would like to get some inputs, of how it should be done instead, or if a different approach to the problem would be better. But i am really hoping i can get the KRR to fit the data, and the gaussian process regression aswell.

There is nothing absolutely wrong with your code. I believe your parameters are wrong and the guess is not the best because of it.
My sugestion would be to use grid search and pipelines to estimate the best parameters.
An example of how it would work would be
param_grid = [
{'alpha': [1, 10, 100, 1000], 'kernel': ['linear']}, ## test linear kernel with varying alpha
{'alpha': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']}, # test rbf kernel while varying gamma and alpha
]
## you can have as many dictionaries as you want inside this list, or just 1. Keep in
## mind this takes O(n^n)*time_per_fit where n is the number of arguments you try to test, so
## it can take a long time
estimator = KernelRidge()
clf = clf = GridSearchCV(estimator, param_grid)
clf.fit(X_train,y_train)
list_y_pred=clf.predict(X_test)
For a more comprehensive tutorial try taking a look at the oficial docs here and here, or even here for a faster, but less thorough search.
Keep in mind my parameters are way off, I just copied the example from the docs

Underfitting, Overfitting, Good_Generalization

So as a part of my assignment I'm applying linear and lasso regressions, and here's Question 7.
Based on the scores from question 6, what gamma value corresponds to a
model that is underfitting (and has the worst test set accuracy)? What
gamma value corresponds to a model that is overfitting (and has the
worst test set accuracy)? What choice of gamma would be the best
choice for a model with good generalization performance on this
dataset (high accuracy on both training and test set)?
Hint: Try plotting the scores from question 6 to visualize the
relationship between gamma and accuracy. Remember to comment out the
import matplotlib line before submission.
This function should return one tuple with the degree values in this order: (Underfitting, Overfitting, Good_Generalization) Please note there is only one correct solution.
I really need help, I can't really think of any way to solve this last question. What code should I use to determine (Underfitting, Overfitting, Good_Generalization) and why???
Thanks,
Data set: http://archive.ics.uci.edu/ml/datasets/Mushroom?ref=datanews.io
Here's my code from question 6:
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve
def answer_six():
# SVC requires kernel='rbf', C=1, random_state=0 as instructed
# C: Penalty parameter C of the error term
# random_state: The seed of the pseudo random number generator
# used when shuffling the data for probability estimates
# e radial basis function kernel, or RBF kernel, is a popular
# kernel function used in various kernelized learning algorithms,
# In particular, it is commonly used in support vector machine
# classification
model = SVC(kernel='rbf', C=1, random_state=0)
# Return numpy array numbers spaced evenly on a log scale (start,
# stop, num=50, endpoint=True, base=10.0, dtype=None, axis=0)
gamma = np.logspace(-4,1,6)
# Create a Validation Curve for model and subsets.
# Create parameter name and range regarding gamma. Test Scoring
# requires accuracy.
# Validation curve requires X and y.
train_scores, test_scores = validation_curve(model, X_subset, y_subset, param_name='gamma', param_range=gamma, scoring ='accuracy')
# Determine mean for scores and tests along columns (axis=1)
sc = (train_scores.mean(axis=1), test_scores.mean(axis=1))
return sc
answer_six()

Well, make yourself familiar with overfitting. You are supposed to produce something like this: Article on this topic
On the left you have underfitting, on the right overfitting... Where both errors are low you have good generalisation.
And these things are a function of gamma (the regularizor)

Overfitting = your model false
if model false
scatter it
change linear to poly or suport vector with working kernel...
Underfitting = your dataset false
add new data ideal correleated ...
check by nubers
score / accuracy of test and train if test and train high and no big difference you are doiing good ...
if test low or train low then you facing overfitting / underfitting
hope explained you ...

Inverse ROC-AUC value?

I have a classification problem where I need to predict a class of (0,1) given a data. Basically I have a dataset with more than 300 features (including a target value for prediction) and more than 2000 rows (samples). I applied different classifiers as follows:
1. DecisionTreeClassifier()
2. RandomForestClassifier()
3. GradientBoostingClassifier()
4. KNeighborsClassifier()
Almost all the classifiers gave me similar results around 0.50 AUC value except Random forest around 0.28. I would like to know that whether it is correct if I inverse the RandomForest result like:
1-0.28= 0.72
And report it as the AUC? Is it correct?

Your intuition is not wrong: if a binary classifier performs indeed worse than random (i.e. AUC < 0.5), a valid strategy is to simply invert its predictions, i.e. report a 0 whenever the classifier predicts a 1, and vice versa); from the relevant Wikipedia entry (emphasis added):
The diagonal divides the ROC space. Points above the diagonal represent good classification results (better than random); points below the line represent bad results (worse than random). Note that the output of a consistently bad predictor could simply be inverted to obtain a good predictor.
Nevertheless, the formally correct AUC for this inverted classifier, would be to first invert the individual probabilistic predictions prob of your model:
prob_invert = 1 - prob
and then calculate the AUC using these predictions prob_invert (arguably the process should give similar results with the naive approach you describe of simply subtracting the AUC from 1, but I'm not quire sure of the exact result - see also this Quora answer).
Needless to say, all this is based on the assumption that your whole process is correct, i.e. you don't have any modeling or coding errors (constructing a worse-than-random classifier is not exactly trivial).

Why are the grid_scores_ higher than the score for full training set? (sklearn, Python, GridSearchCV)

I'm building a logistic regression model as follows:
cross_validation_object = cross_validation.StratifiedKFold(Y, n_folds = 10)
scaler = MinMaxScaler(feature_range = [0,1])
logistic_fit = LogisticRegression()
pipeline_object = Pipeline([('scaler', scaler),('model', logistic_fit)])
tuned_parameters = [{'model__C': [0.01,0.1,1,10],
'model__penalty': ['l1','l2']}]
grid_search_object = GridSearchCV(pipeline_object, tuned_parameters, cv = cross_validation_object, scoring = 'roc_auc')
I looked at the roc_auc score for the best estimator:
grid_search_object.best_score_
Out[195]: 0.94505225726738229
However, when I used the best estimator to score the full training set, I got a worse score:
grid_search_object.best_estimator_.score(X,Y)
Out[196]: 0.89636762322433028
How can this be? What am I doing wrong?
Edit: Nevermind. I'm an idiot. grid_search_object.best_estimator_.score calculates accuracy, not auc_roc. Right?
But if that is the case, how does GridSearchCV compute the grid_scores_? Does it pick the best decision threshold for each parameter, or is the decision threshold always at 0.5? For area under the ROC curve, decision threshold doesn't matter, but it does for say, f1_score.

If you evaluated the best_estimator_ on the full training set it is not surprising that the scores are different from the best_score_, even if the scoring methods are the same:
The best_score_ is the average over your cross-validation fold scores of the best model (best in exactly that sense: scores highest on average over folds).
When scoring on the whole training set, your score may be higher or lower than this. Especially if you have some sort of temporal structure in your data and you are using the wrong data splitting, scores on the full set can be worse.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.