Underfitting, Overfitting, Good_Generalization

Underfitting, Overfitting, Good_Generalization - python

So as a part of my assignment I'm applying linear and lasso regressions, and here's Question 7.
Based on the scores from question 6, what gamma value corresponds to a
model that is underfitting (and has the worst test set accuracy)? What
gamma value corresponds to a model that is overfitting (and has the
worst test set accuracy)? What choice of gamma would be the best
choice for a model with good generalization performance on this
dataset (high accuracy on both training and test set)?
Hint: Try plotting the scores from question 6 to visualize the
relationship between gamma and accuracy. Remember to comment out the
import matplotlib line before submission.
This function should return one tuple with the degree values in this order: (Underfitting, Overfitting, Good_Generalization) Please note there is only one correct solution.
I really need help, I can't really think of any way to solve this last question. What code should I use to determine (Underfitting, Overfitting, Good_Generalization) and why???
Thanks,
Data set: http://archive.ics.uci.edu/ml/datasets/Mushroom?ref=datanews.io
Here's my code from question 6:
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve
def answer_six():
# SVC requires kernel='rbf', C=1, random_state=0 as instructed
# C: Penalty parameter C of the error term
# random_state: The seed of the pseudo random number generator
# used when shuffling the data for probability estimates
# e radial basis function kernel, or RBF kernel, is a popular
# kernel function used in various kernelized learning algorithms,
# In particular, it is commonly used in support vector machine
# classification
model = SVC(kernel='rbf', C=1, random_state=0)
# Return numpy array numbers spaced evenly on a log scale (start,
# stop, num=50, endpoint=True, base=10.0, dtype=None, axis=0)
gamma = np.logspace(-4,1,6)
# Create a Validation Curve for model and subsets.
# Create parameter name and range regarding gamma. Test Scoring
# requires accuracy.
# Validation curve requires X and y.
train_scores, test_scores = validation_curve(model, X_subset, y_subset, param_name='gamma', param_range=gamma, scoring ='accuracy')
# Determine mean for scores and tests along columns (axis=1)
sc = (train_scores.mean(axis=1), test_scores.mean(axis=1))
return sc
answer_six()

Well, make yourself familiar with overfitting. You are supposed to produce something like this: Article on this topic
On the left you have underfitting, on the right overfitting... Where both errors are low you have good generalisation.
And these things are a function of gamma (the regularizor)

Overfitting = your model false
if model false
scatter it
change linear to poly or suport vector with working kernel...
Underfitting = your dataset false
add new data ideal correleated ...
check by nubers
score / accuracy of test and train if test and train high and no big difference you are doiing good ...
if test low or train low then you facing overfitting / underfitting
hope explained you ...

Related

Minimizing BER for sklearn linear models

I'm using sklearn's LinearSVC and LogisticRegression models for classification. I want the optimization in training to minimize BER (balanced error rate). I noticed that you can set the class_weight parameter in the models to 'balanced', which should (according to documentation) 'automatically adjust weights inversely proportional to class frequencies in the input data'. However, you can also pass sample weights as a parameter to the fit() method. I'm wondering if these are equivalent paths of achieving the same goal? Anything I should be aware of / concerned about here?
--
(Added detail below based on bot suggestion)
For example:
You could define
model = sklearn.svm.LinearSVC(class_weight = 'balanced')
and then fit
model.fit(X_train, y_train)
OR you could define
model = sklearn.svm.LinearSVC()
and then fit
wts = _array of length n (number of samples) formulated to balance the classes_
model.fit(X_train,y_train, sample_weight = wts)
...
Are these equivalent and do either/both satisfy my goal of minimizing BER?

Building ML classifier with imbalanced data

I have a dataset with 1400 obs and 19 columns. The Target variable has values 1 (value that I am most interested in) and 0. The distribution of classes shows imbalance (70:30).
Using the code below I am getting weird values (all 1s). I am not figuring out if this is due to a problem of overfitting/imbalance data or to feature selection (I used Pearson correlation since all values are numeric/boolean).
I am thinking that the steps followed are wrong.
import numpy as np
import math
import sklearn.metrics as metrics
from sklearn.metrics import f1_score
y = df['Label']
X = df.drop('Label',axis=1)
def create_cv(X,y):
if type(X)!=np.ndarray:
X=X.values
y=y.values
test_size=1/5
proportion_of_true=y[y==1].shape[0]/y.shape[0]
num_test_samples=math.ceil(y.shape[0]*test_size)
num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])
X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
return X_train,X_test,y_train,y_test
X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_predict_test = tree.predict(X_test)
print(classification_report(y_test, y_predict_test))
f1_score(y_test, y_predict_test)
Output:
precision recall f1-score support
0 1.00 1.00 1.00 24
1 1.00 1.00 1.00 70
accuracy 1.00 94
macro avg 1.00 1.00 1.00 94
weighted avg 1.00 1.00 1.00 94
Has anyone experienced similar issues in building a classifier when data has imbalance, using CV and/or under sampling? Happy to share the whole dataset, in case you might want to replicate the output.
What I would like to ask you for some clear answer to follow that can show me the steps and what I am doing wrong.
I know that, to reduce overfitting and work with balance data, there are some methods such as random sampling (over/under), SMOTE, CV. My idea is
Split the data on train/test taking into account imbalance
Perform CV on trains set
Apply undersampling only on a test fold
After the model has been chosen with the help of CV, undersample the train set and train the classifier
Estimate the performance on the untouched test set
(f1-score)
as also outlined in this question: CV and under sampling on a test fold .
I think the steps above should make sense, but happy to receive any feedback that you might have on this.

When you have imbalanced data you have to perform stratification. The usual way is to oversample the class that has less values.
Another option is to train your algorithm with less data. If you have a good dataset that should not be a problem. In this case you grab first the samples from the less represented class use the size of the set to compute how many samples to get from the other class:
This code may help you split your dataset that way:
def split_dataset(dataset: pd.DataFrame, train_share=0.8):
"""Splits the dataset into training and test sets"""
all_idx = range(len(dataset))
train_count = int(len(all_idx) * train_share)
train_idx = random.sample(all_idx, train_count)
test_idx = list(set(all_idx).difference(set(train_idx)))
train = dataset.iloc[train_idx]
test = dataset.iloc[test_idx]
return train, test
def split_dataset_stratified(dataset, target_attr, positive_class, train_share=0.8):
"""Splits the dataset as in `split_dataset` but with stratification"""
data_pos = dataset[dataset[target_attr] == positive_class]
data_neg = dataset[dataset[target_attr] != positive_class]
if len(data_pos) < len(data_neg):
train_pos, test_pos = split_dataset(data_pos, train_share)
train_neg, test_neg = split_dataset(data_neg, len(train_pos)/len(data_neg))
# set.difference makes the test set larger
test_neg = test_neg.iloc[0:len(test_pos)]
else:
train_neg, test_neg = split_dataset(data_neg, train_share)
train_pos, test_pos = split_dataset(data_pos, len(train_neg)/len(data_pos))
# set.difference makes the test set larger
test_pos = test_pos.iloc[0:len(test_neg)]
return train_pos.append(train_neg).sample(frac = 1).reset_index(drop = True), \
test_pos.append(test_neg).sample(frac = 1).reset_index(drop = True)
Usage:
train_ds, test_ds = split_dataset_stratified(data, target_attr, positive_class)
You can now perform cross validation on train_ds and evaluate your model in test_ds.

There is another solution that is in the model-level - using models that support weights of samples, such as Gradient Boosted Trees. Of those, CatBoost is usually the best as its training method leads to less leakage (as described in their article).
Example code:
from catboost import CatBoostClassifier
y = df['Label']
X = df.drop('Label',axis=1)
label_ratio = (y==1).sum() / (y==0).sum()
model = CatBoostClassifier(scale_pos_weight = label_ratio)
model.fit(X, y)
And so forth.
This works because Catboost treats each sample with a weight, so you can determine class weights in advance (scale_pos_weight).
This is better than downsampling, and is technically equal to oversampling (but requires less memory).
Also, a major part of treating imbalanced data, is making sure your metrics are weighted as well, or at least well-defined, as you might want equal performance (or skewed performance) on these metrics.
And if you want a more visual output than sklearn's classification_report, you can use one of the Deepchecks built-in checks (disclosure - I'm one of the maintainers):
from deepchecks.checks import PerformanceReport
from deepchecks import Dataset
PerformanceReport().run(Dataset(train_df, label='Label'), Dataset(test_df, label='Label'), model)

your implementation of stratified train/test creation is not optimal, as it lacks randomness. Very often data comes in batches, so it is not a good practice to take sequences of data as is, without shuffling.
as #sturgemeister mentioned, classes ratio 3:7 is not critical, so you should not worry too much of class imbalance. When you artificially change data balance in training you will need to compensate it by multiplication by prior for some algorithms.
as for your "perfect" results either your model overtrained or the model is indeed classifies the data perfectly. Use different train/test split to check this.
another point: your test set is only 94 data points. It is definitely not 1/5 of 1400. Check your numbers.
to get realistic estimates, you need lots of test data. This is the reason why you need to apply Cross Validation strategy.
as for general strategy for 5-fold CV I suggest following:
split your data to 5 folds with respect to labels (this is called stratified split and you can use StratifiedShuffleSplit function)
take 4 splits and train your model. If you want to use under/oversampling, modify the data in those 4 training splits.
apply the model to the remaining part. Do not under/over sample data in the test part. This way you get realistic performance estimate. Save the results.
repeat 2. and 3. for all test splits (totally 5 times obviously). Important: do not change parameters (e.g. tree depth) of the model when training - they should be the same for all splits.
now you have all your data points tested without being trained on them. This is the core idea of cross validation. Concatenate all the saved results, and estimate the performance .

Cross-validation or held-out set
First of all, you are not doing cross-validation. You are splitting your data in a train/validation/test set, which is good, and often sufficient when the number of training samples is large (say, >2e4). However, when the number of samples is small, which is your case, cross-validation becomes useful.
It is explained in depth in scikit-learn's documentation. You will start by taking out a test set from your data, as your create_cv function does. Then, you split the rest of the training data in e.g. 3 splits. Then, you do, for i in {1, 2, 3}: train on data j != i, evaluate on data i. The documentation explains it with prettier and colorful figures, you should have a look! It can be quite cumbersome to implement, but hopefully scikit does it out of the box.
As for the dataset being unbalanced, it is a very good idea to keep the same ratio of labels in each set. But again, you can let scikit handle it for you!
Purpose
Also, the purpose of cross-validation is to choose the right values for the hyper-parameters. You want the right amount of regularization, not too big (under-fitting) nor too small (over-fitting). If you're using a decision tree, the maximum depth (or the minimum number of samples per leaf) is the right metric to consider to estimate the regularization of your method.
Conclusion
Simply use GridSearchCV. You will have cross-validation and label balance done for you.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/5, stratified=True)
tree = DecisionTreeClassifier()
parameters = {'min_samples_leaf': [1, 5, 10]}
clf = GridSearchCV(svc, parameters, cv=5) # Specifying cv does StratifiedShuffleSplit, see documentation
clf.fit(iris.data, iris.target)
sorted(clf.cv_results_.keys())
You can also replace the cv variable by a fancier shuffler, such as StratifiedGroupKFold (no intersection between groups).
I would also advise looking towards random trees, which are less interpretable but said to have better performances in practice.

Just wanted to add thresholding and cost sensitive learning to the list of possible approaches mentioned by the others. The former is well described here and consists in finding a new threshold for classifying positive vs negative classes (generally is 0.5 but it can be treated as an hyper parameter). The latter consists on weighting the classes to cope with their unbalancedness. This article was really useful to me to understand how to deal with unbalanced data sets. In it, you can find also cost sensitive learning with a specific explanation using decision tree as a model. Also all other approaches are really nicely reviewed including: Adaptive Synthetic Sampling, informed undersampling etc.

Logistic Regression - Python?

Could you briefly describe me what the below lines of code mean. This is the code of logistic regression in Python.
What means size =0.25 and random_state = 0 ? And what is train_test_split ? What was done in this line of code ?
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
And what was done in these lines of code ?
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)

Have a look at the description of the function here:
random_state sets the seed for the random number generator to give you the same result with each run, especially useful in education settings to give everyone an identical result.
test_size refers to the proportion used in the test split, here 75% of the data is used for training, 25% is used for testing the model.
The other lines simply run the logistic regression on the training dataset. You then use the test dataset to check the goodness of the fitted regression.

What means size =0.25 and random_state = 0 ?
test_size=0.25 -> 25% split of training and test data.
random_state = 0 -> for reproducible results this can be any number.
What was done in this line of code ?
Splits X and y into X_train, X_test, y_train, y_test
And what was done in these lines of code ?
Trains the logistic regression model through the fit(X_train, y_train) and then makes predictions on the test set X_test.
Later you probably compare y_pred to y_test to see what the accuracy of the model is.

Based on the documentation:
test_size : float, int or None, optional (default=None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
This gives you the split between your train data and test data, if you have in total 1000 data points, a test_size=0.25 would mean that you have:
750 data points for train
250 data points for test
The perfect size is still under discussions, for large datasets (1.000.000+ ) I currently prefer to set it to 0.1. And even before I have another validation dataset, which I will keep completly out until I decided to run the algorithm.
random_state : int, RandomState instance or None, optional
(default=None)
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
For machine learning you should set this to a value, if you set it, you will have the chance to open your programm on another day and still produce the same results, normally random_state is also in all classifiers/regression models avaiable, so that you can start working and tuning, and have it reproducible,
To comment your regression:
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
Will load your Regression, for python this is only to name it
Will fit your logistic regression based on your training set, in this example it will use 750 datsets to train the regression. Training means, that the weights of logistic regression will be minimized with the 750 entries, that the estimat for your y_train fits
This will use the learned weights of step 2 to do an estimation for y_pred with the X_test
After that you can test your results, you now have a y_pred which you calculated and the real y_test, you can know calculate some accuracy scores and the how good the regression was trained.

This line line:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
divides your source into train and test set, 0.25 shows 25% of the source will be used for test and remaining will be used for training.
For, random_state = 0, here is a brief discussion.
A part from above link:
if you use random_state=some_number, then you can guarantee that the
output of Run 1 will be equal to the output of Run 2,
logistic_regression= LogisticRegression() #Creates logistic regressor
Calculates some values for your source. Recommended read
logistic_regression.fit(X_train,y_train)
A part from above link:
Here the fit method, when applied to the training dataset,learns the
model parameters (for example, mean and standard deviation)
....
It doesn't matter what the actual random_state number is 42, 0, 21, ... The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results,
Perform prediction on test set based on the learning from training set.
y_pred=logistic_regression.predict(X_test)

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
Above line splits your data into training and testing data randomly
X is your dataset minus output variable
y is your output variable
test_size=0.25 means you are dividing data into 75%-25% where 25% is your testing dataset
random_state is used for generating same sample again when you run the code
Refer train-test-split documentation

Get precison model through GridSearchCV for recall optimization

Given a machine learning model RBF SVC called 'm', I performed a gridSearchCV on gamma value, to optimize recall.
I'm looking to answer to this:
"The grid search should find the model that best optimizes for recall. How much better is the recall of this model than the precision?"
So I did the gridSearchCV:
grid_values = {'gamma': [0.001, 0.01, 0.05, 0.1, 1, 10, 100]}
grid_m_re = GridSearchCV(m, param_grid = grid_values, scoring = 'recall')
grid_m_re.fit(X_train, y_train)
y_decision_fn_scores_re = grid_m_re.decision_function(X_test)
print('Grid best parameter (max. recall): ', grid_m_re.best_params_)
print('Grid best score (recall): ', grid_m_re.best_score_)
This tell me the best model is for gamma=0.001 and it has a recall score of 1.
I'm wondering how to get the precision for this model to get the trade of of this model, cause the GridSearchCV only has attribute to get what it was optimize for.([Doc sklearn.GridSearchCV][1])

Not sure if there's an easier/more direct way to get this, but this approach also allows you to capture the 'best' model to play around with later:
First do you CV fit on training data:
grid_m_re = GridSearchCV(m, param_grid = grid_values, scoring = 'recall')
grid_m_re.fit(X_train, y_train)
Once you're done, you can pull out the 'best' model (as determined by your scoring criteria during CV), and then use it however you want:
m_best = grid_m_re.best_estimator_
and in your specific case:
from sklearn.metrics import precision_score
y_pred = m_best.predict(X_test)
precision_score(y_test, y_pred)

You can easily overfit if you don't optimize both, C and gamma at the same time.
if you plot the SVC with C on the X axis, gamma on the y axis and recall as color you get some kind of V-Shape, see here
So if you do grid search, better optimize for both, C and gamma, at the same time.
The problem is that usually you get the best results for small C-Values, and in that area the V-shape has it's pointy end: is not very big and difficult to hit.
I recently used:
make a random grid of 10 points
every point contains C, gamma, direction, speed
cut the dataset with stratifiedShuffleSplit
fit & estimate score with cross validation
repeat:
kill the worst two points
the best two points spawn a kid
move every point in its direction with just a little bit of random,
fit & estimate score with cross validation
(if a point notice it goes downward, turn around and half speed)
until break criterion is hit
Worked like a charm.
I used the max distance in the feature space divided by four as initial speed,
the direction had a maximum random of pi/4
Well, the cross validation was a bit costly.
Cleptocreatively inspired by this paper.
... and another edit:
I used between 10 and 20 cycles in the loop to get the perfect points.
If your dataset is too big to do several fits, make a representative subset for the first few trainings...

Plotting Precision-Recall curve when using cross-validation in scikit-learn

I'm using cross-validation to evaluate the performance of a classifier with scikit-learn and I want to plot the Precision-Recall curve. I found an example on scikit-learn`s website to plot the PR curve but it doesn't use cross validation for the evaluation.
How can I plot the Precision-Recall curve in scikit learn when using cross-validation?
I did the following but i'm not sure if it's the correct way to do it (psudo code):
for each k-fold:
precision, recall, _ = precision_recall_curve(y_test, probs)
mean_precision += precision
mean_recall += recall
mean_precision /= num_folds
mean_recall /= num_folds
plt.plot(recall, precision)
What do you think?
Edit:
it doesn't work because the size of precision and recall arrays are different after each fold.
anyone?

Instead of recording the precision and recall values after each fold, store the predictions on the test samples after each fold. Next, collect all the test (i.e. out-of-bag) predictions and compute precision and recall.
## let test_samples[k] = test samples for the kth fold (list of list)
## let train_samples[k] = test samples for the kth fold (list of list)
for k in range(0, k):
model = train(parameters, train_samples[k])
predictions_fold[k] = predict(model, test_samples[k])
# collect predictions
predictions_combined = [p for preds in predictions_fold for p in preds]
## let predictions = rearranged predictions s.t. they are in the original order
## use predictions and labels to compute lists of TP, FP, FN
## use TP, FP, FN to compute precisions and recalls for one run of k-fold cross-validation
Under a single, complete run of k-fold cross-validation, the predictor makes one and only one prediction for each sample. Given n samples, you should have n test predictions.
(Note: These predictions are different from training predictions, because the predictor makes the prediction for each sample without having been previously seen it.)
Unless you are using leave-one-out cross-validation, k-fold cross validation generally requires a random partitioning of the data. Ideally, you would do repeated (and stratified) k-fold cross validation. Combining precision-recall curves from different rounds, however, is not straight forward, since you cannot use simple linear interpolation between precision-recall points, unlike ROC (See Davis and Goadrich 2006).
I personally calculated AUC-PR using the Davis-Goadrich method for interpolation in PR space (followed by numerical integration) and compared the classifiers using the AUC-PR estimates from repeated stratified 10-fold cross validation.
For a nice plot, I showed a representative PR curve from one of the cross-validation rounds.
There are, of course, many other ways of assessing classifier performance, depending on the nature of your dataset.
For instance, if the proportion of (binary) labels in your dataset is not skewed (i.e. it is roughly 50-50), you could use the simpler ROC analysis with cross-validation:
Collect predictions from each fold and construct ROC curves (as before), collect all the TPR-FPR points (i.e. take the union of all TPR-FPR tuples), then plot the combined set of points with possible smoothing. Optionally, compute AUC-ROC using simple linear interpolation and the composite trapezoid method for numerical integration.

This is currently the best way to plot a Precision Recall curve for an sklearn classifier using cross-validation. Best part is, it plots the PR Curves for ALL classes, so you get multiple neat-looking curves as well
from scikitplot.classifiers import plot_precision_recall_curve
import matplotlib.pyplot as plt
clf = LogisticRegression()
plot_precision_recall_curve(clf, X, y)
plt.show()
The function automatically takes care of cross-validating the given dataset, concatenating all out of fold predictions, and calculating the PR Curves for each class + averaged PR Curve. It's a one-line function that takes care of it all for you.
Precision Recall Curves
Disclaimer: Note that this uses the scikit-plot library, which I built.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.