How to check machine learning accuracy without cross validation - python

I have training sample X_train, and Y_train to train and X_estimated.
I got task to make my classificator learn as accurate as it can, and then predict vector of results over X_estimated to get close results to Y_estimated (which i have now, and I have to be as much precise as it can). If I split my training data to like 75/25 to train and test it, I can get accuracy using sklearn.metrics.accuracy_score and confusion matrix. But I am losing that 25% of samples, that would make my predictions more accurate.
Is there any way, I could learn by using 100% of the data, and still be able to see accuracy score (or percentage), so I can predict it many times, and save best (%) result?
I am using random forest with 500 estimators, and usually get like 90% accuracy. I want to save best prediction vector as possible for my task, without splitting any data (not wasting anything), but still be able to calculate accuracy (so I can save best prediction vector) from multiple attempts (random forest always shows different results)
Thank you

Splitting your data is critical for evaluation.
There is no way that you could train your model on 100% of the data and be able to get a correct evaluation accuracy unless you expand your dataset. I mean, you could change your train/test split, or try to optimize your model in other ways, but i guess the simple answer to your question would be no.

As per your requirement, you can try K Fold Cross Validation. If you split it in 90|10 i.e for Train|Test. Achieving to take 100% data for training is not possible as you have to test the data then only you can validate the same that how good your model is. K Fold CV takes your whole train data into consideration in each fold and randomly takes test data sample from the train data. And lastly calculates the accuracy by taking summation of all the folds. Then finally you can test the accuracy by using 10% of the data.
More you can read here and here
K Fold Cross Validation
Skearn provides simple methods for performing K fold cross validation. Simply you have to pass no of folds in the method. But then remember, more the folds, it takes more time to train the model. More you can check here

It is not necessary to do 75|25 split of your data all the time. 75
|25 is kind of old school now. It greatly depends on the amount of data that you have. For example, if you have 1 billion sentences for training a language model, it is not necessary to reserve 25% for testing.
Also, I second the previous answer of trying K-fold cross-validation. As a side note, you could consider looking at the other metrics like precision and recall as well.

In general splitting your data set is critical for evaluation. So I would recommend you always do that.
Said that, there are methods that in some sense allow you to train on all your data and still get an estimate of your performance or to estimate the generalization accuracy.
One particularly prominent method is leveraging out-of-bag samples of models based on bootstrapping, i.e. RandomForests.
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, bootstrap=True, oob_score=True)
rf.fit(X, y)
print(rf.oob_score_)

if you are doing classification always go with stratified k-fold cv(https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/).
if you're doing regression then go with simple k-fold cv or you can divide the target as bins and do stratified k-fold cv. by this way you can use your data completely in model training.

Related

GridSearchCV does not improve my test accuracy

I am making multiple classifier models and the test accuracy for all of them is 0.508.
I find it weird that multiple models have the same accuracy. The models I used are Logistic Regressor,DesicionTreeClassifier, MLPClassifier, RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, XGBClassifier, SVC, and VotingClassifier.
After using GridSearchCV to improve the models, all of their test accuracy scores improved. But the test accuracy scores did not change.
I wish I could say I changed something, but I don't know why the test scores did not change. After using gridsearch, I expected the test scores to improve but it didn't
I would like to confirm, you mean your training scores improve but you testing scores did not change? If yes, there are a lot of possibility behind this.
You might want to reconfigure and add your hyper parameter range for example if using KNN you can increase the number of k or by adding more distance metric calculation
If you want to you can change the hyper parameter optimization technique like randomized search or bayesian search
I don't have any information about your data but sometimes turn on or turn off the shuffle mode when splitting can affect the scores for instance if you have time series data you have not to shuffle the dataset
There can be several reasons why the test accuracy didn't change after using GridSearchCV:
The best parameters found by GridSearchCV might not be optimal for the test data.
The test data may have a different distribution than the training data, leading to low test accuracy.
The models might be overfitting to the training data and not generalizing well to the test data.
The test data size might be small, leading to high variance in test accuracy scores.
The problem itself might be challenging, and a test accuracy of 0.508 might be the best that can be achieved with the current models and data.
It would be useful to have more information about the data, the problem, and the experimental setup to diagnose the issue further.
Looking at your accuracy, first of all I would say: are you performing a binary classification task? Because if it is the case, your models are almost not better than random on the test set, which may suggest that something is wrong with your training.
Otherwise, GridSearchCV, like RandomSearchCV and other hyperparameters optimization techniques try to find optimal parameters among a range that you define. If, after optimization, your optimal parameter has the value of one bound of your range, it may suggest that you need to explore beyond this bound, that is to say set another range on purpose and run the optimization again.
By the way, I don't know the size of your dataset but if it is big I would recommend you to use RandomSearchCV instead of GridSearchCV. As it is not exhaustive, it takes less time and gives results that are (nearly) optimized.

Classification: Tweet Sentiment Analysis - Order of steps

I am currently working on a tweet sentiment analysis and have a few questions regarding the right order of the steps. Please assume that the data was already preprocessed and prepared accordingly. So this is how I would proceed:
use train_test_split (80:20 ratio) to withhold a test
data set.
vectorize x_train since the tweets are not numerical.
In the next steps, I would like to identify the best classifier. Please assume those were already imported. So I would go on by:
hyperparameterization (grid-search) including a cross-validation approach.
In this step, I would like to identify the best parameters of each
classifier. For KNN the code is as follows:
model = KNeighborsClassifier()
n_neighbors = range(1, 10, 2)
weights = ['uniform', 'distance']
metric = ['euclidean', 'manhattan', 'minkowski']
# define grid search
grid = dict(n_neighbors=n_neighbors, weights=weights ,metric=metric)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(train_tf, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
compare the accuracy (depending on the best hyperparameters) of the classifiers
choose the best classifier
take the withheld test data set (from train_test_split()) and use the best classifier on the test data
Is this the right approach or would you recommend changing something (e. g. doing the cross-validation alone and not within the hyperparametrization)? Does it make sense to test the test data as the final step or should I do it earlier to assess the accuracy for an unknown data set?
There are lots of ways to do this and people have strong opinions about it and I'm not always convinced they fully understand what they advocate.
TL;DR: Your methodology looks great and you're asking sensible questions.
Having said that, here are some things to consider:
Why are you doing train-test split validation?
Why are you doing hyperparameter tuning?
Why are you doing cross-validation?
Yes, each of these techniques are good at doing something specific; but that doesn't necessarily mean they should all be part of the same pipeline.
First off, let's answer these questions:
Train-Test Split is useful for testing your classifier's inference abilities. In other words, we want to know how well a classifier performs in general (not on the data we used for training). The test portion allows us to evaluate our classifier without using our training portion.
Hyperparameter-Tuning is useful for evaluating the effect of hyperparameters on the performance of a classifier. For it to be meaningful, we must compare two (or more) models (using different hyperparameters) but trained preferably using the same training portion (to eliminate selection bias). What do we do once we know the best performing hyperparameters? Will this set of hyperparameters always perform optimally? No. You will see that, due to the stochastic nature of classification, one hyperparameter set may work best in experiment A then another set of hyperparameters may work best on experiment B. Rather, hyperparameter tuning is good for generalizing about which hyperparameters to use when building a classifier.
Cross-validation is used to smooth out some of the stochastic randomness associated with building classifiers. So, a machine learning pipeline may produce a classifier that is 94% accurate using 1 test-fold and 83% accuracy using another test-fold. What does it mean? It might mean that 1 fold contains samples that are easy. Or it might mean that the classifier, for whatever reason, is actually better. You don't know because it's a black box.
Practically, how is this helpful?
I see little value in using test-train split and cross-validation. I use cross-validation and report accuracy as an average over the n-folds. It is already testing my classifier's performance. I don't see why dividing your training data further to do another round of train-test validation is going to help. Use the average. Having said that, I use the best performing model of the n-fold models created during cross-validation as my final model. As I said, it's black-box, so we can't know which model is best but, all else being equal, you may as well use the best performing one. It might actually be better.
Hyperparameter-tuning is useful but it can take forever to do extensive tuning. I suggest adding hyperparameter tuning to your pipeline but only test 2 sets of hyperparameters. So, keep all your hyperparameters constant except 1. e.g. Batch size = {64, 128}. Run that, and you'll be able to say with confidence, "Oh, that made a big difference: 64 works better than 128!" or "Well, that was a waste of time. It didn't make much difference either way." If the difference is small, ignore that hyperparameter and try another pair. This way, you'll slowly tack towards optimal without all the wasted time.
In practice, I'd say leave the extensive hyperparameter-tuning to academics and take a more pragmatic approach.
But yeah, you're methodology looks good as it is. I think you thinking about what you're doing and that already puts you a step ahead of the pack.

python- Best techniques to split datase to get high performance accuracy

I have applied these 4 methods:
Train and Test Sets.
K-fold Cross Validation.
Leave One Out Cross
Validation. Repeated Random Test-Train Splits.
The method "Train and Test Sets" achieve high accuracy but the remaining methods achieve same accuracy but lower then first approach.
I want to know which method should I choose?
Each of Train and Test Sets and Cross Validation used in certain case,Cross Validation used if you want to compare different models.Accuracy always increase if you use bigger training data that's why sometimes Leave One Out Cross perform better than K-fold Cross Validation,it's depends on your dataset size and sometimes on algorithm you are using.On the other hand Train and Test Sets usually used if you aren't comparing diffrent models, and if the time requirements for running the cross validation aren't worth it,mean it's not needed to make Cross Validation in this case.In most cases Cross Validation is preferred,but, what method you should choose? this usually depend on your choices while training your data such way you handle data and algorithm such you are trainning data using Random Forests usually it's not needed to do Cross Validation but you can and do it in case need more you usually not doing Cross Validation in Random Forests when you use Out of Bag estimate .
Training a model comprises tuning model accuracy as well as model generalization. If model is not generalized it may be Underfit or Overfit model.
In this case, model may perform better on training data but accuracy may decrease on test or unknown data.
We use training data to improve the accuracy of model. As training data size increases model accuracy may also increase.
Similarly we use different training samples to generalize the model.
So Train-Test splitting methods depend on the size of available data and algorithm used for model design.
First train-test method has a fix size training and testing data. So on each iteration, we use same train data to train model and same test data for model's accuracy assessment.
Second k-fold method has fix size train and test data but on each iteration, test and train data changes. So it may be a better approach irrespective of data size.
Leave one out approach is useful only if data size is small. Here we use almost whole data for training purpose. So training accuracy of model will be better but may not be a generalized model.
Randomised Train-test method is also a good approach for training and testing model's performance. Here we randomly select train and test data each time. So it may perform better than Leave one out method if data size is small.
And last each splitting approach has some pros and cons. So it depends on you which splitting method is good to your model. It also depends on data size and data selection means how we are selecting data from sample while splitting.

How LeaveOneOut validation works in GridSearchCV? [duplicate]

It seems that GridSearchCV of scikit-learn collects the scores of its (inner) cross-validation folds and then averages across the scores of all folds. I was wondering about the rationale behind this. At first glance, it would seem more flexible to instead collect the predictions of its cross-validation folds and then apply the chosen scoring metric to the predictions of all folds.
The reason I stumbled upon this is that I use GridSearchCV on an imbalanced data set with cv=LeaveOneOut() and scoring='balanced_accuracy' (scikit-learn v0.20.dev0). It doesn't make sense to apply a scoring metric such as balanced accuracy (or recall) to each left-out sample. Rather, I would want to collect all predictions first and then apply my scoring metric once to all predictions. Or does this involve an error in reasoning?
Update: I solved it by creating a custom grid search class based on GridSearchCV with the difference that predictions are first collected from all inner folds and the scoring metric is applied once.
GridSearchCVuses the scoring to decide what internal hyperparameters to set in the model.
If you want to estimate the performance of the "optimal" hyperparameters, you need to do an additional step of cross validation.
See http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
EDIT to get closer to answering the actual question:
For me it seems reasonable to collect predictions for each fold and then score them all, if you want to use LeaveOneOut and balanced_accuracy. I guess you need to make your own grid searcher to do that. You could use model_selection.ParameterGrid and model_selection.KFold for that.

Linear regression: Good results for training data, horrible for test data

I am working with a dataset of about 400.000 x 250.
I have a problem with the model yielding a very good R^2 score when testing it on the training set, but extremely poorly when used on the test set. Initially, this sounds like overfitting. But the data is split into training/test set at random and the data set i pretty big, so I feel like there has to be something else.
Any suggestions?
Splitting dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice'],
axis=1), df.SalePrice, test_size = 0.3)
Sklearn's Linear Regression estimator
from sklearn import linear_model
linReg = linear_model.LinearRegression() # Create linear regression object
linReg.fit(X_train, y_train) # Train the model using the training sets
# Predict from training set
y_train_linreg = linReg.predict(X_train)
# Predict from test set
y_pred_linreg = linReg.predict(X_test)
Metric calculation
from sklearn import metrics
metrics.r2_score(y_train, y_train_linreg)
metrics.r2_score(y_test, y_pred_linreg)
R^2 score when testing on training set: 0,64
R^2 score when testing on testing set: -10^23 (approximatly)
While I agree with Mihai that your problem definitely looks like overfitting, I don't necessarily agree on his answer that neural network would solve your problem; at least, not out of the box. By themselves, neural networks overfit more, not less, than linear models. You need somehow to take care of your data, hardly any model can do that for you. A few options that you might consider (apologies, I cannot be more precise without looking at the dataset):
Easiest thing, use regularization. 400k rows is a lot, but with 250 dimensions you can overfit almost whatever you like. So try replacing LinearRegression by Ridge or Lasso (or Elastic Net or whatever). See http://scikit-learn.org/stable/modules/linear_model.html (Lasso has the advantage of discarding features for you, see next point)
Especially if you want to go outside of linear models (and you probably should), it's advisable to first reduce the dimension of the problem, as I said 250 is a lot. Try using some of the Feature selection techniques here: http://scikit-learn.org/stable/modules/feature_selection.html
Probably most importantly than anything else, you should consider adapting your input data. The very first thing I'd try is, assuming you are really trying to predict a price as your code implies, to replace it by its logarithm, or log(1+x). Otherwise linear regression will try very very hard to fit that single object that was sold for 1 Million $ ignoring everything below $1k. Just as important, check if you have any non-numeric (categorical) columns and keep them only if you need them, in case reducing them to macro-categories: a categorical column with 1000 possible values will increase your problem dimension by 1000, making it an assured overfit. A single column with a unique categorical data for each input (e.g. buyer name) will lead you straight to perfect overfitting.
After all this (cleaning data, reducing dimension via either one of the methods above or just Lasso regression until you get to certainly less than dim 100, possibly less than 20 - and remember that this includes any categorical data!), you should consider non-linear methods to further improve your results - but that's useless until your linear model provides you at least some mildly positive R^2 value on test data. sklearn provides a lot of them: http://scikit-learn.org/stable/modules/kernel_ridge.html is the easiest to use out-of-the-box (also does regularization), but it might be too slow to use in your case (you should first try this, and any of the following, on a subset of your data, say 1000 rows once you've selected only 10 or 20 features and see how slow that is). http://scikit-learn.org/stable/modules/svm.html#regression have many different flavours, but I think all but the linear one would be too slow. Sticking to linear things, http://scikit-learn.org/stable/modules/sgd.html#regression is probably the fastest, and would be how I'd train a linear model on this many samples. Going truly out of linear, the easiest techniques would probably include some kind of trees, either directly http://scikit-learn.org/stable/modules/tree.html#regression (but that's an almost-certain overfit) or, better, using some ensemble technique (random forests http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees are the typical go-to algorithm, gradient boosting http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting sometimes works better). Finally, state-of-the-art results are indeed generally obtained via neural networks, see e.g. http://scikit-learn.org/stable/modules/neural_networks_supervised.html but for these methods sklearn is generally not the right answer and you should take a look at dedicated environments (TensorFlow, Caffe, PyTorch, etc.)... however if you're not familiar with those it is certainly not worth the trouble!

Categories