Questions on ensemble technique in machine learning

Questions on ensemble technique in machine learning - python

I am studying the ensemble machine learning and when I read some articles online, I encountered 2 questions.
1.
In this article, it mentions
Instead, model 2 may have a better overall performance on all the data
points, but it has worse performance on the very set of points where
model 1 is better. The idea is to combine these two models where they
perform the best. This is why creating out-of-sample predictions have
a higher chance of capturing distinct regions where each model
performs the best.
But I still cannot get the point, why not train all training data can avoid the problem?
2.
From this article, in the prediction section, it mentions
Simply, for a given input data point, all we need to do is to pass it
through the M base-learners and get M number of predictions, and send
those M predictions through the meta-learner as inputs
But in the training process, we use k -fold train data to train M base-learner, so should I also train M base-learner based on all train data for the input to predict?

Assume red and blue were the best models you could find.
One works better in region 1, the other on region 2.
Now you would also train a classifier to predict which model to use, i.e., you would try to learn the two regions.
Do the validation on the outside. You can overfit if you give the two inner models access to data that the meta model does not see.

The idea in ensembles is that a group of weak predictors outperform a strong predictor. So, if we train different models with different predictive results and use the majority rule as the final result of our ensemble, this result is better than just trying to train one single model. Assume, for example, that the data consist of two distinct patterns, one linear and one quadratic. Then using a single classifier can either overfit or produce inaccurate results.
You can read this tutorial to learn more about ensembles and bagging and boosting.

1) "But I still cannot get the point, why not train all training data can avoid the problem?" - We will hold that data for validation purpose, just like the way we do in K-fold
2) "so should I also train M base-learner based on all train data for the input to predict?" - If you give same data to all the learners then the output of all of them would be same and there is no use in creating them. So we will give a subset of data to each learner.

For question 1 I will prove why we train two models in a contradictory way.
Suppose you train a model with all the data points.During training whenever the model will see a data point belonging to the red class, then it will try to fit itself so that it can classify red points with minimal error.Same is true for data points belonging to the blue class.Therefore during training the model is leaning towards a specific data point(either red or blue).And at the end model will try to fit itself so that it does not make much mistakes on both the data points and the final model will be an average model.
But instead if you train two models for the two different datasets, then each model will be trained on a specific dataset and a model doesn't have to care about data points which belong to another class.
It will be more clearer with the following metaphor.
Suppose there are two persons which are specialized to do two completely different jobs.Now when a job comes if you tell them that both of you have to do the job and each of them need to do 50% of the job. Now think what kind of result you will get at the end. Now also think what could be the result if you would tell them that a person should work on only the job at which the person is best.

In question 2 you have to split the train dataset into M datasets.And during training give M datasets to M base learners.

Related

In Leave One Out Cross Validation, How can I Use `shap.Explainer()` Function to Explain a Machine Learning Model?

Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer(), I need to pass an ML model (e.g. XGBRegressor()). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer() function of shap library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer(). However, as there is 250 participants' data, if I apply shap here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.

You seem to train model on a 250 datapoints while doing LOOCV. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don't sift through different sets of hyperparams -- note, 250 LOOCV is already overkill. Will you do that with 250'000 rows? -- you are rather trying to understand which features influence output in what direction and by how much.
Training has it's own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don't overestimate explanation exercise either. It's still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV the model is still the same (same features, hyperparams may be different, depending on your definition of iteration). It's still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn't matter. Feed resulting model to SHAP explainer and you'll get what you want.

How to use data augmentation with cross validation

I need to use data augmentation on what would be my training data from the data augmentation step. The problem is that i am using cross-validation, so i can't find a reference how to adjust my model to use data augmentation. My cross-validation is somewhat indexing by hand my data.
There is articles and general content about data augmentation, but very little and with no generalization for cross validation with data augmentation
I need to use data augmentation on training data by simply rotating and adding zoom, cross validate for the best weights and save them, but i wouldnt know how.
This example can be copy pasted for better reproducibility, in short how would i employ data augmentation and also save the weights with the best accuracy?

When training machine learning models, you should not test model on the samples used during model training phase (if you care for realistic results).
Cross validation is a method for estimating model accuracy. The essence of the method is that you split your available labeled data into several parts (or folds), and then use one part as a test set, training the model on all the rest, and repeating this procedure for all parts one by one. This way you essentially test your model on all the available data, without hurting training too much. There is an implicit assumption that data distribution is the same in all folds. As a rule of thumb, the number of cross validation folds is usually 5 or 7. This depends on the amount of the labeled data at one's disposal - if you have lots of data, you can afford to leave less data to train the model and increase test set size. The higher the number of folds, the better accuracy estimation you can achieve, as the training size part increases, and more time you have to invest into the procedure. In extreme case one have a leave-one-out training procedure: train on everything but one single sample, effectively making number of the folds equal to the number of data samples.
So for a 5-fold CV you train 5 different models, which have a a large overlap of the training data. As a result, you should get 5 models that have similar performance. (If it is not the case, you have a problem ;) ) After you have the test results, you throw away all 5 models you have trained, and train a new model on all the available data, assuming it's performance would be a mean of the values you've got during CV phase.
Now about the augmented data. You should not allow data obtained by augmentation of the training part leak into the test. Each data point created from the training part should be used only for training, same applies to the test set.
So you should split your original data into k-folds (for example using KFold or GroupKFold), then create augmented data for each fold and concatenate them to the original. Then you follow regular CV procedure.
In your case, you can simply pass each group (such as x_group1) through augmenting procedure before concatenating them, and you should be fine.
Please note, that splitting data in linear way can lead to unbalanced data sets and it is not the best way of splitting the data. You should consider functions I've mentioned above.

Will it lead to Overfitting / Curse of Dimensionality

Dataset contains :
15000 Observations/Rows
3000 Features/Columns
Can I train Machine Learning model on these Dataset

Yes, you can apply the ML model but before that understanding of your problem statement come into a picture with all of the feature name available in the data set. If you are having big dataset try to convert it into a cluster of 2 or else take a small dataset to analyze what your data speaks about.
That is why population & sampling come to practical use.
You have to check whether accuracy of the train data set & test data set should be the same, if not then your model is memorizing instead of learning & here Regularization in Machine Learning comes into a picture.

No one can answer this based on the information you provided. The simplest approach is to run a sanity check in the form of cross validation. Does your model perform well on unseen data? If it does, it is probably not overfit. If it does not, check if the model is performing well on the training data. A model that performs well on training data but not on unseen data is the definition of a model being overfit.

Best way to scale across different datasets

I have come across a peculiar situation when preprocessing data.
Let's say I have a dataset A. I split the dataset into A_train and A_test. I fit the A_train using any of the given scalers (sci-kit learn) and transform A_test with that scaler. Now training the neural network with A_train and validating on A_test works well. No overfitting and performance is good.
Let's say I have dataset B with the same features as in A, but with different ranges of values for the features. A simple example of A and B could be Boston and Paris housing datasets respectively (This is just an analogy to say that features ranges like the cost, crime rate, etc vary significantly ). To test the performance of the above trained model on B, we transform B according to scaling attributes of A_train and then validate. This usually degrades performance, as this model is never shown the data from B.
The peculiar thing is if I fit and transform on B directly instead of using scaling attributes of A_train, the performance is a lot better. Usually, this reduces performance if I test this on A_test. In this scenario, it seems to work, although it's not right.
Since I work mostly on climate datasets, training on every dataset is not feasible. Therefore I would like to know the best way to scale such different datasets with the same features to get better performance.
Any ideas, please.
PS: I know training my model with more data can improve performance, but I am more interested in the right way of scaling. I tried removing outliers from datasets and applied QuantileTransformer, it improved performance but could be better.

One possible solution could be like this.
Normalize (pre-process) the dataset A such that the range of each features is within a fixed interval, e.g., between [-1, 1].
Train your model on the normalized set A.
Whenever you are given a new dataset like B:
(3.1.) Normalize the new dataset such that the feature have the same range as they have in A ([-1, 1]).
(3.2) Apply your trained model (step 2) on the normalized new set (3.1).
As you have a one-to-one mapping between set B and its normalized version, then you can see what is the prediction on set B, based on predictions on normalized set B.
Note you do not need to have access to set B in advance (or such sets if they are hundreds of them). You normalize them, as soon as you are given one and you want to test your trained model on it.

Does the test set is used to update weight in a deep learning model with keras?

I'm wondering if the result of the test set is used to make the optimization of model's weights. I'm trying to make a model but the issue I have is I don't have many data because they are medical research patients. The number of patient is limited in my case (61) and I have 5 feature vectors per patient. What I tried is to create a deep learning model by excluding one subject and I used the exclude subject as the test set. My problem is there is a large variability in subject features and my model fits well the training set (60 subjects) but not that good the 1 excluded subject.
So I'm wondering if the testset (in my case the excluded subject) could be used in a certain way to make converge the model to better classify the exclude subject?

You should not use the test data of your data set in your training process. If your training data is not enough, one approach using a lot during this days(especially for medical images) is data augmentation. So I highly recommend you to use this technique in your training process. How to use Deep Learning when you have Limited Data is one of the good tutorial about data augmentation.

No , you souldn't use your test set for training to prevent overfitting , if you use cross-validation principles you need exactly to split your data into three datasets a train set which you'll use to train your model , a validation set to test different value of your hyperparameters , and a test set to finally test your model , if you use all your data for training, your model will overfit obviously.
One thing to remember deep learning work well if you have a large and very rich datasets

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.