Best way to scale across different datasets

Best way to scale across different datasets - python

I have come across a peculiar situation when preprocessing data.
Let's say I have a dataset A. I split the dataset into A_train and A_test. I fit the A_train using any of the given scalers (sci-kit learn) and transform A_test with that scaler. Now training the neural network with A_train and validating on A_test works well. No overfitting and performance is good.
Let's say I have dataset B with the same features as in A, but with different ranges of values for the features. A simple example of A and B could be Boston and Paris housing datasets respectively (This is just an analogy to say that features ranges like the cost, crime rate, etc vary significantly ). To test the performance of the above trained model on B, we transform B according to scaling attributes of A_train and then validate. This usually degrades performance, as this model is never shown the data from B.
The peculiar thing is if I fit and transform on B directly instead of using scaling attributes of A_train, the performance is a lot better. Usually, this reduces performance if I test this on A_test. In this scenario, it seems to work, although it's not right.
Since I work mostly on climate datasets, training on every dataset is not feasible. Therefore I would like to know the best way to scale such different datasets with the same features to get better performance.
Any ideas, please.
PS: I know training my model with more data can improve performance, but I am more interested in the right way of scaling. I tried removing outliers from datasets and applied QuantileTransformer, it improved performance but could be better.

One possible solution could be like this.
Normalize (pre-process) the dataset A such that the range of each features is within a fixed interval, e.g., between [-1, 1].
Train your model on the normalized set A.
Whenever you are given a new dataset like B:
(3.1.) Normalize the new dataset such that the feature have the same range as they have in A ([-1, 1]).
(3.2) Apply your trained model (step 2) on the normalized new set (3.1).
As you have a one-to-one mapping between set B and its normalized version, then you can see what is the prediction on set B, based on predictions on normalized set B.
Note you do not need to have access to set B in advance (or such sets if they are hundreds of them). You normalize them, as soon as you are given one and you want to test your trained model on it.

Related

In Leave One Out Cross Validation, How can I Use `shap.Explainer()` Function to Explain a Machine Learning Model?

Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer(), I need to pass an ML model (e.g. XGBRegressor()). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer() function of shap library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer(). However, as there is 250 participants' data, if I apply shap here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.

You seem to train model on a 250 datapoints while doing LOOCV. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don't sift through different sets of hyperparams -- note, 250 LOOCV is already overkill. Will you do that with 250'000 rows? -- you are rather trying to understand which features influence output in what direction and by how much.
Training has it's own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don't overestimate explanation exercise either. It's still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV the model is still the same (same features, hyperparams may be different, depending on your definition of iteration). It's still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn't matter. Feed resulting model to SHAP explainer and you'll get what you want.

How to use data augmentation with cross validation

I need to use data augmentation on what would be my training data from the data augmentation step. The problem is that i am using cross-validation, so i can't find a reference how to adjust my model to use data augmentation. My cross-validation is somewhat indexing by hand my data.
There is articles and general content about data augmentation, but very little and with no generalization for cross validation with data augmentation
I need to use data augmentation on training data by simply rotating and adding zoom, cross validate for the best weights and save them, but i wouldnt know how.
This example can be copy pasted for better reproducibility, in short how would i employ data augmentation and also save the weights with the best accuracy?

When training machine learning models, you should not test model on the samples used during model training phase (if you care for realistic results).
Cross validation is a method for estimating model accuracy. The essence of the method is that you split your available labeled data into several parts (or folds), and then use one part as a test set, training the model on all the rest, and repeating this procedure for all parts one by one. This way you essentially test your model on all the available data, without hurting training too much. There is an implicit assumption that data distribution is the same in all folds. As a rule of thumb, the number of cross validation folds is usually 5 or 7. This depends on the amount of the labeled data at one's disposal - if you have lots of data, you can afford to leave less data to train the model and increase test set size. The higher the number of folds, the better accuracy estimation you can achieve, as the training size part increases, and more time you have to invest into the procedure. In extreme case one have a leave-one-out training procedure: train on everything but one single sample, effectively making number of the folds equal to the number of data samples.
So for a 5-fold CV you train 5 different models, which have a a large overlap of the training data. As a result, you should get 5 models that have similar performance. (If it is not the case, you have a problem ;) ) After you have the test results, you throw away all 5 models you have trained, and train a new model on all the available data, assuming it's performance would be a mean of the values you've got during CV phase.
Now about the augmented data. You should not allow data obtained by augmentation of the training part leak into the test. Each data point created from the training part should be used only for training, same applies to the test set.
So you should split your original data into k-folds (for example using KFold or GroupKFold), then create augmented data for each fold and concatenate them to the original. Then you follow regular CV procedure.
In your case, you can simply pass each group (such as x_group1) through augmenting procedure before concatenating them, and you should be fine.
Please note, that splitting data in linear way can lead to unbalanced data sets and it is not the best way of splitting the data. You should consider functions I've mentioned above.

How to balance a dataset without oversampling

I am trying to balance my dataset, But I am struggling in finding the right way to do it. Let me set the problem. I have a multiclass dataset with the following class weights:
class weight
2.0 0.700578
4.0 0.163401
3.0 0.126727
1.0 0.009294
As you can see the dataset is pretty unbalanced. What I would like to do is to obtain a balanced dataset in which each class is represented with the same weight.
There are a lot of questions regarding but:
Scikit-learn balanced subsampling: this subsamples can be overlapping, which for my approach is wrong. Moreover, I would like to get that using sklearn or packages that are well tested.
How to perform undersampling (the right way) with python scikit-learn?: here they suggest to use an unbalanced dataset with a balance class weight vector, however, I need to have this balance dataset, is not a matter of which model and which weights.
https://github.com/scikit-learn-contrib/imbalanced-learn: a lot of question refers to this package. Below an example on how I am trying to use it.
Here the example:
from imblearn.ensemble import EasyEnsembleClassifier
eec = EasyEnsembleClassifier(random_state=42, sampling_strategy='not minority', n_estimators=2)
eec.fit(data_for, label_all.loc[data_for.index,'LABEL_O_majority'])
new_data = eec.estimators_samples_
However, the returned indexes are all the indexes of the initial data and they are repeated n_estimators times.
Here the result:
[array([ 0, 1, 2, ..., 1196, 1197, 1198]),
array([ 0, 1, 2, ..., 1196, 1197, 1198])]
Finally, a lot of techniques use oversampling but would like to not use them. Only for class 1 I can tolerate oversampling, as it is very predictable.
I am wondering if really sklearn, or this contrib package do not have a function that does this.

Based on my experience, under-sampling really doesn't help every time as we are not utilizing the total available data and this approach might lead to a lot of overfitting. Synthetic Minority Over-sampling Technique (SMOTE) has worked well with most type of data (both structured and unstructured data like images), although the performance can be slow sometimes. But it is easy to use and available through [imblearn][1] In case if you want to try oversampling techniques this particular article might help: https://medium.com/#adib0073/how-to-use-smote-for-dealing-with-imbalanced-image-dataset-for-solving-classification-problems-3aba7d2b9cad
But for undersampling as mentioned in the above comments, you would have to slice your dataframe or array of the majority class to match the size of the minority class

try to apply iterative-stratification
On the Stratification of Multi-label Data :
Stratified sampling is a sampling method that takes into account the
existence of disjoint groups within a population and pro- duces
samples where the proportion of these groups is maintained. In
single-label classification tasks, groups are differentiated based on
the value of the target variable. In multi-label learning tasks,
however, where there are multiple target variables, it is not clear
how stratified sam- pling could/should be performed. This paper
investigates stratification in the multi-label data context. It
considers two stratification methods for multi-label data and
empirically compares them along with random sampling on a number of
datasets and based on a number of evaluation criteria. The results
reveal some interesting conclusions with respect to the utility of
each method for particular types of multi-label datasets.

Questions on ensemble technique in machine learning

I am studying the ensemble machine learning and when I read some articles online, I encountered 2 questions.
1.
In this article, it mentions
Instead, model 2 may have a better overall performance on all the data
points, but it has worse performance on the very set of points where
model 1 is better. The idea is to combine these two models where they
perform the best. This is why creating out-of-sample predictions have
a higher chance of capturing distinct regions where each model
performs the best.
But I still cannot get the point, why not train all training data can avoid the problem?
2.
From this article, in the prediction section, it mentions
Simply, for a given input data point, all we need to do is to pass it
through the M base-learners and get M number of predictions, and send
those M predictions through the meta-learner as inputs
But in the training process, we use k -fold train data to train M base-learner, so should I also train M base-learner based on all train data for the input to predict?

Assume red and blue were the best models you could find.
One works better in region 1, the other on region 2.
Now you would also train a classifier to predict which model to use, i.e., you would try to learn the two regions.
Do the validation on the outside. You can overfit if you give the two inner models access to data that the meta model does not see.

The idea in ensembles is that a group of weak predictors outperform a strong predictor. So, if we train different models with different predictive results and use the majority rule as the final result of our ensemble, this result is better than just trying to train one single model. Assume, for example, that the data consist of two distinct patterns, one linear and one quadratic. Then using a single classifier can either overfit or produce inaccurate results.
You can read this tutorial to learn more about ensembles and bagging and boosting.

1) "But I still cannot get the point, why not train all training data can avoid the problem?" - We will hold that data for validation purpose, just like the way we do in K-fold
2) "so should I also train M base-learner based on all train data for the input to predict?" - If you give same data to all the learners then the output of all of them would be same and there is no use in creating them. So we will give a subset of data to each learner.

For question 1 I will prove why we train two models in a contradictory way.
Suppose you train a model with all the data points.During training whenever the model will see a data point belonging to the red class, then it will try to fit itself so that it can classify red points with minimal error.Same is true for data points belonging to the blue class.Therefore during training the model is leaning towards a specific data point(either red or blue).And at the end model will try to fit itself so that it does not make much mistakes on both the data points and the final model will be an average model.
But instead if you train two models for the two different datasets, then each model will be trained on a specific dataset and a model doesn't have to care about data points which belong to another class.
It will be more clearer with the following metaphor.
Suppose there are two persons which are specialized to do two completely different jobs.Now when a job comes if you tell them that both of you have to do the job and each of them need to do 50% of the job. Now think what kind of result you will get at the end. Now also think what could be the result if you would tell them that a person should work on only the job at which the person is best.

In question 2 you have to split the train dataset into M datasets.And during training give M datasets to M base learners.

How to calculate probability(confidence) of SVM classification for small data set?

Use case:
I have a small dataset with about 3-10 samples in each class. I am using sklearn SVC to classify those with rbf kernel.
I need the confidence of the prediction along with the predicted class. I used predict_proba method of SVC.
I was getting weird results with that. I searched a bit and found out that it makes sense only for larger datasets.
Found this question on stack Scikit-learn predict_proba gives wrong answers.
The author of the question verified this by multiplying the dataset, thereby duplicating the dataset.
My questions:
1) If I multiply my dataset by lets say 100, having each sample 100 times, it increases the "correctness" of "predict_proba". What sideeffects will it have? Overfitting?
2) Is there any other way I can calculate the confidence of the classifier? Like distance from the hyperplanes?
3) For this small sample size, is SVM a recommended algorithm or should I choose something else?

First of all: Your data set seems very small for any practical purposes. That being said, let's see what we can do.
SVM's are mainly popular in high dimensional settings. It is currently unclear whether that applies to your project. They build planes on a handful of (or even single) supporting instances, and are often outperformed in situation with large trainingsets by Neural Nets. A priori they might not be your worse choice.
Oversampling your data will do little for an approach using SVM. SVM is based on the notion of support vectors, which are basically the outliers of a class that define what is in the class and what is not. Oversampling will not construct new support vector (I am assuming you are already using the train set as test set).
Plain oversampling in this scenario will also not give you any new information on confidence, other than artififacts constructed by unbalanced oversampling, since the instances will be exact copies and no distibution changes will occur. You might be able to find some information by using SMOTE (Synthetic Minority Oversampling Technique). You will basically generate synthetic instances based of the ones you have. In theory this will provide you with new instances, that won't be exact copies of the ones you have, and might thusly fall a little out of the normal classification. Note: By definition all these examples will lie in between the original examples in your sample space. This will not mean that they will lie in between your projected SVM-space, possibly learning effects that aren't really true.
Lastly, you can estimate confidence with the distance to the hyperplane. Please see: https://stats.stackexchange.com/questions/55072/svm-confidence-according-to-distance-from-hyperline

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.