Using a column for cross validation folds - python

I have a dataset with more than 100k rows and about 1k columns including the target column for a binary classification prediction problem. I am using H2O GBM (latest 3.30xx) in python with 5 folds cross validation and 80-20 train-test split. I have noticed that H2O is automatically stratifying it which is good. The problem I have is, I have this whole dataset from one product with some sub-products within it as a separate column or group. Each of these sub-product has decent size of 5k to 10k rows and therefore good to check separate model on each of them I thought. I am looking for if I can specify this sub-product groups for cross validation in H2O model training. Currently I am looping over these sub-products while doing a train-test split as it is not clear to me how to do it otherwise based on the document I have read so far. Is there any option I can use within H2O to have this sub-product column directly for cross validation? That way I have to control less all the model outputs in my scripts.
I hope the question is clear. If not, let me know. Thank you.

fold_column option works, some brief examples are there in the docs:
http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html#h2o.grid.H2OGridSearch

Related

In Leave One Out Cross Validation, How can I Use `shap.Explainer()` Function to Explain a Machine Learning Model?

Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer(), I need to pass an ML model (e.g. XGBRegressor()). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer() function of shap library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer(). However, as there is 250 participants' data, if I apply shap here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.
You seem to train model on a 250 datapoints while doing LOOCV. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don't sift through different sets of hyperparams -- note, 250 LOOCV is already overkill. Will you do that with 250'000 rows? -- you are rather trying to understand which features influence output in what direction and by how much.
Training has it's own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don't overestimate explanation exercise either. It's still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV the model is still the same (same features, hyperparams may be different, depending on your definition of iteration). It's still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn't matter. Feed resulting model to SHAP explainer and you'll get what you want.

Splitting movielens data into train-validation-test datasets

I'm working on a project on recommender systems written by python using the Bayesian Personalized Ranking optimization. I am pretty confident my model learns the data I provided well enough, but now it's time I found out the exact model hyperparameters and try to avoid overfitting. Since the movielens dataset only provided me with 5-fold train-test datasets without validation sets, I want to split the original dataset by myself to validate my model.
Since the movielens dataset holds 943 user data with each user guaranteed to have ranked at least 20 movies, I'm thinking of splitting the data so that both TRAIN and TEST datasets contain the same number of users(e.g. 943), and distributing 80% of the implicit feedback data to TRAIN, and the other to TEST. After training validation will take place using the mean value of Recall at k precision through all 943 users.
Is this the right way to split the dataset? I am curious because the original movielens test dataset doesn't seem to contain test data for all 943 users. If a certain user doesn't have any test data to predict, how do I evaluate using recall#k -- when doing so would cause a zero division? Should I just skip that user and calculate the mean value with the rest of the users?
Thanks for the lengthy read, I hope you're not as confused as I am.
How I would split it is the whole data set on 80% (train)- 10% (validation) - 10% (test). It should work out :)

How to use data augmentation with cross validation

I need to use data augmentation on what would be my training data from the data augmentation step. The problem is that i am using cross-validation, so i can't find a reference how to adjust my model to use data augmentation. My cross-validation is somewhat indexing by hand my data.
There is articles and general content about data augmentation, but very little and with no generalization for cross validation with data augmentation
I need to use data augmentation on training data by simply rotating and adding zoom, cross validate for the best weights and save them, but i wouldnt know how.
This example can be copy pasted for better reproducibility, in short how would i employ data augmentation and also save the weights with the best accuracy?
When training machine learning models, you should not test model on the samples used during model training phase (if you care for realistic results).
Cross validation is a method for estimating model accuracy. The essence of the method is that you split your available labeled data into several parts (or folds), and then use one part as a test set, training the model on all the rest, and repeating this procedure for all parts one by one. This way you essentially test your model on all the available data, without hurting training too much. There is an implicit assumption that data distribution is the same in all folds. As a rule of thumb, the number of cross validation folds is usually 5 or 7. This depends on the amount of the labeled data at one's disposal - if you have lots of data, you can afford to leave less data to train the model and increase test set size. The higher the number of folds, the better accuracy estimation you can achieve, as the training size part increases, and more time you have to invest into the procedure. In extreme case one have a leave-one-out training procedure: train on everything but one single sample, effectively making number of the folds equal to the number of data samples.
So for a 5-fold CV you train 5 different models, which have a a large overlap of the training data. As a result, you should get 5 models that have similar performance. (If it is not the case, you have a problem ;) ) After you have the test results, you throw away all 5 models you have trained, and train a new model on all the available data, assuming it's performance would be a mean of the values you've got during CV phase.
Now about the augmented data. You should not allow data obtained by augmentation of the training part leak into the test. Each data point created from the training part should be used only for training, same applies to the test set.
So you should split your original data into k-folds (for example using KFold or GroupKFold), then create augmented data for each fold and concatenate them to the original. Then you follow regular CV procedure.
In your case, you can simply pass each group (such as x_group1) through augmenting procedure before concatenating them, and you should be fine.
Please note, that splitting data in linear way can lead to unbalanced data sets and it is not the best way of splitting the data. You should consider functions I've mentioned above.

Running get_dummies on train and test data returns different amount of columns - is it ok to concat the two sets and split after feature engineering?

My train and test data set that are two seperate csv files.
I've done some feature engineering on the test set and have used pd_get_dummies() which works as expected.
Training Classes
|Condition|
-----------
Poor
Ok
Good
Excelent
My issue is that the there is a mismatch when I try to predict the values as the test set has a different amount of columns after pd.get_dummies()
Test set:
|Condition|
-----------
Poor
Ok
Good
Notice that Excelent is missing!! And over all the columns after creating dummies i'm about 20 columns short of the training dataframe.
My question is it acceptable to join the train.csv and test.csv - run all my feature engineering, scaling etc and then split back into the two dataframes before the training phase?
Or is there another better solution?
It is acceptable to join the train and test as you say, but I would not recommend that.
Particularly, because when you deploy a model and you start scoring "real data" you don't get the chance to join it back to the train set to produce the dummy variables.
There are alternative solutions using the OneHotEncoder class from either Scikit-learn, Feature-engine or Category encoders. All these are open source python packages, with classes that implement the fit / transform functionality.
With fit, the class learns the dummy variables that will be created from the train set, and with trasnform it creates the dummy variables. In the example that you provide, the test set will also have 4 dummies, and the dummy "Excellent" will contain all 0.
Find examples of the OneHotEncoder from Scikit-learn, Feature-engine and Category encoders in the provided links

(how) can you train a model twice (multiple times) in sklearn using fit.?

Example : my data doesn't fit into memory can I do :
model=my_model
for i in range(20)
model.fit(X_i,Y_i)
This will delete the first 19 fit. and keep only the last one.
How can I avoid this? Can I retrain a model saved and loaded?
Thank you
Some models have a "warm_start" parameter, where it will initialise model parameters with the previous solution from fit()
See for instance SGDClassifier
You need to read 6.1.3. Incremental learning from sklearn documentation http://scikit-learn.org/0.15/modules/scaling_strategies.html
One option is to reduce your data first, either extract a subset of your data, or reduce the dimensions of your data with the help of a database. This would significantly reduce the amount of your data.
You may want to create multiple models from randomly picked subsets of your data, and then compare the models, and use the best model that yields the highest accuracy.

Categories