Splitting movielens data into train-validation-test datasets - python

I'm working on a project on recommender systems written by python using the Bayesian Personalized Ranking optimization. I am pretty confident my model learns the data I provided well enough, but now it's time I found out the exact model hyperparameters and try to avoid overfitting. Since the movielens dataset only provided me with 5-fold train-test datasets without validation sets, I want to split the original dataset by myself to validate my model.
Since the movielens dataset holds 943 user data with each user guaranteed to have ranked at least 20 movies, I'm thinking of splitting the data so that both TRAIN and TEST datasets contain the same number of users(e.g. 943), and distributing 80% of the implicit feedback data to TRAIN, and the other to TEST. After training validation will take place using the mean value of Recall at k precision through all 943 users.
Is this the right way to split the dataset? I am curious because the original movielens test dataset doesn't seem to contain test data for all 943 users. If a certain user doesn't have any test data to predict, how do I evaluate using recall#k -- when doing so would cause a zero division? Should I just skip that user and calculate the mean value with the rest of the users?
Thanks for the lengthy read, I hope you're not as confused as I am.

How I would split it is the whole data set on 80% (train)- 10% (validation) - 10% (test). It should work out :)

Related

How to use data augmentation with cross validation

I need to use data augmentation on what would be my training data from the data augmentation step. The problem is that i am using cross-validation, so i can't find a reference how to adjust my model to use data augmentation. My cross-validation is somewhat indexing by hand my data.
There is articles and general content about data augmentation, but very little and with no generalization for cross validation with data augmentation
I need to use data augmentation on training data by simply rotating and adding zoom, cross validate for the best weights and save them, but i wouldnt know how.
This example can be copy pasted for better reproducibility, in short how would i employ data augmentation and also save the weights with the best accuracy?
When training machine learning models, you should not test model on the samples used during model training phase (if you care for realistic results).
Cross validation is a method for estimating model accuracy. The essence of the method is that you split your available labeled data into several parts (or folds), and then use one part as a test set, training the model on all the rest, and repeating this procedure for all parts one by one. This way you essentially test your model on all the available data, without hurting training too much. There is an implicit assumption that data distribution is the same in all folds. As a rule of thumb, the number of cross validation folds is usually 5 or 7. This depends on the amount of the labeled data at one's disposal - if you have lots of data, you can afford to leave less data to train the model and increase test set size. The higher the number of folds, the better accuracy estimation you can achieve, as the training size part increases, and more time you have to invest into the procedure. In extreme case one have a leave-one-out training procedure: train on everything but one single sample, effectively making number of the folds equal to the number of data samples.
So for a 5-fold CV you train 5 different models, which have a a large overlap of the training data. As a result, you should get 5 models that have similar performance. (If it is not the case, you have a problem ;) ) After you have the test results, you throw away all 5 models you have trained, and train a new model on all the available data, assuming it's performance would be a mean of the values you've got during CV phase.
Now about the augmented data. You should not allow data obtained by augmentation of the training part leak into the test. Each data point created from the training part should be used only for training, same applies to the test set.
So you should split your original data into k-folds (for example using KFold or GroupKFold), then create augmented data for each fold and concatenate them to the original. Then you follow regular CV procedure.
In your case, you can simply pass each group (such as x_group1) through augmenting procedure before concatenating them, and you should be fine.
Please note, that splitting data in linear way can lead to unbalanced data sets and it is not the best way of splitting the data. You should consider functions I've mentioned above.

Using a column for cross validation folds

I have a dataset with more than 100k rows and about 1k columns including the target column for a binary classification prediction problem. I am using H2O GBM (latest 3.30xx) in python with 5 folds cross validation and 80-20 train-test split. I have noticed that H2O is automatically stratifying it which is good. The problem I have is, I have this whole dataset from one product with some sub-products within it as a separate column or group. Each of these sub-product has decent size of 5k to 10k rows and therefore good to check separate model on each of them I thought. I am looking for if I can specify this sub-product groups for cross validation in H2O model training. Currently I am looping over these sub-products while doing a train-test split as it is not clear to me how to do it otherwise based on the document I have read so far. Is there any option I can use within H2O to have this sub-product column directly for cross validation? That way I have to control less all the model outputs in my scripts.
I hope the question is clear. If not, let me know. Thank you.
fold_column option works, some brief examples are there in the docs:
http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html#h2o.grid.H2OGridSearch

Split into test and train set before or after generating document term matrix?

I'm working on simple machine learning problems and I trying to build a classifier that can differentiate between spam and non-spam SMS. I'm confused as to whether I need to generate the document-term matrix before splitting into test and train sets or should I generate the document-term matrix after splitting into test and train?
I tried it both ways and found that the accuracy is slightly higher when the I split the data before generating the document-term matrix. But to me, this makes no sense. Shouldn't the accuracy be the same? Does the order of these operations make any difference?
Qualitatively, you don't need to do it either way. However, proper procedure requires that you keep your training and test data entirely separate. The overall concept is that the test data are not directly represented in the training; this helps reduce over-fitting. The test data (and later validation data) are samples that the trained model has never encountered during training.
Therefore, the test data should not be included in your pre-processing -- the document-term matrix. This breaks the separation, in that the model has, in one respect, "seen" the test data during training.
Quantitatively, you need to do the split first, because that matrix is to be used for training the model against only the training set. When you included the test data in the matrix, you obtained a matrix that is slightly inaccurate in representing the training data: it no longer properly represents the data you're actually training against. This is why your model isn't quite as good as the one that followed proper separation procedures.
It's a subtle difference, most of all because the training and test sets are supposed to be random samples of the same population of possible inputs. Random differences provide the small surprise you encountered.

Will it lead to Overfitting / Curse of Dimensionality

Dataset contains :
15000 Observations/Rows
3000 Features/Columns
Can I train Machine Learning model on these Dataset
Yes, you can apply the ML model but before that understanding of your problem statement come into a picture with all of the feature name available in the data set. If you are having big dataset try to convert it into a cluster of 2 or else take a small dataset to analyze what your data speaks about.
That is why population & sampling come to practical use.
You have to check whether accuracy of the train data set & test data set should be the same, if not then your model is memorizing instead of learning & here Regularization in Machine Learning comes into a picture.
No one can answer this based on the information you provided. The simplest approach is to run a sanity check in the form of cross validation. Does your model perform well on unseen data? If it does, it is probably not overfit. If it does not, check if the model is performing well on the training data. A model that performs well on training data but not on unseen data is the definition of a model being overfit.

Does the test set is used to update weight in a deep learning model with keras?

I'm wondering if the result of the test set is used to make the optimization of model's weights. I'm trying to make a model but the issue I have is I don't have many data because they are medical research patients. The number of patient is limited in my case (61) and I have 5 feature vectors per patient. What I tried is to create a deep learning model by excluding one subject and I used the exclude subject as the test set. My problem is there is a large variability in subject features and my model fits well the training set (60 subjects) but not that good the 1 excluded subject.
So I'm wondering if the testset (in my case the excluded subject) could be used in a certain way to make converge the model to better classify the exclude subject?
You should not use the test data of your data set in your training process. If your training data is not enough, one approach using a lot during this days(especially for medical images) is data augmentation. So I highly recommend you to use this technique in your training process. How to use Deep Learning when you have Limited Data is one of the good tutorial about data augmentation.
No , you souldn't use your test set for training to prevent overfitting , if you use cross-validation principles you need exactly to split your data into three datasets a train set which you'll use to train your model , a validation set to test different value of your hyperparameters , and a test set to finally test your model , if you use all your data for training, your model will overfit obviously.
One thing to remember deep learning work well if you have a large and very rich datasets

Categories