I have a dataset with about 500,000 records and they are grouped. I would like to shuffle and split into 10 smaller datasets based on the percentage weightings of each group. I want each dataset to contain all groups. Is there a library or method to do this in python?
I tried arry_split which just splits the dataset without stratification
Stratification on sckit learn does not really help since it uses training and test splits
You can use k-fold splitting to achieve what you're looking for. Something like
folds = list(StratifiedKFold(n_splits=k, shuffle=True, random_state=1).split(X_train, y_train))
See the documentation here https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
Related
I am working on a quite different problem where I need to split a dataset into an overlapping or non-disjoint dataset using KFOLD validation in python. I was wondering if there is any way to do that.
I am a beginner using Keras, I am trying to preprocess data for training in order to build a neural network. However, I was told that from the csv file where I am getting my data from, the first 6 columns are the x values while the rest are y values. How can I deal with this situation in order to split the data correctly for training and testing. The data is all numerical, it is not categorical. It will be used to predict movement.
When splitting data into training and testing, you aren't splitting along the columns, you're splitting along the rows, so both the training and test sets will have identical columns, but different rows.
You can use scikit-learns train_test_split (docs) to do this for you. So to create an 80-20 split you would do:
df = pd.read_csv(<path to csv>)
train, test = train_test_split(df, test_size=0.20, shuffle=True, random_state=42)
Note that in the docs the example splits the label column out too, however you don't need to do this if you wish to keep the labels and features together.
The random_state parameter (choose any number you like) just ensures that when you re-run the code, the split will be exactly the same (i.e. it is reproducible) each time.
I am trying to split my dataset into a train and a test set using scikit learn's stratified shuffle split, but it does not work because one of the classes has just one instances.
It would be okay if that one instance goes into either of train or test set. Is there any way I can achieve that?
Stratified split except at least two instances of label to split dataset correctly.
You can duplicate the sample with unique label so that you can perform the split, fit them and ensure that the model is able to predict them.
I would do as follow:
vc = (df['y'].value_counts())
unique_label = vc[vc==1].index
df = pd.concat([df, df[df['y'].isin(unique_label)]])
NOTE: It might be wise to remove these sample as your model will have difficulty to learn and predict them.
I have a dataset with more than 100k rows and about 1k columns including the target column for a binary classification prediction problem. I am using H2O GBM (latest 3.30xx) in python with 5 folds cross validation and 80-20 train-test split. I have noticed that H2O is automatically stratifying it which is good. The problem I have is, I have this whole dataset from one product with some sub-products within it as a separate column or group. Each of these sub-product has decent size of 5k to 10k rows and therefore good to check separate model on each of them I thought. I am looking for if I can specify this sub-product groups for cross validation in H2O model training. Currently I am looping over these sub-products while doing a train-test split as it is not clear to me how to do it otherwise based on the document I have read so far. Is there any option I can use within H2O to have this sub-product column directly for cross validation? That way I have to control less all the model outputs in my scripts.
I hope the question is clear. If not, let me know. Thank you.
fold_column option works, some brief examples are there in the docs:
http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/modeling.html#h2o.grid.H2OGridSearch
I have two dataframes, train and test. They both have the same exact column names which contain categorical string features.
I'm trying to map these features to dummy variables in the training set, train a regression model, then do the same exact mapping for the test set and apply the trained model to it.
The problem I came across is, since test is smaller than train, it happens to not contain all the possible values for some of the categorical features. Since pandas.get_dummies() seems to just look at data.Series.unique() to create new columns, after adding dummy columns in the same way for train and test, test now has less columns.
So how can I instead add dummy columns for train, and then use the same exact column names for test, even if for particular features in test, test.feature.unique() is a subset of train.feature.unique()? I looked at the pd.get_dummies documentation, but I don't think I see anything that'll do what I'm looking for. Any help is greatly appreciated!