Duplicating pandas.get_dummies columns from train to test data

Duplicating pandas.get_dummies columns from train to test data - python

I have two dataframes, train and test. They both have the same exact column names which contain categorical string features.
I'm trying to map these features to dummy variables in the training set, train a regression model, then do the same exact mapping for the test set and apply the trained model to it.
The problem I came across is, since test is smaller than train, it happens to not contain all the possible values for some of the categorical features. Since pandas.get_dummies() seems to just look at data.Series.unique() to create new columns, after adding dummy columns in the same way for train and test, test now has less columns.
So how can I instead add dummy columns for train, and then use the same exact column names for test, even if for particular features in test, test.feature.unique() is a subset of train.feature.unique()? I looked at the pd.get_dummies documentation, but I don't think I see anything that'll do what I'm looking for. Any help is greatly appreciated!

Related

Scikit learn Stratified Shuffle Split does not work when one of the classes has just one instance

I am trying to split my dataset into a train and a test set using scikit learn's stratified shuffle split, but it does not work because one of the classes has just one instances.
It would be okay if that one instance goes into either of train or test set. Is there any way I can achieve that?

Stratified split except at least two instances of label to split dataset correctly.
You can duplicate the sample with unique label so that you can perform the split, fit them and ensure that the model is able to predict them.
I would do as follow:
vc = (df['y'].value_counts())
unique_label = vc[vc==1].index
df = pd.concat([df, df[df['y'].isin(unique_label)]])
NOTE: It might be wise to remove these sample as your model will have difficulty to learn and predict them.

OneHotEncoding issue with unseen data on test set

I have data with numerical and categorical variables. I have split the data into train and test. I would like to do one hot encoding after imputation. There are unseen data on the test set.
I understand handle_unknown='ignore' fixes this issue. However, I would also like to drop one column (drop='first') to avoid multicollinearity. OnehotEncoding cannot take both of these.
Is there a way to handle the unseen data and also avoid multicollinearity?
Note: I am using ColumnTransformer.

You'd better use array slicing and design a custom transformer so that you still can apply such transformer within your ColumnTransformer. Here you will find an example of how you can create a custom transformer.

What is right time to perform train_test_split when building a model with text and categorical features?

I am trying to train a model which takes a mixture of numerical, categorical and text features.
My question is which one of the following should I do for vectorizing my text and categorical features?
I split my data into train,cv and test for purpose of features vectorization i.e using vectorizor.fit(train) and vectorizor.transform(cv),vectorizor.transform(test)
Use vectorizor.fit transform on entire data
My goal is to hstack( all above features) and apply NaiveBayes. I think I should split my data into train_test before this point, inorder to find optimal hyperparameter for NB.
Please share some thought on this. I am new to data-science.

If you are going to fit anything like an imputer or a Standard Scaler to the data, I recommend doing that after the split, since this way you avoid any of the test dataset leaking into your training set. However, things like formatting and simple transformations of the data, one-hot encoding should be able to be done safely on the entire dataset without issue, and avoids some extra work.

I think you should go with the 2nd option i.e vectorizer.fit_transform on entire data because if you split the data before, it may happen that some of the data which is in test may not be in train so in that case some classes may remain unrecognised

Running get_dummies on train and test data returns different amount of columns - is it ok to concat the two sets and split after feature engineering?

My train and test data set that are two seperate csv files.
I've done some feature engineering on the test set and have used pd_get_dummies() which works as expected.
Training Classes
|Condition|
-----------
Poor
Ok
Good
Excelent
My issue is that the there is a mismatch when I try to predict the values as the test set has a different amount of columns after pd.get_dummies()
Test set:
|Condition|
-----------
Poor
Ok
Good
Notice that Excelent is missing!! And over all the columns after creating dummies i'm about 20 columns short of the training dataframe.
My question is it acceptable to join the train.csv and test.csv - run all my feature engineering, scaling etc and then split back into the two dataframes before the training phase?
Or is there another better solution?

It is acceptable to join the train and test as you say, but I would not recommend that.
Particularly, because when you deploy a model and you start scoring "real data" you don't get the chance to join it back to the train set to produce the dummy variables.
There are alternative solutions using the OneHotEncoder class from either Scikit-learn, Feature-engine or Category encoders. All these are open source python packages, with classes that implement the fit / transform functionality.
With fit, the class learns the dummy variables that will be created from the train set, and with trasnform it creates the dummy variables. In the example that you provide, the test set will also have 4 dummies, and the dummy "Excellent" will contain all 0.
Find examples of the OneHotEncoder from Scikit-learn, Feature-engine and Category encoders in the provided links

High cardinal Categorical features into numerics

In most of the Academic examples, we used to convert categorical features using get_dummies or OneHotEncoder. Lets say I want to use Country as a feature and in the dataset we have 100 unique countries. When we apply get_dummies on country we will get 100 columns and model will be trained with 100 country columns plus other features.
Lets say, we have deployed this model into production, and we received only 10 countries. When we pre-process the data by using get_dummies, then model will fail predict because "Number of features model trained is not matching with the features passed" as we are passing 10 country columns plus other features.
I came across below article, where we can calculate score using Supervised ratio, Weight of evidence. But how to calculate the score when we want to predict the target in production, which country need to be assigned to right number.
https://www.kdnuggets.com/2016/08/include-high-cardinality-attributes-predictive-model.html
Can you please help me to understand how to handle such scenarios?

There are two things you can do.
Apply OHE after combining your training set and test/validation set data not before that.
Skip OHE and apply StandardScaler because "If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected."
I usually try second option when, I've multiple unique feature in any categorical dataset and can cause my test/validation set
Feel free to correct me.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Duplicating pandas.get_dummies columns from train to test data - python

Related

Scikit learn Stratified Shuffle Split does not work when one of the classes has just one instance

OneHotEncoding issue with unseen data on test set

What is right time to perform train_test_split when building a model with text and categorical features?

Running get_dummies on train and test data returns different amount of columns - is it ok to concat the two sets and split after feature engineering?

High cardinal Categorical features into numerics

Categories

Resources