OneHotEncoding issue with unseen data on test set

OneHotEncoding issue with unseen data on test set - python

I have data with numerical and categorical variables. I have split the data into train and test. I would like to do one hot encoding after imputation. There are unseen data on the test set.
I understand handle_unknown='ignore' fixes this issue. However, I would also like to drop one column (drop='first') to avoid multicollinearity. OnehotEncoding cannot take both of these.
Is there a way to handle the unseen data and also avoid multicollinearity?
Note: I am using ColumnTransformer.

You'd better use array slicing and design a custom transformer so that you still can apply such transformer within your ColumnTransformer. Here you will find an example of how you can create a custom transformer.

Related

One Hot Encoding vs pd.get_dummies

What is the difference between one_hot_encoder and pd.get_dummies? Because sometimes, the get_dummies function gives the same results as the one hot encoding thing, but people tend to use the one hot encoding df to fit in their model. So what is the difference? And will it affect my model?
thanks

In fact, they produce the same result while transforming the categorical variable into dummies. The difference is that one_hot_encoder stores the transformation in an object. Once you have the instance of OneHotEncoder(), you can save it to use it later in a preprocessing step for your prediction pipeline.
If you are just making some experiments, you can use any of them. But if you want your preprocessing process to be better organized, you better use OneHotEncoder.
If you plan to use it for categorical features treatment, you can also use LabelEncoder.

What is right time to perform train_test_split when building a model with text and categorical features?

I am trying to train a model which takes a mixture of numerical, categorical and text features.
My question is which one of the following should I do for vectorizing my text and categorical features?
I split my data into train,cv and test for purpose of features vectorization i.e using vectorizor.fit(train) and vectorizor.transform(cv),vectorizor.transform(test)
Use vectorizor.fit transform on entire data
My goal is to hstack( all above features) and apply NaiveBayes. I think I should split my data into train_test before this point, inorder to find optimal hyperparameter for NB.
Please share some thought on this. I am new to data-science.

If you are going to fit anything like an imputer or a Standard Scaler to the data, I recommend doing that after the split, since this way you avoid any of the test dataset leaking into your training set. However, things like formatting and simple transformations of the data, one-hot encoding should be able to be done safely on the entire dataset without issue, and avoids some extra work.

I think you should go with the 2nd option i.e vectorizer.fit_transform on entire data because if you split the data before, it may happen that some of the data which is in test may not be in train so in that case some classes may remain unrecognised

MinMaxScaler + DecisionTree classifier with numerical and categorical data

I would like to know how should I managed the following situation:
I have a dataset which I need to analyze. It is labeled data and I need to perform over it a classification task. Some features are numerical and others are categorical (non-ordinal), and my problem is I don't know how can I handle the categorical ones.
Before to classify, I usually apply a MinMaxScaler. But I can't do this in this particular dataset because of the categorical features.
I've read about the one-hot encoding, but I don't understand how can apply it to my case because my dataset have some numerical features and 10 categorical features and the one-hot encoding generates more columns in the dataframe, and I don't know how do I need to prepare the resultant dataframe to sent it to the decision tree classifier.
In order to clarify the situation the code I'm using so far is the following:
y = df.class
X = df.drop(['class'] , axis=1)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
# call DecisionTree classifier
When the df has categorical features I get the following error: TypeError: data type not understood. So, if I apply the one-hot encoding I get a dataframe with many columns and I don't know if the decisionTree classifier is going to understand the real situation of my data. I mean how can I express to the classifier that a group of columns belongs to a specific feature? Am I understanding the whole situation wrong? Sorry if this a confused question but I am newbie and I fell pretty confused about how to handle this.

I don't have enough reputation to comment, but note that decision tree classifiers don't require their input to be scaled. So if you're using a decision tree classifier, just use the features as they appear.
If you're using a method that requires feature scaling, then you should probably do one-hot-encoding and feature scaling separately - see this answer: https://stackoverflow.com/a/43798994/9988333
Alternatively, you could use a method that handles categorical variables 'out of the box', such as LGBM.

Running get_dummies on train and test data returns different amount of columns - is it ok to concat the two sets and split after feature engineering?

My train and test data set that are two seperate csv files.
I've done some feature engineering on the test set and have used pd_get_dummies() which works as expected.
Training Classes
|Condition|
-----------
Poor
Ok
Good
Excelent
My issue is that the there is a mismatch when I try to predict the values as the test set has a different amount of columns after pd.get_dummies()
Test set:
|Condition|
-----------
Poor
Ok
Good
Notice that Excelent is missing!! And over all the columns after creating dummies i'm about 20 columns short of the training dataframe.
My question is it acceptable to join the train.csv and test.csv - run all my feature engineering, scaling etc and then split back into the two dataframes before the training phase?
Or is there another better solution?

It is acceptable to join the train and test as you say, but I would not recommend that.
Particularly, because when you deploy a model and you start scoring "real data" you don't get the chance to join it back to the train set to produce the dummy variables.
There are alternative solutions using the OneHotEncoder class from either Scikit-learn, Feature-engine or Category encoders. All these are open source python packages, with classes that implement the fit / transform functionality.
With fit, the class learns the dummy variables that will be created from the train set, and with trasnform it creates the dummy variables. In the example that you provide, the test set will also have 4 dummies, and the dummy "Excellent" will contain all 0.
Find examples of the OneHotEncoder from Scikit-learn, Feature-engine and Category encoders in the provided links

Duplicating pandas.get_dummies columns from train to test data

I have two dataframes, train and test. They both have the same exact column names which contain categorical string features.
I'm trying to map these features to dummy variables in the training set, train a regression model, then do the same exact mapping for the test set and apply the trained model to it.
The problem I came across is, since test is smaller than train, it happens to not contain all the possible values for some of the categorical features. Since pandas.get_dummies() seems to just look at data.Series.unique() to create new columns, after adding dummy columns in the same way for train and test, test now has less columns.
So how can I instead add dummy columns for train, and then use the same exact column names for test, even if for particular features in test, test.feature.unique() is a subset of train.feature.unique()? I looked at the pd.get_dummies documentation, but I don't think I see anything that'll do what I'm looking for. Any help is greatly appreciated!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.