Patsy: New levels in categorical fields in test data - python

I am trying to use Patsy (with sklearn, pandas) for creating a simple regression model. The R style formula creation is a major draw.
My data contains a field called 'ship_city' which can have any city from India. Since I am partitioning the data into train and test sets, there are several cities which appear only in one of the sets. A code snippet is given below:
df_train_Y, df_train_X = dmatrices(formula, data=df_train, return_type='dataframe')
df_train_Y_design_info, df_train_X_design_info = df_train_Y.design_info, df_train_X.design_info
df_test_Y, df_test_X = build_design_matrices([df_train_Y_design_info.builder, df_train_X_design_info.builder], df_test, return_type='dataframe')
The last line throws the following error:
patsy.PatsyError: Error converting data to categorical: observation
with value 'Kolkata' does not match any of the expected levels
I believe this is a very common use case where training data will not have all levels of all categorical fields. Sklearn's DictVectorizer handles this quite well.
Is there any way I can make this work with Patsy?

The problem of course is that if you just give patsy a raw list of values, it has no way to know that there are other values that could potentially happen as well. You have to somehow tell it what the complete set of possible values is.
One way is by using the levels= argument to C(...), like:
# If you have a data frame with all the data before splitting:
all_cities = sorted(df_all["Cities"].unique())
# Alternative approach:
all_cities = sorted(set(df_train["Cities"]).union(set(df_test["Cities"])))
dmatrices("y ~ C(Cities, levels=all_cities)", data=df_train)
Another option if you're using pandas's default categorical support is to record the set of possible values when you set up your data frame; if patsy detects that the object you've passed it is a pandas categorical then it automatically uses the pandas categories attribute instead of trying to guess what the possible categories are by looking at the data.

I ran into a similar problem and I built the design matrices prior to splitting the data.
df_Y, df_X = dmatrices(formula, data=df, return_type='dataframe')
df_train_X, df_test_X, df_train_Y, df_test_Y = \
train_test_split(df_X, df_Y, test_size=test_size)
Then as an example of applying a fit:
model = smf.OLS(df_train_Y, df_train_X)
model2 =
predicted = model2.predict(df_test_X)
Technically I haven't built a test case, but I haven't run into the Error converting data to categorical error again since implementing the above.


cannot use workaround for python PMML Pipeline

I'm trying to build a simple preprocessing Pipeline for a clustering model that uses K-Means and export it to PMML format.
I manage to make the Pipeline work but can't manage to finally export it to pmml.
I have divided the pipeline in two steps, handle numerical data and handle categorical data.
numeric_features = ['column1','column2','column3']
categorical_features = ['column4','column5']
num_mapper = sklearn_pandas.DataFrameMapper([([numeric_column],SimpleImputer(strategy='median')) for numeric_column in numeric_features]
categorical_mapper = sklearn_pandas.DataFrameMapper([([categorical_column],LabelBinarizer()) for categorical_column in categorical_features]
pipeline = PMMLPipeline(steps=[
Note that i have setted default to None in the first dataFrameMapper since it allows the output dataframe to preserve columns that haven't been selected (columns that indeed will be needed by the second mapper).
These workarounds work ok, the problem comes later when i try to export the pipeline to PMML
This line of code yields the following error
java.lang.IllegalArgumentException: Attribute 'sklearn_pandas.dataframe_mapper.DataFrameMapper.default' has a missing (None/null) value
at org.jpmml.sklearn.PyClassDict.get(
at org.jpmml.sklearn.PyClassDict.getObject(
I know this error is probably generated by the fact that i'm setting default to None in both DataFrameMappers, but the thing is it was the only workaround i found in order to preserve the columns needed for the second mapper.
Is there any other workaround I could use? I know i could do all the transformations in the first DataFrameMapper but I don't like that idea since I want to separate numerical transformation from categorical transformation.
Recently could kinda understand the use of FeatureUnion, and realized it could be an elegant solution.
Create the same mappers
numeric_features = ['column1','column2','column3']
categorical_features = ['column4','column5']
num_mapper = sklearn_pandas.DataFrameMapper([([numeric_column],SimpleImputer(strategy='median')) for numeric_column in numeric_features]
categorical_mapper = sklearn_pandas.DataFrameMapper([([categorical_column],LabelBinarizer()) for categorical_column in categorical_features])
preprocessing = FeatureUnion(transformer_list=[('num_mapper',num_mapper),('cat_mapper',categorical_mapper)])
pipeline = PMMLPipeline(steps=[
With this workaround even managed to avoid the use of df_out and default flags in the function call.

How to restore the original feature names in XGBoost feature importance plot (after preprocessing removed them)?

Preprocessing the training data (such as centering or scaling) before training an XGBoost model, can lead to a loss of feature names. Most answers on SO suggest training the model in such a way that feature names aren't lost (such as using pd.get_dummies on data frame columns).
I have trained an XGBoost model using the preprocessed data (centre and scale using MinMaxScaler). Thereby, I am in a similar situation where feature names are lost.
For instance:
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)
my_model_name = XGBClassifier(),Y)`
where X and Y are the training data and labels respectively. The scaling above returns a 2D NumPy array, thereby discarding feature names from pandas DataFrame.
Thus, when I try to use plot_importance(my_model_name), it leads to the plot of feature importance, but only with feature names such as f0, f1, f2 etc., and not the actual feature names from the original data set.
Is there a way to map the feature names from the original training data to the feature importance plot generated, so that the original feature names are plotted in the graph? Any help in this regard is highly appreciated.
You can get the features names by:
You are right that when you pass NumPy array to fit method of XGBoost, you loose the feature names. In such a case calling model.get_booster().feature_names is not useful because the returned names are in the form [f0, f1, ..., fn] and these names are shown in the output of plot_importance method as well.
But there should be several ways how to achieve what you want - supposed you stored your original features names somewhere, e.g. orig_feature_names = ['f1_name', 'f2_name', ..., 'fn_name'] or directly orig_feature_names = X.columns if X was pandas DataFrame.
Then you should be able to:
change stored feature names (model.get_booster().feature_names = orig_feature_names) and then use plot_importance method that should already take the updated names and show it on the plot
or since this method return matplotlib ax, you can modified labels using plot_importance(model).set_yticklabels(orig_feature_names) (but you have to set the correct order of you features)
or you can take model.feature_importances_ and combine it with your original feature names by yourselves (i.e. plotting it by ourselves)
similarly, you can also use model.get_booster().get_score() method and combine it with your feature names
or you can try Learning API with xgboost DMatrix and specify your feature names during creating of the dataset (after scaling) with train_data = xgb.DMatrix(X, label=Y, feature_names=orig_feature_names) (but I do not have much experience with this way of training since I usually use Scikit-Learn API)
Thanks to #Noob Programmer (see comments below) there might be some "inconsistencies" based on using different feature importance method. Those are the most important ones:
xgboost.plot_importance uses "weight" as the default importance type (see plot_importance)
model.get_booster().get_score() also uses "weight" as the default (see get_score)
model.feature_importances_ depends on importance_type parameter (model.importance_type) and it seems that the result is normalized to sum of 1 (see this comment)
For more info on this topic, look at How to get feature importance.
I tried the above answers, and didn't work while loading the model after training.
So, the working code for me is :
it returns a list of the feature names
I think, it is best to turn numpy array back into pandas DataFrame. E.g.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from xgboost import XGBClassifier
X_df = pd.read_csv("train.csv")
orig_feature_names = list(X_df.columns)
scaler = MinMaxScaler(feature_range=(0, 1))
X_scaled_np = scaler.fit_transform(X_df)
X_scaled_df = pd.DataFrame(X_scaled_np, columns=orig_feature_names)
my_model_name = XGBClassifier(max_depth=2, n_estimators=2),Y)
This will show the original names.

SyntaxError while trying to perform RobustScaler on Pandas Dataframe

I am working with the House Prices Kaggle dataset. I am trying to use the RobustScaler from sklearn only on numerical features in the dataset (LotFrontage, LotArea, etc.). First, I fit the data to the numerical values of my dataframe by calling select_dtypes(exclude=['object']. Once the transformer has been fit to those values, I call the transform function, trying to transform those same values I just fit the data on by setting the transformer equal to object excluded attributes. Once I attempt that, I get the following error message:
SyntaxError: can't assign to function call
Data has already been rid of null values. What has worked is when I set the transform results equal to some variable, I get the results back as a numpy.ndarray
from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(df_train.select_dtypes(exclude=['object']))
df_train.select_dtypes(exclude=['object']) = transformer.transform(df_train.select_dtypes(exclude=['object'])) # This doesn't work
test = transformer.transform(df_train.select_dtypes(exclude=['object'])) # This DOES work, but not in the format I need
All I want is for the transformed attributes to go back into the original pandas data frame at their corresponding locations. Is there some workaround I can implement if I can't convert the original dataframe results directly?
I managed to get it to work. Not sure how Pythonic this solution is, but it got me back on track:
df_train[list(df_train.select_dtypes(exclude=['object']).columns)] = RobustScaler().fit_transform(df_train[list(df_train.select_dtypes(exclude=['object']).columns)])

How to handle string data in ML classification

Hello I am a beginner in Machine Learning, I have previously worked with some binary ml tasks where the data was numerical. Now I am facing an issue where I have to find the probability of a particular combination. I can not disclose the dataset or the code at this point. My data is a dataframe of 10 columns. I have to train my model on 8 columns and predict the possibility of the last 2 columns. That is my labels are a combination of the last 2 columns. What I am facing a problem with is, these column values are not numerical. I have tried everything I came across but can't find any suitable means of converting this to numerical values. I have tried LabelEncoder from sklearn,which works with the labels, but throws memory error if I use it again. I have tried to_numeric from pandas, which reads all the values as Nan. The values are in the form '2be74fad-4d4'. Any suggestions would be highly appreciated about how to handle this issue.
To convert categorical data to numerical, you can try these approaches in sklearn:
Label Encoding
Label Binarizer
OneHot Encoding
Now, for your problem, you can use LabelEncoder. But there is a catch. In other sklearn models, you can declare it once and then use it to fit and then transform on a number of columns.
In LabelEncoding, you have to fit_transform the model on one column in train data and then transform the same column in test data. Then the same process for the next categorial column.
You can iterate over a list of categorical columns to make it simple. Consider the snippet below:
cat_cols = ['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type', 'Item_Type_Combined']
enc = LabelEncoder()
for col in cat_cols:
train[col] = train[col].astype('str')
test[col] = test[col].astype('str')
train[col] = enc.fit_transform(train[col])
test[col] = enc.transform(test[col])
You can create a dictionary with the mapping form a string to integer. An example can be found here: enter link description here. Then you use onehot encoding or just feed the integer to the neural network. If the characters have some meaning you could also do it on a per character base instead of wordbased. But that depends on the task. If this String is a unique identifier of the column or so, just leave it away and don't feed it to your model.

formatting design matrix for regression

I am given a test set without the response variable. I have already built the model and need to predict the response variable in the testing set.
I am having trouble formatting the test design matrix so that it would be compatible.
I am using patsy library to construct the matrix.
I want to do something like this, except the code below does not work:
X = dmatrices('Response ~ var1 + var2', test, return_type = 'dataframe')
What is the right approach? thanks
If you used patsy to fit the model in the first place, then you should tell it "hey, you know how you built my first design matrix? build me another the same way":
# Set up training data
train_Y, train_X = dmatrices("Response ~ ...", train, return_type="dataframe")
# Save patsy's record of how it built this matrix:
design_info = train_X.design_info
# Re-use it to build the test matrix
test_X = dmatrix(design_info, test, return_type="dataframe")
Alternatively, you could build a new matrix from scratch:
# Use 'dmatrix' and leave out the left-hand-side of the formula
test_X = dmatrix("~ ...", test, return_type="dataframe")
The first approach is better if you can do it. For example, suppose you have a categorical variable that you're letting patsy encode for you. And suppose that there are 10 categories that show up in your training set, but only 5 of them occur in your test set. If you use the first approach, then patsy will remember what the 10 categories where, and generate a test matrix with 10 columns (some of them all-zeros). If you use the second approach, then patsy will generate a training matrix with 10 columns and a test matrix with 5 columns, and then your model code is probably going to crash because the matrix isn't the shape it expects.
Another case where this matters is if you use patsy's center function to center a variable: with the first approach it will automatically remember what value it subtracted off from the training data and re-use it for the test data, which is what you want. With the second approach it will recompute the center using the test data, which can lead to you silently getting really really wrong results.
