How to correctly use model explainer with unseen data? - python

I trained my classifier using a pipeline:
param_tuning = {
'classifier__learning_rate': [0.01, 0.1],
'classifier__max_depth': [3, 5, 7, 10],
'classifier__min_child_weight': [1, 3, 5],
'classifier__subsample': [0.5, 0.7],
'classifier__n_estimators' : [100, 200, 500],
}
cat_pipe = Pipeline(
[
('selector', ColumnSelector(categorical_features)),
('encoder', ce.one_hot.OneHotEncoder())
]
)
num_pipe = Pipeline(
[
('selector', ColumnSelector(numeric_features)),
('scaler', StandardScaler())
]
)
preprocessor = FeatureUnion(
transformer_list=[
('cat', cat_pipe),
('num', num_pipe)
]
)
xgb_pipe = Pipeline(
steps=[
('preprocessor', preprocessor),
('classifier', xgb.XGBClassifier())
]
)
grid = GridSearchCV(xgb_pipe, param_tuning, cv=5, n_jobs=-1, scoring='accuracy')
xgb_model = grid.fit(X_train, y_train)
The training data have categorical data, so the transformed data shape is (x , 100 ). After that, i try to explain model prediction on unseen data. Since i pass single unseen example directly to model, it preprocessed it in shape (x, 15) (because single observation does not have all examples all categorical data).
eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].fit_transform(df), columns = xgb['classifier'].get_booster().feature_names))
And i got
ValueError: Shape of passed values is (1, 15), indices imply (1, 100).
This occurs because model was trained on whole preprocessed dataset with shape (x, 100), but i pass to explainer single observation with shape (1,15). How do i correctly pass unseen single observation to explainer?

We never use .fit_transform() on unseen data; the correct way is to use the .transform() method of the pre-processor already fitted with your training data (here xgb['preprocessor']). That way, we ensure that the (transformed) unseen data have the same features with our (transformed) training ones, and so they are compatible with the model built with the latter.
So, you should replace .fit_transform(df) here:
eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].fit_transform(df), columns = xgb['classifier'].get_booster().feature_names))
with .transform(df):
eli5.show_prediction(xgb['classifier'], xgb['preprocessor'].transform(df), columns = xgb['classifier'].get_booster().feature_names))

Related

Sklearn Pipeline classifier throwing ValueError even when the missing values are taken care of

I have created sklearn pipeline for preprocessing and then running the model over the processed data. The preprocessing step takes care of missing values even after that it throws the following error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The below is my code :
def test_sklearn_pipeline(random_state_num):
numeric_features = ["x","y"]
categorical_features = ["wconfid","pctid"]
missing_features = ["x"]
missing_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="mean"))]
)
scale_transformer = Pipeline(
steps=[("scaler", StandardScaler())]
)
categorical_transformer = Pipeline(
steps=[('ohe',OneHotEncoder(handle_unknown="ignore"))]
)
preprocessor = ColumnTransformer(
transformers=[
("miss", missing_transformer, missing_features),
("cat", categorical_transformer, categorical_features),
('outlier_remover',outlier_removal,numeric_features),
("num", scale_transformer, numeric_features)
],remainder='passthrough'
)
clf = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", LinearRegression())]
)
df = pd.read_csv('accelerometer_modified.csv')
df = df.drop(columns=['random'])
X,y = df.drop(columns=['z']),df.loc[:,'z']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=random_state_num)
clf.fit(X_train, y_train)
print("MSE: %.3f" % mean_squared_error(clf.predict(X_test), y_test))
Numeric features and missing features do have the column x in common. Columntransformer runs each transformation in the input dataframe. This means you are running the standard scaler in the raw column and not the imputed one. You probably need two transformers that run sequentially, or rather put a small Pipeline as you've done already with steps that are first impute second scale

Sklearn Pipelines: Value Error - Expected number of features

I created a pipeline that basically loops over models and scalers and performs recursive feature elimination (RFE) as follows:
def train_models(models, scalers, X_train, y_train, X_val, y_val):
best_results = {'f1_score': 0}
for model in models:
for scaler in scalers:
for n_features in list(range(
len(X_train.columns),
int(len(X_train.columns)/2),
-10
)):
rfe = RFE(
estimator=model,
n_features_to_select=n_features,
step=10
)
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', model)
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_val)
results = evaluate(y_val, y_pred) #Returns a dictionary of values
results['pipeline'] = pipe
results['y_pred'] = y_pred
if results['f1_score'] > best_results['f1_score']:
best_results = results
print("Best F1: {}".format(best_results['f1_score']))
return best_results
The pipeline works fine inside the function and is able to predict and score the results properly.
However, when I call pipeline.predict() outside the function, e.g.
best_result = train_models(models, scalers, X_train, y_train, X_val, y_val)
pipeline = best_result['pipeline']
pipeline.predict(X_val)
I get the following error:
Here is what pipeline looks like:
Pipeline(steps=[('scaler', StandardScaler()),
('selector',
RFE(estimator=LogisticRegression(C=1, max_iter=1000,
penalty='l1',
solver='liblinear'),
n_features_to_select=78, step=10)),
('model',
LogisticRegression(C=1, max_iter=1000, penalty='l1',
solver='liblinear'))])
I'm guessing the model in the pipeline is expecting 48 features instead of 78, but I don't understand where the number 48 is coming from since n_features_to_select is set to 78 in the previous RFE step!
Any help would be greatly appreciated!
I do not have your data. But doing some math and guessing based on the info you have shared, 48 seems to be the last n_features that your nested loop tries. This makes me suspect that the culprit is a shallow copy. I suggest you change the following:
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', model)
])
to
pipe = Pipeline([
('scaler', scaler),
('selector', rfe),
('model', copy.deepcopy(model))
])
and try again (after first doing an import copy too, of course).

XGBRegressor using Pipeline

I'm using XGBRegressor with Pipeline. Pipeline contains preprocessing steps and model (XGBRegressor).
Below is complete preprocessing steps. (I have already defined numeric_cols and cat_cols)
numerical_transfer = SimpleImputer()
cat_transfer = Pipeline(steps = [
('imputer', SimpleImputer(strategy = 'most_frequent')),
('onehot', OneHotEncoder(handle_unknown = 'ignore'))
])
preprocessor = ColumnTransformer(
transformers = [
('num', numerical_transfer, numeric_cols),
('cat', cat_transfer, cat_cols)
])
And the final pipeline is
my_model = Pipeline(steps = [('preprocessor', preprocessor), ('model', model)])
When I tried to fit without using early_stopping_rounds code is working fine.
(my_model.fit(X_train, y_train))
But when I use early_stopping_rounds as shown below I'm getting error.
my_model.fit(X_train, y_train, model__early_stopping_rounds=5, model__eval_metric = "mae", model__eval_set=[(X_valid, y_valid)])
I'm getting error at:
model__eval_set=[(X_valid, y_valid)]) and the error is
ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in fields MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Condition1, Condition2, BldgType, HouseStyle, RoofStyle, RoofMatl, MasVnrType, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, Heating, HeatingQC, CentralAir, Electrical, KitchenQual, Functional, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PavedDrive, PoolQC, Fence, MiscFeature, SaleType, SaleCondition
Did it mean that I should preprocess X_valid before applying to my_model.fit() or I have done something wrong ?
If the problem is we need to preprocess X_valid before applying fit() how to do that with preprocessor I have defined above ?
Edit : I tried to preprocess X_valid without Pipeline, but I got error saying feature mismatch.
The problem is that pipelines do not fit eval_set. So, as you said, you need to preprocess X_valid. To do that the easiest way is using your pipeline without the 'model' step. Use the following code before fitting your pipeline:
# Make a copy to avoid changing original data
X_valid_eval=X_valid.copy()
# Remove the model from pipeline
eval_set_pipe = Pipeline(steps = [('preprocessor', preprocessor)])
# fit transform X_valid.copy()
X_valid_eval = eval_set_pipe.fit(X_train, y_train).transform (X_valid_eval)
Then fit your pipeline after changing model__eval_set as follows:
my_model.fit(X_train, y_train, model__early_stopping_rounds=5, model__eval_metric = "mae", model__eval_set=[(X_valid_eval, y_valid)])

How to output Pandas object from sklearn pipeline

I have constructed a pipeline that takes a pandas dataframe that has been split into categorical and numerical columns. I am trying to run GridSearchCV on my results and ultimately look at the ranked features of importance for the best performing model that GridSearchCV selects. The problem I am encountering is that sklearn pipelines output numpy array objects and lose any column information along the way. Thus when I go to examine the most important coefficients of the model I am left with an unlabeled numpy array.
I have read that building a custom transformer might be a possible solution to this, but I do not have any experience doing so myself. I have also looked into leveraging the sklearn-pandas package, but I am hesitant to try and implement something that might not be updated in parallel with sklearn. Can anyone suggest what they believe is the best path to go about getting around this issue? I am also open to any literature that has hands on application of pandas and sklearn pipelines.
My Pipeline:
# impute and standardize numeric data
numeric_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
('scale', StandardScaler())
])
# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
clf = Pipeline([
('transform', preprocessor),
('ridge', Ridge())
])
Cross Validation:
kf = KFold(n_splits=4, shuffle=True, random_state=44)
cross_val_score(clf, X_train, y_train, cv=kf).mean()
Grid Search:
param_grid = {
'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}
gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)
Examining Coefficients:
model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
model.named_steps['ridge'].coef_
Here is the output of the model coefficients as it currently stands when performed on the seaborn "mpg" dataset:
array([-4.64782052e-01, 1.47805207e+00, -3.28948689e-01, -5.37033173e+00,
2.80000700e-01, 2.71523808e+00, 6.29170887e-01, 9.51627968e-01,
...
-1.50574860e+00, 1.88477450e+00, 4.57285471e+00, -6.90459868e-01,
5.49416409e+00])
Ideally I would like to preserve the pandas dataframe information and retrieve the derived column names after OneHotEncoder and the other methods are called.
I would actually go for creating column names from the input. If your input is already divided into numerical an categorical you can use pd.get_dummies to get the number of different category for each categorical feature.
Then you can just create proper names for the columns as shown in the last part of this working example based on the question with some artificial data.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
# create aritificial data
numeric_features_vals = pd.DataFrame({'x1': [1, 2, 3, 4], 'x2': [0.15, 0.25, 0.5, 0.45]})
numeric_features = ['x1', 'x2']
categorical_features_vals = pd.DataFrame({'cat1': [0, 1, 1, 2], 'cat2': [2, 1, 5, 0] })
categorical_features = ['cat1', 'cat2']
X_train = pd.concat([numeric_features_vals, categorical_features_vals], axis=1)
X_test = pd.DataFrame({'x1':[2,3], 'x2':[0.2, 0.3], 'cat1':[0, 1], 'cat2':[2, 1]})
y_train = pd.DataFrame({'labels': [10, 20, 30, 40]})
# impute and standardize numeric data
numeric_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
('scale', StandardScaler())
])
# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
clf = Pipeline([
('transform', preprocessor),
('ridge', Ridge())
])
kf = KFold(n_splits=2, shuffle=True, random_state=44)
cross_val_score(clf, X_train, y_train, cv=kf).mean()
param_grid = {
'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}
gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)
model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
print('coefficients : ', model.named_steps['ridge'].coef_, '\n')
# create column names for categorical hot encoded data
columns_names_to_map = list(np.copy(numeric_features))
columns_names_to_map.extend('cat1_' + str(col) for col in pd.get_dummies(X_train['cat1']).columns)
columns_names_to_map.extend('cat2_' + str(col) for col in pd.get_dummies(X_train['cat2']).columns)
print('columns after preprocessing :', columns_names_to_map, '\n')
print('#'*80)
print( '\n', 'dataframe of rescaled features with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (preprocessor.fit_transform(X_train).T, columns_names_to_map)}))
print('#'*80)
print( '\n', 'dataframe of ridge coefficients with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (model.named_steps['ridge'].coef_.T, columns_names_to_map)}))
the code above (in the end) prints out the following dataframe which is a map from parameter name to parameter value:
I would use model.named_steps['transform'].get_feature_names_out().
It will return the feature names like this:
array(['num__cylinders', 'num__displacement', 'num__horsepower',
'num__weight', 'num__acceleration', 'num__model_year',
'cat__origin_europe', 'cat__origin_japan', 'cat__origin_usa',...])
Then you can use the feature names to transform the output to a dataframe:
weights_df = pd.DataFrame(model.named_steps['ridge'].coef_,index=model.named_steps['transform'].get_feature_names_out()).T

Add Features to An Sklearn Classifier

I'm building a SGDClassifier, and using a tfidf transformer. Aside from the features created from tfidf, I'd also like to add additional features like document length or other ratings. How can I add these features to the feature-set? Here is how the classifier is constructed in a pipeline:
data = fetch_20newsgroups(subset='train', categories=None)
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2)), # unigrams or bigrams
'tfidf__use_idf': (True, False),
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
grid_search.fit(data.data, data.target)
print(grid_search.best_score_)
You can use feature union http://scikit-learn.org/stable/modules/pipeline.html#featureunion-composite-feature-spaces
There is a nice example in the documentation https://scikit-learn.org/0.18/auto_examples/hetero_feature_union.html which I think exactly fits your requirements. See TextStats transformer.
[Update: the example was for scikit learn =< 0.18]
Regards,

Categories