In related to question posted in One Hot Encoding preserve the NAs for imputation I am trying to create a custom function that handles NAs when one hot encoding categorical variables. The setup should be suitable for train/test split and modelling using sklearn pipeline.
A simple reproducible example of my problem:
#Packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import Ridge
from sklearn.impute import SimpleImputer
# Make some categorical data X and a response y and split it.
X = pd.DataFrame(columns=["1","2"],data = [["A",np.nan],["B","A"],[np.nan,"A"],[np.nan,"B"],["B","A"],["A","B"],["C","B"],["D","E"]])
y = pd.DataFrame(data = np.array([1,5,4,6,2,3,9,9]))
X_train, X_test, Y_train, Y_test = train_test_split(X,y,test_size=0.2,random_state=42)
I've then created a custom function that does OHE with nan (using the procedure described in Cyclical Loop Between OneHotEncoder and KNNImpute in Scikit-learn)
class OHE_with_nan(BaseEstimator,TransformerMixin):
""" OHE with NAN. Not super pretty but works..
"""
def __init__(self, copy=True):
self.copy = copy
def fit(self, X, y = None):
""" This transformer does not use a fit procedure """
return self
def transform(self, X, y = None):
""" Return the new object here"""
# Replace nans with "Missing" such that OneHotEncoder can work.
enc_missing = SimpleImputer(strategy="constant",fill_value="missing")
data1 = pd.DataFrame(columns=X.columns,data = enc_missing.fit_transform(X))
#Perform standard OHE
OHE = OneHotEncoder(sparse=False,handle_unknown="ignore")
OHE_fit = OHE.fit_transform(data1)
#save feature names of the OHE dataframe
data_OHE = pd.DataFrame(columns=OHE.get_feature_names(data1.columns),data = OHE_fit)
# Initialize
Column_names = data1.columns
Final_OHE = pd.DataFrame()
# Loop over columns to replace 0s with nan the correct places.
for i in range(len(data1.columns)):
tmp_data = data_OHE[data_OHE.columns[pd.Series(data_OHE.columns).str.startswith(Column_names[i])]]
missing_name = tmp_data.iloc[:,-1:].columns
missing_index = np.where(tmp_data[missing_name]==1)[0]
tmp_data.loc[missing_index,:] = np.nan
tmp_data1 = tmp_data.drop(missing_name,axis=1)
Final_OHE = pd.concat([Final_OHE, tmp_data1], axis=1)
return Final_OHE
This is then combined into a pipeline that predicts y using ridge regression (random choice of model, just for the example..)
Estimator = Pipeline([
('Ohe_with_NA',OHE_with_nan()),
("Imputer",KNNImputer(n_neighbors=1)),
('Model',Ridge(alpha = 0.01))
])
The procedure can be fitted:
pipe_fit = Estimator.fit(X_train,Y_train)
But testing on unseen data fails:
pipe_fit.score(X_test, Y_test)
ValueError: X has 2 features, but KNNImputer is expecting 7 features as input.
This is because the handle_unknown = "ignore in OneHotEncoder within OHE_with_nanis no longer "active" as it has been wrapped into my custom function.
If one simply uses OneHotEncoder(handle_unknown = "ignore") directly in the pipeline, everything works fine (but that's not my intention as this "removes" the nans from the data I try to impute.)
My question
How do I enable handle_unknown = "ignore" in my custom function such that it can perform in a pipeline setup on unseen data as well?
Hope you understand my situation - any help is highly appreciated!
I think the main problem is that you need to save more information (especially, the internal OneHotEncoder) at fit time. I also made the missing-column identification a little more robust (I think maybe you were relying on the ordering putting that last, but that only held for your sample data because of alphabetical order?). I didn't spend much time cleaning things up or looking for efficiencies.
class OHE_with_nan(BaseEstimator, TransformerMixin):
"""One-hot encode, propagating NaNs.
Requires a dataframe as input!
"""
def fit(self, X, y=None):
self.orig_cols_ = X.columns
self.imputer_ = SimpleImputer(strategy="constant", fill_value="MISSING")
X_filled = self.imputer_.fit_transform(X)
self.ohe_ = OneHotEncoder(sparse=False, handle_unknown="ignore")
self.ohe_.fit(X_filled)
self.ohe_colnames_ = self.ohe_.get_feature_names(X.columns)
self.missing_value_columns = np.array(["MISSING" in col for col in self.ohe_colnames_])
return self
def transform(self, X, y=None):
raw_ohe = pd.DataFrame(self.ohe_.transform(self.imputer_.transform(X)), columns=self.ohe_colnames_)
out_list = []
# Loop over columns to replace 0s with nan the correct places.
for orig_col in self.orig_cols_:
tmp_data = raw_ohe[self.ohe_colnames_[pd.Series(self.ohe_colnames_).str.startswith(orig_col)]]
missing_name = tmp_data.columns[["MISSING" in col for col in tmp_data.columns]]
missing_indices = np.where(tmp_data[missing_name]==1)[0]
tmp_data.loc[missing_indices, :] = np.nan
tmp_data1 = tmp_data.drop(missing_name, axis=1)
out_list.append(tmp_data1)
out = pd.concat(out_list, axis=1)
return out
Related
I'm creating a model with scikit-learn. The pipeline that seems to be working best is:
mutual_info_classif with a threshold - i.e. only include fields whose mutual information score is above a given threshold.
PCA
LogisticRegression
I'd like to do them all using sklearn's pipeline object, but I'm not sure how to get the mutual info classification in. For the second and third steps I do:
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('dim_red', pca),
('pred', lr)
]
)
But I don't see a way to include the first step. I know I can create my own class to do this, and I will if I have to, but is there a way to do this within sklearn?
You can implement your Estimator by subclassing BaseEstimator. Then, you can pass it as estimator to a SelectFromModel instance, which can be used in a Pipeline:
from sklearn.feature_selection import SelectFromModel, mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
X = [[ 0.87, -1.34, 0.31 ],
[-2.79, -0.02, -0.85 ],
[-1.34, -0.48, -2.55 ],
[ 1.92, 1.48, 0.65 ]]
y = [0, 1, 0, 1]
class MutualInfoEstimator(BaseEstimator):
def __init__(self, discrete_features='auto', n_neighbors=3, copy=True, random_state=None):
self.discrete_features = discrete_features
self.n_neighbors = n_neighbors
self.copy = copy
self.random_state = random_state
def fit(self, X, y):
self.feature_importances_ = mutual_info_classif(X, y, discrete_features=self.discrete_features,
n_neighbors=self.n_neighbors,
copy=self.copy, random_state=self.random_state)
feat_sel = SelectFromModel(estimator=MutualInfoEstimator(random_state=0))
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('feat_sel', feat_sel),
('pca', pca),
('pred', lr)
]
)
print(pipe)
Pipeline(steps=[('feat_sel',
SelectFromModel(estimator=MutualInfoSelector(random_state=0))),
('pca', PCA(random_state=100)),
('pred', LogisticRegression(random_state=200))])
Note that of course the new estimator should expose the parameters you want to tweak during optimisation. Here I just exposed all of them.
Yeah, I do not think there is another way to do it. At least not that I know!
How about SelectKBest or SelectPercentile:
from sklearn.feature_selection import SelectKBest
mi_best = SelectKBest(score_func=mutual_info_classif, k=10)
pca = PCA(random_state=100)
lr = LogisticRegression(random_state=200)
pipe = Pipeline(
[
('select', mi_best),
('dim_red', pca),
('pred', lr),
]
)
It is possible to do this, but the utility of doing this will vary depending on where in the ML workflow you are. I will describe how I got something similar to work.
High Level:
A selector in a ColumnTransformer is just a callable that returns a list of columns when it is passed the dataframe. We can use this to do what you're trying to do. We can define it as follows:
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate
from sklearn.dummy import DummyRegressor
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import mutual_info_regression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
def mi_selector(mi_threshold=0.05, target_label=None):
def selector_to_return(df,):
mi_df = compute_mutual_information(
df=df,
target_label=target_label,
random_state=random_state)
matching_variables = mi_df[mi_df.loc[:,'mutual_information'] > mi_threshold].index.tolist()
matching_features = []
# Remove target
if target_label in matching_variables:
matching_variables.remove(target_label)
# Only return from features that were in original df
# since we compute more than that as we impute, encode etc.
for feature_name in df.columns.tolist():
if feature_name in matching_variables:
matching_features.append(feature_name)
return matching_features
return selector_to_return
What this does is it uses the mutual_information computed by compute_mutual_information to create a selector which can be plugged into a Pipeline.
Since we're computing mutual information, we need to know what the target is. We also need to know whether it's a classification or regression problem. This is the part that compute_mutual_information solves using the following approach:
Figure out the metadata for the dataframe, physical datatypes as well as whether or not the feature is numeric or not etc., along with some other stats (second last code snippet)
Figure out whether it's a regression or a classification problem to set up the right function for computing mutual information
Impute and encode/scale the data and reconstruct the processed dataframe
Compute mutual information on the processed dataframe
def compute_mutual_information(df, target_label, random_state):
# Analyze data frame
meta_df = df_metadata(df, numerical_threshold=50)
target_is_numerical = meta_df.loc[meta_df.variable == target_label][
'is_numerical'].iloc[0]
# Determine problem type
if target_is_numerical:
problem_type = 'regression'
mutual_information_function = mutual_info_regression
else:
problem_type = 'classification'
mutual_information_function = mutual_info_classif
# Select feature types
my_numerical_selector = feature_type_selector(dtype_include='numerical')
my_categorical_selector = feature_type_selector(dtype_include='categorical')
numerical_features = my_numerical_selector(df)
categorical_features = my_categorical_selector(df)
# Remove target label from features
for feature_list in [numerical_features, categorical_features]:
if target_label in feature_list:
feature_list.remove(target_label)
# Transform df
imputation_preprocessor = ColumnTransformer(
[('numerical_imputer',
SimpleImputer(strategy='median', add_indicator=True),
numerical_features),
('categorical_imputer',
SimpleImputer(strategy='most_frequent', add_indicator=True),
categorical_features)],
remainder='passthrough')
# We need to figure out the indices to the features that are supposed to be scaled and encoded by the next
# step
post_imputation_np = imputation_preprocessor.fit_transform(df)
feature_name_np_array = imputation_preprocessor.get_feature_names_out()
categorical_feature_indices = np.zeros(len(categorical_features))
numerical_feature_indices = np.zeros(len(numerical_features))
for position, feature in enumerate(categorical_features):
categorical_feature_indices[position] = np.where(
feature_name_np_array == 'categorical_imputer__' + feature)[0]
for position, feature in enumerate(numerical_features):
numerical_feature_indices[position] = np.where(
feature_name_np_array == 'numerical_imputer__' + feature)[0]
categorical_feature_indices = categorical_feature_indices.astype(
int).tolist()
numerical_feature_indices = numerical_feature_indices.astype(int).tolist()
numeric_and_categorical_transformer = ColumnTransformer(
[('OneHotEncoder', OneHotEncoder(),
categorical_feature_indices),
('StandardScaler', StandardScaler(),
numerical_feature_indices)],
remainder='passthrough')
preprocessor = Pipeline(
[('imputation_preprocessor', imputation_preprocessor),
('numeric_and_categorical_transformer',
numeric_and_categorical_transformer)])
df_transformed_np = preprocessor.fit_transform(df)
preprocessed_feature_names = list(preprocessor.get_feature_names_out())
df_transformed = pd.DataFrame(
df_transformed_np.todense(),
columns=preprocessed_feature_names)
df_transformed = df_transformed.rename(shorten_param, axis=1)
estimated_mutual_information = mutual_information_function(
X=df_transformed, y=df[target_label], random_state=random_state)
estimated_mutual_information_df = pd.DataFrame(
estimated_mutual_information.T.reshape(
1, -1), columns=preprocessed_feature_names)
estimated_mutual_information_df = estimated_mutual_information_df.rename(
shorten_param,
axis=1)
estimated_mutual_information_df = estimated_mutual_information_df.T
estimated_mutual_information_df.columns = ['mutual_information']
estimated_mutual_information_df = estimated_mutual_information_df.sort_values(
by=['mutual_information'])
return estimated_mutual_information_df
The above used a feature_type_selector which is defined as following:
def feature_type_selector(dtype_include=None):
def nested_function(df,):
meta_df = df_metadata(df)
if dtype_include == 'numerical':
return meta_df.loc[meta_df.is_numerical, 'variable'].tolist()
else:
return meta_df.loc[meta_df.is_numerical ==
False, 'variable'].tolist()
return nested_function
The metadata analysis of the dataframe does the following:
Determine variable types
Figure out, with some threshold, which features are really categoricals encoded as numericals
Percentage missing data etc.
def df_metadata(df, numerical_threshold=50):
list_of_variables = list(df.dtypes.index)
list_of_dtypes = [df.dtypes[variable] for variable in list_of_variables]
categorical_selector = selector(dtype_include=object)
numerical_selector = selector(dtype_exclude=object)
unique_value_counts = [df[variable].nunique()
for variable in list_of_variables]
categorical_features = categorical_selector(df)
numerical_features = numerical_selector(df)
is_numerical_init = [True] * len(list_of_variables)
metadata_frame = pd.DataFrame(
{'variable': list_of_variables, 'dtype': list_of_dtypes,
'is_numerical': is_numerical_init,
'unique_value_counts': unique_value_counts})
null_sum = df.isnull().sum()
null_sum.name = 'null_sum'
metadata_frame = pd.merge(
metadata_frame,
null_sum,
left_on='variable',
right_index=True)
metadata_frame['samples_missing'] = metadata_frame['null_sum'] > 0
total_samples = len(df)
metadata_frame['percent_missing'] = metadata_frame['null_sum'] / total_samples
for feature in categorical_features:
metadata_frame.loc[metadata_frame.variable ==
feature, ['is_numerical']] = False
for feature in numerical_features:
if df[feature].nunique() < numerical_threshold:
metadata_frame.loc[metadata_frame.variable ==
feature, ['is_numerical']] = False
return metadata_frame
The shorten_param function is defined as follows:
def shorten_param(param_name):
if "__" in param_name:
if len(param_name.rsplit(" ", 1)) < 2:
return param_name.rsplit("__", 1)[1]
else:
return str(shorten_param(param_name.rsplit(" ", 1)[
0])) + " " + shorten_param(' '.join(param_name.rsplit(" ", 1)[1:]))
return param_name
With all this in place, you can do something like the following to run your model.
standard_scaler_transformer = StandardScaler()
identity_transformer = ColumnTransformer([('unused_scaler', standard_scaler_transformer, []),],remainder='passthrough')
my_mi_selector = mi_selector(mi_threshold = 0.2, target_label=target_label)
mi_filter = ColumnTransformer([('identity_transformer', identity_transformer, my_mi_selector)], remainder='drop')
brute_imputer = SimpleImputer(strategy='most_frequent')
ames_target = ames_data[ames_target_name]
my_model = Pipeline([('mi_filter', mi_filter), ('brute_imputer', brute_imputer), ('Ridge', Ridge())])
my_model.fit(X=ames_data, y=ames_target)
To my original point about the utility of this, I think it's useful very early on in the process when you're trying to figure out which features are important, and which are not. Here are some challenges that I ran into:
Since feature selection happens at runtime, you are limited in what feature engineering you can do. For example, in imputation, I had to use most_frequent because that works for whichever feature type.
It would be really cool if there was a way to track variable locations as they enter numpy land. For example, in compute_mutual_information, post transformation, the ndarray needs to be put back into a DataFrame and then the mutual_information computed in order to have traceability of the mutual_information values for each feature
You cannot treat mi_threshold as a hyperparameter, because the selector is a callable, and model.get_params() has essentially a memory address.
I look forward to the day when mutual_information is a hyperparameter that you can tune, and you set policies for the types of feature engineering you want to apply.
I wanted to know how can I insert into a sklearn pipeline one step which multiplies two columns values and delete the original ones.
I'm doing something like that.
After loading the Dataframe, I multiply the target columns and delete them.
Prepare X, Y, training set and test set.
Configure pipeline with StandardScaler and some ML method (for example Linear Regression)
Fit and predict.
import pandas as pd, numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
# df is a pandas dataframe with columns A, B, C, Y
df['BC']=df['B']*te['C']
df.drop(columns=['B','C'], inplace=True)
X = df.loc[:,['A','BC']]
Y = df['Y']
x_train, x_test, y_train, y_test = train_test_split(X,Y,train_size=0.8)
pipe = Pipeline([
('minmax',StandardScaler()),
('linear',LinearRegression())
])
pipe.fit(x_train,y_train)
y_pred = pipe.predict(x_test)
With this approach, when I want to make some prediction of new data, I must pass the multiplication, for example A=1, B=3, C=4
print(pipe.predict(np.array([[1,12]])))
And I want an approach like
print(pipe.predict(np.array([[1,3,4]])))
What I want, is modify pipeline for something like
pipe = Pipeline([
('product', CustomFunction(columns_to_multiply, result_name_column)),
('minmax',StandardScaler()),
('linear',LinearRegression())
])
Is it possible with scikit-learn or custom functions? How?
I am unable to fully test your codes because of missing data. However, you may be able to adopt FunctionTransfomer as follows:
Code:
def CustomMultiplier(arrs):
a = arrs[:,0]
b = np.prod(arrs[:,1:], axis=1)
return np.column_stack((a, b))
if __name__ == '__main__':
transformer = FunctionTransformer(CustomMultiplier)
X = np.array([[1,3,4], [2,4,5]])
result = transformer.transform(X)
print(result)
Result:
[[ 1 12]
[ 2 20]]
So I am trying to use make_pipeline in scikit-learn to clean my data (replace missing values and then clean for outliers, apply an encoding function to the categorical variables and then finally add a Random Forest Regressor through RandomForestRegressor. The input is a DataFrame. Eventually I'd like to put this through GridSearchCV to search over optimal hyperparameters for the regressor.
In order to do this I built some custom classes which inherit the TransformerMixin class as advised here. Here is what I have so far
from sklearn.pipeline import make_pipeline
from sklearn.base import TransformerMixin
import pandas as pd
class Cleaning(TransformerMixin):
def __init__(self, column_labels):
self.column_labels = column_labels
def fit(self, X, y=None):
return self
def transform(self, X):
"""Given a dataframe X with predictors, clean it."""
X_imputed, medians_X = median_imputer(X) # impute all missing numeric data with median
quantiles_X = get_quantiles(X_imputed, self.column_labels)
X_nooutliers, _ = replace_outliers(X_imputed, self.column_labels, medians_X, quantiles_X)
return X_nooutliers
class Encoding(TransformerMixin):
def __init__(self, encoder_list):
self.encoder_list = encoder_list
def fit(self, X, y=None):
return self
def transform(self, X):
"""Takes in dataframe X and applies encoding transformation to them"""
return encode_data(self.encoder_list, X)
However, when I run the following line of code I am getting an error:
import category_encoders as ce
pipeline_cleaning = Cleaning(column_labels = train_labels)
OneHot_binary = ce.OneHotEncoder(cols = ['new_store'])
OneHot = ce.OneHotEncoder(cols= ['transport_availability'])
Count = ce.CountEncoder(cols = ['county'])
pipeline_encoding = Encoding([OneHot_binary, OneHot, Count])
baseline = RandomForestRegressor(n_estimators=500, random_state=12)
make_pipeline([pipeline_cleaning, pipeline_encoding,baseline])
The error is saying Last step of Pipeline should implement fit or be the string 'passthrough'. I don't understand why?
EDIT: slight typo in the last line, correct. The Third element in the list passed to make_pipeline is the regressor
Change the line:
make_pipeline([pipeline_cleaning, pipeline_encoding,baseline])
to (without list):
make_pipeline(pipeline_cleaning, pipeline_encoding,baseline)
Pipeline(steps=[('cleaning', <__main__.Cleaning object at 0x7f617260c1d0>),
('encoding', <__main__.Encoding object at 0x7f617260c278>),
('randomforestregressor',
RandomForestRegressor(n_estimators=500, random_state=12))])
and you're fine to go
I am trying to preprocess and prepare a dataset to train a model. The dataset has some numerical and categorical features. Let's assume this is the dataset (presented as a Pandas DataFrame):
Age,Gender,Height
23,M,170
33,M,195
31,F,165
and we want to convert it to the following format, using only the Transform functions in scikit-learn 0.19.0 and a single Pipeline object:
Age,Gender_M,Gender_F,Height
0.0,1,0,170
1.0,1,0,195
0.8,0,1,165
The transforms are as follows:
Age: MinMaxScaler
Gender: OneHotEncoder
Height: Unchanged
Is this even possible (an alternative to make it slightly easier is to drop the Height feature for now)?
In my own attempts to come up with a solution, I noticed there is no way to apply a Transform function to a subset of all features, which made me believe this cannot be done using a single Pipeline. But I will be happy to be proven wrong.
Thanks
You can write your own transformer to do the desired operations as following.
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
class CustomTransformer((BaseEstimator, TransformerMixin):
def __init__(self):
self.one_hot_encoder = OneHotEncoder()
self.min_max_scaler = MinMaxScaler()
def fit(self, X, y):
self.one_hot_encoder.fit(X.Gender)
self.min_max_scaler.fit(X.Age)
return self
def transform(self, X, y):
X["Gender"] = self.one_hot_encoder.transform(X["Gender"])
X["Age"] = self.min_max_scaler(X["Age"]
return X
Now you can use the Pipeline object on this transformer as follwoing:
from sklearn.pipeline import Pipeline
pipeline = Pipeline(("transform", CustomTransformer()), ("estimator", est))
pipeline.fit(X, y)
If I exclude my custom transformer the GridSearchCV runs fine, but with, it errors.
Here is a fake dataset:
import pandas
import numpy
from sklearn_pandas import DataFrameMapper
from sklearn_pandas import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.base import TransformerMixin
from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import RandomForestClassifier
import sklearn_pandas
from sklearn.preprocessing import MinMaxScaler
df = pandas.DataFrame({"Letter":["a","b","c","d","a","b","c","d","a","b","c","d","a","b","c","d"],
"Number":[1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4],
"Label":["G","G","B","B","G","G","B","B","G","G","B","B","G","G","B","B"]})
class MyTransformer(TransformerMixin):
def transform(self, x, **transform_args):
x["Number"] = x["Number"].apply(lambda row: row*2)
return x
def fit(self, x, y=None, **fit_args):
return self
x_train = df
y_train = x_train.pop("Label")
mapper = DataFrameMapper([
("Number", MinMaxScaler()),
("Letter", LabelBinarizer()),
])
pipe = Pipeline([
("custom", MyTransformer()),
("mapper", mapper),
("classifier", RandomForestClassifier()),
])
param_grid = {"classifier__min_samples_split":[10,20], "classifier__n_estimators":[2,3,4]}
model_grid = sklearn_pandas.GridSearchCV(pipe, param_grid, verbose=2, scoring="accuracy")
model_grid.fit(x_train, y_train)
and the error is
list indices must be integers, not str
How can I make GridSearchCV work while there is a custom transformer in my pipeline?
I know this answer comes rather late, but I've encountered the same behavior with sklearn and BaseSearchCV derivative classes. The problem actually seems to stem from the _PartitionIterator class in the sklearn cross_validation module, as it makes the assumption that everything emitted from every TransformerMixin class in the pipeline is going to be array-like, and thus it generates slices of indices that are used to index incoming X args in a array-like manner. Here's the __iter__ method:
def __iter__(self):
ind = np.arange(self.n)
for test_index in self._iter_test_masks():
train_index = np.logical_not(test_index)
train_index = ind[train_index]
test_index = ind[test_index]
yield train_index, test_index
And the BaseSearchCV grid search metaclass calls cross_validation's _fit_and_score, which uses a method called safe_split. Here's the relevant line:
X_subset = [X[idx] for idx in indices]
This will absolutely produce unexpected results if X is a pandas dataframe, which you're emitting from your transform function.
There are two ways I've found to fix this:
Make sure to return an array from your transformer:
return x.as_matrix()
This is a hack. If the pipe of transformers demands the input to the next transformer be a DataFrame, as was my case, you can write a utilities script that is essentially the same as the sklearn grid_search module, but includes some clever validation methods that are called in the _fit method of the BaseSearchCV class:
def _validate_X(X):
"""Returns X if X isn't a pandas frame, otherwise
the underlying matrix in the frame. """
return X if not isinstance(X, pd.DataFrame) else X.as_matrix()
def _validate_y(y):
"""Returns y if y isn't a series, otherwise the array"""
if y is None:
return y
# if it's a series
elif isinstance(y, pd.Series):
return np.array(y.tolist())
# if it's a dataframe:
elif isinstance(y, pd.DataFrame):
# check it's X dims
if y.shape[1] > 1:
raise ValueError('matrix provided as y')
return y[y.columns[0]].tolist()
# bail and let the sklearn function handle validation
return y
As an example, here's my "custom grid_search module".
Short version: pandas and scikit-learn's cross validation methods didn't like to talk in that way (in my version, 0.15); this may be fixed simply by updating scikit-learn to 0.16/stable or 0.17/dev.
The GridSearchCV class validates the data and converts it to an array (so that it can perform CV splits correctly). So you don't get to use Pandas DataFrame features inside of built-in cross validation loops.
You will have to make your own cross-validation routines that don't do the validation if you want to do this kind of thing.
EDIT: This is my experience with scikit-learn's cross validation routines. It is why sklearn-pandas provides cross_val_score. However, so far as I can tell, GridSearchCV is not specialized by sklearn-pandas; your import of it accidentally imports the default sklearn version. Therefore, you may have to implement you own grid search using ParameterGrid and sklearn-pandas's cross_val_score.