Staged_predict from a Pipeline object

Staged_predict from a Pipeline object - python

I am having the same issue which was outlined years ago here:
https://github.com/scikit-learn/scikit-learn/issues/10197
It seems to not have been resolved so I am looking for a work around. The example given there no longer works so here is one I wrote based on https://scikit-learn.org/stable/auto_examples/inspection/plot_partial_dependence.html
from sklearn.datasets import fetch_openml
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from time import time
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import HistGradientBoostingRegressor
bikes = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True, parser="pandas")
# Make an explicit copy to avoid "SettingWithCopyWarning" from pandas
X, y = bikes.data.copy(), bikes.target
X["weather"].replace(to_replace="heavy_rain", value="rain", inplace=True)
mask_training = X["year"] == 0.0
X = X.drop(columns=["year"])
X_train, y_train = X[mask_training], y[mask_training]
X_test, y_test = X[~mask_training], y[~mask_training]
numerical_features = [
"temp",
"feel_temp",
"humidity",
"windspeed",
]
categorical_features = X_train.columns.drop(numerical_features)
hgbdt_preprocessor = ColumnTransformer(
transformers=[
("cat", OrdinalEncoder(), categorical_features),
("num", "passthrough", numerical_features),
],
sparse_threshold=1,
verbose_feature_names_out=False,
).set_output(transform="pandas")
hgbdt_model = make_pipeline(
hgbdt_preprocessor,
HistGradientBoostingRegressor(
categorical_features=categorical_features, random_state=0
),
)
hgbdt_model.fit(X_train, y_train)
staged_predict_train = [i for i in hgbdt_model.staged_predict(X_train)]
This produces AttributeError: 'Pipeline' object has no attribute 'staged_predict'
The first thing I tried was to just pass it directly to the model in the pipeline
staged_predict_train = [i for i in hgbdt_model['histgradientboostingregressor'].staged_predict(X_train)]
This fails because X_train is no longer encoded by the prior step in the pipeline.

It is pretty simple to get columntransformer to transform the columns
enc = clf['columntransformer']
X_train_encoded = enc.fit_transform(X_train)
X_test_encoded = enc.fit_transform(X_test)
staged_predict_train = [i.transpose()[1] for i in clf['histgradientboostingclassifier'].staged_predict_proba(X_train_encoded)]

Related

Sklearn pipeline transform specific columns - ValueError: too many values to unpack (expected 2)

i am trying make pipeline with scaler, onhotencoder, polynomialfeature, and finally linear regression model
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler(), num_cols),
('polynom', PolynomialFeatures(3), num_cols),
('encoder', OneHotEncoder(), cat_cols),
('linear_regression', LinearRegression() )
])
but when i fit the pipeline i have ValueError: too many values to unpack (expected 2)
pipeline.fit(x_train,y_train)
pipeline.score(x_test, y_test)

If I understand correctly, you want to apply some steps of the pipeline to specific columns. Instead of doing it by adding the column names ad the end of the pipeline stage (which is incorrect and causes the error), you have to use a ColumnTransformer. Here you can find another similar example.
In your case, you could do something like this:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer
# Fake data.
train_data = pd.DataFrame({'n1': range(10), 'n2': range(10)})
train_data['c1'] = 0
train_data['c1'][5:] = 1
y_train = [0]*10
y_train[5:] = [1]*5
# Here I assumed you are using a DataFrame. If not, use integer indices instead of column names.
num_cols = ['n1', 'n2']
cat_cols = ['c1']
# Pipeline to transform the numerical features.
numerical_transformer = Pipeline([('scaler', StandardScaler()),
('polynom', PolynomialFeatures(3))
])
# Apply the numerical transformer only on the numerical columns.
# Spearately, apply the OneHotEncoder.
ct = ColumnTransformer([('num_transformer', numerical_transformer, num_cols),
('encoder', OneHotEncoder(), cat_cols)])
# Main pipeline for fitting.
pipeline = Pipeline([
('column_transformer', ct),
('linear_regression', LinearRegression() )
])
pipeline.fit(train_data, y_train)
Schematically, the layout of your pipeline would be like this:

ColumnTransformer(s) in various parts of a pipeline do not play well

I am using sklearn and mlxtend.regressor.StackingRegressor to build a stacked regression model.
For example, say I want the following small pipeline:
A Stacking Regressor with two regressors:
A pipeline which:
Performs data imputation
1-hot encodes categorical features
Performs linear regression
A pipeline which:
Performs data imputation
Performs regression using a Decision Tree
Unfortunately this is not possible, because StackingRegressor doesn't accept NaN in its input data.
This is even if its regressors know how to handle NaN, as it would be in my case where the regressors are actually pipelines which perform data imputation.
However, this is not a problem: I can just move data imputation outside the stacked regressor.
Now my pipeline looks like this:
Perform data imputation
Apply a Stacking Regressor with two regressors:
A pipeline which:
1-hot encodes categorical features
Standardises numerical features
Performs linear regression
An sklearn.tree.DecisionTreeRegressor.
One might try to implement it as follows (the entire minimal working example in this gist, with comments):
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
('categorical',
make_pipeline(OneHotEncoder(), StandardScaler()),
make_column_selector(dtype_include='category')),
('numerical',
StandardScaler(),
make_column_selector(dtype_include=np.number))
])),
('model', LinearRegression())
])
sr_tree = DecisionTreeRegressor()
ct_imputation = ColumnTransformer(transformers=[
('categorical',
SimpleImputer(strategy='constant', fill_value='None'),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out())
)),
('model', StackingRegressor(
regressors=[sr_linear, sr_tree],
meta_regressor=DecisionTreeRegressor(),
use_features_in_secondary=True
))
])
Note that the "outer" ColumnTransformer (in stacked_regressor) returns a numpy matrix.
But the "inner" ColumnTransformer (in sr_linear) expects a pandas.DataFrame, so I had to convert the matrix back to a data frame using step back_to_pandas.
(To use get_feature_names_out I had to use the nightly version of sklearn, because the current stable 1.0.2 version does not support it yet. Fortunately it can be installed with one simple command.)
The above code fails when calling stacked_regressor.fit(), with the following error (the entire stacktrace is again in the gist):
ValueError: make_column_selector can only be applied to pandas dataframes
However, because I added the back_to_pandas step to my outer pipeline, the inner pipelines should be getting a pandas data frame!
In fact, if I only fit_transform() my ct_imputation object, I clearly obtain a pandas data frame.
I cannot understand where and when exactly the data which gets passed around ceases to be a data frame.
Why is my code failing?

Imo the issue has to be ascribed to StackingRegressor. Actually, I am not an expert on its usage and still I have not explored its source code, but I've found this sklearn issue - #16473 which seems implying that << the concatenation [of regressors and meta_regressors] does not preserve dataframe >> (though this is referred to sklearn StackingRegressor instance, rather than on mlxtend one).
Indeed, have a look at what happens once you replace it with your sr_linear pipeline:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from mlxtend.regressor import StackingRegressor
import numpy as np
import pandas as pd
# We use the Ames house prices dataset for this example
d = fetch_openml('house_prices', as_frame=True).frame
# Small data preprocessing:
for column in d.columns:
if d[column].dtype == object or column == 'MSSubClass':
d[column] = pd.Categorical(d[column])
d.drop(columns='Id', inplace=True)
# Prepare the data for training
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
# Train the stacked regressor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
sr_linear = Pipeline(steps=[
('preprocessing', ColumnTransformer(transformers=[
('categorical',
make_pipeline(OneHotEncoder(), StandardScaler(with_mean=False)),
make_column_selector(dtype_include='category')),
('numerical',
StandardScaler(),
make_column_selector(dtype_include=np.number))
])),
('model', LinearRegression())
])
ct_imputation = ColumnTransformer(transformers=[
('categorical',
SimpleImputer(missing_values=np.nan, strategy='constant', fill_value='None'),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacked_regressor = Pipeline(steps=[
('imputation', ct_imputation),
('back_to_pandas', FunctionTransformer(
func=lambda values: pd.DataFrame(values, columns=ct_imputation.get_feature_names_out()).astype(types)
)),
('mdl', sr_linear)
])
stacked_regressor.fit(X_train, y_train)
Observe that I had to slightly modify the 'back_to_pandas' step because for some reason pd.DataFrame was changing the dtypes of the columns to 'object' only (from 'category' and 'float64'), therefore clashing with the imputation performed in sr_linear. For this reason, I've applied .astype(types) to the pd.DataFrame constructor, where types is defined as follows (based on the implementation of .get_feature_names_out() method of the SimpleImputer from the dev version of sklearn):
types = {}
for col in d.columns[:-1]:
if d[col].dtype == 'category':
types['categorical__' + col] = str(d[col].dtype)
else:
types['numerical__' + col] = str(d[col].dtype)

The correct thing to do was:
Move from mlxtend's to sklearn's StackingRegressor. I believe the former was creater when sklearn still didn't have a stacking regressor. Now there is no need to use more 'obscure' solutions. sklearn's stacking regressor works pretty well.
Move the 1-hot-encoding step to the outer pipeline, because (surprisingly!) sklearn's DecisionTreeRegressor cannot handle categorical data among the features.
A working version of the code is given below:
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.ensemble import StackingRegressor
import numpy as np
import pandas as pd
def set_correct_categories(df: pd.DataFrame) -> pd.DataFrame:
for column in df.columns:
if df[column].dtype == object or 'MSSubClass' in column:
df[column] = pd.Categorical(df[column])
return df
d = fetch_openml('house_prices', as_frame=True).frame
d = set_correct_categories(d).drop(columns='Id')
sr_linear = Pipeline(steps=[
('preprocessing', StandardScaler()),
('model', LinearRegression())
])
ct_preprocessing = ColumnTransformer(transformers=[
('categorical',
make_pipeline(
SimpleImputer(strategy='constant', fill_value='None'),
OneHotEncoder(sparse=False, handle_unknown='ignore')
),
make_column_selector(dtype_include='category')),
('numerical',
SimpleImputer(strategy='median'),
make_column_selector(dtype_include=np.number))
])
stacking_regressor = Pipeline(steps=[
('preprocessing', ct_preprocessing),
('model', StackingRegressor(
estimators=[('linear_regression', sr_linear), ('regression_tree', DecisionTreeRegressor())],
final_estimator=DecisionTreeRegressor(),
passthrough=True
))
])
label = 'SalePrice'
features = [col for col in d.columns if col != label]
X, y = d[features], d[label]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True)
stacking_regressor.fit(X_train, y_train)
Thanks to user amiola for his answer putting me on the right track.

OneHotEncoder ValueError: Input contains NaN

I have downloaded this data, and this is my code:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils.multiclass import unique_labels
import plotly.figure_factory as ff
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import make_column_transformer
random_state = 27912
df_train = pd.read_csv("...")
df_test = pd.read_csv("...")
X_train, X_test, y_train, y_test = train_test_split(df_train.drop(["Survived", "Ticket", "Cabin", "Name", "PassengerId"],
axis = 1),
df_train["Survived"], test_size=0.2,
random_state=42)
numeric_col_names = ["Age", "SibSp", "Parch", "Fare"]
ordinal_col_names = ["Pclass"]
one_hot_col_names = ["Embarked", "Sex"]
ct = make_column_transformer(
(SimpleImputer(strategy="median"), numeric_col_names),
(SimpleImputer(strategy="most_frequent"), ordinal_col_names + one_hot_col_names),
(OrdinalEncoder(), ordinal_col_names),
(OneHotEncoder(), one_hot_col_names),
(StandardScaler(), ordinal_col_names + one_hot_col_names + numeric_col_names))
preprocessing_pipeline = Pipeline([("transformers", ct)])
preprocessing_pipeline.fit_transform(X_train)
I'm trying make column_transformer for preprocessing step, however, the OneHotEncoding step is giving me an error, ValueError: Input contains NaN. I don't really know why this is happening, because I'm imputing the values before. Any clues on why this is happening?
Trying something like this doesn't help neither
preprocessing_pipeline = Pipeline([("transformers", ct_first)])
ct_second = make_column_transformer((OneHotEncoder(), one_hot_col_names),(StandardScaler(), ordinal_col_names + one_hot_col_names + numeric_col_names))
pipeline = Pipeline([("transformer1", preprocessing_pipeline), ("transformer2", ct_second)])
pipeline.fit_transform(X_train)
I would like to know why is this happening and why the above code, first and second tries, are not correct.
Thanks

You need to create a pipeline for each column type to make sure that the different steps are applied sequentially (i.e. to make sure that the missing values are imputed prior to encoding and scaling), see also this example in the scikit-learn documentation.
import pandas as pd
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
# Load the data (from https://www.kaggle.com/c/titanic/data)
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
# Extract the features
X_train = df_train.drop(labels=['Survived', 'Ticket', 'Cabin', 'Name', 'PassengerId'], axis=1)
X_test = df_test.drop(labels=['Ticket', 'Cabin', 'Name', 'PassengerId'], axis=1)
# Map the feature names to the corresponding
# types (numerical, ordinal or categorical)
numeric_col_names = ['Age', 'SibSp', 'Parch', 'Fare']
ordinal_col_names = ['Pclass']
one_hot_col_names = ['Embarked', 'Sex']
# Define the numerical features pipeline
numeric_col_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Define the ordinal features pipeline
ordinal_col_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OrdinalEncoder()),
('scaler', StandardScaler())
])
# Define the categorical features pipeline
one_hot_col_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(sparse=False)),
('scaler', StandardScaler())
])
# Create the overall preprocessing pipeline
preprocessing_pipeline = make_column_transformer(
(numeric_col_transformer, numeric_col_names),
(ordinal_col_transformer, ordinal_col_names),
(one_hot_col_transformer, one_hot_col_names),
)
# Fit the pipeline to the training data
preprocessing_pipeline.fit(X_train)
# Apply the pipeline to the training and test data
X_train_ = preprocessing_pipeline.transform(X_train)
X_test_ = preprocessing_pipeline.transform(X_test)

How to access ColumnTransformer elements in GridSearchCV

I wanted to find out the correct naming convention when referring to individual preprocessor included in ColumnTransformer (which is part of a pipeline) in param_grid for grid_search.
Environment & sample data:
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, KBinsDiscretizer, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
df = sns.load_dataset('titanic')[['survived', 'age', 'embarked']]
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='survived'), df['survived'], test_size=0.2,
random_state=123)
Pipeline:
num = ['age']
cat = ['embarked']
num_transformer = Pipeline(steps=[('imputer', SimpleImputer()),
('discritiser', KBinsDiscretizer(encode='ordinal', strategy='uniform')),
('scaler', MinMaxScaler())])
cat_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(transformers=[('num', num_transformer, num),
('cat', cat_transformer, cat)])
pipe = Pipeline(steps=[('preprocessor', preprocessor),
('classiffier', LogisticRegression(random_state=1, max_iter=10000))])
param_grid = dict([SOMETHING]imputer__strategy = ['mean', 'median'],
[SOMETHING]discritiser__nbins = range(5,10),
classiffier__C = [0.1, 10, 100],
classiffier__solver = ['liblinear', 'saga'])
grid_search = GridSearchCV(pipe, param_grid=param_grid, cv=10)
grid_search.fit(X_train, y_train)
Basically, what should I write instead of [SOMETHING] in my code?
I have looked at this answer which answered the question for make_pipeline - so using the similar idea, I tried 'preprocessor__num__', 'preprocessor__num_', 'pipeline__num__', 'pipeline__num_' - no luck so far.
Thank you

You were close, the correct way to declare it is like this:
param_grid = {'preprocessor__num__imputer__strategy' : ['mean', 'median'],
'preprocessor__num__discritiser__n_bins' : range(5,10),
'classiffier__C' : [0.1, 10, 100],
'classiffier__solver' : ['liblinear', 'saga']}
Here is the full code:
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, KBinsDiscretizer, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
df = sns.load_dataset('titanic')[['survived', 'age', 'embarked']]
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='survived'), df['survived'], test_size=0.2,
random_state=123)
num = ['age']
cat = ['embarked']
num_transformer = Pipeline(steps=[('imputer', SimpleImputer()),
('discritiser', KBinsDiscretizer(encode='ordinal', strategy='uniform')),
('scaler', MinMaxScaler())])
cat_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(transformers=[('num', num_transformer, num),
('cat', cat_transformer, cat)])
pipe = Pipeline(steps=[('preprocessor', preprocessor),
('classiffier', LogisticRegression(random_state=1, max_iter=10000))])
param_grid = {'preprocessor__num__imputer__strategy' : ['mean', 'median'],
'preprocessor__num__discritiser__n_bins' : range(5,10),
'classiffier__C' : [0.1, 10, 100],
'classiffier__solver' : ['liblinear', 'saga']}
grid_search = GridSearchCV(pipe, param_grid=param_grid, cv=10)
grid_search.fit(X_train, y_train)
One simply way to check the available parameter names is like this:
print(pipe.get_params().keys())
This will print out the list of all the available parameters which you can copy directly into your params dictionary.
I have written a utility function which you can use to check if a parameter exist in a pipeline/classifier by simply passing in a keyword.
def check_params_exist(esitmator, params_keyword):
all_params = esitmator.get_params().keys()
available_params = [x for x in all_params if params_keyword in x]
if len(available_params)==0:
return "No matching params found!"
else:
return available_params
Now if you are unsure of the exact name, just pass imputer as the keyword
print(check_params_exist(pipe, 'imputer'))
This will print the following list:
['preprocessor__num__imputer',
'preprocessor__num__imputer__add_indicator',
'preprocessor__num__imputer__copy',
'preprocessor__num__imputer__fill_value',
'preprocessor__num__imputer__missing_values',
'preprocessor__num__imputer__strategy',
'preprocessor__num__imputer__verbose',
'preprocessor__cat__imputer',
'preprocessor__cat__imputer__add_indicator',
'preprocessor__cat__imputer__copy',
'preprocessor__cat__imputer__fill_value',
'preprocessor__cat__imputer__missing_values',
'preprocessor__cat__imputer__strategy',
'preprocessor__cat__imputer__verbose']

How to add a feature to a vectorized data set?

I want to write a Naive Base text classificator.
Because sklearn does not accept 'text form' features I am transforming them using TfidfVectorizer.
I was successfully able to create such classificatory using only the transformed data as features. The code looks like this:
### text vectorization--go from strings to lists of numbers
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
X_train_transformed = vectorizer.fit_transform(X_train_raw['url'])
X_test_transformed = vectorizer.transform(X_test_raw['url'])
### feature selection, because text is super high dimensional and
### can be really computationally chewy as a result
selector = SelectPercentile(f_classif, percentile=1)
selector.fit(X_train_transformed, y_train_raw)
X_train = selector.transform(X_train_transformed).toarray()
X_test = selector.transform(X_test_transformed).toarray()
clf = GaussianNB()
clf.fit(X_train, y_train_raw)
.....
Everything works as intended but I am having problems when I want to add another feature eg. flag indicating weather the given text contains a certain keyword.
I tried multiple things to properly transform the 'url' feature and then combine the transformed feature with another boolean feature but I was unsuccessfully.
Any tips how it should be done assuming that I have a pandas frame containing two features: 'url' (which I want to transform) and 'contains_keyword' flag?
The solution which failed looks like this:
vectorizer = CountVectorizer(min_df=1)
X_train_transformed = vectorizer.fit_transform(X_train_raw['url'])
X_test_transformed = vectorizer.transform(X_test_raw['url'])
selector = SelectPercentile(f_classif, percentile=1)
selector.fit(X_train_transformed, y_train_raw)
X_train_selected = selector.transform(X_train_transformed)
X_test_selected = selector.transform(X_test_transformed)
X_train_raw['transformed_url'] = X_train_selected.toarray().tolist()
X_train_without = X_train_raw.drop(['url'], axis=1)
X_train = X_train_without.values
This produces rows containing a boolean flag and a list which is a wrong input for sklearn model. I have no idea how should i properly transform this. Grateful for any help.
Here are test data:
url,target,ads_keyword
googleadapis l google com,1,True
googleadapis l google com,1,True
clients1 google com,1,False
c go-mpulse net,1,False
translate google pl,1,False
url - splitted domain taken from dns query
target - target class for classification
ads_keyword - flag indicating weather the 'url' contains the 'ads' word.
I want to transform the 'url' using the TfidfVectorizer and use the transformed data together with 'ads_keyword' (and possibly more features in the future) as features used to train the Naive Bayes model.

Here is a demo, showing how to union features and how to tune up hyperparameters using GridSearchCV.
Unfortunately your sample data set is too tiny to train a real model...
try:
from pathlib import Path
except ImportError: # Python 2
from pathlib2 import Path
import os
import re
from pprint import pprint
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import FunctionTransformer, LabelEncoder, LabelBinarizer, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.externals import joblib
from scipy.sparse import csr_matrix, hstack
class ColumnSelector(BaseEstimator, TransformerMixin):
def __init__(self, name=None, position=None,
as_cat_codes=False, sparse=False):
self.name = name
self.position = position
self.as_cat_codes = as_cat_codes
self.sparse = sparse
def fit(self, X, y=None):
return self
def transform(self, X, **kwargs):
if self.name is not None:
col_pos = X.columns.get_loc(self.name)
elif self.position is not None:
col_pos = self.position
else:
raise Exception('either [name] or [position] parameter must be not-None')
if self.as_cat_codes and X.dtypes.iloc[col_pos] == 'category':
ret = X.iloc[:, col_pos].cat.codes
else:
ret = X.iloc[:, col_pos]
if self.sparse:
ret = csr_matrix(ret.values.reshape(-1,1))
return ret
union = FeatureUnion([
('text',
Pipeline([
('select', ColumnSelector('url')),
#('pct', SelectPercentile(percentile=1)),
('vect', TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')),
]) ),
('ads',
Pipeline([
('select', ColumnSelector('ads_keyword', sparse=True,
as_cat_codes=True)),
#('scale', StandardScaler(with_mean=False)),
]) )
])
pipe = Pipeline([
('union', union),
('clf', MultinomialNB())
])
param_grid = [
{
'union__text__vect': [TfidfVectorizer(sublinear_tf=True,
max_df=0.5,
stop_words='english')],
'clf': [SGDClassifier(max_iter=500)],
'union__text__vect__ngram_range': [(1,1), (2,5)],
'union__text__vect__analyzer': ['word','char_wb'],
'clf__alpha': np.logspace(-5, 0, 6),
#'clf__max_iter': [500],
},
{
'union__text__vect': [TfidfVectorizer(sublinear_tf=True,
max_df=0.5,
stop_words='english')],
'clf': [MultinomialNB()],
'union__text__vect__ngram_range': [(1,1), (2,5)],
'union__text__vect__analyzer': ['word','char_wb'],
'clf__alpha': np.logspace(-4, 2, 7),
},
#{ # NOTE: does NOT support sparse matrices!
# 'union__text__vect': [TfidfVectorizer(sublinear_tf=True,
# max_df=0.5,
# stop_words='english')],
# 'clf': [GaussianNB()],
# 'union__text__vect__ngram_range': [(1,1), (2,5)],
# 'union__text__vect__analyzer': ['word','char_wb'],
#},
]
gs_kwargs = dict(scoring='roc_auc', cv=3, n_jobs=1, verbose=2)
X_train, X_test, y_train, y_test = \
train_test_split(df[['url','ads_keyword']], df['target'], test_size=0.33)
grid = GridSearchCV(pipe, param_grid=param_grid, **gs_kwargs)
grid.fit(X_train, y_train)
# prediction
predicted = grid.predict(X_test)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Staged_predict from a Pipeline object - python

Related

Sklearn pipeline transform specific columns - ValueError: too many values to unpack (expected 2)

ColumnTransformer(s) in various parts of a pipeline do not play well

OneHotEncoder ValueError: Input contains NaN

How to access ColumnTransformer elements in GridSearchCV

How to add a feature to a vectorized data set?

Categories

Resources