XGBRegressor using Pipeline - python

I'm using XGBRegressor with Pipeline. Pipeline contains preprocessing steps and model (XGBRegressor).
Below is complete preprocessing steps. (I have already defined numeric_cols and cat_cols)
numerical_transfer = SimpleImputer()
cat_transfer = Pipeline(steps = [
('imputer', SimpleImputer(strategy = 'most_frequent')),
('onehot', OneHotEncoder(handle_unknown = 'ignore'))
])
preprocessor = ColumnTransformer(
transformers = [
('num', numerical_transfer, numeric_cols),
('cat', cat_transfer, cat_cols)
])
And the final pipeline is
my_model = Pipeline(steps = [('preprocessor', preprocessor), ('model', model)])
When I tried to fit without using early_stopping_rounds code is working fine.
(my_model.fit(X_train, y_train))
But when I use early_stopping_rounds as shown below I'm getting error.
my_model.fit(X_train, y_train, model__early_stopping_rounds=5, model__eval_metric = "mae", model__eval_set=[(X_valid, y_valid)])
I'm getting error at:
model__eval_set=[(X_valid, y_valid)]) and the error is
ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in fields MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Condition1, Condition2, BldgType, HouseStyle, RoofStyle, RoofMatl, MasVnrType, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, Heating, HeatingQC, CentralAir, Electrical, KitchenQual, Functional, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PavedDrive, PoolQC, Fence, MiscFeature, SaleType, SaleCondition
Did it mean that I should preprocess X_valid before applying to my_model.fit() or I have done something wrong ?
If the problem is we need to preprocess X_valid before applying fit() how to do that with preprocessor I have defined above ?
Edit : I tried to preprocess X_valid without Pipeline, but I got error saying feature mismatch.

The problem is that pipelines do not fit eval_set. So, as you said, you need to preprocess X_valid. To do that the easiest way is using your pipeline without the 'model' step. Use the following code before fitting your pipeline:
# Make a copy to avoid changing original data
X_valid_eval=X_valid.copy()
# Remove the model from pipeline
eval_set_pipe = Pipeline(steps = [('preprocessor', preprocessor)])
# fit transform X_valid.copy()
X_valid_eval = eval_set_pipe.fit(X_train, y_train).transform (X_valid_eval)
Then fit your pipeline after changing model__eval_set as follows:
my_model.fit(X_train, y_train, model__early_stopping_rounds=5, model__eval_metric = "mae", model__eval_set=[(X_valid_eval, y_valid)])

Related

Sklearn Pipeline classifier throwing ValueError even when the missing values are taken care of

I have created sklearn pipeline for preprocessing and then running the model over the processed data. The preprocessing step takes care of missing values even after that it throws the following error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The below is my code :
def test_sklearn_pipeline(random_state_num):
numeric_features = ["x","y"]
categorical_features = ["wconfid","pctid"]
missing_features = ["x"]
missing_transformer = Pipeline(
steps=[("imputer", SimpleImputer(strategy="mean"))]
)
scale_transformer = Pipeline(
steps=[("scaler", StandardScaler())]
)
categorical_transformer = Pipeline(
steps=[('ohe',OneHotEncoder(handle_unknown="ignore"))]
)
preprocessor = ColumnTransformer(
transformers=[
("miss", missing_transformer, missing_features),
("cat", categorical_transformer, categorical_features),
('outlier_remover',outlier_removal,numeric_features),
("num", scale_transformer, numeric_features)
],remainder='passthrough'
)
clf = Pipeline(
steps=[("preprocessor", preprocessor), ("classifier", LinearRegression())]
)
df = pd.read_csv('accelerometer_modified.csv')
df = df.drop(columns=['random'])
X,y = df.drop(columns=['z']),df.loc[:,'z']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=random_state_num)
clf.fit(X_train, y_train)
print("MSE: %.3f" % mean_squared_error(clf.predict(X_test), y_test))
Numeric features and missing features do have the column x in common. Columntransformer runs each transformation in the input dataframe. This means you are running the standard scaler in the raw column and not the imputed one. You probably need two transformers that run sequentially, or rather put a small Pipeline as you've done already with steps that are first impute second scale

sklearn Stacking Estimator passthrough skips preprocessing and passes original data

This issue has been discussed here but there has been no comments: https://github.com/scikit-learn/scikit-learn/issues/16473
I have some numerical features and categorical features in X. The categorical features were one hot encoded. So my pipeline is something similar to the sklearn docs example:
cat_proc_lin = make_pipeline(
SimpleImputer(missing_values=None,
strategy='constant',
fill_value='missing'),
OneHotEncoder(categories=categories)
)
num_proc_lin = make_pipeline(
SimpleImputer(strategy='mean'),
StandardScaler()
)
processor_lin = make_column_transformer(
(cat_proc_lin, cat_cols),
(num_proc_lin, num_cols),
remainder='passthrough')
lasso_pipeline = make_pipeline(processor_lin,
LassoCV())
rf_pipeline = make_pipeline(processor_nlin,
RandomForestRegressor(random_state=42))
gradient_pipeline = make_pipeline(
processor_nlin,
HistGradientBoostingRegressor(random_state=0))
estimators = [('Random Forest', rf_pipeline),
('Lasso', lasso_pipeline),
('Gradient Boosting', gradient_pipeline)]
stacking_regressor = StackingRegressor(estimators=estimators,
final_estimator=RidgeCV())
But if I change passthrough=True, it will raise a TypeError because the passthrough gives the original X and skips the preprocessing part of the pipeline:
/usr/local/lib/python3.6/dist-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
ValueError: could not convert string to float: 'RL'
Is there anyway to make the passthrough include the first preprocessing part of the pipeline?
I also cannot add the preprocessing pipeline infront of the final estimator because it will concatenate the original X dataframe with the final layer predictions which is a numpy array as mentioned in the github discussion link at the top of this post. My exact preprocessing pipeline has several custom transformers that operates on pandas dataframe.
Thank you for any help!

How to Save and Load Machine Learning (One-vs-Rest) Models (PYTHON)

I have here my code wherein it loops through each label or category then creates a model out of it. However, what I want is to create a general model that will be able to accept new predictions that are inputs from a user.
I'm aware that the code below saves the model that is fit for the last category in the loop. How can I fix this so that models for each category will be saved so that when I load those models, i would be able to predict a label for a new text?
vectorizer = TfidfVectorizer(strip_accents='unicode',
stop_words=stop_words, analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_text)
vectorizer.fit(test_text)
x_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['question_body'], axis=1)
x_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['question_body'], axis=1)
# Using pipeline for applying linearSVC and one vs rest classifier
SVC_pipeline = Pipeline([
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
for category in categories:
print('... Processing {}'.format(category))
# train the SVC model using X_dtm & y
SVC_pipeline.fit(x_train, train[category])
# compute the testing accuracy of SVC
svc_prediction = SVC_pipeline.predict(x_test)
print("SVC Prediction:")
print(svc_prediction)
print('Test accuracy is {}'.format(f1_score(test[category], svc_prediction)))
print("\n")
#save the model to disk
filename = 'svc_model.sav'
pickle.dump(SVC_pipeline, open(filename, 'wb'))
There are multiple mistakes in your code.
You are fitting your TfidfVectorizer on both train and test:
vectorizer.fit(train_text)
vectorizer.fit(test_text)
This is wrong. Calling fit() is not incremental. It will not learn on both data if called two times. The most recent call to fit() will forget everything from past calls. You never fit (learn) something on test data.
What you need to do is this:
vectorizer.fit(train_text)
The pipeline does not work the way you think:
# Using pipeline for applying linearSVC and one vs rest classifier
SVC_pipeline = Pipeline([
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
See that you are passing LinearSVC inside the OneVsRestClassifier, so it will automatically use that without the need of Pipeline. Pipeline will not do anything here. Pipeline is of use when you sequentially want to pass your data through multiple models. Something like this:
pipe = Pipeline([
('pca', pca),
('logistic', LogisticRegression())
])
What the above pipe will do is pass the data to PCA which will transform it. Then that new data is passed to LogisticRegression and so on..
Correct usage of pipeline in your case can be:
SVC_pipeline = Pipeline([
('vectorizer', vectorizer)
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
See more examples here:
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#examples-using-sklearn-pipeline-pipeline
You need to describe more about your "categories". Show some examples of your data. You are not using y_train and y_test anywhere. Is the categories different from "question_body"?

python imblearn make_pipeline TypeError: Last step of Pipeline should implement fit

I am trying to implement SMOTE of imblearn inside the Pipeline. My data sets are text data stored in pandas dataframe. Please see below the code snippet
text_clf =Pipeline([('vect', TfidfVectorizer()),('scale', StandardScaler(with_mean=False)),('smt', SMOTE(random_state=5)),('clf', LinearSVC(class_weight='balanced'))])
After this I am using GridsearchCV.
grid = GridSearchCV(text_clf, parameters, cv=4, n_jobs=-1, scoring = 'accuracy')
Where parameters are nothing but tuning parameters mostly for TfidfVectorizer().
I am getting the following error.
All intermediate steps should be transformers and implement fit and transform. 'SMOTE
Post this error, I have changed the code to as follows.
vect = TfidfVectorizer(use_idf=True,smooth_idf = True, max_df = 0.25, sublinear_tf = True, ngram_range=(1,2))
X = vect.fit_transform(X).todense()
Y = vect.fit_transform(Y).todense()
X_Train,X_Test,Y_Train,y_test = train_test_split(X,Y, random_state=0, test_size=0.33, shuffle=True)
text_clf =make_pipeline([('smt', SMOTE(random_state=5)),('scale', StandardScaler(with_mean=False)),('clf', LinearSVC(class_weight='balanced'))])
grid = GridSearchCV(text_clf, parameters, cv=4, n_jobs=-1, scoring = 'accuracy')
Where parameters are nothing but tuning Cin SVC classifiers.
This time I am getting the following error:
Last step of Pipeline should implement fit.SMOTE(....) doesn't
What is going here? Can anyone please help?
imblearn.SMOTE has no transform method. Docs is here.
But all steps except the last in a pipeline should have it, along with fit.
To use SMOTE with sklearn pipeline you should implement a custom transformer calling SMOTE.fit_sample() in transform method.
Another easier option is just to use ibmlearn pipeline:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbPipeline
# This doesn't work with sklearn.pipeline.Pipeline because
# SMOTE doesn't have a .tranform() method.
# (It has .fit_sample() or .sample().)
pipe = imbPipeline([
...
('oversample', SMOTE(random_state=5)),
('clf', LinearSVC(class_weight='balanced'))
])

sklearn pipeline - how to apply different transformations on different columns

I am pretty new to pipelines in sklearn and I am running into this problem: I have a dataset that has a mixture of text and numbers i.e. certain columns have text only and rest have integers (or floating point numbers).
I was wondering if it was possible to build a pipeline where I can for example call LabelEncoder() on the text features and MinMaxScaler() on the numbers columns. The examples I have seen on the web mostly point towards using LabelEncoder() on the entire dataset and not on select columns. Is this possible? If so any pointers would be greatly appreciated.
The way I usually do it is with a FeatureUnion, using a FunctionTransformer to pull out the relevant columns.
Important notes:
You have to define your functions with def since annoyingly you can't use lambda or partial in FunctionTransformer if you want to pickle your model
You need to initialize FunctionTransformer with validate=False
Something like this:
from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import FunctionTransformer
def get_text_cols(df):
return df[['name', 'fruit']]
def get_num_cols(df):
return df[['height','age']]
vec = make_union(*[
make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))),
make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler())))
])
Since v0.20, you can use ColumnTransformer to accomplish this.
An Example of ColumnTransformer might help you:
# FOREGOING TRANSFORMATIONS ON 'data' ...
# filter data
data = data[data['county'].isin(COUNTIES_OF_INTEREST)]
# define the feature encoding of the data
impute_and_one_hot_encode = Pipeline([
('impute', SimpleImputer(strategy='most_frequent')),
('encode', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
featurisation = ColumnTransformer(transformers=[
("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']),
('word2vec', MyW2VTransformer(min_count=2), ['last_name']),
('numeric', StandardScaler(), ['num_children', 'income'])
])
# define the training pipeline for the model
neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109)
pipeline = Pipeline([
('features', featurisation),
('learner', neural_net)])
# train-test split
train_data, test_data = train_test_split(data, random_state=0)
# model training
model = pipeline.fit(train_data, train_data['label'])
You can find the entire code under: https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/example_pipelines/healthcare/healthcare.py

Categories