Sklearn Pipeline persistance with custom class not working - python

I am using sklearn pipeline in my code and saving the pipeline object to deploy in the another envinorment. I have one custom class to drop the features. I am saving the model successfully but when I am using the pipeline object in another envirorment which has same version of sklearn, it is throwing an error. The pipeline is working fine when I did not include my custom class DropFeatures. Below is the code
from sklearn import svm
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.externals import joblib
# Load the Iris dataset
df = pd.read_csv('Iris.csv')
label = 'Species'
labels = df[label]
df.drop(['Species'],axis=1,inplace=True)
# Set up a pipeline with a feature selection preprocessor that
# selects the top 2 features to use.
# The pipeline then uses a RandomForestClassifier to train the model.
class DropFeatures(BaseEstimator, TransformerMixin):
def __init__(self, features_to_drop=None):
self.features = features_to_drop
def fit(self, X, y=None):
return self
def transform(self, X):
# encode labels
if len(self.features) != 0:
X = X.copy()
X = X.drop(self.features, axis=1)
return X
return X
pipeline = Pipeline([
('drop_features', DropFeatures(['Id'])),
('feature_selection', SelectKBest(chi2, k=1)),
('classification', RandomForestClassifier())
])
pipeline.fit(df, labels)
print(pipeline.predict(query))
# Export the classifier to a file
joblib.dump(pipeline, 'model.joblib')
When I am using the model.joblib in another environment, I am getting an error. Below is the code to load the model and error in the image
from sklearn.externals import joblib
model = joblib.load('model1.joblib')
print(model)
Error stack trace:

Related

Cache only a single step in sklearn's Pipeline

I want to use UMAP in my sklearn's Pipeline, and I would like to cache that step to speed things up. However, since I have custom Transformer, the suggested method doesn't work.
Example code:
from sklearn.preprocessing import FunctionTransformer
from tempfile import mkdtemp
from sklearn.pipeline import Pipeline
from umap import UMAP
from hdbscan import HDBSCAN
import seaborn as sns
iris = sns.load_dataset("iris")
X = iris.drop(columns='species')
y = iris.species
#FunctionTransformer
def transform_something(iris):
iris = iris.copy()
iris['sepal_sum'] = iris.sepal_length + iris.sepal_width
return iris
cachedir = mkdtemp()
pipe = Pipeline([
('transformer', transform_something),
('umap', UMAP()),
('hdb', HDBSCAN()),
],
memory=cachedir
)
pipe.fit_predict(X)
If you run this, you will get a PicklingError, saying it cannot pickle the custom transformer. But I only need to cache the UMAP step. Any suggestions to make it work?
Not the cleanest, but you could nest pipelines?
pipe = Pipeline(
[
('transformer', transform_something),
('the_rest', Pipeline([
('umap', UMAP()),
('hdb', HDBSCAN()),
], memory=cachedir))
]
)
What also works is, instead of using the FunctionTransformer, writing your custom transform function from scratch like this:
from sklearn.base import BaseEstimator, TransformerMixin
class transform_something(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X):
return self
def transform(self, X):
X = X.copy()
X['sepal_sum'] = X.sepal_length + X.sepal_width
return X
Unfortunately it is a bit more code, but it is picklable.

Stacking classifier: Using custom classifier returns error

I'm using a StackingClassifier in sklearn, where I want the component models to be custom classifiers. In order to do this, I wanted to test it out with some dummy code where the custom classifier is the exact same as an already existing model (KNN, in this example). However this throws an error, and I'm not sure I understand why, and looking for help with this. It's probably something fairly obvious (I'm new to trying to write custom classifiers and using ClassiferMixIn), but I can't seem to figure out what I'm missing:
Code -- the baseline example without my custom class (works):
from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
model = StackingClassifier(estimators=[
('tree', Pipeline([('tree', DecisionTreeClassifier(random_state=42))])),
('knn', Pipeline([('knn', KNeighborsClassifier())])),
])
model.fit(X, y)
Code -- the with my custom class (doesn't work):
class MyOwnClassifier(ClassifierMixin):
def __init__(self,classifier):
self.classifier = classifier
def fit(self, X, y):
self.classifier.fit(X,y)
return self
def predict(self, X):
return self.classifier.predict(X)
def predict_proba(self, X):
return self.classifier.predict_proba(X)
model = StackingClassifier(estimators=[
('tree', Pipeline([('tree', DecisionTreeClassifier(random_state=42))])),
('knn', Pipeline([('knn', MyOwnClassifier(KNeighborsClassifier()))])),
])
model.fit(X, y)
returns the error
AttributeError: 'MyOwnClassifier' object has no attribute 'classes_'
What really puzzles me about this is that in this answer, an identity transform could be used as part of the pipeline, and I can't imagine that object had 'classes_' either.
You've got 3 problems with your code:
StackingClassifier expects an attribute classes_ to be available on a fitted classifier, which is clearly stated in the error message. The linked example does have it, whereas yours doesn't. It can be checked if you run like dir(MyOwnClassifier(KNeighborsClassifier()).fit(X,y)).
BaseEstimator is missing from your class definition (you can do without it, but its presence makes life easier)
Pipelines in you code are extraneous clutter that are not necessary to debug your code and only complicating debugging.
Once you correct these problems you have a working code:
from sklearn.ensemble import StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.base import ClassifierMixin, BaseEstimator
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
class MyOwnClassifier(ClassifierMixin, BaseEstimator):
def __init__(self,classifier):
self.classifier = classifier
def fit(self, X, y):
self.classifier.fit(X,y)
self.classes_ = self.classifier.classes_
return self
def predict(self, X):
return self.classifier.predict(X)
def predict_proba(self, X):
return self.classifier.predict_proba(X)
model = StackingClassifier(estimators=[
('tree', DecisionTreeClassifier(random_state=42)),
('knn', MyOwnClassifier(KNeighborsClassifier()))])
model.fit(X,y)
StackingClassifier(estimators=[('tree',
DecisionTreeClassifier(random_state=42)),
('knn',
MyOwnClassifier(classifier=KNeighborsClassifier()))])

Output of shape for training after oversampling with imbalanced-learn

I am using imbalanced-learn to oversample my data. I want to know how many entries in each class there are after using the oversampling method.
This code works nicely:
import imblearn.over_sampling import SMOTE
from collections import Counter
def oversample(x_values, y_values):
oversampler = SMOTE(random_state=42, n_jobs=-1)
x_oversampled, y_oversampled = oversampler.fit_resample(x_values, y_values)
print("Oversampling training set from {0} to {1} using {2}".format(dict(Counter(y_values)), dict(Counter(y_over_sampled)), oversampling_method))
return x_oversampled, y_oversampled
But I switched to using a pipeline so I can use GridSearchCV to find the best oversampling method (out of ADASYN, SMOTE and BorderlineSMOTE). Therefore I never actually call fit_resample myself and lose my output using something like this:
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
pipe = Pipeline([('scaler', MinMaxScaler()), ('sampler', SMOTE(random_state=42, n_jobs=-1)), ('estimator', RandomForestClassifier())])
pipe.fit(x_values, y_values)
The upsampling works, but I lose my output on how many entries for each class there are in the training set.
Is there a way of getting a similar output than the first example using a pipeline?
In theory yes. When an over-sampler is fitted, an attribute sampling_strategy_ is created, containing the number of samples from the minority class(es) to be generated when fit_resample is invoked. You can use it to get a similar output as your example above. Here is a modified example based on your code:
# Imports
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
# Create toy dataset
X, y = make_classification(weights=[0.20, 0.80], random_state=0)
init_class_distribution = Counter(y)
min_class_label, _ = init_class_distribution.most_common()[-1]
print(f'Initial class distribution: {dict(init_class_distribution)}')
# Create and fit pipeline
pipe = Pipeline([('scaler', MinMaxScaler()), ('sampler', SMOTE(random_state=42, n_jobs=-1)), ('estimator', RandomForestClassifier(random_state=23))])
pipe.fit(X, y)
sampling_strategy = dict(pipe.steps).get('sampler').sampling_strategy_
expected_n_samples = sampling_strategy.get(min_class_label)
print(f'Expected number of generated samples: {expected_n_samples}')
# Fit and resample over-sampler pipeline
sampler_pipe = Pipeline(pipe.steps[:-1])
X_res, y_res = sampler_pipe.fit_resample(X, y)
actual_class_distribution = Counter(y_res)
print(f'Actual class distribution: {actual_class_distribution}')

How to pass pipeline model to a class without instantiating it

I would like to create a class that takes a scikit-learn pipeline and loop over it (as in the code example below).
In the example below I can however only pass the instance of my pipeline to the class and not create a new one in order to start with a fresh model.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
class my_class:
def __init__(self,model):
self.model = model
def evaluate(self, X, y):
results = []
for i in range(10):
self.model.fit(X,y) #I always use the same instance here.
y_pred = self.model.predict(X)
results.append(accuracy_score(y_pred=y_pred, y_true=y))
return results
iris = load_iris()
X = iris.data
y = iris.target
pipeline = Pipeline([
('classifier', AdaBoostClassifier())
])
test = my_class(pipeline)
scores = test.evaluate(X,y)
Your code might be initializing different models but the result will always be the same because you are training and testing on the same data X every time without any change in the hyperparameters and random_state as None. That is why the results list will always contain same value.

Include feature extraction in pipeline sklearn

For a text classification project I made a pipeline for the feature selection and the classifier. Now my question is if it is possible to include the feature extraction module in the pipeline and how. I looked some things up about it, but it doesn't seem to fit with my current code.
This is what I have now:
# feature_extraction module.
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_extraction import DictVectorizer
import numpy as np
vec = DictVectorizer()
X = vec.fit_transform(instances)
scaler = StandardScaler(with_mean=False) # we use cross validation, no train/test set
X_scaled = scaler.fit_transform(X) # To make sure everything is on the same scale
enc = LabelEncoder()
y = enc.fit_transform(labels)
# Feature selection and classification pipeline
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import linear_model
from sklearn.pipeline import Pipeline
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = linear_model.LogisticRegression()
pipe = Pipeline([('mutual_info', feat_sel), ('logistregress', clf)]))
y_pred = model_selection.cross_val_predict(pipe, X_scaled, y, cv=10)
How can I put the dictvectorizer until the label encoder in the pipeline?
Here's how you would do it. Assuming instances is a dict-like object, as specified in the API, then just build your pipeline like so:
pipe = Pipeline([('vectorizer', DictVectorizer()),
('scaler', StandardScaler(with_mean=False)),
('mutual_info', feat_sel),
('logistregress', clf)])
To predict, then call cross_val_predict, passing instances as X:
y_pred = model_selection.cross_val_predict(pipe, instances, y, cv=10)

Categories